### 9.0.1 Collinearity
In this section, we will be exploring the degree of correlation between the predictor variables in our retail sales dataset. That is, we will be testing the models collinearity. For two variables that are collinear, we expect that they should contain similar information about the variance within the given dataset. 

To detect the collinearity, we will create a correlation matrix and find variables with large absolute values.

### 9.0.2 Multicollinearity
Detection of multicollinearity is follows a more complicated procedure. This is because it emerges when three or more variables which are highly correlated are included in the model. It can also emerge when isolated pairs of variables are not collinear.

In order to test multicollinearity we will use Variance Infaltion Factor (VIF). 

### 9.0.3 Variance Infaltion Factor
The variance inflation factor (VIF) is a measure of colinearity among predictor variables within a multiple regression. It is calculated by taking the ration of the variance of all a given model's beta if it were fit alone. 

### 9.0.4 Steps for Implementing VIF

1. Run a multiple regression
2. Calculate the VIF Factors
3. Inspect the factors for each predictor variable: If the VIF is between 5-10, mul;ticollinearity is likely present and we would consider dropping the variable.

Let us import the libraries required for analysis.

In [1]:
#imports
import pandas as pd
import numpy as np
from patsy import dmatrices
import statsmodels.api as sm
from statsmodels.stats.outliers_influence import variance_inflation_factor

### 9.0.5 Data Import
We will now import the dataset.

In [2]:
# load the master_dataset.xls
df = pd.read_excel('master_dataset.xlsx', sheetname='Sheet1')

### 9.0.6 Data Manipulation
We will drop all the non-numeric columns

In [3]:
# Drop Non-numeric cells
df.dropna()

Unnamed: 0,Store,Date,Temperature,Fuel_Price,MarkDown1,MarkDown2,MarkDown3,MarkDown4,MarkDown5,CPI,...,Books,Musical_Instruments,Star_Wars,Movies_TV,Video_Games,Portable_Audios,Cameras_Camcoders,Auto_Electronics,Wearable_Tech,Smart_homes
0,1,2010-05-02,42.31,2.572,10382.90,6115.67,215.07,2406.62,6551.42,211.096358,...,73315.81,57022.45,118966.90,58034.24,56157.83,113009.41,27930.71,32954.82,10344.16,0.01
1,1,2010-12-02,38.51,2.548,10382.90,6115.67,215.07,2406.62,6551.42,211.242170,...,77280.42,57845.36,126907.41,63245.00,66172.11,111466.37,5265.09,30149.20,14740.14,0.01
2,1,2010-02-19,39.93,2.514,10382.90,6115.67,215.07,2406.62,6551.42,211.289143,...,78602.71,59462.22,122267.65,69962.56,62795.87,124821.44,5265.09,33726.13,10139.42,0.01
3,1,2010-02-26,46.63,2.561,10382.90,6115.67,215.07,2406.62,6551.42,211.319643,...,76091.36,63011.44,135066.75,62581.64,72212.32,107952.07,28420.73,31585.78,12087.95,20.00
4,1,2010-05-03,46.50,2.625,10382.90,6115.67,215.07,2406.62,6551.42,211.350143,...,71718.48,57335.17,125048.08,57630.02,55501.07,103652.58,28420.73,28457.31,10871.74,20.00
5,1,2010-12-03,57.79,2.667,10382.90,6115.67,215.07,2406.62,6551.42,211.380643,...,79049.67,64708.91,130199.04,68070.49,64285.98,124220.10,28420.73,36014.75,10282.67,20.00
6,1,2010-03-19,54.58,2.720,10382.90,6115.67,215.07,2406.62,6551.42,211.215635,...,73277.05,51969.94,119421.70,56070.53,45989.26,100834.31,23153.17,30659.40,10262.38,20.00
7,1,2010-03-26,51.45,2.732,10382.90,6115.67,215.07,2406.62,6551.42,211.018042,...,76760.92,63250.87,127434.37,62781.44,64718.53,117716.13,27824.48,31733.15,11323.78,20.00
8,1,2010-02-04,62.27,2.719,10382.90,6115.67,215.07,2406.62,6551.42,210.820450,...,71666.24,55614.69,116688.75,55082.40,59159.95,113117.35,27824.48,30294.48,10063.69,20.00
9,1,2010-09-04,65.86,2.770,10382.90,6115.67,215.07,2406.62,6551.42,210.622857,...,76433.03,59985.57,126058.88,71718.04,64495.92,133056.97,27824.48,35109.14,10994.44,20.00


In [4]:
# Drop Non-Numeric Columns
df = df._get_numeric_data()

In [5]:
df.head()

Unnamed: 0,Store,Temperature,Fuel_Price,MarkDown1,MarkDown2,MarkDown3,MarkDown4,MarkDown5,CPI,Unemployment,...,Books,Musical_Instruments,Star_Wars,Movies_TV,Video_Games,Portable_Audios,Cameras_Camcoders,Auto_Electronics,Wearable_Tech,Smart_homes
0,1,42.31,2.572,10382.9,6115.67,215.07,2406.62,6551.42,211.096358,8.106,...,73315.81,57022.45,118966.9,58034.24,56157.83,113009.41,27930.71,32954.82,10344.16,0.01
1,1,38.51,2.548,10382.9,6115.67,215.07,2406.62,6551.42,211.24217,8.106,...,77280.42,57845.36,126907.41,63245.0,66172.11,111466.37,5265.09,30149.2,14740.14,0.01
2,1,39.93,2.514,10382.9,6115.67,215.07,2406.62,6551.42,211.289143,8.106,...,78602.71,59462.22,122267.65,69962.56,62795.87,124821.44,5265.09,33726.13,10139.42,0.01
3,1,46.63,2.561,10382.9,6115.67,215.07,2406.62,6551.42,211.319643,8.106,...,76091.36,63011.44,135066.75,62581.64,72212.32,107952.07,28420.73,31585.78,12087.95,20.0
4,1,46.5,2.625,10382.9,6115.67,215.07,2406.62,6551.42,211.350143,8.106,...,71718.48,57335.17,125048.08,57630.02,55501.07,103652.58,28420.73,28457.31,10871.74,20.0


Let us use MarkDown1 as the target variable and the others as predictor variable and then run multiple regression. Firstly, let us create a duplicate dataframe and then we will drop the column "MarkDown1" which we will use as our target varibale from the duplicate dataframe.

In [6]:
target = df['MarkDown1']

In [7]:
# Duplicate dataframe and assign the name df_dup to it
df_dup = df

In [8]:
# Drop "Markdown1"
df_dup.drop(['MarkDown1'],  axis = 1, inplace=True)

### 9.0.7 Numpy Array
We will create two numpy arrays X, representing the predictor variables and y representing the target variables:

In [10]:
#Numpy array
X = df_dup.values
y = target.values

### 9.0.8 Reshape Array
We will now reshape the arrays in preparation for regression analysis

In [11]:
# Print the dimensions of X and y before reshaping
print("Dimensions of y before reshaping: {}".format(y.shape))
print("Dimensions of X before reshaping: {}".format(X.shape))

Dimensions of y before reshaping: (8190,)
Dimensions of X before reshaping: (8190, 92)


In [12]:
# Reshape  the dimension of y
y = y.reshape(-1, 1)

### 9.0.9 Multiple Regression

In [20]:
## fit a OLS model with intercept on the data set
X = sm.add_constant(X)


We will run linear regression models in SKLearn. SKLearn is pretty much the golden standard when it comes to machine learning in Python. In order to use linear regression, we need to import it:

In [22]:
from sklearn import linear_model

Then we’ll fit a model:

In [23]:
lm = linear_model.LinearRegression()

In [24]:
model = lm.fit(X,y)

The lm.fit() function fits a linear model. We want to use the model to make predictions, so we’ll use lm.predict():

In [25]:
predictions = lm.predict(X)

The print function would print the first 5 predictions for y (I didn’t print the entire list to “save room”. Removing [0:5] would print the entire list):

In [27]:
print((predictions)[0:5])

[[  7234.75615153]
 [ 12361.17684614]
 [ 12578.73163448]
 [ 11702.03855637]
 [  8002.05489265]]


Remember, lm.predict() predicts the y (dependent variable) using the linear model we fitted. We use built-in functions to return the score, the coefficients and the estimated intercepts

In [28]:
lm.score(X,y)

0.6857431849075899

In [29]:
lm.coef_

array([[ -1.13921338e-10,   1.20500336e+02,  -1.27638829e+00,
         -3.53095573e+02,   1.35663608e-01,  -6.41334627e-02,
          1.10872046e+00,   6.30729742e-03,  -1.05537392e-02,
          5.98986987e+01,   7.96124877e+02,   3.54716258e-02,
         -1.85512385e-02,  -8.98316094e-02,  -1.03114195e-03,
          3.20915222e-01,  -9.44050332e-03,   1.40405164e-01,
          1.57409274e-02,  -3.93847773e-02,  -6.76029606e-03,
          1.38345898e-01,   1.28062377e-02,  -2.59795512e-01,
          1.70205423e-01,  -2.92213411e-01,   4.88036516e-03,
          1.30569333e-01,  -1.06646088e-02,   5.44034030e-02,
         -1.99629574e-01,   4.70159307e-01,  -6.87432790e-02,
         -2.19933174e-02,  -2.31341208e-01,  -7.68720506e-02,
         -1.22479278e-01,   8.30781434e-01,  -2.57322650e+00,
         -2.90807838e-02,   4.20582519e-01,  -1.12719850e-01,
          1.13657487e-01,   3.17410244e-01,  -2.16439259e-02,
          7.78665468e-03,  -7.74034552e-02,  -1.53960471e-01,
        

In [30]:
lm.intercept_

array([-4236.86232933])

These are all (estimated/predicted) parts of the multiple regression equation 

### 9.0.10 Generating VIF

In [40]:
# load the master_dataset.xls
df = pd.read_excel('master_dataset.xlsx', sheetname='Sheet1')

In [41]:
#subset the dataframe
df = df[['MarkDown1','MarkDown2','MarkDown3', 'MarkDown4', 'MarkDown5', 'CPI', 'Unemployment']].dropna() 

#### 9.0.10.1 Run Multiple Regression

In [42]:
# Run Multiple Regression
#%%capture
#gather features
features = "+".join(df.columns)


In [43]:
# get y and X dataframes based on this regression:
y, X = dmatrices('MarkDown1 ~' + features, df, return_type='dataframe')

#### 9.0.10.2 Calculate VIF Factors

In [44]:
# For each X, calculate VIF and save in dataframe
vif = pd.DataFrame()
vif["VIF Factor"] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
vif["features"] = X.columns

In [45]:
vif.round(1)

Unnamed: 0,VIF Factor,features
0,56.7,Intercept
1,2.4,MarkDown1
2,1.2,MarkDown2
3,1.0,MarkDown3
4,2.2,MarkDown4
5,1.0,MarkDown5
6,1.1,CPI
7,1.1,Unemployment


In [76]:
# load the master_dataset.xls
df = pd.read_excel('master_dataset.xlsx', sheetname='Sheet1')

In [77]:
# Drop Non-Numeric Columns
df = df._get_numeric_data()

In [78]:
df.keys()

Index(['Store', 'Temperature', 'Fuel_Price', 'MarkDown1', 'MarkDown2',
       'MarkDown3', 'MarkDown4', 'MarkDown5', 'CPI', 'Unemployment',
       'IsHoliday', 'Size', 'Jewelry', 'Pets', 'TV_Video', 'Cell_Phones',
       'Pharmaceutical ', 'Health_beauty', 'Toy ', 'Home_others', 'Kitchen',
       'Bedding', 'Bathroom', 'Office_supplies ', 'School_Supplies',
       'Home_Office', 'Craft_general', 'Floral', 'Beading', 'Paint', 'Framing',
       'outdoor', 'Auto', 'School_Uniforms', 'Baby_Toddlers_Clothing',
       'Baby_Kids_Shoes', 'Clearance_Clothings', 'Boys_Clothing',
       'Girls_Clothing', 'Women_Clothing', 'Intimates_Sleepwears',
       'Men_Clothings', 'Precious_Metals', 'Active_Wear', 'Adult_Shoes',
       'Bags_Accessories', 'Sportswear', 'Computer', 'Music', 'Luggage',
       'Food', 'Fruit', 'Grocery', 'Laundry', 'IPad_Tablets',
       'Heating_Cooling', 'Swim_Shop', 'Gift_cards', 'Baby_Essentials',
       'Cribs', 'Car_Seats', 'Strollers', 'Bikes', 'Photo',
       'Househol

In [80]:
df[['Toy']] = df[['Toy ']]

In [82]:
df[['Pharmaceutical']] = df[['Pharmaceutical ']]

In [83]:
df[['Office_supplies']] = df[['Office_supplies ']]

In [84]:
df = df[['MarkDown1', 'Store', 'Temperature', 'Fuel_Price', 'MarkDown1', 'MarkDown2', 'MarkDown3', 'MarkDown4', 'MarkDown5', 
         'CPI', 'Unemployment', 'Size', 'Jewelry', 'Pets', 'TV_Video', 'Cell_Phones', 'Pharmaceutical', 'Health_beauty', 'Toy',
         'Home_others', 'Kitchen', 'Bedding', 'Bathroom', 'Office_supplies', 'School_Supplies', 'Home_Office', 'Craft_general',
         'Floral','Beading', 'Paint', 'Framing', 'outdoor', 'Auto', 'School_Uniforms', 'Baby_Toddlers_Clothing', 'Baby_Kids_Shoes',
         'Clearance_Clothings', 'Boys_Clothing', 'Girls_Clothing', 'Women_Clothing', 'Intimates_Sleepwears', 'Men_Clothings',
         'Precious_Metals', 'Active_Wear', 'Adult_Shoes', 'Bags_Accessories', 'Sportswear', 'Computer', 'Music', 'Luggage',
         'Food', 'Fruit', 'Grocery', 'Laundry', 'IPad_Tablets', 'Heating_Cooling', 'Swim_Shop', 'Gift_cards', 'Baby_Essentials',
         'Cribs', 'Car_Seats', 'Strollers', 'Bikes', 'Photo', 'Household_Essentials', 'Air_Quality', 'Light_bulbs', 'Gardening',
         'Building_Materials', 'Hardware', 'Electrical', 'Home_Safety', 'Tools', 'Teen_Room', 'Kids_Room', 'Lighting', 
         'Home_Decor', 'Mattresses', 'Furniture', 'Storage', 'Appliances', 'Pioneer_Woman', 'Computer_Software', 'Books', 
         'Musical_Instruments', 'Star_Wars', 'Movies_TV', 'Video_Games', 'Portable_Audios', 'Cameras_Camcoders', 
         'Auto_Electronics', 'Wearable_Tech', 'Smart_homes']].dropna()

In [85]:
# Run Multiple Regression
#%%capture
#gather features
features = "+".join(df.columns)

In [86]:
# get y and X dataframes based on this regression:
y, X = dmatrices('MarkDown1 ~' + features, df, return_type='dataframe')

In [87]:
# For each X, calculate VIF and save in dataframe
vif = pd.DataFrame()
vif["VIF Factor"] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
vif["features"] = X.columns

  vif = 1. / (1. - r_squared_i)


In [88]:
vif.round(1)

Unnamed: 0,VIF Factor,features
0,2053.400000,Intercept
1,inf,MarkDown1[0]
2,inf,MarkDown1[1]
3,25.200000,Store
4,1.300000,Temperature
5,1.200000,Fuel_Price
6,1.600000,MarkDown2
7,1.000000,MarkDown3
8,2.600000,MarkDown4
9,1.100000,MarkDown5
