## 18.0 Introduction
We will explore further regression of the predictors selected after calculation of the VIFs and the target variables: 
* Transport (Auto)-Items
* Fashion (Wears)-Items and
* Food-Items

We will use statsmodel and sklearn for multiple regression. 

### 18.0.1 Python Libraries
In order to start, we will import necessary libraries. 

In [1]:
#import Libraries
import pandas as pd
import numpy as np
import seaborn as sns; sns.set()
import matplotlib
import matplotlib.pyplot as plt
import statsmodels.api as sm
from sklearn import linear_model
%matplotlib inline  

## 18.1 LINEAR REGRESSION IMPLEMENTING VIF RESULT

### 18.1.0 TARGET - Health-items 
Objective: Predict Health Items sales performance 

In [2]:
# Read the file 'master_dataset.xlsx' into a DataFrame df using the read_xls() function.
df = pd.read_excel('master_dataset.xlsx', sheetname='Sheet1')

After importing the master dataset 'master_dataset.xlsx', we create features dataframe using the columns with VIF<5

In [3]:
# generating the 29 predictor features
features = df[['Temperature', 'Fuel_Price', 'MarkDown1', 'MarkDown2', 'MarkDown3', 'MarkDown4', 'MarkDown5', 'CPI', 'Unemployment',
               'Size','Jewelry', 'Bathroom', 'Beading', 'Paint', 'Sportswear', 'Music', 'Fruit', 'Laundry', 'Heating_Cooling',
               'Gift_cards', 'Baby_Essentials', 'Car_Seats', 'Strollers', 'Photo', 'Air_Quality', 'Light_bulbs', 'Gardening',
               'Building_Materials', 'Kids_Room', 'Lighting']]

We now create a column for health items using the VIFs as weighting factors:  

In [4]:
# Target 
df['Auto-items'] = (29.4*df['Auto'] + 14.0*df['Precious_Metals'] + 7.3*df['Hardware'] + 73 + 
                    11.4*df['Tools'])/(29.4 + 14.0 + 7.3 + 11.4)

#### 18.1.1.1 Linear Regression Using Statsmodel
We will proceed to run multiple regression using Statsmodel. First, we will fit and summarize the OLS model with zero intercept.

##### 18.1.1.1.1 Linear Regression Using Statsmodel; without intercept

In [5]:
# Fit and summarize OLS model
mod = sm.OLS(df['Auto-items'], features)

In [6]:
res = mod.fit()

In [7]:
print(res.summary())

                            OLS Regression Results                            
Dep. Variable:             Auto-items   R-squared:                       0.916
Model:                            OLS   Adj. R-squared:                  0.916
Method:                 Least Squares   F-statistic:                     2981.
Date:                Wed, 13 Dec 2017   Prob (F-statistic):               0.00
Time:                        20:49:39   Log-Likelihood:                -83515.
No. Observations:                8190   AIC:                         1.671e+05
Df Residuals:                    8160   BIC:                         1.673e+05
Df Model:                          30                                         
Covariance Type:            nonrobust                                         
                         coef    std err          t      P>|t|      [0.025      0.975]
--------------------------------------------------------------------------------------
Temperature            6.9854      4

#### 18.1.1.1.2 Linear Regression Using Statsmodel¶ - with intercept

In [8]:
# fit a OLS model with intercept on the data set
X = sm.add_constant(features)

In [9]:
# Fit and summarize OLS model
mod = sm.OLS(df['Auto-items'], X)

In [10]:
res = mod.fit()

In [11]:
print(res.summary())

                            OLS Regression Results                            
Dep. Variable:             Auto-items   R-squared:                       0.664
Model:                            OLS   Adj. R-squared:                  0.663
Method:                 Least Squares   F-statistic:                     537.0
Date:                Wed, 13 Dec 2017   Prob (F-statistic):               0.00
Time:                        20:50:22   Log-Likelihood:                -83508.
No. Observations:                8190   AIC:                         1.671e+05
Df Residuals:                    8159   BIC:                         1.673e+05
Df Model:                          30                                         
Covariance Type:            nonrobust                                         
                         coef    std err          t      P>|t|      [0.025      0.975]
--------------------------------------------------------------------------------------
const               3885.2340   1003

#### 18.1.1.2 Linear Regression Using SKLearn

We will run linear regression models in SKLearn.

Then we’ll create numpy arrays of the features and target, reshape the target and fit a model:

In [12]:
#create numpy array
X = features.values
y = df['Auto-items'].values

In [13]:
#Reshape y
y = y.reshape(-1,1)

In [14]:
#fit regression model
lm = linear_model.LinearRegression()
model = lm.fit(X,y)

The lm.fit() function fits a linear model. We want to use the model to make predictions, so we’ll use lm.predict():

In [15]:
#run prediction 
predictions = lm.predict(X)

The print function would print the first 5 predictions for y (I didn’t print the entire list to “save room”. Removing [0:5] would print the entire list):

In [16]:
print((predictions)[0:5])

[[ 16751.57615155]
 [ 26312.91041307]
 [ 23880.26436963]
 [ 28823.90757628]
 [ 19912.86796577]]


In [17]:
# generate the score
lm.score(X,y)

0.6638054762066663

In [18]:
lm.coef_

array([[  7.19661999e+00,  -7.10174001e+01,  -1.31491736e-03,
         -2.79359384e-02,   2.32647632e-02,   8.49310876e-02,
          6.51977549e-03,  -3.23784846e+00,   1.86582578e+02,
          4.52467900e-03,   1.27279374e-01,   3.68407255e-01,
          1.38375589e-01,   6.83289172e-01,   4.31712574e-01,
          1.47206523e-02,   8.32498722e-01,   1.62528086e+03,
         -1.67316147e+00,   1.85668491e+00,  -7.40211906e-01,
         -8.34910017e-02,   1.67321443e+01,   8.96308552e+00,
         -8.26558314e-02,   1.13306768e-01,   1.02010242e+00,
         -3.68752370e+00,   2.44895760e+00,   3.52037440e+01]])

In [19]:
lm.intercept_

array([ 3885.23398773])

### 18.2.1 TARGET - Fashion-items
Objective: Predict Fashion Items sales performance

In [22]:
df['Wears-items'] = (23.0*df['Clearance_Clothings'] + 18.7*df['Boys_Clothing'] + 9.1*df['Girls_Clothing'] + 
                        21.4*df['Women_Clothing'] + 13.9* df['Intimates_Sleepwears'] + 11.3*df['Men_Clothings'] + 
                     34.5*df['Active_Wear'] + 18.7*df['Adult_Shoes'] + 8.4*df['Bags_Accessories'] + 6.7*df['Luggage'] + 
                     44.2*df['Swim_Shop'] + 5.7*df['Pioneer_Woman']
                       )/(23.0 + 18.7 + 9.1 + 13.9 + 18.7 + 8.4 + 44.2 + 5.7)

#### 18.2.1.1 Linear Regression Using Statsmodel¶ - without intercept

In [23]:
# Fit and summarize OLS model
mod = sm.OLS(df['Wears-items'], features)

In [24]:
res = mod.fit()

In [25]:
print(res.summary())

                            OLS Regression Results                            
Dep. Variable:            Wears-items   R-squared:                       0.943
Model:                            OLS   Adj. R-squared:                  0.942
Method:                 Least Squares   F-statistic:                     4471.
Date:                Wed, 13 Dec 2017   Prob (F-statistic):               0.00
Time:                        20:54:17   Log-Likelihood:                -77936.
No. Observations:                8190   AIC:                         1.559e+05
Df Residuals:                    8160   BIC:                         1.561e+05
Df Model:                          30                                         
Covariance Type:            nonrobust                                         
                         coef    std err          t      P>|t|      [0.025      0.975]
--------------------------------------------------------------------------------------
Temperature            9.4944      2

#### 18.2.1.2 Linear Regression Using Statsmodel¶ - with intercept

In [26]:
## fit a OLS model with intercept on the data set
X = sm.add_constant(features)

In [27]:
# Fit and summarize OLS model
mod = sm.OLS(df['Wears-items'], X)

In [28]:
res = mod.fit()

In [29]:
print(res.summary())

                            OLS Regression Results                            
Dep. Variable:            Wears-items   R-squared:                       0.785
Model:                            OLS   Adj. R-squared:                  0.784
Method:                 Least Squares   F-statistic:                     990.8
Date:                Wed, 13 Dec 2017   Prob (F-statistic):               0.00
Time:                        20:54:49   Log-Likelihood:                -77927.
No. Observations:                8190   AIC:                         1.559e+05
Df Residuals:                    8159   BIC:                         1.561e+05
Df Model:                          30                                         
Covariance Type:            nonrobust                                         
                         coef    std err          t      P>|t|      [0.025      0.975]
--------------------------------------------------------------------------------------
const               2127.3615    507

#### 18.2.2 Linear Regression Using SKLearn

In [30]:
#create numpy array
X = features.values
y = df['Wears-items'].values

In [31]:
#Reshape y
y = y.reshape(-1,1)

In [32]:
#fit regression model
lm = linear_model.LinearRegression()
model = lm.fit(X,y)

In [33]:
#run prediction 
predictions = lm.predict(X)

In [34]:
print((predictions)[0:5])

[[ 11061.508697  ]
 [ 19659.40487071]
 [ 15040.27919271]
 [ 17474.41428156]
 [ 13389.42792125]]


In [35]:
# generate the score
lm.score(X,y)

0.78462742270889585

In [36]:
lm.coef_

array([[  9.61003305e+00,  -7.87158267e+01,  -1.08792092e-02,
         -8.63172503e-03,   6.10842243e-03,   2.87603593e-02,
          2.25984280e-03,  -4.17814759e+00,  -4.40761240e+01,
          1.12197448e-03,   1.41459291e-01,   2.78163312e-01,
         -1.13398271e-02,   9.02091900e-01,   7.41929394e-01,
         -3.86825051e-03,   5.38385944e-01,   1.18368817e+02,
          9.06782311e+00,   8.80891500e-01,  -4.55285820e-01,
          4.65539770e-02,  -1.27717651e+01,   4.76266947e+00,
          3.07136541e-02,   1.88269675e-02,   4.05767647e-01,
          2.35321634e-01,   3.23859873e-01,   1.19230971e+01]])

In [37]:
lm.intercept_

array([ 2127.36148067])

### 18.3.1 TARGET - Food-items
Objective: Predict Food Items sales performance

In [38]:
df['Food-items'] = (10.800000*df['Office_supplies '] + 88.600000*df['School_Supplies'] + 44.600000*df['Home_Office'] +
                     11.000000*df['Craft_general'] + 134.3*df['Books'])/(10.800000 + 88.600000 + 44.600000 + 11.000000 + 134.3)

#### 18.3.1.1 Linear Regression Using Statsmodel¶ - without intercept

In [39]:
# Fit and summarize OLS model
mod = sm.OLS(df['Food-items'], features)

In [40]:
res = mod.fit()

In [41]:
print(res.summary())

                            OLS Regression Results                            
Dep. Variable:             Food-items   R-squared:                       0.886
Model:                            OLS   Adj. R-squared:                  0.886
Method:                 Least Squares   F-statistic:                     2121.
Date:                Wed, 13 Dec 2017   Prob (F-statistic):               0.00
Time:                        20:57:03   Log-Likelihood:                -88546.
No. Observations:                8190   AIC:                         1.772e+05
Df Residuals:                    8160   BIC:                         1.774e+05
Df Model:                          30                                         
Covariance Type:            nonrobust                                         
                         coef    std err          t      P>|t|      [0.025      0.975]
--------------------------------------------------------------------------------------
Temperature           35.3048      7

#### 18.3.1.2 Linear Regression Using Statsmodel¶ - with intercept

In [42]:
# fit a OLS model with intercept on the data set
X = sm.add_constant(features)

In [44]:
# Fit and summarize OLS model
mod = sm.OLS(df['Food-items'], X)

In [45]:
res = mod.fit()

In [46]:
print(res.summary())

                            OLS Regression Results                            
Dep. Variable:             Food-items   R-squared:                       0.554
Model:                            OLS   Adj. R-squared:                  0.553
Method:                 Least Squares   F-statistic:                     338.2
Date:                Wed, 13 Dec 2017   Prob (F-statistic):               0.00
Time:                        20:57:44   Log-Likelihood:                -88546.
No. Observations:                8190   AIC:                         1.772e+05
Df Residuals:                    8159   BIC:                         1.774e+05
Df Model:                          30                                         
Covariance Type:            nonrobust                                         
                         coef    std err          t      P>|t|      [0.025      0.975]
--------------------------------------------------------------------------------------
const               -165.0207   1856

#### 18.3.1.2 Linear Regression Using SKLearn

In [47]:
#create numpy array
X = features.values
y = df['Food-items'].values

In [48]:
#Reshape y
y = y.reshape(-1,1)

In [49]:
#fit regression model
lm = linear_model.LinearRegression()
model = lm.fit(X,y)

In [50]:
#run prediction 
predictions = lm.predict(X)

In [51]:
print((predictions)[0:5])

[[ 46364.4178115 ]
 [ 74605.77633882]
 [ 58046.54415954]
 [ 55294.03057185]
 [ 46644.73452003]]


In [52]:
# generate the score
lm.score(X,y)

0.55425128162889259

In [53]:
lm.coef_

array([[  3.52957899e+01,   1.61633726e+02,  -6.17108951e-02,
         -5.84126002e-02,  -2.29651559e-02,  -5.57845001e-02,
          1.34500202e-02,   4.90492874e+01,   1.88870803e+02,
          1.33169709e-02,   4.62225912e-01,   5.09467483e-01,
         -7.83371455e-02,  -1.96734223e+00,   1.05896019e-01,
          7.00194684e-02,  -2.10903142e-01,  -2.51830902e+03,
         -1.35649863e+01,  -3.35119844e+00,  -9.94915824e-01,
         -1.45006722e+00,  -1.74258833e+01,  -7.01998667e+00,
          1.58918567e-01,   1.55860659e+00,  -7.51032902e-02,
          1.47337817e+01,  -2.28436340e+00,   1.59678354e+02]])

In [54]:
lm.intercept_

array([-165.02066664])