## 17.0 Introduction
We will explore further regression of the predictors selected after calculation of the VIFs and the target variables: 
* Health-items
* Kids-Items
* Office-Items
* Transport (Auto)-Items
* Fashion (Wears)-Items and
* Food-Items

We will use statsmodel and sklearn for multiple regression. 

### 17.0.1 Python Libraries
In order to start, we will import necessary libraries. 

In [1]:
#import Libraries
import pandas as pd
import numpy as np
import seaborn as sns; sns.set()
import matplotlib
import matplotlib.pyplot as plt
import statsmodels.api as sm
from sklearn import linear_model
%matplotlib inline  

## 17.1 LINEAR REGRESSION IMPLEMENTING VIF RESULT

### 17.1.0 TARGET - Health-items 
Objective: Predict Health Items sales performance 

In [2]:
# Read the file 'master_dataset.xlsx' into a DataFrame df using the read_xls() function.
df = pd.read_excel('master_dataset.xlsx', sheetname='Sheet1')

After importing the master dataset 'master_dataset.xlsx', we create features dataframe using the columns with VIF<5

In [3]:
# generating the 29 predictor features
features = df[['Temperature', 'Fuel_Price', 'MarkDown1', 'MarkDown2', 'MarkDown3', 'MarkDown4', 'MarkDown5', 'CPI', 'Unemployment',
               'Size','Jewelry', 'Bathroom', 'Beading', 'Paint', 'Sportswear', 'Music', 'Fruit', 'Laundry', 'Heating_Cooling',
               'Gift_cards', 'Baby_Essentials', 'Car_Seats', 'Strollers', 'Photo', 'Air_Quality', 'Light_bulbs', 'Gardening',
               'Building_Materials', 'Kids_Room', 'Lighting']]

We now create a column for health items using the VIFs as weighting factors:  

In [4]:
# Target 
df['Health-items'] = (18.900000*df['Pharmaceutical '] + 9.700000*df['Health_beauty'])/(18.900000 + 9.700000)

#### 17.1.1.1 Linear Regression Using Statsmodel
We will proceed to run multiple regression using Statsmodel. First, we will fit and summarize the OLS model with zero intercept.

##### 17.1.1.1.1 Linear Regression Using Statsmodel; without intercept

In [5]:
# Fit and summarize OLS model
mod = sm.OLS(df['Health-items'], features)

In [6]:
res = mod.fit()

In [7]:
print(res.summary())

                            OLS Regression Results                            
Dep. Variable:           Health-items   R-squared:                       0.888
Model:                            OLS   Adj. R-squared:                  0.888
Method:                 Least Squares   F-statistic:                     2162.
Date:                Wed, 13 Dec 2017   Prob (F-statistic):               0.00
Time:                        20:44:28   Log-Likelihood:                -83486.
No. Observations:                8190   AIC:                         1.670e+05
Df Residuals:                    8160   BIC:                         1.672e+05
Df Model:                          30                                         
Covariance Type:            nonrobust                                         
                         coef    std err          t      P>|t|      [0.025      0.975]
--------------------------------------------------------------------------------------
Temperature            7.6780      4

#### 17.1.1.1.2 Linear Regression Using Statsmodel¶ - with intercept

In [8]:
# fit a OLS model with intercept on the data set
X = sm.add_constant(features)

In [9]:
# Fit and summarize OLS model
mod = sm.OLS(df['Health-items'], X)

In [10]:
res = mod.fit()

In [11]:
print(res.summary())

                            OLS Regression Results                            
Dep. Variable:           Health-items   R-squared:                       0.754
Model:                            OLS   Adj. R-squared:                  0.753
Method:                 Least Squares   F-statistic:                     831.9
Date:                Wed, 13 Dec 2017   Prob (F-statistic):               0.00
Time:                        20:44:28   Log-Likelihood:                -83484.
No. Observations:                8190   AIC:                         1.670e+05
Df Residuals:                    8159   BIC:                         1.672e+05
Df Model:                          30                                         
Covariance Type:            nonrobust                                         
                         coef    std err          t      P>|t|      [0.025      0.975]
--------------------------------------------------------------------------------------
const              -2020.6795   1000

#### 17.1.1.2 Linear Regression Using SKLearn

We will run linear regression models in SKLearn.

Then we’ll create numpy arrays of the features and target, reshape the target and fit a model:

In [12]:
#create numpy array
X = features.values
y = df['Health-items'].values

In [13]:
#Reshape y
y = y.reshape(-1,1)

In [14]:
#fit regression model
lm = linear_model.LinearRegression()
model = lm.fit(X,y)

The lm.fit() function fits a linear model. We want to use the model to make predictions, so we’ll use lm.predict():

In [15]:
#run prediction 
predictions = lm.predict(X)

The print function would print the first 5 predictions for y (I didn’t print the entire list to “save room”. Removing [0:5] would print the entire list):

In [16]:
print((predictions)[0:5])

[[ 12393.67082356]
 [ 27902.48731316]
 [ 20597.71855269]
 [ 28357.00684406]
 [ 16121.94106584]]


In [17]:
# generate the score
lm.score(X,y)

0.75362809113826268

In [18]:
lm.coef_

array([[  7.56816197e+00,   1.16695272e+02,   2.68396122e-02,
          3.45733483e-03,   2.94697832e-03,  -2.98601001e-03,
          6.32707305e-03,  -3.08886963e+00,   5.24475035e+01,
          2.45185444e-04,   2.32175293e-01,   2.53462465e-01,
          2.05170776e-01,   7.75290490e-01,   4.33998868e-01,
          8.88484136e-03,   3.59485911e-01,   9.12623494e+02,
          3.14952902e+00,   2.17203262e+00,  -1.89185057e-02,
         -4.23863037e-01,   2.00907055e+01,   8.01038134e+00,
         -3.96425476e-02,   1.06481875e-01,   2.37198889e+00,
          4.58462350e-01,   4.01177252e+00,   1.64995454e+01]])

In [19]:
lm.intercept_

array([-2020.67947819])

### 17.2.1 TARGET - Kids-items
Objective: Predict Kids Items sales performance

In [20]:
df['Kids-items'] = (19.600000*df['Toy '] + 48.6*df['School_Uniforms'] + 26.9*df['Baby_Toddlers_Clothing'] + 
                    16.6*df['Baby_Kids_Shoes'] + 6.7*df['Cribs'] + 11.4*df['Bikes'] + 31.9*df['Teen_Room']
                   )/(19.600000 + 48.6 + 26.9 + 11.4 + 31.9)

#### 17.2.1.1 Linear Regression Using Statsmodel¶ - without intercept

In [21]:
# Fit and summarize OLS model
mod = sm.OLS(df['Kids-items'], features)

In [22]:
res = mod.fit()

In [23]:
print(res.summary())

                            OLS Regression Results                            
Dep. Variable:             Kids-items   R-squared:                       0.943
Model:                            OLS   Adj. R-squared:                  0.943
Method:                 Least Squares   F-statistic:                     4525.
Date:                Wed, 13 Dec 2017   Prob (F-statistic):               0.00
Time:                        20:44:29   Log-Likelihood:                -80980.
No. Observations:                8190   AIC:                         1.620e+05
Df Residuals:                    8160   BIC:                         1.622e+05
Df Model:                          30                                         
Covariance Type:            nonrobust                                         
                         coef    std err          t      P>|t|      [0.025      0.975]
--------------------------------------------------------------------------------------
Temperature            7.4543      3

#### 17.2.1.2 Linear Regression Using Statsmodel¶ - with intercept

In [24]:
## fit a OLS model with intercept on the data set
X = sm.add_constant(features)

In [25]:
# Fit and summarize OLS model
mod = sm.OLS(df['Kids-items'], X)

In [26]:
res = mod.fit()

In [27]:
print(res.summary())

                            OLS Regression Results                            
Dep. Variable:             Kids-items   R-squared:                       0.829
Model:                            OLS   Adj. R-squared:                  0.828
Method:                 Least Squares   F-statistic:                     1316.
Date:                Wed, 13 Dec 2017   Prob (F-statistic):               0.00
Time:                        20:44:29   Log-Likelihood:                -80975.
No. Observations:                8190   AIC:                         1.620e+05
Df Residuals:                    8159   BIC:                         1.622e+05
Df Model:                          30                                         
Covariance Type:            nonrobust                                         
                         coef    std err          t      P>|t|      [0.025      0.975]
--------------------------------------------------------------------------------------
const               2421.7677    736

#### 17.2.2 Linear Regression Using SKLearn

In [28]:
#create numpy array
X = features.values
y = df['Kids-items'].values

In [29]:
#Reshape y
y = y.reshape(-1,1)

In [30]:
#fit regression model
lm = linear_model.LinearRegression()
model = lm.fit(X,y)

In [31]:
#run prediction 
predictions = lm.predict(X)

In [32]:
print((predictions)[0:5])

[[ 10602.32849381]
 [ 23489.86200449]
 [ 18639.41378362]
 [ 21395.04983033]
 [ 15715.64316725]]


In [33]:
# generate the score
lm.score(X,y)

0.82871181096273983

In [34]:
lm.coef_

array([[  7.58596378e+00,   2.23054462e+02,  -1.54184052e-02,
         -3.33576308e-02,   7.45956551e-03,   6.49702069e-02,
          6.35291447e-03,  -1.37759187e+01,   6.91346635e+01,
         -2.95057534e-03,   1.74280001e-01,   4.55231841e-01,
          8.78361468e-02,   8.16047189e-01,   1.28167983e+00,
         -1.97235477e-02,   9.08791405e-01,   1.09490251e+03,
          1.91347695e+01,   2.53022058e+00,  -4.92909342e-01,
         -9.79278772e-02,  -1.71394925e+01,   6.39967873e+00,
         -7.10702631e-03,   3.46024188e-02,   1.29470977e+00,
         -1.83425481e+00,   9.36868594e-01,   1.04460773e+01]])

In [35]:
lm.intercept_

array([ 2421.76774539])

### 17.3.1 TARGET - Office-items
Objective: Predict Office Items sales performance

In [36]:
df['Office-items'] = (10.800000*df['Office_supplies '] + 88.600000*df['School_Supplies'] + 44.600000*df['Home_Office'] +
                     11.000000*df['Craft_general'] + 134.3*df['Books'])/(10.800000 + 88.600000 + 44.600000 + 11.000000 + 134.3)

#### 17.3.1.1 Linear Regression Using Statsmodel¶ - without intercept

In [37]:
# Fit and summarize OLS model
mod = sm.OLS(df['Office-items'], features)

In [38]:
res = mod.fit()

In [39]:
print(res.summary())

                            OLS Regression Results                            
Dep. Variable:           Office-items   R-squared:                       0.886
Model:                            OLS   Adj. R-squared:                  0.886
Method:                 Least Squares   F-statistic:                     2121.
Date:                Wed, 13 Dec 2017   Prob (F-statistic):               0.00
Time:                        20:44:29   Log-Likelihood:                -88546.
No. Observations:                8190   AIC:                         1.772e+05
Df Residuals:                    8160   BIC:                         1.774e+05
Df Model:                          30                                         
Covariance Type:            nonrobust                                         
                         coef    std err          t      P>|t|      [0.025      0.975]
--------------------------------------------------------------------------------------
Temperature           35.3048      7

#### 17.3.1.2 Linear Regression Using Statsmodel¶ - with intercept

In [40]:
# fit a OLS model with intercept on the data set
X = sm.add_constant(features)

In [41]:
# Fit and summarize OLS model
mod = sm.OLS(df['Office-items'], X)

In [42]:
res = mod.fit()

In [43]:
print(res.summary())

                            OLS Regression Results                            
Dep. Variable:           Office-items   R-squared:                       0.554
Model:                            OLS   Adj. R-squared:                  0.553
Method:                 Least Squares   F-statistic:                     338.2
Date:                Wed, 13 Dec 2017   Prob (F-statistic):               0.00
Time:                        20:44:29   Log-Likelihood:                -88546.
No. Observations:                8190   AIC:                         1.772e+05
Df Residuals:                    8159   BIC:                         1.774e+05
Df Model:                          30                                         
Covariance Type:            nonrobust                                         
                         coef    std err          t      P>|t|      [0.025      0.975]
--------------------------------------------------------------------------------------
const               -165.0207   1856

#### 17.3.1.2 Linear Regression Using SKLearn

In [44]:
#create numpy array
X = features.values
y = df['Office-items'].values

In [45]:
#Reshape y
y = y.reshape(-1,1)

In [46]:
#fit regression model
lm = linear_model.LinearRegression()
model = lm.fit(X,y)

In [47]:
#run prediction 
predictions = lm.predict(X)

In [48]:
print((predictions)[0:5])

[[ 46364.4178115 ]
 [ 74605.77633882]
 [ 58046.54415954]
 [ 55294.03057185]
 [ 46644.73452003]]


In [49]:
# generate the score
lm.score(X,y)

0.55425128162889259

In [50]:
lm.coef_

array([[  3.52957899e+01,   1.61633726e+02,  -6.17108951e-02,
         -5.84126002e-02,  -2.29651559e-02,  -5.57845001e-02,
          1.34500202e-02,   4.90492874e+01,   1.88870803e+02,
          1.33169709e-02,   4.62225912e-01,   5.09467483e-01,
         -7.83371455e-02,  -1.96734223e+00,   1.05896019e-01,
          7.00194684e-02,  -2.10903142e-01,  -2.51830902e+03,
         -1.35649863e+01,  -3.35119844e+00,  -9.94915824e-01,
         -1.45006722e+00,  -1.74258833e+01,  -7.01998667e+00,
          1.58918567e-01,   1.55860659e+00,  -7.51032902e-02,
          1.47337817e+01,  -2.28436340e+00,   1.59678354e+02]])

In [51]:
lm.intercept_

array([-165.02066664])