# Chapter 3 - Linear Regression

In [1]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import StandardScaler
import statsmodels.formula.api as smf

In [2]:
advertising = pd.read_csv('Data/Advertising.csv')

In [3]:
advertising = advertising.drop(["Unnamed: 0"], axis=1)

In [4]:
advertising.head()

Unnamed: 0,TV,radio,newspaper,sales
0,230.1,37.8,69.2,22.1
1,44.5,39.3,45.1,10.4
2,17.2,45.9,69.3,9.3
3,151.5,41.3,58.5,18.5
4,180.8,10.8,58.4,12.9


## Simple Linear Regression

$$Y=\hat\beta_0+\hat\beta_1X+\epsilon$$

In [5]:
# Data initiation and Feature Scaling
sc_X = StandardScaler(with_std = False)
X = sc_X.fit_transform(advertising.TV.values.reshape(-1,1))
y = advertising.sales

In [6]:
# Linear Model
regressor = LinearRegression()
regressor.fit(X,y)
print(regressor.coef_)
print(regressor.intercept_)

[ 0.04753664]
14.0225


In [7]:
# without scaling (as in book)
X = advertising.TV.values.reshape(-1,1)
y = advertising.sales

regressor = LinearRegression()
regressor.fit(X,y)
print(regressor.coef_)
print(regressor.intercept_)

[ 0.04753664]
7.03259354913


In [8]:
# Table of statistical tests
est = smf.ols('sales ~ TV', advertising).fit()
est.summary().tables[1]

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,7.0326,0.458,15.360,0.000,6.130,7.935
TV,0.0475,0.003,17.668,0.000,0.042,0.053


In [9]:
# Confidence intervals for the slope and intercept terms
est.conf_int(alpha=0.05, cols = None)

Unnamed: 0,0,1
Intercept,6.129719,7.935468
TV,0.042231,0.052843


Above, we see that our two parameters to our linear equation, $\hat\beta_0,\ \hat\beta_1$ lie smack dab in the middle of our 95% confidence interval, which leads us to believe that our predictions are accurate. Additionally, for each parameter the t-statistic is huge (usually only need about 2 in order to achieve significance, but both of ours are over 15!) and the p-value are ver small, leaving no doubt that there is a high correlation between TV advertising expenditure and sales.

In [10]:
est.summary().tables[0]

0,1,2,3
Dep. Variable:,sales,R-squared:,0.612
Model:,OLS,Adj. R-squared:,0.61
Method:,Least Squares,F-statistic:,312.1
Date:,"Thu, 14 Jun 2018",Prob (F-statistic):,1.47e-42
Time:,10:21:05,Log-Likelihood:,-519.05
No. Observations:,200,AIC:,1042.0
Df Residuals:,198,BIC:,1049.0
Df Model:,1,,
Covariance Type:,nonrobust,,


Finally, from our R-squared value in the top right corner of this table, we see that our TV expenditure has a strong relationship with sales. (61%!)

## Multiple Linear Regression

$$Y=\hat\beta_0+\hat\beta_1X_1+\hat\beta_2X_2+\dots+\hat\beta_pX_p+\epsilon$$

In our advertising dataset, this equation becomes:
$$\text{sales}=\hat\beta_0+\hat\beta_1\cdot\text{TV}+\hat\beta_2\cdot\text{radio}+\hat\beta_3\cdot\text{newspaper}+\epsilon$$

Now, given our estimated parameters $\hat\beta_0,\hat\beta_1,\dots,\hat\beta_p$, we make predictions about $\hat y$.
We estimate the parameters $\hat\beta_0,\hat\beta_1,\dots,\hat\beta_p$ by finding the values that minimize the RSS (residual sum of squares), using the multiple least squares method.
$$\text{RSS}=\sum^{n}_{i=1}(y_i-\hat y_i)^2$$

In [11]:
# multiple least squares with statsmodel
multiple_est = smf.ols('sales ~ TV + radio + newspaper', advertising).fit()
multiple_est.summary()

0,1,2,3
Dep. Variable:,sales,R-squared:,0.897
Model:,OLS,Adj. R-squared:,0.896
Method:,Least Squares,F-statistic:,570.3
Date:,"Thu, 14 Jun 2018",Prob (F-statistic):,1.58e-96
Time:,10:21:10,Log-Likelihood:,-386.18
No. Observations:,200,AIC:,780.4
Df Residuals:,196,BIC:,793.6
Df Model:,3,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,2.9389,0.312,9.422,0.000,2.324,3.554
TV,0.0458,0.001,32.809,0.000,0.043,0.049
radio,0.1885,0.009,21.893,0.000,0.172,0.206
newspaper,-0.0010,0.006,-0.177,0.860,-0.013,0.011

0,1,2,3
Omnibus:,60.414,Durbin-Watson:,2.084
Prob(Omnibus):,0.0,Jarque-Bera (JB):,151.241
Skew:,-1.327,Prob(JB):,1.44e-33
Kurtosis:,6.332,Cond. No.,454.0


As seen from the p-values above, TV and radio seem to be highly correlated with sales, but newspaper is not. It may make sense to leave this out of our model.

In [12]:
# Linear regression with Scikit-learn
multiple_X = advertising[['TV', 'radio', 'newspaper']]
multiple_y = advertising[['sales']]

regressor = LinearRegression()
regressor.fit(multiple_X, multiple_y)
print(regressor.coef_)
print(regressor.intercept_)

[[ 0.04576465  0.18853002 -0.00103749]]
[ 2.93888937]


In [13]:
# correlations
advertising.corr()

Unnamed: 0,TV,radio,newspaper,sales
TV,1.0,0.054809,0.056648,0.782224
radio,0.054809,1.0,0.354104,0.576223
newspaper,0.056648,0.354104,1.0,0.228299
sales,0.782224,0.576223,0.228299,1.0


### Importants questions when it comes to applying multiple linear regression

1. How can we decide how to choose the best subset of predictors/features for our model? When there are $p$ predictors, there are $2^p$ possible subsets...

#### Forward Selection
* begin with a null model that contains an intercept but no predictors
* fit $p$ simple linear regression models and add to the null model the variable that results in the lowest RSS
* add to that model the variable that results in the lowest RSS amongst all two-variable models
* continue until the model achieves some pre-defined threshold (i.e. achieves some p-value)

#### Backward Selection
* start with all variables in the model
* remove the variable with the largest p-value
* fit the new $p-1$-variable model, then again remove the variable with the largest p-value
* continue until some stopping rule is reached (i.e. all variables have a significant p-value)

2) Dealing with qualitative or categorical data

* We use dummy variables
* Note: these ISLR videos provide the best mathematical explanation I've seen of dummy variables

In [15]:
credit = pd.read_csv('Data/Credit.csv')
credit = credit.drop(["Unnamed: 0"], axis=1)

In [16]:
credit.head()

Unnamed: 0,Income,Limit,Rating,Cards,Age,Education,Gender,Student,Married,Ethnicity,Balance
0,14.891,3606,283,2,34,11,Male,No,Yes,Caucasian,333
1,106.025,6645,483,3,82,15,Female,Yes,Yes,Asian,903
2,104.593,7075,514,4,71,11,Male,No,No,Asian,580
3,148.924,9504,681,3,36,11,Female,No,No,Asian,964
4,55.882,4897,357,2,68,16,Male,No,Yes,Caucasian,331


In [17]:
# Gender has nothing to do with bank balance
est = smf.ols('Balance ~ Gender', credit).fit()
est.summary().tables[1]

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,509.8031,33.128,15.389,0.000,444.675,574.931
Gender[T.Female],19.7331,46.051,0.429,0.669,-70.801,110.267


In [18]:
# Neither does ethnicity
est = smf.ols('Balance ~ Ethnicity', credit).fit()
est.summary().tables[1]

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,531.0000,46.319,11.464,0.000,439.939,622.061
Ethnicity[T.Asian],-18.6863,65.021,-0.287,0.774,-146.515,109.142
Ethnicity[T.Caucasian],-12.5025,56.681,-0.221,0.826,-123.935,98.930


### Extensions of the Linear Model

#### Interactions

In [19]:
# Interacting with TV and radio
est = smf.ols('sales ~ TV + radio + TV*radio', advertising).fit()
est.summary().tables[1]

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,6.7502,0.248,27.233,0.000,6.261,7.239
TV,0.0191,0.002,12.699,0.000,0.016,0.022
radio,0.0289,0.009,3.241,0.001,0.011,0.046
TV:radio,0.0011,5.24e-05,20.727,0.000,0.001,0.001


In [20]:
# Having trouble figuring out how to do this with scikit-learn...

In [21]:
# interacting with qualitative/quantitative variables
est = smf.ols('Balance ~ Income + Income*Student', credit).fit()
est.summary().tables[1]

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,200.6232,33.698,5.953,0.000,134.373,266.873
Student[T.Yes],476.6758,104.351,4.568,0.000,271.524,681.827
Income,6.2182,0.592,10.502,0.000,5.054,7.382
Income:Student[T.Yes],-1.9992,1.731,-1.155,0.249,-5.403,1.404


#### Non-linearity

In [23]:
auto = pd.read_csv('Data/Auto.csv')

In [24]:
auto['horsepower'] = pd.to_numeric(auto.horsepower, errors='coerce')
auto['horsepower2'] = auto.horsepower**2
auto.head(3)

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,year,origin,name,horsepower2
0,18.0,8,307.0,130.0,3504,12.0,70,1,chevrolet chevelle malibu,16900.0
1,15.0,8,350.0,165.0,3693,11.5,70,1,buick skylark 320,27225.0
2,18.0,8,318.0,150.0,3436,11.0,70,1,plymouth satellite,22500.0


In [25]:
est = smf.ols('mpg ~ horsepower + horsepower2', auto).fit()
est.summary().tables[1]

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,56.9001,1.800,31.604,0.000,53.360,60.440
horsepower,-0.4662,0.031,-14.978,0.000,-0.527,-0.405
horsepower2,0.0012,0.000,10.080,0.000,0.001,0.001


In [26]:
 .0289 + .0011 * 250 

0.3039