## Multiple Regression

A multiple linear regression is simply a linear regression that involves more than one predictor variable. It is represented as:

$Y = \alpha + \beta_1*X_1  + \beta_2*X_2 + \dots  + \beta_p*X_p$  

Each *β<sub>i</sub>* will be estimated using the least sum of squares method.

As mentioned previously, values of the RSE generally decrease as we add variables that are significant predictors of the output variable – hence, using more variables can increase the efficiency of a model.

However, it also increases the complexity of model building since process of selecting variables to be kept and discarded can become tedious.

With this simple dataset of three predictor variables, there can be seven possible models:

1. Sales ~ TV
2. Sales ~ Newspaper
3. Sales ~ Radio
4. Sales ~ TV + Radio
5. Sales ~ TV + Newspaper
6. Sales ~ Newspaper + Radio
7. Sales ~ TV + Radio + Newspaper

Generally, if there are p possible predictor variables, there can be *(2<sup>p</sup> - 1)* possible models – this can get large very quickly!


In [14]:
# Import necessary libaries and data
import pandas as pd
import numpy as np

advert = pd.read_csv('advertising.csv')

In [15]:
advert

Unnamed: 0,TV,Radio,Newspaper,Sales
0,230.1,37.8,69.2,22.1
1,44.5,39.3,45.1,10.4
2,17.2,45.9,69.3,9.3
3,151.5,41.3,58.5,18.5
4,180.8,10.8,58.4,12.9
...,...,...,...,...
195,38.2,3.7,13.8,7.6
196,94.2,4.9,8.1,9.7
197,177.0,9.3,6.4,12.8
198,283.6,42.0,66.2,25.5


In [16]:
train_test_split(X, Y, test_size=0.2, random_state=0)

[        TV  Radio  Newspaper
 134   36.9   38.6       65.6
 66    31.5   24.6        2.2
 26   142.9   29.3       12.6
 113  209.6   20.6       10.7
 168  215.4   23.6       57.6
 ..     ...    ...        ...
 67   139.3   14.5       10.2
 192   17.2    4.1       31.6
 117   76.4    0.8       14.8
 47   239.9   41.5       18.5
 172   19.6   20.1       17.0
 
 [160 rows x 3 columns],
         TV  Radio  Newspaper
 18    69.2   20.5       18.3
 170   50.0   11.6       18.4
 107   90.4    0.3       23.2
 98   289.7   42.3       51.2
 177  170.2    7.8       35.2
 182   56.2    5.7       29.7
 5      8.7   48.9       75.0
 146  240.1    7.3        8.7
 12    23.8   35.1       65.9
 152  197.6   23.3       14.2
 61   261.3   42.7       54.7
 125   87.2   11.8       25.9
 180  156.6    2.6        8.3
 154  187.8   21.1        9.5
 80    76.4   26.7       22.3
 7    120.2   19.6       11.6
 33   265.6   20.0        0.3
 130    0.7   39.6        8.7
 37    74.7   49.4       45.7
 74   213.4  

In [5]:
# Import necessary scikit-learn methods
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

# Build linear regression model using TV, Radio, and Newspaper as predictors
# Split data into predictors X and output Y
predictors = ['TV', 'Radio', 'Newspaper']
X = advert[predictors]
Y = advert['Sales']

# Split data into training and testing sets using `train_test_split` method
trainX, testX, trainY, testY = train_test_split(X, Y, test_size=0.2, random_state=0)

# Initialise and fit model
lm = LinearRegression()
lm.fit(trainX, trainY)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

In [17]:
print(f'alpha = {lm.intercept_}')
print(f'betas = {lm.coef_}')

alpha = 2.994893030495332
betas = [ 0.04458402  0.19649703 -0.00278146]


In [18]:
lm.score(trainX, trainY)

0.9067114990146383

In [19]:
lm.predict(testX)

array([10.05739563,  7.4522807 ,  7.0197076 , 24.08029725, 12.01786259,
        6.53793858, 12.78286918, 15.10974587, 10.76974013, 16.34357951,
       22.88297477,  9.12924467, 10.46455672, 15.48743552, 11.58555633,
       12.17296914, 18.76551502, 10.78318566, 15.90515992, 17.30651279,
       24.06692057,  9.59834224, 15.13512211, 12.38591525,  5.71360885,
       15.24749314, 12.29402334, 20.9421167 , 13.40991558,  9.04348832,
       12.89239415, 21.40272028, 18.13802209, 21.17320803,  6.56974433,
        6.14114206,  7.89018394, 13.01541434, 14.68953791,  6.18835143])

## Feature Selection

In [10]:
from sklearn.feature_selection import RFE   # Recursive Feature Elimination
from sklearn.svm import SVR                 # Support Vector Regression

# Start with all possible predictors
predictors = ['TV', 'Radio', 'Newspaper']
X = advert[predictors]
Y = advert['Sales']

# Estimate a linear model
estimator = SVR(kernel="linear")

# Using RFE, specify 2 predictors for the final model
# and 1 predictor to remove at each iteration
selector = RFE(estimator, 2, step=1)
selector = selector.fit(X, Y)

We use the method `SVR` to estimate a linear model. Then, using `RFE` we specify the number of desired variables in the model to be two, and the number of variables to remove at each iteration to be one. 

For more information about these methods, you can read the documentation [here](http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.RFE.html) and [here](http://scikit-learn.org/stable/modules/generated/sklearn.svm.SVR.html).

To get the list of selected variables, we call use:

In [12]:
selector.support_

array([ True,  True, False])

In [13]:
selector.ranking_

array([1, 1, 2])