# Linear Regression with `scikit-learn`

In [None]:
# Import necessary libaries and data
import pandas as pd
import numpy as np

advert = pd.read_csv('advertising.csv')

We’ve learnt to implement linear regression models using `statsmodels`…now let’s learn to do it using `scikit-learn`, a commonly used package for machine learning in Python. It has more built-in methods to perform the regular processes associated with regression.

In the last step, we manually split our dataset into train and test sets. `scikit-learn` has a built-in method to do this for us!

In [None]:
# Import necessary scikit-learn methods
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

# Build linear regression model using TV and Radio as predictors
# Split data into predictors X and output Y
predictors = ['TV', 'Radio']
X = advert[predictors]
Y = advert['Sales']

# Split data into training and testing sets using `train_test_split` method
trainX, testX, trainY, testY = train_test_split(X, Y, test_size=0.2, random_state=0)

# Initialise and fit model
lm = LinearRegression()
lm.fit(trainX, trainY)

Again, we have split the dataset 80:20 (`test_size=0.2`). You can check that the split was done correctly by printing out the shapes of the training and testing arrays:

In [None]:
print(trainX.shape)
print(testX.shape)
print(trainY.shape)
print(testY.shape)

The following will display the parameters of the model:

In [None]:
print(f'alpha = {lm.intercept_}')
print(f'betas = {lm.coef_}')

`lm.coef_` returns an array with our coefficients *β<sub>1</sub>* and *β<sub>2</sub>*. 

The value of *R<sup>2</sup>* can be returned simply calling `.score`. It turns out to be very close to the value obtained using the `statsmodels` method in the previous step (0.901)!

In [None]:
lm.score(trainX, trainY)

To predict sales, pass the test predictors into `.predict()`. An array of 40 predictions (because there are 40 rows in the test dataset) is returned:

In [None]:
lm.predict(testX)

### Feature selection with scikit-learn

As stated before, many statistical tools and packages have built-in methods for variable selection (feel free to review the “Multiple Regression with `statsmodels`” step if you need to). If done manually, feature selection is time consuming and tedious, compromising the efficiency of the model.

One advantage of using `scikit-learn` is that it has a method for feature selection. This method, called **Recursive Feature Elimination** (**RFE**), works similarly to backward selection.

The model is first run with all the variables and certain weights are assigned to all the variables. In the subsequent iterations, the variables with the smallest weights are pruned from the list of variables until the specified number of variables is left.

Let’s give it a go!

In [None]:
from sklearn.feature_selection import RFE   # Recursive Feature Elimination
from sklearn.svm import SVR                 # Support Vector Regression

# Start with all possible predictors
predictors = ['TV', 'Radio', 'Newspaper']
X = advert[predictors]
Y = advert['Sales']

# Estimate a linear model
estimator = SVR(kernel="linear")

# Using RFE, specify 2 predictors for the final model
# and 1 predictor to remove at each iteration
selector = RFE(estimator, 2, step=1)
selector = selector.fit(X, Y)

We use the method `SVR` to estimate a linear model. Then, using `RFE` we specify the number of desired variables in the model to be two, and the number of variables to remove at each iteration to be one. 

For more information about these methods, you can read the documentation [here](http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.RFE.html) and [here](http://scikit-learn.org/stable/modules/generated/sklearn.svm.SVR.html).

To get the list of selected variables, we call use:

In [None]:
selector.support_

This returns an array of `True`s and `False`s in the order of the `predictors`. In our example, we specified the order of our predictors to be: `predictors = ['TV', 'Radio','Newspaper']`. The output, therefore, means `TV` and `Radio` were selected (`True`) while `Newspaper` was not selected (`False`). 

This concurs with the variable selection we had done manually!

We can also return the predictors’ significance rankings using `ranking_`. Selected variables have a ranking of `1`, while the rest are ranked in descending order of their significance (based on which iteration it was removed).

In [None]:
selector.ranking_

We've covered quite a lot this lesson! To wrap up, reopen the instructions panel on the left, then press Next Step.