# Lecture 2
## Introduction to Sklearn
### Pipelines!

<ol>
<li> Used data: Boston Data Set ( https://www.cs.toronto.edu/~delve/data/boston/bostonDetail.html )
<li> Notebook Goal: Learn how to build a pipeline in sklearn. Use pipelines for hyperparameter tuning and model selection
<li> Extra Exercise: Yes, see below.
</ol>


In [3]:
from sklearn.model_selection import train_test_split
import numpy as np
import pandas as pd
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV

from sklearn.preprocessing import StandardScaler, RobustScaler
from sklearn.decomposition import PCA
from sklearn.linear_model import Ridge
from sklearn.metrics import r2_score

We will use the Boston Housing data set. 

This data set consists of 14 numerical features. The goal of this notebook is to construct a regression model that tries to predict the median house price of a house in Boston using several explanatory variables.

In [4]:
column_names = ['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX', 'PTRATIO', 'B', 'LSTAT', 'MEDV']
df = pd.read_csv('../Data//housing.csv', header=None, delimiter=r"\s+", names=column_names)
df.head()

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT,MEDV
0,0.00632,18.0,2.31,0,0.538,6.575,65.2,4.09,1,296.0,15.3,396.9,4.98,24.0
1,0.02731,0.0,7.07,0,0.469,6.421,78.9,4.9671,2,242.0,17.8,396.9,9.14,21.6
2,0.02729,0.0,7.07,0,0.469,7.185,61.1,4.9671,2,242.0,17.8,392.83,4.03,34.7
3,0.03237,0.0,2.18,0,0.458,6.998,45.8,6.0622,3,222.0,18.7,394.63,2.94,33.4
4,0.06905,0.0,2.18,0,0.458,7.147,54.2,6.0622,3,222.0,18.7,396.9,5.33,36.2


We will use the following plan of attack:
<ol>
<li> Split the data set in a training and a test data set
<li> Standardize the training data using StandardScaler
<li> Then, reduce the dimension of the model using PCA
<li> Afterwards, perform Ridge regression on the principal components
<li> Evaluate the model performance on the test set using the R-squared.
</ol>

We will first do this without using a pipeline.

In [5]:
X_train, X_test, y_train, y_test = train_test_split(df.drop(['MEDV'],axis=1), df['MEDV'], random_state=0)

scaler = StandardScaler()
pca = PCA()
ridge=Ridge()

X_train = scaler.fit_transform(X_train)
X_train = pca.fit_transform(X_train)
ridge.fit(X_train, y_train)

X_test = scaler.transform(X_test)
X_test = pca.transform(X_test)
preds = ridge.predict(X_test)

r2_score(y_test,preds)

0.6345884564889062

The messy part of this code is that we have some repetition: first apply either fit_transform or fit three times, then use transform or predict three times. 

Pipelines essentially just group the desired models and runs them sequentially. A pipeline consists of an array of several steps, each step has as format ('Name_of_step','method used in step') where 'method_used_in_step' is a class from Sklearn (or your own custom class). 

The resulting pipeline is then a typical sklearn class, with fit and predict methods. Calling a fit on training data then sequentially fits (and transforms) the methods in the pipeline to the training data. Using the predict function then applies these fitted transformers to the test data and returns the predictions. 

Because everything is done at once, the probability of data leakage is very low!

In [6]:
pipe = Pipeline(
    [
        ('scaler', StandardScaler()), #Step one: Scaling the data
        ('reduce_dim', PCA()), #Step two: perform PCA in order to reduce dimensions
        ('regressor', Ridge()) #Step three: Bring in the regression model

    ]
)
pipe.fit(X_train, y_train)
pipe.score(X_test,y_test)

0.6357103588722749

# Hyperparameter Finetuning

In this section, we will consider how to fine-tune the hyperparameters of the model. Recall that hyperparameters are model-specific parameters which are __user-supplied__ and are __NOT__ part of the fitting procedure. Examples are the amount of neighbours in the k-NN algorithm, or the learning rate in some neural networks.

For some hyperparameters there are specific rules that one can follow, for example the scree plot in PCA. 
A more general rule is just to use the hyperparameters which leads to the best fitting model! 

This is basically what we will do in the next section. We will supply for each hyperparameter an interval of possible values. We will then use a method called GridSearchCV which considers each hyperparameter combination, fits this to the data and evaluates the performance of this model. In order to reduce the risk of overfitting, this method uses cross validation.

The GridSearchCV method takes as input the pipeline (without hyperparameters) and the chosen intervals. It returns the best performing model, fit on the data.

As an example, we will consider the previous setting. We think that the number of components for PCA is between 1 and 10, and that the best fitting regularization parameter alpha (see documentation!) lies in the set 
$$
\alpha \in \left\{ 2^{-6}, 2^{-5}, ..., 2^{4}, 2^5 \right\}.
$$

Recall that the naming convention of the pipeline consisted of a pair ('Name_of_step', 'Method_of_step'). If we want to change a hyperparameter called 'theta' for the method in this step using GridSearchCV, we will use the following naming convention
<ol>
<li> Name of the step in the pipe : 'Name_of_step'
<li> Double underscore
<li> Name of the hyperparameter: 'theta'
</ol>
In our case: Name_of_step__theta.

Following code should make this clear.

In [7]:
number_features = np.arange(1,11)
alpha_to_test = 2.0**np.arange(-6,6)

params_to_consider = {'reduce_dim__n_components': number_features, 
                    'regressor__alpha': alpha_to_test   
                    }

We can then continue as follows:

In [8]:
gridsearch = GridSearchCV(pipe, param_grid = params_to_consider, verbose=1).fit(X_train,y_train)
print('Final score is: ', gridsearch.score(X_test,y_test) )

Fitting 5 folds for each of 120 candidates, totalling 600 fits
Final score is:  0.57801002901813


Notice that we had 120 candidates. This comes from the fact that we have 10 different values of N_components and 12 different values of alpha.

When fitting your pipeline, make sure not to include impossible values! Not only is this bad practice, it can blow up the amount of necessary searches.

# Model Selection

We can easily extend above framework so that it can also compare different model classes, instead of only models of the same class but with different hyperparameters. 

Suppose for example we are not sure which choice would be the best: Standardization or Robust scaling. We can tweak the code above to find the following:

In [9]:
pipe = Pipeline([
    ('scaler', []),
    ('reduce_dim',PCA()),
    ('regressor',Ridge())
])

grid = {'scaler':[StandardScaler(),RobustScaler()]}

gridsearch = GridSearchCV(pipe,grid,verbose=1).fit(X_train,y_train)
print('Testing score:', gridsearch.score(X_test,y_test))

Fitting 5 folds for each of 2 candidates, totalling 10 fits
Testing score: 0.6364746260666787


In [10]:
gridsearch.best_params_ #Robust scaler was the best!

{'scaler': RobustScaler()}

__Exercise__:

Extend the above generalization so that it can consider not only different scalers, but also different regression methods. Compare in this way Ridge regression with OLS regression.