# Pipeline Setup

In [1]:
from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split

data = load_boston()
X_train, X_test, y_train, y_test = train_test_split(data['data'], data['target'])

In [15]:
from sklearn.preprocessing import StandardScaler, RobustScaler, QuantileTransformer
from sklearn.feature_selection import SelectKBest, f_regression
from sklearn.decomposition import PCA
from sklearn.linear_model import Ridge
import numpy as np

## Pipeline Details:

* Data Normalization: in this tutorial we have selected three different normalization methods, including the QuantileTransformer (check out the documentation)..

* Dimensionality Reduction: we selected Principal Component Analysis (PCA) and a univariate feature selection algorithm as possible candidates.

* Regression: we apply a simple regularized linear method, although the method is easily extendable to other learning algorithms.

# Without Pipeline

In [33]:
scaler = StandardScaler()
pca = PCA()
ridge = Ridge()

In [35]:
X_train = scaler.fit_transform(X_train)

In [38]:
X_train = pca.fit_transform(X_train)

In [40]:
ridge.fit(X_train, y_train)

Ridge(alpha=1.0, copy_X=True, fit_intercept=True, max_iter=None,
   normalize=False, random_state=None, solver='auto', tol=0.001)

# With Pipeline

In [41]:
from sklearn.pipeline import Pipeline

In [42]:
pipe = Pipeline([
    
    ('scaler', StandardScaler()),
    ('reduce_dim', PCA()),
    ('regressor', Ridge())
    
])

In [43]:
pipe

Pipeline(memory=None,
     steps=[('scaler', StandardScaler(copy=True, with_mean=True, with_std=True)), ('reduce_dim', PCA(copy=True, iterated_power='auto', n_components=None, random_state=None,
  svd_solver='auto', tol=0.0, whiten=False)), ('regressor', Ridge(alpha=1.0, copy_X=True, fit_intercept=True, max_iter=None,
   normalize=False, random_state=None, solver='auto', tol=0.001))])

In [44]:
pipe = pipe.fit(X_train, y_train)

In [45]:
print('Testing score: ', pipe.score(X_test, y_test))

Testing score:  -5573.01401859


In [50]:
# Indexing pipeline to represent the steps

pipe.steps[1][1].explained_variance_

array([ 1.0026455,  1.0026455,  1.0026455,  1.0026455,  1.0026455,
        1.0026455,  1.0026455,  1.0026455,  1.0026455,  1.0026455,
        1.0026455,  1.0026455,  1.0026455])

In [56]:
pipe.steps[2][1].coef_

array([-2.24262291, -5.63727879, -3.07670887, -1.43650915, -1.63554129,
        2.03750121,  1.00838457, -0.52069999, -2.32708928, -0.5815485 ,
       -1.3458464 , -1.386344  ,  0.13674465])

![](https://iaml.it/user/pages/05.blog/optimizing-sklearn-pipelines/images/pipeline-diagram.png)

On every object within the pipeline the methods `fit_transform` are invoked during training, while `transform (or predict)` are called during test. So far using pipelines is just a matter of code cleaness and minimization. Now let's jump into model's hyper-parameter tuning.

# Pipeline Tuning (Base Version)

Interested in tuning the following:
* `n_components` of PCA()
* `alpha` of Ridge()

In [66]:
n_features_to_test = np.arange(1, 11)
alpha_to_test = 2.0**np.arange(-6, 6)

In [67]:
n_features_to_test

array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10])

In [68]:
alpha_to_test

array([  1.56250000e-02,   3.12500000e-02,   6.25000000e-02,
         1.25000000e-01,   2.50000000e-01,   5.00000000e-01,
         1.00000000e+00,   2.00000000e+00,   4.00000000e+00,
         8.00000000e+00,   1.60000000e+01,   3.20000000e+01])

## Create Dict of Params

It's possible to notice that the two parameters are correlated, and should be optimized in combination. That is, a variation in the number of PCA components might imply a variation in the regularization factor, and viceversa. Thereby, it is important to evaluate all their possible combinations, and this is where the pipeline module really supports us.

In [69]:
params = {
    
    'reduce_dim__n_components': n_features_to_test,
    'regressor__alpha': alpha_to_test
}

# GridSearchCV on Pipeline

In [70]:
from sklearn.model_selection import GridSearchCV

In [72]:
gridsearch = GridSearchCV(pipe, params, verbose=1).fit(X_train, y_train)

Fitting 3 folds for each of 120 candidates, totalling 360 fits


[Parallel(n_jobs=1)]: Done 360 out of 360 | elapsed:    1.6s finished


In [73]:
print('Final score is: ', gridsearch.score(X_test, y_test))

Final score is:  -6316.70817709


In [76]:
print('Best params are: \n', gridsearch.best_params_)

Best params are: 
 {'reduce_dim__n_components': 10, 'regressor__alpha': 0.015625}


# Advanced Pipeline Tuning

In [77]:
scalers_to_test = [StandardScaler(), RobustScaler(), QuantileTransformer()]


In [78]:
params = {'scaler': scalers_to_test,
        'reduce_dim__n_components': n_features_to_test,
        'regressor__alpha': alpha_to_test}

In [79]:
params = [
        {'scaler': scalers_to_test,
         'reduce_dim': [PCA()],
         'reduce_dim__n_components': n_features_to_test,
         'regressor__alpha': alpha_to_test},

        {'scaler': scalers_to_test,
         'reduce_dim': [SelectKBest(f_regression)],
         'reduce_dim__k': n_features_to_test,
         'regressor__alpha': alpha_to_test}
        ]

In [80]:
gridsearch = GridSearchCV(pipe,params, verbose =1).fit(X_train, y_train)
print('Final score is: ', gridsearch.score(X_test, y_test))

Fitting 3 folds for each of 720 candidates, totalling 2160 fits
Final score is:  -5298.26495423


[Parallel(n_jobs=1)]: Done 2160 out of 2160 | elapsed:   33.3s finished


In [81]:
print('Best params are: ', gridsearch.best_params_)

Best params are:  {'reduce_dim': SelectKBest(k=9, score_func=<function f_regression at 0x108ecd620>), 'reduce_dim__k': 9, 'regressor__alpha': 8.0, 'scaler': StandardScaler(copy=True, with_mean=True, with_std=True)}
