# More on Pipelines
We already saw how pipelines can make our live easier in chapter todo. However, when using model evaluation tools such as cross_validate and GridSearchCV, using pipelines becomes essential for obtaining valid results.
Also, the use of pipelines in GridSearchCV allows for a variety of powerful use-cases. We'll explore both of these in this chapter.

## Data leakage: a common error
Let's start with an error that's commonly made when using cross-validation, which is to leak information from the validation parts of the data.
This is an error that has been made, not only countless times by beginning data scientists, but in several published scientific research articles.
When doing any preprocessing, it is essential that the preprocessing happens within cross-validation, not outside of it.
While we haven't seen the details of feature selection yet, it provides and excellent example, and so we'll quickly go over it.

### Automatic univariate feature selection
When working with high dimensional datasets, it can be beneficial to work with only a subset of the features. This will reduce the computational burden, increase interpretability, and in some cases can even improve generalization performance.
There are several methods for automating this process, which we will discuss in depth in chapter todo. One of the simplest methods of automatic feature selection is using univariate statistics to rank features.
Univariate means we are looking only at one feature at a time, and evaluate its relationship with the target, often with a simple statistical measure such as an F test or t-test.
We can then rank all the features by the strength of their response (or alternatively by how significant their association with the target was) and select the ones deemed most important.
A version of this is implemented in the ``SelectPercentile`` transformer in scikit-learn, which allows you to keep a fixed percentage of the existing features.
This can be a quick and easy way to subselect features from a very wide dataset and is commonly used. Here is a quick example on the breast cancer dataset:

In [5]:
from sklearn.datasets import load_breast_cancer
from sklearn.feature_selection import SelectPercentile
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline

# load the dataset and split it into training and test set
X, y  = load_breast_cancer(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)
print(X_train.shape)

(426, 30)


In [6]:
# Create a standard pipeline out of scaler and classifier
pipe_knn = make_pipeline(StandardScaler(), KNeighborsClassifier())
# Fit and evaluate as a baseline
pipe_knn.fit(X_train, y_train)
pipe_knn.score(X_test, y_test)

0.958041958041958

In [7]:
# create a pipeline subselecting 20% of the features according to univariate statistics
# Order of scaling and selection does not matter in this case
pipe_select = make_pipeline(StandardScaler(), SelectPercentile(percentile=20), KNeighborsClassifier())
# Fit the pipeline
pipe_select.fit(X_train, y_train)
# slice off the classifier, look at shape of transformed data:
pipe_select[:-1].transform(X_train).shape

(426, 6)

As expected, of the 30 original features, ``SelectPercentile`` only kept 20%, meaning 6. Now let's evaluate the whole pipeline:

In [10]:
pipe_select.score(X_test, y_test)

0.958041958041958

The performance using only 20% of the features is actually identical to the performance when using all the features, but might be much more interpretable.
We can see which features were selected by TODO.

Now, that we have familiarized ourselves with how SelectPercentile works (at least in general terms), let's look at the example mentioned above.

In [26]:
# TODO hide
import numpy as np
rng = np.random.RandomState(42)
X = rng.normal(size=(100, 10000))
y = rng.normal(size=(100,)) > 0

Say someone gave you a binary classification dataset like this:

In [31]:
print(X.shape, y.shape)
# count appearances of 0 and 1 in y
print(np.bincount(y))

(100, 10000) (100,)
[53 47]


It's very wide, meaning it has many features, compared to the number of samples. This is quite common in sensor networks or in biomedical data for example.
Given the small size of the dataset, we might want to use cross-validation to assess performance, instead of using a single train-test split.
One might start like this:

In [28]:
# select most informative 5% of features
select = SelectPercentile(percentile=5)
select.fit(X, y)
X_selected = select.transform(X)
print(X_selected.shape)

(100, 500)


Now the dataset seems much more managable at 500 features (which are arguably still a lot), and we can evaluate our model with ``cross_val_score``:

In [29]:
from sklearn.model_selection import cross_val_score
# run cross-validation with the subselected features
cross_val_score(KNeighborsClassifier(), X_selected, y)

array([1., 1., 1., 1., 1.])

```{margin}
If a model looks too good to be true, an experienced data scientist ususally looks for the mistake. Often it's a case of information leakage,
so if you ever observe very high accuracy, you might do well to be skeptical at first.
```

It looks like it's our lucky day: we created a model that classifies our dataset perfectly across all folds. From this evaluation, we might be quite certain we found a good model.
However, we made a mistake: we applied the feature selection procedure outside of the cross-validation. We should apply it inside the cross-validation instead.
In scikit-learn, we can easily do that using a pipeline (as we did above).


In [30]:
pipe = make_pipeline(SelectPercentile(percentile=5), KNeighborsClassifier())
# run cross-validation on the original dataset using the pipeline
cross_val_score(pipe, X, y)

array([0.45, 0.5 , 0.5 , 0.5 , 0.7 ])

If we use the proper evaluation technique, our results change drastically: our model is around chance performance for a balanced dataset as this, in other words, we might conclude that the model didn't learn anything.
Where does this dramatic difference come from? When we called ``fit`` on ``SelectPercentile`` before the cross-validation, it had access to the full dataset, which includes the training and test parts for each of the splits. This means it could extract information from all parts of the data, even those that we meant to use as validation set during cross-validation. This is a classical example of information leakage, and a good reason to always use pipelines!

To make the difference in the computation a bit more apparent, I wrote down a more explicit version of the same computation, not using ``cross_val_score`` or ``Pipeline`` (we're using ``KFold`` here which is a way to get the indices to perform K-fold cross-validation, we'll see this in more detail in TODO):

````{list-table}
---
header-rows: 1
---
* - preprocessing before cross validation
  - preprocessing within cross validation
* - ```python
    # BAD!
    select = SelectPercentile(percentile=5)
    select.fit(X, y)  # includes the cv test parts!
    X_sel = select.transform(X)
    scores = []
    for train, test in KFold().split(X, y):
        knn = KNeighborsClassifier().fit(X_sel[train], y[train])
        score = knn.score(X_sel[test], y[test])
        scores.append(score)
    ```
  - ```python
    # GOOD!
    scores = []
    select = SelectPercentile(percentile=5)
    for train, test in KFold().split(X, y):
        select.fit(X[train], y[train])
        X_sel_train = select.transform(X[train])
        knn = KNeighborsClassifier().fit(X_sel_train, y[train])
        X_sel_test = select.transform(X[test])
        score = knn.score(X_sel_test, y[test])
        scores.append(score)
    ```
* - equivalent to:
    ```python
    select = SelectPercentile(percentile=5)
    X_selected = select.fit_transform(X)
    scores = cross_val_score(KNeighborsClassifier(), X, y)
    ```
  - ```python
    pipe = make_pipeline(SelectPercentile(percentile=5),
                         KNeighborsClassifier()
    scores = cross_val_score(pipe, X, y)
    ```
    
````

## Pipeline and GridSearchCV
.small-padding-top[
```python
from sklearn.model_selection import GridSearchCV

knn_pipe = make_pipeline(StandardScaler(), KNeighborsRegressor())
param_grid = {'kneighborsregressor__n_neighbors': range(1, 10)}
grid = GridSearchCV(knn_pipe, param_grid, cv=10)
grid.fit(X_train, y_train)
print(grid.best_params_)
print(grid.score(X_test, y_test))
```

```
{'kneighborsregressor__n_neighbors': 7}
0.60
```
]




These names are important for using pipelines with
gridsearch. Recall that for using GridSearchCV you need to
specify a parameter grid as a dictionary, where the keys are
the parameter names. If you are using a pipeline inside
GridSearchCV, you need to specify not only the parameter
name, but also the step name – because multiple steps could
have a parameter with the same name.

The way to do this is to use the stepname, then two
underscores, and then the parameter name, as the key for the
param_grid dictionary.

You can see that the best_params_ will have this same
format.

This way you can tune the parameters of all steps in a
pipeline at once!

And you don’t have to worry about leaking information, since
all transformations are contained in the pipeline.

You should always use pipelines for preprocessing. Not only
does it make your code shorter, it also makes it less likely
that you have bugs.

## Going wild with Pipelines
```python
from sklearn.datasets import load_diabetes
diabetes = load_diabetes()
X_train, X_test, y_train, y_test = train_test_split(
    diabetes.data, diabetes.target, random_state=0)

from sklearn.preprocessing import PolynomialFeatures
pipe = make_pipeline(
    StandardScaler(),
    PolynomialFeatures(),
    Ridge())

param_grid = {'polynomialfeatures__degree': [1, 2, 3],
              'ridge__alpha': [0.001, 0.01, 0.1, 1, 10, 100]}
grid = GridSearchCV(pipe, param_grid=param_grid,
                    n_jobs=-1, return_train_score=True)
grid.fit(X_train, y_train)
```

## Going wilder with Pipelines
```python
pipe = Pipeline([('scaler', StandardScaler()),
                 ('regressor', Ridge())])

param_grid = {'scaler': [StandardScaler(), MinMaxScaler(),
                         'passthrough'],
              'regressor': [Ridge(), Lasso()],
              'regressor__alpha': np.logspace(-3, 3, 7)}


grid = GridSearchCV(pipe, param_grid)
grid.fit(X_train, y_train)
grid.score(X_test, y_test)
```

## Going wildest with Pipelines
```python
from sklearn.tree import DecisionTreeRegressor
pipe = Pipeline([('scaler', StandardScaler()),
                 ('regressor', Ridge())])

# check out searchgrid for more convenience
param_grid = [{'regressor': [DecisionTreeRegressor()],
               'regressor__max_depth': [2, 3, 4],
               'scaler': ['passthrough']},
              {'regressor': [Ridge()],
               'regressor__alpha': [0.1, 1],
               'scaler': [StandardScaler(), MinMaxScaler(),
                          'passthrough']}
             ]
grid = GridSearchCV(pipe, param_grid)
grid.fit(X_train, y_train)
grid.score(X_test, y_test)
```

In [1]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import warnings
import sklearn
sklearn.set_config(print_changed_only=True)

In [2]:
from sklearn.svm import SVC
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler

# load and split the data
cancer = load_breast_cancer()
X_train, X_test, y_train, y_test = train_test_split(
    cancer.data, cancer.target, random_state=0)

# compute minimum and maximum on the training data
scaler = MinMaxScaler().fit(X_train)
# rescale training data
X_train_scaled = scaler.transform(X_train)

svm = SVC()
# learn an SVM on the scaled training data
svm.fit(X_train_scaled, y_train)
# scale test data and score the scaled data
X_test_scaled = scaler.transform(X_test)
svm.score(X_test_scaled, y_test)

0.972027972027972

## Using Pipelines in Grid-searches

In [6]:
param_grid = {'svm__C': [0.001, 0.01, 0.1, 1, 10, 100],
              'svm__gamma': [0.001, 0.01, 0.1, 1, 10, 100]}

In [7]:
from sklearn.model_selection import GridSearchCV
grid = GridSearchCV(pipe, param_grid=param_grid)
grid.fit(X_train, y_train)
print("best cross-validation accuracy:", grid.best_score_)
print("test set score: ", grid.score(X_test, y_test))
print("best parameters: ", grid.best_params_)

best cross-validation accuracy: 0.9812311901504789
test set score:  0.972027972027972
best parameters:  {'svm__C': 1, 'svm__gamma': 1}


## Not using Pipelines vs feature selection

In [8]:
rnd = np.random.RandomState(seed=0)
X = rnd.normal(size=(100, 10000))
y = rnd.normal(size=(100,))

In [9]:
from sklearn.feature_selection import SelectPercentile, f_regression

select = SelectPercentile(score_func=f_regression,
                          percentile=5)
select.fit(X, y)
X_selected = select.transform(X)
print(X_selected.shape)

(100, 500)


In [10]:
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import Ridge
np.mean(cross_val_score(Ridge(), X_selected, y))

0.9057953065239822

In [11]:
pipe = Pipeline([("select", SelectPercentile(score_func=f_regression, percentile=5)),
                 ("ridge", Ridge())])
np.mean(cross_val_score(pipe, X, y))

-0.2465542238495281

## The General Pipeline Interface

In [12]:
def fit(self, X, y):
    X_transformed = X
    for step in self.steps[:-1]:
        # iterate over all but the final step
        # fit and transform the data
        X_transformed = step[1].fit_transform(X_transformed, y)
    # fit the last step
    self.steps[-1][1].fit(X_transformed, y)
    return self

In [13]:
def predict(self, X):
    X_transformed = X
    for step in self.steps[:-1]:
        # iterate over all but the final step
        # transform the data
        X_transformed = step[1].transform(X_transformed)
    # predict using the last step
    return self.steps[-1][1].predict(X_transformed)

### Convenient Pipeline creation with ``make_pipeline``

In [14]:
from sklearn.pipeline import make_pipeline
# standard syntax
pipe_long = Pipeline([("scaler", MinMaxScaler()), ("svm", SVC(C=100))])
# abbreviated syntax
pipe_short = make_pipeline(MinMaxScaler(), SVC(C=100))

In [15]:
pipe_short.steps

[('minmaxscaler', MinMaxScaler()), ('svc', SVC(C=100))]

In [16]:
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

pipe = make_pipeline(StandardScaler(), PCA(n_components=2),
                     StandardScaler())
pipe.steps

[('standardscaler-1', StandardScaler()),
 ('pca', PCA(n_components=2)),
 ('standardscaler-2', StandardScaler())]

### Accessing step attributes

In [17]:
# fit the pipeline defined above to the cancer dataset
pipe.fit(cancer.data)
# extract the first two principal components from the "pca" step
components = pipe.named_steps.pca.components_
print(components.shape)

(2, 30)


In [18]:
pipe['pca']

PCA(n_components=2)

In [19]:
pipe[0]

StandardScaler()

In [20]:
pipe[1]

PCA(n_components=2)

In [21]:
pipe[:2]

Pipeline(steps=[('standardscaler-1', StandardScaler()),
                ('pca', PCA(n_components=2))])

### Accessing attributes in grid-searched pipeline.

In [22]:
from sklearn.linear_model import LogisticRegression

pipe = make_pipeline(StandardScaler(), LogisticRegression(max_iter=1000))

In [23]:
param_grid = {'logisticregression__C': [0.01, 0.1, 1, 10, 100]}

In [24]:
X_train, X_test, y_train, y_test = train_test_split(
    cancer.data, cancer.target, random_state=4)
grid = GridSearchCV(pipe, param_grid)
grid.fit(X_train, y_train)

GridSearchCV(estimator=Pipeline(steps=[('standardscaler', StandardScaler()),
                                       ('logisticregression',
                                        LogisticRegression(max_iter=1000))]),
             param_grid={'logisticregression__C': [0.01, 0.1, 1, 10, 100]})

In [25]:
print(grid.best_estimator_)

Pipeline(steps=[('standardscaler', StandardScaler()),
                ('logisticregression', LogisticRegression(C=1, max_iter=1000))])


In [26]:
print(grid.best_estimator_.named_steps.logisticregression)
print(grid.best_estimator_['logisticregression'])

LogisticRegression(C=1, max_iter=1000)
LogisticRegression(C=1, max_iter=1000)


In [27]:
print(grid.best_estimator_.named_steps.logisticregression.coef_)

[[-0.43570655 -0.34266946 -0.40809443 -0.5344574  -0.14971847  0.61034122
  -0.72634347 -0.78538827  0.03886087  0.27497198 -1.29780109  0.04926005
  -0.67336941 -0.93447426 -0.13939555  0.45032641 -0.13009864 -0.10144273
   0.43432027  0.71596578 -1.09068862 -1.09463976 -0.85183755 -1.06406198
  -0.74316099  0.07252425 -0.82323903 -0.65321239 -0.64379499 -0.42026013]]


In [28]:
print(grid.best_estimator_['logisticregression'].coef_)

[[-0.43570655 -0.34266946 -0.40809443 -0.5344574  -0.14971847  0.61034122
  -0.72634347 -0.78538827  0.03886087  0.27497198 -1.29780109  0.04926005
  -0.67336941 -0.93447426 -0.13939555  0.45032641 -0.13009864 -0.10144273
   0.43432027  0.71596578 -1.09068862 -1.09463976 -0.85183755 -1.06406198
  -0.74316099  0.07252425 -0.82323903 -0.65321239 -0.64379499 -0.42026013]]


### Grid-searching preprocessing steps and model parameters

In [29]:
from sklearn.datasets import load_diabetes
diabetes = load_diabetes()
X_train, X_test, y_train, y_test = train_test_split(
    diabetes.data, diabetes.target, random_state=0)

from sklearn.preprocessing import PolynomialFeatures
pipe = make_pipeline(
    StandardScaler(),
    PolynomialFeatures(),
    Ridge())

In [30]:
param_grid = {'polynomialfeatures__degree': [1, 2, 3],
              'ridge__alpha': [0.001, 0.01, 0.1, 1, 10, 100]}

In [31]:
grid = GridSearchCV(pipe, param_grid=param_grid,
                    n_jobs=-1, return_train_score=True)
grid.fit(X_train, y_train)

GridSearchCV(estimator=Pipeline(steps=[('standardscaler', StandardScaler()),
                                       ('polynomialfeatures',
                                        PolynomialFeatures()),
                                       ('ridge', Ridge())]),
             n_jobs=-1,
             param_grid={'polynomialfeatures__degree': [1, 2, 3],
                         'ridge__alpha': [0.001, 0.01, 0.1, 1, 10, 100]},
             return_train_score=True)

In [32]:
res = pd.DataFrame(grid.cv_results_)
res.head()

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_polynomialfeatures__degree,param_ridge__alpha,params,split0_test_score,split1_test_score,split2_test_score,...,mean_test_score,std_test_score,rank_test_score,split0_train_score,split1_train_score,split2_train_score,split3_train_score,split4_train_score,mean_train_score,std_train_score
0,0.009374,0.00223897,0.000798,0.000399,1,0.001,"{'polynomialfeatures__degree': 1, 'ridge__alph...",0.43355,0.486062,0.514033,...,0.512238,0.049383,5,0.579986,0.565378,0.556566,0.54734,0.547173,0.559289,0.012349
1,0.007978,0.0008919101,0.001197,0.000399,1,0.01,"{'polynomialfeatures__degree': 1, 'ridge__alph...",0.433578,0.486053,0.514045,...,0.512254,0.049385,4,0.579986,0.565378,0.556566,0.54734,0.547173,0.559289,0.012349
2,0.012566,0.006752189,0.001595,0.000489,1,0.1,"{'polynomialfeatures__degree': 1, 'ridge__alph...",0.433841,0.485966,0.514161,...,0.512406,0.049413,3,0.579982,0.565377,0.556563,0.547339,0.547166,0.559285,0.012349
3,0.003186,0.001924576,0.000598,0.000489,1,1.0,"{'polynomialfeatures__degree': 1, 'ridge__alph...",0.435508,0.485451,0.514905,...,0.513376,0.049622,2,0.579731,0.565309,0.556384,0.547307,0.546763,0.559099,0.012352
4,0.001994,7.599534e-07,0.000598,0.000488,1,10.0,"{'polynomialfeatures__degree': 1, 'ridge__alph...",0.441858,0.487365,0.516466,...,0.515968,0.048215,1,0.577828,0.564338,0.554608,0.54677,0.543759,0.557461,0.012428


In [33]:
res = pd.pivot_table(res, index=['param_polynomialfeatures__degree', 'param_ridge__alpha'],
               values=['mean_train_score', 'mean_test_score'])

In [34]:
res['mean_train_score'].unstack()

param_ridge__alpha,0.001,0.010,0.100,1.000,10.000,100.000
param_polynomialfeatures__degree,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1,0.559289,0.559289,0.559285,0.559099,0.557461,0.53627
2,0.666707,0.666268,0.664697,0.661747,0.650237,0.605903
3,0.974622,0.958083,0.927772,0.883468,0.823504,0.719794


In [35]:
res['mean_test_score'].unstack()

param_ridge__alpha,0.001,0.010,0.100,1.000,10.000,100.000
param_polynomialfeatures__degree,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1,0.512238,0.512254,0.512406,0.513376,0.515968,0.507934
2,0.374003,0.379897,0.392917,0.428173,0.471838,0.492605
3,-51.023088,-16.45233,-5.166146,-1.194386,0.060621,0.315993


In [36]:
print(grid.best_params_)

{'polynomialfeatures__degree': 1, 'ridge__alpha': 10}


In [37]:
grid.best_estimator_['polynomialfeatures'].get_feature_names(diabetes.feature_names)

['1', 'age', 'sex', 'bmi', 'bp', 's1', 's2', 's3', 's4', 's5', 's6']

In [38]:
grid.score(X_test, y_test)

0.3580342762914789

In [39]:
from sklearn.linear_model import Lasso
from sklearn.model_selection import RepeatedKFold

In [40]:
pipe = Pipeline([('scaler', StandardScaler()), ('regressor', Ridge())])

param_grid = {'scaler': [StandardScaler(), MinMaxScaler(), 'passthrough'],
              'regressor': [Ridge(), Lasso()],
              'regressor__alpha': np.logspace(-3, 3, 7)}

grid = GridSearchCV(pipe, param_grid,
                    cv=RepeatedKFold(n_splits=10, n_repeats=10))
grid.fit(X_train, y_train)
grid.score(X_test, y_test)

0.3556784125736949

In [41]:
grid.best_score_

0.5041191056200575

In [42]:
grid.best_params_

{'regressor': Lasso(), 'regressor__alpha': 1.0, 'scaler': StandardScaler()}

In [43]:
from sklearn.tree import DecisionTreeRegressor
param_grid = [{'regressor': [DecisionTreeRegressor()],
               'regressor__max_depth': [2, 3, 4]},
              {'regressor': [Ridge()],
               'regressor__alpha': [0.1, 1]}
             ]