# Algorithm Chains and Pipelines

We will cover how to use the `Pipeline` class to simplify the process of building chains of transformations and models. In particular, we will see how we can combine `Pipeline` and `GridSearchCV` to search over parameters for all processing steps at once

In [1]:
import numpy as np
import pandas as pd

In [2]:
from sklearn.linear_model import SGDClassifier
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler

# load and split the data
cancer = load_breast_cancer()

In [2]:
cancer

{'data': array([[1.799e+01, 1.038e+01, 1.228e+02, ..., 2.654e-01, 4.601e-01,
         1.189e-01],
        [2.057e+01, 1.777e+01, 1.329e+02, ..., 1.860e-01, 2.750e-01,
         8.902e-02],
        [1.969e+01, 2.125e+01, 1.300e+02, ..., 2.430e-01, 3.613e-01,
         8.758e-02],
        ...,
        [1.660e+01, 2.808e+01, 1.083e+02, ..., 1.418e-01, 2.218e-01,
         7.820e-02],
        [2.060e+01, 2.933e+01, 1.401e+02, ..., 2.650e-01, 4.087e-01,
         1.240e-01],
        [7.760e+00, 2.454e+01, 4.792e+01, ..., 0.000e+00, 2.871e-01,
         7.039e-02]]),
 'target': array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0,
        0, 0, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 0, 0, 1, 1, 1, 1, 0, 1, 0, 0,
        1, 1, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 1, 1, 0, 0, 1, 0, 0, 0,
        1, 1, 1, 0, 1, 1, 0, 0, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 1, 1, 0, 1,
        1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 0, 0, 1, 1, 1, 0

In [3]:
X = cancer.data
y = cancer.target

In [4]:
pd.DataFrame(X).describe()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,20,21,22,23,24,25,26,27,28,29
count,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,...,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0
mean,14.127292,19.289649,91.969033,654.889104,0.09636,0.104341,0.088799,0.048919,0.181162,0.062798,...,16.26919,25.677223,107.261213,880.583128,0.132369,0.254265,0.272188,0.114606,0.290076,0.083946
std,3.524049,4.301036,24.298981,351.914129,0.014064,0.052813,0.07972,0.038803,0.027414,0.00706,...,4.833242,6.146258,33.602542,569.356993,0.022832,0.157336,0.208624,0.065732,0.061867,0.018061
min,6.981,9.71,43.79,143.5,0.05263,0.01938,0.0,0.0,0.106,0.04996,...,7.93,12.02,50.41,185.2,0.07117,0.02729,0.0,0.0,0.1565,0.05504
25%,11.7,16.17,75.17,420.3,0.08637,0.06492,0.02956,0.02031,0.1619,0.0577,...,13.01,21.08,84.11,515.3,0.1166,0.1472,0.1145,0.06493,0.2504,0.07146
50%,13.37,18.84,86.24,551.1,0.09587,0.09263,0.06154,0.0335,0.1792,0.06154,...,14.97,25.41,97.66,686.5,0.1313,0.2119,0.2267,0.09993,0.2822,0.08004
75%,15.78,21.8,104.1,782.7,0.1053,0.1304,0.1307,0.074,0.1957,0.06612,...,18.79,29.72,125.4,1084.0,0.146,0.3391,0.3829,0.1614,0.3179,0.09208
max,28.11,39.28,188.5,2501.0,0.1634,0.3454,0.4268,0.2012,0.304,0.09744,...,36.04,49.54,251.2,4254.0,0.2226,1.058,1.252,0.291,0.6638,0.2075


Necessary to **Rescale** the training data

In [5]:
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=.8,random_state=0)

In [6]:
scaler = MinMaxScaler().fit(X_train)
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)

**Q1. Why not scale X first before split the data?**

In [7]:
sgd = SGDClassifier()

In [8]:
sgd.fit(X_train_scaled, y_train)
print("Test score: {:.2f}".format(sgd.score(X_test_scaled, y_test)))

Test score: 0.96


## 1. Parameter Selection with Preprocessing

Now let’s say we want to find better parameters for `SGDClassifier` using `GridSearchCV`

In [9]:
from sklearn.model_selection import GridSearchCV
# for illustration purposes only, don't use this code!
param_grid = {'alpha': [0.00001, 0.0001, 0.001, 0.01, 1, 10]}
grid = GridSearchCV(SGDClassifier(), param_grid=param_grid, cv=10)
grid.fit(X_train_scaled, y_train)
print("Best cross-validation accuracy: {:.2f}".format(grid.best_score_))
print("Best parameters: ", grid.best_params_)
print("Test set accuracy: {:.2f}".format(grid.score(X_test_scaled, y_test)))


Best cross-validation accuracy: 0.97
Best parameters:  {'alpha': 0.001}
Test set accuracy: 0.96


**Q.2. Let's apply the logic in Q.1. here. What's going on?**

1. When scaling the data, we used all the data in the training set to compute the minimum and maximum of the data. 
2. We then use the scaled training data to run our grid search using cross-validation. 
3. For each split in the cross-validation, some part of the original training set will be declared the training part of the split, and some the test part of the split.
4. The test part is used to measure the performance of a model trained on the training part when applied to new data. However, we already used the information contained in the test part of the split, when scaling the data
5. **Remember** that the test part in each split in the cross-validation is part of the training set, and we used the information from the entire training set to find the right scaling of the data

- This is fundamentally different from how new data looks to the model. 
- If we observe new data (say, in form of our test set), this data will not have been used to scale the training data, and it might have a different minimum and maximum than the training data.

In [10]:
from IPython.display import Image
Image(url="https://learning.oreilly.com/api/v2/epubs/urn:orm:book:9781449369880/files/assets/malp_0601.png", 
      width=1000, height=750)

- So, the splits in the cross-validation no longer correctly mirror how new data will look to the modeling process. 
- We already leaked information from these parts of the data into our modeling process. 
- This will lead to overly optimistic results during cross-validation, and possibly the selection of suboptimal parameters.
- To get around this problem, the splitting of the dataset during cross-validation should be done before doing any preprocessing
- Any process that extracts knowledge from the dataset should only ever be learned from the training portion of the dataset, and therefore be contained inside the cross-validation loop

## 2. Building Pipelines

Let’s look at how we can use the `Pipeline` class to express the workflow for training an `SGDClassifier` after scaling the data with `MinMaxScaler` (for now without the grid search)

In [11]:
from sklearn.pipeline import Pipeline
pipe = Pipeline([("scaler", MinMaxScaler()), ("sgd", SGDClassifier())])

Here, we created two steps: the first, called "scaler", is an instance of `MinMaxScaler`, and the second, called "sgd", is an instance of `SGDClassifier`. Now, we can fit the pipeline, like any other scikit-learn estimator:

In [12]:
pipe.fit(X_train, y_train)

Pipeline(steps=[('scaler', MinMaxScaler()), ('sgd', SGDClassifier())])

Here, `pipe.fit` first calls `fit` on the first step (the scaler), then transforms the training data using the scaler, and finally fits the SGDClassifier with the scaled data. To evaluate on the test data, we simply call `pipe.score`:

In [13]:
print("Test score: {:.2f}".format(pipe.score(X_test, y_test)))

Test score: 0.97


Calling the `score` method on the pipeline first transforms the test data using the scaler, and then calls the `score` method on the SGDClassifier using the scaled test data

Using the pipeline, we reduced the code needed for our “preprocessing + classification” process. The main benefit of using the pipeline, however, is that we can now use this single estimator in `cross_val_score` or `GridSearchCV`.

## 3. Using Pipelines in Grid Searches

Using a pipeline in a grid search works the same way as using any other estimator.

We define a parameter grid to search over, and construct a `GridSearchCV` from the pipeline and the parameter grid. 

When specifying the parameter grid, there is a slight change, though. 

We need to specify for each parameter which step of the pipeline it belongs to

Note we want to adjust `alpha` which is parameter of `SGDClassifier`.

The syntax to define a parameter grid for a pipeline is to specify for each parameter the step name, followed by `__` (a double underscore), followed by the parameter name. 

To search over the `alpha` parameter of `SGDClassifier` we therefore have to use `"sgd__alpha"` as the key in the parameter grid dictionary, and similarly for gamma:

In [23]:
param_grid = {"sgd__alpha": [0.00001, 0.0001, 0.001, 0.01, 1, 10]}

With this parameter grid we can use `GridSearchCV` as usual:

In [22]:
grid = GridSearchCV(pipe, param_grid=param_grid, cv=10)
grid.fit(X_train, y_train)
print("Best cross-validation accuracy: {:.2f}".format(grid.best_score_))
print("Test set score: {:.2f}".format(grid.score(X_test, y_test)))
print("Best parameters: {}".format(grid.best_params_))


Best cross-validation accuracy: 0.97
Test set score: 0.96
Best parameters: {'sgd__alpha': 0.0001}


In contrast to the grid search we did before, now for each split in the cross-validation, the `MinMaxScaler` is refit with only the training splits and no information is leaked from the test split into the parameter search

## 4. Illustrating Information Leakage

Reference: *Hastie, Tibshirani, and Friedman’s book The Elements of Statistical Learning*

Let’s consider a synthetic regression task with 100 samples and 10,000 features that are sampled independently from a Gaussian distribution

In [24]:
rnd = np.random.RandomState(seed=0)
X = rnd.normal(size=(100, 10000))
y = rnd.normal(size=(100,))

In [26]:
X.shape

(100, 10000)

Given the way we created the dataset, there is no relation between the data, `X`, and the target, `y` (they are independent), so it should not be possible to learn anything from this dataset.

We will now do the following:

1. select the most informative of the 10,000 features using `SelectPercentile` feature selection
2. we evaluate a `Ridge` regressor using cross-validation

In [27]:
from sklearn.feature_selection import SelectPercentile, f_regression

select = SelectPercentile(score_func=f_regression, percentile=5).fit(X, y)
X_selected = select.transform(X)
print("X_selected.shape: {}".format(X_selected.shape))

X_selected.shape: (100, 500)


In [28]:
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import Ridge
print("Cross-validation accuracy (cv only on ridge): {:.2f}".format(np.mean(cross_val_score(Ridge(), X_selected, y, cv=5))))

Cross-validation accuracy (cv only on ridge): 0.91


The mean r-squared by cross-validation is 0.91, indicating a very good model.

This clearly cannot be right, as our data is entirely random. 

What happened here is that our feature selection picked out some features among the 10,000 random features that are (by chance) very well correlated with the target. 

Because we fit the feature selection outside of the cross-validation, it could find features that are correlated both on the training and the test folds. 

The information we leaked from the test folds was very informative, leading to highly unrealistic results. 

Let’s compare this to a proper cross-validation using a pipeline:

In [29]:
pipe = Pipeline([("select", SelectPercentile(score_func=f_regression, percentile=5)),
                 ("ridge", Ridge())])
print("Cross-validation accuracy (pipeline): {:.2f}".format(np.mean(cross_val_score(pipe, X, y, cv=5))))

Cross-validation accuracy (pipeline): -0.25


This time, we get a negative $R^2$ score, indicating a very poor model. 

Using the pipeline, the feature selection is now inside the cross-validation loop. 

This means features can only be selected using the training folds of the data, not the test fold. 

The feature selection finds features that are correlated with the target on the training set, but because the data is entirely random, these features are not correlated with the target on the test set. 

In this example, rectifying the data leakage issue in the feature selection makes the difference between concluding that a model works very well and concluding that a model works not at all.

**Do not confuse pipe() with Data Science Pipelines**

Data science pipelines and workflows involve many complex, multidisciplinary, and iterative steps

Pipelines ususally refer to the end-to-end process.

Let’s take a typical machine learning model development workflow as an example.

We start with data preparation, then move to model training and tuning. 

Eventually, we deploy our model (or application) into a production environment. 

Each of those steps consists of several subtasks

Consider your final project 

In [36]:
Image(url="https://learning.oreilly.com/api/v2/epubs/urn:orm:book:9781492079385/files/assets/dsaw_0101.png", 
      width=1000, height=400)