## Algorithm Chains and pipelines

In [1]:
import numpy as np
%matplotlib inline
import matplotlib.pyplot as plt
# allow multiple outputs per cell
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

The following exercise is adapted from Chapter 6 of *Introduction to Machine Learning with Python* by Andreas C. Müller, Sarah Guido.

In this notebook we will review how to chain together many different processing steps and machine learning models by using *Pipeline* class. This was already covered in the previous notebooks, especially the notebooks from Lecture 2 and 4, but here we will give a few more details.
 
Let's start by importing the breast cancer dataset and let's again use the KNN Classifier. Beffore fitting the KNN CLassifier, we will split the data into train and test, and fit the MinMax scaler on the training data, and then transform both the training and the testing data.


In [2]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import classification_report

# load and split the data
bc = datasets.load_breast_cancer()
X_train, X_test, y_train, y_test = train_test_split(bc.data, bc.target, random_state=0)

# compute minimum and maximum on the training data
scaler = MinMaxScaler().fit(X_train)
# rescale the training data
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)

knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train_scaled, y_train)

KNeighborsClassifier()

In [3]:
y_pred=knn.predict(X_test_scaled)
print(classification_report(y_test, y_pred))


              precision    recall  f1-score   support

           0       0.98      0.91      0.94        53
           1       0.95      0.99      0.97        90

    accuracy                           0.96       143
   macro avg       0.96      0.95      0.95       143
weighted avg       0.96      0.96      0.96       143



 For KNN Classifier the default scoring is the accuracy. Hence, if we were only interested in accuracy, we could have simply used the *score* function instead of *predict* and *classification_report*.

In [4]:
knn_score=knn.score(X_test_scaled, y_test)
print("Test score: {:.2f}".format(knn_score))

Test score: 0.96


Now let’s say we want to find better parameters for KNN using *GridSearchCV*, as discussed in Lecture 4. 
Recall that some of the parameteres *KNeighborsClassifier* are:
- `n_neighbors`: Number of neighbors to use by default for kneighbors queries.
- `weights`{‘uniform’, ‘distance’} : 
    - `uniform` All points in each neighborhood are weighted equally.
    - `distance` : weight points by the inverse of their distance. 
    
    
Recall that *GridSearchCV* after fitting returns attributes:
- best_estimator_: estimator which gave the highest score on the left out data.
- best_score_ : mean cross-validated score of the best_estimator
- best_params_ : parameter setting that gave the best results on the hold out data.

    
    
### A naive and WRONG approach to doing a grid search with data scaling might look like this:

In [6]:
from sklearn.model_selection import GridSearchCV
# for illustration purposes only, don't use this code!
param_grid = {'n_neighbors': range(1, 20,2), 
              'weights': ['uniform', 'distance']
             }
grid = GridSearchCV(KNeighborsClassifier(), param_grid=param_grid, cv=5)
grid.fit(X_train_scaled, y_train)
print("Best cross-validation accuracy: {:.2f}".format(grid.best_score_))
print("Best parameters: ", grid.best_params_)
print("Test set accuracy: {:.2f}".format(grid.score(X_test_scaled, y_test)))

GridSearchCV(cv=5, estimator=KNeighborsClassifier(),
             param_grid={'n_neighbors': range(1, 20, 2),
                         'weights': ['uniform', 'distance']})

Best cross-validation accuracy: 0.97
Best parameters:  {'n_neighbors': 9, 'weights': 'uniform'}
Test set accuracy: 0.97


Here, we ran the grid search over the parameters of *KNeighborsClassifier* using the scaled data. However, there is a subtle catch in what we just did. When scaling the data, we used ALL the data in the training set to compute the minimum and maximum of the data. We then used the scaled training data to run our grid search using cross-validation. **For each split in the cross-validation, some part of the original training set will be declared the training part of the split, and some the test part of the split. The test part is used to measure the performance of a model trained on the training part when applied to new data. However, we already used the information contained in the test part of the split, when scaling the data. Remember that the test part in each split in the cross-validation is part of the training set, and we used the information from the entire training set to find the right scaling of the data.**

This is fundamentally different from how new data looks to the model. If we observe new data (say, in form of our test set), this data will not have been used to scale the training data, and it might have a different minimum and maximum than the training data. **So, the splits in the cross-validation no longer correctly mirror how new data will look to the modeling process. We already leaked information from these parts of the data into our modeling process.** This will lead to overly optimistic results during cross-validation, and possibly the selection of suboptimal parameters.

To get around this problem, **the splitting of the dataset during cross-validation should be done BEFORE doing any preprocessing.** Any process that extracts knowledge from the dataset should only ever be learned from the training portion of the dataset, and therefore be contained inside the cross-validation loop.

To achieve this in *scikit-learn* with the *cross_val_score* function and the *GridSearchCV* function, we can use the *Pipeline* class. The *Pipeline* class is a class that allows “gluing” together multiple processing steps into a single *scikit-learn* estimator. The *Pipeline* class itself has fit, predict, and score methods and behaves just like any other model in *scikit-learn*. The most common use case of the *Pipeline* class is in chaining preprocessing steps (like scaling of the data) together with a supervised model like a classifier.

### Building Pipelines
 First, we build a pipeline object by providing it with a list of steps. Each step is a tuple containing a name (any string of your choosing) and an instance of an estimator:

In [7]:
from sklearn.pipeline import Pipeline
pipe = Pipeline([("scaler", MinMaxScaler()), ("knn", KNeighborsClassifier(n_neighbors=5))])

Here, we created two steps: the first, called "scaler", is an instance of *MinMaxScaler*, and the second, called "knn", is an instance of *KNeighborsClassifier*. Now, we can fit the pipeline, like any other *scikit-learn* estimator:

In [8]:
pipe.fit(X_train, y_train)

Pipeline(steps=[('scaler', MinMaxScaler()), ('knn', KNeighborsClassifier())])

Here, *pipe.fit* first calls fit on the first step (the scaler), **then transforms the training data using the scaler,** and finally fits the *KNeighborsClassifier* with the scaled data. To evaluate on the test data, we simply call *pipe.score*:

In [9]:
pipe.score(X_test, y_test)

0.958041958041958

Calling the score method on the pipeline **first** transforms the test data using the scaler, and **then** calls the *score* method on the *KNeighborsClassifier* **using the scaled test data.** As you can see, the result is identical to the one we got from the code at the beginning of the chapter, when doing the transformations by hand. Using the pipeline, we reduced the code needed for our “preprocessing + classification” process. **The main benefit of using the pipeline, however, is that we can now use this single estimator in *cross_val_score* or *GridSearchCV*.**

Using a pipeline in a grid search works the same way as using any other estimator. We define a parameter grid to search over, and construct a *GridSearchCV* from the pipeline and the parameter grid. **When specifying the parameter grid, there is a slight change, though. We need to specify for each parameter which step of the pipeline it belongs to.** Both parameters that we want to adjust, *n_neighbors* and *weights*, are parameters of *KNeighborsClassifier*, the second step. We gave this step the name "knn". **The syntax to define a parameter grid for a pipeline is to specify for each parameter the step name, followed by __ (a double underscore),** followed by the parameter name. To search over the *n_neighbors* parameter of knn we therefore have to use "knn__n_neighbors" as the key in the parameter grid dictionary, and similarly for *weights*.

Recall that we could also use *get_params* to see the names of all the parameters of an estimator, as we did in the notebook for Lecture 4. All *scikit-learn* estimators have *get_params* and *set_params* functions. The *get_params* function takes no arguments and returns a dictionary of the _init_ parameters of the estimator, together with their values. Let's check the parameters of our pipeline:

In [10]:
pipe.get_params()

{'memory': None,
 'steps': [('scaler', MinMaxScaler()), ('knn', KNeighborsClassifier())],
 'verbose': False,
 'scaler': MinMaxScaler(),
 'knn': KNeighborsClassifier(),
 'scaler__clip': False,
 'scaler__copy': True,
 'scaler__feature_range': (0, 1),
 'knn__algorithm': 'auto',
 'knn__leaf_size': 30,
 'knn__metric': 'minkowski',
 'knn__metric_params': None,
 'knn__n_jobs': None,
 'knn__n_neighbors': 5,
 'knn__p': 2,
 'knn__weights': 'uniform'}

In [11]:
param_grid = {'knn__n_neighbors': range(1, 20,2), 
              'knn__weights': ['uniform', 'distance']
             }


In [12]:
grid = GridSearchCV(pipe, param_grid=param_grid, cv=5)
grid.fit(X_train, y_train)
print("Best cross-validation accuracy: {:.2f}".format(grid.best_score_))
print("Test set score: {:.2f}".format(grid.score(X_test, y_test)))
print("Best parameters: {}".format(grid.best_params_))

GridSearchCV(cv=5,
             estimator=Pipeline(steps=[('scaler', MinMaxScaler()),
                                       ('knn', KNeighborsClassifier())]),
             param_grid={'knn__n_neighbors': range(1, 20, 2),
                         'knn__weights': ['uniform', 'distance']})

Best cross-validation accuracy: 0.97
Test set score: 0.97
Best parameters: {'knn__n_neighbors': 15, 'knn__weights': 'distance'}


### In contrast to the grid search we did before, now for each split in the cross-validation, the *MinMaxScaler* is refit with ONLY the training splits and no information is leaked from the test split into the parameter search.

### The General Pipeline Interface
The *Pipeline* class is not restricted to preprocessing and classification, but can in fact join any number of estimators together. For example, we could build a pipeline containing feature extraction, feature selection, scaling, and classification, for a total of four steps. Similarly, the last step could be regression or clustering instead of classification.

**The only requirement for estimators in a pipeline is that all but the last step need to have a transform method, so they can produce a new representation of the data that can be used in the next step.**

Internally, during the call to *Pipeline.fit*, the pipeline calls fit and then transform on each step in turn, with the input given by the output of the transform method of the previous step. For the last step in the pipeline, just fit is called.

Here is an illustation of the pipeline process.

<div>
<img src="attachment:dcd49abb-7716-4870-9dc3-f96d137fb952.png" width="600"/>
</div>

The pipeline is actually even more general than this. There is no requirement for the last step in a pipeline to have a predict function, and we could create a pipeline just containing, for example, a scaler and PCA. Then, **because the last step (PCA) has a transform method, we could call transform on the pipeline to get the output of *PCA.transform* applied to the data that was processed by the previous step. The last step of a pipeline is only required to have a fit method.**

Creating a pipeline using the syntax described earlier is sometimes a bit cumbersome, and we often don’t need user-specified names for each step. There is a convenience function, *make_pipeline*, that will create a pipeline for us and automatically name each step based on its class. We used this already in Lecture 4 notebook. The syntax for *make_pipeline* is as follows:

In [14]:
from sklearn.pipeline import make_pipeline
# standard syntax
pipe_long = Pipeline([("scaler", MinMaxScaler()), ("knn", KNeighborsClassifier(n_neighbors=5))])
# abbreviated syntax
pipe_short = make_pipeline(MinMaxScaler(), KNeighborsClassifier(n_neighbors=5))

The pipeline objects pipe_long and pipe_short do exactly the same thing, but pipe_short has steps that were automatically named. We can see the names of the steps by looking at the steps attribute:

In [15]:
pipe_short.steps

[('minmaxscaler', MinMaxScaler()),
 ('kneighborsclassifier', KNeighborsClassifier())]

The steps are named *minmaxscaler* and *kneighborsclassifier*. In general, the step names are just lowercase versions of the class names. If multiple steps have the same class, a number is appended.

### Often we will want to inspect attributes of one of the steps of the pipeline—say, the coefficients of a linear model or the components extracted by PCA. The easiest way to access the steps in a pipeline is via the *named_steps* attribute, which is a dictionary from the step names to the estimators:

In [14]:
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

pipe = make_pipeline(StandardScaler(), PCA(n_components=2), StandardScaler())
print("Pipeline steps:\n{}".format(pipe.steps))

# fit the pipeline defined before to the cancer dataset
pipe.fit(bc.data)
# extract the first two principal components from the "pca" step
components = pipe.named_steps["pca"].components_
print("components.shape: {}".format(components.shape))

Pipeline steps:
[('standardscaler-1', StandardScaler()), ('pca', PCA(n_components=2)), ('standardscaler-2', StandardScaler())]


Pipeline(steps=[('standardscaler-1', StandardScaler()),
                ('pca', PCA(n_components=2)),
                ('standardscaler-2', StandardScaler())])

components.shape: (2, 30)


As we discussed earlier, one of the main reasons to use pipelines is for doing grid searches. A common task is to access some of the steps of a pipeline inside a grid search. Let’s grid search a *LogisticRegression* classifier on the cancer dataset, using *Pipeline* and *StandardScaler* to scale the data before passing it to the *LogisticRegression* classifier. First we create a pipeline using the *make_pipeline* function:

In [15]:
from sklearn.linear_model import LogisticRegression

pipe = make_pipeline(StandardScaler(), LogisticRegression(solver="liblinear"))

Next, we create a parameter grid. As explained in Lecture 2, the regularization parameter to tune for LogisticRegression is the parameter *C*. We use a logarithmic grid for this parameter, searching between 0.01 and 100. Because we used the *make_pipeline* function, the name of the *LogisticRegression* step in the pipeline is the lowercased class name, *logisticregression*. To tune the parameter *C*, we therefore have to specify a parameter grid for *logisticregression__C*:

In [16]:
param_grid = {'logisticregression__C': [0.01, 0.1, 1, 10, 100]}

In [17]:
X_train, X_test, y_train, y_test = train_test_split( bc.data, bc.target, random_state=0)
grid = GridSearchCV(pipe, param_grid, cv=5)
grid.fit(X_train, y_train)

GridSearchCV(cv=5,
             estimator=Pipeline(steps=[('standardscaler', StandardScaler()),
                                       ('logisticregression',
                                        LogisticRegression(solver='liblinear'))]),
             param_grid={'logisticregression__C': [0.01, 0.1, 1, 10, 100]})

We can access best model found by *GridSearchCV*, trained on all the training data, with the attribute *grid.best_estimator_*:

In [18]:
grid.best_estimator_.named_steps["logisticregression"]

LogisticRegression(C=0.1, solver='liblinear')

Now that we have the trained *LogisticRegression* instance, we can access the coefficients (weights) associated with each input feature:

In [19]:
grid.best_estimator_.named_steps["logisticregression"].coef_

array([[-0.34544958, -0.38444935, -0.33988553, -0.3505656 , -0.16913186,
        -0.03157163, -0.32665637, -0.42678968, -0.2045492 ,  0.16941073,
        -0.53273967, -0.00933038, -0.43498257, -0.40182175,  0.07447853,
         0.24949134,  0.0994215 , -0.07708802,  0.09495805,  0.26302164,
        -0.49111303, -0.48321499, -0.46467249, -0.45726692, -0.32629663,
        -0.16551265, -0.37312226, -0.48736181, -0.36902284, -0.18518844]])

### Grid-Searching Which Model To Use

We can even go further in combining *GridSearchCV* and *Pipeline*: it is also possible to search over the actual steps being performed in the pipeline (say whether to use *StandardScaler* or *MinMaxScaler*). This leads to an even bigger search space and should be considered carefully. Trying all possible solutions is usually not a viable machine learning strategy. However, here is an example comparing a *DecisionTreeClassifier* and *KNeighborsClassifier* on the breast_cancer dataset. We know that the *KNeighborsClassifier* needs the data to be scaled, so we also search over whether to use *MonMaxScaler* or no preprocessing. For the *DecisionTreeClassifier*, we know that no preprocessing is necessary. 

We start by defining the pipeline. Here, we explicitly name the steps. We want two steps, one for the preprocessing and then a classifier. We can instantiate this using  *KNeighborsClassifier* and *MinMaxScaler*:

In [20]:
pipe = Pipeline([('preprocessing', MinMaxScaler()), ('classifier', KNeighborsClassifier())])

Now we can define the parameter_grid to search over. We want the classifier to be either *DecisionTreeClassifier* or a *KNeighborsClassifier*. Because they have different parameters to tune, and need different preprocessing, we can make use of the list of search grids. To assign an estimator to a step, we use the name of the step as the parameter name. When we want to skip a step in the pipeline (for example, because we don't need preprocessing for the *DecisionTreeClassifier*), we can set that step to *None*. Note that *GridSearchCV* allows the param_grid to be a list of dictionaries. Each dictionary in the list is expanded into an independent grid. 

In [21]:
from sklearn.tree import DecisionTreeClassifier

param_grid = [
    {'classifier': [KNeighborsClassifier()], 'preprocessing': [MinMaxScaler(), None],
     'classifier__n_neighbors': range(1, 20,2), 
    'classifier__weights': ['uniform', 'distance']},
    {'classifier': [DecisionTreeClassifier()], 'preprocessing': [None], 
     'classifier__max_depth': range(2,10),
    'classifier__criterion': ["gini", "entropy"]}]

Now we can instantiate and run the grid search on the breast_cancer dataset:

In [22]:
X_train, X_test, y_train, y_test = train_test_split(  bc.data, bc.target, random_state=0)

grid = GridSearchCV(pipe, param_grid, cv=5)
grid.fit(X_train, y_train)

print("Best params:\n{}\n".format(grid.best_params_))
print("Best cross-validation score: {:.2f}".format(grid.best_score_))
print("Test-set score: {:.2f}".format(grid.score(X_test, y_test)))

GridSearchCV(cv=5,
             estimator=Pipeline(steps=[('preprocessing', MinMaxScaler()),
                                       ('classifier', KNeighborsClassifier())]),
             param_grid=[{'classifier': [KNeighborsClassifier(n_neighbors=15,
                                                              weights='distance')],
                          'classifier__n_neighbors': range(1, 20, 2),
                          'classifier__weights': ['uniform', 'distance'],
                          'preprocessing': [MinMaxScaler(), None]},
                         {'classifier': [DecisionTreeClassifier()],
                          'classifier__criterion': ['gini', 'entropy'],
                          'classifier__max_depth': range(2, 10),
                          'preprocessing': [None]}])

Best params:
{'classifier': KNeighborsClassifier(n_neighbors=15, weights='distance'), 'classifier__n_neighbors': 15, 'classifier__weights': 'distance', 'preprocessing': MinMaxScaler()}

Best cross-validation score: 0.97
Test-set score: 0.97


The outcome of the grid search is that *KNeighborsClassifier* with *MinMaxScaler* preprocessing, *n_neighbors=15*, *weights='distance'* gave the best result.