# Building Pipelines

The Pipeline class is a class that
allows “gluing” together multiple processing steps into a single scikit-learn estimator. The Pipeline class itself has fit, predict, and score methods and behaves just
like any other model in scikit-learn. The most common use case of the Pipeline
class is in chaining preprocessing steps (like scaling of the data) together with a
supervised model like a classifier.

Let’s look at how we can use the Pipeline class to express the workflow for training
an SVM after scaling the data with MinMaxScaler (for now without the grid search).

In [4]:
from sklearn.svm import SVC
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler

# Load and split the data
cancer = load_breast_cancer()
X_train, X_test, y_train, y_test = train_test_split(cancer.data, cancer.target, random_state=0)

First, we build a pipeline object by providing it with a list of steps. Each step is a tuple
containing a name (any string of your choosing1) and an instance of an estimator:

In [5]:
from sklearn.pipeline import Pipeline
pipe = Pipeline([("scaler", MinMaxScaler()), ("svm", SVC())])

Here, we created two steps: the first, called "scaler", is an instance of MinMaxScaler,
and the second, called "svm", is an instance of SVC. Now, we can fit the pipeline, like
any other scikit-learn estimator:

In [6]:
pipe.fit(X_train, y_train)

Pipeline(steps=[('scaler', MinMaxScaler()), ('svm', SVC())])

pipe.fit first calls fit on the first step (the scaler), then transforms the training
data using the scaler, and finally fits the SVM with the scaled data.

In [7]:
print("Test score: {:.2f}".format(pipe.score(X_test, y_test)))

Test score: 0.97


The score method on the pipeline first transforms the test data using the
scaler, and then calls the score method on the SVM using the scaled test data.

Using the pipeline, we reduced the
code needed for our preprocessing + classification process.

# Using Pipelines in Grid Searches

Using a pipeline in a grid search works the same way as using any other estimator. We
define a hyperparameter grid to search over, and construct a GridSearchCV from the pipeline
and the hyperparameter grid.

When specifying the hyperparameter grid, we need to specify for each hyperparameter which step of the pipeline it
belongs to. Both hyperparameters that we want to adjust, C and gamma, are hyperparameters of
SVC, the second step; so, we gave this step the name "svm". The syntax to define a hyperparameter
grid for a pipeline is to specify for each hyperparameter the step name, followed by __
(a double underscore), followed by the hyperparameter name.

In [8]:
param_grid = {'svm__C': [0.001, 0.01, 0.1, 1, 10, 100],
                'svm__gamma': [0.001, 0.01, 0.1, 1, 10, 100]}

In [9]:
from sklearn.model_selection import GridSearchCV
grid = GridSearchCV(pipe, param_grid=param_grid, cv=5)
grid.fit(X_train, y_train)

print("Best cross-validation accuracy: {:.2f}".format(grid.best_score_))
print("Test set score: {:.2f}".format(grid.score(X_test, y_test)))
print("Best parameters: {}".format(grid.best_params_))

Best cross-validation accuracy: 0.98
Test set score: 0.97
Best parameters: {'svm__C': 1, 'svm__gamma': 1}


For each split in the cross-validation,
the MinMaxScaler is refit with only the training splits and no information is leaked
from the test split into the parameter search.

# General Pipeline Interface

The only requirement for estimators in a pipeline is that all but the last step need to
have a transform method, so they can produce a new representation of the data that
can be used in the next step

The internal code looks similar to:

In [10]:
def fit(self, X, y):
    X_transformed = X
    for name, estimator in self.steps[:-1]:
        # Iterate over all but the final step
        # Fit and transform the data
        X_transformed = estimator.fit_transform(X_transformed, y)
        
    # Fit the last step
    self.steps[-1][1].fit(X_transformed, y)
    
    return self

In [11]:
def predict(self, X):
    X_transformed = X
    for step in self.steps[:-1]:
        # Iterate over all but the final step
        # Transform the data
        X_transformed = step[1].transform(X_transformed)
        
    # Fit the last step
    return self.steps[-1][1].predict(X_transformed)

### make_pipeline

make_pipeline, that will create a pipeline for us and automatically name
each step based on its class

In [12]:
from sklearn.pipeline import make_pipeline
# standard syntax
pipe_long = Pipeline([("scaler", MinMaxScaler()), ("svm", SVC(C=100))])

# Abbreviated syntax
pipe_short = make_pipeline(MinMaxScaler(), SVC(C=100))

pipe_short has steps that were automatically named. We can see the names of the
steps by looking at the steps attribute:

In [13]:
print("Pipeline steps:\n{}".format(pipe_short.steps))

Pipeline steps:
[('minmaxscaler', MinMaxScaler()), ('svc', SVC(C=100))]


### Accessing Step Attributes

Often you will want to inspect attributes of one of the steps of the pipeline—say, the
coefficients of a linear model or the components extracted by PCA. The easiest way to
access the steps in a pipeline is via the named_steps attribute, which is a dictionary
from the step names to the estimators:

In [14]:
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

pipe = make_pipeline(StandardScaler(), PCA(n_components=2), StandardScaler())
print("Pipeline steps:\n{}".format(pipe.steps))

# Fit the pipeline defined before to the cancer dataset
pipe.fit(cancer.data)

# Extract the first two principal components from the "pca" step
components = pipe.named_steps["pca"].components_

print("components.shape: {}".format(components.shape))

Pipeline steps:
[('standardscaler-1', StandardScaler()), ('pca', PCA(n_components=2)), ('standardscaler-2', StandardScaler())]
components.shape: (2, 30)


### Accessing Attributes in a Grid-Searched Pipeline

Let’s grid search a LogisticRegression classifier on the cancer dataset,
using Pipeline and StandardScaler to scale the data before passing it to the LogisticRegression classifier.

In [18]:
from sklearn.linear_model import LogisticRegression

pipe = make_pipeline(StandardScaler(), LogisticRegression(max_iter=1000))

Because we used the
make_pipeline function, the name of the LogisticRegression step in the pipeline is
the lowercased class name, logisticregression. To tune the hyperparameter C, we therefore
have to specify a hyperparameter grid for logisticregression__C:

In [19]:
param_grid = {'logisticregression__C': [0.01, 0.1, 1, 10, 100]}

X_train, X_test, y_train, y_test = train_test_split(cancer.data, cancer.target, random_state=4)

grid = GridSearchCV(pipe, param_grid, cv=5)
grid.fit(X_train, y_train)

GridSearchCV(cv=5,
             estimator=Pipeline(steps=[('standardscaler', StandardScaler()),
                                       ('logisticregression',
                                        LogisticRegression(max_iter=1000))]),
             param_grid={'logisticregression__C': [0.01, 0.1, 1, 10, 100]})

The best model found by
GridSearchCV, trained on all the training data, is stored in grid.best_estimator_:

In [20]:
print("Best estimator:\n{}".format(grid.best_estimator_))

Best estimator:
Pipeline(steps=[('standardscaler', StandardScaler()),
                ('logisticregression', LogisticRegression(C=1, max_iter=1000))])


This best_estimator_ in our case is a pipeline with two steps, standardscaler and
logisticregression. To access the logisticregression step, we can use the
named_steps attribute of the pipeline, as explained earlier:

In [21]:
print("Logistic regression step:\n{}".format(grid.best_estimator_.named_steps["logisticregression"]))

Logistic regression step:
LogisticRegression(C=1, max_iter=1000)


In [22]:
print("Logistic regression coefficients:\n{}".format(grid.best_estimator_.named_steps["logisticregression"].coef_))

Logistic regression coefficients:
[[-0.43570655 -0.34266946 -0.40809443 -0.5344574  -0.14971847  0.61034122
  -0.72634347 -0.78538827  0.03886087  0.27497198 -1.29780109  0.04926005
  -0.67336941 -0.93447426 -0.13939555  0.45032641 -0.13009864 -0.10144273
   0.43432027  0.71596578 -1.09068862 -1.09463976 -0.85183755 -1.06406198
  -0.74316099  0.07252425 -0.82323903 -0.65321239 -0.64379499 -0.42026013]]
