# The General Pipeline Interface

The Pipeline class is not restricted to preprocessing and classification, but can in fact join any number of estimators together. For example, you could build a pipeline containing feature extraction, feature selection, scaling, and classification, for a total of four steps. Similarly, the last step could be regression or clustering instead of classification. 

The only requirement for estimators in a pipeline is that all but the last step need to have a transform method, so they can produce a new representation of the data that can be used in the next step. 


# Convenient Pipeline Creation with make_pipeline

Creating a pipeline using the syntax described earlier is sometimes a bit cumbersome, and we often don’t need user-specified names for each step. There is a convenience function, make_pipeline, that will create a pipeline for us and automatically name each step based on its class. The syntax for make_pipeline is as follows:


In [5]:
from sklearn.pipeline import make_pipeline
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import MinMaxScaler
from sklearn.svm import SVC
# standard syntex
pipe_long=Pipeline([("scaler",MinMaxScaler()),("svm",SVC(C=100))])
# abbreviated syntex
pipe_short=make_pipeline(MinMaxScaler(),SVC(C=100))

The pipeline objects pipe_long and pipe_short do exactly the same thing, but pipe_short has steps that were automatically named. We can see the names of the steps by looking at the steps attribute:


In [6]:
print("Pipeline steps: \n{}".format(pipe_short.steps))

Pipeline steps: 
[('minmaxscaler', MinMaxScaler(copy=True, feature_range=(0, 1))), ('svc', SVC(C=100, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='auto', kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False))]


The steps are named minmaxscaler and svc. In general, the step names are just lowercase versions of the class names. If multiple steps have the same class, a number is appended:


In [7]:
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA 

pipe=make_pipeline(StandardScaler(),PCA(n_components=2),StandardScaler())
print("Pipeline steps: \n{}".format(pipe.steps))

Pipeline steps: 
[('standardscaler-1', StandardScaler(copy=True, with_mean=True, with_std=True)), ('pca', PCA(copy=True, iterated_power='auto', n_components=2, random_state=None,
  svd_solver='auto', tol=0.0, whiten=False)), ('standardscaler-2', StandardScaler(copy=True, with_mean=True, with_std=True))]


As you can see, the first StandardScaler step was named standardscaler-1 and the second standardscaler-2. However, in such settings it might be better to use the Pipeline construction with explicit names, to give more semantic names to each step. 

# Accessing Step Attributes 

Often you will want to inspect attributes of one of the steps of the pipeline—say, the coefficients of a linear model or the components extracted by PCA. The easiest way to access the steps in a pipeline is via the named_steps attribute, which is a dictionary from the step names to the estimators:


In [8]:
# fit the pipeline defined before to the cancer dataset
from sklearn.datasets import load_breast_cancer
cancer=load_breast_cancer()

pipe.fit(cancer.data)
# extract the first two principal components from the pca step
components=pipe.named_steps["pca"].components_
print("component.shape: {}".format(components.shape))

component.shape: (2, 30)


# Accessing Attributes in a Grid-Searched Pipeline

As we discussed earlier in this tutorial, one of the main reasons to use pipelines is for doing grid searches. A common task is to access some of the steps of a pipeline inside a grid search. Let’s grid search a LogisticRegression classifier on the cancer dataset, using Pipeline and StandardScaler to scale the data before passing it to the Logisti cRegression classifier. First we create a pipeline using the make_pipeline function:


In [12]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.grid_search import GridSearchCV

pipe=make_pipeline(StandardScaler(),LogisticRegression())

Next, we create a parameter grid. As explained in Supervised learning, the regularization parameter to tune for LogisticRegression is the parameter C. We use a logarithmic grid for this parameter, searching between 0.01 and 100. Because we used the make_pipeline function, the name of the LogisticRegression step in the pipeline is the lowercased class name, logisticregression. To tune the parameter C, we therefore have to specify a parameter grid for logisticregression__C:


In [14]:
param_grid={'logisticregression__C':[0.01,0.1,10,100]}

# split data into train and tes set
X_train,X_test,y_train,y_test=train_test_split(cancer.data,cancer.target,random_state=4)

grid=GridSearchCV(pipe,param_grid=param_grid,cv=5)

# fit the data in grid search
grid.fit(X_train,y_train)

GridSearchCV(cv=5, error_score='raise',
       estimator=Pipeline(memory=None,
     steps=[('standardscaler', StandardScaler(copy=True, with_mean=True, with_std=True)), ('logisticregression', LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False))]),
       fit_params={}, iid=True, n_jobs=1,
       param_grid={'logisticregression__C': [0.01, 0.1, 10, 100]},
       pre_dispatch='2*n_jobs', refit=True, scoring=None, verbose=0)

So how do we access the coefficients of the best LogisticRegression model that was found by GridSearchCV? 
 we know that the best model found by GridSearchCV, trained on all the training data, is stored in grid.best_estimator_:


In [15]:
print('Best estimator: \n{}'.format(grid.best_estimator_))

Best estimator: 
Pipeline(memory=None,
     steps=[('standardscaler', StandardScaler(copy=True, with_mean=True, with_std=True)), ('logisticregression', LogisticRegression(C=0.1, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False))])


This best_estimator_ in our case is a pipeline with two steps, standardscaler and logisticregression. To access the logisticregression step, we can use the named_steps attribute of the pipeline, as explained earlier:


In [18]:
print("Logistic regression step: \n{}".format(grid.best_estimator_.named_steps["logisticregression"]))

Logistic regression step: 
LogisticRegression(C=0.1, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)


Now that we have the trained LogisticRegression instance, we can access the coefficients (weights) associated with each input feature:


In [19]:
print("Logistic regression coefficients: \n{}".format(grid.best_estimator_.named_steps["logisticregression"].coef_))

Logistic regression coefficients: 
[[-0.38856355 -0.37529972 -0.37624793 -0.39649439 -0.11519359  0.01709608
  -0.3550729  -0.38995414 -0.05780518  0.20879795 -0.49487753 -0.0036321
  -0.37122718 -0.38337777 -0.04488715  0.19752816  0.00424822 -0.04857196
   0.21023226  0.22444999 -0.54669761 -0.52542026 -0.49881157 -0.51451071
  -0.39256847 -0.12293451 -0.38827425 -0.4169485  -0.32533663 -0.13926972]]


This might be a somewhat lengthy expression, but often it comes in handy in understanding your models.
