Import necessary libraries

In [1]:
from IPython.display import display, Math, Latex

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_theme(style="whitegrid")

#### **Chaining Transformers**

* The preprocessing transformations are applied one after another on the input feature matrix.

* It is important to apply exactly same transformation on training, evaluation and test set in the same order.

* Failing to do so would lead to incorrect predictions from model due to distribution shift and hence incorrect performance evaluation.

* The `sklearn.pipeline` module provides utilities to build a composite estimator, as a chain of transformers and estimators.

## Pipeline

Sequentially apply a list of transformers and estimators.

* Intermediate steps of the pipeline must be 'transformer' i.e, they must implement `fit` and `transform` methods.

* The final estimator only needs to implement `fit`. 

The purpose of the pipeline is to assemble several steps that can be cross-validated together while setting different parameters.

### **1.Creating Pipelines**

A pipeline can be created with `Pipeline()`. 

It takes a list of ('estimatorsName',estimator(...)) tuples. The pipeline object exposes interface of the last step.

In [2]:
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression

estimators = [
    ('simpleImputer' , SimpleImputer()),
    ('standardScaler' , StandardScaler()),
]

pipe = Pipeline(steps=estimators)

The same pipeline can also be created via `make_pipeline()` helper function, which doesn't take names of the steps and assigns them generic names based on their steps.

In [3]:
from sklearn.pipeline import make_pipeline
pipe = make_pipeline(SimpleImputer(), StandardScaler())

### **2.Accessing Individual steps in a Pipeline**

In [4]:
from sklearn.decomposition import PCA
estimators = [
    ('simpleImputer', SimpleImputer()),
    ('pca', PCA()),
    ('regressor', LinearRegression())
]
pipe = Pipeline(steps=estimators)

Let's print number of steps in this pipeline:

In [5]:
print(len(pipe.steps))

3


Let's look at each of the steps:

In [6]:
print(pipe.steps)

[('simpleImputer', SimpleImputer()), ('pca', PCA()), ('regressor', LinearRegression())]


The second estimator can be accessed in following 4 ways:

In [7]:
print(pipe.named_steps.regressor)

LinearRegression()


In [8]:
pipe.steps[1]

('pca', PCA())

In [9]:
pipe['pca']

PCA()

### **3.Accessing parameters of a step in pipeline**

Parameters of the estimators in the pipeline can be accessed using the __syntax, note there are two underscores. 

In [10]:
estimators = [
    ('simpleImputer', SimpleImputer()),
    ('pca', PCA()),
    ('regressor', LinearRegression())
]

pipe = Pipeline(steps=estimators)
pipe.set_params(pca__n_components=2)

Pipeline(steps=[('simpleImputer', SimpleImputer()),
                ('pca', PCA(n_components=2)),
                ('regressor', LinearRegression())])

In above example `n_components` of `PCA()` step is set after the pipeline is created.


### **4.GridSeachCV with Pipeline**

By using naming convention of nested parameters, grid search can be implemented.

In [11]:
from sklearn.impute import KNNImputer
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV

param_grid = dict(
    imputer=['passthrough',SimpleImputer(),KNNImputer()],
    clf=[SVC(), LogisticRegression()],
    clf_C=[0.1, 1, 10, 100])

grid_search = GridSearchCV(pipe, param_grid=param_grid)

* `c` is an inverse of regularization, lower its value stronger the regularization is.

* In the example above `clf_C` provides a set of values for grid search.

### **Caching Transformers**

Transforming data is a computationally expensive step.
* For grid search, transformers need not be applied for every parameter configuration. 

* They can be applied only once, and the transformed data can be reused.

* This can be achived by setting `memory` parameter of `pipeline` object.


In [12]:
import tempfile 
tempDirPath = tempfile.TemporaryDirectory()

In [13]:
estimators = [
              ('simpleImputer', SimpleImputer()),
              ('pca', PCA(2)),
              ('regressor',LinearRegression())
]

pipe = Pipeline(steps = estimators ,memory = tempDirPath)

### **FeatureUnion** 
Concatenates results of multiple transformer objects.

* Applies a list of transformer objects in parallel, and their outputs are concatenated side-by-side into a larger matrix.

* `FeatuerUnion` and `Pipeline` can be used to create complex transformers.

### **5.Visualizing Pipelines**

In [14]:
from sklearn.preprocessing import StandardScaler, LabelBinarizer
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer

num_pipeline = Pipeline([
    ('selector', ColumnTransformer([(
        'select_first_4', 'passthrough', slice(0, 4))])),

    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler()),
])

cat_pipeline = ColumnTransformer([
    ('label_binarizer', LabelBinarizer(), [4]),
])

full_pipeline = FeatureUnion(transformer_list=[('num_pipeline', num_pipeline),
                                               ('cat_pipeline', cat_pipeline)
                                               ])


In [15]:
from sklearn import set_config
set_config(display='diagram')

#displays HTML representation in a jupyter context
full_pipeline