#Pipeline

## Table of Contents

1. `Pipeline` class
1. `fit` and `transform` methods in pipeline

### 1. `Pipeline` class

`Pipeline` can be used to chain multiple transformers (and an optional final estimator) into one. This is useful as there is often a fixed sequence of steps in processing the data.

Load libraries.

In [6]:
import sklearn as  sk
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import Imputer
from sklearn.preprocessing import MinMaxScaler
from sklearn.decomposition import PCA
from sklearn.neighbors import KNeighborsClassifier

Build a `Pipeline` and store it in an object `pipe`. The `Pipeline` is built using a list of _(key, value)_ pairs, where the _key_ is a string containing the name you want to give this step and _value_ is an estimator/transformer object:

In [8]:
pipe = Pipeline([('imputer', Imputer(strategy='mean')),
                 ('scaler' , MinMaxScaler()),
                 ('reduce_dim', PCA()),
                 ('knn'    , KNeighborsClassifier())
                 ])
pipe

The output above shows that the four transformers are chained in the order we defined in the `Pipeline`. In addition, the default init parameters are displayed.

Note: All estimators in a pipeline, except the last one, must be transformers (i.e. must have a `transform` method). The last estimator may be any type (transformer, classifier, etc.).

The transformers of a pipeline are stored as a list in the `steps` attribute:

In [12]:
pipe.steps[2]

The final estimator is also stored in the `steps` attribute.

In [14]:
pipe.steps[-1]

The transformers of a pipeline are also stored as a dict in `named_steps` attribute:

In [16]:
pipe.named_steps['reduce_dim']

Parameters of the transformers in the pipeline can be accessed using the `set_params` method with the `<step name>__<parameter>` syntax:

In [18]:
 pipe.set_params(imputer__strategy='most_frequent') 

This session introduces the way to build a pipeline and to obtain parameters of the transformers in the pipeline.

###2. `fit` and `transform` methods in pipeline

In scikit-learn, estimators have a `fit` method, which learns model parameters from a training set, and a `transform` method which applies this transformation model to unseen data. `fit_transform` may be more convenient and efficient for modelling and transforming the training data simultaneously. Calling `fit` on the pipeline is the same as calling `fit` on each estimator in turn, `transform` the input and pass it on to the next step. 
 - [Pipeline process](https://ibb.co/m4sz3o)

Load libraries.

In [23]:
from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split

Prepare a sample dataset from the sklearn library, and store it in the `digits` object.

In [25]:
digits = load_digits()

Split the predictors (`digit.data`) and labels (`digits.target`) into train and test using the `train_test_split` function.

In [27]:
X_train, X_test, y_train, y_test = train_test_split(digits.data, digits.target, test_size=.2, random_state=0)

Fit the pipeline using `X_train` as training data and `y_train` as target values, and store the computed parameters to the object `Pipe`.

In [29]:
pipe.fit(X_train, y_train)

The above pipeline process consists of several intermediate steps (`Imputer`, `MinMaxScaler`, `PCA`) and a final estimator (`KNeighborsClassifier`). When the `fit` method of the pipeline is called with the training data:
1. The `Imputer` object is fit to the training data
1. The `Imputer` object transforms the training data
1. The `MinMaxScaler` object is fit to the transformed training data from the `Imputer` object (previous step)
1. The `MinMaxScaler` object transforms the transformed training data from the `Imputer` object
1. The `PCA` object fits the transformed data from the `MinMaxScaler` object
1. The `PCA` object transforms the transformed data from the `MinMaxScaler` object
1. The `KNeighborsClassifier` object fits the transformed data from the `PCA` object

That's all it does. In particular, the `transform` method of the `KNeighborsClassifier` __is not called__.

Apply the pipeline transformations to the test data, and score with the final estimator.

In [32]:
print('Test accuracy: %.3f' % pipe.score(X_test, y_test))

The pipeline has all the methods that the last estimator in the pipeline has, i.e. if the last estimator is a classifier, the pipeline can be used as a classifier with all the methods that classifier has. If the last estimator is a transformer, again, so is the pipeline.

Below is the `Pipeline` command that originally created the object `pipe`.

In [35]:
pipe = Pipeline([('imputer', Imputer(strategy='mean')),
                 ('scaler' , MinMaxScaler()),
                 ('reduce_dim', PCA()),
                 ('knn'    , KNeighborsClassifier())
                 ])
pipe