# Pipelines

Pipelines are sequences of 
transformers with an optional estimator at the end. 

The entire composite is:
- a _transformer_ if it does not contain an estimator
- an _estimator_ if the sequence ends with an estimator

If a pipeline is:
- a transformer, then it (the entire pipeline) has a `fit` method and a `transform` method
- an estimator, then it (the entire pipeline) has a `fit` method and a `predict` method

The work involved in understanding pipelines is in understanding how the above methods (of the pipeline) are composed of the `fit`, `transform` and `predict` methods of the transformers (and optional estimator) that make up the pipeline. 

This will be demonstrated below in two sections: __Transformer pipelines__ and __Estimator pipelines__.

## Reference
- http://scikit-learn.org/stable/modules/pipeline.html
- http://scikit-learn.org/stable/data_transforms.html
- http://scikit-learn.org/stable/modules/preprocessing.html#preprocessing
- http://scikit-learn.org/stable/modules/feature_extraction.html#feature-extraction
- http://scikit-learn.org/stable/modules/unsupervised_reduction.html#data-reduction
- http://scikit-learn.org/stable/modules/kernel_approximation.html#kernel-approximation

## Table of Contents
1. Setup
1. Transformer pipelines
1. Estimator pipelines
1. Pipelines with train and test datasets

## 1. Setup

Import the `pandas` and `numpy` libraries. In addition, import the `train_test_split` class with which we will create train and test datasets.

In [7]:
import pandas  as pd
import numpy   as np
import sklearn as sk

Display the version numbers of the numpy, pandas and scikit-learn packages:

In [9]:
print('numpy  ',np.__version__)
print('pandas ',pd.__version__)
print('sklearn',sk.__version__)

Load the iris dataset to use in the demonstration below. Create datasets for features and the target.

In [11]:
from sklearn.datasets import load_iris
iris_features = load_iris().data
iris_target   = load_iris().target
(iris_features.shape, 
 iris_target.shape
)

Create train and test datasets from the feature and target datasets.

In [13]:
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(iris_features, iris_target)
x_train.shape, x_test.shape, y_train.shape, y_test.shape

The pipelines below are created with an `Imputer` object, a `MinMaxScaler` object and a `LogisticRegression` classifier object.
For details see:
- `Imputer`: https://bentley.cloud.databricks.com/#notebook/430288
- `MinMaxScaler`: https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html
- `LogisticRegression`: https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html

## 2. Transformer pipelines

A transformer pipeline is sequence of transformers. (It does not contain an estimator.)

For this demonstration the pipeline will consist of (in this order):
1. An `Imputer` object that will complete the missing values
2. A `MinMaxScaler` object that will rescale each column

This pipeline is created below in `xfm_pipe`.

### `fit` and `transform`

Create pipeline and individual transformers.

In [19]:
from sklearn.pipeline      import Pipeline
from sklearn.preprocessing import Imputer, MinMaxScaler

xfm = Pipeline([
  ('imputer', Imputer(strategy="mean")),
  ('scaler',  MinMaxScaler())
])
imp = Imputer(strategy="mean")
sca = MinMaxScaler()

Fit and transform the `x_train` dataset on the individual transformers, in the same order as the pipeline.

In [21]:
x_train_imp      = imp.fit(x_train)    .transform(x_train)
x_train_imp_sca  = sca.fit(x_train_imp).transform(x_train_imp)

Fit and transform the `x_train` dataset on the entire pipeline.

In [23]:
xfm.fit(x_train)
x_train_xfm = xfm.transform(x_train)

Compare the results of these two methods. (They should be the same.)

In [25]:
np.all(x_train_imp_sca == x_train_xfm)

They are the same.

### `fit` only

Recreate the pipeline and individual transformers.

In [29]:
from sklearn.pipeline      import Pipeline
from sklearn.preprocessing import Imputer, MinMaxScaler

xfm = Pipeline([
  ('imputer', Imputer(strategy="mean")),
  ('scaler',  MinMaxScaler())
])
imp = Imputer(strategy="mean")
sca = MinMaxScaler()

Fit the `x_train` datasets to the individual transformers in the same order as the entire pipeline.

In [31]:
x_train_imp      = imp.fit(x_train).transform(x_train)
sca.fit(x_train_imp)

Fit the `x_train` dataset on the pipeline.

In [33]:
xfm.fit(x_train)

Compare the `data_range_` attribute from the last individual transformer and from the pipeline.

In [35]:
(sca                      .data_range_, 
 xfm.named_steps['scaler'].data_range_
)

They are the same.

### `transform` only (`x_train` has already been fit above)

Transform the `x_test` datasets using the individual transformers in the same order as the entire pipeline.

In [39]:
x_test_imp     = imp.transform(x_test)
x_test_imp_sca = sca.transform(x_test_imp)

Transform the `x_test` datasets entire pipeline.

In [41]:
x_test_xfm     = xfm.transform(x_test)

Check whether the two results are the same.

In [43]:
np.all(x_test_imp_sca==x_test_xfm)

They are.

## 3. Estimator pipelines

__TBD__

In [47]:
from sklearn.pipeline      import Pipeline
from sklearn.linear_model  import LogisticRegression
from sklearn.preprocessing import Imputer, MinMaxScaler

est = Pipeline([
  ('imputer', Imputer(strategy="mean")),
  ('scaler',  MinMaxScaler()),
  ('logreg',  LogisticRegression())
])
imp = Imputer(strategy="mean")
sca = MinMaxScaler()
log = LogisticRegression()

__The End__