# Pipelines

## Table of Contents
1. Setup
1. Pipelines

## Setup

Import `pandas`, `numpy` and `sklearn`. Note the corresponding versions.

In [5]:
import pandas  as pd
import numpy   as np
import sklearn as sk
pd.__version__, np.__version__, sk.__version__

Related Scikit-learn documentation, for the version we are using, can be found at:
- Documentation: http://scikit-learn.org/0.18/documentation.html
- User guide: http://scikit-learn.org/0.18/user_guide.html
- Pipelines: http://scikit-learn.org/0.18/modules/pipeline.html
- Data transformations: http://scikit-learn.org/0.18/data_transforms.html
    - Prepreprocess: http://scikit-learn.org/0.18/modules/preprocessing.html#preprocessing
    - Feature extraction: http://scikit-learn.org/0.18/modules/feature_extraction.html#feature-extraction
    - Unsupervised dimensionality reduction: http://scikit-learn.org/0.18/modules/unsupervised_reduction.html#data-reduction
    - Expand (kernel approximation): http://scikit-learn.org/0.18/modules/kernel_approximation.html#kernel-approximation

### Setup Dataset

The first two columns will be treated as numeric. The last column will be treated as categorical.

In [9]:
data_array = np.array([[1,  np.NaN, 1],
                       [6,  0,      2],
                       [11, 10,     3],
                       [8,  8,      np.NaN]
                      ])
data_array

In [10]:
from sklearn import pipeline
from sklearn import preprocessing

num_pipeline = pipeline.Pipeline([
        ('imputer',       preprocessing.Imputer(strategy="mean")),
    ])
num_pipeline

## Pipelines

In [12]:
data_array[:,[1]]

In [13]:
num_pipeline.fit_transform(data_array[:,[1]])

In [14]:
from sklearn.preprocessing import Imputer
imp = Imputer(missing_values='NaN', strategy='mean', axis=0)

In [15]:
imp.fit([[1, 2], 
         [np.nan, 3], 
         [7, 6]])

In [16]:
imp.statistics_

In [17]:
X = [[np.nan, 2], 
     [6, np.nan], 
     [9, 0]]

In [18]:
print(imp.transform(X)) 

A `Pipeline` object takes as input a list of tuples where each tuple has: 
- A name (`string`)
- A transformer object 

Below two transformers are run in sequence. The `fit` method of the second takes as input the output from the `transform` method of the first.

In [20]:
from sklearn import pipeline
from sklearn import preprocessing

num_pipeline = pipeline.Pipeline([('imputer', preprocessing.Imputer(strategy="median")),
                                  ('scaler',  preprocessing.MinMaxScaler())
    ])
num_pipeline

In [21]:
data_array[:,[0,1]]

In [22]:
num_pipeline.fit_transform(data_array[:,[0,1]])

This completes the numeric pipeline (`num_pipeline`) for this demonstration. Now we create a categorical pipeline (`cat_pipeline`.)

In [24]:
data_array[:,[2]]

In [25]:
0,
0.5,
1,
0.5

In [26]:
from sklearn import pipeline
from sklearn import preprocessing

cat_pipeline = pipeline.Pipeline([('imputer', preprocessing.Imputer(strategy="median")),
                                  ('scaler',  preprocessing.MinMaxScaler()),
                                  ('onehot',  preprocessing.OneHotEncoder())
                                 ])
cat_pipeline

In [27]:
cat_pipeline.fit_transform(data_array[:,[2]]).toarray()

__The End__