## An Introduction to Pipelines in Scikit-Learn
### Chris Feller | Galvanize Data Science Immersive

---

### First, Some Background on Scikit-Learn's API Design

There are three main objects you will work with in scikit-learn:
    <br>
1. **Estimators**: An object that can estimate some parameters based on a dataset. An estimator learns from the data. This is the `.fit()` step. Example: `StandardScaler()` is an estimator since it takes in a dataset and calculates or 'learns' the dataset's parameters (e.g., mean and standard deviation). 
    <br>
2. **Transformers**: An object that changes the data in some way. This is the `.transform()` step. Example: `Imputer()` is a transformer because it imputes null values, thus changing the underlying data. It is important to note that some estimators (such as `StandardScaler()`) can also transform a datasest.
    <br>
3. **Predictors**: An object that is capable of making predictions on a given dataset. Takes a dataset of new instances and returns a dataset of corresponding predictions. This is the `.predict()` step. Also has a `.score()` method to measure the quality of the predictions given a test set.


The general process of scikit-learn is:
```
.fit()
.transform()
.predict()
```

### So What Are Pipelines and Why Are They Useful?

Usually, this `.fit()`, `.transform()`, `.predict()` process happens in sequence, but in separate self-contained steps. However, Pipelines make it so that we can chain these steps together into a single unit that still executes in sequential order. In essence, a Pipeline is a way to chain multiple estimators, transformers, and predictors into a single unit. 

This is useful for a multitude of reasons:
1. Readability
    - The intent of the code is clearer and the code is more readable
2. Efficiency
    - You only make one `fit()` and `predict()` call on the entire sequence instead of at each step
    - Leads to a faster modeling loop and easier experimentation 
    - It makes it trivial to move ordering of the Pipeline pieces, or to swap pieces in and out
3. Safety
    - Ensures that each transformation of the data is being performed in the correct order, which protects from inadvertent data leakage
    - You don't have to keep track of data during intermediate steps
4. Joint Parameter Selection
    - You can grid-search once over all parameters of all your transformers and estimators at once
   

### How Do They Work?

The Pipeline is built using a list of `(key, value)` tuples, where the `key` is a string containing the name you want to give the step and `value` is an estimator object. The names of each step can be anything you'd like as long as they don't include any double underscores `__`. Here's a general outline:
```
pipeline = Pipeline([
    ('name', estimator()),
    ('name', transformer(), 
    ('name', predictor()),
    ])
```
All estimators in a pipeline, except for the last one, must be transformers (i.e., they take X, do something to X, and then spit out a transformed X). In other words, each of the estimators, except for the last one, must implement a `fit()` and `transform()` method. The last estimator can be of any type (transformer, predictor, etc.). The Pipeline as a whole has all of the methods that the last estimator in the pipeline has, i.e., if the last estimator is a predictor such as Logistic Regression then the entire pipeline has a `fit()`, `transform()`, and `predict()` method. If instead the last estimator is a transformer, then the pipeline as a whole would not have a `predict()` method.  

Calling `fit()` on the pipeline is the same as calling `fit()` on each estimator in turn, `transform()` the input and pass it on to the next step. In other words, when you call the pipeline's `fit()` method, it calls `fit_transform()` sequentially on all transformers, passing the output of each call as the parameter to the next call, until it reaches the final estimator, for which it just calls the `fit()` or `predict()` method.


Example in which we are imputing the median for missing values, standardizing the data, and then building a Logistic Regression model:

In [1]:
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, Imputer
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import make_classification

pipeline = Pipeline([
    ('imputer', Imputer(strategy='median')),
    ('std_scaler', StandardScaler()),
    ('logistic_model', LogisticRegression()),
    ])

# Create Fake Data
X, y = make_classification(n_features=2, n_redundant=0, n_informative=2,
                             n_clusters_per_class=1)
# Perform Train/Test Split
X_train, X_test, y_train, y_test = train_test_split(X, y)

# Fit and Predict on Pipeline
pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)
print(y_pred)

[1 1 1 0 1 0 0 1 1 1 0 1 0 0 1 1 0 1 0 0 1 0 0 0 1]


### FeatureUnions

In some situations, you'll need to combine the output of two separate pipelines into one. For example, you may have a pandas DataFrame which requires separate processing for numeric features and categorical features. You could create a Pipeline to deal with the numeric features and a separate Pipeline to deal with the categorical features and then combine the two prior to fitting a model. `FeatureUnion` does this seamlessly, by taking multiple estimators which return columns in parallel and concatenating their results together using `np.hstack`. In practice, Pipelines and FeatureUnions are commonly used in conjunction to create complex workflows. 

`Pipeline` chains things together in sequential order, while `FeatureUnion` chains things together in parallel. `Pipeline` items need to happen in order, while `FeatureUnion` items can happen at the same time.

A few things to keep in mind when using `FeatureUnion`:
* A FeatureUnion has no way of checking whether two transformers might produce identical features. It only produces a union when the feature sets are disjoint, and making sure they are the caller's responsibility). So the output of your FeatureUnion may have duplicate features if you're not careful. 
* For some complex transformers, alignment may be tricky. Pandas is good at this, but sometimes runs into trouble in FeatureUnions because `np.hstack` is called, which ignores indexes.
* Writing a generalizable transformer often means you will expect the correct column to be selected from your X matrix, oftentimes this means writing a custom selector (via Custom Transformers which we'll get to next), which is too bad.

Example in which we take an array of random numbers and then concatenate two additional arrays to it, one with the square-root of the original array and one with the square of the original array:

In [2]:
import numpy as np
from sklearn.preprocessing import FunctionTransformer
from sklearn.pipeline import FeatureUnion

# Create Fake Data
X = np.random.random((5,1))

feat_union = FeatureUnion([('identity',FunctionTransformer()),
                    ('sqrt',FunctionTransformer(np.sqrt)),
                    ('square', FunctionTransformer(lambda x: x**2))])

print(feat_union.fit_transform(X))

[[  7.15427374e-01   8.45829400e-01   5.11836328e-01]
 [  5.20721797e-01   7.21610558e-01   2.71151190e-01]
 [  6.99023359e-01   8.36076168e-01   4.88633656e-01]
 [  6.94400787e-01   8.33307138e-01   4.82192453e-01]
 [  1.10036141e-02   1.04898113e-01   1.21079523e-04]]


### Custom Transformers

Often during preprocessing and feature selection, we write our own functions that transform the data (e.g. drop columns, multiply two columns together, etc.). And although scikit-learn provides many useful transformers, such as `StandardScaler()`, that do much of this preprocessing for us, you will need to write your own for tasks such as custom cleanup operations. As a reminder, a transformer is just an object that responds to `fit()`, `transform()`, and `fit_transform()`. A transformer can be thought of as a data in, data out black box.

To create a custom transformer that will fit into a Pipeline, all you need is to create a class and implement three methods:
1. `.fit()`  - which returns `self`
2. `.transform()`
3. `.fit_transform()`
    - You can get this last one for free by simply adding `TransformerMixin` as a base class.
    - Similarly, if you use `BaseEstimator` as a base class (and avoid `*args` and `**kwargs` in your constructor) your transformer will have grid-searchable parameters, which will be very useful later on. 


**Custom Transformer Template**:
```
class MyTransformer(TransformerMixin, BaseEstimator):
    '''A template for a custom transformer.'''

    def __init__(self):
        pass

    def fit(self, X, y=None):
        return self

    def transform(self, X):
        # transform X via code or additional methods
        return X
```

* `fit()` **ALWAYS** returns `self`. Sometimes it can set state variables if you will need those to transform test data later on. Otherwise it just does nothing. Either way, it returns `self`.
* Even though your `fit()` method isn't doing anything with `y`, it still needs to be a parameter. Just set it to `None`. On the flip side, the `transform` method **ONLY** takes `X` and won't work if you include a `y`.
* `transform()` is where most of the transformations happen! In this method, `X` is transformed somehow (perhaps through other methods within the transformer), and then the transformed `X` is returned.


Example in which we implement our own version of `StandardScaler()`:

In [3]:
from sklearn.base import TransformerMixin, BaseEstimator
import numpy as np

class MyScaler(TransformerMixin, BaseEstimator):
    """
    Scale to zero mean and unit variance.
    """
    def fit(self, X, y):
        """
        Recommended signature for custom transformer's
        fit method.

        Set state in your transformer with whatever information
        is needed to transform later.
        """
        #You have to return self, so we can chain!
        self.mean = np.mean(X, axis=0)
        self.scale = np.std(X, axis=0)
        return self

    def transform(self, X):
        """
        Recommended signature for custom transformer's
        transform method.

        Use state (if any) to transform some X data. This X
        may be the same X passed to fit, but it may also be new data,
        as in the case of a CV dataset. Both are treated the same.
        """
        #Do transformations
        Xt = X.copy()
        Xt -= self.mean
        Xt /= self.scale
        return Xt

### GridSearching with Pipeline

One of the best features of Pipelines is you can pass your entire pipeline object into GridSearchCV to optimize your hyperparameters. By combining GridSearchCV with Pipeline you can also cross-validate and optimize any upstream transformers. This could come in handy if you were doing dimensionality reduction before classifying, and wanted to compare techniques as seen in the example below:
    
```
from sklearn.pipeline import Pipeline
from sklearn.svm import SVC
from sklearn.decomposition import PCA
from sklearn.model_selection import GridSearchCV

pipe = Pipeline([
    ('reduce_dim', PCA()),
    ('clf', SVC()),
    ])

param_grid = {'reduce_dim__n_components:[2, 5, 10], 
              'clf__C:[0.1, 10, 100]}

grid_search = GridSearchCV(pipe, param_grid=param_grid)
```

Note that when building the `param_grid` dictionary to gridsearch over, the naming convention is as follows; The name of the pipeline step (e.g., `clf`), followed by two underscores, followed by the name of the parameter. For example, below `'my_classifier` is the name of the random forest classifier in the pipeline and `min_samples_split` is the parameter we are trying to optimize.
```
pipeline = Pipeline([(“my_classifier”, RandomForestClassifier())])
parameters = {my_classifier__min_samples_split=[2, 3, 4, 5]}   
cv = GridSearchCV(pipeline, param_grid = parameters)
```


 ---

## Examples Using the Titanic Dataset

Let's compare an entire process with and without Pipelines to see their true value in action. Using the Titanic dataset, we'll build a Random Forest Classifier to predict which passengers survived the historic shipwreck. In the process, we'll impute values for missing data, standardize numeric features, one hot encode categorical features, and gridsearch to find the optimal parameters in the final model. 

### The Non-Pipeline Way

In [2]:
# Libraries
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

pd.options.mode.chained_assignment = None 

# Load data
df = pd.read_csv('../data/titanic.csv')

# Select features to use in our final model
df_features = df[['Sex', 'Embarked', 'Pclass', 'Fare', 'Age', 'SibSp', 'Parch', 'Survived']]

# Perform a train/test split 
y = df_features.pop('Survived')
X = df_features
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state=0)

# Impute missing values w/ the mean of the numeric features and the mode of categorical features
X_train['Fare'].fillna(X_test['Fare'].median(), inplace=True)
X_train['Age'].fillna(X_test['Age'].median(), inplace=True)
X_train['Embarked'].fillna(X_test['Embarked'].mode(), inplace=True)

X_test['Fare'].fillna(X_test['Fare'].median(), inplace=True)
X_test['Age'].fillna(X_test['Age'].median(), inplace=True)
X_test['Embarked'].fillna(X_test['Embarked'].mode(), inplace=True)

# One hot encode categorical features
for col in ['Sex', 'Embarked', 'Pclass']:
    X_train = pd.concat([X_train, pd.get_dummies(X_train[col], prefix=col,
                        drop_first=True)], axis=1)
    X_train.drop(col, inplace=True, axis=1)
    X_test = pd.concat([X_test, pd.get_dummies(X_test[col],
                        prefix=col, drop_first=True)], axis=1)
    X_test.drop(col, inplace=True, axis=1)

# Standardize the numeric features
features_to_scale = ['Fare', 'Age', 'SibSp', 'Parch']
scaler = StandardScaler()
scaler.fit(X_train[features_to_scale])
X_train[features_to_scale] = scaler.transform(X_train[features_to_scale])
X_test[features_to_scale] = scaler.transform(X_test[features_to_scale])

# Gridsearch on random forest classifier
random_forest_param_list = {"max_depth": [None],
          "max_features": np.arange(1,9, 1),
          "min_samples_split": np.arange(2,100,10),
          "min_samples_leaf": np.arange(2,100, 100),
          "bootstrap": [True, False],
          "n_estimators" :[100,200, 300],
          "criterion": ["gini"]}
rf = RandomForestClassifier()
g = GridSearchCV(rf, random_forest_param_list, scoring='accuracy', cv=3, n_jobs=-1)
g.fit(X_train, y_train)

# Make predictions
y_pred = g.predict(X_test)
print('Final Model Accuracy: ', accuracy_score(y_test, y_pred))

Final Model Accuracy:  0.811659192825


### The Pipeline Way

Now, let's take advantage of all that Pipelines have to offer. We'll start by creating three custom transformers. After performing a train/test split we'll create two separate pipelines, one for numerical features and another for categorical features. We'll then create one final pipeline, which includes a FeatureUnion to combine the outputs of the first two pipelines. We'll gridsearch on that final pipeline and make predictions on the test data.

In [3]:
# Libraries
import numpy as np
import pandas as pd
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import Imputer, StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

# Build Custom Transformers
class CustomSelector(TransformerMixin, BaseEstimator):
    '''Custom Transformer to select categorical or numerical features'''

    def __init__(self, categorical=True):
        self.categorical = categorical

    def fit(self, X, y=None):
        return self

    def transform(self, X):
        if self.categorical:
            return X[['Sex', 'Embarked', 'Pclass']]
        else:
            return X[['Fare', 'Age', 'SibSp', 'Parch']]
        
class CustomCategoricalImputer(TransformerMixin, BaseEstimator):
    '''Custom Transformer to impute the most frequent value in categorical features'''

    def fit(self, X, y=None):
        self.fill = X.mode().iloc[0]
        return self

    def transform(self, X):
        return X.fillna(self.fill)
    
class CustomCategoricalEncoder(TransformerMixin, BaseEstimator):
    '''Custom Transformer to implement One Hot Encoding on categorical features'''

    def fit(self, X, y=None):
        return self

    def transform(self, X):
        for col in ['Sex', 'Embarked', 'Pclass']:
            X = pd.concat([X, pd.get_dummies(X[col], prefix=col,
                        drop_first=True)], axis=1)
            X.drop(col, inplace=True, axis=1)
        return X
    
# Load data
df = pd.read_csv('../data/titanic.csv')

# Select features to use in our final model
df_features = df[['Sex', 'Embarked', 'Pclass', 'Fare', 'Age', 'SibSp', 'Parch', 'Survived']]

# Perform train/test split
y = df_features.pop('Survived')
X = df_features
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state=0)

# Build Pipeline for numerical features
num_pipeline = Pipeline([
    ('selector', CustomSelector(categorical=False)),
    ('imputer', Imputer(strategy='median')),
    ('scaler', StandardScaler()),
])

# Build Pipeline for categorical features
cat_pipeline = Pipeline([
    ('selector', CustomSelector(categorical=True)),
    ('imputer', CustomCategoricalImputer()),
    ('cat_encoder', CustomCategoricalEncoder())
])

# Build Full Pipeline
full_pipeline = Pipeline([
    ('feat_union', FeatureUnion(transformer_list=[
    ('num_pipeline', num_pipeline),
    ('cat_pipeline', cat_pipeline),
    ])),
    ('rf', RandomForestClassifier())
])

# Gridsearch on entire pipeline
random_forest_param_list = {"rf__max_depth": [None],
          "rf__max_features": np.arange(1,9, 1),
          "rf__min_samples_split": np.arange(2,100,10),
          "rf__min_samples_leaf": np.arange(2,100, 100),
          "rf__bootstrap": [True, False],
          "rf__n_estimators" :[100,200, 300],
          "rf__criterion": ["gini"]}

grid = GridSearchCV(full_pipeline, param_grid=random_forest_param_list, scoring='accuracy', cv=3, n_jobs=-1)
grid.fit(X_train, y_train)

# Make predictions
y_pred = grid.predict(X_test)
print('Final Model Accuracy: ', accuracy_score(y_test, y_pred))

Final Model Accuracy:  0.784753363229
