# Pipelines in sklearn

In many cases when working with data the same "process" is repeated multiple times, which can become tedious to recode multiple different times. A simple example of this is doing the standardization procedure to data before using regularized regression on that data.

Luckily, sklearn has "Pipelines" which chain together multiple steps in a data analysis process. By constructing these you can consolidate all of the steps you went through into a single object.

---

### Load packages and cleaned "titanic" dataset

In [1]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
import scipy.stats as stats

plt.style.use('fivethirtyeight')

%matplotlib inline
%config InlineBackend.figure_format = 'retina'

In [2]:
titanic = pd.read_csv('/Users/kiefer/github-repos/DSI-SF-2/datasets/titanic/titanic_clean.csv')

---

### Loading the pipeline objects

From the `sklearn.pipeline` module we are going to import `Pipeline` and `make_pipeline`.

`Pipeline` is the class object that will hold our data analysis process. The `make_pipeline` function is a convenience method that takes in a series of estimators or preprocessing steps and returns a `Pipeline` object.

We'll start with the more explicit construction using `Pipeline` and then move on to the convenience function.

In [3]:
from sklearn.pipeline import Pipeline
from sklearn.pipeline import make_pipeline

---

The term "pipeline" is jargon for a series of concatenated data transformations. Each stage of a pipeline feeds from the previous stage, i.e. the output of a stage is plugged into the input of the next stage and data flows through the pipeline from beginning to end.


![pipeline](./images/pipeline.png)

---

Pipelines provide a higher level of abstraction than the individual building blocks of a data science process and are a nice and convenient way to organize analyses.

Let's take a look at the titanic data:

In [6]:
import cPickle

df = pd.DataFrame({'a':[1,2,3,4], 'b':[1,2,4,5]})

f = open('/Users/kiefer/Desktop/small_df.p', 'w')
cPickle.dump(df, f)
f.close()

f = open('/Users/kiefer/Desktop/dumb_list.p', 'w')
cPickle.dump([0,2,3,55,'a','dx',-12.2], f)
f.close()

In [8]:
f = open('/Users/kiefer/Desktop/small_df.p', 'r')
loaded_df = cPickle.load(f)
f.close()

loaded_df

Unnamed: 0,a,b
0,1,1
1,2,2
2,3,4
3,4,5


In [9]:
titanic.head(3)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Fare,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,7.25,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,71.2833,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,7.925,S


There are some preprocessing steps we're going to do before classifying whether or not passengers survived:

1. Remove unwanted columns.
- Convert categorical string or numeric columns to dummy coded columns.
- Standardize the predictor matrix.

In [10]:
data = titanic.drop(['PassengerId', 'Name'], axis=1)

In [12]:
def code_pclass(df):
    # Reference pclass is pclass==1
    df['pclass_2'] = df.Pclass.map(lambda x: 1 if x == 2 else 0)
    df['pclass_3'] = df.Pclass.map(lambda x: 1 if x == 3 else 0)
    return df

def code_gender(df):
    # male is reference class
    df['female'] = df.Sex.map(lambda x: 1 if x == 'female' else 0)
    return df

# embarked is either S, C, or Q
def code_embarked(df):
    # S is the reference class
    df['embarked_C'] = df.Embarked.map(lambda x: 1 if x == 'C' else 0)
    df['embarked_Q'] = df.Embarked.map(lambda x: 1 if x == 'Q' else 0)
    return df

---

### Remove unwanted columns from data and convert categorical to dummy-coded columns

For now we'll do this manually and then later integrate it into the pipeline.

In [15]:
data = code_pclass(data)
data = code_gender(data)
data = code_embarked(data)
data.head(3)

Unnamed: 0,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked,pclass_2,pclass_3,female,embarked_C,embarked_Q
0,0,3,male,22.0,1,0,7.25,S,0,1,0,0,0
1,1,1,female,38.0,1,0,71.2833,C,0,0,1,1,0
2,1,3,female,26.0,0,0,7.925,S,0,1,1,0,0


In [16]:
data = data.drop(['Sex','Pclass','Embarked'], axis=1)

---

### Using a pipeline to standardize the data and fit the model

Now we'll split the data up into the X, y predictor target format, standardize the X matrix, and fit a Logistic Regression model on Survived.

First, split into X, y:

In [17]:
y = data.Survived.values
X = data.drop('Survived', axis=1)
print y.shape, X.shape

(712,) (712, 9)


In [21]:
from sklearn.cross_validation import train_test_split

In [22]:
Xtrain, Xtest, ytrain, ytest = train_test_split(X, y, test_size=0.4)

In [23]:
from sklearn.preprocessing import StandardScaler

ss = StandardScaler()
Xtrain_n = ss.fit_transform(Xtrain)

In [26]:
Xtest_n = ss.transform(Xtest)

In [27]:
Xtest_n.mean()

-0.020216928300224521

In [25]:
Xtrain_n.mean()

2.7271676691286699e-17

In [24]:
ss.mean_

array([  2.96124356e+01,   5.12880562e-01,   4.47306792e-01,
         3.55833529e+01,   2.45901639e-01,   4.77751756e-01,
         3.88758782e-01,   2.01405152e-01,   3.51288056e-02])

In [20]:
ss.scale_

array([ 14.4827517 ,   0.93003832,   0.85358139,  52.9014591 ,
         0.42888163,   0.49999803,   0.48108187,   0.38632532,   0.19436903])

Import the LogisticRegression and StandardScaler classes.

In [28]:
Xn = ss.fit_transform(X)

Next we're going to build one of these pipelines that can combine the steps. Below, we make the standard scaler object as well as the logistic regression object, then put them together into the pipeline object.

In [32]:
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression

ss = StandardScaler()
lr = LogisticRegression()

lr_pipe = Pipeline(steps=[['scaler', ss], ['lr_model', lr]])

**Pipelines combine both pre-processing and model building steps into a single object**. 

Rather than manually building transformations and then feeding them into the models, pipelines tie both of these steps together.

Furthermore, pipelines are equipped with the methods of the final estimator step:

- `fit()` methods
- `predict()` and/or `predict_proba()`
- `score()`
- ... etc.

use the pipeline to fit the model:


---

### Using pipelines with training and testing data

Next we'll split up this data into training and testing sets. One of the greatest benefits, in my opinion, to using pipelines is that the preprocessing steps before the model fitting retain the "fit" information from the training data to be applied to the testing data.

In the pipeline we built above, for example, the first standardization step is "fit" on the data we put into it. This means that the `StandardScaler` object takes the mean and standard deviation of that data and performs the procedure with those values.

It _also_ means that were we to predict or score on future data, the standard scaler in the pipeline would use the training data's mean and standard deviation to standardize that test data. This is what we want! You definitely don't want to standardize the training and testing data to their own means and standard deviations.

This hasn't been an issue for us thus far since we standardize the whole dataset and then split it into training and testing. But we have all the data right away. There are many scenarios in which the test data is actually data that we have not collected yet. In this case, you need to save the standardization procedure you used on the training data to use on this future data.

Split up into training and testing X, y below:


In [34]:
lr_pipe.fit(X,y)

Pipeline(steps=[['scaler', StandardScaler(copy=True, with_mean=True, with_std=True)], ['lr_model', LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)]])

In [35]:
lr_pipe.score(X, y)

0.8019662921348315

In [36]:
lr_pipe.fit(Xtrain, ytrain)

Pipeline(steps=[['scaler', StandardScaler(copy=True, with_mean=True, with_std=True)], ['lr_model', LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)]])

In [37]:
lr_pipe.score(Xtest, ytest)

0.81052631578947365

Fit the pipeline with the training data, then score it on the testing data:

For the sake of example, standardize the Xtrain and Xtest separately and show that their normalization parameters differ.

---

### Many built-in transformations and preprocessing steps

Sklearn comes with a wide variety of useful classes for preprocessing your data prior to model fitting that can be put into pipelines.

These can be found in the `sklearn.preprocessing` module and you should feel free to familiarize yourself with them if you want to make use of them in your code:

The preprocessing module comes loaded with many very useful pre-processing classes.

**Data Manipulators**

- Binarizer
- KernelCenterer
- MaxAbsScaler
- MinMaxScaler
- Normalizer
- OneHotEncoder
- PolynomialFeatures
- RobustScaler
- StandardScaler

**Data Imputation**

- Imputer

**Function Transformer**

- FunctionTransformer

**Label Manipulators**

- LabelBinarizer
- LabelEncoder
- MultiLabelBinarizer



---

### Custom transformations

It's not always possible to use a built-in transformation class to do what you want. In fact, it's likely that you're going to run into a scenario where you need a customized preprocessing step before model fitting.

Let's take our titanic data, for example. Say we wanted a preprocessor that would remove the columns we didn't want and create the dummy-coded columns before sending it through to the standardization step.

Custom transformer classes start with this template code:


In [39]:
# we need to import the template classes to create a class that works like an sklearn class
from sklearn.base import BaseEstimator, TransformerMixin

# our "TitanicPreprocessor" is going to do the processing
class TitanticPreprocessor(BaseEstimator, TransformerMixin):
    def __init__(self):
        pass

    def transform(self, X, *args):
        return X

    def fit(self, X, *args):
        return self

Some notes on this class:

1. We have to load in the `BaseEstimator` and `TransformerMixin` classes for our preprocessor to "inherit" from in the class definition.
- The two required functions are `fit` and `transform`, which will be used to chain the processes together in our pipeline.
- The `*args` argument tells the function to expect an arbitrary number of arguments after whatever arguments were listed explicitly.

**Add the dummy-coding functions we wrote above to the class:**

**Modify the `transform` function to perform these preprocessing steps, returning the new DataFrame.**

Also, keep track of the final column names in a class attribute.

**Add a function to remove the unneccessary columns after dummy-coding:**

In [40]:
class TitanticPreprocessor(BaseEstimator, TransformerMixin):
    def __init__(self):
        pass
    
    def code_pclass(self, df):
        # Reference pclass is pclass==1
        df['pclass_2'] = df.Pclass.map(lambda x: 1 if x == 2 else 0)
        df['pclass_3'] = df.Pclass.map(lambda x: 1 if x == 3 else 0)
        return df

    def code_gender(self, df):
        # male is reference class
        df['female'] = df.Sex.map(lambda x: 1 if x == 'female' else 0)
        return df

    # embarked is either S, C, or Q
    def code_embarked(self, df):
        # S is the reference class
        df['embarked_C'] = df.Embarked.map(lambda x: 1 if x == 'C' else 0)
        df['embarked_Q'] = df.Embarked.map(lambda x: 1 if x == 'Q' else 0)
        return df
    
    def remove_cols(self, df):
        for col in ['PassengerId','Pclass','Name','Sex','Embarked']:
            try:
                df = df.drop(col, axis=1)
            except:
                pass
        return df

    def transform(self, X, *args):
        X = self.code_pclass(X)
        X = self.code_gender(X)
        X = self.code_embarked(X)
        X = self.remove_cols(X)
        return X

    def fit(self, X, *args):
        return self

In [41]:
tproc = TitanticPreprocessor()

In [42]:
Xraw = titanic.drop('Survived', axis=1)

In [43]:
Xraw.head(2)

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Fare,Embarked
0,1,3,"Braund, Mr. Owen Harris",male,22.0,1,0,7.25,S
1,2,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,71.2833,C


In [44]:
Xproc = tproc.transform(Xraw)

In [45]:
Xproc.head(2)

Unnamed: 0,Age,SibSp,Parch,Fare,pclass_2,pclass_3,female,embarked_C,embarked_Q
0,22.0,1,0,7.25,0,1,0,0,0
1,38.0,1,0,71.2833,0,0,1,1,0


In [46]:
pipe2 = Pipeline(steps=[
        ['tproc', TitanticPreprocessor()],
        ['ss', StandardScaler()],
        ['lr', LogisticRegression()]
    ])

In [47]:
pipe2.fit(Xraw, y)

Pipeline(steps=[['tproc', TitanticPreprocessor()], ['ss', StandardScaler(copy=True, with_mean=True, with_std=True)], ['lr', LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)]])

In [48]:
pipe2.score(Xraw, y)

0.8019662921348315

In [49]:
f = open('/Users/kiefer/Desktop/pickled_pipe.p', 'w')
cPickle.dump(pipe2, f)
f.close()

---

### Use the custom TitanticPreprocessor in a pipeline

We'll put it before the StandardScaler in our original pipeline.

Fit on the training data and test on the testing data like before, with the new pipeline. You'll need to create a new X, y with the original non-manually preprocessed data!

---

### Looking at pipeline internals with `.get_params()`

Use the `.get_params()` function on the pipeline object to get out all of the parameters from the different steps as a dictionary.

You can pull out the feature names we stored by accessing our preprocessor object from the dictionary, then pulling out the attribute from that:

---

### The `make_pipeline()` convenience function

`make_pipeline()` essentially does the same thing as `Pipeline`, the only difference being that you just insert your objects as arguments to the function and it will create the pipeline for you. This means that it will name the steps itself, rather than you doing it.