<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Pipelines in Scikit-Learn

_Authors: Kiefer Katovich (SF)_

---

### Learning Objectives

- Learn what a scikit-learn pipeline is and the scenarios for which its useful.
- Standardize data as part of a pipeline.
- Use pipelines with training and testing data.
- Practice object-oriented programming and building a custom transformation in scikit-learn.
- Put the custom Titanic preprocessor into a pipeline.
- Investigate the internals of scikit-learn pipelines.
- Practice using the `make_pipeline()` function to easily create pipeline objects.


### Lesson Guide
- [Introduction](#intro)
- [Loading the Pipeline Objects](#pipe-objects)
- [Processing Steps for the Titanic Data](#steps)
- [Standardizing Data as Part of a Pipeline](#standardize)
- [Pipelines With Training and Testing Data](#pipe-train-test)
- [Built-in Transformations and Preprocessing Steps](#built-in)
- [Custom Transformations](#custom)
- [Putting the Custom `TitanicPreprocessor()` in a Pipeline](#custom-pipe)
- [Pipeline Internals](#internals)
- [The `make_pipeline()` Convenience Function](#make-pipe)

In [1]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
import scipy.stats as stats

plt.style.use('fivethirtyeight')

%matplotlib inline
%config InlineBackend.figure_format = 'retina'

<a id='intro'></a>

## Introduction to Pipelines

---

When working with data, the same process is often repeated multiple times and can become tedious to recode. A simple example of this is standardizing data before using regularized regression or other models.

Luckily, scikit-learn has “pipelines" that chain together multiple steps in a data analysis process. By constructing these, you can consolidate all of the steps in a process into a single object.

This code-along introduces how to use pipelines and also serves as object-oriented programming practice.


### Load the Titanic Data

In [2]:
titanic = pd.read_csv('./datasets/titanic_clean.csv')

<a id='pipe-objects'></a>

## Loading the Pipeline Objects

---

From the `sklearn.pipeline` module, we’re going to import `Pipeline` and `make_pipeline`.

`Pipeline` is the class object that will hold our data analysis process. The `make_pipeline()` function is a convenience method that takes in a series of estimators or preprocessing steps and returns a `Pipeline` object.

We'll start with the more explicit construction using `Pipeline` and then move on to the convenience function.

In [3]:
from sklearn.pipeline import Pipeline
from sklearn.pipeline import make_pipeline


The term "pipeline" is jargon for a series of concatenated data transformations. Each stage of a pipeline builds off its predecessor (i.e., the output of a stage is plugged into the input of the next stage) and data flow through the pipeline from beginning to end.


![pipeline](./assets/pipeline.png)

---

Pipelines provide a higher level of abstraction than the individual building blocks of a data science process and are a convenient way to organize analyses.

**Let's take a look at the Titanic data:**

In [6]:
titanic.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Fare,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,7.25,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,71.2833,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,7.925,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,53.1,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,8.05,S


<a id='steps'></a>

## Processing Steps for the Titanic Data

---

There are some preprocessing steps we’ll complete before classifying whether or not passengers survived.

1) Remove unwanted columns.
    - Convert categorical string or numeric columns to dummy-coded columns.
    - Standardize the predictor matrix.

For now, we'll do this manually and integrate it into the pipeline later.

In [5]:
titanic.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 712 entries, 0 to 711
Data columns (total 10 columns):
PassengerId    712 non-null int64
Survived       712 non-null int64
Pclass         712 non-null int64
Name           712 non-null object
Sex            712 non-null object
Age            712 non-null float64
SibSp          712 non-null int64
Parch          712 non-null int64
Fare           712 non-null float64
Embarked       712 non-null object
dtypes: float64(2), int64(5), object(3)
memory usage: 55.7+ KB


In [18]:
df = titanic.drop(['PassengerId','Name'], axis=1)

In [19]:
dummies = pd.get_dummies(df[['Pclass','Sex','Embarked']], drop_first = True)
df2 = df.drop(['Pclass','Sex','Embarked'], axis=1).merge(dummies, left_index = True, right_index = True,how = 'outer')#merge back

In [20]:
df2.head()

Unnamed: 0,Survived,Age,SibSp,Parch,Fare,Pclass,Sex_male,Embarked_Q,Embarked_S
0,0,22.0,1,0,7.25,3,1,0,1
1,1,38.0,1,0,71.2833,1,0,0,0
2,1,26.0,0,0,7.925,3,0,0,1
3,1,35.0,1,0,53.1,1,0,0,1
4,0,35.0,0,0,8.05,3,1,0,1


<a id='standardize'></a>
## Using a Pipeline to Standardize the Data and Fit the Model

---

Now, we'll split the data up into the `x`, `y` predictor target format, standardize the `x` matrix, and fit a logistic regression model on “survived.”

First, split into `x` and `y`:

In [23]:
x = df2.iloc[:,1:]
y = df2['Survived']

Import the `LogisticRegression` and `StandardScaler` classes.

In [24]:
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression

Next, we're going to build a pipeline that can combine the steps above. Below, we create the `StandardScaler` and `LogisticRegression` objects, then combine them into the `Pipeline` object.

In [25]:
pipe1 = Pipeline([('SS', StandardScaler()),
                  ('logit', LogisticRegression())])

**Pipelines combine both preprocessing and model-building steps into a single object**. 

Rather than manually building transformations and feeding them into the models, pipelines tie both of these steps together.

Furthermore, pipelines are equipped with the methods of the final estimator step:

- `fit()` methods.
- `predict()` and/or `predict_proba()`.
- `score()`.

Use the pipeline to fit the model.


In [26]:
pipe1.fit(x, y)

Pipeline(memory=None,
     steps=[('SS', StandardScaler(copy=True, with_mean=True, with_std=True)), ('logit', LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False))])

In [29]:
pred = pipe1.predict(x)

In [32]:
pipe1.score(x, y)

0.800561797752809

<a id='pipe-train-test'></a>

## Using Pipelines With Training and Testing Data

---

Next, we'll split up this data into training and testing sets. One of the greatest benefits to using pipelines is that the preprocessing steps retain the “fit” information from the training data, which can then be applied to the testing data.

In the pipeline above, for example, the first standardization step is fit on the data we put into it. This means that the `StandardScaler` object takes the mean and standard deviation of the data and performs the procedure with those values.

It _also_ means that, were we to predict or score on future data, the standard scaler in the pipeline would use the training data's mean and standard deviation to standardize the testing data. This is what we want! You don't want to standardize the training and testing data to their own means and standard deviations.

There are many scenarios in which the testing data are actually data we haven’t collected yet. In these cases, you need to save the standardization procedure you used on the training set to use on this future data.

Below, split up into training and testing `x` and `y`:


In [33]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42, stratify=y) #can add in stratify=y

Fit the pipeline with the training data, then score it on the testing data.

In [34]:
pipe1.fit(X_train, y_train)

Pipeline(memory=None,
     steps=[('SS', StandardScaler(copy=True, with_mean=True, with_std=True)), ('logit', LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False))])

In [35]:
pipe1.score(X_test, y_test)

0.7972027972027972

In [45]:
X_train.head()

Unnamed: 0,Age,SibSp,Parch,Fare,Pclass,Sex_male,Embarked_Q,Embarked_S
472,49.0,1,0,56.9292,1,1,0,0
28,21.0,0,0,8.05,3,1,0,1
440,62.0,0,0,26.55,1,1,0,1
646,26.0,0,0,7.8875,3,1,0,1
287,40.0,1,4,27.9,3,1,0,1


**For the sake of example, standardize the `x` train and `x` test separately and show how their normalization parameters differ.**

In [42]:
StandardScaler().fit_transform(X_test)

array([[ 0.12500696, -0.60921364, -0.5043669 , ..., -1.30311673,
        -0.2268713 ,  0.6448578 ],
       [ 0.42211079, -0.60921364, -0.5043669 , ...,  0.76739096,
        -0.2268713 , -1.55072948],
       [-0.76630454, -0.60921364, -0.5043669 , ...,  0.76739096,
        -0.2268713 , -1.55072948],
       ...,
       [-0.54347666, -0.60921364, -0.5043669 , ...,  0.76739096,
        -0.2268713 , -1.55072948],
       [-1.50906412, -0.60921364,  1.98268366, ..., -1.30311673,
        -0.2268713 ,  0.6448578 ],
       [-2.17754774, -0.60921364,  1.98268366, ...,  0.76739096,
        -0.2268713 , -1.55072948]])

In [44]:
StandardScaler().fit(X_train).transform(X_test)

array([[ 0.17167165, -0.5428972 , -0.50768983, ..., -1.32745468,
        -0.19575793,  0.50603423],
       [ 0.44335538, -0.5428972 , -0.50768983, ...,  0.75332139,
        -0.19575793, -1.97615091],
       [-0.64337954, -0.5428972 , -0.50768983, ...,  0.75332139,
        -0.19575793, -1.97615091],
       ...,
       [-0.43961675, -0.5428972 , -0.50768983, ...,  0.75332139,
        -0.19575793, -1.97615091],
       [-1.32258887, -0.5428972 ,  1.80331428, ..., -1.32745468,
        -0.19575793,  0.50603423],
       [-1.93387727, -0.5428972 ,  1.80331428, ...,  0.75332139,
        -0.19575793, -1.97615091]])

<a id='built-in'></a>
## Built-In Transformations and Preprocessing Steps

---

Scikit-learn comes with a wide variety of useful classes for preprocessing your data prior to model fitting that can be put into pipelines.

These can be found in the `sklearn.preprocessing` module. Familiarize yourself with these classes if you want to make use of them in your code. 

They include:

**Data Manipulators**

- `Binarizer`
- `KernelCenterer`
- `MaxAbsScaler`
- `MinMaxScaler`
- `Normalizer`
- `OneHotEncoder`
- `PolynomialFeatures`
- `RobustScaler`
- `StandardScaler`

**Data Imputation**

- `Imputer`

**Function Transformer**

- `FunctionTransformer`

**Label Manipulators**

- `LabelBinarizer`
- `LabelEncoder`
- `MultiLabelBinarizer`

<a id='custom'></a>
## Custom Transformations

---

It's not always possible to use a built-in transformation class to do what you want. In fact, it's likely that you're going to run into a scenario where you need a customized preprocessing step before model fitting.

Let's take our Titanic data, for example. Say we wanted a preprocessor that would remove the columns we didn't want and create the dummy-coded columns before sending the set through to the standardization step.

Custom transformer classes start with this template code:


In [46]:
# We need to import the template classes to create something that works like a scikit-learn class.
from sklearn.base import BaseEstimator, TransformerMixin

# Our "TitanicPreprocessor" is going to do the processing.
class TitanicPreprocessor(BaseEstimator, TransformerMixin):
    def __init__(self):
        pass

    def transform(self, X, *args):
        return X

    def fit(self, X, *args):
        return self

Some notes on this class:

1) We have to load in the `BaseEstimator` and `TransformerMixin` classes for our preprocessor to inherit from in the class definition.
    - The two required functions are `fit()` and `transform()`, which will be used to chain the processes together in our pipeline.
    - The `*args` argument tells the function to expect an arbitrary number of arguments after whatever arguments were explicitly listed.

**Add the dummy coding functions we wrote to the class.**

In [14]:
# A:

**Add a function to remove the unnecessary columns after dummy coding.**

In [15]:
# A:

**Modify the `transform()` function to perform these preprocessing steps, returning the new DataFrame.**

Also, keep track of the final column names in a class attribute.

In [16]:
# A:

<a id='custom-pipe'></a>
## Use the Custom `TitanicPreprocessor()` in a Pipeline
---

We'll put it before the `StandardScaler` in our original pipeline.

In [17]:
# A:

Fit on the training data and test on the testing data with the new pipeline. You'll need to create a new `x`, `y` with the original, non-manually preprocessed data.

In [18]:
# A:

<a id='internals'></a>
## Looking at Pipeline Internals With `.get_params()`.

---

Use the `.get_params()` function on the pipeline object to extract all of the parameters from the different steps as a dictionary.

In [19]:
# A:

You can pull out the feature names we stored by accessing our preprocessor object from the dictionary, then pulling out the attribute.

In [20]:
# A:

<a id='make-pipe'></a>
## The `make_pipeline()` Convenience Function

---

`make_pipeline()` does essentially the same thing as `Pipeline`, except you insert your objects as arguments and the function will create the pipeline for you. This means that it will name the steps itself, rather than you having to.

In [21]:
# A: