# Pipelines 
*Author: Douglas Strodtman (SaMo), adjusted by Jeff Hale (DC)*


## Learning Objectives

By the end of this lesson students will be able to:

- Understand what a pipeline is
- Use sklearn pipelines with transformers and an estimator

---

## Pipelines

Pipelines make it easier to perform multiple preprocessing transformations and fit and transform our model with our data.

Pipelines are an extremely powerful tool for your machine learning workflow. 🛠

---

Here we'll use the Boston data with `VarianceThreshold`, `StandardScaler`, and `LinearRegression`.

In [66]:
# imports
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.feature_selection import VarianceThreshold, f_regression
from sklearn.model_selection import GridSearchCV

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

In [67]:
# read in the data
boston = pd.read_csv('../data/boston_data.csv')

In [68]:
# inspect 
boston.head()

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,LSTAT,MEDV
0,0.00632,18.0,2.31,0,0.538,6.575,65.2,4.09,1,296,15.3,4.98,24.0
1,0.02731,0.0,7.07,0,0.469,6.421,78.9,4.9671,2,242,17.8,9.14,21.6
2,0.02729,0.0,7.07,0,0.469,7.185,61.1,4.9671,2,242,17.8,4.03,34.7
3,0.03237,0.0,2.18,0,0.458,6.998,45.8,6.0622,3,222,18.7,2.94,33.4
4,0.06905,0.0,2.18,0,0.458,7.147,54.2,6.0622,3,222,18.7,5.33,36.2


In [69]:
# break into X and y
X = boston.drop(columns=['MEDV'])
y = boston['MEDV']

In [70]:
X.head()

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,LSTAT
0,0.00632,18.0,2.31,0,0.538,6.575,65.2,4.09,1,296,15.3,4.98
1,0.02731,0.0,7.07,0,0.469,6.421,78.9,4.9671,2,242,17.8,9.14
2,0.02729,0.0,7.07,0,0.469,7.185,61.1,4.9671,2,242,17.8,4.03
3,0.03237,0.0,2.18,0,0.458,6.998,45.8,6.0622,3,222,18.7,2.94
4,0.06905,0.0,2.18,0,0.458,7.147,54.2,6.0622,3,222,18.7,5.33


In [71]:
y.head()

0    24.0
1    21.6
2    34.7
3    33.4
4    36.2
Name: MEDV, dtype: float64

In [72]:
# train test split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)
X_train.head()

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,LSTAT
182,0.09103,0.0,2.46,0,0.488,7.155,92.2,2.7006,3,193,17.8,4.82
155,3.53501,0.0,19.58,1,0.871,6.152,82.6,1.7455,5,403,14.7,15.02
280,0.03578,20.0,3.33,0,0.4429,7.82,64.5,4.6947,5,216,14.9,3.76
126,0.38735,0.0,25.65,0,0.581,5.613,95.6,1.7572,2,188,19.1,27.26
329,0.06724,0.0,3.24,0,0.46,6.333,17.2,5.2146,4,430,16.9,7.34


## Pipeline Syntax

We set up a pipeline by passing a list of tuples in the format
```
('string_name', ClassObject())
```
Note that we can name our steps beforehand (each of the methods that we're using is a sklearn class).
```
lr = LinearRegression()
('linreg', lr)
```

We can include as many steps as we'd like. 


The final step has to be an **estimator**.
Each prior step has to be a **transformer**.

Look at the following example:

Create a pipeline instance with the following named steps using the provided names (and arguments where provided)

 vt: VarianceThreshold(.10)
 ss: StandardScaler()
 lr: LinearRegression()

You could name the steps whatever you like. You just use the names to refer to the steps later, if you want to.

In [73]:
list_for_pipeline = [
    ('vt', VarianceThreshold(threshold=.10)),
    ('sc', StandardScaler()),
    ('lr', LinearRegression())   
]


pipeline = Pipeline(list_for_pipeline)


To use the pipeline, we just call fit on it our training data.

In [74]:
# fit with our training data
pipeline.fit(X_train, y_train)

Pipeline(memory=None,
         steps=[('vt', VarianceThreshold(threshold=0.1)),
                ('sc',
                 StandardScaler(copy=True, with_mean=True, with_std=True)),
                ('lr',
                 LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
                                  normalize=False))],
         verbose=False)

Then we can `score` on our train.

In [75]:
# score training data
pipeline.score(X_train, y_train)

0.7201743335728632

and our test

In [76]:
# score test data
pipeline.score(X_test, y_test)

0.68201363327704

Does the fitting and scoring process look familiar?

### Create another pipeline. 

Name this one `pipe`.  This time select only features with at least 70% variance. Keep all the other steps the same. Fit it and score it.


Does the performance improve?

In [77]:
pipe = Pipeline([
    ('vt', VarianceThreshold(threshold=.7)),
    ('sc', StandardScaler()),
    ('lr', LinearRegression())   
])

In [83]:
pipe

Pipeline(memory=None,
         steps=[('vt', VarianceThreshold(threshold=0.7)),
                ('sc',
                 StandardScaler(copy=True, with_mean=True, with_std=True)),
                ('lr',
                 LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
                                  normalize=False))],
         verbose=False)

In [77]:
pipe.fit(X_train, y_train)
pipe.score(X_test, y_test)

0.6790628755239603

### Which features were left after VarianceThreshold did its job?

In [80]:
features = pipe.named_steps['vt']
features

VarianceThreshold(threshold=0.7)

In [81]:
features.get_support()

array([ True,  True,  True, False, False, False,  True,  True,  True,
        True,  True,  True])

In [82]:
X.columns[features.get_support()]

# https://stackoverflow.com/a/43189631/4590385

Index(['CRIM', ' ZN ', 'INDUS ', 'AGE', 'DIS', 'RAD', 'TAX', 'PTRATIO',
       'LSTAT'],
      dtype='object')

# Summary

You've seen how to use pipelines in sklearn.

## Check for understanding

- Why would you want to use a Pipeline?
- What does every step in a Pipeline before the final step have to be?
- What does the final step of a Pipeline have to be?

Pipelines are a timesaving tool for your toolkit! 🛠