# Pipelines 
*Author: Douglas Strodtman (SaMo), adjusted by Jeff Hale (DC)*


## Learning Objectives

By the end of this lesson students will be able to:

- Understand what a pipeline is
- Use sklearn pipelines with transformers and an estimator

---

## Pipelines

Pipelines make it easier to perform multiple preprocessing transformations and fit and transform our model with our data.

Pipelines are an extremely powerful tool for your machine learning workflow. 🛠

---

Here we'll use the Boston data with `VarianceThreshold`, `SelectKBest`, `StandardScaler`, and `LinearRegression`.

In [1]:
# imports
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.feature_selection import SelectKBest, VarianceThreshold, f_regression
from sklearn.model_selection import GridSearchCV

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

In [2]:
# read in the data
boston = pd.read_csv('../data/boston_data.csv')

In [3]:
# inspect 
boston.head()

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,LSTAT,MEDV
0,0.00632,18.0,2.31,0,0.538,6.575,65.2,4.09,1,296,15.3,4.98,24.0
1,0.02731,0.0,7.07,0,0.469,6.421,78.9,4.9671,2,242,17.8,9.14,21.6
2,0.02729,0.0,7.07,0,0.469,7.185,61.1,4.9671,2,242,17.8,4.03,34.7
3,0.03237,0.0,2.18,0,0.458,6.998,45.8,6.0622,3,222,18.7,2.94,33.4
4,0.06905,0.0,2.18,0,0.458,7.147,54.2,6.0622,3,222,18.7,5.33,36.2


In [4]:
# break into X and y
X = boston.drop('MEDV', axis=1)
y = boston['MEDV']

In [5]:
X.head()

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,LSTAT
0,0.00632,18.0,2.31,0,0.538,6.575,65.2,4.09,1,296,15.3,4.98
1,0.02731,0.0,7.07,0,0.469,6.421,78.9,4.9671,2,242,17.8,9.14
2,0.02729,0.0,7.07,0,0.469,7.185,61.1,4.9671,2,242,17.8,4.03
3,0.03237,0.0,2.18,0,0.458,6.998,45.8,6.0622,3,222,18.7,2.94
4,0.06905,0.0,2.18,0,0.458,7.147,54.2,6.0622,3,222,18.7,5.33


In [6]:
y.head()

0    24.0
1    21.6
2    34.7
3    33.4
4    36.2
Name: MEDV, dtype: float64

In [7]:
# train test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=41)

## Pipeline Syntax

We set up a pipeline by passing a list of tuples in the format
```
('string_name', ClassObject())
```
Note that we can name our steps beforehand (each of the methods that we're using is a sklearn class).
```
lr = LinearRegression()
('linreg', lr)
```

We can include as many steps as we'd like. 


The final step has to be an **estimator**.
Each prior step has to be a **transformer**.

Look at the following example:

In [8]:
# create a pipeline instance with the following steps using the provided names (and arguments where provided)
#     var_thresh: VarianceThreshold(.05)
#     ss: StandardScaler()
#     kbest: SelectKBest(f_regression, k=5)
#     lr: LinearRegression()

In [9]:
pipeline = Pipeline([
    ('var_thresh', VarianceThreshold(threshold=.05)), # Feature selector that removes all low-variance features.
    ('sc', StandardScaler()),
    ('kbest', SelectKBest(f_regression, k=5)),      # SelectKBest(f_regression, k = 4) 
                                                    # produces the same result as using LinearRegression(fit_intercept=True) 
                                                    # and choosing the first 4 features with the highest scores 
                                                    # https://stats.stackexchange.com/a/253255/198892
    ('lr', LinearRegression())])

To use the pipeline, we just call fit on it our training data.

In [10]:
# fit with our training data
pipeline.fit(X_train, y_train)

Pipeline(memory=None,
         steps=[('var_thresh', VarianceThreshold(threshold=0.05)),
                ('sc',
                 StandardScaler(copy=True, with_mean=True, with_std=True)),
                ('kbest',
                 SelectKBest(k=5,
                             score_func=<function f_regression at 0x1a19fdfef0>)),
                ('lr',
                 LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
                                  normalize=False))],
         verbose=False)

Then we can `score` on our train.

In [11]:
# score training data
pipeline.score(X_train, y_train)

0.694095011029009

and our test

In [12]:
# score test data
pipeline.score(X_test, y_test)

0.5871056549075859

Create another pipeline. Name this one `pip`.  This time select only the 3 best features. Keep all the other steps the same. Fit it and score it.


Does the performance improve?

# Summary

You've seen how to use pipelines in sklearn.

## Check for understanding

- Why would you want to use a Pipeline?
- What does every step in a Pipeline before the final step have to be?
- What does the final step of a Pipeline have to be?

Pipelines are a timesaving tool for your toolkit! 🛠