# Pipelines 
*Author: Douglas Strodtman (SaMo), adjusted by Jeff Hale (DC)*


## Learning Objectives

By the end of this lesson students will be able to:

- Understand what a pipeline is
- Use sklearn pipelines with transformers and an estimator

---

## Pipelines

Pipelines make it easier to perform multiple preprocessing transformations and fit and transform our model with our data.

Pipelines are an extremely powerful tool for your machine learning workflow. 🛠

---

Here we'll use the Boston data with `VarianceThreshold`, `SelectKBest`, `StandardScaler`, and `LinearRegression`.

In [None]:
# imports
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.feature_selection import SelectKBest, VarianceThreshold, f_regression
from sklearn.model_selection import GridSearchCV

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

In [None]:
# read in the data
boston = pd.read_csv('../data/boston_data.csv')

In [None]:
# inspect 
boston.head()

In [None]:
# break into X and y


In [None]:
X.head()

In [None]:
y.head()

In [None]:
# train test split


## Pipeline Syntax

We set up a pipeline by passing a list of tuples in the format
```
('string_name', ClassObject())
```
Note that we can name our steps beforehand (each of the methods that we're using is a sklearn class).
```
lr = LinearRegression()
('linreg', lr)
```

We can include as many steps as we'd like. 


The final step has to be an **estimator**.
Each prior step has to be a **transformer**.

Look at the following example:

In [None]:
# create a pipeline instance with the following steps using the provided names (and arguments where provided)
#     var_thresh: VarianceThreshold(.05)
#     ss: StandardScaler()
#     kbest: SelectKBest(f_regression, k=5)
#     lr: LinearRegression()

In [None]:
                                                    # SelectKBest(f_regression, k = 4) 
                                                    # produces the same result as using LinearRegression(fit_intercept=True) 
                                                    # and choosing the first 4 features with the highest scores 
                                                    # https://stats.stackexchange.com/a/253255/198892
  

To use the pipeline, we just call fit on it our training data.

In [None]:
# fit with our training data


Then we can `score` on our train.

In [None]:
# score training data


and our test

In [None]:
# score test data


Create another pipeline. Name this one `pipe`.  This time select only the 3 best features. Keep all the other steps the same. Fit it and score it.


Does the performance improve?

# Summary

You've seen how to use pipelines in sklearn.

## Check for understanding

- Why would you want to use a Pipeline?
- What does every step in a Pipeline before the final step have to be?
- What does the final step of a Pipeline have to be?

Pipelines are a timesaving tool for your toolkit! 🛠