# Pipelines

Almost every machine learning problem involves several steps before you arrive at a final result. Commonly, you must

* load the data
* clean the data
* extract or engineer features
* fit a model
* evaluate this model
* iterate

Scikit-learn is designed from the ground up to make these steps easy for users with minimal boilerplate. Recall that there are three (really, four) main objects and interfaces in scikit-learn

* Estimators
* Predictors
* Transformers

The scikit-learn [pipeline](http://scikit-learn.org/stable/modules/pipeline.html) abstraction builds on these interfaces to allow us to build chain of transformers and estimators and use the pipeline, as if it were an estimator itself. 

We've already seen some steps that are common in a machine learning pipeline. In the coming sections, we're going to dive into some more methods that we may also want to use. First, let's fix ideas.

## Transformer Interface

Recall the transformer interface. A transformer is intended to filter or modify the data in a supervised or unsupervised way.

```python
new_data = obj.transform(data)
```

The interface is

```python
class Transformer:

    def fit(self, X, y=None):
        """"""
        return self
        
    def transform(self, X, y=None):
        return X
        
    def fit_transform(self, X, y=None):
        self.fit(X, y)
        return self.transform(X, y)
```

## Estimator Interface

Recall the estimator interface.

```python
class Estimator:
  
    def fit(self, X, y=None):
        """Fit model to data X (and y)"""
        self.some_attribute = self.some_fitting_method(X, y)
        return self
            
    def predict(self, X_test):
        """Make prediction based on passed features"""
        pred = self.make_prediction(X_test)
        return pred
```

## Putting it Together

```python
from sklearn.pipeline import Pipeline

estimator = Pipeline([
    ('transformer1', Transformer(*args1)),
    ('transformer2', Transformer(*args2)),
    ('estimator', Estimator(*args))
])

estimator.fit(X_train, y_train)

y_fitted = estimator.predict(X_test)
```

By chaining together transformer estimators, our code is much easier to deal with than it would have been otherwise.

Under the hood, this calls `fit` on the first transformer, then `transform` on `X` and passes the transformed `X` to the next stop until the final estimator. The pipeline simply calls `fit` on the transformed `X` and `y`.

### Example: Chaining PCA and Logistic Regression

Recall that we can use PCA for unsupervised dimensionality reduction.

In [None]:
from sklearn import linear_model, decomposition, datasets
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split

logistic = linear_model.LogisticRegression()

pca = decomposition.PCA()

In [None]:
estimator = Pipeline(steps=[
    ('pca', pca), 
    ('logistic', logistic)
])

In [None]:
digits = datasets.load_digits()

(X_digits, X_digits_test, 
 y_digits, y_digits_test) = train_test_split(digits.data, digits.target, test_size=.2)

In [None]:
X_digits[:2]

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt

fig, axes = plt.subplots(3, 3)

def plot_image(ax, img):
    ax.imshow(img.reshape(8, 8), cmap=plt.cm.gray_r)
    
for i in range(3):
    for j in range(3):
        plot_image(axes[i, j], X_digits[i * 3 + j])

In [None]:
y_digits[:9]

Calling fit on the estimator runs the whole pipeline.

In [None]:
estimator.fit(X_digits, y_digits)

Each fitted transformer or estimator is available from the pipeline.

In [None]:
estimator.named_steps.keys()

How did we do across classes?

In [None]:
estimator.score(X_digits_test, y_digits_test)

In [None]:
from sklearn.metrics import confusion_matrix

confusion_matrix(y_digits_test, 
                 estimator.predict(X_digits_test))

If you find naming the steps a little tedious, there is a convenience function called `make_pipeline` that will use the class names for you, avoiding collisions.

In [None]:
from sklearn.pipeline import make_pipeline

pipe = make_pipeline(pca, logistic)

pipe.named_steps

In [None]:
pipe = make_pipeline(pca, pca, logistic)

pipe.named_steps

### Avoid Contamination

The following is an example of a common gotcha in statistical learning.

In [None]:
import numpy as np

from sklearn.datasets import fetch_20newsgroups
from sklearn.model_selection import GridSearchCV
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression


newsgroups = fetch_20newsgroups(categories=[
    'sci.space', 'alt.atheism', 'comp.graphics'
])

(X, X_test, 
 y, y_test) = train_test_split(newsgroups.data, newsgroups.target)

vectorizer = TfidfVectorizer()
X_vect = vectorizer.fit_transform(X)

clf = LogisticRegression()

param_grid = {
    'C': np.logspace(-1, 2, num=4)
}

grid = GridSearchCV(clf, param_grid=param_grid, cv=5)

grid.fit(X_vect, y)

Can anyone see what we did wrong here?

### Exercise: 

Create a pipeline out of a [`StandardScaler`](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html#sklearn.preprocessing.StandardScaler) and [`Ridge`](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Ridge.html#sklearn.linear_model.Ridge) regression and apply it to the Boston housing dataset (load the data using `sklearn.datasets.load_boston`). Try adding the [`PolynomialFeatures`](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.PolynomialFeatures.html) transformer as a second preprocessing step, and grid-search the degree of the polynomials (try 1, 2 and 3).

Hint: See the scikit-learn [documentation](http://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html#sklearn-pipeline-pipeline) on passing parameters to grid search over to the steps in a pipeline.

In [None]:
%load solutions/4a-ridge-grid.py