<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Pipelines in Sklearn



---

## Learning Objectives
### Core
- Learn what an sklearn pipeline is and scenarios where they are useful
- Standardize data as part of a pipeline
- Use pipelines with training and testing data
- Use the `make_pipeline` function to easily create pipeline objects

### Target
- Be able to build a custom transformation in sklearn and use it in a pipeline
- Investigate the internals of sklearn pipelines

### Stretch
- Understand the concepts behind transformers and estimators
- Get an intuition of how we can leverage the pipeline with gridsearch

<h1>Lesson Guide<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Learning-Objectives" data-toc-modified-id="Learning-Objectives-1">Learning Objectives</a></span><ul class="toc-item"><li><span><a href="#Core" data-toc-modified-id="Core-1.1">Core</a></span></li><li><span><a href="#Target" data-toc-modified-id="Target-1.2">Target</a></span></li><li><span><a href="#Stretch" data-toc-modified-id="Stretch-1.3">Stretch</a></span></li></ul></li><li><span><a href="#Introduction-to-pipelines" data-toc-modified-id="Introduction-to-pipelines-2">Introduction to pipelines</a></span><ul class="toc-item"><li><span><a href="#Load-the-titanic-data" data-toc-modified-id="Load-the-titanic-data-2.1">Load the titanic data</a></span></li></ul></li><li><span><a href="#Loading-the-pipeline-objects" data-toc-modified-id="Loading-the-pipeline-objects-3">Loading the pipeline objects</a></span><ul class="toc-item"><li><span><a href="#The-titanic-data" data-toc-modified-id="The-titanic-data-3.1">The titanic data</a></span></li></ul></li><li><span><a href="#Preprocessing-steps-for-the-titanic-data" data-toc-modified-id="Preprocessing-steps-for-the-titanic-data-4">Preprocessing steps for the titanic data</a></span><ul class="toc-item"><li><span><a href="#Check-for-a-few-categorical-variables" data-toc-modified-id="Check-for-a-few-categorical-variables-4.1">Check for a few categorical variables</a></span></li></ul></li><li><span><a href="#Select-feature-and-target-variables" data-toc-modified-id="Select-feature-and-target-variables-5">Select feature and target variables</a></span></li><li><span><a href="#Standardize-the-data-and-fit-a-LogisticRegression-model" data-toc-modified-id="Standardize-the-data-and-fit-a-LogisticRegression-model-6">Standardize the data and fit a LogisticRegression model</a></span></li><li><span><a href="#Add-a-train-test-split" data-toc-modified-id="Add-a-train-test-split-7">Add a train-test split</a></span></li><li><span><a href="#Use-a-pipeline-to-standardize-the-data-and-fit-the-model" data-toc-modified-id="Use-a-pipeline-to-standardize-the-data-and-fit-the-model-8">Use a pipeline to standardize the data and fit the model</a></span></li><li><span><a href="#Using-pipelines-with-training-and-testing-data" data-toc-modified-id="Using-pipelines-with-training-and-testing-data-9">Using pipelines with training and testing data</a></span><ul class="toc-item"><li><ul class="toc-item"><li><span><a href="#Split-up-into-training-and-testing-X,-y,-fit-on-the-training-data-and-score-on-training-and-test-data" data-toc-modified-id="Split-up-into-training-and-testing-X,-y,-fit-on-the-training-data-and-score-on-training-and-test-data-9.0.1">Split up into training and testing X, y, fit on the training data and score on training and test data</a></span></li></ul></li><li><span><a href="#Exercise:-Experiment-by-setting-up-a-pipeline-using-different-scaling-methods-or-models" data-toc-modified-id="Exercise:-Experiment-by-setting-up-a-pipeline-using-different-scaling-methods-or-models-9.1">Exercise: Experiment by setting up a pipeline using different scaling methods or models</a></span></li></ul></li><li><span><a href="#Built-in-transformations-and-preprocessing-steps" data-toc-modified-id="Built-in-transformations-and-preprocessing-steps-10">Built-in transformations and preprocessing steps</a></span></li><li><span><a href="#Custom-transformations" data-toc-modified-id="Custom-transformations-11">Custom transformations</a></span><ul class="toc-item"><li><span><a href="#Custom-transformer-classes-start-with-this-template-code:" data-toc-modified-id="Custom-transformer-classes-start-with-this-template-code:-11.1">Custom transformer classes start with this template code:</a></span></li><li><span><a href="#Add-functions-to-the-class" data-toc-modified-id="Add-functions-to-the-class-11.2">Add functions to the class</a></span></li><li><span><a href="#Test-the-preprocessing-function" data-toc-modified-id="Test-the-preprocessing-function-11.3">Test the preprocessing function</a></span></li><li><span><a href="#Use-the-custom-TitanticPreprocessor-in-a-pipeline" data-toc-modified-id="Use-the-custom-TitanticPreprocessor-in-a-pipeline-11.4">Use the custom <code>TitanticPreprocessor</code> in a pipeline</a></span></li></ul></li><li><span><a href="#Looking-at-pipeline-internals-with-.get_params()" data-toc-modified-id="Looking-at-pipeline-internals-with-.get_params()-12">Looking at pipeline internals with <code>.get_params()</code></a></span></li><li><span><a href="#The-make_pipeline()-convenience-function" data-toc-modified-id="The-make_pipeline()-convenience-function-13">The <code>make_pipeline()</code> convenience function</a></span></li><li><span><a href="#Feature-Union" data-toc-modified-id="Feature-Union-14">Feature Union</a></span><ul class="toc-item"><li><span><a href="#Feature-Extractor" data-toc-modified-id="Feature-Extractor-14.1">Feature Extractor</a></span></li><li><span><a href="#Create-a-dummifyer-class" data-toc-modified-id="Create-a-dummifyer-class-14.2">Create a dummifyer class</a></span></li><li><span><a href="#Build-the-feature-union" data-toc-modified-id="Build-the-feature-union-14.3">Build the feature union</a></span></li></ul></li><li><span><a href="#Independent-Practice" data-toc-modified-id="Independent-Practice-15">Independent Practice</a></span><ul class="toc-item"><li><span><a href="#Improve-the-model-using-grid-search" data-toc-modified-id="Improve-the-model-using-grid-search-15.1">Improve the model using grid search</a></span><ul class="toc-item"><li><span><a href="#Gridsearch-with-pipeline" data-toc-modified-id="Gridsearch-with-pipeline-15.1.1">Gridsearch with pipeline</a></span></li><li><span><a href="#Use-different-scaling" data-toc-modified-id="Use-different-scaling-15.1.2">Use different scaling</a></span></li><li><span><a href="#Binarize-predictor-variables" data-toc-modified-id="Binarize-predictor-variables-15.1.3">Binarize predictor variables</a></span></li><li><span><a href="#Use-polynomial-features" data-toc-modified-id="Use-polynomial-features-15.1.4">Use polynomial features</a></span></li></ul></li></ul></li><li><span><a href="#Conclusions" data-toc-modified-id="Conclusions-16">Conclusions</a></span></li><li><span><a href="#Additional-resources" data-toc-modified-id="Additional-resources-17">Additional resources</a></span></li></ul></div>

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set(font_scale=1.5)

%config InlineBackend.figure_format = 'retina'
%matplotlib inline

## Introduction to pipelines

---

Often when working with data the same "process" is repeated multiple times, which can become tedious to recode. A simple example of this is doing the standardization of data before using regularized regression or other models.

Luckily, sklearn has "Pipelines" that chain together multiple steps in a data analysis process. By constructing these you can consolidate all of the steps you went through into a single object.

This codealong introduces how to use these pipelines and also serves as object oriented programming practice.


### Load the titanic data

In [2]:
titanic = pd.read_csv(
    '../../../../resource-datasets/titanic/titanic_clean.csv')

## Loading the pipeline objects

---

From the `sklearn.pipeline` module we are going to import `Pipeline` and `make_pipeline`.

`Pipeline` is the class object that will hold our data analysis process. The `make_pipeline` function is a convenience method that takes in a series of estimators or preprocessing steps and returns a `Pipeline` object.

We'll start with the more explicit construction using `Pipeline` and then move on to the convenience function.

[sklearn pipelines](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html)

In [3]:
from sklearn.pipeline import Pipeline
from sklearn.pipeline import make_pipeline

The term "pipeline" is jargon for a series of concatenated data transformations. Each stage of a pipeline feeds from the previous stage, i.e. the output of a stage is plugged into the input of the next stage and data flows through the pipeline from beginning to end.


![pipeline](./assets/pipeline.png)

---

Pipelines provide a higher level of abstraction than the individual building blocks of a data science process and are a nice and convenient way to organize analyses.

### The titanic data

What are preprocessing steps that you would carry out before fitting a model?

In [4]:
titanic.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Fare,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,7.25,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,71.2833,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,7.925,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,53.1,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,8.05,S


## Preprocessing steps for the titanic data

---

There are some preprocessing steps we're going to do before classifying whether or not passengers survived:

- Remove unwanted columns
- Convert categorical string or numeric columns to dummy coded columns
- Standardize the predictor matrix

For now we'll do this manually and then later integrate it into the pipeline.

### Check for a few categorical variables

In [5]:
for column in ['Pclass', 'Sex', 'Embarked', 'SibSp', 'Parch']:
    print(np.sort(titanic[column].unique()))

[1 2 3]
['female' 'male']
['C' 'Q' 'S']
[0 1 2 3 4 5]
[0 1 2 3 4 5 6]


In [6]:
def drop_cols(df, columns):
    return df.drop(columns, axis=1)

In [7]:
def make_dummy_cols(X, columns):
    X = pd.get_dummies(X, columns=columns, drop_first=True)
    return X

In [8]:
# drop unwanted columns
data = drop_cols(titanic, ['PassengerId', 'Name'])
# dummify columns
data = make_dummy_cols(data, ['Pclass', 'Sex', 'Embarked'])

data.head()

Unnamed: 0,Survived,Age,SibSp,Parch,Fare,Pclass_2,Pclass_3,Sex_male,Embarked_Q,Embarked_S
0,0,22.0,1,0,7.25,0,1,1,0,1
1,1,38.0,1,0,71.2833,0,0,0,0,0
2,1,26.0,0,0,7.925,0,1,0,0,1
3,1,35.0,1,0,53.1,0,0,0,0,1
4,0,35.0,0,0,8.05,0,1,1,0,1


## Select feature and target variables


Now we'll split the data up into the X, y predictor target format, standardize the X matrix, and fit a Logistic Regression model on Survived.

In [9]:
X = data.copy()
y = X.pop('Survived')

In [10]:
X.shape, y.shape

((712, 9), (712,))

## Standardize the data and fit a LogisticRegression model

In [11]:
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression

In [12]:
scaler = StandardScaler()
model = LogisticRegression(solver='lbfgs', random_state=1)

In [13]:
X_s = scaler.fit_transform(X)
model.fit(X_s, y)
model.score(X_s, y)

0.8019662921348315

## Add a train-test split

In [14]:
from sklearn.model_selection import train_test_split, cross_val_score

In [15]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, stratify=y, random_state=1)
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

In [16]:
model.fit(X_train, y_train)
print(cross_val_score(model, X_train, y_train, cv=5).mean())
print(model.score(X_train, y_train))
print(model.score(X_test, y_test))

0.7931427142714271
0.8072289156626506
0.8130841121495327


## Use a pipeline to standardize the data and fit the model

Next we're going to build a pipeline that can combine the steps. We combine the standard scaler and the logistic regression into a single 
pipeline object.

In the pipeline we indicate the object used in each step and choose a name for each step. Except the last step, all objects included in the pipeline need to be equipped with a `fit` and a `transform` function. The last step only needs a `fit` function. To fit a model the whole pipeline is called with `fit` on the data.

[Sklearn pipelines](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html#sklearn.pipeline.Pipeline)

In [17]:
pipe = Pipeline(steps=[('scaler', scaler),
                       ('model', model)])

**Pipelines combine both pre-processing and model building steps into a single object**. 

Rather than manually building transformations and then feeding them into the models, pipelines tie both of these steps together.

Furthermore, pipelines are equipped with the methods of the final estimator step:

- `fit()` methods
- `predict()` and/or `predict_proba()`
- `score()`
- ... etc.

Use the pipeline to fit the model:

In [18]:
pipe.fit(X, y)

Pipeline(memory=None,
         steps=[('scaler',
                 StandardScaler(copy=True, with_mean=True, with_std=True)),
                ('model',
                 LogisticRegression(C=1.0, class_weight=None, dual=False,
                                    fit_intercept=True, intercept_scaling=1,
                                    l1_ratio=None, max_iter=100,
                                    multi_class='warn', n_jobs=None,
                                    penalty='l2', random_state=1,
                                    solver='lbfgs', tol=0.0001, verbose=0,
                                    warm_start=False))],
         verbose=False)

In [19]:
pipe.score(X, y)

0.8019662921348315

In [20]:
predictions = pipe.predict(X)
predictions[:10]

array([0, 1, 1, 1, 0, 0, 0, 1, 1, 1])

We can obtain the different steps involved by calling `.steps` (returning a list) or `.namedstep (returning a dictionary). We can get values out of each of those.

In [21]:
pipe.steps

[('scaler', StandardScaler(copy=True, with_mean=True, with_std=True)),
 ('model',
  LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                     intercept_scaling=1, l1_ratio=None, max_iter=100,
                     multi_class='warn', n_jobs=None, penalty='l2',
                     random_state=1, solver='lbfgs', tol=0.0001, verbose=0,
                     warm_start=False))]

In [22]:
pipe.named_steps

{'scaler': StandardScaler(copy=True, with_mean=True, with_std=True),
 'model': LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                    intercept_scaling=1, l1_ratio=None, max_iter=100,
                    multi_class='warn', n_jobs=None, penalty='l2',
                    random_state=1, solver='lbfgs', tol=0.0001, verbose=0,
                    warm_start=False)}

In [23]:
# the column means infered with the standard scaler
pipe.steps[0][1].mean_

array([29.6420927 ,  0.51404494,  0.43258427, 34.5672514 ,  0.24297753,
        0.49859551,  0.63623596,  0.03932584,  0.77808989])

In [24]:
# the coefficients determined with logistic regression
pipe.named_steps['model'].coef_

array([[-0.60222826, -0.32648765, -0.05202477,  0.0934684 , -0.47528871,
        -1.14705344, -1.24761907, -0.15821131, -0.1700639 ]])

## Using pipelines with training and testing data

---

Next we'll split up this data into training and testing sets. One of the greatest benefits  to using pipelines is that the preprocessing steps before the model fitting retain the "fit" information from the training data to be applied to the testing data.

In the pipeline we built above, for example, the first standardization step is "fit" on the data we put into it. This means that the `StandardScaler` object takes the mean and standard deviation of that data and performs the procedure with those values.

It _also_ means that were we to predict or score on future data, the standard scaler in the pipeline would use the training data's mean and standard deviation to standardize that test data. This is what we want! You definitely don't want to standardize the training and testing data to their own means and standard deviations.

There are many scenarios in which the test data is actually data that we have not collected yet. In this case, you need to save the standardization procedure you used on the training data to use on this future data.

#### Split up into training and testing X, y, fit on the training data and score on training and test data

In [25]:
X = data.copy()
y = X.pop('Survived')
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, stratify=y, random_state=1)
pipe.fit(X_train, y_train)
print(cross_val_score(pipe, X_train, y_train, cv=5).mean())
print(pipe.score(X_train, y_train))
print(pipe.score(X_test, y_test))

0.7931427142714271
0.8072289156626506
0.8130841121495327


### Exercise: Experiment by setting up a pipeline using different scaling methods or models

In [26]:
from sklearn.preprocessing import MinMaxScaler, PolynomialFeatures
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegressionCV

In [27]:
# try with MinMaxScaling:
scaler = MinMaxScaler()
model = KNeighborsClassifier()
pipe = Pipeline(steps=[('scaler', scaler), ('model', model)])
# make sure to redefine pipe after setting up scaler and model

pipe.fit(X_train,y_train)
print(cross_val_score(pipe, X_train, y_train, cv=5).mean())
print(pipe.score(X_train, y_train))
print(pipe.score(X_test, y_test))

0.8051641164116411
0.8493975903614458
0.7616822429906542


In [28]:
# try with Polynomial Features:
poly = PolynomialFeatures()
scaler = StandardScaler()
model = LogisticRegressionCV(max_iter=10000, penalty='l2', solver='liblinear')
pipe = Pipeline(steps=[('poly', poly), ('scaler', scaler), ('model', model)])
# make sure to redefine pipe after setting up scaler and model

pipe.fit(X_train,y_train)
print(cross_val_score(pipe, X_train, y_train, cv=5).mean())
print(pipe.score(X_train, y_train))
print(pipe.score(X_test, y_test))



0.8030841084108411
0.8453815261044176
0.7990654205607477


## Built-in transformations and preprocessing steps

---

Sklearn comes with a wide variety of useful classes for preprocessing your data prior to model fitting that can be put into pipelines.

These can be found in the `sklearn.preprocessing` module and you should feel free to familiarize yourself with them if you want to make use of them in your code:

The preprocessing module comes loaded with many very useful pre-processing classes.

**Data Manipulators**

- Binarizer
- KernelCenterer
- MaxAbsScaler
- MinMaxScaler
- Normalizer
- OneHotEncoder
- PolynomialFeatures
- RobustScaler
- StandardScaler

**Data Imputation**

- Imputer

**Function Transformer**

- FunctionTransformer

**Label Manipulators**

- LabelBinarizer
- LabelEncoder
- MultiLabelBinarizer

[Sklearn preprocessing](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.preprocessing)

## Custom transformations

---

It's not always possible to use a built-in transformation class to do what you want. In fact, it's likely that you're going to run into a scenario where you need a customized preprocessing step before model fitting.

Let's take our titanic data, for example. Say we wanted a preprocessor that would remove the columns we didn't want and create the dummy-coded columns before sending it through to the standardization step.

### Custom transformer classes start with this template code:

In [29]:
# we need to import the template classes to create a class that works like an sklearn class
from sklearn.base import BaseEstimator, TransformerMixin

# our "TitanicPreprocessor" is going to do the processing


class TitanticPreprocessor(BaseEstimator, TransformerMixin):
    def __init__(self):
        pass

    def transform(self, X, *args):
        return X

    def fit(self, X, *args):
        return self

In [30]:
tp = TitanticPreprocessor()

In [31]:
tp.fit_transform(X_train)

Unnamed: 0,Age,SibSp,Parch,Fare,Pclass_2,Pclass_3,Sex_male,Embarked_Q,Embarked_S
416,40.5,0,0,7.7500,0,1,1,1,0
111,22.0,0,0,7.7500,0,1,0,0,1
643,18.0,0,0,7.7750,0,1,0,0,1
370,48.0,0,0,13.0000,1,0,1,0,1
59,30.0,0,0,12.4750,0,1,0,0,1
367,48.0,0,0,26.5500,0,0,1,0,1
168,35.0,0,0,21.0000,1,0,0,0,1
380,22.0,0,0,7.5208,0,1,1,0,1
612,32.0,0,0,8.3625,0,1,1,0,1
378,34.0,1,0,21.0000,1,0,1,0,1


In [32]:
tp.get_params()

{}

Some notes on this class:

1. We have to load in the `BaseEstimator` and `TransformerMixin` classes for our preprocessor to "inherit" from the class definition.
- Your class must contain the functions `fit` and `transform`, which will be used to chain the processes together in our pipeline.
- The `*args` argument tells the function to expect an arbitrary number of arguments after whatever arguments were listed explicitly.

If you are confused about those classes, [this article]( http://danielhnyk.cz/creating-your-own-estimator-scikit-learn/) gives a nice overview

### Add functions to the class

- Include the dummy-coding function we wrote above
- Add a function which removes unnecessary columns
- Modify the `transform` function to perform these preprocessing steps, returning the new DataFrame
- The fit function does not need to be modified
- Add a class attribute which contains the final column names

In [33]:
class TitanticPreprocessor(BaseEstimator, TransformerMixin):

    def __init__(self, columns_to_drop=None, columns_to_dummify=None, drop_first=True):
        self.feature_names = []
        self.columns_to_drop = columns_to_drop
        self.columns_to_dummify = columns_to_dummify
        self.drop_first = drop_first
        
    def _drop_unused_cols(self, X):
        for col in self.columns_to_drop:
            try:
                X = X.drop(col, axis=1)
            except:
                pass
        return X

    def _make_dummy_cols(self, X):
        X = pd.get_dummies(X, columns=self.columns_to_dummify, drop_first=self.drop_first)
        return X

    def transform(self, X, *args):
        X = self._make_dummy_cols(X)
        X = self._drop_unused_cols(X)
        self.feature_names = X.columns
        return X

    def fit(self, X, *args):
        return self

### Test the preprocessing function

In [34]:
tprep = TitanticPreprocessor(columns_to_drop=['PassengerId', 'Name'],
                             columns_to_dummify=['Sex', 'Pclass', 'Embarked'])
tprep.fit(titanic)

TitanticPreprocessor(columns_to_drop=['PassengerId', 'Name'],
                     columns_to_dummify=['Sex', 'Pclass', 'Embarked'],
                     drop_first=True)

In [35]:
titanic.head(2)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Fare,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,7.25,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,71.2833,C


In [36]:
tprep.transform(titanic).head()

Unnamed: 0,Survived,Age,SibSp,Parch,Fare,Sex_male,Pclass_2,Pclass_3,Embarked_Q,Embarked_S
0,0,22.0,1,0,7.25,1,0,1,0,1
1,1,38.0,1,0,71.2833,0,0,0,0,0
2,1,26.0,0,0,7.925,0,0,1,0,1
3,1,35.0,1,0,53.1,0,0,0,0,1
4,0,35.0,0,0,8.05,1,0,1,0,1


### Use the custom `TitanticPreprocessor` in a pipeline
---

We'll put it before the `StandardScaler` in our original pipeline.

In [39]:
columns_to_drop = ['PassengerId', 'Name']
columns_to_dummify = ['Sex', 'Pclass', 'Embarked']

tprep = TitanticPreprocessor(columns_to_drop=columns_to_drop,
                             columns_to_dummify=columns_to_dummify)
scaler = StandardScaler()
model = LogisticRegression(solver='lbfgs', random_state=1)

pipe = Pipeline(steps=[('titanic_prep', tprep),
                       ('scaler', scaler),
                      # ('model', model)
                      ])

Fit on the training data and test on the testing data like before, with the new pipeline. You'll need to create a new X, y with the original non-manually preprocessed data!

In [40]:
X = titanic.copy()
y = X.pop('Survived')

In [41]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, stratify=y, random_state=1)
pipe.fit(X_train, y_train)
print(cross_val_score(pipe, X_train, y_train, cv=5).mean())
print(pipe.score(X_train, y_train))
print(pipe.score(X_test, y_test))

TypeError: If no scoring is specified, the estimator passed should have a 'score' method. The estimator Pipeline(memory=None,
         steps=[('titanic_prep',
                 TitanticPreprocessor(columns_to_drop=['PassengerId', 'Name'],
                                      columns_to_dummify=['Sex', 'Pclass',
                                                          'Embarked'],
                                      drop_first=True)),
                ('scaler',
                 StandardScaler(copy=True, with_mean=True, with_std=True))],
         verbose=False) does not.

In [None]:
pipe.named_steps['model'].coef_

## Looking at pipeline internals with `.get_params()`

---

Use the `.get_params()` function on the pipeline object to get out all of the parameters from the different steps as a dictionary.

In [None]:
pipe.get_params()

You can pull out the feature names we stored by accessing our preprocessor object from the dictionary, then pulling out the attribute from that:

In [None]:
pipe.named_steps['titanic_prep'].feature_names

## The `make_pipeline()` convenience function

---

`make_pipeline()` essentially does the same thing as `Pipeline`, the only difference being that you just insert your objects as arguments to the function and it will create the pipeline for you. This means that it will name the steps itself, rather than you doing it.

In [None]:
auto_pipe = make_pipeline(
    TitanticPreprocessor(columns_to_drop=columns_to_drop,
                         columns_to_dummify=columns_to_dummify),
    StandardScaler(),
    LogisticRegression(solver='lbfgs', random_state=1))

In [None]:
auto_pipe.fit(X_train, y_train)
print(auto_pipe.score(X_train, y_train))
print(auto_pipe.score(X_test, y_test))

## Feature Union

Sometimes we want to give different treatments to different data columns. Some of the columns we might have to dummify, others might contain missing values for which we will choose a variety of imputation methods. This can be done in an efficient way with feature unions which can combine a variety of transformer objects.

Like for the pipeline, there is an sklearn `FeatureUnion` which allows to set up all the transformation steps, but also a `make_union` which facilitates the set up.

[sklearn feature union](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.FeatureUnion.html)

In [43]:
from sklearn.preprocessing import LabelBinarizer, StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.pipeline import make_union, FeatureUnion
from sklearn.base import BaseEstimator

### Feature Extractor

First we create a class which does nothing else than extracting a column from a dataframe to return it as a matrix-like numpy array.

In [44]:
# Create a helper class to extract features one by one in a pipeline
class FeatureExtractor(BaseEstimator, TransformerMixin):
    def __init__(self, column):
        self.column = column

    def fit(self, X, *args):
        return self

    def transform(self, X, *args):
        X = X[self.column].values.reshape(-1, 1)
        return X

In [45]:
fare_extractor = FeatureExtractor('Fare')
fare_extractor.fit_transform(X_train)[:5]

array([[ 7.75 ],
       [ 7.75 ],
       [ 7.775],
       [13.   ],
       [12.475]])

### Create a dummifyer class

This time however we are using the `LabelBinarizer`.

In [46]:
class CustomBinarizer(BaseEstimator, TransformerMixin):
    def fit(self, X, *args):
        return self

    def transform(self, X, *args):
        X = LabelBinarizer().fit(X).transform(X)
        if X.shape[1] > 1:
            return X[:, 1:]
        else:
            return X

In [47]:
LabelBinarizer().fit_transform([0, 1, 2, 0])

array([[1, 0, 0],
       [0, 1, 0],
       [0, 0, 1],
       [1, 0, 0]])

In [48]:
binarizer = CustomBinarizer()
binarizer.fit_transform(titanic['Embarked'])[:5]

array([[0, 1],
       [0, 0],
       [0, 1],
       [0, 1],
       [0, 1]])

### Build the feature union

Now for each column we will build a pipeline that extracts the indicated column, dummifies the column if required and imputes any missing values according to the specified method. 

Then we build the feature union which will be integrated into a pipeline with the standard scaler and the logistic regression.

In [67]:
# Create a pipeline to binarize labels and impute missing values with an appropriate method
pclass_pipe = make_pipeline(
    FeatureExtractor('Pclass'),
    CustomBinarizer(),
    SimpleImputer(strategy='most_frequent')
)
embarked_pipe = make_pipeline(
    FeatureExtractor('Embarked'),
    CustomBinarizer(),
    SimpleImputer(strategy='most_frequent')
)
sex_pipe = make_pipeline(
    FeatureExtractor('Sex'),
    CustomBinarizer(),
    SimpleImputer(strategy='most_frequent')
)
age_pipe = make_pipeline(
    FeatureExtractor('Age'),
    SimpleImputer(strategy='mean')
)
sibsp_pipe = make_pipeline(
    FeatureExtractor('SibSp'),
    SimpleImputer(strategy='most_frequent')
)
parch_pipe = make_pipeline(
    FeatureExtractor('Parch'),
    SimpleImputer(strategy='most_frequent')
)
fare_pipe = make_pipeline(
    FeatureExtractor('Fare'),
    SimpleImputer(strategy='most_frequent')
)

fu = make_union(pclass_pipe, sex_pipe, embarked_pipe,
                age_pipe, sibsp_pipe, parch_pipe, fare_pipe)

In [50]:
train_X = fu.fit_transform(X_train)
test_X = fu.transform(X_test)

In [73]:
train_X[:4, :]

array([[ 0.   ,  1.   ,  1.   ,  1.   ,  0.   , 40.5  ,  0.   ,  0.   ,
         7.75 ],
       [ 0.   ,  1.   ,  0.   ,  0.   ,  1.   , 22.   ,  0.   ,  0.   ,
         7.75 ],
       [ 0.   ,  1.   ,  0.   ,  0.   ,  1.   , 18.   ,  0.   ,  0.   ,
         7.775],
       [ 1.   ,  0.   ,  1.   ,  0.   ,  1.   , 48.   ,  0.   ,  0.   ,
        13.   ]])

In [52]:
fu_pipe = make_pipeline(fu, scaler, model)
fu_pipe.fit(X_train, y_train)

print(cross_val_score(fu_pipe, X_train, y_train, cv=5).mean())
print(fu_pipe.score(X_train, y_train))
print(fu_pipe.score(X_test, y_test))

0.7931427142714271
0.8072289156626506
0.8130841121495327


## Independent Practice
### Improve the model using grid search

- Find out how to refer to the model tuning parameters in the pipeline (use `fu_pipe.get_params()`).
- How would you modify your pipeline to use the `MinMaxScaler`?
- What about using another model like kNN?
- Create a transformer which returns the binarized version of a variable for being above or below a given threshold, e.g. for the Parch and SibSp columns.
- Create polynomial features and/or interaction terms

In [53]:
from sklearn.model_selection import GridSearchCV
from sklearn.preprocessing import PolynomialFeatures, MinMaxScaler
from sklearn.neighbors import KNeighborsClassifier

#### Gridsearch with pipeline

In [87]:
kn = KNeighborsClassifier() 
kn_params = {'n_neighbors': [3,5,7,9,21,31,51,101],
            'weights': ['uniform', 'distance'],
            'metric': ['euclidean', 'manhattan']}

model = GridSearchCV(kn, kn_params, n_jobs=2, cv=5, return_train_score=True)

fu_pipe = make_pipeline(fu, scaler, model)
fu_pipe.fit(X_train, y_train)

print(cross_val_score(fu_pipe, X_train, y_train, cv=5).mean())
print(fu_pipe.score(X_train, y_train))
print(fu_pipe.score(X_test, y_test))
print(model.best_params_)



0.8031439143914391
0.8333333333333334
0.8130841121495327
{'metric': 'euclidean', 'n_neighbors': 9, 'weights': 'uniform'}




In [88]:
lr = LogisticRegressionCV(cv=5)
lr_params = {'penalty': ['l1', 'l2'],
             'solver': ['liblinear']}

model = GridSearchCV(lr, lr_params, n_jobs=2, cv=5, return_train_score=True)

fu_pipe = make_pipeline(fu, scaler, model)
fu_pipe.fit(X_train, y_train)

print(cross_val_score(fu_pipe, X_train, y_train, cv=5).mean())
print(fu_pipe.score(X_train, y_train))
print(fu_pipe.score(X_test, y_test))
print(model.best_params_)



0.787082108210821
0.8072289156626506
0.8130841121495327
{'penalty': 'l1', 'solver': 'liblinear'}




#### Use different scaling

In [82]:
scaler = MinMaxScaler()

fu_pipe = make_pipeline(fu, scaler, model)
fu_pipe.fit(X_train, y_train)

print(cross_val_score(fu_pipe, X_train, y_train, cv=5).mean())
print(fu_pipe.score(X_train, y_train))
print(fu_pipe.score(X_test, y_test))



0.787082108210821
0.8072289156626506
0.8130841121495327




#### Binarize predictor variables

In [89]:
X_train.head()

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Fare,Embarked
416,526,3,"Farrell, Mr. James",male,40.5,0,0,7.75,Q
111,142,3,"Nysten, Miss. Anna Sofia",female,22.0,0,0,7.75,S
643,808,3,"Pettersson, Miss. Ellen Natalia",female,18.0,0,0,7.775,S
370,464,2,"Milling, Mr. Jacob Christian",male,48.0,0,0,13.0,S
59,80,3,"Dowdell, Miss. Elizabeth",female,30.0,0,0,12.475,S


In [None]:
class CustomBinarizer(BaseEstimator, TransformerMixin):
    def fit(self, X, *args):
        return self

    def transform(self, X, *args):
        if type(X).isnumeric():
            
        X = LabelBinarizer().fit(X).transform(X)
        if X.shape[1] > 1:
            return X[:, 1:]
        else:
            return X

In [92]:
LabelBinarizer().fit_transform(X_train['Fare'])

ValueError: Unknown label type: (416      7.7500
111      7.7500
643      7.7750
370     13.0000
59      12.4750
367     26.5500
168     21.0000
380      7.5208
612      8.3625
378     21.0000
614      7.8542
374      8.6625
646      7.8875
209     31.3875
565      9.4833
560     13.5000
179     90.0000
190     26.2500
660     37.0042
67       8.0500
365     26.5500
613      9.5000
301      4.0125
706      7.0500
563     49.5042
40      61.9792
134     56.4958
710     30.0000
467     78.2667
449      7.8542
         ...   
135     33.5000
81       7.8958
215    153.4625
118     26.0000
527      8.0500
575     53.1000
16      18.0000
77       7.8958
344     26.0000
510      7.8958
50       7.9250
525     73.5000
299      7.2500
458     30.0000
42       7.2292
675     33.0000
427     71.0000
212     10.5000
620      7.2292
691     50.4958
335     24.1500
655     93.5000
97      77.2875
216    135.6333
701     83.1583
331      7.9250
382     46.9000
89       7.9250
303    227.5250
189     26.0000
Name: Fare, Length: 498, dtype: float64,)

#### Use polynomial features

In [74]:
poly = PolynomialFeatures()

fu_pipe = make_pipeline(fu, poly, scaler, model)
fu_pipe.fit(X_train, y_train)

print(cross_val_score(fu_pipe, X_train, y_train, cv=5).mean())
print(fu_pipe.score(X_train, y_train))
print(fu_pipe.score(X_test, y_test))



0.8031843184318432
0.8393574297188755
0.8130841121495327




## Conclusions

Now we can combine different preprocessing steps and models into a single pipeline making our code more fit for production environments. We can even create our classes which fit into the sklearn framework.

## Additional resources

- [Sklearn pipelines](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html#sklearn.pipeline.Pipeline)
- [Sklearn feature union](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.FeatureUnion.html)
- [Sklearn preprocessing](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.preprocessing)
- [Create your own estimator]( http://danielhnyk.cz/creating-your-own-estimator-scikit-learn/)