# Pipelines
**`sklearn.pipeline.Pipeline`**

Documentation:https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html

Pipeline of transforms with a final estimator.

Sequentially apply a list of transforms and a final estimator. Intermediate steps of the pipeline must be `transforms`, that is, they must implement fit and transform methods. The final estimator only needs to implement fit.



Convenience and encapsulation

    You only have to call fit and predict once on your data to fit a whole sequence of estimators.
Joint parameter selection

    You can grid search over parameters of all estimators in the pipeline at once.
Safety

    Pipelines help avoid leaking statistics from your test data into the trained model in cross-validation, by ensuring that the same samples are used to train the transformers and predictors.

All estimators in a pipeline, except the last one, must be transformers (i.e. must have a transform method). The last estimator may be any type (transformer, classifier, etc.).

In [None]:
from sklearn.pipeline import Pipeline

In [None]:
from sklearn.metrics import r2_score, mean_squared_error
import numpy as np
def rmse(y_hat, y):
    return np.sqrt(mean_squared_error(y_hat, y))


**California Housing Dataset**

Features:
* MedInc - median income in block
* HouseAge - median house age in block
* AveRooms - average number of rooms
* AveBedrms - average number of bedrooms
* Population - block population
* AveOccup - average house occupancy
* Latitude - house block latitude
* Longitude - house block longitude

Target: housing prices.

In [None]:
from sklearn.datasets import fetch_california_housing
import pandas as pd
bunch = fetch_california_housing()
df = pd.DataFrame(bunch['data'], columns=bunch['feature_names'])
df['target'] = bunch['target']
df.describe()

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude,target
count,20640.0,20640.0,20640.0,20640.0,20640.0,20640.0,20640.0,20640.0,20640.0
mean,3.870671,28.639486,5.429,1.096675,1425.476744,3.070655,35.631861,-119.569704,2.068558
std,1.899822,12.585558,2.474173,0.473911,1132.462122,10.38605,2.135952,2.003532,1.153956
min,0.4999,1.0,0.846154,0.333333,3.0,0.692308,32.54,-124.35,0.14999
25%,2.5634,18.0,4.440716,1.006079,787.0,2.429741,33.93,-121.8,1.196
50%,3.5348,29.0,5.229129,1.04878,1166.0,2.818116,34.26,-118.49,1.797
75%,4.74325,37.0,6.052381,1.099526,1725.0,3.282261,37.71,-118.01,2.64725
max,15.0001,52.0,141.909091,34.066667,35682.0,1243.333333,41.95,-114.31,5.00001


In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20640 entries, 0 to 20639
Data columns (total 9 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   MedInc      20640 non-null  float64
 1   HouseAge    20640 non-null  float64
 2   AveRooms    20640 non-null  float64
 3   AveBedrms   20640 non-null  float64
 4   Population  20640 non-null  float64
 5   AveOccup    20640 non-null  float64
 6   Latitude    20640 non-null  float64
 7   Longitude   20640 non-null  float64
 8   target      20640 non-null  float64
dtypes: float64(9)
memory usage: 1.4 MB


**Train_test_split**

In [None]:
from sklearn.model_selection import train_test_split, cross_val_score
X = df.drop('target', axis=1)
Y = df['target']
X_train, X_test, y_train, y_test = train_test_split(X, Y, random_state=42)

## Basic example

**`fit`**

Fit the model

Fit all the transforms one after the other and transform the data, then fit the transformed data using the final estimator.

In [None]:
from sklearn.tree import DecisionTreeRegressor
from sklearn.preprocessing import StandardScaler
#apply scaler and Decision Tree Regressor
pipe = Pipeline([('scaler', StandardScaler()), ('dt', DecisionTreeRegressor())])

# The pipeline can be used as any other estimator
# and avoids leaking the test set into the train set
pipe.fit(X_train, y_train)


Pipeline(steps=[('scaler', StandardScaler()), ('dt', DecisionTreeRegressor())])

**`predict`**

Apply transforms to the data, and predict with the final estimator

In [None]:
preds = pipe.predict(X_test)
print('R2: ', r2_score(y_test, preds))
print('RSME: ', rmse(y_test, preds))

R2:  0.6008586234281721
RSME:  0.7267400469905225


**`get_params`**

Get parameters for this estimator.

Returns the parameters given in the constructor as well as the estimators contained within the steps of the Pipeline.

In [None]:
pipe.get_params()

{'dt': DecisionTreeRegressor(),
 'dt__ccp_alpha': 0.0,
 'dt__criterion': 'mse',
 'dt__max_depth': None,
 'dt__max_features': None,
 'dt__max_leaf_nodes': None,
 'dt__min_impurity_decrease': 0.0,
 'dt__min_impurity_split': None,
 'dt__min_samples_leaf': 1,
 'dt__min_samples_split': 2,
 'dt__min_weight_fraction_leaf': 0.0,
 'dt__random_state': None,
 'dt__splitter': 'best',
 'memory': None,
 'scaler': StandardScaler(),
 'scaler__copy': True,
 'scaler__with_mean': True,
 'scaler__with_std': True,
 'steps': [('scaler', StandardScaler()), ('dt', DecisionTreeRegressor())],
 'verbose': False}

**`make_pipeline`**

The utility function `make_pipeline` is a shorthand for constructing pipelines; it takes a variable number of estimators and returns a pipeline, filling in the names automatically.

In [None]:
from sklearn.pipeline import make_pipeline
make_pipeline(StandardScaler(), DecisionTreeRegressor())

Pipeline(steps=[('standardscaler', StandardScaler()),
                ('decisiontreeregressor', DecisionTreeRegressor())])

In [None]:
pipe = make_pipeline(StandardScaler(), DecisionTreeRegressor())
pipe.fit(X_train, y_train)
preds = pipe.predict(X_test)
print('R2: ', r2_score(y_test, preds))
print('RSME: ', rmse(y_test, preds))

R2:  0.5979802024147758
RSME:  0.7293557943132248


In [None]:
pipe.get_params()

{'decisiontreeregressor': DecisionTreeRegressor(),
 'decisiontreeregressor__ccp_alpha': 0.0,
 'decisiontreeregressor__criterion': 'mse',
 'decisiontreeregressor__max_depth': None,
 'decisiontreeregressor__max_features': None,
 'decisiontreeregressor__max_leaf_nodes': None,
 'decisiontreeregressor__min_impurity_decrease': 0.0,
 'decisiontreeregressor__min_impurity_split': None,
 'decisiontreeregressor__min_samples_leaf': 1,
 'decisiontreeregressor__min_samples_split': 2,
 'decisiontreeregressor__min_weight_fraction_leaf': 0.0,
 'decisiontreeregressor__random_state': None,
 'decisiontreeregressor__splitter': 'best',
 'memory': None,
 'standardscaler': StandardScaler(),
 'standardscaler__copy': True,
 'standardscaler__with_mean': True,
 'standardscaler__with_std': True,
 'steps': [('standardscaler', StandardScaler()),
  ('decisiontreeregressor', DecisionTreeRegressor())],
 'verbose': False}

## Pipelines and composite estimators

Examples from documentation: https://scikit-learn.org/stable/modules/compose.html#pipeline

**Acessing steps**

In [None]:
pipe = Pipeline([('scaler', StandardScaler()), ('dt', DecisionTreeRegressor())])
pipe.steps[0]

('scaler', StandardScaler())

In [None]:
pipe[0]

StandardScaler()

In [None]:
pipe['scaler']

StandardScaler()

In [None]:
pipe['dt']

DecisionTreeRegressor()

A sub-pipeline can also be extracted using the slicing notation commonly used for Python Sequences such as lists or strings (although only a step of 1 is permitted). This is convenient for performing only some of the transformations (or their inverse).

In [None]:
pipe[:1]

Pipeline(steps=[('scaler', StandardScaler())])

In [None]:
pipe[-1:]

Pipeline(steps=[('dt', DecisionTreeRegressor())])

## Nested parameters

Parameters of the estimators in the pipeline can be accessed using the `<estimator>__<parameter>` syntax

In [None]:
pipe

Pipeline(steps=[('scaler', StandardScaler()), ('dt', DecisionTreeRegressor())])

In [None]:
pipe.set_params(dt__max_depth=2)

Pipeline(steps=[('scaler', StandardScaler()),
                ('dt', DecisionTreeRegressor(max_depth=2))])

This is particularly important for doing grid searches

In [None]:
from sklearn.model_selection import GridSearchCV
param_grid = dict(scaler__with_mean=[True,False],
                    dt__max_depth=[2, 5, 10])
grid_search = GridSearchCV(pipe, param_grid=param_grid, verbose = True)

In [None]:
grid_search.fit(X_train, y_train)

Fitting 5 folds for each of 6 candidates, totalling 30 fits


GridSearchCV(estimator=Pipeline(steps=[('scaler', StandardScaler()),
                                       ('dt',
                                        DecisionTreeRegressor(max_depth=2))]),
             param_grid={'dt__max_depth': [2, 5, 10],
                         'scaler__with_mean': [True, False]},
             verbose=True)

In [None]:
grid_search.best_estimator_

Pipeline(steps=[('scaler', StandardScaler()),
                ('dt', DecisionTreeRegressor(max_depth=10))])

In [None]:
grid_search.best_params_

{'dt__max_depth': 10, 'scaler__with_mean': True}

## ColumnTransformer
Documentation: https://scikit-learn.org/stable/modules/generated/sklearn.compose.ColumnTransformer.html#sklearn.compose.ColumnTransformer

Applies transformers to columns of an array or pandas DataFrame.

Many datasets contain features of different types, say text, floats, and dates, where each type of feature requires separate preprocessing or feature extraction steps. Often it is easiest to preprocess data before applying scikit-learn methods, for example using pandas. Processing your data before passing it to scikit-learn might be problematic for one of the following reasons:

    Incorporating statistics from test data into the preprocessors makes cross-validation scores unreliable (known as data leakage), for example in the case of scalers or imputing missing values.

    You may want to include the parameters of the preprocessors in a parameter search.

The ColumnTransformer helps performing different transformations for different columns of the data, within a Pipeline that is safe from data leakage and that can be parametrized. ColumnTransformer works on arrays, sparse matrices, and pandas DataFrames.

In [None]:
import numpy as np
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import Normalizer
ct = ColumnTransformer(
     [("norm1", Normalizer(norm='l1'), [0, 1]),
      ("norm2", Normalizer(norm='l2'), slice(2, 4))])
X = np.array([[0., 1., 2., 2.],
               [1., 1., 0., 1.]])
X

array([[0., 1., 2., 2.],
       [1., 1., 0., 1.]])

In [None]:
# Normalizer scales each row of X to unit norm. A separate scaling
# is applied for the two first and two last elements of each
# row independently.
ct.fit_transform(X)


array([[0.        , 1.        , 0.70710678, 0.70710678],
       [0.5       , 0.5       , 0.        , 1.        ]])

**`make_column_transformer`**

Construct a ColumnTransformer from the given transformers.

This is a shorthand for the ColumnTransformer constructor; it does not require, and does not permit, naming the transformers. Instead, they will be given names automatically based on their types.

In [None]:
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import make_column_transformer
ct = make_column_transformer(
     (StandardScaler(), ['numerical_column']),
    (OneHotEncoder(), ['categorical_column']))
ct

ColumnTransformer(transformers=[('standardscaler', StandardScaler(),
                                 ['numerical_column']),
                                ('onehotencoder', OneHotEncoder(),
                                 ['categorical_column'])])

In [None]:
X = pd.DataFrame([[0, 1, 6], ['a','b', 'b']], index = ['numerical_column', 'categorical_column']).T
X

Unnamed: 0,numerical_column,categorical_column
0,0,a
1,1,b
2,6,b


In [None]:
ct.fit_transform(X)

array([[-0.88900089,  1.        ,  0.        ],
       [-0.50800051,  0.        ,  1.        ],
       [ 1.3970014 ,  0.        ,  1.        ]])

## More Complicated examples

Examples from here: https://scikit-learn.org/stable/modules/compose.html#make-column-transformer

In [None]:
import pandas as pd
X = pd.DataFrame(
     {'city': ['London', 'London', 'Paris', 'Sallisaw'],
      'title': ["His Last Bow", "How Watson Learned the Trick",
                "A Moveable Feast", "The Grapes of Wrath"],
      'expert_rating': [5, 3, 4, 5],
      'user_rating': [4, 5, 4, 3]})
X

Unnamed: 0,city,title,expert_rating,user_rating
0,London,His Last Bow,5,4
1,London,How Watson Learned the Trick,3,5
2,Paris,A Moveable Feast,4,4
3,Sallisaw,The Grapes of Wrath,5,3


For this data, we might want to encode the `city` column as a categorical variable using [OneHotEncoder](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html#sklearn.preprocessing.OneHotEncoder) but apply a [CountVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html#sklearn.feature_extraction.text.CountVectorizer) to the `title` column. As we might use multiple feature extraction methods on the same column, we give each transformer a unique name, say `city_category` and `title_bow`. By default, the remaining rating columns are ignored `(remainder='drop')`.

In [None]:
from sklearn.compose import ColumnTransformer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.preprocessing import OneHotEncoder
column_trans = ColumnTransformer(
     [('city_category', OneHotEncoder(dtype='int'),['city']),
      ('title_bow', CountVectorizer(), 'title')],
     remainder='drop')

column_trans.fit(X)

ColumnTransformer(transformers=[('city_category', OneHotEncoder(dtype='int'),
                                 ['city']),
                                ('title_bow', CountVectorizer(), 'title')])

In the above example, the CountVectorizer expects a 1D array as input and therefore the columns were specified as a string (`title`). However, OneHotEncoder as most of other transformers expects 2D data, therefore in that case you need to specify the column as a list of strings (`['city']`).

In [None]:
col 

['city_category__x0_London',
 'city_category__x0_Paris',
 'city_category__x0_Sallisaw',
 'title_bow__bow',
 'title_bow__feast',
 'title_bow__grapes',
 'title_bow__his',
 'title_bow__how',
 'title_bow__last',
 'title_bow__learned',
 'title_bow__moveable',
 'title_bow__of',
 'title_bow__the',
 'title_bow__trick',
 'title_bow__watson',
 'title_bow__wrath']

In [None]:
column_trans.transform(X).toarray()

array([[1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0],
       [1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 1, 1, 0],
       [0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0],
       [0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1]], dtype=int64)

In [None]:
pd.DataFrame(column_trans.transform(X).toarray(), columns = column_trans.get_feature_names())

Unnamed: 0,city_category__x0_London,city_category__x0_Paris,city_category__x0_Sallisaw,title_bow__bow,title_bow__feast,title_bow__grapes,title_bow__his,title_bow__how,title_bow__last,title_bow__learned,title_bow__moveable,title_bow__of,title_bow__the,title_bow__trick,title_bow__watson,title_bow__wrath
0,1,0,0,1,0,0,1,0,1,0,0,0,0,0,0,0
1,1,0,0,0,0,0,0,1,0,1,0,0,1,1,1,0
2,0,1,0,0,1,0,0,0,0,0,1,0,0,0,0,0
3,0,0,1,0,0,1,0,0,0,0,0,1,1,0,0,1


**`make_column_selector`**

Apart from a scalar or a single item list, the column selection can be specified as a list of multiple items, an integer array, a slice, a boolean mask, or with a `make_column_selector`. The `make_column_selector` is used to select columns based on data type or column name:

In [None]:
from sklearn.preprocessing import StandardScaler
from sklearn.compose import make_column_selector
ct = ColumnTransformer([
       ('scale', StandardScaler(),
       make_column_selector(dtype_include=np.number)),
       ('onehot',
       OneHotEncoder(),
       make_column_selector(pattern='city', dtype_include=object))])
res = ct.fit_transform(X)
res

array([[ 0.90453403,  0.        ,  1.        ,  0.        ,  0.        ],
       [-1.50755672,  1.41421356,  1.        ,  0.        ,  0.        ],
       [-0.30151134,  0.        ,  0.        ,  1.        ,  0.        ],
       [ 0.90453403, -1.41421356,  0.        ,  0.        ,  1.        ]])

We can keep the remaining rating columns by setting `remainder='passthrough'`. The values are appended to the end of the transformation

In [None]:
column_trans = ColumnTransformer(
     [('city_category', OneHotEncoder(dtype='int'),['city']),
      ('title_bow', CountVectorizer(), 'title')],
     remainder='passthrough')

column_trans.fit_transform(X)

array([[1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 5, 4],
       [1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 1, 1, 0, 3, 5],
       [0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 4, 4],
       [0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 5, 3]])

The remainder parameter can be set to an estimator to transform the remaining rating columns. The transformed values are appended to the end of the transformation.

In [None]:
X

Unnamed: 0,city,title,expert_rating,user_rating
0,London,His Last Bow,5,4
1,London,How Watson Learned the Trick,3,5
2,Paris,A Moveable Feast,4,4
3,Sallisaw,The Grapes of Wrath,5,3


In [None]:
from sklearn.preprocessing import MinMaxScaler
column_trans = ColumnTransformer(
     [('city_category', OneHotEncoder(), ['city']),
      ('title_bow', CountVectorizer(), 'title')],
     remainder=MinMaxScaler())
## Scaled expert_rating and user_rating
column_trans.fit_transform(X)[:, -2:]n

array([[1. , 0.5],
       [0. , 1. ],
       [0.5, 0.5],
       [1. , 0. ]])

## Pipeline with column transformer

Load [titanic](https://gist.github.com/michhar/2dfd2de0d4f8727f873422c5d959fff5) dataset.

    VARIABLE DESCRIPTIONS:
    survival        Survival
                (0 = No; 1 = Yes)
    pclass          Passenger Class
                (1 = 1st; 2 = 2nd; 3 = 3rd)
    name            Name
    sex             Sex
    age             Age
    sibsp           Number of Siblings/Spouses Aboard
    parch           Number of Parents/Children Aboard
    ticket          Ticket Number
    fare            Passenger Fare
    cabin           Cabin
    embarked        Port of Embarkation
                (C = Cherbourg; Q = Queenstown; S = Southampton)

    SPECIAL NOTES:
    Pclass is a proxy for socio-economic status (SES)
     1st ~ Upper; 2nd ~ Middle; 3rd ~ Lower

    Age is in Years; Fractional if Age less than One (1)
     If the Age is Estimated, it is in the form xx.5

In [None]:
df = pd.read_csv('https://grantmlong.com/data/titanic.csv')
#df = pd.read_csv('titanic.csv')
df.Age = df.Age.fillna(0)
df = df.drop(['Ticket','Name','Cabin', 'PassengerId'], axis = 1)
df = df.dropna()
df.head()

Unnamed: 0,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked
0,0,3,male,22.0,1,0,7.25,S
1,1,1,female,38.0,1,0,71.2833,C
2,1,3,female,26.0,0,0,7.925,S
3,1,1,female,35.0,1,0,53.1,S
4,0,3,male,35.0,0,0,8.05,S


In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 889 entries, 0 to 890
Data columns (total 8 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Survived  889 non-null    int64  
 1   Pclass    889 non-null    int64  
 2   Sex       889 non-null    object 
 3   Age       889 non-null    float64
 4   SibSp     889 non-null    int64  
 5   Parch     889 non-null    int64  
 6   Fare      889 non-null    float64
 7   Embarked  889 non-null    object 
dtypes: float64(2), int64(4), object(2)
memory usage: 62.5+ KB


In [None]:
X = df.drop(['Survived'], axis = 1)
Y = df.Survived
X_train, X_test, y_train, y_test = train_test_split(X, Y, random_state=42)

In [None]:
X.describe()

Unnamed: 0,Pclass,Age,SibSp,Parch,Fare
count,889.0,889.0,889.0,889.0,889.0
mean,2.311586,23.740349,0.524184,0.382452,32.096681
std,0.8347,17.562609,1.103705,0.806761,49.697504
min,1.0,0.0,0.0,0.0,0.0
25%,2.0,6.0,0.0,0.0,7.8958
50%,3.0,24.0,0.0,0.0,14.4542
75%,3.0,35.0,1.0,0.0,31.0
max,3.0,80.0,8.0,6.0,512.3292


In [None]:
X.head(1)

Unnamed: 0,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked
0,3,male,22.0,1,0,7.25,S


In [None]:
from sklearn.preprocessing import OrdinalEncoder
ct = ColumnTransformer([
       ('scale', StandardScaler(), make_column_selector(dtype_include=np.number)),
       ('onehot', OneHotEncoder(), ['Sex']),
       ('ordinal', OrdinalEncoder(), ['Embarked'])
        ])

In [None]:
ct.fit(X_train)

ColumnTransformer(transformers=[('scale', StandardScaler(),
                                 <sklearn.compose._column_transformer.make_column_selector object at 0x7f7caa116790>),
                                ('onehot', OneHotEncoder(), ['Sex']),
                                ('ordinal', OrdinalEncoder(), ['Embarked'])])

In [None]:
pd.DataFrame(ct.transform(X_train)).head(2)

Unnamed: 0,0,1,2,3,4,5,6,7
0,0.815528,-0.111904,-0.474917,-0.480663,-0.500108,1.0,0.0,2.0
1,-0.386113,1.462037,-0.474917,-0.480663,-0.435393,1.0,0.0,2.0


In [None]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
ct = ColumnTransformer([
       ('scale', StandardScaler(), make_column_selector(dtype_include=np.number)),
       ('onehot', OneHotEncoder(), ['Sex']),
       ('ordinal', OrdinalEncoder(), ['Embarked'])
        ])
pipe = Pipeline([('ct', ct), ('dt', DecisionTreeClassifier())])
pipe.fit(X_train, y_train)
pred = pipe.predict(X_test)
print('Accuracy', accuracy_score(y_test,pred))

Accuracy 0.7757847533632287


## ColumnTrasformer + Pipeline + GridSearch

In [None]:
param_grid = dict(dt__max_depth=[2, 5, 10, None])
gs = GridSearchCV(pipe, param_grid=param_grid, verbose = True)
gs.fit(X_train, y_train)
best_params = gs.best_params_
gs.best_params_

Fitting 5 folds for each of 4 candidates, totalling 20 fits


{'dt__max_depth': 5}

In [None]:
pred = gs.predict(X_test)
print('Accuracy', accuracy_score(y_test,pred))

Accuracy 0.7937219730941704


**Setting best parameters**

In [None]:
pipe = Pipeline([('ct', ct), ('dt', DecisionTreeClassifier())])
pipe.set_params(**gs.best_params_)
pipe.fit(X_train, y_train)
pred = pipe.predict(X_test)
print('Accuracy', accuracy_score(y_test,pred))

Accuracy 0.7937219730941704


In [None]:
np.number

numpy.number