## Pipeline

This notebook shows how to use `Pipeline` from `sklearn.pipeline` to create pipeline that streamlines the preprocessing, training and prediction of a machine learning model.

[scikit-learn documentation for Pipeline](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html)

In [40]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

We are using titanic dataset, downloaded from [Kaggle](https://www.kaggle.com/competitions/titanic/data), in this notebook.

In [41]:
df = pd.read_csv('titanic_data.csv')
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


We drop columns `PassengerId`, `Name`, `Ticket`, `Cabin` from DataFrame `df` to simplify our analysis.

In [42]:
df.drop(columns=['PassengerId', 'Name', 'Ticket', 'Cabin'], inplace=True)
df.head()

Unnamed: 0,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked
0,0,3,male,22.0,1,0,7.25,S
1,1,1,female,38.0,1,0,71.2833,C
2,1,3,female,26.0,0,0,7.925,S
3,1,1,female,35.0,1,0,53.1,S
4,0,3,male,35.0,0,0,8.05,S


In [43]:
X = df.drop(columns='Survived')
y = df['Survived']
X.head()

Unnamed: 0,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked
0,3,male,22.0,1,0,7.25,S
1,1,female,38.0,1,0,71.2833,C
2,3,female,26.0,0,0,7.925,S
3,1,female,35.0,1,0,53.1,S
4,3,male,35.0,0,0,8.05,S


In [44]:
X.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Pclass    891 non-null    int64  
 1   Sex       891 non-null    object 
 2   Age       714 non-null    float64
 3   SibSp     891 non-null    int64  
 4   Parch     891 non-null    int64  
 5   Fare      891 non-null    float64
 6   Embarked  889 non-null    object 
dtypes: float64(2), int64(3), object(2)
memory usage: 48.9+ KB


In [45]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [46]:
X_train.info()

<class 'pandas.core.frame.DataFrame'>
Index: 712 entries, 331 to 102
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Pclass    712 non-null    int64  
 1   Sex       712 non-null    object 
 2   Age       572 non-null    float64
 3   SibSp     712 non-null    int64  
 4   Parch     712 non-null    int64  
 5   Fare      712 non-null    float64
 6   Embarked  710 non-null    object 
dtypes: float64(2), int64(3), object(2)
memory usage: 44.5+ KB


We are going to perform following steps in this notebook:
1. Impute null values of all columns according to the type of column. Even if a column in `X_train` doesn't have null value we might encounter null values in test set.
2. Scale `Age` and `Fare` columns.
3. One-hot encode categorical columns `Sex` and `Embarked`.
4. Train a model.
5. Predict from the model.
6. Save the model.

In [47]:
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.pipeline import Pipeline

In [48]:
# to impute discrete_numerical columns
pipe_discrete = Pipeline(steps=[
    ('si_mode', SimpleImputer(strategy='most_frequent'))
])

# to impute continuous_numrical columns and scale them
pipe_continuous = Pipeline(steps=[
    ('si_mean', SimpleImputer(strategy='mean')),
    ('std_scaler', StandardScaler())
])

# use pipeline to perform imputation, then one-hot encoding on categorical column
pipe_categorical = Pipeline(steps=[('cat_si', SimpleImputer(strategy='most_frequent')),
                                   ('cat_ohe', OneHotEncoder(sparse_output=False))])

We create a `ColumnTransformer` object that will use transformation from above code cell on relevant columns of dataset.

In [49]:
transformer = ColumnTransformer(transformers=[
    ('dis_transform', pipe_discrete, ['Pclass', 'SibSp', 'Parch']),
    ('cont_transform', pipe_continuous, ['Age', 'Fare']),
    ('cat_transform', pipe_categorical, ['Sex', 'Embarked'])
], remainder='passthrough')

We create a `Pipeline` object that will:
1. Apply transformation, and
2. Train a model

In [50]:
from sklearn.ensemble import RandomForestClassifier

pipe = Pipeline(steps=[
    ('transformer', transformer),
    ('random_forest', RandomForestClassifier(random_state=42))
])

Pipeline makes it easier to train different models and predict outcomes. So we create a function that will train a model, and print accuracy score of training data, testing data and cross validation score.
We will use this function to quickly check performance of different models.

In [51]:
from sklearn.metrics import accuracy_score
from sklearn.model_selection import cross_val_score

In [52]:
def predict_and_print_accuracy_score(pipeline, X_train=X_train, y_train=y_train, X_test=X_test, y_test=y_test):
    """Train a model, predict outcomes and print accuracy score of train and test sets.
    It also prints cross validation score.
    """
    # fit the model
    pipeline.fit(X_train, y_train)
    # predict outcomes from training set
    y_pred_train = pipeline.predict(X_train)
    # predict outcomes from test set
    y_pred_test = pipeline.predict(X_test)
    # calculate accuracy score of training set
    acc_score_train = accuracy_score(y_train, y_pred_train)
    # calculate accuracy score of test set
    acc_score_test = accuracy_score(y_test, y_pred_test)
    # calculate cross validation score array.
    cv_score = cross_val_score(pipeline, X, y, scoring='accuracy', cv=5)
    print(f'train accuracy score = {acc_score_train}')
    print(f'test accuracy score = {acc_score_test}')
    print(f'cross_val_score = {cv_score.mean()}')

The `pipe` pipeline trains a `RandomForestClassifier`. Let's check its performance using the function that we have created.

In [53]:
predict_and_print_accuracy_score(pipe)

train accuracy score = 0.9803370786516854
test accuracy score = 0.8044692737430168
cross_val_score = 0.8103634423451134


Let's see how `LogisticRegression` performs.

In [54]:
# let's use LogisticRegression
from sklearn.linear_model import LogisticRegression

pipe_lr = Pipeline(steps=[
    ('transformer', transformer),
    ('logistic_regression', LogisticRegression(random_state=42))
])

In [55]:
predict_and_print_accuracy_score(pipe_lr)

train accuracy score = 0.8019662921348315
test accuracy score = 0.8100558659217877
cross_val_score = 0.7867679367271359


Let's see how a `DecisionTreeClassifier` performs.

In [56]:
# let's use DecisionTreeClassifier
from sklearn.tree import DecisionTreeClassifier

pipe_dt = Pipeline(steps=[
    ('transformer', transformer),
    ('decision_tree', DecisionTreeClassifier(random_state=42))
])

In [57]:
predict_and_print_accuracy_score(pipe_dt)

train accuracy score = 0.9803370786516854
test accuracy score = 0.776536312849162
cross_val_score = 0.7676981984809491


By this time, we have seen how Pipeline makes our code simple and readable.  
It takes `X_train`, `y_train`, preprocesses them, and trains the model. When we need to predict from test set, we just put `X_test` in predict method of `Pipeline` object.

But this is not the only benefit of `Pipeline`. We can store the trained Pipeline object, and load it somewhere else, and start using it with ease.  
Let's see how it works: 

In [58]:
import pickle

In [59]:
pickle.dump(pipe_lr, open("pipe_lr.pkl", "wb"))

We have store our trained `LogisticRegression` model using `Pipeline` as `pipe_lr.pkl`.  
We create another notebook named `use_pipeline.ipynb` in the same directory, that will use `pipe_lr.pkl` to predict the outcomes.