# 5.6.3) Training the model using Pipeline and ColumnTransformer

In [4]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import MinMaxScaler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.feature_selection import SelectKBest, chi2
from sklearn.tree import DecisionTreeClassifier

In [6]:
df = pd.read_csv('train.csv')
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [7]:
# Dropping less important features
df.drop(columns=['PassengerId', 'Name', 'Ticket', 'Cabin'], inplace=True)

In [8]:
df.head()

Unnamed: 0,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked
0,0,3,male,22.0,1,0,7.25,S
1,1,1,female,38.0,1,0,71.2833,C
2,1,3,female,26.0,0,0,7.925,S
3,1,1,female,35.0,1,0,53.1,S
4,0,3,male,35.0,0,0,8.05,S


- **Plan:**
  - Impute missing values using ColumnTransformer and send the output of this as input for the other ColumnTransformer doing one hot encoding for sex and embarked column. Output of this step will than be used for scaling (using ColumnTransformer), than we will do feature selection (out of 10 columns that we will have after the transformation we will try to select some columns (say eight columns)).
  - Finally we will train the model

In [9]:
# train test split:
X_train, x_test, y_train, y_test = train_test_split(df.drop(columns=['Survived']), df['Survived'], test_size=0.2, random_state = 42)

In [10]:
# Imputation Transformer:
trf1 = ColumnTransformer([
    ('impute_age', SimpleImputer(), [2]),    # column index for 'Age'
    ('impute_embarked', SimpleImputer(strategy='most_frequent'), [6])   # column index for 'Embarked'
], remainder = 'passthrough')

- Note: We are providing column index. Why?
  - Because output of our transformation will be a numpy array and will have no column name.
  - So, if we use column name than there may be chance of code breaking in pipeline.

In [22]:
# One Hot Encoding:
trf2 = ColumnTransformer([
    ('ohe_sex_embarked', OneHotEncoder(sparse_output=False, handle_unknown = 'ignore'), [1,6])    # no need of drop first as we are using DecisionTree
], remainder='passthrough')

In [29]:
# Scaling:
trf3 = ColumnTransformer([('scale', MinMaxScaler(), slice(0,10))])   # slice(0,10) - applying scaling on each column

In [30]:
# Feature Selection:
trf4 = SelectKBest(score_func=chi2, k=8)    # k=8 => using top eight columns

In [31]:
# train the model:
trf5 = DecisionTreeClassifier()

In [32]:
# Creating the pipeline:
pipe = Pipeline([
    ('trf1', trf1),       # Syntax is: (transformation_name, transformation_object)
    ('trf2', trf2),  
    ('trf3', trf3),  
    ('trf4', trf4),  
    ('trf5', trf5),  
])

- Pipeline vs make_pipeline:
  - Pipeline requires naming of steps, make_pipeline does not.
  - Same applies for ColumnTransformer vs make_column_transformer.
  - So, alternate syntax: pipe = make_pipeline(trf1, trf2, trf3, trf4, trf5)
- Tutor prefers Pipeline as this will help while using pipe.named_steps due to output as key-value pair (See below)

In [33]:
# training:
pipe.fit(X_train,y_train)   # All the transformations will apply through the pipeline

- Note:
  - Here we are also applying the algorithm as part of trf5. Hence, we are using pipe.fit(). If we had only done imputation, one hot encoding, and scaling than we would have used pipe.fit_transform() [Or, fit() followed by transform()] as this is only data preprocessing but not model training.
  - Also, we are able to see the visual of flow of pipeline through the diagram. If we are not seeing it then we can do the following before fitting:  
    from sklearn import set_config  
    set_config(display='diagram')

In [35]:
# Predicting:
y_pred = pipe.predict(x_test)

In [36]:
# accuracy score:
from sklearn.metrics import accuracy_score
accuracy_score(y_test, y_pred)

0.6256983240223464

- **Some important learning w.r.to pipeline:**

In [37]:
pipe.named_steps   # gives us a dictionary containing details of all steps our pipeline is following

{'trf1': ColumnTransformer(remainder='passthrough',
                   transformers=[('impute_age', SimpleImputer(), [2]),
                                 ('impute_embarked',
                                  SimpleImputer(strategy='most_frequent'),
                                  [6])]),
 'trf2': ColumnTransformer(remainder='passthrough',
                   transformers=[('ohe_sex_embarked',
                                  OneHotEncoder(handle_unknown='ignore',
                                                sparse_output=False),
                                  [1, 6])]),
 'trf3': ColumnTransformer(transformers=[('scale', MinMaxScaler(), slice(0, 10, None))]),
 'trf4': SelectKBest(k=8, score_func=<function chi2 at 0x000001ACACC85BC0>),
 'trf5': DecisionTreeClassifier()}

In [38]:
pipe.named_steps['trf1']   # accessing a paricular column transformer

In [40]:
pipe.named_steps['trf1'].transformers_   # gives list of tuples for transformtions in a paricular column transformer

[('impute_age', SimpleImputer(), [2]),
 ('impute_embarked', SimpleImputer(strategy='most_frequent'), [6]),
 ('remainder',
  FunctionTransformer(accept_sparse=True, check_inverse=False,
                      feature_names_out='one-to-one'),
  [0, 1, 3, 4, 5])]

In [41]:
pipe.named_steps['trf1'].transformers_[0][1]   # At index [0][1] we have SimpleImputer

In [43]:
pipe.named_steps['trf1'].transformers_[0][1].statistics_     # gives the mean value

array([29.49884615])

In [44]:
pipe.named_steps['trf1'].transformers_[1][1].statistics_     # gives the most frequest value

array(['S'], dtype=object)

- Cross Validation using cross_val_score (To be studied in detail later):

In [46]:
from sklearn.model_selection import cross_val_score
cross_val_score(pipe, X_train, y_train, cv = 5, scoring='accuracy').mean() 

"\nWhat is cross_val_score(pipe, X_train, y_train, cv = 5, scoring='accuracy').mean()?\nThis basically means doing train test split five times (because cv=5) and calculating the mean accuracy of it.\n"

In [None]:
''' What is cross_val_score(pipe, X_train, y_train, cv = 5, scoring='accuracy').mean()?
This basically means doing train test split five times (because cv=5) and calculating the mean accuracy of it.
'''

- Grid Search Using Pipeline (Hyper parameter tuning):
  - Changing setting of an algorith to improve its performance.
  - (To be studied in detail later)

In [50]:
# grid search cv:
params = {
    'trf5__max_depth': [1,2,3,4,5,None] 
        # max_depth is a hyperparameter in Decision Tree, and changing it improves/downgrades the performance of the algorithm.
}

from sklearn.model_selection import GridSearchCV
grid = GridSearchCV(pipe, params, cv=5, scoring='accuracy')
grid.fit(X_train, y_train)
print(grid.best_score_)   # gives the best score in a specific depth
print(grid.best_params_)  # gives the depth on which best score was achieved

0.6391214419383433
{'trf5__max_depth': 2}


In [51]:
# Exporting the pipeline:
import pickle
pickle.dump(pipe, open('pipe.pkl', 'wb'))

- No need of taking any transformer object explicitly as all those things is already part of the pipe.
- We will use this model in - 5.6.4) Testing the model trained with pipeline

==============================================================================================================