# Part 4: Model Training Pipeline
In the previous notebook, we settled on a model algorithm after validating it properly and are now ready to formalize the training pipeline from start to finish. The training pipeline will take the raw dataset as input and perform both the feature engineering and model training as a single pipeline. We will specifically do the following actions:

- Importing the raw dataset from the "/data/raw" directory
- Splitting the data into training and validation datasets
- Using our feature engineering and model algorithm code to build an end-to-end training pipeline
- Saving the model as a serialized pickle file

The algorithm to be used will be Random Forest Classifier:
1. Best hyperparameters: {'max_depth': 15, 'min_samples_leaf': 1, 'min_samples_split': 10, 'n_estimators': 75}
2. Average accuracy score: 82%
3. Average ROC AUC score: 81%

In [1]:
# Importing the necessary Python libraries
import warnings
import numpy as np
import pandas as pd
from datetime import datetime
from category_encoders.one_hot import OneHotEncoder
from sklearn.metrics import accuracy_score, roc_auc_score, f1_score, mean_absolute_error, mean_squared_error, r2_score
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import FunctionTransformer
from sklearn.preprocessing import StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.linear_model import Lasso
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import Pipeline

# Hiding any warnings
warnings.filterwarnings('ignore')

# Adjusting Pandas output
pd.set_option("display.max_columns", None)

In [2]:
# Loading in the training data
df_raw = pd.read_csv('../data/raw/titanic-train-raw.csv')

# Separating predictor value from the remainder of the dataset
X = df_raw.drop(columns = ['Survived'])
y = df_raw[['Survived']]

In [3]:
def encode_age(df_raw):

    # Filling any null values with the median age of 28.0
    median_age = 28.0
    df_raw['Age'].fillna(median_age, inplace = True)

    # Establishing our bins values and names
    bin_labels = ['child', 'teen', 'young_adult', 'adult', 'elder']
    bin_values = [-1, 12, 19, 30, 60, 100]

    # Applying "Age" binning with Pandas cut
    age_bins = pd.cut(df_raw['Age'], bins = bin_values, labels = bin_labels)
    df_age_bins = pd.DataFrame(age_bins)

    # Dropping the original "Age" column
    df_raw.drop(columns = ['Age'], inplace = True)

    # Concatenating the new "Age" column to the original DataFrame
    df_raw = pd.concat([df_raw, df_age_bins], axis = 1)

    return df_raw


In [4]:
age_binner = FunctionTransformer(encode_age, validate = False)

# Creating the data preprocessor that will perform our feature engineering
data_preprocessor = ColumnTransformer(transformers = [
    ('ohe_engineering', OneHotEncoder(use_cat_names = True, handle_unknown = 'ignore'), ['Age', 'Sex', 'Embarked']),
    ('columns_to_drop', 'drop', ['PassengerId', 'Name', 'Ticket', 'Cabin'])],
                                      remainder = 'passthrough'
)
data_preprocessor

Training the Binary Classification Model
- we have finished all necessary data cleanup and engineering to provide the algorithm and model a valid dataset to work with

In [6]:
# Creating the full inference pipeline for the binary classification model
binary_classification_pipeline = Pipeline(steps = [
    ('age_engineering', FunctionTransformer(encode_age, validate = False)),
    ('feature_engineering', data_preprocessor),
    ('predictive_modeling', RandomForestClassifier(n_estimators = 75,
                                                   max_depth = 15,
                                                   min_samples_split = 10,
                                                   min_samples_leaf = 1))
])

binary_classification_pipeline