# Titanic dataset: classification

### Description

This notebook sets out an approach for building a survival-classifier predictive model for the Titanic disaster. Various models will be trained and evaluated using data from Kaggle. The best-performer will then be deployed for use. This follows on from the *Titanic_EDA.ipynb* notebook where exploratory data analysis of the titantic dataset is conducted. 

Goal: create a machine learning model to generate predictions for whether an individual will survive the titanic disaster.



**Feature descriptors:**
 - Pclass: ticket class
 - Name: full name of passenger
 - Sex: sex (m/f)
 - Age: age in years
 - SibSp: # of siblings/spouses aboard
 - Parch: # of parents / children aboard the Titanic
 - Ticket: ticket number
 - Fare: passenger fare
 - Cabin: cabin number
 - Embarked: port of embarkation
 
 
**Target:** Survival - whether the individual survived (0 - No, 1 - Yes)

### Prepare environment

**Import libraries**

In [1]:
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

import sklearn
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, minmax_scale, OneHotEncoder
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.compose import ColumnTransformer, make_column_selector, make_column_transformer

from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from xgboost import XGBClassifier

from sklearn.metrics import confusion_matrix, classification_report, auc, roc_auc_score
from sklearn.dummy import DummyClassifier
import joblib

In [2]:
import warnings
warnings.filterwarnings('ignore')

**Load datasets**

In [3]:
train_df = pd.read_csv("../datasets/train.csv")
test_df = pd.read_csv("../datasets/test.csv")

**Explore training data**

See Titanic_EDA.ipynb for a more in-depth EDA exercise

In [4]:
# look at the first 10 rows
train_df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


### Feature engineering

Refer to EDA notebook for more in-depth analysis of the data and reasoning for the steps applied made below in feature selection and transformation. 

**Feature selection**

The name and ticket columns should be dropped as they contain all unique values and appear unlikely to be useful to the model. The cabin column has a large number of nan values and so this should also be excluded.

In [5]:
feature_cols = ['PassengerId','Pclass','Name','Sex','Age','SibSp','Parch','Ticket','Fare','Cabin','Embarked']
target_col = 'Survived'

X = train_df.copy()
y = train_df[[target_col]].values

Firstly, train-test split the data with stratification to ensure there is enough target examples in both training and test sets. We want to split the data ahead of performing any data pre-processing so that we can check any pipeline we construct from the training data against the test part of the split.

In [6]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=324, stratify=y)

**Data cleaning and transformations**

There are 177 missing values in the Age column. Due to the small dataset size, we shouldn't remove all these rows. Instead use an imputer to replace missing values with a mean for the column. Impute missing values in Embarked with a new category 'Unknown'.

Any fields containing continuous numeric data should be scaled or normalized. Use One Hot Encoding (OHE) to convert the categoric variables (sex, embarked, Pclass, SibSp) so each category is a feature with values (1/0). Parch has not been included as categories in training and testing datasets are not consistent which will cause issues for OHE.

Combine all these transformations into an sklearn pipeline for repeatability.

In [7]:
# construct pipeline of transformtions
feature_pipe = make_column_transformer(
    (make_pipeline(SimpleImputer(strategy='mean'), StandardScaler()), ['Age']),
    (make_pipeline(SimpleImputer(strategy='constant', fill_value='UNKNOWN'),OneHotEncoder()), ['Embarked']),
    (StandardScaler(), ['Fare']),
    (OneHotEncoder(),['Sex','Pclass','SibSp']))

# remove the rows with missing values in the Embarked column then fit the transformation pipeline with the training data
feature_pipe.fit(X_train)

ColumnTransformer(transformers=[('pipeline-1',
                                 Pipeline(steps=[('simpleimputer',
                                                  SimpleImputer()),
                                                 ('standardscaler',
                                                  StandardScaler())]),
                                 ['Age']),
                                ('pipeline-2',
                                 Pipeline(steps=[('simpleimputer',
                                                  SimpleImputer(fill_value='UNKNOWN',
                                                                strategy='constant')),
                                                 ('onehotencoder',
                                                  OneHotEncoder())]),
                                 ['Embarked']),
                                ('standardscaler', StandardScaler(), ['Fare']),
                                ('onehotencoder', OneHotEncoder(),
                 

Apply the pipeline to both train and test parts of the supervised training data split. Also apply it to the testing dataset from Kaggle to confirm that the pipeline is robust.  

In [8]:
# apply feature pipeline tranformations to train and test sets to check no issues
X_train_processed = feature_pipe.transform(X_train)
X_test_processed = feature_pipe.transform(X_test)

In [9]:
# apply the pipeline to the test dataframe to check no issues
test_df_processed = feature_pipe.transform(test_df)

### Model training, validation and testing

Follow the training, validation and testing framework
 - Use 60% of the training data for model training
 - Use 20% for assessing model performance while fine-tuning parameters
 - Hold-out 20% for final evaluation
 
Cross-validation should be applied to the 80% for training and fine-tuning using sklearn's GridSearchCV. This will also allow hyperparameter tuning. Train and validate each classifer on the training dataset using cross-validation.
Accuracy is being used as the evaluation metric, with a dummy classifier to benchmark the evaluation.

Create a function for consistent training, validation and testing throughout experimentation. This should also provide training and test set scores (giving useful info on bias/variance), and save the pipeline to the model respository.

In [10]:
model_repo = '../models'
def model_train_val(model, feature_pipe, params, X_train, y_train, X_test, y_test, cv=4, scoring='accuracy'):
    
    # apply feature transformations
    X_train_processed = feature_pipe.transform(X_train)
    X_test_processed = feature_pipe.transform(X_test)
    
    # run gridsearch with cross-validation to find best estimator
    clf = GridSearchCV(model, param_grid=params, cv=cv, scoring=scoring, return_train_score=True)
    clf.fit(X_train_processed, y_train)
    
    # print the results to screen
    print(f'Best params: {clf.best_params_}')
    print(f'Best CV score: {clf.best_score_}')
    print(f"Training set score: {clf.cv_results_['mean_train_score'][clf.best_index_]:2.3}")
    print(f'Test set score: {clf.best_estimator_.score(X_test_processed,y_test):2.3}')
    
    # construct final pipeline
    predict_pipe = make_pipeline(feature_pipe, clf.best_estimator_)
    
    # save the model in the repository
    joblib.dump(predict_pipe, os.path.join(model_repo, f'pipe_{str(clf.best_estimator_)}.pkl'))

**Dummy classifier**

In [11]:
# use a dummy classifier to benchmark the evaluation
params = {}
model_train_val(DummyClassifier(strategy='most_frequent'), feature_pipe, params, X_train, y_train, X_test, y_test)

Best params: {}
Best CV score: 0.6165730337078652
Training set score: 0.617
Test set score: 0.615


#### kNN

In [12]:
# k-Nearest Neighbors
params = {'n_neighbors':range(2,11)}
model_train_val(KNeighborsClassifier(), feature_pipe, params, X_train, y_train, X_test, y_test)

Best params: {'n_neighbors': 5}
Best CV score: 0.803370786516854
Training set score: 0.86
Test set score: 0.827


#### Logistic regression

In [13]:
# logistic regression
params = {'penalty':['none','l1','l2'],'C':[0.01,0.1,1,10]}
model_train_val(LogisticRegression(), feature_pipe, params, X_train, y_train, X_test, y_test)

Best params: {'C': 0.1, 'penalty': 'l2'}
Best CV score: 0.7907303370786517
Training set score: 0.8
Test set score: 0.832


#### Support Vector Machines

In [14]:
# support vector machines
params = {'kernel':('linear','rbf'),'C':(0.01,0.1,1)}
model_train_val(SVC(), feature_pipe, params, X_train, y_train, X_test, y_test)

Best params: {'C': 1, 'kernel': 'rbf'}
Best CV score: 0.8174157303370787
Training set score: 0.835
Test set score: 0.849


#### Decision trees

In [15]:
# decision tree
params = {'max_depth':range(1,11),'max_features':range(1,10)}
model_train_val(DecisionTreeClassifier(random_state=0),feature_pipe, params, X_train, y_train, X_test, y_test)

Best params: {'max_depth': 5, 'max_features': 6}
Best CV score: 0.806179775280899
Training set score: 0.818
Test set score: 0.816


#### Random forests

In [58]:
# random forests
params = {'n_estimators':range(1,101,10),'max_depth':range(1,11),}
model_train_val(RandomForestClassifier(random_state=0),feature_pipe, params, X_train, y_train, X_test, y_test)

Best params: {'max_depth': 8, 'n_estimators': 41}
Best CV score: 0.8160112359550562
Training set score: 0.909
Test set score: 0.832


#### Gradient boosted trees

In [None]:
# gradient boosted trees
params= {'n_estimators':range(1,101,10),'max_depth':range(1,11)}
model_train_val(GradientBoostingClassifier(random_state=0), feature_pipe, params, X_train, y_train, X_test, y_test)

In [None]:
# using XGBoost
params= {'n_estimators':range(1,101,10),'max_depth':range(1,11),'max_features':range(1,10)}
#model_train_val(XGBClassifier(random_state=0,use_label_encoder=False), params, X_train, y_train, X_test, y_test)