# Titanic dataset: classification

### Description

This notebook sets out an approach for building a survival-classifier predictive model for the Titanic disaster. Various models will be trained and evaluated using data from Kaggle. The best-performer will then be deployed for use. 

Goal: create a machine learning model to generate predictions for whether an individual will survive the titanic disaster.



**Feature descriptors:**
 - Pclass: ticket class
 - Name: full name of passenger
 - Sex: sex (m/f)
 - Age: age in years
 - SibSp: # of siblings/spouses aboard
 - Parch: # of parents / children aboard the Titanic
 - Ticket: ticket number
 - Fare: passenger fare
 - Cabin: cabin number
 - Embarked: port of embarkation
 
 
Target: Survival - whether the individual survived (0 - No, 1 - Yes)

**Import libraries**

In [6]:
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

import sklearn
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, minmax_scale
from sklearn.model_selection import train_test_split, GridSearchCV

from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from xgboost import XGBClassifier

from sklearn.metrics import confusion_matrix, classification_report, auc, roc_auc_score
from sklearn.dummy import DummyClassifier
import joblib

In [37]:
import warnings
warnings.filterwarnings('ignore')

**Load datasets**

In [21]:
train_df = pd.read_csv("../datasets/train.csv")
test_df = pd.read_csv("../datasets/test.csv")

**Explore training data**

see Titanic_EDA.ipynb for a more in-depth EDA exercise

In [3]:
# look at the first 10 rows
train_df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [22]:
# set passenger id as the index
train_df.set_index('PassengerId',inplace=True)
test_df.set_index('PassengerId',inplace=True)

### Feature engineering

The name and ticket columns have been dropped as they contain all unique values and appear unlikely to be useful to the model. The cabin column has a large number of nan values and so this has also been excluded.

Made the same change on the test dataset.

In [None]:
# name includes title which could be extracted using regex. This could also be an indicator (Mr, Mrs, Miss, Master, Dr., Rev.) 
# for example, indicates whether the passenger is married or not.  
#train_df['Title'] = train_df['Name'].str.extract('([a-zA-Z]{2,}[\.]{1})')
#test_df['Title'] = test_df['Name'].str.extract('([a-zA-Z]{2,}[\.]{1})')

In [None]:
# Consider whether to bin fields, like age for example, into categories. Age is unlikley to be useful as a continuous variable. 

**Clean dataset**

There are 177 missing values in the Age column. Due to the small dataset size, we shouldn't remove all these rows. Instead use an imputer to replace missing values with a mean for the column.  

In [23]:
# drop redundant columns in test and train sets
cleaned_train_df = train_df.drop(columns=['Name','Ticket','Cabin']).copy()
cleaned_test_df = test_df.drop(columns=['Name','Ticket','Cabin']).copy()

# impute values where missing for Age and Embarked (train)
imp_age = SimpleImputer(strategy='mean').fit(cleaned_train_df['Age'].values.reshape(-1,1))
cleaned_train_df['Age'] = imp_age.transform(cleaned_train_df['Age'].values.reshape(-1,1))
cleaned_test_df['Age'] = imp_age.transform(cleaned_test_df['Age'].values.reshape(-1,1))

# impute values where missing for Age and Embarked (test)
#imp_embarked = SimpleImputer(strategy='constant',fill_value='UNKNOWN').fit(cleaned_train_df['Embarked'].values.reshape(-1,1))
#cleaned_train_df['Embarked'] = imp_embarked.transform(cleaned_train_df['Embarked'].values.reshape(-1,1))
#cleaned_test_df['Embarked'] = imp_embarked.transform(cleaned_test_df['Embarked'].values.reshape(-1,1))

In the training set, there are two rows now with missing values in the Embarked column. These can be dropped as we will not lose too much data. Similarly, there is a missing value in the Fare column in the test set.

In [26]:
# drop rows where missing values are remaining
cleaned_train_df.dropna(inplace=True)
cleaned_test_df.dropna(inplace=True)

**Transform Categoric data**

Use One hot encoding to convert the categoric variables (sex, embarked, parch, Pclass, SibSp) so each category is a feature (1/0) 

This has to be done for both test and training datasets

In [27]:
# one hot encode categoric variables
train_ohe_df = pd.get_dummies(cleaned_train_df,columns=['Sex','Embarked','Parch','Pclass','SibSp'])
test_ohe_df = pd.get_dummies(cleaned_test_df,columns=['Sex','Embarked','Parch','Pclass','SibSp'])

**Transform Numeric data**

Any fields containing continuous numeric data should be scaled or normalized.

In [28]:
# fit and apply a scaler to the training set for the Age variable
age_scaler = StandardScaler().fit(train_ohe_df['Age'].values.reshape(-1, 1))
train_ohe_df['Age'] = age_scaler.transform(train_ohe_df['Age'].values.reshape(-1, 1))

# fit and apply a scaler to the train set for the Fare variable
fare_scaler = StandardScaler().fit(train_ohe_df['Fare'].values.reshape(-1, 1))
train_ohe_df['Fare'] = age_scaler.transform(train_ohe_df['Fare'].values.reshape(-1, 1))

# apply the scalers to the test set
test_ohe_df['Age'] = age_scaler.transform(test_ohe_df['Age'].values.reshape(-1, 1))
test_ohe_df['Fare'] = fare_scaler.transform(test_ohe_df['Fare'].values.reshape(-1, 1))

**Create feature and target datasets**

In [29]:
target_col = ['Survived']
feature_cols=[f for f in train_ohe_df.columns.to_list() if f not in target_col]

X = train_ohe_df[feature_cols].copy()
y = train_ohe_df[target_col].copy()

X_inf = test_ohe_df[feature_cols].copy()

In [19]:
y.head()

Unnamed: 0_level_0,Survived
PassengerId,Unnamed: 1_level_1
1,0
2,1
3,1
4,1
5,0


### Model training, validation and testing

Follow the training, validation and testing framework
 - Use 60% of the training data for model training
 - Use 20% for assessing model performance while fine-tuning parameters
 - Hold-out 20% for final evaluation
 
Cross-validation should be applied to the 80% for training and fine-tuning using sklearn's GridSearchCV. This will also allow hyperparameter tuning.

In [31]:
# Split the training data into training and test sets for fitting and evaluation
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=324, stratify=y)

Train and validate each classifer on the training dataset using cross-validation. Then, score using the evaluation function. 
Accuracy is being used as the evaluation metric, with a dummy classifier to benchmark the evaluation.

In [33]:
# define a dummy classifier to benchmark the evaluation
dummy_model = DummyClassifier(strategy='most_frequent').fit(X_train,y_train)
print(f'Dummy training score: {dummy_model.score(X_train,y_train)}')
print(f'Dummy test score: {dummy_model.score(X_test,y_test)}')

Dummy training score: 0.6174402250351617
Dummy test score: 0.6179775280898876


Create a function for consistent training, validation and testing throughout experimentation. 

In [34]:
model_repo = '../models'
def model_train_val(model, params, X_train, y_train, X_test, y_test, cv=4, scoring='accuracy'):
    clf = GridSearchCV(model, param_grid=params, cv=cv, scoring=scoring, return_train_score=True)
    clf.fit(X_train, y_train.values.ravel())
    print(f'Best params: {clf.best_params_}')
    print(f'Best CV score: {clf.best_score_}')
    print(f"Training set score: {clf.cv_results_['mean_train_score'][clf.best_index_]:2.3}")
    print(f'Test set score: {clf.best_estimator_.score(X_test,y_test):2.3}')
    joblib.dump(clf.best_estimator_,os.path.join(model_repo, f'model_{str(clf.best_estimator_)}.pkl'))

#### kNN

In [38]:
# k-Nearest Neighbors
params = {'n_neighbors':range(2,11)}
model_train_val(KNeighborsClassifier(), params, X_train, y_train, X_test, y_test)

Best params: {'n_neighbors': 7}
Best CV score: 0.8115676379102392
Training set score: 0.836
Test set score: 0.792


#### Logistic regression

In [39]:
# logistic regression
params = {'penalty':['none','l1','l2'],'C':[0.01,0.1,1,10]}
model_train_val(LogisticRegression(), params, X_train, y_train, X_test, y_test)

Best params: {'C': 10, 'penalty': 'l2'}
Best CV score: 0.8059496603821494
Training set score: 0.815
Test set score: 0.803


#### Support Vector Machines

In [40]:
# support vector machines
params = {'kernel':('linear','rbf'),'C':(0.01,0.1,1)}
model_train_val(SVC(), params, X_train, y_train, X_test, y_test)

Best params: {'C': 1, 'kernel': 'linear'}
Best CV score: 0.7862311940582746
Training set score: 0.799
Test set score: 0.798


#### Decision trees

In [None]:
# decision tree
params = {'max_depth':range(1,11),'max_features':range(1,10)}
model_train_val(DecisionTreeClassifier(random_state=0), params, X_train, y_train, X_test, y_test)

#### Random forests

In [None]:
# random forests
params = {'n_estimators':range(1,101,10),'max_depth':range(1,11),}
model_train_val(RandomForestClassifier(random_state=0), params, X_train, y_train, X_test, y_test)

#### Gradient boosted trees

In [None]:
# gradient boosted trees
params= {'n_estimators':range(1,101,10),'max_depth':range(1,11)}
model_train_val(GradientBoostingClassifier(random_state=0), params, X_train, y_train, X_test, y_test)

In [None]:
# using XGBoost
params= {'n_estimators':range(1,101,10),'max_depth':range(1,11),'max_features':range(1,10)}
#model_train_val(XGBClassifier(random_state=0,use_label_encoder=False), params, X_train, y_train, X_test, y_test)

### Model diagnostic

With the exception of logistic regression, the above scores show that the model is overfitting to the training data in each case. Training set scores are higher than test set scores (high variance so the models are too complex).  

Approaches we can take:
 - Use less features
 - Get more training examples (not possible here)
 - Increase regularization
