# Titanic Survival
This notebook is an exploration in building a classifier to predict
if somone would survive the titanic given a small dataset and feature
set.

In [4]:
import pandas as pd
import time
import numpy as np
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.linear_model import LinearRegression
from sklearn.impute import SimpleImputer
from sklearn.metrics import roc_curve, roc_auc_score
from sklearn.model_selection import cross_val_score
import matplotlib.pyplot as plt
from functools import partial
from datetime import datetime
import subprocess as sp
%matplotlib inline
from typing import List


## Load Data

In [5]:
train_file = "../data/train.csv"
test_file = "../data/test.csv"


In [6]:
raw_train = pd.read_csv(train_file)
raw_test = pd.read_csv(test_file)


In [7]:
raw_train.head()


Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


## Process features

There is a subset of fields that we will be using for our main classifier but before we can get to that we need to deal with missing data as well as categorical features.


In [8]:
FEATURES = [
    "Pclass",
    "Sex",
    "Age",
    "SibSp",
    "Parch",
    "Fare",
    "Embarked"
]
CATEGORICAL_FIELDS = [
    "Sex",
    "Embarked"
]
LABEL = 'Survived'
CLASS_FEATURES = FEATURES + ['Alone']

In [9]:
for f in FEATURES:
    print(f"Train {f}:\n{raw_train[f].isna().value_counts()}\n")
    

Train Pclass:
False    891
Name: Pclass, dtype: int64

Train Sex:
False    891
Name: Sex, dtype: int64

Train Age:
False    714
True     177
Name: Age, dtype: int64

Train SibSp:
False    891
Name: SibSp, dtype: int64

Train Parch:
False    891
Name: Parch, dtype: int64

Train Fare:
False    891
Name: Fare, dtype: int64

Train Embarked:
False    889
True       2
Name: Embarked, dtype: int64



We can see that both the `Embarked` and `Age` fields have missing data.  The `Embarked` and `Sex` fields are the only categorical features that will need to be encoded.

In [10]:
for f in FEATURES:
    print(f"Test {f}:\n{raw_test[f].isna().value_counts()}\n")
    

Test Pclass:
False    418
Name: Pclass, dtype: int64

Test Sex:
False    418
Name: Sex, dtype: int64

Test Age:
False    332
True      86
Name: Age, dtype: int64

Test SibSp:
False    418
Name: SibSp, dtype: int64

Test Parch:
False    418
Name: Parch, dtype: int64

Test Fare:
False    417
True       1
Name: Fare, dtype: int64

Test Embarked:
False    418
Name: Embarked, dtype: int64



We can see that in the test data, not only are we missing some `Embarked` and `Age` values but we're also missing a `Fare` value.  We're going to handle this using the mean value for the field.  This seems fine for this exercise since it's only one instance.

### Encoding and Missing data

#### Embarked
Here we will fill in the missing Embarked values as well as encode the values to integers. To fill the missing Embarked data, the most common value is going to be used.


In [11]:
def fill_encode_embark(df: pd.DataFrame) -> pd.DataFrame:
    """
    This function will replace any missing embark locations with
    the most common one.
    """
    # fill in the missing embarked with the most common value
    most_common_embark = df['Embarked'].mode()[0]
    df['Embarked'] = df['Embarked'].fillna(most_common_embark)
    # encode the values
    encoding = {f:i for i,f in enumerate(df['Embarked'].unique())}
    df['Embarked'] = df['Embarked'].map(encoding)
    

In [12]:
# copy the training data to retain for post analysis
encoded_train = raw_train.copy()
fill_encode_embark(encoded_train)
encoded_train.head()


Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,0
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,1
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,0
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,0
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,0


You can see that the values are now encoded to integers (above).

Let's make sure that the Nan values were encoded as well. (below).

In [13]:
raw_train.Embarked.value_counts(dropna=False)


S      644
C      168
Q       77
NaN      2
Name: Embarked, dtype: int64

In [14]:
encoded_train.Embarked.value_counts()


0    646
1    168
2     77
Name: Embarked, dtype: int64

We can see that the 2 records that were NaN now have the 0 (or 'S') value.

#### Sex
Here we will encode the `Sex` values into integers

In [15]:
sex_mapping = {'female': 0, 'male': 1}
encoded_train['Sex'] = encoded_train['Sex'].map(sex_mapping)
encoded_train.head()


Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",1,22.0,1,0,A/5 21171,7.25,,0
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",0,38.0,1,0,PC 17599,71.2833,C85,1
2,3,1,3,"Heikkinen, Miss. Laina",0,26.0,0,0,STON/O2. 3101282,7.925,,0
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",0,35.0,1,0,113803,53.1,C123,0
4,5,0,3,"Allen, Mr. William Henry",1,35.0,0,0,373450,8.05,,0


Above we can see that the `Sex` values are now encoded.

#### Age

For the `Age` missing values, we're going to make a Random Forest model to predict their ages.

In [16]:
def missing_clf(df: pd.DataFrame, features: List, label: str) -> LinearRegression:
    """
    This function will train a classifier
    on the data with missing values. This
    classifier is can be used to fill in
    missing data.
    """
    # getting an LBACK warning with the linear regressor, this will surpress that
    import warnings
    warnings.filterwarnings(action="ignore", module="scipy", message="^internal gelsd")
    
    train_data = df[~df[label].isna()]
    label = train_data[label].astype(int)  # train on integers
    clf = RandomForestClassifier(n_estimators = 250,
                                 max_depth = 3,
                                 bootstrap = False,
                                 oob_score = False
                                )
    clf.fit(train_data[features], label)
    return clf

def age_groups(age):
    """
    This function creates age groups
    """
    if age < 10:
        return 0
    elif 10 <= age < 18:
        return 1
    elif 18 <= age < 26:
        return 2
    elif 26 <= age < 36:
        return 3
    elif 36 <= age < 48:
        return 4
    elif 48 <= age < 56:
        return 5
    else:
        return 6

def predict_encode_age(row: pd.Series, features: List=[], clf: LinearRegression=None) -> int:
    """
    This function will predict a passenger's age
    """
    if pd.isnull(row['Age']):
        return age_groups(clf.predict(row[features].values.reshape(1,-1))[0])
    else:
        return age_groups(row['Age'])
    

Let's build a classifier to predict the passenger's age

In [17]:
age_label = 'Age'
age_features = [f for f in FEATURES if f != age_label]
age_clf = missing_clf(encoded_train, age_features, 'Age')


In [18]:
encoded_train.head(10)


Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",1,22.0,1,0,A/5 21171,7.25,,0
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",0,38.0,1,0,PC 17599,71.2833,C85,1
2,3,1,3,"Heikkinen, Miss. Laina",0,26.0,0,0,STON/O2. 3101282,7.925,,0
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",0,35.0,1,0,113803,53.1,C123,0
4,5,0,3,"Allen, Mr. William Henry",1,35.0,0,0,373450,8.05,,0
5,6,0,3,"Moran, Mr. James",1,,0,0,330877,8.4583,,2
6,7,0,1,"McCarthy, Mr. Timothy J",1,54.0,0,0,17463,51.8625,E46,0
7,8,0,3,"Palsson, Master. Gosta Leonard",1,2.0,3,1,349909,21.075,,0
8,9,1,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",0,27.0,0,2,347742,11.1333,,0
9,10,1,2,"Nasser, Mrs. Nicholas (Adele Achem)",0,14.0,1,0,237736,30.0708,,1


In [19]:
# age_predict_partial = partial(predict_age, clf=age_clf)
encoded_train['Age'] = encoded_train.apply(lambda x: predict_encode_age(x,
                                                                        features=age_features,
                                                                        clf=age_clf
                                                                       ),
                                           axis=1
                                          )


In [20]:
encoded_train.head(10)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",1,2,1,0,A/5 21171,7.25,,0
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",0,4,1,0,PC 17599,71.2833,C85,1
2,3,1,3,"Heikkinen, Miss. Laina",0,3,0,0,STON/O2. 3101282,7.925,,0
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",0,3,1,0,113803,53.1,C123,0
4,5,0,3,"Allen, Mr. William Henry",1,3,0,0,373450,8.05,,0
5,6,0,3,"Moran, Mr. James",1,2,0,0,330877,8.4583,,2
6,7,0,1,"McCarthy, Mr. Timothy J",1,5,0,0,17463,51.8625,E46,0
7,8,0,3,"Palsson, Master. Gosta Leonard",1,0,3,1,349909,21.075,,0
8,9,1,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",0,3,0,2,347742,11.1333,,0
9,10,1,2,"Nasser, Mrs. Nicholas (Adele Achem)",0,1,1,0,237736,30.0708,,1


**Note:** This classifier is really a place holder as the hyperparameters would need to be tuned.

#### Fare
The `Fare` field is a high precision one, so we could take some time to make it a bit fuzzier by creating groups

In [21]:
#  here we use qcut to produce groupings based off of quartiles
pd.qcut(encoded_train['Fare'], 4, precision=2).unique()

[(-0.01, 7.91], (31.0, 512.33], (7.91, 14.45], (14.45, 31.0]]
Categories (4, interval[float64]): [(-0.01, 7.91] < (7.91, 14.45] < (14.45, 31.0] < (31.0, 512.33]]

In [22]:
def fare_groups(fare: float):
    """
    This function puts Fares into groups based of
    a defined interval
    """
    if fare < 7.78:
        return 0
    elif 7.78 <= fare < 8.66:
        return 1
    elif 8.66 <= fare < 14.45:
        return 2
    elif 14.45 <= fare < 26.0:
        return 3
    elif 26.0 <= fare < 52.37:
        return 4
    elif 52.37 <= fare < 512.33:
        return 5
    else:
        return 6
        

In [23]:
encoded_train['Fare'] = encoded_train['Fare'].apply(fare_groups)


#### Alone
We can add a feature to tell if someone is alone or not.

In [24]:
def is_alone(row: pd.Series):
    """
    This function is used to determing if a passenger was not traveling with
    anyone else
    """
    family_size = row['SibSp'] + row['Parch']
    if family_size == 0:
        return 0
    else:
        return 1
    

In [25]:
encoded_train['Alone'] = encoded_train.apply(is_alone, axis=1)

We now have a set of features that we can use for classification:

In [26]:
encoded_train[CLASS_FEATURES].head()


Unnamed: 0,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked,Alone
0,3,1,2,1,0,0,0,1
1,1,0,4,1,0,5,1,1
2,3,0,3,0,0,1,0,0
3,1,0,3,1,0,5,0,1
4,3,1,3,0,0,1,0,0


## Predict Surival
With all of the missing and categorical training data dealt with, we can now attempt to classify passenger survival.

In [27]:
def train_survival_clf(df: pd.DataFrame, features: List, label: str):
    """
    This function will train a classifier to predict passenger survival
    """
    clf = RandomForestClassifier(n_estimators=300,
                                 min_samples_leaf=7,
                                 min_samples_split=5,
                                 max_features=0.5,
                                 oob_score=True,
                                 n_jobs=-1,
                                 random_state=42
                                )
#     clf = GradientBoostingClassifier()
    clf.fit(df[features], df[label])
    return clf


In [28]:
survival_clf = train_survival_clf(encoded_train, CLASS_FEATURES, LABEL)


In [29]:
survival_clf.get_params()

{'bootstrap': True,
 'class_weight': None,
 'criterion': 'gini',
 'max_depth': None,
 'max_features': 0.5,
 'max_leaf_nodes': None,
 'min_impurity_decrease': 0.0,
 'min_impurity_split': None,
 'min_samples_leaf': 7,
 'min_samples_split': 5,
 'min_weight_fraction_leaf': 0.0,
 'n_estimators': 300,
 'n_jobs': -1,
 'oob_score': True,
 'random_state': 42,
 'verbose': 0,
 'warm_start': False}

### Validate Model

In [30]:
scores = cross_val_score(survival_clf,
                         encoded_train[CLASS_FEATURES],
                         encoded_train[LABEL],
                         cv=10,
                         scoring='accuracy'
                        )

print(f"Accuracy (95% CI): {round(scores.mean(), 3)} (+/- {round(scores.std() * 2, 3)})")


Accuracy (95% CI): 0.822 (+/- 0.073)


This is looking pretty good so far.

### Test Prediction
With a classifier trained, we will predict survival on the test data.  This dataset will also require the same encoding and missing data treatment as the training data

In [31]:
# Embarked fill and encoding
test_data = raw_test.copy()  # retain original since changes are in place
fill_encode_embark(test_data)
# Sex encoding
test_data['Sex'] = test_data['Sex'].map(sex_mapping)
# Age prediction
test_data['Age'] = test_data.apply(lambda x: predict_encode_age(x,
                                                                features=age_features,
                                                                clf=age_clf
                                                               ),
                                   axis=1
                                  )

Lest we forget, we need to handle the missing `Fare` value that we noticed upon inspection of the dataset.

In [32]:
test_data['Fare'] = test_data['Fare'].fillna(test_data['Fare'].median())
test_data['Fare'] = test_data['Fare'].apply(fare_groups)


In [33]:
test_data['Alone'] = test_data.apply(is_alone, axis=1)


To format the output for Kaggle submission, we need a CSV with the `PassengerID` and `Survived` fields

In [34]:
test_data['Survived'] = survival_clf.predict(test_data[CLASS_FEATURES])


We'll save this data off to a file and submit it using the Kaggle API

**Note:** to execute this part, you'll have to setup the Kaggle API
(https://github.com/Kaggle/kaggle-api)

In [35]:
fname = "prediction.csv.gz"
test_data[['PassengerId', 'Survived']].to_csv(fname,
                                              index=False,
                                              compression='gzip'
                                             )


In [36]:
msg = f"Prediction submit - {datetime.utcnow().strftime('%Y%m%d-%H:%M UTC')}"
cmd = ['kaggle',
       'competitions',
       'submit',
       'titanic',
       '-f',
       f'{fname}',
       '-m',
       f'"{msg}"'
      ]


In [37]:
sp.check_call(cmd)

0

In [38]:
time.sleep(5)
!kaggle competitions submissions titanic | head -n 5

fileName               date                 description                               status    publicScore  privateScore  
---------------------  -------------------  ----------------------------------------  --------  -----------  ------------  
prediction.csv.gz      2018-11-06 01:11:14  "Prediction submit - 20181106-01:11 UTC"  complete  0.78947      None          
prediction.csv.gz      2018-10-21 21:43:05  "Prediction submit - 20181021-21:43 UTC"  complete  0.78947      None          
prediction.csv.gz      2018-10-21 21:41:41  "Prediction submit - 20181021-21:41 UTC"  complete  0.79904      None          


I seem to have run into a bug when trying to call my submissions:

https://github.com/Kaggle/kaggle-api/issues/108

## Conclusion

Upon manual inspection, this produces a performance accuracy score of `0.77033` this is pretty bad considering we would get a score of `0.7655` if we just classified by gender.  We've also clearly overfit when looking at the cross-validation score.

But alas, this isn't the point of this exercise.  The point is to build a web API to provide a survval prediction

In [39]:
sorted(list(zip(CLASS_FEATURES, survival_clf.feature_importances_)), key=lambda x: x[1], reverse=True)

[('Sex', 0.5146319696442846),
 ('Pclass', 0.15976786785150707),
 ('Fare', 0.1141523789061459),
 ('Age', 0.09136945866017947),
 ('SibSp', 0.0486020385153977),
 ('Embarked', 0.03197448406857218),
 ('Parch', 0.022806573310928786),
 ('Alone', 0.01669522904298415)]