# Titanic: Machine Learning from Disaster
This notebook is my solution to the introductory machine learning challenge on kaggle.com. It is meant to highlight the methods I have used and clarify the reasoning behind my choices while building my models.

Start by importing the libraries, loading the data and joining the test and train sets.

In [1]:
import pandas as pd 
import numpy as np
import math
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import Perceptron
from sklearn.svm import LinearSVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier
import xgboost as xgb
from sklearn.model_selection import cross_val_score

In [2]:
test_set = pd.read_csv('test.csv')
train_set = pd.read_csv('train.csv')
all_df = pd.concat((train_set.drop(columns = 'Survived'), test_set))

## Feature Engineering
The first thing that needs to be done is to check the data for any missing values. The function below presents this information for each of the features in the datasets.

In [3]:
def compute_nan_stats(df):
    numberOfNaN = df.isnull().sum().sort_values(ascending = False)
    percentage = round((numberOfNaN/len(df))*100, 2)
    return pd.concat([numberOfNaN, percentage], axis = 1, keys = ['Number', 'Percentage'])

print(compute_nan_stats(all_df))

             Number  Percentage
Cabin          1014       77.46
Age             263       20.09
Embarked          2        0.15
Fare              1        0.08
Ticket            0        0.00
Parch             0        0.00
SibSp             0        0.00
Sex               0        0.00
Name              0        0.00
Pclass            0        0.00
PassengerId       0        0.00


There are four features with missing values: Cabin, Age, Embarked and Fare. Cabin is missing most of its data, so we will discard it. We have to deal with the other three. Since Embarked is missing only 2 passengers, we can set the embarked value to the port which is the most common. 

In [4]:
all_df['Embarked'].describe()

count     1307
unique       3
top          S
freq       914
Name: Embarked, dtype: object

In [5]:
all_df['Embarked'] = all_df['Embarked'].fillna('S')

The missing fare can be imputed by using the median fare. 

In [6]:
all_df.fillna(value = all_df['Fare'].median(), inplace = True)

The missing age values need to be imputed as they are likely to have an impact on the rate of survival. The function below accomplishes this by imputing random age values drawn from a normal distribution with the mean and standard deviation of age of the training set.

In [7]:
def fill_age(df):
    # We use the train_set statistics for both the train and test sets
    mean_age = df.groupby('Pclass')['Age'].mean()
    std_age = df.groupby('Pclass')['Age'].std()
    agelist = []
    for i, passenger in df.iterrows():
        if math.isnan(passenger['Age']):
            age = round(std_age[passenger['Pclass']] * np.random.randn() + 
                        mean_age[passenger['Pclass']],1)
        else:
            age = passenger['Age']
        agelist.append(age)
    return agelist

all_df['Age'] = fill_age(all_df)

The names of individuals contain information about their social class. We can use this data after extracting this information and categorizing it. 

In [8]:
titles = []
for name in all_df['Name']:
    titles.append(name.split(',')[1].split('.')[0].strip())
title_set = set(titles)
all_df['Title'] = titles
print(title_set)

{'Mme', 'Jonkheer', 'Master', 'Miss', 'Major', 'the Countess', 'Mrs', 'Mlle', 'Capt', 'Dona', 'Lady', 'Col', 'Rev', 'Don', 'Mr', 'Sir', 'Dr', 'Ms'}


We can simplify the titles listed above into three categories - ship staff, 
nobility and commoners - and map each individuals titles into these categories as follows:

In [9]:
title_map = {'Col': 'Staff', 'Mlle': 'Commoner', 'Ms': 'Commoner', 
             'Miss': 'Commoner', 'Lady' : 'Nobility', 'Mr': 'Commoner', 
             'Mrs': 'Commoner', 'Rev': 'Staff', 'Dona': 'Nobility',
             'Capt': 'Staff', 'Sir': 'Nobility', 'the Countess': 'Nobility',
             'Major': 'Staff', 'Mme': 'Commoner', 'Dr': 'Staff', 
             'Don': 'Nobility', 'Master' : 'Commoner', 'Jonkheer': 'Nobility'} 

# Map titles into social class and replace them with dummy variables
all_df['Social Status'] = pd.Series(all_df.Title.map(title_map))
# Check if there are any missing values
all_df['Social Status'] .isnull().sum()

0

The features SibSp and Parch give the total family size when combined, so we create a new feature FamSize.

In [10]:
all_df['FamSize'] = all_df['SibSp'] + all_df['Parch']

All missing data are now filled and new features are added. All that is left to do in terms of feature engineering is removing unnecessary features and converting categorical variables to numerical ones.

In [11]:
all_df.head() # View the dataset as to not make any errors 

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Title,Social Status,FamSize
0,1,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,14.4542,S,Mr,Commoner,1
1,2,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C,Mrs,Commoner,1
2,3,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,14.4542,S,Miss,Commoner,0
3,4,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S,Mrs,Commoner,1
4,5,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,14.4542,S,Mr,Commoner,0


In [12]:
all_df.drop(columns = ['PassengerId', 'Name', 'Ticket', 'SibSp', 'Parch', 
                       'Cabin', 'Title'], inplace = True)
all_df.head()

Unnamed: 0,Pclass,Sex,Age,Fare,Embarked,Social Status,FamSize
0,3,male,22.0,7.25,S,Commoner,1
1,1,female,38.0,71.2833,C,Commoner,1
2,3,female,26.0,7.925,S,Commoner,0
3,1,female,35.0,53.1,S,Commoner,1
4,3,male,35.0,8.05,S,Commoner,0


## Dealing with categorical variables
To accomplish this, both one-hot encoding and label encoding methods will be used. One-hot encoding will be employed for the Embarked feature, since this feature is non-binary and has no hiearchy structure within it that can be used to logically give values to each category. 
Label encoding will be used for the Sex and Social Status features, since the Sex feature is binary and Social Status has an internal hiearachy (Nobility > Staff > Commoners).

In [13]:
all_df['Sex'] = all_df['Sex'].map({'male': 0, 'female': 1}) 
all_df['Social Status'] = all_df['Social Status'].map({'Commoner': 0, 
                                                       'Staff': 1,
                                                       'Nobility': 2}) 
ready_data = pd.get_dummies(all_df)
ready_data.info() # Check to see everything is in order

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1309 entries, 0 to 417
Data columns (total 9 columns):
Pclass           1309 non-null int64
Sex              1309 non-null int64
Age              1309 non-null float64
Fare             1309 non-null float64
Social Status    1309 non-null int64
FamSize          1309 non-null int64
Embarked_C       1309 non-null uint8
Embarked_Q       1309 non-null uint8
Embarked_S       1309 non-null uint8
dtypes: float64(2), int64(4), uint8(3)
memory usage: 75.4 KB


In [14]:
# Seperate the data into test and train sets once again 
X_train = ready_data[:891]
Y_train = train_set["Survived"]
X_test = ready_data[891:]

The dataset is now ready for classification algorithms. 

## Modelling
Several methods will be used to model the system and predict the survival rate of the passengers. The models will be tested using a cross-validation metric with 10 folds. Parameters for some models will be identified using sklearn's gridsearchcv method. The function below will be used to present relevant information from grid search. 

In [15]:
def grid_search_results(grid_search_model):
    df = pd.DataFrame.from_dict(grid_search_model.cv_results_['params'])
    df['mean_test_score'] = grid_search_model.cv_results_['mean_test_score']
    df['rank_test_score'] = grid_search_model.cv_results_['rank_test_score']
    print(df)
    return(df)

In [16]:
cv_accuracy = pd.Series() # Average cross-validation score of each method
predictions = pd.DataFrame() # Predictions of each model

#### Method 1: Logistic Regression

In [17]:
model = LogisticRegression()
model.fit(X_train, Y_train)
acc = cross_val_score(model, X_train, Y_train, cv = 10)
cv_accuracy['Logistic Regression'] = acc.mean() * 100
predictions['Logistic Regression'] = model.predict(X_test)

#### Method 2: Gaussian Naive Bayes

In [18]:
model = GaussianNB()
model.fit(X_train, Y_train)
acc = cross_val_score(model, X_train, Y_train, cv = 10)
cv_accuracy['Gaussian Naive Bayes'] = acc.mean() * 100
predictions['Gaussian Naive Bayes'] = model.predict(X_test)

#### Method 3: Linear Support Vector Classifier

In [19]:
model = LinearSVC(dual = False)
model.fit(X_train, Y_train)
acc = cross_val_score(model, X_train, Y_train, cv = 10)
cv_accuracy['Linear SVC'] = acc.mean() * 100
predictions['Linear SVC'] = model.predict(X_test)

#### Method 4: Decision Tree Classifier

In [20]:
model = DecisionTreeClassifier()
model.fit(X_train, Y_train)
acc = cross_val_score(model, X_train, Y_train, cv = 10)
cv_accuracy['Decision Tree'] = acc.mean() * 100
predictions['Decision Tree'] = model.predict(X_test)

#### Method 5: Random Forest Classifier

In [21]:
model = RandomForestClassifier()
parameters = {'n_estimators': [5, 10, 20, 30, 40, 50, 100, 150, 200]}
grid_search = GridSearchCV(model, parameters, cv=10)
grid_search.fit(X_train, Y_train)
# View grid search results
df = grid_search_results(grid_search)

   n_estimators  mean_test_score  rank_test_score
0             5         0.787879                9
1            10         0.796857                8
2            20         0.815937                2
3            30         0.809203                5
4            40         0.805836                7
5            50         0.810325                4
6           100         0.817059                1
7           150         0.806958                6
8           200         0.813692                3


In [22]:
# Prediction and cv score
cv_accuracy['Random Forest'] = df['mean_test_score'].max() * 100
predictions['Random Forest'] = grid_search.predict(X_test)

#### Method 6: Gradient Boosting Classifier

In [23]:
model = GradientBoostingClassifier()
# For this model, the learning_rate and n_estimators parameters 
parameters = {'learning_rate': [0.1, 0.15, 0.2, 0.25], 
              'n_estimators': [100, 150, 200, 250, 300]}
grid_search = GridSearchCV(model, parameters, cv=10)
grid_search.fit(X_train, Y_train)
# View grid search results
df = grid_search_results(grid_search)

    learning_rate  n_estimators  mean_test_score  rank_test_score
0            0.10           100         0.824916               16
1            0.10           150         0.831650                3
2            0.10           200         0.837262                1
3            0.10           250         0.829405                7
4            0.10           300         0.828283               11
5            0.15           100         0.831650                3
6            0.15           150         0.833895                2
7            0.15           200         0.827160               13
8            0.15           250         0.829405                7
9            0.15           300         0.827160               13
10           0.20           100         0.829405                7
11           0.20           150         0.830527                5
12           0.20           200         0.828283               11
13           0.20           250         0.827160               13
14        

In [24]:
# Prediction and cv score
cv_accuracy['Gradient Boosting Classifer'] = df['mean_test_score'].max() * 100
predictions['Gradient Boosting Classifer'] = grid_search.predict(X_test)

### Cross-Validation Results

In [25]:
cv_accuracy.sort_values(ascending = False)

Gradient Boosting Classifer    83.726150
Random Forest                  81.705948
Logistic Regression            79.687436
Linear SVC                     78.339065
Gaussian Naive Bayes           78.246737
Decision Tree                  78.230451
dtype: float64

The cross-validation scores indicate the Gradient Boosting Classifier has outperformed the other methods by more than one percentage point. CV scores however, don't always apply to the test scores. In fact, logistic regression got the best score out of these methods, with 79.904% accuracy, placing at the top 16% at the time of submission.