# Titanic: Machine Learning from Disaster

##### Ben Sharkey

https://www.kaggle.com/c/titanic

The Titanic machine learning challenge is somewhat a 'right of passage' for the Kaggle community. It has been attempted by thousands of data scientists and analytics professionals over the last few years. 

The challenge is to predict as accurately as possible, the survival of approx. 1/3rd of the passengers aboard the titanic. Data is provided on the survival of the other 2/3rds of the passengers to build a predictive model to predict the remaining 1/3rd.

This notebook outlines my methodology. I've used Python to clean data and feature engineer, and the machine learning package scikit-learn to build the model to predict survivors based on the variables in the training dataset provided. 

<img src="https://i.ytimg.com/vi/cMVi953awHQ/maxresdefault.jpg">

### Import and view data

Import required packages.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

Load train and test datasets.

In [2]:
train = pd.read_csv('http://s3.amazonaws.com/assets.datacamp.com/course/Kaggle/train.csv')
test = pd.read_csv('http://s3.amazonaws.com/assets.datacamp.com/course/Kaggle/test.csv')
combined=[train, test]

View the train datasets.

In [3]:
train.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


Let's see how many observations we have for each column, and the data type they have been loaded as.

In [4]:
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.6+ KB


It appears there are 891 rows, with some rows having missing data.

Let's now view the test dataset.

In [5]:
test.head()

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,892,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q
1,893,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0,,S
2,894,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,,Q
3,895,3,"Wirz, Mr. Albert",male,27.0,0,0,315154,8.6625,,S
4,896,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0,1,1,3101298,12.2875,,S


In [6]:
test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 11 columns):
PassengerId    418 non-null int64
Pclass         418 non-null int64
Name           418 non-null object
Sex            418 non-null object
Age            332 non-null float64
SibSp          418 non-null int64
Parch          418 non-null int64
Ticket         418 non-null object
Fare           417 non-null float64
Cabin          91 non-null object
Embarked       418 non-null object
dtypes: float64(2), int64(4), object(5)
memory usage: 36.0+ KB


There are 418 rows, three of the columns are missing data.

### Data cleaning and feature engineering

Incomplete columns are: Age, Cabin, Embarked, Fare.

Since Embarked is only missing 2 values, we will fill these with the most occurring Embarked value.

Find the most occurring Embarked value.

In [7]:
train.groupby('Embarked').count()[['PassengerId']]

Unnamed: 0_level_0,PassengerId
Embarked,Unnamed: 1_level_1
C,168
Q,77
S,644


Most occurring is 'S' so fill missing values with 'S'.

In [8]:
for df in combined:
    df['Embarked']=df['Embarked'].fillna(value='S')

Since Fare is only missing 1 value, fill with the mean of the train dataset.

In [9]:
test['Fare']=test['Fare'].fillna(32.2)

Next we will need to assign all codes and letters in the test and train data frames to integers. We need to do this so that they can be read by the machine learning algorithms we will be using later.

First let's split out the first letter of the Cabin string.

In [10]:
for df in combined:
    df['Cabin']=df.Cabin.str.extract('([A-Za-z])', expand=False)
    df.groupby('Cabin').count()[['PassengerId']]

Then fill the missing values with a unique value 'Z'.

In [11]:
for df in combined:
    df['Cabin']=df['Cabin'].fillna('Z')
    df.groupby('Cabin').count()[['PassengerId']]

Now assign an integer to each Cabin letter.

In [12]:
for df in combined:
    df['Cabin']=df['Cabin'].map({'A':1,'B':2,'C':3,'D':4,'E':5,'F':6,'G':7,'T':8,'Z':9})

Now for the Embarked column.

In [13]:
train.groupby('Embarked').count()[['PassengerId']]

Unnamed: 0_level_0,PassengerId
Embarked,Unnamed: 1_level_1
C,168
Q,77
S,646


Assign an integer to each Embarked letter.

In [14]:
for df in combined:
    df['Embarked']=df['Embarked'].map({'C':1,'Q':2,'S':3})

Change Sex into male=1, female=0, child=2

In [15]:
for df in combined:
    df['Sex']=df['Sex'].map({'male':1,'female':0}).astype(int)
    df.loc[(df['Age']<16),'Sex']=2

Drop the Ticket column as it does not appear to contain useful information.

In [16]:
for df in combined:
    df.drop('Ticket',axis=1,inplace=True)

Let's view the train data frame to see what we have so far.

In [17]:
train.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",1,22.0,1,0,7.25,9,3
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",0,38.0,1,0,71.2833,3,1
2,3,1,3,"Heikkinen, Miss. Laina",0,26.0,0,0,7.925,9,3
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",0,35.0,1,0,53.1,3,3
4,5,0,3,"Allen, Mr. William Henry",1,35.0,0,0,8.05,9,3


It appears that a persons title may influence on their chances of survival. Let's create a variable called Title.

Extract the title from Name column.

In [18]:
for df in combined:
    df['Title'] = df.Name.str.extract('([A-Za-z]+)\.',expand=False)
    df[['Title']].describe()

Number of passengers with each title in the test and train datasets.

In [19]:
train.groupby('Title').count()[['PassengerId']]

Unnamed: 0_level_0,PassengerId
Title,Unnamed: 1_level_1
Capt,1
Col,2
Countess,1
Don,1
Dr,7
Jonkheer,1
Lady,1
Major,2
Master,40
Miss,182


In [20]:
test.groupby('Title').count()[['PassengerId']]

Unnamed: 0_level_0,PassengerId
Title,Unnamed: 1_level_1
Col,2
Dona,1
Dr,1
Master,21
Miss,78
Mr,240
Mrs,72
Ms,1
Rev,2


Survival by each title.

In [21]:
train.groupby('Title').mean().sort_values(by='Survived',ascending=False)[['Survived']]

Unnamed: 0_level_0,Survived
Title,Unnamed: 1_level_1
Sir,1.0
Countess,1.0
Ms,1.0
Mme,1.0
Lady,1.0
Mlle,1.0
Mrs,0.792
Miss,0.697802
Master,0.575
Col,0.5


Map all titles to one of; Mr, Miss, Mrs, Master, Special.

In [22]:
for df in combined:
    df['Title']=df['Title'].replace(['Mlle','Ms'],'Miss')
    df['Title']=df['Title'].replace(['Mme'],'Master')
    df['Title']=df['Title'].replace(['Dr','Rev','Major','Col','Capt','Lady','Jonkheer','Don','Dona','Countess','Sir'],'Special')

Survival rates by new title categories.

In [23]:
train.groupby('Title').mean().sort_values(by='Survived',ascending=False)[['Survived']]

Unnamed: 0_level_0,Survived
Title,Unnamed: 1_level_1
Mrs,0.792
Miss,0.702703
Master,0.585366
Special,0.347826
Mr,0.156673


Assign titles to integers.

In [24]:
for df in combined:
    df['Title']=df['Title'].map({'Mrs':1,'Miss':2,'Master':3,'Special':4,'Mr':5}).astype(int)

Drop Name and PassengerId columns.

In [25]:
for df in combined:
    df.drop('Name',axis=1,inplace=True)

Check data types in the dataframe are all floats and integers and check for missing data.

In [26]:
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 11 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Sex            891 non-null int32
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Fare           891 non-null float64
Cabin          891 non-null int64
Embarked       891 non-null int64
Title          891 non-null int32
dtypes: float64(2), int32(2), int64(7)
memory usage: 69.7 KB


We need to fill the missing ages. Assign the mean of all ages to missing Ages.

In [27]:
for df in combined:
        df['Age']=df['Age'].fillna(29.7)

Check both train and test datasets are complete and numerical.

In [28]:
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 11 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Sex            891 non-null int32
Age            891 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Fare           891 non-null float64
Cabin          891 non-null int64
Embarked       891 non-null int64
Title          891 non-null int32
dtypes: float64(2), int32(2), int64(7)
memory usage: 69.7 KB


In [29]:
test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 10 columns):
PassengerId    418 non-null int64
Pclass         418 non-null int64
Sex            418 non-null int32
Age            418 non-null float64
SibSp          418 non-null int64
Parch          418 non-null int64
Fare           418 non-null float64
Cabin          418 non-null int64
Embarked       418 non-null int64
Title          418 non-null int32
dtypes: float64(2), int32(2), int64(6)
memory usage: 29.5 KB


### Relationship between variables and survival

Let's now look at survival rates by different variables.

In [30]:
# Pclass
train[['Survived','Pclass']].groupby('Pclass').mean()

Unnamed: 0_level_0,Survived
Pclass,Unnamed: 1_level_1
1,0.62963
2,0.472826
3,0.242363


In [31]:
# male/female/child
train[['Survived','Sex']].groupby('Sex').mean()

Unnamed: 0_level_0,Survived
Sex,Unnamed: 1_level_1
0,0.756458
1,0.163873
2,0.590361


In [32]:
# SibSp
train[['Survived','SibSp']].groupby('SibSp').mean()

Unnamed: 0_level_0,Survived
SibSp,Unnamed: 1_level_1
0,0.345395
1,0.535885
2,0.464286
3,0.25
4,0.166667
5,0.0
8,0.0


In [33]:
# Parch
train[['Survived','Parch']].groupby('Parch').mean()

Unnamed: 0_level_0,Survived
Parch,Unnamed: 1_level_1
0,0.343658
1,0.550847
2,0.5
3,0.6
4,0.0
5,0.2
6,0.0


It appears that increased chance of survival indicated by; higher Pclass, female, SibSp==1, higher Parch.

### Create predictive models

Create test train split from training data. This is done so that we can test the effectiveness of each of the predictive models. Scikit-learn has a function that we can use which randomly splits a dataset.

In [34]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(train.drop('Survived',axis=1), train['Survived'], 
test_size=0.33, random_state=42)

Now that our training dataset is split into two, we can try some different machine learning methods and see how well they perform by printing a classification report for each.

##### Logistic Regression

In [35]:
from sklearn.linear_model import LogisticRegression

model1=LogisticRegression(random_state=42)

model1.fit(X_train,y_train)

prediction1=model1.predict(X_test)

In [36]:
from sklearn.metrics import classification_report

print(classification_report(y_test,prediction1))

             precision    recall  f1-score   support

          0       0.83      0.87      0.85       175
          1       0.79      0.73      0.76       120

avg / total       0.81      0.81      0.81       295



##### Decision Tree

In [37]:
from sklearn.tree import DecisionTreeClassifier

model2=DecisionTreeClassifier(random_state=42)

model2.fit(X_train,y_train)

prediction2=model2.predict(X_test)

print(classification_report(y_test,prediction2))

             precision    recall  f1-score   support

          0       0.82      0.86      0.84       175
          1       0.78      0.72      0.75       120

avg / total       0.80      0.80      0.80       295



The Decision tree performed worse that logistic regression!

##### Random Forest

In [38]:
from sklearn.ensemble import RandomForestClassifier

model3=RandomForestClassifier(n_estimators=1000,max_features=3,oob_score=True,random_state=42)

model3.fit(X_train,y_train)

prediction3=model3.predict(X_test)

print(classification_report(y_test,prediction3))

             precision    recall  f1-score   support

          0       0.83      0.90      0.87       175
          1       0.84      0.73      0.78       120

avg / total       0.83      0.83      0.83       295



The random forest outperformed both logistic regression and decision tree models. Almost there!

<img src="http://images2.fanpop.com/images/photos/4300000/Jack-and-Rose-jack-and-rose-4381715-500-281.jpg">

### Create file for Kaggle submission

Now that we have detemined the random forest model to be the best performing, we will apply this to the test dataset, and generate a csv file to submit to Kaggle.

In [39]:
predictionsub=model3.predict(test)

submission=pd.DataFrame({'PassengerId':test['PassengerId'],'Survived':predictionsub})

Check that output has 418 rows and the headers PassengerId and Survived.

In [40]:
submission.describe()

Unnamed: 0,PassengerId,Survived
count,418.0,418.0
mean,1100.5,0.34689
std,120.810458,0.476551
min,892.0,0.0
25%,996.25,0.0
50%,1100.5,0.0
75%,1204.75,1.0
max,1309.0,1.0


All ok. Now create the csv file.

In [41]:
submission.to_csv('titanic.csv',index=False)

The submission was submitted to Kaggle and scored 0.78469.

This is in the top 1/3rd of all entrants.

### Ideas for improvement

- create family size category based on Parch and SibSp
- fill missing ages with machine learning predictions
- fill missing fare with machine learning predictions
- further optimise random forest to prevent overfitting