# Titanic: Machine Learning from Disaster

##### Ben Sharkey

This is my notebook for the Kaggle Machine Learning Challenge. https://www.kaggle.com/c/titanic

I've used machine learning in Python to predict survivors based on the variables in the training dataset provided. 

### Import and view data

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

In [2]:
# load train and test datasets

train=pd.read_csv('http://s3.amazonaws.com/assets.datacamp.com/course/Kaggle/train.csv')
test=pd.read_csv('http://s3.amazonaws.com/assets.datacamp.com/course/Kaggle/test.csv')
combined=[train,test]

In [3]:
# view the train datasets

train.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [4]:
train.info()
train.describe()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.6+ KB


Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,891.0,714.0,891.0,891.0,891.0
mean,446.0,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,257.353842,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,1.0,0.0,1.0,0.42,0.0,0.0,0.0
25%,223.5,0.0,2.0,20.125,0.0,0.0,7.9104
50%,446.0,0.0,3.0,28.0,0.0,0.0,14.4542
75%,668.5,1.0,3.0,38.0,1.0,0.0,31.0
max,891.0,1.0,3.0,80.0,8.0,6.0,512.3292


In [5]:
# view the test dataset

test.head()

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,892,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q
1,893,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0,,S
2,894,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,,Q
3,895,3,"Wirz, Mr. Albert",male,27.0,0,0,315154,8.6625,,S
4,896,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0,1,1,3101298,12.2875,,S


In [6]:
test.info()
test.describe()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 11 columns):
PassengerId    418 non-null int64
Pclass         418 non-null int64
Name           418 non-null object
Sex            418 non-null object
Age            332 non-null float64
SibSp          418 non-null int64
Parch          418 non-null int64
Ticket         418 non-null object
Fare           417 non-null float64
Cabin          91 non-null object
Embarked       418 non-null object
dtypes: float64(2), int64(4), object(5)
memory usage: 36.0+ KB


Unnamed: 0,PassengerId,Pclass,Age,SibSp,Parch,Fare
count,418.0,418.0,332.0,418.0,418.0,417.0
mean,1100.5,2.26555,30.27259,0.447368,0.392344,35.627188
std,120.810458,0.841838,14.181209,0.89676,0.981429,55.907576
min,892.0,1.0,0.17,0.0,0.0,0.0
25%,996.25,1.0,21.0,0.0,0.0,7.8958
50%,1100.5,3.0,27.0,0.0,0.0,14.4542
75%,1204.75,3.0,39.0,1.0,0.0,31.5
max,1309.0,3.0,76.0,8.0,9.0,512.3292


### Data cleaning and feature engineering

Incomplete columns are: Age, Cabin, Embarked, Fare

Since Embarked is only missing 2 values, fill these with the most occurring

In [7]:
# find most ocurring

train.groupby('Embarked').count()[['PassengerId']]

Unnamed: 0_level_0,PassengerId
Embarked,Unnamed: 1_level_1
C,168
Q,77
S,644


Most occurring is 'S' so fill missing values with 'S'

In [8]:
for df in combined:
    df['Embarked']=df['Embarked'].fillna(value='S')

In [9]:
# since Fare is only missing 1 value, fill with the mean of the train dataset

test['Fare']=test['Fare'].fillna(32.2)

In [10]:
# assign Cabin to integers based on first letter then fill missing with a unique integer

# split first letter from Cabin strings

for df in combined:
    df['Cabin']=df.Cabin.str.extract('([A-Za-z])', expand=False)
    df.groupby('Cabin').count()[['PassengerId']]

In [11]:
for df in combined:
    df['Cabin']=df['Cabin'].fillna('Z')
    df.groupby('Cabin').count()[['PassengerId']]

In [12]:
# assign an integer to each Cabin letter

for df in combined:
    df['Cabin']=df['Cabin'].map({'A':1,'B':2,'C':3,'D':4,'E':5,'F':6,'G':7,'T':8,'Z':9})

In [13]:
# check Embarked column

train.groupby('Embarked').count()[['PassengerId']]

Unnamed: 0_level_0,PassengerId
Embarked,Unnamed: 1_level_1
C,168
Q,77
S,646


In [14]:
# assign an integer to each Embarked letter

for df in combined:
    df['Embarked']=df['Embarked'].map({'C':1,'Q':2,'S':3})

In [15]:
# change Sex into male=1, female=0, child=2

for df in combined:
    df['Sex']=df['Sex'].map({'male':1,'female':0}).astype(int)
    df.loc[(df['Age']<16),'Sex']=2

In [16]:
# drop Ticket column as it does not appear to contain useful information

for df in combined:
    df.drop('Ticket',axis=1,inplace=True)

In [17]:
train.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",1,22.0,1,0,7.25,9,3
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",0,38.0,1,0,71.2833,3,1
2,3,1,3,"Heikkinen, Miss. Laina",0,26.0,0,0,7.925,9,3
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",0,35.0,1,0,53.1,3,3
4,5,0,3,"Allen, Mr. William Henry",1,35.0,0,0,8.05,9,3


In [18]:
# extract title from Name column

for df in combined:
    df['Title'] = df.Name.str.extract('([A-Za-z]+)\.',expand=False)
    df[['Title']].describe()

In [19]:
# number of passengers with each title in test and train datasets

train.groupby('Title').count()[['PassengerId']]

Unnamed: 0_level_0,PassengerId
Title,Unnamed: 1_level_1
Capt,1
Col,2
Countess,1
Don,1
Dr,7
Jonkheer,1
Lady,1
Major,2
Master,40
Miss,182


In [20]:
test.groupby('Title').count()[['PassengerId']]

Unnamed: 0_level_0,PassengerId
Title,Unnamed: 1_level_1
Col,2
Dona,1
Dr,1
Master,21
Miss,78
Mr,240
Mrs,72
Ms,1
Rev,2


In [21]:
# survival by each title

train.groupby('Title').mean().sort_values(by='Survived',ascending=False)[['Survived']]

Unnamed: 0_level_0,Survived
Title,Unnamed: 1_level_1
Sir,1.0
Countess,1.0
Ms,1.0
Mme,1.0
Lady,1.0
Mlle,1.0
Mrs,0.792
Miss,0.697802
Master,0.575
Col,0.5


In [22]:
# map all titles to one of; Mr, Miss, Mrs, Master, Special

for df in combined:
    df['Title']=df['Title'].replace(['Mlle','Ms'],'Miss')
    df['Title']=df['Title'].replace(['Mme'],'Master')
    df['Title']=df['Title'].replace(['Dr','Rev','Major','Col','Capt','Lady','Jonkheer','Don','Dona','Countess','Sir'],'Special')

In [23]:
# survival rates by new title categories

train.groupby('Title').mean().sort_values(by='Survived',ascending=False)[['Survived']]

Unnamed: 0_level_0,Survived
Title,Unnamed: 1_level_1
Mrs,0.792
Miss,0.702703
Master,0.585366
Special,0.347826
Mr,0.156673


In [24]:
# assign titles to integers

for df in combined:
    df['Title']=df['Title'].map({'Mrs':1,'Miss':2,'Master':3,'Special':4,'Mr':5}).astype(int)

In [25]:
# drop Name and PassengerId columns

for df in combined:
    df.drop('Name',axis=1,inplace=True)

In [26]:
# check data types in dataframe are all floats and integers and check for missing data

train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 11 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Sex            891 non-null int32
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Fare           891 non-null float64
Cabin          891 non-null int64
Embarked       891 non-null int64
Title          891 non-null int32
dtypes: float64(2), int32(2), int64(7)
memory usage: 69.7 KB


In [27]:
# for a first run, assign the mean of all ages to missing Ages

# later we will further refine by using machine learning to fill missing age values

for df in combined:
        df['Age']=df['Age'].fillna(29.7)

In [28]:
# check both train and test datasets are complete and numerical

train.info()
train.describe()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 11 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Sex            891 non-null int32
Age            891 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Fare           891 non-null float64
Cabin          891 non-null int64
Embarked       891 non-null int64
Title          891 non-null int32
dtypes: float64(2), int32(2), int64(7)
memory usage: 69.7 KB


Unnamed: 0,PassengerId,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Cabin,Embarked,Title
count,891.0,891.0,891.0,891.0,891.0,891.0,891.0,891.0,891.0,891.0,891.0
mean,446.0,0.383838,2.308642,0.789001,29.699293,0.523008,0.381594,32.204208,7.716049,2.536476,3.698092
std,257.353842,0.486592,0.836071,0.594291,13.002015,1.102743,0.806057,49.693429,2.460739,0.791503,1.622104
min,1.0,0.0,1.0,0.0,0.42,0.0,0.0,0.0,1.0,1.0,1.0
25%,223.5,0.0,2.0,0.0,22.0,0.0,0.0,7.9104,9.0,2.0,2.0
50%,446.0,0.0,3.0,1.0,29.7,0.0,0.0,14.4542,9.0,3.0,5.0
75%,668.5,1.0,3.0,1.0,35.0,1.0,0.0,31.0,9.0,3.0,5.0
max,891.0,1.0,3.0,2.0,80.0,8.0,6.0,512.3292,9.0,3.0,5.0


In [29]:
test.info()
test.describe()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 10 columns):
PassengerId    418 non-null int64
Pclass         418 non-null int64
Sex            418 non-null int32
Age            418 non-null float64
SibSp          418 non-null int64
Parch          418 non-null int64
Fare           418 non-null float64
Cabin          418 non-null int64
Embarked       418 non-null int64
Title          418 non-null int32
dtypes: float64(2), int32(2), int64(6)
memory usage: 29.5 KB


Unnamed: 0,PassengerId,Pclass,Sex,Age,SibSp,Parch,Fare,Cabin,Embarked,Title
count,418.0,418.0,418.0,418.0,418.0,418.0,418.0,418.0,418.0,418.0
mean,1100.5,2.26555,0.744019,30.154785,0.447368,0.392344,35.618989,7.758373,2.401914,3.629187
std,120.810458,0.841838,0.586846,12.636659,0.89676,0.981429,55.840752,2.443901,0.854496,1.673266
min,892.0,1.0,0.0,0.17,0.0,0.0,0.0,1.0,1.0,1.0
25%,996.25,1.0,0.0,23.0,0.0,0.0,7.8958,9.0,2.0,2.0
50%,1100.5,3.0,1.0,29.7,0.0,0.0,14.4542,9.0,3.0,5.0
75%,1204.75,3.0,1.0,35.75,1.0,0.0,31.5,9.0,3.0,5.0
max,1309.0,3.0,2.0,76.0,8.0,9.0,512.3292,9.0,3.0,5.0


### Relationship between variables and survival

average survival by categories

In [30]:
# Pclass
train[['Survived','Pclass']].groupby('Pclass').mean()

Unnamed: 0_level_0,Survived
Pclass,Unnamed: 1_level_1
1,0.62963
2,0.472826
3,0.242363


In [31]:
# male/female/child
train[['Survived','Sex']].groupby('Sex').mean()

Unnamed: 0_level_0,Survived
Sex,Unnamed: 1_level_1
0,0.756458
1,0.163873
2,0.590361


In [32]:
# SibSp
train[['Survived','SibSp']].groupby('SibSp').describe()

Unnamed: 0_level_0,Unnamed: 1_level_0,Survived
SibSp,Unnamed: 1_level_1,Unnamed: 2_level_1
0,count,608.0
0,mean,0.345395
0,std,0.475888
0,min,0.0
0,25%,0.0
0,50%,0.0
0,75%,1.0
0,max,1.0
1,count,209.0
1,mean,0.535885


In [33]:
# Parch
train[['Survived','Parch']].groupby('Parch').describe()

Unnamed: 0_level_0,Unnamed: 1_level_0,Survived
Parch,Unnamed: 1_level_1,Unnamed: 2_level_1
0,count,678.0
0,mean,0.343658
0,std,0.475279
0,min,0.0
0,25%,0.0
0,50%,0.0
0,75%,1.0
0,max,1.0
1,count,118.0
1,mean,0.550847


it appears that increased chance of survival indicated by; higher Pclass, female, SibSp==1, higher Parch.

### Create predictive models

In [34]:
# create the test train split from training data

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(train.drop('Survived',axis=1), train['Survived'], test_size=0.33, random_state=42)

##### Logistic Regression

In [35]:
from sklearn.linear_model import LogisticRegression

model1=LogisticRegression(random_state=42)

model1.fit(X_train,y_train)

prediction1=model1.predict(X_test)

In [36]:
from sklearn.metrics import classification_report

print(classification_report(y_test,prediction1))

             precision    recall  f1-score   support

          0       0.83      0.87      0.85       175
          1       0.79      0.73      0.76       120

avg / total       0.81      0.81      0.81       295



##### Decision Tree

In [37]:
from sklearn.tree import DecisionTreeClassifier

model2=DecisionTreeClassifier(random_state=42)

model2.fit(X_train,y_train)

prediction2=model2.predict(X_test)

print(classification_report(y_test,prediction2))

             precision    recall  f1-score   support

          0       0.82      0.86      0.84       175
          1       0.78      0.72      0.75       120

avg / total       0.80      0.80      0.80       295



In [38]:
# decision tree performed worse that logistic regression!

##### Random Forest

In [39]:
from sklearn.ensemble import RandomForestClassifier

model3=RandomForestClassifier(n_estimators=1000,max_features=3,oob_score=True,random_state=42)

model3.fit(X_train,y_train)

prediction3=model3.predict(X_test)

print(classification_report(y_test,prediction3))

             precision    recall  f1-score   support

          0       0.83      0.90      0.87       175
          1       0.84      0.73      0.78       120

avg / total       0.83      0.83      0.83       295



random forest outperformed both logistic regression and decision tree models!

### Create file for Kaggle submission

In [40]:
# create file with PassengerId and Survived based on random forest model

predictionsub=model3.predict(test)

submission=pd.DataFrame({'PassengerId':test['PassengerId'],'Survived':predictionsub})

# check that output has 418 rows and headers

submission.describe()

Unnamed: 0,PassengerId,Survived
count,418.0,418.0
mean,1100.5,0.34689
std,120.810458,0.476551
min,892.0,0.0
25%,996.25,0.0
50%,1100.5,0.0
75%,1204.75,1.0
max,1309.0,1.0


In [41]:
submission.to_csv('titanic.csv',index=False)

submission scored 0.78469

approximately top 1/3rd of all entrants

### Ideas for improvement

- create family size category based on Parch and SibSp
- fill missing ages with machine learning predictions
- fill missing fare with machine learning predictions
- further optimise random forest to prevent overfitting