# Titanic: Machine Learning from Disaster

url = https://www.kaggle.com/c/titanic

In this notebook I am going to explore some engineered features and use a Random Forest to make my predictions.

A lot of these features were heavily inspired from this awesome notebook:
https://www.kaggle.com/omanekodie/titanic/titanic-random-forest-82-78/notebook

### Understanding the Question

Given certain characteristics of each passenger can we predict if they survived the sinking or not?
The titanic data set contains 10 different x values we can use to make predictions.  Our y values are binary with 1 being they survived, and 0 being that they did not survive.  

This is therefore a supervised classification problem.

### Getting Started

The data is available from Kaggle at https://www.kaggle.com/c/titanic/data.  The train.csv is what we will use to build our model as it contains both the features and labels.  The test.csv contains the features only and we must predict these labels to submit to the Kaggle competition.

In [5]:
import pandas as pd
import numpy as np


train_df = pd.read_csv('train.csv')
test_df = pd.read_csv('test.csv')

In [9]:
def get_title(name):
    '''(string) -> string
    
    Given a name from dataframe, return the Title
    '''
    comma_list = name.split(",")
    front_part_of_name = comma_list[1].split()
    return front_part_of_name[0]

#Test Function
get_title('Beckwith, Mrs. Richard Leonard (Sallie Monypeny)')

'Mrs.'

In [19]:
#Engineer Features
def add_features(df):
    #Family Size
    df['FamilySize'] = df['Parch'] + df['SibSp']
    #Title
    df['Title'] = df['Name'].apply(lambda x: get_title(x))
    #Length of Name
    df['NameLength'] = df['Name'].apply(lambda x: len(x))
    #First Letter of Ticket
    df['TicketFirstChar'] = df['Ticket'].apply(lambda x: str(x)[0])
    return df

train_df = add_features(train_df)
test_df = add_features(test_df)

train_df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,FamilySize,Title,NameLength,TicketFirstChar
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S,1,Mr.,23,A
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C,1,Mrs.,51,P
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S,0,Miss.,22,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S,1,Mrs.,44,1
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S,0,Mr.,24,3


In [13]:
#Look at Survival rate by Title
train_df['Survived'].groupby(train_df['Title']).mean()

Title
Capt.        0.000000
Col.         0.500000
Don.         0.000000
Dr.          0.428571
Jonkheer.    0.000000
Lady.        1.000000
Major.       0.500000
Master.      0.575000
Miss.        0.697802
Mlle.        1.000000
Mme.         1.000000
Mr.          0.156673
Mrs.         0.792000
Ms.          1.000000
Rev.         0.000000
Sir.         1.000000
the          1.000000
Name: Survived, dtype: float64

In [14]:
#Look at Survival rate by Family Size
train_df['Survived'].groupby(train_df['FamilySize']).mean()

FamilySize
0     0.303538
1     0.552795
2     0.578431
3     0.724138
4     0.200000
5     0.136364
6     0.333333
7     0.000000
10    0.000000
Name: Survived, dtype: float64

In [24]:
#Look at Survival rate by NameLength
#pd.qcut(series, # of bins to split series)
train_df['Survived'].groupby(pd.qcut(train_df['NameLength'], 5)).mean()

NameLength
[12, 19]    0.220588
(19, 23]    0.301282
(23, 27]    0.319797
(27, 32]    0.442424
(32, 82]    0.674556
Name: Survived, dtype: float64

In [25]:
#Look at Survival rate by Ticket First Letter
train_df['Survived'].groupby(train_df['TicketFirstChar']).mean()

TicketFirstChar
1    0.630137
2    0.464481
3    0.239203
4    0.200000
5    0.000000
6    0.166667
7    0.111111
8    0.000000
9    1.000000
A    0.068966
C    0.340426
F    0.571429
L    0.250000
P    0.646154
S    0.323077
W    0.153846
Name: Survived, dtype: float64

### Prepare Data for ML

Now that I am happy with these new features (they all seem to add value), it is time to convert all data to numerical and fill in missing values.

In [27]:
#Features To Use
feature_cols = ['Pclass', 'Sex', 'Age', 'Fare', 'Embarked', 'FamilySize', 'Title', 'NameLength', 'TicketFirstChar']
train_df[feature_cols].info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 9 columns):
Pclass             891 non-null int64
Sex                891 non-null object
Age                714 non-null float64
Fare               891 non-null float64
Embarked           889 non-null object
FamilySize         891 non-null int64
Title              891 non-null object
NameLength         891 non-null int64
TicketFirstChar    891 non-null object
dtypes: float64(2), int64(3), object(4)
memory usage: 62.7+ KB


In [29]:
test_df[feature_cols].info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 9 columns):
Pclass             418 non-null int64
Sex                418 non-null object
Age                332 non-null float64
Fare               417 non-null float64
Embarked           418 non-null object
FamilySize         418 non-null int64
Title              418 non-null object
NameLength         418 non-null int64
TicketFirstChar    418 non-null object
dtypes: float64(2), int64(3), object(4)
memory usage: 29.5+ KB


In [60]:
#Fill in NaN's -train & test
#Fill using Median age from both test and train sets
merged_age = train_df['Age'].append(test_df['Age'])
median_age = merged_age.median()
print median_age
train_df['Age'].fillna(median_age, inplace=True)
test_df['Age'].fillna(median_age, inplace=True)
#Fill using mode of Embarked
train_df['Embarked'].fillna('S', inplace=True) 
#Fill using median Fare
test_df['Fare'].fillna(train_df['Fare'].median(), inplace=True)

print train_df[feature_cols].isnull().sum()
print test_df[feature_cols].isnull().sum()

28.0
Pclass             0
Sex                0
Age                0
Fare               0
Embarked           0
FamilySize         0
Title              0
NameLength         0
TicketFirstChar    0
dtype: int64
Pclass             0
Sex                0
Age                0
Fare               0
Embarked           0
FamilySize         0
Title              0
NameLength         0
TicketFirstChar    0
dtype: int64


In [61]:
#Encode all data to numerical
from sklearn.preprocessing import LabelEncoder

cols_to_encode = ['Sex', 'Embarked', 'Title', 'TicketFirstChar']
for col in cols_to_encode:
    le = LabelEncoder()
    #Merge test and train set to fit label encoder
    merged = train_df[col].append(test_df[col])
    le.fit(merged)
    #Transform
    train_df[col] = le.transform(train_df[col])
    test_df[col] = le.transform(test_df[col])
    
train_df[feature_cols].head()

Unnamed: 0,Pclass,Sex,Age,Fare,Embarked,FamilySize,Title,NameLength,TicketFirstChar
0,3,1,22.0,7.25,2,1,12,23,9
1,1,0,38.0,71.2833,0,1,13,51,13
2,3,0,26.0,7.925,2,0,9,22,14
3,1,0,35.0,53.1,2,1,13,44,0
4,3,1,35.0,8.05,2,0,12,24,2


In [92]:
#Extract Data
feature_cols = ['Pclass', 'Sex', 'Age', 'Fare', 'Embarked', 'FamilySize', 'Title', 'NameLength', 'TicketFirstChar']
X_train = train_df[feature_cols].values
X_test = test_df[feature_cols].values
Y_train = train_df['Survived'].values

### Test Random Forest Algorithm

In [93]:
#RandomForest
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import KFold, cross_val_score

#Initialize Model
rf = RandomForestClassifier(n_estimators=100, min_samples_split=6, n_jobs=-1)
#Create KFold
kfold = KFold(n_splits=10, random_state=5)
cross_val_results = cross_val_score(rf, X_train, Y_train, cv=kfold, scoring='accuracy')
print cross_val_results.mean()

0.835093632959


In [94]:
#Grid Search 
from sklearn.model_selection import GridSearchCV

param_grid = {
    'max_depth' : [50, 75, 100,],
    'min_samples_leaf' :[1, 2, 3, 5],
    'min_samples_split' :[2, 4, 8]
}

param_grid = {'min_samples_split' :[2, 4, 6, 8, 10, 14, 20]}

rf100 = RandomForestClassifier(n_estimators=100, n_jobs=-1)
grid_search = GridSearchCV(estimator=rf100, param_grid=param_grid, cv=5, scoring='accuracy')
grid_search.fit(X_train, Y_train)

print "Best Score: %f" % grid_search.best_score_
print grid_search.best_params_

Best Score: 0.838384
{'min_samples_split': 6}


In [95]:
#Train Final Model
rf = RandomForestClassifier(n_estimators=2000, min_samples_split=6, n_jobs=-1)
rf.fit(X_train, Y_train)
predictions = rf.predict(X_test)

In [96]:
#Look at Feature Importance
importances = zip(rf.feature_importances_, feature_cols)
importances = pd.DataFrame(importances)
importances

Unnamed: 0,0,1
0,0.081978,Pclass
1,0.228542,Sex
2,0.125681,Age
3,0.154749,Fare
4,0.026225,Embarked
5,0.06434,FamilySize
6,0.114889,Title
7,0.140349,NameLength
8,0.063248,TicketFirstChar


### Create Submission

In [97]:
test_df['Survived'] = predictions
output_df = test_df[['PassengerId', 'Survived']]
output_df.to_csv("titanic_random_forest.csv", index=False)