## Titanic Survival Exploration using Sklearn's Decision Trees
In this notebook, we'll do a data exploration of the Titanic Data and model a Decision Tree for Survivors.
We'll start by loading the dataset and showing some of its first rows

The data we will be working with has the following features:
* Survived –– Outcome of survival(0 = No, 1=Yes)
* Pclass –– Socio-economic class(1 = Upper, 2 = Middle, 3 = Lower class)
* Name –– name of passenger
* Sex –– sex of passenger
* Age –– some are `NaN`
* SibSp –– number of siblings and spouses of the passenger aboard
* Parch –– number of parents and children of the passenger aboard
* Ticker –– passenger's ticket number
* Fare –– Fare paid by passenger
* Cabin –– Cabin number (some are `NaN`)
* Embarked –– Port of embarkation of passenger (C = Cherbourg, Q=Queenstown, S=Southampton)



In [3]:
# import relevant libs
import pandas as pd
import numpy as np
from IPython.display import display

# Pretty display for our notebook
%matplotlib inline

# Set a random seed
import random
random.seed(42)

# load the dataset 
data = pd.read_csv('titanic_data_train.csv')
display(data.head())

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


Now, since we are interested in the outcome of survival for each passenger, we will remove the Survived column from the dataset. We will remove it and store it as the outcomes, which will later be used to test the accuracy of the prediction

In [5]:
# remove `Survived` feature, and store it as outcomes
outcomes = data['Survived']
features = data.drop('Survived', axis=1)

display(features.head())


Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


#### 2. Prepare the data
We will one-hot encode the features and fill any blanks(NaN) with zeros

In [10]:
features = pd.get_dummies(features)
features = features.fillna(0.0)
display(features.head())

Unnamed: 0,PassengerId,Pclass,Age,SibSp,Parch,Fare,"Name_Abbing, Mr. Anthony","Name_Abbott, Mr. Rossmore Edward","Name_Abbott, Mrs. Stanton (Rosa Hunt)","Name_Abelson, Mr. Samuel",...,Cabin_F G73,Cabin_F2,Cabin_F33,Cabin_F38,Cabin_F4,Cabin_G6,Cabin_T,Embarked_C,Embarked_Q,Embarked_S
0,1,3,22.0,1,0,7.25,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
1,2,1,38.0,1,0,71.2833,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
2,3,3,26.0,0,0,7.925,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
3,4,1,35.0,1,0,53.1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
4,5,3,35.0,0,0,8.05,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1


#### 3. Train the model
We will split the data into training and testing sets and fit the model to the data using the training set.

In [11]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
    features, outcomes, test_size=0.2, random_state=42)

display(X_train.head())


Unnamed: 0,PassengerId,Pclass,Age,SibSp,Parch,Fare,"Name_Abbing, Mr. Anthony","Name_Abbott, Mr. Rossmore Edward","Name_Abbott, Mrs. Stanton (Rosa Hunt)","Name_Abelson, Mr. Samuel",...,Cabin_F G73,Cabin_F2,Cabin_F33,Cabin_F38,Cabin_F4,Cabin_G6,Cabin_T,Embarked_C,Embarked_Q,Embarked_S
331,332,1,45.5,0,0,28.5,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
733,734,2,23.0,0,0,13.0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
382,383,3,32.0,0,0,7.925,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
704,705,3,26.0,1,0,7.8542,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
813,814,3,6.0,4,2,31.275,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1


In [12]:
from sklearn.tree import DecisionTreeClassifier
# train the model: You can experiement with the hyperparameters (max_depth, min_samples_leaf etc)
model = DecisionTreeClassifier(max_depth=10, min_samples_leaf=5)
model.fit(X_train, y_train)

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=10,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=5, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best')

#### 4. Predict and test the model's accuracy

In [13]:
# make predictions
y_predictions = model.predict(X_test)

# calculate accuracy
from sklearn.metrics import accuracy_score
acc = accuracy_score(y_test, y_predictions)
print('The test accuracy is', acc)

The test accuracy is 0.854748603352
