# Titanic Survival Exploration w/ Decision Trees

In [None]:
import numpy as np
import pandas as pd
#from IPython import display # Allows the use of display() for DataFrames

# Pretty display for notebooks:
%matplotlib inline 

In [None]:
# Setting a random seed
import random
random.seed(42)

In [7]:
# Loading the dataset
full_data = pd.read_csv('titanic_data.csv')
full_data.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


Recall that these are the various features present for each passenger on the ship:

  * **Survived**: Outcome of survival (0 = No; 1 = Yes)
  * **Pclass**: Socio-economic class (1 = Upper class; 2 = Middle class; 3 = Lower class)
  * **Name**: Name of passenger
  * **Sex**: Sex of the passenger
  * **Age**: Age of the passenger (Some entries contain NaN)
  * **SibSp**: Number of siblings and spouses of the passenger aboard
  * **Parch**: Number of parents and children of the passenger aboard
  * **Ticket**: Ticket number of the passenger
  * **Fare**: Fare paid by the passenger
  * **Cabin**: Cabin number of the passenger (Some entries contain NaN)
  * **Embarked**: Port of embarkation of the passenger (C = Cherbourg; Q = Queenstown; S = Southampton)

Since we're interested in the outcome of survival for each passenger or crew member, we can remove the **Survived** feature from this dataset and store it as its own separate variable *outcomes*. We will use these outcomes as our prediction targets.


In [8]:
# Storing the 'Survived' feature in a new variable and removing it from the original dataset
outcomes = full_data['Survived']
features_raw = full_data.drop('Survived', axis=1) # New dataset with 'Survived' removed
features_raw.head()

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


The very same sample of the RMS Titanic data now shows the Survived feature removed from the DataFrame. Note that *data* (the passenger data) and *outcomes* (the outcomes of survival) are now paired. That means for any passenger *data.loc[i]*, they have the survival outcome *outcomes[i]*. 

**Preprocessing the data**

Now, let's do some data preprocessing. First, we'll remove the names of the passengers, and then one-hot encode the features.

[One-Hot encoding](https://hackernoon.com/what-is-one-hot-encoding-why-and-when-do-you-have-to-use-it-e3c6186d008f) is useful for changing over categorical data into numerical data, with each different option within a category changed into either a 0 or 1 in a separate new category as to whether it is that option or not (e.g. Queenstown port or not Queenstown port). 

In [9]:
features_no_names = features_raw.drop(['Name'], axis=1) # Removing the names
features = pd.get_dummies(features_no_names) # One-hot encoding
features = features.fillna(0.0) # Filling blank spaces with zeroes
features.head()

Unnamed: 0,PassengerId,Pclass,Age,SibSp,Parch,Fare,Sex_female,Sex_male,Ticket_110152,Ticket_110413,...,Cabin_F G73,Cabin_F2,Cabin_F33,Cabin_F38,Cabin_F4,Cabin_G6,Cabin_T,Embarked_C,Embarked_Q,Embarked_S
0,1,3,22.0,1,0,7.25,0,1,0,0,...,0,0,0,0,0,0,0,0,0,1
1,2,1,38.0,1,0,71.2833,1,0,0,0,...,0,0,0,0,0,0,0,1,0,0
2,3,3,26.0,0,0,7.925,1,0,0,0,...,0,0,0,0,0,0,0,0,0,1
3,4,1,35.0,1,0,53.1,1,0,0,0,...,0,0,0,0,0,0,0,0,0,1
4,5,3,35.0,0,0,8.05,0,1,0,0,...,0,0,0,0,0,0,0,0,0,1


**Training the model**

First, let's split the data intro training and testing sets. Then we'll train the model on the training set.

In [10]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(features, outcomes, test_size=0.2, random_state=42)

In [11]:
from sklearn.tree import DecisionTreeClassifier
model = DecisionTreeClassifier()
model.fit(X_train, y_train)

DecisionTreeClassifier()

**Testing the model**

In [12]:
# Making predictions
y_train_pred = model.predict(X_train)
y_test_pred = model.predict(X_test)

# Calculating the accuracy
from sklearn.metrics import accuracy_score
train_accuracy = accuracy_score(y_train, y_train_pred)
test_accuracy = accuracy_score(y_test, y_test_pred)
print('Training accuracy:', train_accuracy)
print('Test accuracy:', test_accuracy)

Training accuracy: 1.0
Test accuracy: 0.8156424581005587


**Improving the model**

Ok, high training accuracy and a lower testing accuracy. We may be overfitting a bit.

We should tune the model! Let's train a new model, and try to specify some parameters in order to improve the testing accuracy, such as:

  * *max_depth*
  * *min_samples_leaf*
  * *min_samples_split*

Different approaches: using your intuition, trial and error, or even better, use Grid Search!


In [13]:
# help(model.tree_)

In [None]:
print('Depth:', model.tree_.max_depth) # depth before tuning the model

Depth: 50


In [None]:
# Tuning the model (trial and error approach)
model = DecisionTreeClassifier(max_depth=16, min_samples_leaf=6)
model.fit(X_train, y_train)

DecisionTreeClassifier(max_depth=16, min_samples_leaf=6)

In [None]:
print('Depth:', model.tree_.max_depth) # depth after tuning the model 

Depth: 16


In [None]:
# Testing it again:
# Making predictions
y_train_pred = model.predict(X_train)
y_test_pred = model.predict(X_test)

# Calculating the accuracy
from sklearn.metrics import accuracy_score
train_accuracy = accuracy_score(y_train, y_train_pred)
test_accuracy = accuracy_score(y_test, y_test_pred)
print('Training accuracy:', train_accuracy)
print('Test accuracy:', test_accuracy)

Training accuracy: 0.8820224719101124
Test accuracy: 0.8603351955307262


In [None]:
# Tuning the model (Grid Search approach)
from sklearn.model_selection import GridSearchCV 
params = {
    'criterion' : ['gini', 'entropy'],
    'max_depth' : range(1, 100),
    'min_samples_split' : range(2, 10),
    'min_samples_leaf' : range(1, 10)
}
grid_search = GridSearchCV(model, param_grid=params, verbose=1, cv=3)
grid_search.fit(X_train, y_train)

Fitting 3 folds for each of 14256 candidates, totalling 42768 fits


GridSearchCV(cv=3,
             estimator=DecisionTreeClassifier(max_depth=16, min_samples_leaf=6),
             param_grid={'criterion': ['gini', 'entropy'],
                         'max_depth': range(1, 100),
                         'min_samples_leaf': range(1, 10),
                         'min_samples_split': range(2, 10)},
             verbose=1)

In [None]:
print(grid_search.best_params_)
print(grid_search.best_estimator_)

{'criterion': 'gini', 'max_depth': 7, 'min_samples_leaf': 6, 'min_samples_split': 3}
DecisionTreeClassifier(max_depth=7, min_samples_leaf=6, min_samples_split=3)


In [None]:
# Testing it again:
# Making predictions
y_train_pred = model.predict(X_train)
y_test_pred = model.predict(X_test)

# Calculating the accuracy
from sklearn.metrics import accuracy_score
train_accuracy = accuracy_score(y_train, y_train_pred)
test_accuracy = accuracy_score(y_test, y_test_pred)
print('Training accuracy:', train_accuracy)
print('Test accuracy:', test_accuracy)

Training accuracy: 0.8820224719101124
Test accuracy: 0.8603351955307262


* As we can see, the training and test accuracy are more balanced now. That's better.