# Titanic Survival Exploration

One of the most infamous and tragic shipwrecks in history was the sinking of the RMS Titanic. According to the survivors and the available evidence, one of the reasons that the shipwreck led to such loss of life was that there were not enough lifeboats for the passengers and crew. Although there was some element of luck involved in surviving the sinking, some groups of people were more likely to survive than others, such as women, children, and the upper-class.

In this notebook, using machine learning techniques, we will analyze the titanic dataset to predict who, among other passengers, were most likely to survive the tragic accident. Using sklearn, we will implement different machine learning algorithms like decision tree, k nearest neighbors, random forest for the prediction.

First, let's start by decision tree implementation. We start with loading the dataset and displaying some of its rows.

In [1]:
import numpy as np
import pandas as pd
from IPython.display import display # allow the use of display() for DataFrames

# render pretty display for notebooks
%matplotlib inline 

import random
random.seed(42) # set a random seed             

# load the dataset
in_file = 'titanic_data.csv'
full_data = pd.read_csv(in_file)

# print the first few entries of the RMS Titanic data
display(full_data.head())

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


These are the various features present for each passenger on the ship:
- **Survived**: Outcome of survival (0 = No, 1 = Yes)
- **Pclass**: Socio-economic class (1 = Upper class, 2 = Middle class, 3 = Lower class)
- **Name**: Name of passenger
- **Sex**: Sex of the passenger
- **Age**: Age of the passenger (Some entries contain `NaN`)
- **SibSp**: Number of siblings and spouses of the passenger aboard
- **Parch**: Number of parents and children of the passenger aboard
- **Ticket**: Ticket number of the passenger
- **Fare**: Fare paid by the passenger
- **Cabin** Cabin number of the passenger (Some entries contain `NaN`)
- **Embarked**: Port of embarkation of the passenger (C = Cherbourg, Q = Queenstown, S = Southampton)

Since we're interested in the outcome of survival for each passenger or crew member, we can remove the **Survived** feature from this dataset and store it as its own separate variable `outcomes`. We will use these outcomes as our prediction targets. Let's remove **Survived** as a feature of the dataset and store it in `outcomes`.

In [5]:
# save the feature 'Survived' in a new variable and remove it from the dataset
outcomes = full_data['Survived']
features_raw = full_data.drop('Survived', axis = 1)

# print the first few entries of the dataset with 'Survived' removed
display(features_raw.head())

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


The RMS Titanic data now shows the **Survived** feature removed from the DataFrame. `data` (the passenger data) and `outcomes` (the outcomes of survival) are now *paired* that means for any passenger `data.loc[i]`, survival outcome is `outcomes[i]`.

## Preprocessing the data

Now, let's do some data preprocessing. First, we'll one-hot encode the features.

In [7]:
features = pd.get_dummies(features_raw)
features.head()

Unnamed: 0,PassengerId,Pclass,Age,SibSp,Parch,Fare,"Name_Abbing, Mr. Anthony","Name_Abbott, Mr. Rossmore Edward","Name_Abbott, Mrs. Stanton (Rosa Hunt)","Name_Abelson, Mr. Samuel",...,Cabin_F G73,Cabin_F2,Cabin_F33,Cabin_F38,Cabin_F4,Cabin_G6,Cabin_T,Embarked_C,Embarked_Q,Embarked_S
0,1,3,22.0,1,0,7.25,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
1,2,1,38.0,1,0,71.2833,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
2,3,3,26.0,0,0,7.925,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
3,4,1,35.0,1,0,53.1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
4,5,3,35.0,0,0,8.05,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1


And now we'll fill in any blanks with zeroes.

In [8]:
features = features.fillna(0.0)
display(features.head())

Unnamed: 0,PassengerId,Pclass,Age,SibSp,Parch,Fare,"Name_Abbing, Mr. Anthony","Name_Abbott, Mr. Rossmore Edward","Name_Abbott, Mrs. Stanton (Rosa Hunt)","Name_Abelson, Mr. Samuel",...,Cabin_F G73,Cabin_F2,Cabin_F33,Cabin_F38,Cabin_F4,Cabin_G6,Cabin_T,Embarked_C,Embarked_Q,Embarked_S
0,1,3,22.0,1,0,7.25,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
1,2,1,38.0,1,0,71.2833,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
2,3,3,26.0,0,0,7.925,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
3,4,1,35.0,1,0,53.1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
4,5,3,35.0,0,0,8.05,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1


## Training the model

Now that data has been preprocessed, it is ready for training the model in sklearn. First, let's split the data into training and testing sets. Then we'll train the model on the training set.


In [9]:
# split the data into training and testing set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(features, outcomes, test_size=0.2, random_state=42)

In [10]:
# import the classifier from sklearn
from sklearn.tree import DecisionTreeClassifier

# define the classifier, and fit it to the data
model = DecisionTreeClassifier()
model.fit(X_train, y_train)

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
                       max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, presort=False,
                       random_state=None, splitter='best')

## Testing the model
Now, let's calculate the accuracy over both the training and the testing set. Let's see how our model does.

In [11]:
# make predictions
y_train_pred = model.predict(X_train)
y_test_pred = model.predict(X_test)

# calculate the accuracy
from sklearn.metrics import accuracy_score
train_accuracy = accuracy_score(y_train, y_train_pred)
test_accuracy = accuracy_score(y_test, y_test_pred)
print('The training accuracy is', train_accuracy)
print('The test accuracy is', test_accuracy)

The training accuracy is 1.0
The test accuracy is 0.8100558659217877


## Improving the model

So, the model shows high training accuracy and a lower testing accuracy. Obviously, its overfitting a bit.

In order to improve the testing accuracy, let's specify some parameters while training a new model

* max_depth
* min_samples_leaf
* min_samples_split


In [41]:
# define a new classifier with parameters and train the model
model = DecisionTreeClassifier(max_depth=6, min_samples_leaf=5, min_samples_split=6)
model.fit(X_train, y_train)

# make predictions
y_train_pred = model.predict(X_train)
y_test_pred = model.predict(X_test)

# calculate accuracies
train_accuracy = accuracy_score(y_train, y_train_pred)
test_accuracy = accuracy_score(y_test, y_test_pred)

print('The training accuracy is', train_accuracy)
print('The test accuracy is', test_accuracy)

The training accuracy is 0.8735955056179775
The test accuracy is 0.8547486033519553


Let's tune parameters and try to improve the test accuracy even further 

In [42]:
# define a new model by tweaking parameters and fit it to the data
model = DecisionTreeClassifier(max_depth=10, min_samples_leaf=6, min_samples_split=10)
model.fit(X_train, y_train)

# make predictions
y_train_pred = model.predict(X_train)
y_test_pred = model.predict(X_test)

# calculate accuracies
train_accuracy = accuracy_score(y_train, y_train_pred)
test_accuracy = accuracy_score(y_test, y_test_pred)

print('The training accuracy is', train_accuracy)
print('The test accuracy is', test_accuracy)

The training accuracy is 0.8820224719101124
The test accuracy is 0.8603351955307262


After tuning parameters, we got an improved test accuracy above 86%.

Let's find out how does SVM performs on this Titanic dataset.

In [55]:
# import the svm classifier from sklearn
from sklearn.svm import SVC

# define the classifier, and fit it to the data
model = SVC(kernel = 'linear')
model.fit(X_train, y_train)

# make predictions
y_train_pred = model.predict(X_train)
y_test_pred = model.predict(X_test)

# calculate accuracies
train_accuracy = accuracy_score(y_train, y_train_pred)
test_accuracy = accuracy_score(y_test, y_test_pred)

print('The training accuracy is', train_accuracy)
print('The test accuracy is', test_accuracy)

The training accuracy is 0.9985955056179775
The test accuracy is 0.8212290502793296


SVM gave test accuracy on lower side and also it was also slower to train the model than the decision tree. 

Now, let's change SVC's kernel and use a parameter, C to define the classifier. Let's see what effect it has on the accuracy.

In [60]:
# define the classifier, and fit it to the data
model = SVC(kernel = 'rbf', C=1000)
model.fit(X_train, y_train)

# make predictions
y_train_pred = model.predict(X_train)
y_test_pred = model.predict(X_test)

# calculate accuracies
train_accuracy = accuracy_score(y_train, y_train_pred)
test_accuracy = accuracy_score(y_test, y_test_pred)

print('The training accuracy is', train_accuracy)
print('The test accuracy is', test_accuracy)



The training accuracy is 1.0
The test accuracy is 0.6256983240223464


Using kernel 'rbf' and C as 1000, both training accuracy as well as test accuracy give lower accuracy than the previous model.

Let's try a new model using another machine learning algorithm K Nearest Neighbors(KNN).

In [82]:
# define a new model using K neighbors classifier 
from sklearn.neighbors import KNeighborsClassifier
neigh = KNeighborsClassifier(n_neighbors = 5)

# fit data to the model
neigh.fit(X_train, y_train)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
                     metric_params=None, n_jobs=None, n_neighbors=5, p=2,
                     weights='uniform')

In [83]:
# make predictions
y_train_pred = model.predict(X_train)
y_test_pred = model.predict(X_test)

# calculate accuracies
train_accuracy = accuracy_score(y_train, y_train_pred)
test_accuracy = accuracy_score(y_test, y_test_pred)

print('The training accuracy is', train_accuracy)
print('The test accuracy is', test_accuracy)

The training accuracy is 1.0
The test accuracy is 0.6256983240223464


The test accuracy of the new classifier built using k nearest neighbors produces a comparatively low test accuracy, and clearly it is overfitting too.

That's not impressive! Let's tweak the model's parameter and use sklearn's 'score' to calculate accuracies.

In [107]:
# define a new model using K neighbors classifier 
from sklearn.neighbors import KNeighborsClassifier
neigh = KNeighborsClassifier(n_neighbors = 8)

# fit data to the model
neigh.fit(X_train, y_train)

# calculate accuracies
train_acc = neigh.score(X_train, y_train)
test_acc = neigh.score(X_test, y_test)

print('The training accuracy is', train_acc)
print('The test accuracy is', test_acc)

The training accuracy is 0.6853932584269663
The test accuracy is 0.6703910614525139


Using score for calculating accuracy After tweaking the parameter n_neighbors, by increasing the number of nearest neighbors from 5 to 8, the model addressed the issue of overfitting and gives a higher test accuracy of 67%.

Let's use logistic regression to a newly defined model.

In [92]:
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression()

# fit data to the model
logreg.fit(X_train, y_train)

#calculate accuracies
train_acc = logreg.score(X_train, y_train)
test_acc = logreg.score(X_test, y_test)

print('The training accuracy is', train_acc)
print('The test accuracy is', test_acc)


The training accuracy is 0.9213483146067416
The test accuracy is 0.7932960893854749




The logistic regression showed the higher test accuracy(79%) than KNN.

Now, let's check how does Gaussian Naive Bayes perform on this dataset. 

In [110]:
# define a new model using naive bayes
from sklearn.naive_bayes import GaussianNB
gaussian = GaussianNB()

# fit data to the model
gaussian.fit(X_train, y_train)

# calculate accuracies
train_acc = gaussian.score(X_train, y_train)
test_acc = gaussian.score(X_test, y_test)

print('The training accuracy is', train_acc)
print('The test accuracy is', test_acc)

The training accuracy is 1.0
The test accuracy is 0.5027932960893855


The naive bayes comes no where close to the performance of other algorithms with the minimum test accuracy of 0.50, and shows the clear sign of overfitting.