# Titanic Survival Exploration

One of the most infamous and tragic shipwrecks in history was the sinking of the RMS Titanic. According to the survivors and the available evidence, one of the reasons that the shipwreck led to such loss of life was that there were not enough lifeboats for the passengers and crew. Although there was some element of luck involved in surviving the sinking, some groups of people were more likely to survive than others, such as women, children, and the upper-class.

In this notebook, using machine learning techniques, we will analyse the titanic dataset to predict who, among other passengers, were mosst likely to survive the tragic accident. Using sklearn, we will implement different machine learning algorithms like decision tree, k nearest neighbors, random forest for the prediction.

First, let's start by decision tree implementation. We start with loading the dataset and displaying some of its rows.

In [1]:
import numpy as np
import pandas as pd
from IPython.display import display # allow the use of display() for DataFrames

# render pretty display for notebooks
%matplotlib inline 

import random
random.seed(42) # set a random seed             

# load the dataset
in_file = 'titanic_data.csv'
full_data = pd.read_csv(in_file)

# print the first few entries of the RMS Titanic data
display(full_data.head())

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


These are the various features present for each passenger on the ship:
- **Survived**: Outcome of survival (0 = No, 1 = Yes)
- **Pclass**: Socio-economic class (1 = Upper class, 2 = Middle class, 3 = Lower class)
- **Name**: Name of passenger
- **Sex**: Sex of the passenger
- **Age**: Age of the passenger (Some entries contain `NaN`)
- **SibSp**: Number of siblings and spouses of the passenger aboard
- **Parch**: Number of parents and children of the passenger aboard
- **Ticket**: Ticket number of the passenger
- **Fare**: Fare paid by the passenger
- **Cabin** Cabin number of the passenger (Some entries contain `NaN`)
- **Embarked**: Port of embarkation of the passenger (C = Cherbourg, Q = Queenstown, S = Southampton)

Since we're interested in the outcome of survival for each passenger or crew member, we can remove the **Survived** feature from this dataset and store it as its own separate variable `outcomes`. We will use these outcomes as our prediction targets. Let's remove **Survived** as a feature of the dataset and store it in `outcomes`.

In [5]:
# save the feature 'Survived' in a new variable and remove it from the dataset
outcomes = full_data['Survived']
features_raw = full_data.drop('Survived', axis = 1)

# print the first few entries of the dataset with 'Survived' removed
display(features_raw.head())

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


The RMS Titanic data now shows the **Survived** feature removed from the DataFrame. `data` (the passenger data) and `outcomes` (the outcomes of survival) are now *paired* that means for any passenger `data.loc[i]`, survival outcome is `outcomes[i]`.