# Titanic: Machine Learning from Disaster
---

In [1]:
import pandas as pd
import numpy as np

In [2]:
# Load the dataset
data = pd.read_csv("./data/train.csv")

## Exploring the data

In [3]:
display(data.head())

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [4]:
train_size = data.shape[0]
percent_positive = data[data['Survived'] == 1].shape[0] / train_size
percent_negative = data[data['Survived'] == 0].shape[0] / train_size

In [5]:
print("Number of training examples: {}".format(train_size))
print("Positive examples: {:2.0f}%".format(percent_positive*100))
print("Negative examples: {:2.0f}%".format(percent_negative*100))

Number of training examples: 891
Positive examples: 38%
Negative examples: 62%


**Note**: The data is skewed towards negative samples, so accuracy might be misleading as a metric. We'll use the F1 score instead.

In [6]:
data.corr()

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
PassengerId,1.0,-0.005007,-0.035144,0.036847,-0.057527,-0.001652,0.012658
Survived,-0.005007,1.0,-0.338481,-0.077221,-0.035322,0.081629,0.257307
Pclass,-0.035144,-0.338481,1.0,-0.369226,0.083081,0.018443,-0.5495
Age,0.036847,-0.077221,-0.369226,1.0,-0.308247,-0.189119,0.096067
SibSp,-0.057527,-0.035322,0.083081,-0.308247,1.0,0.414838,0.159651
Parch,-0.001652,0.081629,0.018443,-0.189119,0.414838,1.0,0.216225
Fare,0.012658,0.257307,-0.5495,0.096067,0.159651,0.216225,1.0


---
## Removing names and passenger ID

We are going to assume that the names of the passengers and their ID's are irrelevant to their chance of survival. 

From the correlation table above we can confirm that the passenger ID is an uninteresting feature because it has a correlation of $-0.005$ with the target.

In [7]:
# Drop Name and PassengerId columns
data = data.drop(columns=["PassengerId", "Name"])
data.head(1)

Unnamed: 0,Survived,Pclass,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,0,3,male,22.0,1,0,A/5 21171,7.25,,S
