# Can We Predict Whether Someone Survived The Titanic Sinking?

The sinking of the Titanic is one the most well known events in human history. Boasting both pop-culture significance and historical relevance there are likely thousands if not millions of people who have asked the question; "Would I survive the Titanic?" While we have no way to go back in time and see for ourselves we thankfully have the second best thing: data. More specifically data on the passengers. We know who they were, where they stayed and most importantly if they survived. Let's take a look at this historical demographic battle royal and see what kind of person was most likely to survive the Titanic.

In [76]:
import numpy as np
import pandas as pd 

Before we start to do any actual analysis it is a good idea to get a grasp on the data we have.

In [77]:
train_data = pd.read_csv('train.csv')
train_data.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


There are 12 categories provided to us. It is a good idea to write down exactly what they each represent.

PassengerId - ID of the Passenger

Pclass - The ticket class the passenger has. The quality goes best to worst 1, 2, 3

Name - The name of the Passenger

Sex - The sex of the passenger Male or Female

Age - The age of the Passenger

SibSp - The number of Siblings or Spouses the Passenger had on board

Parch = The number of Parents or Children the Passenger had on board

Ticket -  The Ticket Number of the Passenger

Fare - The Price paif for the ticket

Cabin - The Cabin they resided in

Embarked - The Port from which they embarked C = Cherbourg, Q = Queenstown, S = Southampton

The Standout Category is Survived

Survived - Whether or not they survived. 1 for Yes, 0 for No.

Our entire model will be built around predicting whether a Passenger gets a 1 or a zero.

Now that we understand the data lets try to clean it up. First lets get rid of any non-essential categories. The column  "PassengerID" stands out as not particullarly useful as it only serves as a sort of built in index for each passenger. Therefore it should be deleted. 

In [78]:
train_data = train_data.drop('PassengerId', axis=1)

Now that obvious irrelevncies have be cleared its time to look closer at the values that in the data frame to see if we can actually use them. A category may be relevant but have so many Null variables that it doesn't have any practical use from a model creating standpoint.

In [79]:
print(train_data.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 11 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Survived  891 non-null    int64  
 1   Pclass    891 non-null    int64  
 2   Name      891 non-null    object 
 3   Sex       891 non-null    object 
 4   Age       714 non-null    float64
 5   SibSp     891 non-null    int64  
 6   Parch     891 non-null    int64  
 7   Ticket    891 non-null    object 
 8   Fare      891 non-null    float64
 9   Cabin     204 non-null    object 
 10  Embarked  889 non-null    object 
dtypes: float64(2), int64(4), object(5)
memory usage: 76.7+ KB
None


From the look of things the category with the most Null elements is cabin. Now the cabin by itself is a highly relevant part of a person's chances of survival. Each cabin Value is separated into a letter and a number. While the number is not particularly significant, the letter is very important as it tells us the floor the person was living on. A person living on A near the upper decks and lifeboats has a much higher chance to survive than someone on G who has to traverse 6 floors to make it to the boat. However the high number of Nulls could make the data obsolete. Getting rid of such a valuable piece of info despite only being applicable for 1/4 of cases doesn’t seem like the right move especially since we haven’t seen its corollary force yet so for now ‘Cabin’ lives to see another day. 

Now that we have erased the categories that are not useful we should make the rest of the data more usable. We can do this through numerization the rest of the data for easier use. We can start with Sex as it is a binary category; either male or female. We can make male a 0 and female a 1.

In [80]:
train_data['Sex'] = train_data['Sex'].astype(str)

sex_mapping = {'male': 0, 'female': 1}  
train_data['Sex'] = train_data['Sex'].map(sex_mapping)

train_data.head()

Unnamed: 0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,0,3,"Braund, Mr. Owen Harris",0,22.0,1,0,A/5 21171,7.25,,S
1,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",1,38.0,1,0,PC 17599,71.2833,C85,C
2,1,3,"Heikkinen, Miss. Laina",1,26.0,0,0,STON/O2. 3101282,7.925,,S
3,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",1,35.0,1,0,113803,53.1,C123,S
4,0,3,"Allen, Mr. William Henry",0,35.0,0,0,373450,8.05,,S


Next we can turn Embarked into numeric Values too. C = 0, Q = 1, S = 2.

In [81]:
train_data['Embarked'] = train_data['Embarked'].astype(str)

port_mapping = {'C': 0, 'Q': 1, 'S': 2 }  
train_data['Embarked'] = train_data['Embarked'].map(port_mapping)

train_data.head()

Unnamed: 0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,0,3,"Braund, Mr. Owen Harris",0,22.0,1,0,A/5 21171,7.25,,2.0
1,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",1,38.0,1,0,PC 17599,71.2833,C85,0.0
2,1,3,"Heikkinen, Miss. Laina",1,26.0,0,0,STON/O2. 3101282,7.925,,2.0
3,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",1,35.0,1,0,113803,53.1,C123,2.0
4,0,3,"Allen, Mr. William Henry",0,35.0,0,0,373450,8.05,,2.0


Now we come back to cabin. This will be a 2 step proccess. First we will strip down cabin to its letter to represent the floor and from there we will match each letter to a number A = 0, B = 1, and so on till G=6.

In [84]:
train_data['Cabin'] = train_data['Cabin'].str.extract('([A-Za-z]+)', expand=False)
train_data

Unnamed: 0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,0,3,"Braund, Mr. Owen Harris",0,22.0,1,0,A/5 21171,7.2500,,2.0
1,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",1,38.0,1,0,PC 17599,71.2833,C,0.0
2,1,3,"Heikkinen, Miss. Laina",1,26.0,0,0,STON/O2. 3101282,7.9250,,2.0
3,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",1,35.0,1,0,113803,53.1000,C,2.0
4,0,3,"Allen, Mr. William Henry",0,35.0,0,0,373450,8.0500,,2.0
...,...,...,...,...,...,...,...,...,...,...,...
886,0,2,"Montvila, Rev. Juozas",0,27.0,0,0,211536,13.0000,,2.0
887,1,1,"Graham, Miss. Margaret Edith",1,19.0,0,0,112053,30.0000,B,2.0
888,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",1,,1,2,W./C. 6607,23.4500,,2.0
889,1,1,"Behr, Mr. Karl Howell",0,26.0,0,0,111369,30.0000,C,0.0
