# Titanic: Machine Learning from Disaster

## Prompt from site:
The sinking of the Titanic is one of the most infamous shipwrecks in history.

On April 15, 1912, during her maiden voyage, the widely considered “unsinkable” RMS Titanic sank after colliding with an iceberg. Unfortunately, there weren’t enough lifeboats for everyone onboard, resulting in the death of 1502 out of 2224 passengers and crew.

While there was some element of luck involved in surviving, it seems some groups of people were more likely to survive than others.

In this challenge, we ask you to build a predictive model that answers the question: “what sorts of people were more likely to survive?” using passenger data (ie name, age, gender, socio-economic class, etc).

## Question
The question for this competition is "what sorts of people were more likely to survive?"

We'll use a subset of known data to train a machine learning model to predict which individuals from a test set of data survive the wreck based on the features of individuals' passenger data. The data is provided via the competition.

The answer to the question, did the individual survive, will be yes or no. As this is a categorical response, we'll take a categorical approach to the model used in the predictive analytics. This will likely be some form of decision tree.

## Data
The data is provided by the Kaggle website. It consists of two csv files: test.csv and train.csv. We'll start by reviewing and visualizing the train.csv data.

In [1]:
import pandas as pd
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt

In [2]:
# Location of the train.csv data
train = '/Users/Brandon/Documents/GitHub/Data-Science-Projects/Kaggle Competition Projects/Titanic Machine Learning from Disaster/Data/train.csv'

# create dataframe and view first few rows
df_train=pd.read_csv(train)
df_train.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


Replacing categorical data with numerical representations:

In [3]:
dummy_sex = pd.get_dummies(df_train['Sex'])
df_train = pd.concat([df_train, dummy_sex], axis=1)
df_train.drop('Sex', axis=1, inplace=True)
df_train.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,female,male
0,1,0,3,"Braund, Mr. Owen Harris",22.0,1,0,A/5 21171,7.25,,S,0,1
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",38.0,1,0,PC 17599,71.2833,C85,C,1,0
2,3,1,3,"Heikkinen, Miss. Laina",26.0,0,0,STON/O2. 3101282,7.925,,S,1,0
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",35.0,1,0,113803,53.1,C123,S,1,0
4,5,0,3,"Allen, Mr. William Henry",35.0,0,0,373450,8.05,,S,0,1


In [4]:
dummy_embarked = pd.get_dummies(df_train['Embarked'])
df_train = pd.concat([df_train, dummy_embarked], axis = 1)
df_train.drop('Embarked', axis=1, inplace=True)
df_train.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Age,SibSp,Parch,Ticket,Fare,Cabin,female,male,C,Q,S
0,1,0,3,"Braund, Mr. Owen Harris",22.0,1,0,A/5 21171,7.25,,0,1,0,0,1
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",38.0,1,0,PC 17599,71.2833,C85,1,0,1,0,0
2,3,1,3,"Heikkinen, Miss. Laina",26.0,0,0,STON/O2. 3101282,7.925,,1,0,0,0,1
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",35.0,1,0,113803,53.1,C123,1,0,0,0,1
4,5,0,3,"Allen, Mr. William Henry",35.0,0,0,373450,8.05,,0,1,0,0,1


There's little/no corrolation between PassengerID, so  let's drop it for now.

In [5]:
df_train.drop('PassengerId', axis=1, inplace=True)
df_train.drop('Ticket', axis=1, inplace=True)
df_train.drop('Cabin', axis=1, inplace=True)

Renaming the columns to make them more readable.

In [6]:
df_train.rename(columns={'Pclass':'Passenger Class', 'SibSp':'Siblings Onboard', 'Parch':'Parents Onboard', 'C':'Cherbourg', 'Q':'Queenstown','S':'Southampton'}, inplace=True)
df_train.head()

Unnamed: 0,Survived,Passenger Class,Name,Age,Siblings Onboard,Parents Onboard,Fare,female,male,Cherbourg,Queenstown,Southampton
0,0,3,"Braund, Mr. Owen Harris",22.0,1,0,7.25,0,1,0,0,1
1,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",38.0,1,0,71.2833,1,0,1,0,0
2,1,3,"Heikkinen, Miss. Laina",26.0,0,0,7.925,1,0,0,0,1
3,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",35.0,1,0,53.1,1,0,0,0,1
4,0,3,"Allen, Mr. William Henry",35.0,0,0,8.05,0,1,0,0,1


In [7]:
df_train.corr()

Unnamed: 0,Survived,Passenger Class,Age,Siblings Onboard,Parents Onboard,Fare,female,male,Cherbourg,Queenstown,Southampton
Survived,1.0,-0.338481,-0.077221,-0.035322,0.081629,0.257307,0.543351,-0.543351,0.16824,0.00365,-0.15566
Passenger Class,-0.338481,1.0,-0.369226,0.083081,0.018443,-0.5495,-0.1319,0.1319,-0.243292,0.221009,0.08172
Age,-0.077221,-0.369226,1.0,-0.308247,-0.189119,0.096067,-0.093254,0.093254,0.036261,-0.022405,-0.032523
Siblings Onboard,-0.035322,0.083081,-0.308247,1.0,0.414838,0.159651,0.114631,-0.114631,-0.059528,-0.026354,0.070941
Parents Onboard,0.081629,0.018443,-0.189119,0.414838,1.0,0.216225,0.245489,-0.245489,-0.011069,-0.081228,0.063036
Fare,0.257307,-0.5495,0.096067,0.159651,0.216225,1.0,0.182333,-0.182333,0.269335,-0.117216,-0.166603
female,0.543351,-0.1319,-0.093254,0.114631,0.245489,0.182333,1.0,-1.0,0.082853,0.074115,-0.125722
male,-0.543351,0.1319,0.093254,-0.114631,-0.245489,-0.182333,-1.0,1.0,-0.082853,-0.074115,0.125722
Cherbourg,0.16824,-0.243292,0.036261,-0.059528,-0.011069,0.269335,0.082853,-0.082853,1.0,-0.148258,-0.778359
Queenstown,0.00365,0.221009,-0.022405,-0.026354,-0.081228,-0.117216,0.074115,-0.074115,-0.148258,1.0,-0.496624


In [8]:
avg_age = df_train['Age'].mean(axis=0)
df_train['Age'].replace(np.nan, avg_age, inplace=True)

In [9]:
missing_data = df_train.isnull()
missing_data.head(5)

Unnamed: 0,Survived,Passenger Class,Name,Age,Siblings Onboard,Parents Onboard,Fare,female,male,Cherbourg,Queenstown,Southampton
0,False,False,False,False,False,False,False,False,False,False,False,False
1,False,False,False,False,False,False,False,False,False,False,False,False
2,False,False,False,False,False,False,False,False,False,False,False,False
3,False,False,False,False,False,False,False,False,False,False,False,False
4,False,False,False,False,False,False,False,False,False,False,False,False


In [11]:
# for column in missing_data.columns.values.tolist():
#     print(column)
#     print(missing_data[column].value_counts())
#     print('')