Titanic Machine Learning Data Trainer
- In this notebook, we will analyze the test data and attempt to predict the outcomes for the missing people. Given the data we have, can we accurately predict whether the rest of the people lived or died?
- First, we will want to do manual research on the data we have. Are there any trends (gender, ticket class, age, etc.) that represent a strong-ish correlation with survival rate?

Data Information

(1) train.csv
- train.csv contains the details of a subset of the passengers on board (891 passengers, to be exact -- where each passenger gets a different row in the table).

- The values in the second column ("Survived") can be used to determine whether each passenger survived or not:
- if it's a "1", the passenger survived.
- if it's a "0", the passenger died.

(2) test.csv
- Using the patterns you find in train.csv, you have to predict whether the other 418 passengers on board (in test.csv) survived.

- Note that test.csv does not have a "Survived" column - this information is hidden from you, and how well you do at predicting these hidden values will determine how highly you score in the competition!

(3) gender_submission.csv
- The gender_submission.csv file is provided as an example that shows how you should structure your predictions. It predicts that all female passengers survived, and all male passengers died. Your hypotheses regarding survival will probably be different, which will lead to a different submission file. But, just like this file, your submission should have:

- a "PassengerId" column containing the IDs of each passenger from test.csv.
- a "Survived" column (that you will create!) with a "1" for the rows where you think the passenger survived, and a "0" where you predict that the passenger died.



In [23]:
import numpy as np
import pandas as pd
import os
import csv
from sklearn.ensemble import RandomForestClassifier

In [19]:
# with open('test.csv', newline='') as file:
#     tweets_dict = {}
#     test_data = csv.reader(file)

#     # Skip the header row
#     next(test_data)

#     for row in test_data:
#         print(row[0])

test_data = pd.read_csv("test.csv")
test_data.head()


Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [20]:
train_data = pd.read_csv("train.csv")
train_data.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [53]:
women = train_data.loc[train_data.Sex == 'female']["Survived"]
rate_women = sum(women)/len(women)

men = train_data.loc[train_data.Sex == 'male']["Survived"]
rate_men = sum(men)/len(men)

print("% of men who survived:", rate_men)
print("% of women who survived:", rate_women)

% of men who survived: 0.18890814558058924
% of women who survived: 0.7420382165605095


Given the above rates when comparing men and women, it is clear that gender is a highly correlated attribute to survival rate. Let's dive a little deeper into this, and see if age has any significant correlation.

In [54]:
child = train_data.loc[train_data.Age < 18]["Survived"]
rate_child = sum(child)/len(child)

print("% of boys who survived:", rate_child)

% of boys who survived: 0.5398230088495575


With a survival rate of ~50% for children (both male and female) under the age of 18, this does not tell us much. However, I have a feeling that this rate could be skewed by males that were closer to that age threshold (i.e. teenage boys). I could see a world where boys that were old enough to not be considered "children" may have not had priority spots on the lifeboats. Let's dig into this a bit further, and see if we can uncover any trends when considering age and gender:

In [60]:
child_male_sub_5 = train_data.loc[train_data.Age < 5][train_data.Sex == 'male']["Survived"]
rate_child_male_sub_5 = sum(child_male_sub_5)/len(child_male_sub_5)

child_male_sub_12 = train_data.loc[train_data.Age < 12][train_data.Sex == 'male']["Survived"]
rate_child_male_sub_12 = sum(child_male_sub_12)/len(child_male_sub_12)

child_male_sub_18 = train_data.loc[train_data.Age < 18][train_data.Sex == 'male']["Survived"]
rate_child_male_sub_18 = sum(child_male_sub_18)/len(child_male_sub_18)

print("% of boys who survived (under 5 years old):", rate_child_male_sub_5)
print("% of boys who survived (5-11 years old):", rate_child_male_sub_12)
print("% of boys who survived (12-18 years old):", rate_child_male_sub_18)

% of boys who survived (under 5 years old): 0.6521739130434783
% of boys who survived (5-11 years old): 0.5555555555555556
% of boys who survived (12-18 years old): 0.39655172413793105


  child_male_sub_5 = train_data.loc[train_data.Age < 5][train_data.Sex == 'male']["Survived"]
  child_male_sub_12 = train_data.loc[train_data.Age < 12][train_data.Sex == 'male']["Survived"]
  child_male_sub_18 = train_data.loc[train_data.Age < 18][train_data.Sex == 'male']["Survived"]


Given these survival rates, it is evident that younger boys that could be considered babies/infants had a much higher chance of survival than older boys. Let's take a look at these same rates for girls:

In [61]:
child_female_sub_5 = train_data.loc[train_data.Age < 5][train_data.Sex == 'female']["Survived"]
rate_child_female_sub_5 = sum(child_female_sub_5)/len(child_female_sub_5)

child_female_sub_12 = train_data.loc[train_data.Age < 12][train_data.Sex == 'female']["Survived"]
rate_child_female_sub_12 = sum(child_female_sub_12)/len(child_female_sub_12)

child_female_sub_18 = train_data.loc[train_data.Age < 18][train_data.Sex == 'female']["Survived"]
rate_child_female_sub_18 = sum(child_female_sub_18)/len(child_female_sub_18)

print("% of girls who survived (under 5 years old):", rate_child_female_sub_5)
print("% of girls who survived (5-11 years old):", rate_child_female_sub_12)
print("% of girls who survived (12-18 years old):", rate_child_female_sub_18)

% of girls who survived (under 5 years old): 0.7058823529411765
% of girls who survived (5-11 years old): 0.59375
% of girls who survived (12-18 years old): 0.6909090909090909


  child_female_sub_5 = train_data.loc[train_data.Age < 5][train_data.Sex == 'female']["Survived"]
  child_female_sub_12 = train_data.loc[train_data.Age < 12][train_data.Sex == 'female']["Survived"]
  child_female_sub_18 = train_data.loc[train_data.Age < 18][train_data.Sex == 'female']["Survived"]


Although there is a strange dip in the survival rate of girls between ages 5-11, it is notable that the survival rate of female babies/infants is very similar to the overall rate of girls under the age of 18. This goes to show that the age discrepancy between female childeren did not share the same impact on survival rate that it did on male children.

Overall, it is clear that the age of the passenger has significant correlation with the survival rate, especially when looking at gender and age together. We will want to keep that in mind when training our data. 

In [26]:
y = train_data["Survived"]

features = ["Pclass", "Sex", "SibSp", "Parch"]
X = pd.get_dummies(train_data[features])
X_test = pd.get_dummies(test_data[features])

model = RandomForestClassifier(n_estimators=100, max_depth=5, random_state=1)
model.fit(X, y)
predictions = model.predict(X_test)

output = pd.DataFrame({'PassengerId': test_data.PassengerId, 'Survived': predictions})
output.to_csv('submission.csv', index=False)
print("Your submission was successfully saved!")

Your submission was successfully saved!
