Titanic Machine Learning Data Trainer
- In this notebook, we will analyze the test data and attempt to predict the outcomes for the missing people. Given the data we have, can we accurately predict whether the rest of the people lived or died?
- First, we will want to do manual research on the data we have. Are there any trends (gender, ticket class, age, etc.) that represent a strong-ish correlation with survival rate?

Data Information

(1) train.csv
- train.csv contains the details of a subset of the passengers on board (891 passengers, to be exact -- where each passenger gets a different row in the table).

- The values in the second column ("Survived") can be used to determine whether each passenger survived or not:
- if it's a "1", the passenger survived.
- if it's a "0", the passenger died.

(2) test.csv
- Using the patterns you find in train.csv, you have to predict whether the other 418 passengers on board (in test.csv) survived.

- Note that test.csv does not have a "Survived" column - this information is hidden from you, and how well you do at predicting these hidden values will determine how highly you score in the competition!

(3) gender_submission.csv
- The gender_submission.csv file is provided as an example that shows how you should structure your predictions. It predicts that all female passengers survived, and all male passengers died. Your hypotheses regarding survival will probably be different, which will lead to a different submission file. But, just like this file, your submission should have:

- a "PassengerId" column containing the IDs of each passenger from test.csv.
- a "Survived" column (that you will create!) with a "1" for the rows where you think the passenger survived, and a "0" where you predict that the passenger died.



In [None]:
import numpy as np
import pandas as pd
import os
import csv
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout
from tensorflow.keras.optimizers import Adam

In [2]:
test_data = pd.read_csv("test.csv")
test_data.head()


Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,892,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q
1,893,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0,,S
2,894,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,,Q
3,895,3,"Wirz, Mr. Albert",male,27.0,0,0,315154,8.6625,,S
4,896,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0,1,1,3101298,12.2875,,S


In [3]:
train_data = pd.read_csv("train.csv")
train_data.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [4]:
#Here, we are looking at the survival rate of each class
first_class = train_data.loc[train_data.Pclass == 1]["Survived"]
rate_first = sum(first_class)/len(first_class)

second_class = train_data.loc[train_data.Pclass == 2]["Survived"]
rate_second = sum(second_class)/len(second_class)

third_class = train_data.loc[train_data.Pclass == 3]["Survived"]
rate_third = sum(third_class)/len(third_class)

print("% of 1st class passengers that survived:", rate_first)
print("% of 2nd class passengers that survived:", rate_second)
print("% of 3rd class passengers that survived:", rate_third)

% of 1st class passengers that survived: 0.6296296296296297
% of 2nd class passengers that survived: 0.47282608695652173
% of 3rd class passengers that survived: 0.24236252545824846


In [5]:
#Now, we want to see the rate of survival from each port to see if there is some connection there
c_port = train_data.loc[train_data.Embarked == 'C']["Survived"]
rate_cport = sum(c_port)/len(c_port)

q_port = train_data.loc[train_data.Embarked == 'Q']["Survived"]
rate_qport = sum(q_port)/len(q_port)

s_port = train_data.loc[train_data.Embarked == 'S']["Survived"]
rate_sport = sum(s_port)/len(s_port)

print("% of C port passengers that survived:", rate_cport)
print("% of Q port passengers that survived:", rate_qport)
print("% of S port passengers that survived:", rate_sport)

% of C port passengers that survived: 0.5535714285714286
% of Q port passengers that survived: 0.38961038961038963
% of S port passengers that survived: 0.33695652173913043


In [6]:
#Now we will start to look at fare prices
#we will now try to filter out some variables that might affect the price. It seems like Parch and SibSp is a high determining factor in the price, so we want to take this out of the equation for now.
#Here, we are looking at how many distinct numbers there are in both SibSp and Parch
num_dist_parch = train_data.Parch.nunique(dropna=True)
num_dist_sibsp = train_data.SibSp.nunique(dropna=True)


#now we want to look at the fare prices of people who's SibSp and Parch are 0.
NaN_SibSp_Parch = train_data.loc[train_data['Parch'] == 0][train_data['SibSp'] == 0]
cond1 = train_data['Pclass'] == 1
cond2 = train_data['Pclass'] == 2
cond3 = train_data['Pclass'] == 3
cond4 = train_data['Parch'] == 0
cond5 = train_data['SibSp'] == 0
cond6 = ['Survived']


combined_cond1 = cond1 & cond4 & cond5
filtered_cond1 = train_data[combined_cond1]
first_avg = filtered_cond1['Fare'].mean()


combined_cond2 = cond2 & cond4 & cond5
filtered_cond2 = train_data[combined_cond2]
second_avg = filtered_cond2['Fare'].mean()


combined_cond3 = cond3 & cond4 & cond5
filtered_cond3 = train_data[combined_cond3]
third_avg = filtered_cond3['Fare'].mean()

print("Average first class ticket without Parch and SibSp:", first_avg)
print("Average second class ticket without Parch and SibSp:", second_avg)
print("Average third class ticket without Parch and SibSp:", third_avg)

Average first class ticket without Parch and SibSp: 63.67251376146789
Average second class ticket without Parch and SibSp: 14.06610576923077
Average third class ticket without Parch and SibSp: 9.272051851851854


  NaN_SibSp_Parch = train_data.loc[train_data['Parch'] == 0][train_data['SibSp'] == 0]


In [7]:
#First, we are finding the average price of each class ticket so we have a range to go off of.
cond1 = train_data['Pclass'] == 1
cond2 = train_data['Pclass'] == 2
cond3 = train_data['Pclass'] == 3
cond4 = train_data['Parch'] == 0
cond5 = train_data['SibSp'] == 0


combined_cond1 = cond1 & cond4 & cond5
filtered_cond1 = train_data[combined_cond1]


combined_cond2 = cond2 & cond4 & cond5
filtered_cond2 = train_data[combined_cond2]


combined_cond3 = cond3 & cond4 & cond5
filtered_cond3 = train_data[combined_cond3]


high_rate = train_data.loc[train_data.Fare >= 64]["Survived"]
rate_high = sum(high_rate)/len(high_rate)


med_fare = train_data.loc[train_data.Fare < 64][train_data.Fare >= 14]["Survived"]
rate_med = sum(med_fare)/len(med_fare)


low_fare = train_data.loc[train_data.Fare <= 14]["Survived"]
rate_low = sum(low_fare)/len(low_fare)


print("% of people who survived and paid over $64:", rate_high)
print("% of people who survived and paid between $14 and $54:",  rate_med)
print("% of people who survived and paid below $14:", rate_low)

% of people who survived and paid over $64: 0.6864406779661016
% of people who survived and paid between $14 and $54: 0.44510385756676557
% of people who survived and paid below $14: 0.2540045766590389


  med_fare = train_data.loc[train_data.Fare < 64][train_data.Fare >= 14]["Survived"]


In [8]:
condi1 = train_data['Embarked'] == 'C'
condi2 = train_data['Embarked'] == 'Q'
condi3 = train_data['Embarked'] == 'S'
condi4 = train_data['Parch'] == 0
condi5 = train_data['SibSp'] == 0


combined_condi1 = condi1 & condi4 & condi5
filtered_condi1 = train_data[combined_condi1]
c_port_avg = filtered_condi1['Fare'].mean()


combined_condi2 = condi2 & condi4 & condi5
filtered_condi2 = train_data[combined_condi2]
q_port_avg = filtered_condi2['Fare'].mean()


combined_condi3 = condi3 & condi4 & condi5
filtered_condi3 = train_data[combined_condi3]
s_port_avg = filtered_condi3['Fare'].mean()


print(c_port_avg, q_port_avg, s_port_avg)


49.756618823529415 8.382970175438595 16.641684223918585


In [9]:
c_port_rate = train_data.loc[train_data['Embarked'] == 'C'][train_data['SibSp'] == 0][train_data['SibSp'] == 0]["Survived"]
rate_c_port = sum(c_port_rate)/len(c_port_rate)


q_port_rate = train_data.loc[train_data['Embarked'] == 'Q'][train_data['SibSp'] == 0][train_data['SibSp'] == 0]["Survived"]
rate_q_port = sum(q_port_rate)/len(q_port_rate)


s_port_rate = train_data.loc[train_data['Embarked'] == 'S'][train_data['SibSp'] == 0][train_data['SibSp'] == 0]["Survived"]
rate_s_port = sum(s_port_rate)/len(s_port_rate)




print(rate_c_port, rate_q_port, rate_s_port)

0.47706422018348627 0.3898305084745763 0.3036529680365297


  c_port_rate = train_data.loc[train_data['Embarked'] == 'C'][train_data['SibSp'] == 0][train_data['SibSp'] == 0]["Survived"]
  q_port_rate = train_data.loc[train_data['Embarked'] == 'Q'][train_data['SibSp'] == 0][train_data['SibSp'] == 0]["Survived"]
  s_port_rate = train_data.loc[train_data['Embarked'] == 'S'][train_data['SibSp'] == 0][train_data['SibSp'] == 0]["Survived"]


In [84]:
women = train_data.loc[train_data.Sex == 'female']["Survived"]
rate_women = sum(women)/len(women)

men = train_data.loc[train_data.Sex == 'male']["Survived"]
rate_men = sum(men)/len(men)

print("% of men who survived:", rate_men)
print("% of women who survived:", rate_women)

% of men who survived: 0.18890814558058924
% of women who survived: 0.7420382165605095


Given the above rates when comparing men and women, it is clear that gender is a highly correlated attribute to survival rate. Let's dive a little deeper into this, and see if age has any significant correlation.

In [85]:
child = train_data.loc[train_data.Age < 18]["Survived"]
rate_child = sum(child)/len(child)

print("% of children who survived:", rate_child)

% of children who survived: 0.5398230088495575


With a survival rate of ~50% for children (both male and female) under the age of 18, this does not tell us much. However, I have a feeling that this rate could be skewed by males that were closer to that age threshold (i.e. teenage boys). I could see a world where boys that were old enough to not be considered "children" may have not had priority spots on the lifeboats. Let's dig into this a bit further, and see if we can uncover any trends when considering age and gender:

In [89]:
child_male_sub_5 = train_data.loc[train_data.Age < 5][train_data.Sex == 'male']["Survived"]
rate_child_male_sub_5 = sum(child_male_sub_5)/len(child_male_sub_5)

child_male_sub_12 = train_data.loc[train_data.Age < 12][train_data.Sex == 'male']["Survived"]
rate_child_male_sub_12 = sum(child_male_sub_12)/len(child_male_sub_12)

child_male_sub_18 = train_data.loc[train_data.Age < 18][train_data.Sex == 'male']["Survived"]
rate_child_male_sub_18 = sum(child_male_sub_18)/len(child_male_sub_18)

print("% of boys who survived (under 5 years old):", rate_child_male_sub_5)
print("% of boys who survived (under 12 years old):", rate_child_male_sub_12)
print("% of boys who survived (under 18 years old):", rate_child_male_sub_18)

% of boys who survived (under 5 years old): 0.6521739130434783
% of boys who survived (under 11 years old): 0.5555555555555556
% of boys who survived (under 18 years old): 0.39655172413793105


  child_male_sub_5 = train_data.loc[train_data.Age < 5][train_data.Sex == 'male']["Survived"]
  child_male_sub_12 = train_data.loc[train_data.Age < 12][train_data.Sex == 'male']["Survived"]
  child_male_sub_18 = train_data.loc[train_data.Age < 18][train_data.Sex == 'male']["Survived"]


Given these survival rates, it is evident that younger boys that could be considered babies/infants had a much higher chance of survival than older boys. Let's take a look at these same rates for girls:

In [95]:
child_female_sub_5 = train_data.loc[train_data.Age < 5][train_data.Sex == 'female']["Survived"]
rate_child_female_sub_5 = sum(child_female_sub_5)/len(child_female_sub_5)

child_female_sub_12 = train_data.loc[train_data.Age < 12][train_data.Sex == 'female']["Survived"]
rate_child_female_sub_12 = sum(child_female_sub_12)/len(child_female_sub_12)

child_female_sub_18 = train_data.loc[train_data.Age < 18][train_data.Sex == 'female']["Survived"]
rate_child_female_sub_18 = sum(child_female_sub_18)/len(child_female_sub_18)

print("% of girls who survived (under 5 years old):", rate_child_female_sub_5)
print("% of girls who survived (under 12 years old):", rate_child_female_sub_12)
print("% of girls who survived (under 18 years old):", rate_child_female_sub_18)

% of girls who survived (under 5 years old): 0.7058823529411765
% of girls who survived (under 11 years old): 0.59375
% of girls who survived (under 18 years old): 0.6909090909090909


  child_female_sub_5 = train_data.loc[train_data.Age < 5][train_data.Sex == 'female']["Survived"]
  child_female_sub_12 = train_data.loc[train_data.Age < 12][train_data.Sex == 'female']["Survived"]
  child_female_sub_18 = train_data.loc[train_data.Age < 18][train_data.Sex == 'female']["Survived"]


Although there is a strange dip in the survival rate of girls between ages 5-11, it is notable that the survival rate of female babies/infants is very similar to the overall rate of girls under the age of 18. This goes to show that the age discrepancy between female childeren did not share the same impact on survival rate that it did on male children.

Overall, it is clear that the age of the passenger has significant correlation with the survival rate, especially when looking at gender and age together. We will want to keep that in mind when training our data. 

In [11]:
train_data['Fare'] = train_data['Fare'].astype('float64')

print(train_data.isna().sum())
print((train_data == float('inf')).sum())
print((train_data == float('-inf')).sum())
max_fare = train_data['Fare'].max()
print(max_fare)

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64
PassengerId    0
Survived       0
Pclass         0
Name           0
Sex            0
Age            0
SibSp          0
Parch          0
Ticket         0
Fare           0
Cabin          0
Embarked       0
dtype: int64
PassengerId    0
Survived       0
Pclass         0
Name           0
Sex            0
Age            0
SibSp          0
Parch          0
Ticket         0
Fare           0
Cabin          0
Embarked       0
dtype: int64
512.3292


In [10]:
y = train_data["Survived"]

features = ["Pclass", "Sex", "SibSp", "Parch", "Embarked"]
X = pd.get_dummies(train_data[features])
X_test = pd.get_dummies(test_data[features])

model = RandomForestClassifier(n_estimators=100, max_depth=5, random_state=1)
model.fit(X, y)
predictions = model.predict(X_test)

output = pd.DataFrame({'PassengerId': test_data.PassengerId, 'Survived': predictions})
output.to_csv('results-random-forest.csv', index=False)
print("Your submission was successfully saved!")

Your submission was successfully saved!


After submitting this results CSV file to Kaggle, we discovered that our Random Forest Classifier model successfully predicted the survival of 77.5% of our test data. While this is reasonably accurate, ideally we'd like to increase that accuracy rate by a bit. Let's try creating a sequential model using Keras and see how the results compare.

In [162]:
keras_train_data = pd.read_csv("train.csv")
keras_test_data = pd.read_csv("test.csv")

# drop the fields we don't care about from the data
drop_fields = ['Ticket', 'Cabin', 'Name']
def clean_data(data, droppable):
    if pd.Series(droppable).isin(data.columns).all():  
        for field in droppable:
            data.drop(field, axis=1, inplace=True)
    return data

clean_data(keras_train_data, drop_fields)
clean_data(keras_test_data, drop_fields)

keras_train_data.head()

Unnamed: 0,PassengerId,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked
0,1,0,3,male,22.0,1,0,7.25,S
1,2,1,1,female,38.0,1,0,71.2833,C
2,3,1,3,female,26.0,0,0,7.925,S
3,4,1,1,female,35.0,1,0,53.1,S
4,5,0,3,male,35.0,0,0,8.05,S


Now that we've cleaned our data, let's convert the categorical variables to numeric values. Our categorical variables are Sex and Embarked (both nominal). We'll use the get_dummies() function for this.

In [163]:
def convert_features(data, features):
    if pd.Series(features).isin(data.columns).all():
        data = pd.get_dummies(data, columns=features, drop_first=True)
    return data

nominal_features = ['Sex', 'Embarked']
keras_train_data = convert_features(keras_train_data, nominal_features)
keras_test_data = convert_features(keras_test_data, nominal_features)

# fill null values if any exist
keras_train_data.fillna(keras_train_data.mean(), inplace=True)
keras_test_data.fillna(keras_test_data.mean(), inplace=True)

# make sure the columns are aligned
keras_test_data = keras_test_data[keras_train_data.columns.drop('Survived')]

keras_test_data = keras_test_data.astype(np.float32)
keras_train_data = keras_train_data.astype(np.float32)

print(f'keras_train_data shape: {keras_train_data.shape}')
print(f'keras_test_data shape: {keras_test_data.shape}')

keras_train_data.head()


keras_train_data shape: (891, 10)
keras_test_data shape: (418, 9)


Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare,Sex_male,Embarked_Q,Embarked_S
0,1.0,0.0,3.0,22.0,1.0,0.0,7.25,1.0,0.0,1.0
1,2.0,1.0,1.0,38.0,1.0,0.0,71.283302,0.0,0.0,0.0
2,3.0,1.0,3.0,26.0,0.0,0.0,7.925,0.0,0.0,1.0
3,4.0,1.0,1.0,35.0,1.0,0.0,53.099998,0.0,0.0,1.0
4,5.0,0.0,3.0,35.0,0.0,0.0,8.05,1.0,0.0,1.0


In [167]:
X_train = keras_train_data.drop('Survived', axis=1)
y_train = keras_train_data['Survived']

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(keras_test_data)

X_train, X_val, y_train, y_val = train_test_split(X_scaled, y_train, test_size=0.2, random_state=222)

model = Sequential([
    Dense(256, input_dim=X_train.shape[1], activation='relu'),
    Dropout(0.5),
    Dense(128, activation='relu'),
    Dropout(0.5),
    Dense(64, activation='relu'),
    Dropout(0.5),
    Dense(32, activation='relu'),
    Dropout(0.5),
    Dense(1, activation='sigmoid')
])

model.compile(optimizer=Adam(), loss='binary_crossentropy', metrics=['accuracy'])
model.fit(X_train, y_train, batch_size=5, epochs=25, validation_split=0.2)
predictions = (model.predict(X_test_scaled) > 0.5).astype(int).flatten()

# Prepare submission file
submission = pd.DataFrame({'PassengerId': test_data['PassengerId'].astype('int32'), 'Survived': predictions})
submission.to_csv('results-keras.csv', index=False)

Epoch 1/25
Epoch 2/25
Epoch 3/25
Epoch 4/25
Epoch 5/25
Epoch 6/25
Epoch 7/25
Epoch 8/25
Epoch 9/25
Epoch 10/25
Epoch 11/25
Epoch 12/25
Epoch 13/25
Epoch 14/25
Epoch 15/25
Epoch 16/25
Epoch 17/25
Epoch 18/25
Epoch 19/25
Epoch 20/25
Epoch 21/25
Epoch 22/25
Epoch 23/25
Epoch 24/25
Epoch 25/25


In [None]:
loss, accuracy = model.evaluate(X_test_normalized, y_test)

print(f'Model Accuracy: {accuracy}')
print(f'Model Loss: {loss}')