<h1>Predicting the Survival of Passengers on the RMS <i>Titanic</i> Using Demographic Data</h1>
<br>
<b>Benjamin Hoover</b>
<br>
<img src="titanic.jpg" width="600" height="450" align="left"></img>

<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>

This lab determines the best machine learning model (DecisionTree, Random Forest, Nearest Neighbors) for data about Titanic passengers, and then uses it to determine what combinations of sex, age (children or adult), and class were most influential in determining a passenger's survival and what combinations most likely result in survival or death.

In [174]:
#code to import a csv and convert it to a numpy array
#import the numpy library
import numpy as np
import statistics
#open the file
raw_data = open("titanic.csv")
# load the CSV file as a numpy matrix
dataset = np.loadtxt(raw_data, delimiter=",")
#'shape' is the dimensions of the matrix
# separate the data from the target attributes
X = dataset[:,0:3]
y = dataset[:,3]


tests = 1000

#TITANIC.csv = Class, Age, Sex, Survived
#reordered csv: Class, Sex, Age, Survived

<h3>Part 1: Which machine learning algorithm - DecisionTree, KNeighbors, or RandomForest is most accurate for the dataset? What are their respective accuracies?</h3>

This part determines the best algorithm and each's accuracy using 1000 tests of a random test data set consisting of 25% of the total data.

In [175]:
from sklearn import tree
from sklearn.cross_validation import train_test_split
from sklearn.metrics import accuracy_score

count = 0;
avg = []
while (count < tests): 
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = .25)
    my_classifier = tree.DecisionTreeClassifier();
    my_classifier = my_classifier.fit(X_train, y_train)
    predictions = my_classifier.predict(X_test)
    avg.append(accuracy_score(predictions, y_test))
    count += 1
print("DecisionTree has an accuracy of", str(statistics.mean(avg)*100) + "%")

DecisionTree has an accuracy of 78.7869328494%


In [176]:
from sklearn import neighbors
count = 0;
avg = []
while (count < tests): 
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = .25)
    my_classifier = neighbors.KNeighborsClassifier();
    my_classifier = my_classifier.fit(X_train, y_train)
    predictions = my_classifier.predict(X_test)
    avg.append(accuracy_score(predictions, y_test))
    count += 1
print("KNeighbors has an accuracy of", str(statistics.mean(avg) * 100) + "%")

KNeighbors has an accuracy of 75.8215970962%


In [177]:
from sklearn import ensemble
count = 0;
avg = []
while (count < tests): 
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = .25)
    my_classifier = ensemble.RandomForestClassifier();
    my_classifier = my_classifier.fit(X_train, y_train)
    predictions = my_classifier.predict(X_test)
    avg.append(accuracy_score(predictions, y_test))
    count += 1
print("RandomForest has an accuracy of", str(statistics.mean(avg) * 100) + "%")

RandomForest has an accuracy of 78.7362976407%


Based off of the average of 1000 tests, the Decision Tree Classifier is the best, being slightly more accurate than the Random Forest and significantly better than KNearestNeighbors.

Note that due to random variation, sometimes RandomForest will appear more accurate.

<h3>Part 2: Using the most accurate algorithm, who would have survived?</h3>

This part uses the best algorithm (Decision Tree) to determine whether the following imaginary people would have survived:

a. An adult male in the 3rd class

b. An adult female in 1st class

c. A female child in 1st class

In [178]:
my_classifier = tree.DecisionTreeClassifier();
my_classifier = my_classifier.fit(X, y)
print(my_classifier.predict([[3 ,1, 1]]))
print("An adult male in the 3rd class would have most likely died.")

[ 0.]
An adult male in the 3rd class would have most likely died.


In [179]:
print(my_classifier.predict([[1 ,1, 0]]))
print("An adult female in 1st class would have most likely survived.")

[ 1.]
An adult female in 1st class would have most likely survived.


In [180]:
print(my_classifier.predict([[1 ,0, 0]]))
print("A female child in 1st class would have most likely survived.")

[ 1.]
A female child in 1st class would have most likely survived.


As shown, the adult male in 3rd class would have most likely died, while the 1st class female adult and child would have most likely survived. These results seem unsurprising, considering that "women and children" were supposedly prioritized for spots in lifeboats.

<h3>Part 3: What is the most important determining feature in the dataset?</h3>

This part of the lab determines the most influential feature in the dataset. It does this by getting the accuracy of the algorithm's predictions using only 1 or 2 of the features. The feature with the highest accuracy is the most influential.

In [181]:
#control test
resultdict = {}
X = dataset[:,0:3]
y = dataset[:,3]
count = 0;
avg = []
while (count < tests): 
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = .25)
    my_classifier = tree.DecisionTreeClassifier();
    my_classifier = my_classifier.fit(X_train, y_train)
    predictions = my_classifier.predict(X_test)
    avg.append(accuracy_score(predictions, y_test))
    count += 1
avgFinal = statistics.mean(avg)
resultdict["control"] = avgFinal

In [182]:
#class and age
X = dataset[:,0:2]
y = dataset[:,3]
count = 0;
avg = []
while (count < tests): 
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = .25)
    my_classifier = tree.DecisionTreeClassifier();
    my_classifier = my_classifier.fit(X_train, y_train)
    predictions = my_classifier.predict(X_test)
    avg.append(accuracy_score(predictions, y_test))
    count += 1
avgFinal = statistics.mean(avg)
resultdict["class and age"] = avgFinal

In [183]:
#age and sex
X = dataset[:,1:3]
y = dataset[:,3]
count = 0;
avg = []
while (count < tests): 
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = .25)
    my_classifier = tree.DecisionTreeClassifier();
    my_classifier = my_classifier.fit(X_train, y_train)
    predictions = my_classifier.predict(X_test)
    avg.append(accuracy_score(predictions, y_test))
    count += 1
avgFinal = statistics.mean(avg)
resultdict["age and sex"] = avgFinal

In [184]:
#reordered csv: Class, Sex, Age, Survived
raw_data2 = open("titanic2.csv")
# load the CSV file as a numpy matrix
dataset2 = np.loadtxt(raw_data2, delimiter=",")

In [185]:
#class and sex
X = dataset2[:,0:2]
y = dataset2[:,3]
count = 0;
avg = []
while (count < tests): 
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = .25)
    my_classifier = tree.DecisionTreeClassifier();
    my_classifier = my_classifier.fit(X_train, y_train)
    predictions = my_classifier.predict(X_test)
    avg.append(accuracy_score(predictions, y_test))
    count += 1
avgFinal = statistics.mean(avg)
resultdict["class and sex"] = avgFinal

In [186]:
#Class
#TITANIC.csv = Class, Age, Sex, Survived
X = dataset[:,0:1]
y = dataset[:,3]
count = 0;
avg = []
while (count < tests): 
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = .25)
    my_classifier = tree.DecisionTreeClassifier();
    my_classifier = my_classifier.fit(X_train, y_train)
    predictions = my_classifier.predict(X_test)
    avg.append(accuracy_score(predictions, y_test))
    count += 1
avgFinal = statistics.mean(avg)
resultdict["class"] = avgFinal

In [187]:
#age
X = dataset[:,1:2]
y = dataset[:,3]
count = 0;
avg = []
while (count < tests): 
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = .25)
    my_classifier = tree.DecisionTreeClassifier();
    my_classifier = my_classifier.fit(X_train, y_train)
    predictions = my_classifier.predict(X_test)
    avg.append(accuracy_score(predictions, y_test))
    count += 1
avgFinal = statistics.mean(avg)
resultdict["age"] = avgFinal

In [188]:
#age
X = dataset[:,2:3]
y = dataset[:,3]
count = 0;
avg = []
while (count < tests): 
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = .25)
    my_classifier = tree.DecisionTreeClassifier();
    my_classifier = my_classifier.fit(X_train, y_train)
    predictions = my_classifier.predict(X_test)
    avg.append(accuracy_score(predictions, y_test))
    count += 1
avgFinal = statistics.mean(avg)
resultdict["sex"] = avgFinal

In [189]:
for resultkey in resultdict.keys():
    resultdict[resultkey] = str(resultdict[resultkey] * 100) + "%"
print("Accuracy based on the following demogrpahics:", resultdict)

Accuracy based on the following demogrpahics: {'sex': '77.6773139746%', 'age and sex': '77.3557168784%', 'class and age': '72.4578947368%', 'class and sex': '78.2878402904%', 'control': '78.8711433757%', 'age': '67.6588021779%', 'class': '71.3626134301%'}


As visible by the accuracy percents, sex proved to be the best indicator of a passenger's survival, with an accuracy percent of 77.7% besides all 3 demographics, which had an accuracy percent of 78.9%. The class was the 2nd most important metric, with an accuracy percentage of 71.4%, followed by age, with an accuracy percentage of 67.7%. The most effective combination of metrics was class and sex, with an accuracy percent of 78.3%, which is unsurprising given that age was a much less important indicator.

<h3>Conclusion</h3>

In this lab, I found that the DecisionTree classifier was generally the most accurate classifier for this data set. I also found that sex was the most important influential feature in the dataset, followed by class, then age. I also found that an adult male in 3rd class would have most likely have died, while 1st class female adults and children were likely to have survived. These results make sense given the "women and children" prioritization in the evacuation, as well as the fact that sex was the most important determining feature. I was, however, surprised that age was the least important determining feature of the set considering children were supposedly prioritized. This could be because age only differentiated between children and adult, not between adults, and therefore was not a useful differentiator for the majority of passengers.

I am wondering more about how age impacts survival - were older people a lower priority? The dataset only differentiates between adult and children, so these more in depth questions cannot be answered. I am also curious that if gender mattered for the children, or only for adults. Additionally, it would be interesting to see the survival rates for each group of people. I suspect that 1st class female children would have the highest survival rate, but it would be good to have data to verify this. While this additional information does not require machine learning, it would be helpful to get a better understanding of who survived on the Titanic.