# Preliminary Steps

Remember to **SAVE A COPY** of this notebook (File -> Save a copy in Drive) before making any edits!

**The Challenge**

The sinking of the Titanic is one of the most infamous shipwrecks in history.

On April 15, 1912, during her maiden voyage, the widely considered “unsinkable” RMS Titanic sank after colliding with an iceberg. Unfortunately, there weren’t enough lifeboats for everyone onboard, resulting in the death of 1502 out of 2224 passengers and crew.

While there was some element of luck involved in surviving, it seems some groups of people were more likely to survive than others.

In this challenge, we ask you to build a predictive model that answers the question: “what sorts of people were more likely to survive?” using passenger data (ie name, age, gender, socio-economic class, etc).

Link to the project: https://www.kaggle.com/competitions/titanic/overview

In [None]:
import pandas as pd
import numpy as np

In [None]:
! pip install kaggle

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [None]:
! mkdir ~/.kaggle

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
!cp /content/drive/MyDrive/kaggle.json ~/.kaggle/kaggle.json

In [None]:
! chmod 600 ~/.kaggle/kaggle.json

In [None]:
! kaggle competitions download -c titanic

Downloading titanic.zip to /content
  0% 0.00/34.1k [00:00<?, ?B/s]
100% 34.1k/34.1k [00:00<00:00, 1.90MB/s]


In [None]:
! unzip titanic.zip

Archive:  titanic.zip
  inflating: gender_submission.csv   
  inflating: test.csv                
  inflating: train.csv               


In [None]:
testtbl = pd.read_csv("test.csv")
testtbl.head()

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,892,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q
1,893,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0,,S
2,894,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,,Q
3,895,3,"Wirz, Mr. Albert",male,27.0,0,0,315154,8.6625,,S
4,896,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0,1,1,3101298,12.2875,,S


In [None]:
traintbl = pd.read_csv("train.csv")
traintbl.head()


Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


Above, I created 2 tables of data, one titled "testtbl" with our first dataset, and the other titled "traintbl" with the second dataset.

In [None]:
women = traintbl.loc[traintbl.Sex == 'female']["Survived"]
rate_women = sum(women)/len(women)

print("% of women who survived:", rate_women)

% of women who survived: 0.7420382165605095


In [None]:
men = traintbl.loc[traintbl.Sex == 'male']["Survived"]
rate_men = sum(men)/len(men)

print("% of men who survived:", rate_men)

% of men who survived: 0.18890814558058924


Above, I have found the percentage of women (74%) and percentage of men (19%) who survived, where a noticable difference can be seen.

In [None]:
older = traintbl.loc[traintbl.Age > 50]["Survived"]
rate_older = sum(older)/len(older)

print("% of older people who survived:", rate_older)

% of older people who survived: 0.34375


In [None]:
younger = traintbl.loc[traintbl.Age <= 50]['Survived']
rate_younger = sum(younger)/len(younger)

print("% of younger people who survived:", rate_younger)

% of younger people who survived: 0.4123076923076923


Next, I wanted to see if age of the passenger made a difference in survival rate, so I compared passengers above 50, older, and less than or equal to 50, younger. There was not a large difference in survival rate between older (34%) and younger (41%) people.

In [None]:
newtbl = traintbl.copy()
newsex = pd.get_dummies(newtbl['Sex'])
newembarked = pd.get_dummies(newtbl['Embarked'])
newtbl = newtbl.drop("Sex", axis=1).drop("Embarked", axis=1).drop("Name", axis=1).drop("Ticket", axis=1).drop("Cabin", axis=1).join(newsex).join(newembarked)

In [None]:
newtbl

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare,female,male,C,Q,S
0,1,0,3,22.0,1,0,7.2500,0,1,0,0,1
1,2,1,1,38.0,1,0,71.2833,1,0,1,0,0
2,3,1,3,26.0,0,0,7.9250,1,0,0,0,1
3,4,1,1,35.0,1,0,53.1000,1,0,0,0,1
4,5,0,3,35.0,0,0,8.0500,0,1,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,27.0,0,0,13.0000,0,1,0,0,1
887,888,1,1,19.0,0,0,30.0000,1,0,0,0,1
888,889,0,3,,1,2,23.4500,1,0,0,0,1
889,890,1,1,26.0,0,0,30.0000,0,1,1,0,0


In [None]:
newtesttbl = testtbl.copy()
newtestsex = pd.get_dummies(newtesttbl['Sex'])
newtestembarked = pd.get_dummies(newtesttbl['Embarked'])
newtesttbl = newtesttbl.drop("Sex", axis=1).drop("Embarked", axis=1).drop("Name", axis=1).drop("Ticket", axis=1).drop("Cabin", axis=1).join(newtestsex).join(newtestembarked)

In [None]:
newtesttbl

Unnamed: 0,PassengerId,Pclass,Age,SibSp,Parch,Fare,female,male,C,Q,S
0,892,3,34.5,0,0,7.8292,0,1,0,1,0
1,893,3,47.0,1,0,7.0000,1,0,0,0,1
2,894,2,62.0,0,0,9.6875,0,1,0,1,0
3,895,3,27.0,0,0,8.6625,0,1,0,0,1
4,896,3,22.0,1,1,12.2875,1,0,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...
413,1305,3,,0,0,8.0500,0,1,0,0,1
414,1306,1,39.0,0,0,108.9000,1,0,1,0,0
415,1307,3,38.5,0,0,7.2500,0,1,0,0,1
416,1308,3,,0,0,8.0500,0,1,0,0,1


Above, I did some data cleaning, dropping columns that would likely not have a correlation in survival rate, including "Name", "Ticket", and "Cabin". Additionally, I replaced categorical variable columns that would be useful in survival rate, but split them into quantitative variable columns, so they would be useable in the model.

In [None]:
from sklearn.ensemble import RandomForestClassifier

y = newtbl["Survived"]

features = ["Pclass", "female", "male", "SibSp", "Parch", "C", "Q", "S"]
X = pd.get_dummies(newtbl[features])
X_test = pd.get_dummies(newtesttbl[features])

model = RandomForestClassifier(n_estimators=100, max_depth=5, random_state=1)
model.fit(X, y)
predictions = model.predict(X_test)

output = pd.DataFrame({'PassengerId': testtbl.PassengerId, 'Survived': predictions})
output.to_csv('submission.csv', index=False)
print("Your submission was successfully saved!")

Your submission was successfully saved!


Above, I have created a type of machine learning model called a Random Forest Model. The model is consisted of "trees" that will individually use each passenger's data in features I have selected such as Class, Gender, and Embarked status to predict whether each passenger survived or not. 

In [None]:
results = pd.read_csv("submission.csv")
results

Unnamed: 0,PassengerId,Survived
0,892,0
1,893,0
2,894,0
3,895,0
4,896,0
...,...,...
413,1305,0
414,1306,1
415,1307,0
416,1308,0


Above are the results of my predictive model. It was able to correctly predict passenger survivability with 77.3% accuracy. It lists 1 under the "Survived" column if it predicts the passenger to survive, for each of the 418 passengers in the test data, along with their passenger Id.