In [None]:
%%capture
!pip install pandas scikit-learn numpy matplotlib torch
!kaggle competitions download titanic
!unzip -o titanic.zip

In [None]:
import pandas as pd

# Model I: Birkenhead Model

To get things going, we can make an easy first pass model that operates on the knowledge that women and children were prioritized evacuation under the [Birkenhead drill](https://en.wikipedia.org/wiki/Women_and_children_first).  Simply put, if you are a woman or under the age of 13, the model will predict that the individual survives.

In [None]:
train = pd.read_csv("train.csv")
test = pd.read_csv("test.csv")
train.sample(5)

In [None]:
child_age = 12.0

# Keep sex and age columns
reduced = train[["Sex", "Age", "Survived"]].copy()
reduced.Sex = (reduced.Sex == "female").astype(int)
reduced.Age = (reduced.Age <= child_age).astype(int)
reduced["predicted"] = reduced.Sex | reduced.Age
train_acc = (reduced.predicted == reduced.Survived).mean()
sex_only = (reduced.Sex == reduced.Survived).mean()
print(f"Birkenhead drill train accuracy: {train_acc * 100:.2f}%")
print(f"Sex only train accuracy: {sex_only * 100:.2f}%")

# Write out the submission
bh_test = test[["PassengerId", "Sex", "Age"]].copy()
bh_test.Sex = (bh_test.Sex == "female").astype(int)
bh_test.Age = (bh_test.Age <= child_age).astype(int)
bh_test["Survived"] = bh_test.Sex | bh_test.Age
bh_test.to_csv("birkenhead.csv", columns=["PassengerId", "Survived"], index=False)

In [None]:
#!kaggle competitions submit titanic -f birkenhead.csv -m "Naive Model"

Not bad!  For such a simple model we are at nearly 80% accurate.  Adjusting for children didn't make much of a difference, but it did give us an extra half a percent or so.  

# Data Preparation

Before trying out any more sophisticated models, let's try to clean up our data to get the most out of what we have.  There are a lot of variables that may be missing entries and string fields that we could convert into categorical columns like we did for sex in our naive model.