This note book summarizes the results of using logistic regression to determine the whether a passenger on board of the Titanic survives or not, given their social economical class, age, sex, etc. Analysis show a passengers gender, age, and social economical class are the three most important factors (in decending order). The decision boundary between survived and not survived is potentially highly nonlinear due to the nature of the data. A simple linear function (with the inputs as is) gives slightly less than 80% accuracy on a validation set that consists 40% of the total training data. A 2nd order polynormial function can improve the accuracy to around 83%. In the model, missing age was filled by mapping the title (extracted from name) to the median age in the title group.

In [1]:
import scipy.optimize as scop
import titanic_project as titan
import numpy as np
import matplotlib.pyplot as plt

features = ['Survived', 'Pclass', 'Age', 'Sex', 'SibSp', 'Parch', 'Fare','Embarked','Name']
datain = titan.readdata('train.csv', 0.6, 0.6, features)
datain['train'].describe()

Unnamed: 0,Survived,Pclass,Age,Sex,SibSp,Parch,Fare,Embarked
count,535.0,535.0,424.0,535.0,535.0,535.0,535.0,535.0
mean,0.386916,2.325234,29.266509,-0.256075,0.549533,0.370093,31.716035,1.366355
std,0.4875,0.829477,14.434267,0.967562,1.125494,0.767702,47.259528,0.676234
min,0.0,1.0,0.75,-1.0,0.0,0.0,0.0,-3.0
25%,0.0,2.0,20.875,-1.0,0.0,0.0,7.925,1.0
50%,0.0,3.0,28.0,-1.0,0.0,0.0,14.4542,1.0
75%,1.0,3.0,37.25,1.0,1.0,0.0,29.8854,2.0
max,1.0,3.0,71.0,1.0,8.0,5.0,512.3292,3.0


In [2]:
datain['train'].dropna(how='any').corr()

Unnamed: 0,Survived,Pclass,Age,Sex,SibSp,Parch,Fare,Embarked
Survived,1.0,-0.28566,-0.094232,0.559736,-0.029968,0.101782,0.210668,0.074648
Pclass,-0.28566,1.0,-0.359838,-0.140481,0.085795,0.008385,-0.587926,-0.094167
Age,-0.094232,-0.359838,1.0,-0.091064,-0.353861,-0.238634,0.083645,0.018236
Sex,0.559736,-0.140481,-0.091064,1.0,0.069462,0.167201,0.183197,0.061273
SibSp,-0.029968,0.085795,-0.353861,0.069462,1.0,0.414155,0.167069,0.021752
Parch,0.101782,0.008385,-0.238634,0.167201,0.414155,1.0,0.252656,-0.072881
Fare,0.210668,-0.587926,0.083645,0.183197,0.167069,0.252656,1.0,0.142913
Embarked,0.074648,-0.094167,0.018236,0.061273,0.021752,-0.072881,0.142913,1.0


The correlation table shows that the most important features are the gender (Sex), social economical class (Pclass and to some extent Fare, those two features are closely correlated). Where the passenger embarked and the sibling spouse count doesn't matter too much. Age is not insignificant, however it is missing in 26% of the training records as shown here. All other features contains no NaN values in the training set.

In [24]:
datain['train'].isnull().sum() * 1.0 / datain['train'].count()

Survived    0.000000
Pclass      0.000000
Age         0.261792
Sex         0.000000
SibSp       0.000000
Parch       0.000000
Fare        0.000000
Embarked    0.000000
Name        0.000000
dtype: float64

Given the potential important of Age I attempt to correlate it to other parameters to improve the final model. The correlation table shows that Age correlates well with Pclass, SibSp, and Parch. After examining the names I find the title in the names, e.g., Mr, could also be used to correlate to age because each title group seem to have a distinct median age.

In [6]:
tmap = {' Mr':0, ' Mrs':1, ' Miss':2, ' Mme':1, ' Master':3}
title = datain['train'].Name.apply(titan.name2title).map(tmap).fillna(4).astype(int)

In [22]:
ind = datain['train'].Age.notnull()
np.corrcoef(datain['train'].Age[ind], title[ind])

array([[ 1.        , -0.31492767],
       [-0.31492767,  1.        ]])

The correlation coefficient matrix shows that Age and extracted title are closely related. The first model I build is to calculate Age based on Pclass, SibSp, Parch, and title.