## Kaggle Challenge Notebook - Titanic

This notebook is a scratch pad to test code and visualize data for the Titanic data science challenge. This file is not required to be submitted. It is only developed as an aide and to document decisions.

#### Challenge Definition

The Titanic challenge is recommended as a simple introduction to the Kaggle platform. The goal is to analyze/engineer the data and develop a model to predict what sorts of people were likely to survive. 

The submission consists of a single csv file with the PassengerID and prediction for Survived.

In [1]:
from src.titanic import get_data
from src.titanic import clean_data
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

In [2]:
train_filename = 'data/train.csv'
test_filename = 'data/test.csv'
train_df, test_df = get_data(train_filename, test_filename)
train_df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [3]:
test_df.head()

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,892,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q
1,893,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0,,S
2,894,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,,Q
3,895,3,"Wirz, Mr. Albert",male,27.0,0,0,315154,8.6625,,S
4,896,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0,1,1,3101298,12.2875,,S


In [4]:
train_df.info(), test_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.6+ KB
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 11 columns):
PassengerId    418 non-null int64
Pclass         418 non-null int64
Name           418 non-null object
Sex            418 non-null object
Age            332 non-null float64
SibSp          418 non-null int64
Parch          418 non-null int64
Ticket         418 non-null object
Fare           417 non-null float

(None, None)

In [5]:
train_df.describe()

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,891.0,714.0,891.0,891.0,891.0
mean,446.0,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,257.353842,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,1.0,0.0,1.0,0.42,0.0,0.0,0.0
25%,223.5,0.0,2.0,20.125,0.0,0.0,7.9104
50%,446.0,0.0,3.0,28.0,0.0,0.0,14.4542
75%,668.5,1.0,3.0,38.0,1.0,0.0,31.0
max,891.0,1.0,3.0,80.0,8.0,6.0,512.3292


In [6]:
#train_df['Cabin'].unique(return_counts=True)
#np.unique(train_df['Cabin'], return_counts=True)
train_df.describe(include=['O'])

Unnamed: 0,Name,Sex,Ticket,Cabin,Embarked
count,891,891,891,204,889
unique,891,2,681,147,3
top,"Pinsky, Mrs. (Rosa)",male,347082,G6,S
freq,1,577,7,4,644


## Some initial observations of the data

- PassengerID will have no bearing on Survived and can be discarded without further analysis.
- Pclass - has 3 values (none missing) for 1st, 2nd, and 3rd class. It appears to be a good predictor of survival.
- Name - names should have no bearing on survival, but the titles assigned to each name could be an indicator of status and predict survival. However, 93% of the titles are simply indicators of gender, with no indicator of status. There is also the potential of a name having no title, or the test data having titles not seen during training. Remove name for now and come back to it if models don't perform well.
- Sex - is a very stong predictor of survival. This data needs to be converted to binary values.
- Cabin - 77% of the values are missing. There could conceivably be useful information in the cabin name indicating a part of the ship someone was located in, but there are too many missing values to estimate.
- Ticket has 681 unique value (non missing) and can be discarded without further analysis.

In [7]:
#plt.scatter(train_df['Pclass'], train_df['Survived'])
train_df[['Pclass', 'Survived']].groupby(['Pclass'], as_index=False).mean().sort_values(by='Survived', ascending=False)

Unnamed: 0,Pclass,Survived
0,1,0.62963
1,2,0.472826
2,3,0.242363


In [8]:
temp_df = train_df
temp_df['Title'] = train_df.Name.str.extract(' ([A-Za-z]+)\.', expand=False)
pd.crosstab(temp_df['Title'], temp_df['Sex'])

Sex,female,male
Title,Unnamed: 1_level_1,Unnamed: 2_level_1
Capt,0,1
Col,0,2
Countess,1,0
Don,0,1
Dr,1,6
Jonkheer,0,1
Lady,1,0
Major,0,2
Master,0,40
Miss,182,0


In [9]:
train_df[['Sex', 'Survived']].groupby(['Sex'], as_index=False).mean().sort_values(by='Survived', ascending=False)

Unnamed: 0,Sex,Survived
0,female,0.742038
1,male,0.188908


In [10]:
train_df = train_df.drop(['Title'], axis=1)

## Initial modifications to the data

- Drop PassengerID, Name, Cabin and Ticket
- Change sex to binary values

In [11]:
train_df = train_df.drop(['PassengerId', 'Name', 'Ticket', 'Cabin'], axis=1)
train_df.loc[train_df['Sex']=='male', 'Sex'] = 1
train_df.loc[train_df['Sex']=='female', 'Sex'] = 0
test_df = test_df.drop(['PassengerId', 'Name', 'Ticket', 'Cabin'], axis=1)
test_df.loc[test_df['Sex']=='male', 'Sex'] = 1
test_df.loc[test_df['Sex']=='female', 'Sex'] = 0

In [12]:
train_df.head(), test_df.head()

(   Survived  Pclass  Sex   Age  SibSp  Parch     Fare Embarked
 0         0       3    1  22.0      1      0   7.2500        S
 1         1       1    0  38.0      1      0  71.2833        C
 2         1       3    0  26.0      0      0   7.9250        S
 3         1       1    0  35.0      1      0  53.1000        S
 4         0       3    1  35.0      0      0   8.0500        S,
    Pclass  Sex   Age  SibSp  Parch     Fare Embarked
 0       3    1  34.5      0      0   7.8292        Q
 1       3    0  47.0      1      0   7.0000        S
 2       2    1  62.0      0      0   9.6875        Q
 3       3    1  27.0      0      0   8.6625        S
 4       3    0  22.0      1      1  12.2875        S)

## Observations on Age

- Age - 20% of the data points are missing. Will attempt to estimate values based on correlation with other data.
- Resulting data needs to be banded so we have fewer values to classify on.
- Gender is not well correlated with age, except for the very young or very old, women represented ~65% of the passengers in every age group.
- Same for SibSp.
- Parch seems to be well correlated with the 32 to 48 age band and Pclass trends from 1st class to 3rd class as you go from older to youger age group. I will use both of these to estimate age.

In [13]:
train_df['Age Band'] = pd.cut(train_df['Age'], 5)
train_df[['Age Band', 'Survived']].groupby(['Age Band'], as_index=False).mean().sort_values(by='Age Band', ascending=False)

Unnamed: 0,Age Band,Survived
4,"(64.084, 80.0]",0.090909
3,"(48.168, 64.084]",0.434783
2,"(32.252, 48.168]",0.404255
1,"(16.336, 32.252]",0.369942
0,"(0.34, 16.336]",0.55


In [14]:
train_df[['Age Band', 'SibSp', 'Parch', 'Pclass', 'Sex']].groupby(['Age Band'], as_index=False).mean().sort_values(by='Age Band', ascending=False)

Unnamed: 0,Age Band,SibSp,Parch,Pclass,Sex
4,"(64.084, 80.0]",0.090909,0.181818,1.727273,1.0
3,"(48.168, 64.084]",0.333333,0.289855,1.507246,0.652174
2,"(32.252, 48.168]",0.367021,0.468085,2.005319,0.638298
1,"(16.336, 32.252]",0.33526,0.242775,2.416185,0.653179
0,"(0.34, 16.336]",1.57,1.14,2.61,0.51


In [15]:
train_df[['Pclass', 'Age']].groupby(['Pclass'], as_index=False).mean().sort_values(by='Pclass', ascending=False)

Unnamed: 0,Pclass,Age
2,3,25.14062
1,2,29.87763
0,1,38.233441


In [16]:
train_df.Parch.value_counts()

0    678
1    118
2     80
5      5
3      5
4      4
6      1
Name: Parch, dtype: int64

In [17]:
train_df.Pclass.value_counts()

3    491
1    216
2    184
Name: Pclass, dtype: int64

## Filling in missing Age values

In [18]:
train_df.loc[(train_df.Age.isnull()) & (train_df.Parch > 0), 'Age'] = 40
test_df.loc[(test_df.Age.isnull()) & (test_df.Parch > 0), 'Age'] = 40

In [19]:
train_df.loc[(train_df.Age.isnull()) & (train_df.Pclass == 1), 'Age'] = 38
test_df.loc[(test_df.Age.isnull()) & (test_df.Pclass == 1), 'Age'] = 38

In [20]:
train_df.loc[(train_df.Age.isnull()) & (train_df.Pclass == 2), 'Age'] = 30
test_df.loc[(test_df.Age.isnull()) & (test_df.Pclass == 2), 'Age'] = 30

In [21]:
train_df.loc[(train_df.Age.isnull()) & (train_df.Pclass == 3), 'Age'] = 26
test_df.loc[(test_df.Age.isnull()) & (test_df.Pclass == 3), 'Age'] = 26

In [22]:
train_df[train_df['Age'].isnull()]

Unnamed: 0,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked,Age Band


#### Replace Age values with ordinals based on age bands, then drop age bands

In [23]:
train_df.loc[train_df['Age'] <= 16, 'Age'] = 0
train_df.loc[(train_df['Age'] > 16) & (train_df['Age'] <= 32), 'Age'] = 1
train_df.loc[(train_df['Age'] > 32) & (train_df['Age'] <= 48), 'Age'] = 2
train_df.loc[(train_df['Age'] > 48) & (train_df['Age'] <= 64), 'Age'] = 3
train_df.loc[train_df['Age'] > 64, 'Age'] = 4
test_df.loc[test_df['Age'] <= 16, 'Age'] = 0
test_df.loc[(test_df['Age'] > 16) & (test_df['Age'] <= 32), 'Age'] = 1
test_df.loc[(test_df['Age'] > 32) & (test_df['Age'] <= 48), 'Age'] = 2
test_df.loc[(test_df['Age'] > 48) & (test_df['Age'] <= 64), 'Age'] = 3
test_df.loc[test_df['Age'] > 64, 'Age'] = 4
train_df.head(), test_df.head()

(   Survived  Pclass  Sex  Age  SibSp  Parch     Fare Embarked  \
 0         0       3    1  1.0      1      0   7.2500        S   
 1         1       1    0  2.0      1      0  71.2833        C   
 2         1       3    0  1.0      0      0   7.9250        S   
 3         1       1    0  2.0      1      0  53.1000        S   
 4         0       3    1  2.0      0      0   8.0500        S   
 
            Age Band  
 0  (16.336, 32.252]  
 1  (32.252, 48.168]  
 2  (16.336, 32.252]  
 3  (32.252, 48.168]  
 4  (32.252, 48.168]  ,    Pclass  Sex  Age  SibSp  Parch     Fare Embarked
 0       3    1  2.0      0      0   7.8292        Q
 1       3    0  2.0      1      0   7.0000        S
 2       2    1  3.0      0      0   9.6875        Q
 3       3    1  1.0      0      0   8.6625        S
 4       3    0  1.0      1      1  12.2875        S)

In [24]:
train_df = train_df.drop(['Age Band'], axis=1)
train_df.head()

Unnamed: 0,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked
0,0,3,1,1.0,1,0,7.25,S
1,1,1,0,2.0,1,0,71.2833,C
2,1,3,0,1.0,0,0,7.925,S
3,1,1,0,2.0,1,0,53.1,S
4,0,3,1,2.0,0,0,8.05,S


## Observations on Fare

- The chance of survival above a certain fare level appears to 2/3 to 3/4. Below that fare level the chance of survival drops precipitously.
- I will select a threshold of $75 and create two ordinal values, 0 <= 75 and 1 above.

In [25]:
temp_df = train_df.copy()
temp_df['Fare Band'] = pd.cut(temp_df['Fare'], 7)
temp_df[['Fare Band', 'Survived']].groupby(['Fare Band'], as_index=False).mean().sort_values(by='Fare Band', ascending=False)

Unnamed: 0,Fare Band,Survived
6,"(439.139, 512.329]",1.0
5,"(365.949, 439.139]",
4,"(292.76, 365.949]",
3,"(219.57, 292.76]",0.615385
2,"(146.38, 219.57]",0.733333
1,"(73.19, 146.38]",0.732394
0,"(-0.512, 73.19]",0.33967


#### Change Fares to ordinal values

In [26]:
train_df.loc[train_df['Fare'] <= 75, 'Fare'] = 0
train_df.loc[train_df['Fare'] > 75, 'Fare'] = 1
test_df.loc[test_df['Fare'] <= 75, 'Fare'] = 0
test_df.loc[test_df['Fare'] > 75, 'Fare'] = 1
train_df.head(), test_df.head()

(   Survived  Pclass  Sex  Age  SibSp  Parch  Fare Embarked
 0         0       3    1  1.0      1      0   0.0        S
 1         1       1    0  2.0      1      0   0.0        C
 2         1       3    0  1.0      0      0   0.0        S
 3         1       1    0  2.0      1      0   0.0        S
 4         0       3    1  2.0      0      0   0.0        S,
    Pclass  Sex  Age  SibSp  Parch  Fare Embarked
 0       3    1  2.0      0      0   0.0        Q
 1       3    0  2.0      1      0   0.0        S
 2       2    1  3.0      0      0   0.0        Q
 3       3    1  1.0      0      0   0.0        S
 4       3    0  1.0      1      1   0.0        S)

## Observations on Embarked

- Could be a good correlation with survived based on survival rate when embarking in C.

In [27]:
train_df[['Embarked', 'Survived']].groupby(['Embarked'], as_index=False).mean().sort_values(by='Survived', ascending=False)

Unnamed: 0,Embarked,Survived
0,C,0.553571
1,Q,0.38961
2,S,0.336957


#### Change Embarked to ordinal values

- there are two missing values in the training set and none in the test set. 72% of all passnger embarked in 'S' so just set the two missing values to 'S'.

In [28]:
train_df.loc[train_df['Embarked'] == 'S', 'Embarked'] = 0
train_df.loc[train_df['Embarked'] == 'Q', 'Embarked'] = 1
train_df.loc[train_df['Embarked'] == 'C', 'Embarked'] = 2
test_df.loc[test_df['Embarked'] == 'S', 'Embarked'] = 0
test_df.loc[test_df['Embarked'] == 'Q', 'Embarked'] = 1
test_df.loc[test_df['Embarked'] == 'C', 'Embarked'] = 2
train_df.head(), test_df.head()

(   Survived  Pclass  Sex  Age  SibSp  Parch  Fare Embarked
 0         0       3    1  1.0      1      0   0.0        0
 1         1       1    0  2.0      1      0   0.0        2
 2         1       3    0  1.0      0      0   0.0        0
 3         1       1    0  2.0      1      0   0.0        0
 4         0       3    1  2.0      0      0   0.0        0,
    Pclass  Sex  Age  SibSp  Parch  Fare  Embarked
 0       3    1  2.0      0      0   0.0         1
 1       3    0  2.0      1      0   0.0         0
 2       2    1  3.0      0      0   0.0         1
 3       3    1  1.0      0      0   0.0         0
 4       3    0  1.0      1      1   0.0         0)

In [29]:
train_df['Embarked'].value_counts()

0    644
2    168
1     77
Name: Embarked, dtype: int64

In [30]:
train_df.loc[train_df.Embarked.isnull(), 'Embarked'] = 0
train_df.describe()

Unnamed: 0,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked
count,891.0,891.0,891.0,891.0,891.0,891.0,891.0,891.0
mean,0.383838,2.308642,0.647587,1.345679,0.523008,0.381594,0.108866,0.463524
std,0.486592,0.836071,0.47799,0.824847,1.102743,0.806057,0.311647,0.791503
min,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.0,2.0,0.0,1.0,0.0,0.0,0.0,0.0
50%,0.0,3.0,1.0,1.0,0.0,0.0,0.0,0.0
75%,1.0,3.0,1.0,2.0,1.0,0.0,0.0,1.0
max,1.0,3.0,1.0,4.0,8.0,6.0,1.0,2.0


## Develop Models

The following models can be used for a supervised regression/classification problem. I will test and compare each one for this competition using cross validation. 

- Logistic Regression
- KNN or k-Nearest Neighbors
- Support Vector Machines
- Naive Bayes classifier
- Decision Tree
- Random Forrest
- Perceptron
- Artificial neural network
- RVM or Relevance Vector Machine