Introduction
=========

Hi all, this is my first Kaggle kernel and really one of the first data science problems I've tackled by myself. We'll go through some basic data wrangling then explore a bunch of simple learning methods and see how they stack up against each other. This kernel aims to be primarily educational for any other beginner Kaggler's looking to get started!

Prepping the Data
================

Loading
----------

In [1]:
import pandas as pd

fr_train = pd.read_csv('../input/train.csv')
print(fr_train.shape)
fr_train.head()

(891, 12)


Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


The head function can be used on a dataframe to view the first n rows, the pandas library sets a default of n = 5. It's a useful function to get a quick peak at the data you're working with; in particular whether a feature is categorical or numerical, as well as typical values. Right off the bat spot an idiosyncracy in our dataset, namely that we seem to be missing a lot of values in our `Cabin` feature. Dealing with missing data is an important step in any machine learning pipeline, and we'll spend a bit of time on it a little later on. But first, let's continue exploring some basic properties of our dataset.

In [2]:
fr_train.describe(include='all')

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
count,891.0,891.0,891.0,891,891,714.0,891.0,891.0,891,891.0,204,889
unique,,,,891,2,,,,681,,147,3
top,,,,"Maenpaa, Mr. Matti Alexanteri",male,,,,CA. 2343,,B96 B98,S
freq,,,,1,577,,,,7,,4,644
mean,446.0,0.383838,2.308642,,,29.699118,0.523008,0.381594,,32.204208,,
std,257.353842,0.486592,0.836071,,,14.526497,1.102743,0.806057,,49.693429,,
min,1.0,0.0,1.0,,,0.42,0.0,0.0,,0.0,,
25%,223.5,0.0,2.0,,,20.125,0.0,0.0,,7.9104,,
50%,446.0,0.0,3.0,,,28.0,0.0,0.0,,14.4542,,
75%,668.5,1.0,3.0,,,38.0,1.0,0.0,,31.0,,


The pandas `describe()` function can be used to summarise basical statistical properties of each feature in a dataframe. Let's run through what we see here.

First of all the mean of `Survived` is 0.38, meaning only 38% of people survived in this dataset. Naturally this means we have more data pertaining to passengers who died, so it wouldn't be too suprising if our model ends up being better at able to predict who dies than who survives but this is something we'll have to evaluate later.

There's also more males than females in this dataset by a fairly significant margin, 577 from 891 total records. Most interestingly take a look at `Cabin`, the `describe()` function naturally excludes all missing values from it's analysis, we only have 204 `Cabin` values in our dataset from the full 891 rows of data. That's a lot of missing data, note that `Age` and `Embarked` are missing values as well but not nearly as much.

I like to quickly make helper functions which give me neat, explicit output about what I'm concerned with. Since we're about to deal with a lot of missing data I'm going to make a short, simple function which summarises the missing data in any generic dataframe.

In [3]:
def naSummary(df):
    nrow, ncol = df.shape
    na_count = df.isnull().sum()
    na_pc = na_count.divide(nrow)
    print(pd.DataFrame({'NA Count': na_count, 'NA %': na_pc}))
    
naSummary(fr_train)

                 NA %  NA Count
PassengerId  0.000000         0
Survived     0.000000         0
Pclass       0.000000         0
Name         0.000000         0
Sex          0.000000         0
Age          0.198653       177
SibSp        0.000000         0
Parch        0.000000         0
Ticket       0.000000         0
Fare         0.000000         0
Cabin        0.771044       687
Embarked     0.002245         2


Finally, lets convert out categorical data to be explicitly categorical according to pandas.

In [4]:
ctgs = ['Survived', 'Pclass', 'Embarked', 'Sex']

for ctg in ctgs:
    fr_train[ctg] = fr_train[ctg].astype('category')
    
fr_train.dtypes

PassengerId       int64
Survived       category
Pclass         category
Name             object
Sex            category
Age             float64
SibSp             int64
Parch             int64
Ticket           object
Fare            float64
Cabin            object
Embarked       category
dtype: object

Missing Data Imputation
----------------------

There's essentially two strategies you can employ when you have to deal with missing data. You can throw out the entries with missing variables, or you can replace the missing variables with your best guess. This latter process is called imputation. The downside of throwing data out is simply that you'll have less data to train your model on, but it's really simple to do. Strategies for imputation can range from very easy to quite complex, it's an extensive topic in and of itself and you should spend some time reading up on it.

### Embarked

Let's go from easy to hard, and start off with the `Embarked` feature which is only missing two values. This is a totally trivial amount of missing data so we'll just use a really simple 'most frequent' imputation which is exactly what it sounds like, replace the missing value with the most frequent level for that feature.

In [5]:
# value_counts returns the count of each level in the feature sorted in descending order by default
embarked_mcl = fr_train['Embarked'].value_counts().index[0]
fr_train['Embarked'].fillna(embarked_mcl, inplace=True)

naSummary(fr_train)

                 NA %  NA Count
PassengerId  0.000000         0
Survived     0.000000         0
Pclass       0.000000         0
Name         0.000000         0
Sex          0.000000         0
Age          0.198653       177
SibSp        0.000000         0
Parch        0.000000         0
Ticket       0.000000         0
Fare         0.000000         0
Cabin        0.771044       687
Embarked     0.000000         0


### Age

`Age` is a bit harder, we're missing a non-trivial amount of values. We're going to attempt a random regression imputation. This involves developing a regression model between `Age` and a set of predictors from the dataset, then adding a residual term to the prediction in order to reintroduce randomosity to the imputed values. Effectively this is another machine learning problem within our bigger Titanic machine learning problem!

We're going to need features so we can do a bit of feature engineering here. In particular I want to extract the Title of each passenger from the `Name` feature, a look at the data shows that the title "Master" seems to be reserved for males under the age of 13. I'd also expect there to be a correlation between the title "Miss" and younger females.

In [6]:
fr_train['Title'] = (fr_train.apply(lambda passenger: passenger['Name'].split(',')[1].split()[0][:-1], axis=1)).astype('category')
fr_train.head(n=10)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Title
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S,Mr
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C,Mrs
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S,Miss
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S,Mrs
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S,Mr
5,6,0,3,"Moran, Mr. James",male,,0,0,330877,8.4583,,Q,Mr
6,7,0,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.8625,E46,S,Mr
7,8,0,3,"Palsson, Master. Gosta Leonard",male,2.0,3,1,349909,21.075,,S,Master
8,9,1,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",female,27.0,0,2,347742,11.1333,,S,Mrs
9,10,1,2,"Nasser, Mrs. Nicholas (Adele Achem)",female,14.0,1,0,237736,30.0708,,C,Mrs


So far looks good! Let's use `describe()` to get a better look. But also let's first move `Title` to before `Name` just so the dataframe is visually cleaner when we display it.

In [7]:
cols = fr_train.columns.tolist()
cols = cols[0:3] + cols[-1:] + cols[3:-1]
fr_train = fr_train[cols]

fr_train.describe(include='all')

Unnamed: 0,PassengerId,Survived,Pclass,Title,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
count,891.0,891.0,891.0,891,891,891,714.0,891.0,891.0,891,891.0,204,891
unique,,2.0,3.0,17,891,2,,,,681,,147,3
top,,0.0,3.0,Mr,"Maenpaa, Mr. Matti Alexanteri",male,,,,CA. 2343,,B96 B98,S
freq,,549.0,491.0,517,1,577,,,,7,,4,646
mean,446.0,,,,,,29.699118,0.523008,0.381594,,32.204208,,
std,257.353842,,,,,,14.526497,1.102743,0.806057,,49.693429,,
min,1.0,,,,,,0.42,0.0,0.0,,0.0,,
25%,223.5,,,,,,20.125,0.0,0.0,,7.9104,,
50%,446.0,,,,,,28.0,0.0,0.0,,14.4542,,
75%,668.5,,,,,,38.0,1.0,0.0,,31.0,,


Seventeen unique values, let's display them all.

In [8]:
fr_train['Title'].value_counts()

Mr          517
Miss        182
Mrs         125
Master       40
Dr            7
Rev           6
Col           2
Mlle          2
Major         2
Jonkheer      1
Don           1
th            1
Lady          1
Sir           1
Mme           1
Ms            1
Capt          1
Name: Title, dtype: int64

We'll coerce most of these unique values to a single other value so we only deal with the following subset of titles:

* Mr
* Mrs
* Miss
* Master

We'll use the following mappings.

* Dr, Rev, Major, Col, Jonkheer, Don, Sir, Capt -> Mr
* Mlle, Ms -> Miss
* Lady, Mme, th -> Mrs

Typically speaking I think there would be some amount of rpedictive power to these title if we had more data, but as it stands these honorifics just don't have enough entries to act as standalone features so we're merging them with the most appropriate alternate level.

In [10]:
mr_alias = ['Dr', 'Rev', 'Major', 'Col', 'Jonkheer', 'Don', 'Sir', 'Capt']
fr_train.set_value(fr_train['Title'].apply(lambda ttl: ttl in mr_alias), 'Title', 'Mr')

mrs_alias = ['Lady', 'Mme']
fr_train.set_value(fr_train['Title'].apply(lambda ttl: ttl in mrs_alias), 'Title', 'Mrs')

miss_alias = ['Mlle', 'Ms', 'th']
fr_train.set_value(fr_train['Title'].apply(lambda ttl: ttl in miss_alias), 'Title', 'Miss')

fr_train['Title'] = fr_train['Title'].cat.remove_unused_categories()
fr_train['Title'].value_counts()

fr_train.to_csv('../input/treated_train.csv')

Great! Now that's all done and we have what's hopefully a useful, additional feature for our `Age` random regression imputation.

In [None]:
fr_train.dtypes

In [None]:
# Exclude passengers with missing ages
cl_train = fr_train[fr_train['Age'].notnull()]

# One hot encode categorical variables
cl_train = cl_train[['Survived', 'Pclass', 'Title', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare', 'Embarked']]
ctgs = cl_train.select_dtypes(include=['category']).columns.tolist()
cl_train = pd.get_dummies(cl_train, columns=ctgs)

import numpy as np
from sklearn import linear_model
lm = linear_model.LinearRegression()

X = cl_train.drop('Age', axis=1)
Y = cl_train['Age']

lm.fit(X, Y)

In [None]:
# The coefficients
print('Coefficients: \n', lm.coef_)
# The mean squared error
print("Mean squared error: %.2f"
      % np.mean((lm.predict(X) - Y) ** 2))
# Explained variance score: 1 is perfect prediction
print('Variance score: %.2f' % lm.score(X, Y))