## Titanic: Machine Learning from Disaster

The Kaggle contest can be found [here](https://www.kaggle.com/c/titanic).

## 1. Problem Statement

The sinking of the RMS Titanic is one of the most infamous shipwrecks in history. Due to limited life boats, not all on board were able to survive. The likelihood of survival was found to be dependent on only on luck, but also on age, gender, social status, etc.

In this challenge, we are given a binary classification problem - from a given set of people, predict who surivived the shipwreck.

**Input**: a set of people and various properties such as age, gender, etc.

**Output**: whether or not each person survived

## 2. Gathering Data

Data has been provided by Kaggle.

[Training data](data/train.csv)

[Test data](data/test.csv)

[Gender Submission](data/gender_submission.csv)

## 3. Preparation

### 3.1 Imports

In [1]:
import pandas as pd         # data processing and analysis modeled after R dataframes with SQL like features
import matplotlib           # scientific and publication-ready visualization
import numpy as np          # foundational package for scientific computing
import scipy as sp          # scientific computing and advance mathematics
import IPython
from IPython import display # pretty printing of dataframes in Jupyter notebook
import sklearn              # collection of machine learning algorithms

# Model Algorithms
from sklearn import svm, tree, linear_model, neighbors, naive_bayes, ensemble, discriminant_analysis, gaussian_process

# Model Helpers
from sklearn.preprocessing import OneHotEncoder, LabelEncoder
from sklearn import feature_selection
from sklearn import model_selection
from sklearn import metrics

# Visualization
import matplotlib as mpl
import matplotlib.pyplot as plt
import matplotlib.pylab as pylab
import seaborn as sns
from pandas.tools.plotting import scatter_matrix

# Visualization Defaults

# show plots in Jupyter Notebook browser
%matplotlib inline
mpl.style.use('ggplot')
sns.set_style('white')
pylab.rcParams['figure.figsize'] = 12,8

### 3.2 Loading Data

In [2]:
data_raw_training = pd.read_csv('data/train.csv')
data_raw_test = pd.read_csv('data/test.csv')

data_to_clean = [data_raw_training, data_raw_test]       # datasets to clean

for dataset in data_to_clean:
    print(dataset.info())
    print(40*"=")

data_raw_training.sample(10)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.6+ KB
None
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 11 columns):
PassengerId    418 non-null int64
Pclass         418 non-null int64
Name           418 non-null object
Sex            418 non-null object
Age            332 non-null float64
SibSp          418 non-null int64
Parch          418 non-null int64
Ticket         418 non-null object
Fare           417 non-null 

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
84,85,1,2,"Ilett, Miss. Bertha",female,17.0,0,0,SO/C 14885,10.5,,S
465,466,0,3,"Goncalves, Mr. Manuel Estanslas",male,38.0,0,0,SOTON/O.Q. 3101306,7.05,,S
310,311,1,1,"Hays, Miss. Margaret Bechstein",female,24.0,0,0,11767,83.1583,C54,C
125,126,1,3,"Nicola-Yarred, Master. Elias",male,12.0,1,0,2651,11.2417,,C
460,461,1,1,"Anderson, Mr. Harry",male,48.0,0,0,19952,26.55,E12,S
321,322,0,3,"Danoff, Mr. Yoto",male,27.0,0,0,349219,7.8958,,S
798,799,0,3,"Ibrahim Shawah, Mr. Yousseff",male,30.0,0,0,2685,7.2292,,C
633,634,0,1,"Parr, Mr. William Henry Marsh",male,,0,0,112052,0.0,,S
22,23,1,3,"McGowan, Miss. Anna ""Annie""",female,15.0,0,0,330923,8.0292,,Q
95,96,0,3,"Shorney, Mr. Charles Joseph",male,,0,0,374910,8.05,,S


## 3.3 Data Cleaning

### 3.3.1 Correcting

In this section, we identify incorrect values (i.e. age = 800) and correct them. Our strategy is to make incorrect values `NaN` and find a suitable value for them in the completion section.

In [3]:
for df in data_to_clean:
    # incorrect sex
    df.loc[~df['Sex'].isin(["male", "female", "m", "f"]), "Sex"] = np.NaN

    # incorrect pclass
    df.loc[(df.Pclass <= 0) | (df.Pclass >= 4), "Pclass"] = np.NaN

    # incorrect ages
    df.loc[(df.Age <= 0) | (df.Age >= 100), "Age"] = np.NaN
        
    # incorrect ports
    df.loc[~df.Embarked.isin(['S', 'C', 'Q']), "Embarked"] = np.NaN

### 3.3.2 Completing

Our initial observation shows that there are several missing values in the data. To proceed, we can choose one of the following strategies:

1. Ignore NULL values
2. Delete rows with NULL values
3. Delete columns with missing values
4. Fill in the NULL values with reasonable alternatives

In the training data, we see that the columns with missing values are: `Age`, `Cabin`, and `Embarked`.
In the test data, we see that the columns with missing values are `Age`, `Fare`, `Cabin`.

We see that `Fare` is missing only one row. Thus, it is more favorable to drop the row then try to replace the value.

The vast majority of `Cabin` values are missing. It makes more sense to drop the attribute altogether rather than trying to inference it from the small sample.

For `Age`, we can just replace missing values with the mean/mode/randomized standard deviation. To keep things simple, let's stick with the main.

> Keep in mind that more complexity may be introduced by contextualizing the replacement values. For example, we could replace NULL age with the *mean age of passengers from a specific port*.

We can replace `Embarked` with the mode because the median and mean don't apply.

In [4]:
for idx, df in enumerate(data_to_clean):
    # drop cabin
    df.drop(columns='Cabin', axis=1, inplace=True)
    
    # replace missing Age w/ median
    df['Age'].fillna(df['Age'].median(), inplace = True)

    # replace missing Embarked w/ mode
    df['Embarked'].fillna(df['Embarked'].mode()[0], inplace = True)
    
    # drop Fare row
    df = df[df.Fare.notnull()]
    data_to_clean[idx] = df
    
    # sum null values along columns
    print(df.isnull().sum(axis=0))
    print(40*"=")

PassengerId    0
Survived       0
Pclass         0
Name           0
Sex            0
Age            0
SibSp          0
Parch          0
Ticket         0
Fare           0
Embarked       0
dtype: int64
PassengerId    0
Pclass         0
Name           0
Sex            0
Age            0
SibSp          0
Parch          0
Ticket         0
Fare           0
Embarked       0
dtype: int64


### 3.3 Feature Engineering

### 3.4 Formatting

