## Titanic: Machine Learning from Disaster

The Kaggle contest can be found [here](https://www.kaggle.com/c/titanic).

## 1. Problem Statement

The sinking of the RMS Titanic is one of the most infamous shipwrecks in history. Due to limited life boats, not all on board were able to survive. The likelihood of survival was found to be dependent on only on luck, but also on age, gender, social status, etc.

In this challenge, we are given a binary classification problem - from a given set of people, predict who surivived the shipwreck.

**Input**: a set of people and various properties such as age, gender, etc.

**Output**: whether or not each person survived

## 2. Gathering Data

Data has been provided by Kaggle.

[Training data](data/train.csv)

[Test data](data/test.csv)

[Gender Submission](data/gender_submission.csv)

## 3. Preparation

### 3.1 Imports

In [1]:
import pandas as pd         # data processing and analysis modeled after R dataframes with SQL like features
import matplotlib           # scientific and publication-ready visualization
import numpy as np          # foundational package for scientific computing
import scipy as sp          # scientific computing and advance mathematics
import IPython
from IPython import display # pretty printing of dataframes in Jupyter notebook
import sklearn              # collection of machine learning algorithms

# Model Algorithms
from sklearn import svm, tree, linear_model, neighbors, naive_bayes, ensemble, discriminant_analysis, gaussian_process

# Model Helpers
from sklearn.preprocessing import OneHotEncoder, LabelEncoder
from sklearn import feature_selection
from sklearn import model_selection
from sklearn import metrics

# Visualization
import matplotlib as mpl
import matplotlib.pyplot as plt
import matplotlib.pylab as pylab
import seaborn as sns
from pandas.tools.plotting import scatter_matrix

# Visualization Defaults

# show plots in Jupyter Notebook browser
%matplotlib inline
mpl.style.use('ggplot')
sns.set_style('white')
pylab.rcParams['figure.figsize'] = 12,8

### 3.2 Loading Data

In [2]:
data_raw_training = pd.read_csv('data/train.csv')
data_raw_test = pd.read_csv('data/test.csv')

data_to_clean = [data_raw_training, data_raw_test]       # datasets to clean

print(data_raw_training.info())
data_raw_training.sample(10)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.6+ KB
None


Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
132,133,0,3,"Robins, Mrs. Alexander A (Grace Charity Laury)",female,47.0,1,0,A/5. 3337,14.5,,S
80,81,0,3,"Waelens, Mr. Achille",male,22.0,0,0,345767,9.0,,S
23,24,1,1,"Sloper, Mr. William Thompson",male,28.0,0,0,113788,35.5,A6,S
150,151,0,2,"Bateman, Rev. Robert James",male,51.0,0,0,S.O.P. 1166,12.525,,S
619,620,0,2,"Gavey, Mr. Lawrence",male,26.0,0,0,31028,10.5,,S
39,40,1,3,"Nicola-Yarred, Miss. Jamila",female,14.0,1,0,2651,11.2417,,C
118,119,0,1,"Baxter, Mr. Quigg Edmond",male,24.0,0,1,PC 17558,247.5208,B58 B60,C
713,714,0,3,"Larsson, Mr. August Viktor",male,29.0,0,0,7545,9.4833,,S
706,707,1,2,"Kelly, Mrs. Florence ""Fannie""",female,45.0,0,0,223596,13.5,,S
495,496,0,3,"Yousseff, Mr. Gerious",male,,0,0,2627,14.4583,,C


## 3.3 Data Cleaning

### 3.3.1 Correcting

In this section, we identify incorrect values (i.e. age = 800) and correct them. Our strategy is to make incorrect values `NaN` and find a suitable value for them in the completion section.

In [5]:
# incorrect sex
data_raw_training.loc[~data_raw_training['Sex'].isin(["male", "female", "m", "f"])] = np.NaN

# incorrect pclass
data_raw_training.loc[(data_raw_training.Pclass <= 0) | (data_raw_training.Pclass >= 4), "Pclass"] = np.NaN

# incorrect ages
data_raw_training.loc[(data_raw_training.Age <= 0) | (data_raw_training.Age >= 100), "Age"] = np.NaN
