### What is this notebook for?

This is my first foray into machine learning (apart from when I trained a useless model on customer feedback last year).

As appears to be tradition, I'll be attempting the Titanic prediction compeition, hosted on Kaggle https://www.kaggle.com/c/titanic

From the blurb:

_The Challenge
The sinking of the Titanic is one of the most infamous shipwrecks in history._

_On April 15, 1912, during her maiden voyage, the widely considered “unsinkable” RMS Titanic sank after colliding with an iceberg. Unfortunately, there weren’t enough lifeboats for everyone onboard, resulting in the death of 1502 out of 2224 passengers and crew._

_While there was some element of luck involved in surviving, it seems some groups of people were more likely to survive than others._

_In this challenge, we ask you to build a predictive model that answers the question: “what sorts of people were more likely to survive?” using passenger data (ie name, age, gender, socio-economic class, etc)_

I'm going to split into a number of sections:

- Get Data
- EDA
- Feature engineering
- Modelling
- Results


tl;dr

<img src="files/ralph.png">



### Prepare the environment

Let's get my most commonly used libraries ready. These are the go to for maths, data wrangling, and plotting

In [6]:
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns
plt.style.use('ggplot')
%matplotlib inline

Ok, now lets import the libraries that will do all of the actual ~~magic~~ machine learning for us.

In [5]:
from sklearn.model_selection import cross_val_score, GridSearchCV, KFold, StratifiedKFold, train_test_split
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier
from sklearn.svm import SVR, SVC
from sklearn.metrics import r2_score, mean_squared_error, confusion_matrix, classification_report, accuracy_score

 ### Get Data

Now we need to import the data. I've downloaded the data into the current working directory for this repo on my local machine, which sits in it's own folder called `titanic`.

In [53]:
import os
path = os.getcwd()+'/titanic/'

In [56]:
test = pd.read_csv(path+'test.csv')
train = pd.read_csv(path+'train.csv')

In [57]:
test.shape

(418, 11)

In [58]:
train.shape

(891, 12)

I'm also going to convert the column names to lower case

In [82]:
test.columns = test.columns.str.lower()
train.columns = train.columns.str.lower()

Ok now we have two pandas dataframes called `test` and `train`. We're not going to touch test until we start to do some modelling, so let's proceed to some EDA on `train`.

### EDA

These are the descriptions of what each column in the dataframe means:

|  Variable | Definition | Key |
| :--- | :--- | :--- |
|  survived | Survival | 0 = No, 1 = Yes |
|  pclass | Ticket class | 1 = 1st, 2 = 2nd, 3 = 3rd |
|  sex | Sex |  |
|  Age | Age in years |  |
|  sibsp | # of siblings / spouses aboard the Titanic |  |
|  parch | # of parents / children aboard the Titanic |  |
|  ticket | Ticket number |  |
|  fare | Passenger fare |  |
|  cabin | Cabin number |  |
|  embarked | Port of Embarkation | C = Cherbourg, Q = Queenstown, S = Southampton |

#### How clean is the data?

The first thing I want to know is how clean this data is. Is there missing data?

In [77]:
test

Unnamed: 0,passengerid,pclass,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked
0,892,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q
1,893,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0000,,S
2,894,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,,Q
3,895,3,"Wirz, Mr. Albert",male,27.0,0,0,315154,8.6625,,S
4,896,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0,1,1,3101298,12.2875,,S
...,...,...,...,...,...,...,...,...,...,...,...
413,1305,3,"Spector, Mr. Woolf",male,,0,0,A.5. 3236,8.0500,,S
414,1306,1,"Oliva y Ocana, Dona. Fermina",female,39.0,0,0,PC 17758,108.9000,C105,C
415,1307,3,"Saether, Mr. Simon Sivertsen",male,38.5,0,0,SOTON/O.Q. 3101262,7.2500,,S
416,1308,3,"Ware, Mr. Frederick",male,,0,0,359309,8.0500,,S


In [94]:
train.columns

Index(['passengerid', 'survived', 'pclass', 'name', 'sex', 'age', 'sibsp',
       'parch', 'ticket', 'fare', 'cabin', 'embarked'],
      dtype='object')

In [91]:
train.describe()

Unnamed: 0,passengerid,survived,pclass,age,sibsp,parch,fare
count,891.0,891.0,891.0,714.0,891.0,891.0,891.0
mean,446.0,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,257.353842,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,1.0,0.0,1.0,0.42,0.0,0.0,0.0
25%,223.5,0.0,2.0,20.125,0.0,0.0,7.9104
50%,446.0,0.0,3.0,28.0,0.0,0.0,14.4542
75%,668.5,1.0,3.0,38.0,1.0,0.0,31.0
max,891.0,1.0,3.0,80.0,8.0,6.0,512.3292


We have 12 columns here, but only 7 are represented in `pd.describe` as they are categorical variables. We're going to need to ohe these

In [96]:
train.count()

passengerid    891
survived       891
pclass         891
name           891
sex            891
age            714
sibsp          891
parch          891
ticket         891
fare           891
cabin          204
embarked       889
dtype: int64

There are missing rows in age, cabin, and embarked

In [99]:
train.age.isna().value_counts(normalize=True)

False    0.801347
True     0.198653
Name: age, dtype: float64

In [101]:
train.age.isna().value_counts(normalize=True)

False    0.801347
True     0.198653
Name: age, dtype: float64

In [103]:
train.cabin.isna().value_counts(normalize=True)

True     0.771044
False    0.228956
Name: cabin, dtype: float64

In [89]:
train.embarked.isna().value_counts(normali)

False    889
True       2
Name: embarked, dtype: int64