# Titanic ML model

This notebook will guide you through my process of creating a ML model to predict who died and who survived the Titanic.

In [4]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
from sklearn.model_selection import StratifiedShuffleSplit

In [5]:
data = pd.read_csv("/Users/pedro/github/intro-statistical-learning/data/titanic/train.csv")
data.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


# Stratified sampling

During data exploration, proxy class seemed to be a rather influential feature, thus I want to make sure these are stratified proportionally in my training and test groups.

In [7]:
split = StratifiedShuffleSplit(n_splits=1, test_size = 0.2, random_state = 69)
for train_index, test_index in split.split(data, data.Pclass):
        strat_train_set = data.loc[train_index]
        strat_test_set = data.loc[test_index]

Let's check if the proportions were maintained

In [8]:
strat_test_set.Pclass.value_counts() / len(strat_test_set)

3    0.553073
1    0.240223
2    0.206704
Name: Pclass, dtype: float64

In [11]:
strat_train_set.Pclass.value_counts() /len(strat_train_set)

3    0.550562
1    0.242978
2    0.206461
Name: Pclass, dtype: float64

In [12]:
#Original proportions
data.Pclass.value_counts() / len(data)

3    0.551066
1    0.242424
2    0.206510
Name: Pclass, dtype: float64

Ja! Alles gut!

In [13]:
#We create a copy of our training set to manipulate it as we wish
df = strat_train_set.copy()

In [23]:
# I may have to come back and drop those with Age = nan
df.Age.isna().sum() / len(df)

0.20224719101123595

# Transformations
Now let's deal with nans and encode our ordinal and categorical variables so that we can later test out different algorithms with any of our features

1

In [67]:
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler

#First we create the pipeline for numerical variables
num_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='median')),
    ('std_scaler', StandardScaler())
])

#Now we create the full pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OrdinalEncoder, OneHotEncoder

num_att = ['Age', 'SibSp', 'Parch', 'Fare']
cat_att = ['Sex', 'Embarked']
ord_att = ['Pclass']


full_pipeline = ColumnTransformer([
    ('num', num_pipeline, num_att),
    ('cat', OneHotEncoder(), cat_att),
    ('ord', OrdinalEncoder(), ord_att)
])

df.prepd = full_pipeline.fit_transform(df)

ValueError: Input contains NaN

In [55]:
from sklearn.compose import ColumnTransformer

cat_attributes = ['Sex']

pipeline = ColumnTransformer([
    ('cat', OneHotEncoder(), cat_attributes)
])

df_prepd = pipeline.fit_transform(df)

In [56]:
df_prepd

array([[0., 1.],
       [0., 1.],
       [0., 1.],
       ...,
       [1., 0.],
       [0., 1.],
       [0., 1.]])