# Titanic ML model

This notebook will guide you through my process of creating a ML model to predict who died and who survived the Titanic.

In [278]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
from sklearn.model_selection import StratifiedShuffleSplit

In [5]:
data = pd.read_csv("/Users/pedro/github/intro-statistical-learning/data/titanic/train.csv")
submission_test_set = pd.read_csv("/Users/pedro/github/intro-statistical-learning/data/titanic/test.csv")
data.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


# Stratified sampling

During data exploration, proxy class seemed to be a rather influential feature, thus I want to make sure these are stratified proportionally in my training and test groups.

In [7]:
split = StratifiedShuffleSplit(n_splits=1, test_size = 0.2, random_state = 69)
for train_index, test_index in split.split(data, data.Pclass):
        strat_train_set = data.loc[train_index]
        strat_test_set = data.loc[test_index]

Let's check if the proportions were maintained

In [8]:
strat_test_set.Pclass.value_counts() / len(strat_test_set)

3    0.553073
1    0.240223
2    0.206704
Name: Pclass, dtype: float64

In [11]:
strat_train_set.Pclass.value_counts() /len(strat_train_set)

3    0.550562
1    0.242978
2    0.206461
Name: Pclass, dtype: float64

In [12]:
#Original proportions
data.Pclass.value_counts() / len(data)

3    0.551066
1    0.242424
2    0.206510
Name: Pclass, dtype: float64

Ja! Alles gut!

In [116]:
#We create a copy of our training set to manipulate it as we wish
df = strat_train_set.copy()
df

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
456,457,0,1,"Millet, Mr. Francis Davis",male,65.00,0,0,13509,26.5500,E38,S
494,495,0,3,"Stanley, Mr. Edward Roland",male,21.00,0,0,A/4 45380,8.0500,,S
611,612,0,3,"Jardin, Mr. Jose Neto",male,,0,0,SOTON/O.Q. 3101305,7.0500,,S
136,137,1,1,"Newsom, Miss. Helen Monypeny",female,19.00,0,2,11752,26.2833,D47,S
850,851,0,3,"Andersson, Master. Sigvard Harald Elias",male,4.00,4,2,347082,31.2750,,S
...,...,...,...,...,...,...,...,...,...,...,...,...
281,282,0,3,"Olsson, Mr. Nils Johan Goransson",male,28.00,0,0,347464,7.8542,,S
303,304,1,2,"Keane, Miss. Nora A",female,,0,0,226593,12.3500,E101,Q
882,883,0,3,"Dahlberg, Miss. Gerda Ulrika",female,22.00,0,0,7552,10.5167,,S
378,379,0,3,"Betros, Mr. Tannous",male,20.00,0,0,2648,4.0125,,C


# Transformations
Now let's deal with nans, scaling, and encoding our ordinal and categorical variables so that we can later test out different algorithms with any of our features.

This next block is a demonstration of how encoders are used. However we will group all our encoding processes in a pipeline (see next block)

In [171]:
from sklearn.preprocessing import OneHotEncoder

df_cat = df[['Sex']]

cat_encoder = OneHotEncoder()
sex_1hot = cat_encoder.fit_transform(df_cat)

print(sex_1hot.shape)
cat_encoder.categories_
#We have created a sparse matrix shape: 712,2 with female and male as the columns

(712, 2)


[array(['female', 'male'], dtype=object)]

In [196]:
#I'm trying out imputing on ordinal data

imp = SimpleImputer(strategy='most_frequent')
embarked_imp = imp.fit_transform(df[['Embarked']])
type(embarked_imp)

#But how do I integrate the ndarray to my df???

numpy.ndarray

### Pipeline encoding

In [197]:
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn_pandas import CategoricalImputer

#First we create the pipeline for numerical variables
num_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='median')),
    ('std_scaler', StandardScaler())
])

#Now we create the full pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OrdinalEncoder, OneHotEncoder

num_att = ['Age', 'SibSp', 'Parch', 'Fare']
cat_att = ['Sex']
ord_att = ['Pclass']


full_pipeline = ColumnTransformer([
    ('num', num_pipeline, num_att),
    #('cat_imp', SimpleImputer(missing_values = np.nan, strategy='most_frequent'), cat_att),
    ('cat', OneHotEncoder(), cat_att),
    ('ord', OrdinalEncoder(), ord_att)
])

df_prepd = full_pipeline.fit_transform(df)

#Couldn't fit Embarked because of its NaN. Follow the link for info on how to impute categorical variables to most frequent
# https://stackoverflow.com/questions/25239958/impute-categorical-missing-values-in-scikit-learn


In [264]:
type(df_prepd)

numpy.ndarray

# Modeling

In [213]:
from sklearn.ensemble import RandomForestClassifier

forest_classifier = RandomForestClassifier()
forest_classifier.fit(df_prepd, df.Survived)



RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
                       max_depth=None, max_features='auto', max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=10,
                       n_jobs=None, oob_score=False, random_state=None,
                       verbose=0, warm_start=False)

## Let's cross validate our model using only the training set folded in 5

So the set is split into 5 subsets, each subset is predicted by the other 4 subsets and we calculate how accurate the predictions were (as a percentage of correct classifications).

cv defines the number of folds. I went for 5 cause its not a huge data set and its the default, but I'm not sure what the best number is.

In [239]:
from sklearn.model_selection import cross_val_score

scores = cross_val_score(forest_classifier, df_prepd, df.Survived,
                        scoring = 'accuracy', cv = 5)

# Follow this link to check out other scoring methods you can use with cross validation
# https://scikit-learn.org/stable/modules/model_evaluation.html#scoring-parameter

In [243]:
print('Scores: ', scores, '\n Mean: ', scores.mean(), '\n Standard Deviation: ', scores.std())

Scores:  [0.81118881 0.87412587 0.76923077 0.80985915 0.78014184] 
 Mean:  0.8089092906893327 
 Standard Deviation:  0.03650000621016738


The model seems quite promising but we should compare these numbers to other models or tweaks to the hyperparameters

I can save this model using joblib

In [254]:
import joblib

joblib.dump(forest_classifier, 'forest_classifier_titanic_woEmbark.pkl')

['forest_classifier_titanic_woEmbark.pkl']

In [258]:
# Here's our test set prepared for this model but that should come much later.
test_prepd = full_pipeline.fit_transform(strat_test_set)

# Fine-Tuning

In [262]:
from sklearn.model_selection import GridSearchCV
param_grid =[
    {'n_estimators': [3, 10, 30], 'max_features': [2, 4, 6]},
    {'bootstrap':[False], 'n_estimators': [3, 10], 'max_features': [2, 3, 4]},
]

for_class_2 = RandomForestClassifier()

grid_search = GridSearchCV(for_class_2, param_grid, cv=5,
                          scoring='accuracy', return_train_score = True)

grid_search.fit(df_prepd, df.Survived)

GridSearchCV(cv=5, error_score='raise-deprecating',
             estimator=RandomForestClassifier(bootstrap=True, class_weight=None,
                                              criterion='gini', max_depth=None,
                                              max_features='auto',
                                              max_leaf_nodes=None,
                                              min_impurity_decrease=0.0,
                                              min_impurity_split=None,
                                              min_samples_leaf=1,
                                              min_samples_split=2,
                                              min_weight_fraction_leaf=0.0,
                                              n_estimators='warn', n_jobs=None,
                                              oob_score=False,
                                              random_state=None, verbose=0,
                                              warm_start=False),
             iid

In [263]:
grid_search.best_params_


{'max_features': 6, 'n_estimators': 10}

In [268]:
results = grid_search.cv_results_
for x, y in zip(results['mean_test_score'], results['params']):
    print(x, y)

0.776685393258427 {'max_features': 2, 'n_estimators': 3}
0.8103932584269663 {'max_features': 2, 'n_estimators': 10}
0.8146067415730337 {'max_features': 2, 'n_estimators': 30}
0.7808988764044944 {'max_features': 4, 'n_estimators': 3}
0.8103932584269663 {'max_features': 4, 'n_estimators': 10}
0.8089887640449438 {'max_features': 4, 'n_estimators': 30}
0.8061797752808989 {'max_features': 6, 'n_estimators': 3}
0.8202247191011236 {'max_features': 6, 'n_estimators': 10}
0.8160112359550562 {'max_features': 6, 'n_estimators': 30}
0.773876404494382 {'bootstrap': False, 'max_features': 2, 'n_estimators': 3}
0.7963483146067416 {'bootstrap': False, 'max_features': 2, 'n_estimators': 10}
0.7879213483146067 {'bootstrap': False, 'max_features': 3, 'n_estimators': 3}
0.8061797752808989 {'bootstrap': False, 'max_features': 3, 'n_estimators': 10}
0.8047752808988764 {'bootstrap': False, 'max_features': 4, 'n_estimators': 3}
0.7991573033707865 {'bootstrap': False, 'max_features': 4, 'n_estimators': 10}


In [276]:
model = RandomForestClassifier(n_estimators = 10)

model.fit(df_prepd, df.Survived)
predictions = model.predict(test_prepd)

(strat_test_set.Survived==predictions).sum()/len(strat_test_set)

0.7541899441340782

The model was able to predict the test set with 75% accuracy. Not bad!

# Final fit

Now we have to fit our model to all the data that is available to us, pipeline the submission test set provided by kaggle and run predictions on it. Finally, foru our submission we have to create a csv file with two columns.

In [286]:
#Gather data
full_train = data.copy()

#Pipeline transform both sets
full_train_prepd = full_pipeline.fit_transform(full_train)
submission_set_prepd = full_pipeline.fit_transform(submission_test_set)

#Fit model
model.fit(full_train_prepd, full_train.Survived)
preds = model.predict(submission_set_prepd)

In [314]:
#Put together in a Data Frame
submission_df = pd.DataFrame({'PassengerId': submission_test_set.PassengerId,
                             'Survived': preds})
submission_df

Unnamed: 0,PassengerId,Survived
0,892,0
1,893,0
2,894,0
3,895,1
4,896,0
...,...,...
413,1305,0
414,1306,1
415,1307,0
416,1308,0


### Now we just have to save our DF as a csv file and we're ready to upload!

In [317]:
submission_df.to_csv('submission1.csv', index=False)