<img src="https://upload.wikimedia.org/wikipedia/commons/thumb/f/fd/RMS_Titanic_3.jpg/1024px-RMS_Titanic_3.jpg" width=400/>



In this notebook we will review Titanic problem.

The following sections will be discussed:
 - matplotlib/pandas basics
 - usage of sklearn for classification problem (logistic regression, random forest, svm)
 - cross-validation usage for hyperparameter tuning
 - mixing of different classifiers
 
Dataset can be downloaded from here:
https://www.kaggle.com/francksylla/titanic-machine-learning-from-disaster

# Data loading

Download data file from github via curl command and unzip it.

In [None]:
!curl -L https://github.com/broutonlab/deep-learning-course/raw/mmcs-2020-fall/week02-ml_basics/data.zip > data.zip
!unzip ./data.zip -d ./titanic  

# Let's review the data

import the modules we plan to use in notebook

In [None]:
import numpy as np
import pandas as pd
import sklearn, seaborn as sns
import plotly.express as px
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split

In [None]:
data = pd.read_csv("titanic/train.csv")
data.head()

In [None]:
print("Size of dataset: {}".format(len(data)))
print("Number of survived: {}".format(len(data[data.Survived == 1])))

In [None]:
import matplotlib
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline  

In [None]:
data = data.fillna(0)

In [None]:
fig, ax =plt.subplots(2,1, figsize=(5, 5))

sns.distplot(data['Age'], ax=ax[0])
sns.distplot(data['Pclass'], ax=ax[1])
fig.show()

In [None]:
fig = px.sunburst(data, path=['Sex', 'Survived'], values='PassengerId')
fig.show()

# Data preparison

In [None]:
data.Sex = data.Sex.map({'female':0,'male':1})

Let's check the dataset doesn't contain nans

In [None]:
data.isna().any()

Seems to be everything is fine. Let's create feature/label tables for model training

In [None]:
features = data[['Sex', 'Age', "SibSp", "Parch", "Fare"]].values

In [None]:
train_X, test_X, train_y, test_y = train_test_split(features, data['Survived'].tolist(), test_size=0.33)

# Model training

We plan to train binary classification model. Let's try to train logistic regression

<img src="https://miro.medium.com/max/770/1*RqXFpiNGwdiKBWyLJc_E7g.png" width=600/>


In [None]:
clf_log_reg = LogisticRegression()

In [None]:
clf_log_reg.fit(train_X, train_y)

clf_log_reg.coef_

In [None]:
preds_logr_reg = clf_log_reg.predict_proba(test_X)[:,1]

In [None]:
sklearn.metrics.accuracy_score(preds_logr_reg > 0.5, test_y)

In [None]:
clf_rf = RandomForestClassifier(n_estimators=100, max_depth=5)

In [None]:
clf_rf.fit(train_X, train_y)
preds_rf = clf_rf.predict_proba(test_X)[:,1]
sklearn.metrics.accuracy_score(preds_rf > 0.5, test_y)

# Mixing of classifiers

In [None]:
acc_list = []
for alpha in np.linspace(0, 1, 30):
    preds = alpha * preds_logr_reg + (1 - alpha) * preds_rf
    acc = sklearn.metrics.accuracy_score(preds > 0.5, test_y)
    acc_list.append(acc)
plt.plot(acc_list)

# Tuning of hyperparameters

In [None]:
from sklearn.model_selection import GridSearchCV

In [None]:
param_grid = { 
    'n_estimators': [50, 100, 200],
    'max_depth' : [4, 7, 9, 13],
    'criterion' :['gini', 'entropy']
}

CV_rfc = GridSearchCV(estimator=RandomForestClassifier(), param_grid=param_grid, cv=5)
CV_rfc.fit(train_X, train_y)

In [None]:
CV_rfc.best_estimator_
CV_rfc.best_estimator_.fit(train_X, train_y)

In [None]:
preds_rf_tuned = CV_rfc.best_estimator_.predict_proba(test_X)[:,1]
sklearn.metrics.accuracy_score(preds_rf_tuned > 0.5, test_y)

# Home task

- **3 points**: try to tune other classifier (e.g. xgboost).
- **7 points**: try to mix xgboost, random forest and logistic regression. Find optimal proportion between them. don't forget to use cross validation.
- **20 points**: Let's try to create a model and become the best among Top 20% data scientists on unseen dataset here: https://www.kaggle.com/c/titanic/leaderboard.