<img src="https://upload.wikimedia.org/wikipedia/commons/thumb/f/fd/RMS_Titanic_3.jpg/1024px-RMS_Titanic_3.jpg" width=400/>



In this notebook we will review Titanic problem.

The following sections will be discussed:
 - matplotlib/pandas basics
 - usage of sklearn for classification problem (logistic regression, random forest, svm)
 - cross-validation usage for hyperparameter tuning
 - mixing of different classifiers
 
Dataset can be downloaded from here:
https://www.kaggle.com/francksylla/titanic-machine-learning-from-disaster

# Authorization

We usage MLSage API for remote checking of your solution. You can skip this block if you do not want your results to becked 

Authorization page: http://178.62.239.103:3004/app/sign-up

First of all you have install MLSage client API to simplify communication

In [None]:
!pip install git+https://github.com/myurushkin/mlsage-pyclient

In [None]:
import getpass
import mlsage_pyclient

login = 'm.yurushkin@gmail.com'
password = getpass.getpass(prompt="MLSage password:")
course = 1
lessonId=1
token = mlsage_pyclient.signIn(login, password)

In [None]:
print(mlsage_pyclient.getSolutions(token))
print(mlsage_pyclient.sendSolution(token, courseId=1, lessonId=3, exerciseId=1, data={"data": 1}))

# Basics

In [None]:
import numpy as np
import pandas as pd
import sklearn
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split

In [None]:
data = pd.read_csv("titanic/train.csv")
data.head()

In [None]:
print("Size of dataset: {}".format(len(data)))
print("Number of survived: {}".format(len(data[data.Survived == 1])))

In [None]:
import matplotlib
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline  

In [None]:
plt.title("Age distribution")
data["Age"].plot.hist()

In [None]:
data["Pclass"].plot.hist()

In [None]:
data.Sex = data.Sex.map({'female':0,'male':1})

In [None]:
data = data.fillna(0)

In [None]:
data.isna().any()

In [None]:
features = data[['Sex', 'Age', "SibSp", "Parch", "Fare"]].values

In [None]:
train_X, test_X, train_y, test_y = train_test_split(features, data['Survived'].tolist(), test_size=0.2)

In [None]:
clf_log_reg = LogisticRegression()

In [None]:
clf_log_reg.fit(train_X, train_y)

clf_log_reg.coef_

In [None]:
preds_logr_reg = clf_log_reg.predict_proba(test_X)[:,1]

In [None]:
sklearn.metrics.accuracy_score(preds_logr_reg > 0.5, test_y)

In [None]:
clf_rf = RandomForestClassifier(n_estimators=100, max_depth=13)

In [None]:
clf_rf.fit(train_X, train_y)
preds_rf = clf_rf.predict_proba(test_X)[:,1]
sklearn.metrics.accuracy_score(preds_rf > 0.5, test_y)

# Mixing of classifiers

In [None]:
acc_list = []
for alpha in np.linspace(0, 1, 10):
    preds = alpha * preds_logr_reg + (1 - alpha) * preds_rf
    acc = sklearn.metrics.accuracy_score(preds > 0.5, test_y)
    acc_list.append(acc)
plt.plot(acc_list)

# Tuning of hyperparameters

In [None]:
from sklearn.model_selection import GridSearchCV

In [None]:
param_grid = { 
    'n_estimators': [50, 100, 200],
    'max_depth' : [4, 7, 9, 13],
    'criterion' :['gini', 'entropy']
}

CV_rfc = GridSearchCV(estimator=RandomForestClassifier(), param_grid=param_grid, cv=5)
CV_rfc.fit(train_X, train_y)

In [None]:
CV_rfc.best_estimator_
CV_rfc.best_estimator_.fit(train_X, train_y)

In [None]:
preds_rf_tuned = CV_rfc.best_estimator_.predict_proba(test_X)[:,1]
sklearn.metrics.accuracy_score(preds_rf_tuned > 0.5, test_y)

# Home task

- **3 points**: try tune other classifier (e.g. xgboost).
- **7 points**: try to mix xgboost, random forest and logistic regression. Find optimal proportion between them. don't forget to use cross validation.
- **20 points**: Let's try to create a model and become the best among Top 20% data scientists on unseen dataset here: https://www.kaggle.com/c/titanic/leaderboard.