# Choosing a model for the Titanic competition
*Anders Poirel - 14-10-2019*

We've seen how to prepare the data for the Titanic dataset and are pready to fit some models. 
Import the data we pre-processed last time:

In [None]:
import pandas as pd
import numpy as np

train = pd.read_csv('train_clean.csv')
test = pd.read_csv('test_clean.csv')

In [None]:
train.head()

You already have a clean dataset but that doesn't mean that there is not some data pre-processing to be done. You won't be using the `Name`, `Ticket` and `Title` features. You need to create dummy variables for `Sex`. Finally, while you might not be interested in the specific cabin number, whether a passenger has a cabin or not should be important.

In [None]:
def has_cabin(x):
    if pd.isna(x):
        return 0
    else:
        return 1

def pre_process(dataset):
    dataset = pd.get_dummies(dataset, columns = ['Sex'], drop_first = True)
    dataset['HasCabin'] = dataset['Cabin'].apply(has_cabin)
    dataset.drop(['Embarked', 'Cabin', 'Name', 'Ticket', 'Title'], inplace = True, axis = 1)
    return dataset

In [None]:
train = pre_process(train)
X_test = pre_process(test)

y_train = train['Survived']
X_train = train.drop('Survived', axis = 1)

## Logistic Regression

We adapt regression so that the values fall between \[0,1\], and interpret the values as the probability that the point should be classified in the positive category:
One such function is the logistic function:
$$p(X) = \frac{e^{\beta_0 + \beta_1 X_1 + \ldots + \beta_n X_n}}{1 + e^{\beta_0 + \beta_1 X_1 + \ldots + \beta_n X_n}} $$

We then seek to maximize the likelyhood function:
$$(\beta_0, \ldots, \beta_1) = argmax_{\beta_0, \ldots, \beta_1} \prod_{i: y_i =1}P(x_i) \prod_{i: y_i = 0}(1-P(x_i))$$
 

To make predictions, we classify points with predicted probabilities <0.5 as 0 and those with over 0.5 as 1.

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

The following syntax can be reused across most of the `sklearn` and `keras` APIs (this is the `estimator` interface)

In [None]:
model_1 = LogisticRegression()
model_1.fit(X_train, y_train)

### Evaluating model performance

To evaluate the performance of our model we will be using the accuracy metric. As a reminder, the accuracy of an estimate $\mathbf{\hat{y}}$ on a classification task is defined as:
$$Accuracy(\mathbf{y}, \mathbf{\hat{y}}) = \frac{1}{n}\sum_{i =1}^{n} \mathbb{1}(y_i = \hat{y_i})$$ 

In [None]:
from sklearn.metrics import accuracy_score

In [None]:
y_pred = model.predict(X_test)
accuracy(y_test, y_pred)

## Decision Tree

Find the documentation for the API [here](https://scikit-learn.org/stable/modules/tree.html). 
Refer to ISL [2] for more information on decision trees.

Before defining and fitting the model, you will want to read up on the different hyperparameters the model takes on the scikit-learn documentation. 

In [None]:
from sklearn.tree import DecisionTreeClassifier
from sklearn import tree

In [None]:
## Try fitting the model on your own and making predictions

# Your code here


Now we will try to get a better estimate of how the model performs using cross-validation -which you may remember from our basic statistics workshop, otherwise refer to ISL [3]-

In [None]:
from sklearn.model_selection import cross_val_score
scores = cross_val_score(model2, X_train, y_train, cv = 10, scoring = 'accuracy')

In [None]:
np.mean(scores)

Use this to compare the perfomance of your decision tree and logistic regression model. 

In [None]:
# Your code here

You could also try playing around with the hyperparameters of the model to see if you can improve the cross-validation score

In [None]:
import graphviz

The resulting decision tree can be visualized as follows:

In [None]:
dot_data = tree.export(model2, out_file = None)
graph = graphviz.Source(dot_data)
graph

### Hyperparameter tuning

### Difference between model parameters and hyper-parameters
Parameters are the values that are calculated during model training, e.g. the weights $\beta_0, \ldots, \beta_n$ in logistic regression. 

Hyper-parameters on the other hand, are values that determine how exactly an training algorithm is carried out, for instance the maximum depth of a decision tree, or the number of features the decision tree is allowed to consider.

### Cross-validation search

Earlier, you might have tried to improve you model by modifying its hyper-parameters. A more systematic of doing so it to specify a set of hyperparameters to explore, and find the best performing ones by cross-validating the model for each combination.

In [None]:
from sklearn.model_selection import GridSearchCV

params = {'criterion' : ['gini', 'entropy'],
          'max_depth' : [3,5,10],
          'max_features' : ['sqrt', 'log2'],
          'min_samples_leaf' : [1,2,5]
          }
tuned_model = GridSearchCV(estimator = DecisionTreeClassifier(),
                           scoring = 'accuracy',
                           param_grid = params,
                           cv = 10,
                           n_jobs = -1)

In [None]:
tuned_model.fit(X_train, y_train)

You can access the best set of parameters after a grid search as follows

In [None]:
model.best_params_

You can now use these parameters to fit a definitive model on the entire dataset. Alternatively, you could also use them as the starting point of a new grid search to see if you can eke out even more performance.

You can see how well the best set of parameters performs using:

In [None]:
model.best_score_

Compare to the score on the untuned model.

## Random Forests & Ensemble Models

Find the documention for Random Forests here [here]()
You can read more about the theory behind the random forest algorithm in ISL [4]. A random forest is an ensemble model, which means that we are averaging a bunch of simpler models.

Here, we build a number of decision trees slightly differently and then average them by majority voting, i.e. we classify a training instance according to how the majority of decision trees we built would classify it. 



In [None]:
from sklearn.ensemble import RandomForestClassifier 

In [None]:
# Try tuning a random forest model on your own
# your code here

## References

[1] *An Introduction to Statistical Learning*, 4.3 Logistic Regression, pp. 130-137

[2] *An Introduction to Statistical Learning*, 8.1 The Basics of Decision Trees, pp.303-314

[3] *An Introduction to Statistical Learning*, 5.1 Cross-Validation pp. 176-183

[4] *An Introduction to Statistical Learning*, 8.2.1 Bagging & 8.2.3 Random Forests, pp 316-320