> **Note:** In most sessions you will be solving exercises posed in a Jupyter notebook that looks like this one. Because you are cloning a Github repository that only we can push to, you should **NEVER EDIT** any of the files you pull from Github. Instead, what you should do, is either make a new notebook and write your solutions in there, or **make a copy of this notebook and save it somewhere else** on your computer, not inside the `sds` folder that you cloned, so you can write your answers in there. If you edit the notebook you pulled from Github, those edits (possible your solutions to the exercises) may be overwritten and lost the next time you pull from Github. This is important, so don't hesitate to ask if it is unclear.

In [1]:
import matplotlib.pyplot as plt 
import pandas as pd 
import numpy as np
import seaborn as sns 

# Exercise Set 14: Classification Trees

*Afternoon, August 21, 2018*

In this Exercise Set 14 we will explore some of the most used tree-based classifiers, beginning with simple decision tree classifiers. 


> **Ex. 14.1.1:** In this exercise set we will work with the titanic dataset. Run the code below to get a variable **data**. Notice the last line of the code - what does it do?
>
>```python
> rawdata = sns.load_dataset('titanic').sample(frac=1, random_state = 3)
> ```
># Feature creation
> ```python
>rawdata['male'] = (rawdata['sex'] == 'male').astype(int)
>rawdata['alone'] = rawdata['alone'].astype(int)
>rawdata['adult_male'] = rawdata['adult_male'].astype(int)
> ```
> The following piece of code (in multiple lines) generates dummies for all of the categorical variables. 
> ```
data = pd.get_dummies(rawdata, 
                      columns = ['class', 'sibsp', 'parch', 'deck'], 
                      drop_first=True)\
         .drop(['pclass', 'sex', 'embarked', 'who', 'embark_town', 'alive'], axis = 1)
>```
> Read about decision-tree classification [here](http://scikit-learn.org/stable/modules/tree.html#classification). Try to explain why decision tree classification can _only_ produce splits orthogonal to one of the axes. 

In [2]:
# [Answer to Ex. 14.1.1]

> **Ex. 14.1.2:** Use `data` to create a numpy array **X** and an array **y**, y should contain the variable called 'survived', and X should contain all the other variables. Remove all rows containing missing values beforehand.

In [4]:
# [Answer to Ex. 14.1.2]

> **Ex. 14.1.3**: Train a decision tree classifier on **all** of the titanic data.
>
>* If your feature matrix is `X` and your target array is `y` you can do this by instantiating a model like:
>
        from sklearn.tree import DecisionTreeClassifier
        model = DecisionTreeClassifier()
        model.fit(X, y)  # <--- This is the training/fitting/learning step
>       
>Write four functions that counts the number of ..
 - true positives where we call the function `TP`;
 - true negatives where we call the function `TN`;
 - false positives where we call the function `FP`;
 - false negatives where we call the function `FN`.

> All of these functions should take three arguments, the actual y column, the actual X column and a fittedModel object (e.g. `fittedModel = DecisionTreeClassifier.fit(X,y)`)
>
>> _Hint 1:_ use the function `np.where` to compare y and the predicted values. For example `y + prediction == 2` is true only for the True Positives.
>
>> _Hint 2:_ You can check if your result is correct by summing all four functions and checking that the result equals to the number of observations. 

In [6]:
# [Answer to Ex. 14.1.3]

> **Ex. 14.1.4:** Combine the four functions you defined above to write your own accuracy function, which calculates
$$
ACC = \frac{TP + TN}{TP+ TN + FP + FN}
$$
>
> Test the accuracy of your model using the `A` function. Report the accuracy of your model on the same data that you trained the model on.
>
>> _Note:_ The reason we want to split the calculation of accuracy into these four components, is that we can then easily calculate other scores, such as the _precision, recall and f1_.

In [None]:
# [Answer to Ex. 14.1.4]

> **Ex. 14.1.5**: So far our model has been achieving an unbelievably high accuracy of 100% correct classification.
>
>1. Why did you get such a high accuracy in the previous exercises?
>2. Split your data into a test and training set of equal size. Train the model only on the training set and report its accuracy on both the training and test sets.
>> Hint: `from sklearn.model_selection import train_test_split` is an easy way to split your data.
>3. Comment on the difference you observe.

In [None]:
# [Answer to Ex. 14.1.5]

> **Ex. 14.1.6:** Read about the random forest classifier in [the documentation](http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html) - how does the RF learner relate to decision trees? Train a random forest classifier, and print its train-accuracy and test-accuracy. How does it compare to the decision tree?

In [None]:
# [Answer to Ex. 14.1.6]

> **Ex. 14.1.7:** Random forest have a number of hyperparameters that we can use. Often we just use the default settings as they are shown to be pretty good. Sometimes we want to make sure we grow the right number of trees to grow in the forest. Try varying `n_estimators` and plot the resulting curve. 

>> *Hint:* try the `validation_curve` which we learned about in Ex. 13.1.4.



In [None]:
# [Answer to Ex. 14.1.7]

> **Ex. 14.1.8:** Often we want to see the learning curve which shows model scores for test and train data with increasing sample sizes. Try making this plot. Be sure to use the n_estimators you have deemed useful.

> A balanced model should show a declining train and an increasing test score when adding more observations. Is this the case? 

>> *Hint:* try the `validation_curve` which we learned about in Ex. 13.1.4.



In [None]:
# [Answer to Ex. 14.1.8 BONUS]

> **Ex. 14.1.9 (BONUS):** We want to reduce the overfitting in our random forest. Read through the random forest and check which hyperparameters may help mitigating overfitting. You can use validation curves to search these. Try to making a new learning curve.

>> *Hint 1:* Can the number of features used in bootstrapping also be considered a hyperparameter? 

>> *Hint 2:* Can the depth of trees be considered a hyperparameter? 

>> *Hint 3:* sklearn has the built-in `learning_curve` in model_selection.

In [None]:
# [Answer to Ex. 14.1.9 BONUS]