# 04 - Decision Trees

The goal of this exercise is to to develop an understanding how to implement a decision tree.

<div class="alert alert-block alert-info">
To solve this notebook you need the knowledge from the previous notebook. If you have problems solving it, take another look at the last week's notebooks.
    
It's also recommended to read the chapter 7 of the book in advance.
</div>

**Task**: In this exercise, we use a popular dataset to predict, if a patient has a heart disease or not, depending on some medical measurements.

In [None]:
# Run this cell two import the following modules
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

<h2 style="color:blue" align="left">Load and preprocess data</h2>

In the first step, we need to load the dataset. If your are interessted about the meaning of each feature, you can have a look at the description on the [UCI site](https://archive.ics.uci.edu/ml/datasets/statlog+(heart)) to this dataset.

In [None]:
dataset = pd.read_csv('dataset/heart.dat', delim_whitespace=True)
dataset.head()

In [None]:
dataset.info()

In [None]:
pd.plotting.scatter_matrix(dataset, alpha=0.2, figsize=(12,12));

The dataset is complete and has only numerical values, we can do the train-test-split.

In [None]:
from sklearn.model_selection import train_test_split
X = dataset.drop('target', axis=1)
y = dataset['target']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=dataset['target'])
X_train.shape, X_test.shape, y_train.shape, y_test.shape

<div class="alert alert-block alert-info">
Decision trees and ensemble methods, like random forests, do not require feature scaling to be performed as they are not sensitive to the the variance in the data.
</div>

<h2 style="color:blue" align="left">Build and evaluate the tree</h2>

Now, that we have prepared the data, we can start to grow the tree. Therefore we use the built-in class `DecisionTreeClassifier` of scikit-learn.

In [None]:
from sklearn.tree import DecisionTreeClassifier

<div class="alert alert-block alert-success"><b>Task</b><br> 
Create an instance of a DecisionTreeClassifer and save it in the variable tree_clf. Then fit the model using the training data set. Set the parameter random_state to 42, to have comparable results.
</div>

In [None]:
tree_clf = None
# Write Your Code Here


You can use the function defined below to visualize the full grown tree.

In [None]:
from sklearn import tree
def plot_decision_tree(dec_tree, feature_names, class_names, filename=None):
    # Setting dpi = 300 to make image clearer than default
    # fig size depends on the size of the tree
    depth = dec_tree.tree_.max_depth
    fig, axes = plt.subplots(nrows = 1, ncols = 1, figsize = (depth*2,depth*2), dpi=300)
    tree.plot_tree(dec_tree,
               feature_names = feature_names, 
               class_names=class_names,
               filled = True);
    if filename != None:
        fig.savefig(str(filename) + '.png')

<div class="alert alert-block alert-success"><b>Task</b><br> 
Use the function plot_decision_tree() to plot the tree you created in the previous task. If the output is too small, you can pass the function a filename as fourth parameter, to save the figure as png in the current directory.
</div>

In [None]:
feature_names=X.columns
class_names=['no heart attack', 'heart attack']
# Write Your Code Here


In [None]:
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score

<div class="alert alert-block alert-success"><b>Task</b><br> 
Use the metrics Confusion Matrix and Accuracy score to evalute the performance of your model. Evaluate the model with the training and the test set. How do you assess the results?
</div>

In [None]:
# Write Your Code Here


_Assess The Model Here_



<h2 style="color:blue" align="left">Regularization</h2>

Especially if the model tends to overfitting, then the influence of max_depth and ccp_alpha should be examined.

### Max depth

Max_depth regulates the maximum depth of the tree.

<div class="alert alert-block alert-success"><b>Task</b><br> 
Use a for loop to create multiple trees with different depths. The values for max_depth to be examined are in the variable max_depths. Store the accuracy score of the training set and the test set in the variables provided. Then use the code in the next cell to visualize the results.
</div>

In [None]:
train_accuracies = []
test_accuracies = []
max_depths = range(1, tree_clf.tree_.max_depth+1)
# Write Your Code Here


In [None]:
# Plot accuracies vs. max depth
plt.plot(max_depths, train_accuracies, marker='o', label='Train Accuracy', drawstyle='steps-post')
plt.plot(max_depths, test_accuracies, marker='o', label='Test Accuracy', drawstyle='steps-post')
plt.title('Accuracies of regularised Decision Tree depeding on max depth')
plt.xlabel('Max depth')
plt.ylabel('Accuracy')
#plt.ylim([0,1])
#plt.axhline(y=1, color='black', linestyle='-')
plt.legend(loc='lower right');

### Pruning

Another way to regularize a tree is the parameter ccp_alpha. With this parameter you can control the use of pruning. The higher the ccp_alpha value, the more batches will be pruned. 

<div class="alert alert-block alert-success"><b>Task</b><br> 
Use a for loop to create multiple trees with different depths. The values for ccp_alpha to be examined are in the variable ccp_alphas. Store the accuracy score of the training set and the test set in the variables provided. Then use the code in the next cell to visualize the results.
</div>

In [None]:
train_accuracies = []
test_accuracies = []
ccp_alphas = np.linspace(0,0.1,11)
# Write Your Code Here


In [None]:
# Plot accuracies vs. Alpha
plt.plot(ccp_alphas, train_accuracies, marker='o', label="Train Accuracy", drawstyle="steps-post")
plt.plot(ccp_alphas, test_accuracies, marker='o', label="Test Accuracy", drawstyle="steps-post")
plt.title('Accuracies of regularised Decision Tree depending on Alpha')
plt.xlabel('Alpha')
plt.ylabel('Accuracy')
#plt.ylim([0,1])
plt.legend(loc='lower left');