# Decision Trees and Random Forests

Before we get going, we collect all the necessary includes here (so we're through with them...).

In [None]:
!pip install graphviz

In [None]:
from sklearn import tree
from sklearn.datasets import load_iris
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_predict
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestClassifier
from ipywidgets import interactive
from graphviz import Source
from IPython.display import display, SVG

import graphviz 
import os
import pandas as pd
import warnings

warnings.filterwarnings('ignore')

Also, there is a function *plot_confusion_matrix* we would like to use, which cannot be easily imported. So we put it right here! And come back to it later. 

In [None]:
## do not change anything below ##
# Source: http://scikit-learn.org/stable/auto_examples/model_selection/plot_confusion_matrix.html
import numpy as np
import matplotlib.pyplot as plt
import itertools

def plot_confusion_matrix(cm, classes,
                          normalize=False,
                          title='Confusion matrix',
                          cmap=plt.cm.Blues):
    """
    This function prints and plots the confusion matrix.
    Normalization can be applied by setting `normalize=True`.
    """
    if normalize:
        cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
        print("Normalized confusion matrix")
    else:
        print('Confusion matrix, without normalization')
    print(cm)

    plt.imshow(cm, interpolation='nearest', cmap=cmap)
    plt.title(title)
    plt.colorbar()
    tick_marks = np.arange(len(classes))
    plt.xticks(tick_marks, classes, rotation=45)
    plt.yticks(tick_marks, classes)

    fmt = '.2f' if normalize else 'd'
    thresh = cm.max() / 2.
    for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
        plt.text(j, i, format(cm[i, j], fmt),
                 horizontalalignment="center",
                 color="white" if cm[i, j] > thresh else "black")

    plt.tight_layout()
    plt.ylabel('True label')
    plt.xlabel('Predicted label')

In [None]:
def plot_tree(train_features: pd.DataFrame, train_target: pd.DataFrame, feature_names: list,
              split: str, depth: int, min_split: float, min_leaf: float=0.2):
    model = DecisionTreeRegressor(random_state=0,
                                  splitter=split,
                                  max_depth=depth,
                                  min_samples_split=min_split,
                                  min_samples_leaf=min_leaf)
    model.fit(train_features, train_target)
    graph = Source(tree.export_graphviz(model,
                                        out_file=None,
                                        feature_names=feature_names,
                                        filled=True))
    display(SVG(graph.pipe(format='svg')))
    return model

At last, we will need a special package. Please install the windows package. Download it from http://www.graphviz.org/.
You will probably also have to install the graphviz package first via *pip install graphviz*. 

That's it with the preparatory work. Now we can really step into business!

## Decision Trees

### Load and investigate the Iris data

Let's load the Iris data set, which is already included in Scikit Learn. 

In [None]:
iris = load_iris()

Biologists have been busy measuring some key characteristics of the Iris flowers - the **length** and **width** of two characteristic leafs (called 'petal' and 'sepal'). These **four numbers** are our "*features*", contained in iris.data.

Furthermore, biologists have used their **expertise** to identify the Iris flowers. These "*labels*" are contained in iris.target.

First, check that both datasets have the same size.

In [None]:
print(iris.data.shape)
print(iris.target.shape)

Let's take a look at what labels are available!

In [None]:
set(iris.target)

Hmm - it seems only to contain 'boring' classification numbers. But never mind - the data set also knows the scientific name of the Iris flower. 

In [None]:
class_names = iris.target_names
print(class_names)

And here are the full names of the four features (we will need them later).

In [None]:
feature_names = iris.feature_names
print(feature_names)

Okay, now we have a rough feeling of the data :-).  

### Define training and test data

Second is to **split** the 150 data sets into **training data** and **test data**. This is crucial in any machine learning scheme. 

It is customary to use roughly 80 percent of the data for training purposes and 20 percent for testing.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.2, random_state=0)

So the task is: Find a model to **derive the flower type from the 4 features** (the sizes of the petal and sepal leafs). 

### Define and train model: decision tree

Our first attempt for such a model is a "decision tree". Fortunately, this can be found fully programmed and tested in the sklearn lib. So we only need to set some characteristic numbers of the model (*max_depth* in this case). 

In [None]:
clf = tree.DecisionTreeClassifier(max_depth=2)

Now we use our training and test data set to train the model. 

In [None]:
clf.fit(X_train, y_train)

### Try out the model

Next, we **try out our model**: Taking the leaf sizes, estimate what Iris flower it is. And compare this "*prediction*" 
to the *labels* (the 'truth').

We do this **independently for the test and traning data**.

In [None]:
pred_train = clf.predict(X_train)
pred_test = clf.predict(X_test)
print(accuracy_score(y_train, pred_train))
print(accuracy_score(y_test, pred_test))

Okay, the numbers are not bad, and they are similar. Great!

Next we do some statistics on our model results. First thing is cross validation, which tries to answer the question: "When we use **different (random) sets of training data**, how well do we 'forecast' the correct result **on the average**?" 

The cross validation itself again is pre-computed and tested. So the implementation is short. 

In [None]:
clf = tree.DecisionTreeClassifier(max_depth=2)
predicted = cross_val_predict(clf, iris.data, iris.target, cv=20)

Now calculate the '*confusion matrix*' and plot it. This matrix brings together the true and the predicted classification of the Iris flower. Here we can use the function stated in the beginning.

In [None]:
cnf_matrix = confusion_matrix(iris.target, predicted)

## Plot non-normalized confusion matrix
plt.figure()
plot_confusion_matrix(cnf_matrix, classes=class_names,
                      title='Confusion matrix, without normalization')
# Plot normalized confusion matrix
# plt.figure()
# plot_confusion_matrix(cnf_matrix, classes=class_names, normalize=True,
#                       title='Normalized confusion matrix')
plt.show()

What this matrix tells us is: All 50 'setosa' flowers have been correctly predicted by our model. Great! 

But trom the 50 'versicolor' flowers, some have been predicted falsely as 'verginica' flowers. For the 50 'verginica' flowers, the same goes vice versa. 

We can boil down this matrix into a single number, the *accuracy score*: 

In [None]:
accuracy_score(iris.target, predicted)

This is similar to the accuracy score of 0.966 found up above (which again is a good sign). 

And mind that this is a *random number*, dependent on the actual selection of the test data split. So your neighbours may well find a different number. Try repeating the previous steps: You will find the confusion matrix and the accuracy score are different

### Visualize model

One of the good things about the decision tree is that it is easy to understand. And easy to visualise. Try this:

In [None]:
os.environ["PATH"] += os.pathsep + 'C:/Program Files (x86)/Graphviz2.38/bin/'  #Make sure that the dot binary is in the path.
clf.fit(iris.data, iris.target)
dot_data = tree.export_graphviz(clf, out_file=None, 
                         feature_names=iris.feature_names,  
                         class_names=iris.target_names,  
                         filled=True, rounded=True,  
                         special_characters=True)
graphviz.Source(dot_data) 

Take some time to understand this graph.

As stated above, the decision tree model has (like any model) some model parameters that can and need to be tuned. The tuning to yield the best results depends on the task given - there is no general law for this setting.

Try some alternative values and see how the design of the model changes.

In [None]:
plot_tree_partial = lambda split, depth, min_split, min_leaf: plot_tree(X_train, y_train, feature_names, split, depth, min_split, min_leaf)

inter = interactive(plot_tree_partial, 
                    split = ["best", "random"], 
                    depth=[1,2,3,4], 
                    min_split=(0.1,1), 
                    min_leaf=(0.1,0.5))
display(inter)

## Random Forest

Now let's experiment with a random forest as the model of choice.

### Define and train model: random forest

Again, the code is readily available in the sklearn package. Try this. 

In [None]:
rf = RandomForestClassifier()

### Try out the model 

Again, we use cross validation to investigate the quality of the model prediction. So first calculate this prediction

In [None]:
predicted_rf = cross_val_predict(rf, iris.data, iris.target, cv=20)

Plot the confusion matrix and compute the accuracy.

In [None]:
cnf_matrix = confusion_matrix(iris.target, predicted_rf)
# Plot non-normalized confusion matrix
plt.figure()
plot_confusion_matrix(cnf_matrix, classes=class_names,
                      title='Confusion matrix, without normalization')
# # Plot normalized confusion matrix
# plt.figure()
# plot_confusion_matrix(cnf_matrix, classes=class_names, normalize=True,
#                       title='Normalized confusion matrix')
# plt.show()

In [None]:
accuracy_score(iris.target, predicted_rf)

Do these numbers show a higher accuracy than for the decision tree model? They do not necessarily need to. You might discuss with your tutor why this is the case. 

### Some more insights

Anyway, the random forest allows us a little more insight into the dependency between features and lables. For instance, we may find out which of the 4 four features has the largest impact on the overall prediction. This is done with an *importance plot*.

In [None]:
rf.fit(iris.data, iris.target)

f, ax = plt.subplots(figsize=(7, 5))
ax.bar(range(len(rf.feature_importances_)), rf.feature_importances_)
ax.set_title("Feature Importances")
plt.xticks(range(len(rf.feature_importances_)), iris.feature_names)
plt.show()

## Excercise: The Wine Data Set

So now it is over to you to do some exercises.

### Load and investigate the Wine data

Let's start by loading the data set. It is already included in Scikit Learn, and has been imported at the beginning.

In [None]:
wine = load_wine()
print(wine.DESCR)

Find out the dimensions and names of the features and labels.

In [None]:
print(wine.data.shape)
print(wine.feature_names)
print(wine.target.shape)
print(wine.target_names)

Let's sum up these data in prose:
<p>    - The data consists of 178 samples of wine, each described by 13 chemical characteristics, the *features*.</p>
<p>    - For each sample, there is a single *label* (which may be thought of red, white and  rosé wine).</p>

### Exercise: Try out models on wine data

Your task is: 
Investigate the dataset using decision trees and random forests. Experiment with different hyperparameters. What is the best accuracy you can get with decision trees? What is the best accuracy you can get with random forests? What *feature* has the largest impact on the attribution to the target?

1. Split the dataset in training set and test set

2. Initialize the first Model - a Decision Tree

3. Fit the model with the training data, the model needs the x data with the features and the target data with the 3 classes

4. Use the trained model to predict the target values for the train- and for the test-set

5. Compare the actual values with the predicted values by calculating the accuracy score for the training and test set

6. Plot a confusion matrix to get a more detailed view on the predictions your model made

7. Use graphviz to visualize the tree built from your model

Try to change the parameters to get better results. Common changes could be:
* Increase the number of training samples (ATTENTION: Think of overfitting ;))
* Use cross-validation to train your model
* Only use specific features
* Use another model, e.g. a random forest
