## Chapter 6 -  Decision Trees

### Training and Visualising a Tree

We first use the iris dataset as an example. Consider the following tree built for the iris dataset. We use the petal length and petal width to predict which type of iris plant it is. 

It consists of a series of splitting rules from the top of the tree. For a new sample, we traverse from the top of the tree. In the top split, if the `petal length <= 2.45` it goes to the left branch. The prediction for this example is the Setosa class. If the `petal length > 2.45` then it goes to the right branch. It is further subdivided by the `petal width` feature. Overall, the tree segments the flowers into 3 regions of prediction space. 

The top node is the root node. This node checks of the petal length is smaller than 2.45cm. The nodes with no more child nodes are terminal nodes or leaf nodes of the tree. The points along the tree are considered internal nodes. The segments of the tree that connects the nodes are branches.

Observing the tree, we can say that `petal length` is the most important factor determining type of iris as it is present in the root node.

<img src="tree1.png" width="300" />

SKLearn uses the Classification and Regression Tree (CART) algorithm to train Decision Trees. Given $n$ training examples, each with $p$ features $x_1, \cdots,x_i,\cdots, x_n \in \mathbb R^p$ and their associated class labels $y_1, \cdots, y_j, \cdots, y_n$ where each label is one of $K$ classes i.e. $y_i \in \{1,\cdots, K\}$, the following steps are taken generally:

1. Split the prediction space into $J$ distinct and non-overlapping regions, $R_1, \cdots, R_m, \cdots, R_J$. 
2. For a new observation that falls into $R_m \in \{1,\cdots, J\}$, the predicted class is the most commonly occuring class in the region (or the proportion of each class in the region).

For the Iris example, we have obtained 3 regions, from left $R_1, R_2, R_3$. The response in each of the nodes is $R_1=\text{setosa}$, $R_2=\text{versicolor}$, $R_3=\text{virginica}$ respectively. So for a new observation $x^*$, if $x^* \in R_1$ then its class prediction is the Setosa.

To construct the regions $R_m$, we find regions that maximise class purity. Two ways to measure class purity are Gini impurity or Entropy. 

Given a region $R_m$ with $n_{R_m}$observations from each class $k$, Gini impurity / Gini Index is calculated as:
$$G_{R_m,\text{Gini}} = 1-\sum_{k=1}^{K}p_{R_m,k}^2$$
where $p_{R_m,k}$ is the ratio of class $k$ instances among all the training instances in the region $R_m$. It can be also shown that the expression is the same as $\sum_{k=1}^{K}p_{R_m,k}(1 - p_{R_m,k})$. If all the $p_{R_m,k}$ are close to zero or one, then the Gini index takes on a small value. This means a small value indicates that a node consists predominantly observations from a single class. 

In the tree context, the region $R_m$ can also mean the node $m$. 

For a pure node with $[50,0,0]$ members from each class, the Gini score is $1-\begin{pmatrix}\frac{0}{50}\end{pmatrix}^2-\begin{pmatrix}\frac{0}{50}\end{pmatrix}^2-\begin{pmatrix}\frac{50}{50}\end{pmatrix}^2= 0$ 

For a node of $54$ with $[0,49,5]$ members from each class, the Gini score is $1-\begin{pmatrix}\frac{0}
{54}\end{pmatrix}^2-\begin{pmatrix}\frac{49}{54}\end{pmatrix}^2-\begin{pmatrix}\frac{5}{54}\end{pmatrix}^2= 0.168$ 



Alternatively, Entropy or cross-entropy is calculated as:
$$G_{R_m,\text{Entropy}} = -\sum_{k=1, p_{i,k}\neq 0}^K p_{R_m,k}\log p_{R_m,k}$$

The bottom term of the summation $p_{i,k}\neq 0$ means to omit all classes $k$ in the node $i$ with no instances. This is to remove all terms that result in the expression $\log 0$. The cross entropy is near zero if all the $p_{R_m,k}$ are close to zero or one - hence, similar to the Gini index, the cross entropy score will take a small value if the node is pure.

For a pure node with $[50,0,0]$ members from each class, the Entropy calculation is $-\frac{50}{50} \log\begin{pmatrix}\frac{50}{50}\end{pmatrix}^2= 0$ 

For a node of $54$ with $[0,49,5]$ members from each class, the Gini score is $-\begin{pmatrix}\frac{49}{54}\end{pmatrix}\log\begin{pmatrix}\frac{49}{54}\end{pmatrix}-\begin{pmatrix}\frac{5}{54}\end{pmatrix}\log\begin{pmatrix}\frac{5}{54}\end{pmatrix}= 0.308$ 



First, split the training set using a feature $k$ and a threshold $t_k$. Find this feature-threshold pair $(k,t_k)$ that produces the purest subsets. The cost function to minimise is:

$$J(k,t_k) = \frac{m_{\text{left}}}{m}G_\text{left} + \frac{m_{\text{right}}}{m}G_\text{right}$$

where $G_{\text{left or right}}$ is the impurity of the left or right subset and $m_{\text{left or right}}$ is the number of instances in the left or right subset. $G$ can be either Gini or Entropy.

Once it has successfully split the training set in two, it continues to do so recursively until it reaches the max depth or it cannot find a split to further reduce impurity.

In [5]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn import datasets
from sklearn.tree import DecisionTreeClassifier
from sklearn.tree import export_graphviz

from graphviz import Source # For creating the visualisation of the decision tree

In [6]:
# Ingest
iris_dataset = datasets.load_iris()
X = pd.DataFrame(iris_dataset['data'], columns=iris_dataset['feature_names'])
y = pd.Series(iris_dataset['target'])

In [9]:
# For testing
display(X.head())
display(y.head())

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm)
0,5.1,3.5,1.4,0.2
1,4.9,3.0,1.4,0.2
2,4.7,3.2,1.3,0.2
3,4.6,3.1,1.5,0.2
4,5.0,3.6,1.4,0.2


0    0
1    0
2    0
3    0
4    0
dtype: int64

In [10]:
# Train
tree_clf = DecisionTreeClassifier(max_depth=2)
tree_clf.fit(X[['petal length (cm)','petal width (cm)']], y)

DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None, criterion='gini',
                       max_depth=2, max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, presort='deprecated',
                       random_state=None, splitter='best')

In [11]:
# Visualise tree
graph = Source(export_graphviz(tree_clf, out_file=None, 
                               feature_names=iris_dataset['feature_names'][2:], class_names=iris_dataset['target_names']))
graph.view()

'Source.gv.pdf'

### Making Predictions

Class purity can be measured by Gini impurity or Entropy. Gini impurity is calculated as:
$$G_i = 1-\sum_{k=1}^np_{i,k}^2$$
where $p_{i,k}$ is the ratio of class $k$ instances among all the training instances in the node $i$.

For a pure node with $[50,0,0]$ members from each class, the Gini score is $1-\begin{pmatrix}\frac{0}{50}\end{pmatrix}^2-\begin{pmatrix}\frac{0}{50}\end{pmatrix}^2-\begin{pmatrix}\frac{50}{50}\end{pmatrix}^2= 0$ 

For a node of $54$ with $[0,49,5]$ members from each class, the Gini score is $1-\begin{pmatrix}\frac{0}
{54}\end{pmatrix}^2-\begin{pmatrix}\frac{49}{54}\end{pmatrix}^2-\begin{pmatrix}\frac{5}{54}\end{pmatrix}^2= 0.168$ 

Another way is Entropy. It is calculated as 
$$H_i = -\sum_{k=1, p_{i,k}\neq 0}^n p_{i,k}\log p_{i,k}$$

$p_{i,k}\neq 0$ means to omit all classes $k$ in the node $i$ with no instances.

For a pure node with $[50,0,0]$ members from each class, the Entropy calculation is $-\frac{50}{50} \log\begin{pmatrix}\frac{50}{50}\end{pmatrix}^2= 0$ 

For a node of $54$ with $[0,49,5]$ members from each class, the Gini score is $-\begin{pmatrix}\frac{49}{54}\end{pmatrix}\log\begin{pmatrix}\frac{49}{54}\end{pmatrix}-\begin{pmatrix}\frac{5}{54}\end{pmatrix}\log\begin{pmatrix}\frac{5}{54}\end{pmatrix}= 0.308$ 

In [6]:
#Gini Calculations
print(1-(0/50)**2-(0/50)**2-(50/50)**2)
print(1-(0/54)**2-(49/54)**2-(5/54)**2)

0.0
0.1680384087791495


In [7]:
#Entropy Calculations
print(-(50/50)*np.log(50/50))
print(-(49/54)*np.log(49/54)-(5/54)*np.log(5/54))

-0.0
0.30849545083110386


### Estimating Class Probabilities

A decision tree can also estimate the probability that the instance belongs to a particular class $k$. First it traverses the tree to find the leaf node. Then it returns the ratio of training instances of class $k$ for this node. 

In this case, the number of class 1 for this leaf node is 49 out of 54 so the predicted probability is $\frac{49}{54}=0.907$

In [8]:
new_s = [[5,1.5]]
print('prediction : ', tree_clf.predict(new_s))
print('proba : ' , tree_clf.predict_proba(new_s))

prediction :  [1]
proba :  [[0.         0.90740741 0.09259259]]
