## Chapter 6 -  Decision Trees

### Training and Visualising a Tree

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn import datasets
from sklearn.tree import DecisionTreeClassifier
from sklearn.tree import export_graphviz

from graphviz import Source # For creating the visualisation of the decision tree

In [2]:
# Ingest
iris_dataset = datasets.load_iris()
X = pd.DataFrame(iris_dataset['data'], columns=iris_dataset['feature_names'])
y = pd.Series(iris_dataset['target'])

In [3]:
# Train
tree_clf = DecisionTreeClassifier(max_depth=2)
tree_clf.fit(X.iloc[:,2:], y)

DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None, criterion='gini',
                       max_depth=2, max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, presort='deprecated',
                       random_state=None, splitter='best')

In [4]:
print()




In [5]:
# Visualise tree
graph = Source(export_graphviz(tree_clf, out_file=None, 
                               feature_names=iris_dataset['feature_names'][2:], class_names=iris_dataset['target_names']))
graph.view()

'Source.gv.pdf'

### Making Predictions

To use the tree, suppose there is a new flower and you want to classify it. Start from the root node. The node checks of the petal length is smaller than 2.45cm. If this is True then move left to the child node. This is a leaf node and the prediction is Setosa. 

If another flower has a petal length greater than 2.45cm, then move to the right. This child node is not a leaf node so further ask if the petal width is smaller than 1.75cm. If it is then move left to the leaf node. It is a Versicolor. If not then move to the right leaf node. It is a Virginica.

Class purity can be measured by Gini impurity or Entropy. Gini impurity is calculated as:
$$G_i = 1-\sum_{k=1}^np_{i,k}^2$$
where $p_{i,k}$ is the ratio of class $k$ instances among all the training instances in the node $i$.

For a pure node with $[50,0,0]$ members from each class, the Gini score is $1-\begin{pmatrix}\frac{0}{50}\end{pmatrix}^2-\begin{pmatrix}\frac{0}{50}\end{pmatrix}^2-\begin{pmatrix}\frac{50}{50}\end{pmatrix}^2= 0$ 

For a node of $54$ with $[0,49,5]$ members from each class, the Gini score is $1-\begin{pmatrix}\frac{0}
{54}\end{pmatrix}^2-\begin{pmatrix}\frac{49}{54}\end{pmatrix}^2-\begin{pmatrix}\frac{5}{54}\end{pmatrix}^2= 0.168$ 

Another way is Entropy. It is calculated as 
$$H_i = -\sum_{k=1, p_{i,k}\neq 0}^n p_{i,k}\log p_{i,k}$$

$p_{i,k}\neq 0$ means to omit all classes $k$ in the node $i$ with no instances.

For a pure node with $[50,0,0]$ members from each class, the Entropy calculation is $-\frac{50}{50} \log\begin{pmatrix}\frac{50}{50}\end{pmatrix}^2= 0$ 

For a node of $54$ with $[0,49,5]$ members from each class, the Gini score is $-\begin{pmatrix}\frac{49}{54}\end{pmatrix}\log\begin{pmatrix}\frac{49}{54}\end{pmatrix}-\begin{pmatrix}\frac{5}{54}\end{pmatrix}\log\begin{pmatrix}\frac{5}{54}\end{pmatrix}= 0.308$ 

In [6]:
#Gini Calculations
print(1-(0/50)**2-(0/50)**2-(50/50)**2)
print(1-(0/54)**2-(49/54)**2-(5/54)**2)

0.0
0.1680384087791495


In [7]:
#Entropy Calculations
print(-(50/50)*np.log(50/50))
print(-(49/54)*np.log(49/54)-(5/54)*np.log(5/54))

-0.0
0.30849545083110386


### Estimating Class Probabilities

A decision tree can also estimate the probability that the instance belongs to a particular class $k$. First it traverses the tree to find the leaf node. Then it returns the ratio of training instances of class $k$ for this node. 

In this case, the number of class 1 for this leaf node is 49 out of 54 so the predicted probability is $\frac{49}{54}=0.907$

In [8]:
new_s = [[5,1.5]]
print('prediction : ', tree_clf.predict(new_s))
print('proba : ' , tree_clf.predict_proba(new_s))

prediction :  [1]
proba :  [[0.         0.90740741 0.09259259]]
