# Decision Trees
---

## Problem setting
Find the relationship between the features of a data point and it's corresponding class label. For a binary decision tree we only have 2 class labels, such as pass/fail, win/lose, alive/dead, healthy/sick, 0/1 etc. So given the features of a data point, the binary decision tree model should output/predict a probability value, for this point belonging to class 0 or 1.

# Data set and objective:
---
![image.png](resources/dt_dataset.png)

### Objective: 

Find feature splits that perfectly split all the classes in the leaf nodes of the decision tree.

## Basic algorithm:
1. Define metric that measures the "importance" of splitting at a certain feature
2. Apply this metric at every feature and select the one with the highest value
3. If tree leaf nodes aren't perfectly distinguished by label, repeat step 2.

### Note: The Decision tree algorithm is easy exentible to more than 2 classes. No changes have to be made, as long as the leaf nodes contain only 1 class after the tree has been built.

## Which metric to choose that measures importance of a feature ?

## Information Gain
---
The formula for information gain is the following:
$Information\_Gain(\text{feature a},Y)=H(Y)-H(Y|\text{feature a})$, or

$Information\_Gain$
$=(\text{Entropy of distribution before the splitting})$
$-(\text{Entropy of distribution after observing the feature value of feature a})$

### Entropy of 1 random variable
For the discrete case, the Entropy of a random variable is defined as: $H(\mathbf Y)=-\sum_{i=1}^{K}P(Y_i) log_2(P(Y_i))$, with $Y$ being the random variable of classes, $K$ being the number of different classes and $P(Y_i)$ being the probability(in this case relative frequency) of class $i$. It is a measure of homogeneity, or information content(or disorder, randomness) of a probability distribution of a random variable. So for example when taking the label play golf as our random variable, with it's class frequencies as a discrete probability distribution, we get the following entropy value
$Entropy:=H(\mathbf Y)=-\frac{4}{6}log_2(\frac{4}{6})-\frac{2}{6}log_2(\frac{2}{6})=0.918$  

![](resources/h.png)

In [4]:
import numpy as np 
import matplotlib.pyplot as plt

In [3]:
# Example 1: Entropy of the tree above: 4 out of 6 labels are yes, 2 out of 6 are no
label_entropy = -(4/6)*np.log2((4/6))-(2/6)*np.log2((2/6))
print(label_entropy)

0.9182958340544896


In [6]:
# Example 2: Say 59 of 60 labels are "yes"
-(59/60)*np.log2((59/60))-(1/60)*np.log2((1/60)) # <- low entropy, meaning low information content(randomness) in the distribution

0.1222915970693747

In [7]:
# Example 3: Say 30 of 60 labels are "yes"
-(30/60)*np.log2((30/60))-(30/60)*np.log2((30/60)) # <- high entropy, meaning high information content(randomness) in the distribution

1.0

### Conditional entropy of a random variable, given a distinct feature value:
Definition of conditional entropy:
$H(\mathbf Y| \text{feature a})=\sum_{v\epsilon vals(a)}P_a(v)H(S_a(v))$, 

where $a$ is a specific feature(e.g. "Outlook"), $S_a(v)=\{x\epsilon D|x_a=v\}$ is the set of data points in $D$, whose feature value for feature $a$ is equal to $v$, $vals(a)$ is the set of unique elements of feature $a$, $P_a(v)$ is the probability of feature $a$, having the value $v$, or $P_a(v)=\frac{|S_a(v)|}{|D|}$, and $H(S_a(v))$ is the entropy over the labels induced by the set $S_a(v)$. 

In information theory, the conditional entropy quantifies the amount of information needed to describe the outcome of a random variable $Y$ given that the value of another random variable $\text{feature a}$ is known.

### Joint entropy label and feature 1:
Let $X$ be the variable for feature "Outlook".
![](resources/d1.png)
$H(\mathbf Y|Outlook)$
=$P_{Outlook}(sunny)\cdot H(S_{Outlook}(sunny))$

$+P_{Outlook}(rainy)\cdot H(S_{Outlook}(rainy))$

$+P_{Outlook}(cloudy)\cdot H(S_{Outlook}(cloudy))$

$=\frac{2}{6}(-\frac{2}{2}log_2(\frac{2}{2})-\frac{0}{2}log_2(\frac{0}{2}))+\frac{2}{6}(-\frac{1}{2}log_2(\frac{1}{2})-\frac{1}{2}log_2(\frac{1}{2}))+\frac{2}{6}(-\frac{1}{2}log_2(\frac{1}{2})-\frac{1}{2}log_2(\frac{1}{2}))$

In [8]:
conditional_entropy_feature_1 = (2/6)*(-0.5*np.log2(0.5)-0.5*np.log2(0.5))+(2/6)*(-0.5*np.log2(0.5)-0.5*np.log2(0.5))
print(joint_entropy_feature_1)

0.6666666666666666


### Joint entropy label and feature 2:
Let $X$ be the random variable for feature "Temperature".
![](resources/d2.png)
$H(\mathbf Y|Temperature)
=P_{Temperature}(cool)\cdot H(S_{Temperature}(cool))$

$+P_{Temperature}(hot)\cdot H(S_{Temperature}(hot))$

$=\frac{2}{6}(-\frac{2}{2}log_2(\frac{2}{2})-\frac{0}{2}log_2(\frac{0}{2}))+\frac{4}{6}(-\frac{2}{4}log_2(\frac{2}{4})-\frac{2}{4}log_2(\frac{2}{4}))$

In [9]:
conditional_entropy_feature_2 = (4/6)*(-2*(2/4)*np.log2((2/4)))

print(joint_entropy_feature_2) 

0.6666666666666666


### Joint entropy label and feature 3:
Let $X$ be the random variable for feature "Wind".
![](resources/d3.png)
$H(\mathbf Y|Wind)$

$=P_{Wind}(strong)\cdot H(S_{Wind}(strong))+P_{Wind}(weak)\cdot H(S_{Wind}(weak))$

$=\frac{3}{6}(-\frac{1}{3}log_2(\frac{1}{3})-\frac{2}{3}log_2(\frac{2}{3}))+\frac{3}{6}(-\frac{3}{3}log_2(\frac{3}{3})-\frac{0}{3}log_2(\frac{0}{3}))$

In [10]:
conditional_entropy_feature_3 = (3/6)*(-(1/3)*np.log2((1/3))-(2/3)*np.log2((2/3)))
print(joint_entropy_feature_3)

0.4591479170272448


## Information Gain
Next, we compute the Information gain:

$Information\_Gain(\text{feature a},Y)=H(Y)-H(Y|\text{feature a})$, 

using the label Entropy $H(Y)$ and the conditional Entropy of the label random variable for each $\text{feature a}$, one by one. 

$Information\_Gain=(\text{Entropy of distribution before the splitting})-(\text{Entropy of distribution after observing the feature value of feature a})$
https://en.wikipedia.org/wiki/Information_gain_in_decision_trees

In [18]:
information_gain_feat_1 = label_entropy - conditional_entropy_feature_1  
information_gain_feat_2 = label_entropy - conditional_entropy_feature_2
information_gain_feat_3 = label_entropy - conditional_entropy_feature_3  

information_gain_feat_1, information_gain_feat_2, information_gain_feat_3

(0.2516291673878229, 0.2516291673878229, 0.4591479170272448)

#### $\rightarrow$ Use feature 3 as splitting node, and repeat the process with the remaining data points.

## 2. Gini Index:
An alternative metric to measure on what feature to split the tree.
https://medium.com/analytics-steps/understanding-the-gini-index-and-information-gain-in-decision-trees-ab4720518ba8

## Train a Decision Tree using SkLearn

### 0. Do the imports

In [27]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score

from sklearn.datasets import load_wine
import sklearn.tree as tree
from sklearn.tree import DecisionTreeClassifier

### 1. Load the data

In [28]:
X, Y = load_wine(return_X_y=True)

In [30]:
Y

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2])

In [31]:
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=42) 

### 2. Build the model

In [32]:
tree_classifier = DecisionTreeClassifier(criterion='entropy')

In [33]:
tree_classifier

DecisionTreeClassifier(criterion='entropy')

### 3. Train the model

In [34]:
tree_classifier.fit(X_train, Y_train) 

DecisionTreeClassifier(criterion='entropy')

### 4. Evaluate the model

In [35]:
tree_classifier.score(X_train, Y_train)

1.0

In [36]:
tree_classifier.score(X_test, Y_test)

0.9166666666666666