<a href="https://colab.research.google.com/github/changsin/AI/blob/main/08.4.decision_tree.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Decision Tree

The following is an explanation of Decision Tree (8.4) in Ertel's Artificial Intelligence.

## Entropy

The entropy is calculated according to Shannon's formula:

$ H(p) = H(p_1;...p_n) = -\Sigma_{n=1}^{n}p_i log_2 p_i = H(D) $

Applied to the skiing example in the text book, we calculate

In [None]:
import numpy as np
import matplotlib.pyplot as plt

In [None]:
p1 = 5/11
p2 = 6/11
p3 = 7/11

probs = np.array([p1, p2])

def calculate_entropy(probs):
  entropy = 0
  for p in probs:
    if p != 0:
      entropy += p*np.log2(p)
  
  return -entropy



H = -(p1*np.log2(p1)) - (p2*np.log2(p2))
H

calculate_entropy(probs)

0.9940302114769565

In [None]:
calculate_entropy(np.array([2/7, 5/7]))

0.863120568566631

In [None]:
calculate_entropy(np.array([4/4, 0/4]))

-0.0

In [None]:
calculate_entropy(np.array([6/11, 5/11]))

0.9940302114769565

## Information Gain

For dataset D, the information gained is thus:

$$ I(D) = 1 - H(D) $$

The information gain through attribute A is defined as:
$$ G(D, A) = \Sigma_{i=1}^{n}\frac{|D_i|}{|D|}I(D_i) - I(D) $$

or if we rewrite it in terms of entropy:

$$ = H(D) - \Sigma_{i=1}^{n}\frac{|D_i|}{|D|}H(D_i) $$

Applied to our skiing example, for the choice of snow distribution, we get:

In [None]:
D = 11
D_snow_little = 4
D_snow_big = 7
H_D = calculate_entropy(np.array([6/11, 5/11]))
H_snow_little = calculate_entropy(np.array([4/4, 0/4]))
H_snow_big = calculate_entropy(np.array([2/7, 5/7]))
H_snow_big

0.863120568566631

0.863 is the entropy we get when we branch the decision tree for snow distribution. 

In [None]:
def calculate_information_gain(dataset_count, initial_entropy, subs):
  """
    dataset_count: total number of entries in the dataset
    initial_entropy: the initial entropy of the dataset
    subs: a list of positive and negative counts for each sub-branch:
      e.g., [[4, 0], [2, 5]]
  """
  sub_entropy = 0

  for sub in subs:
    sub_counts = np.array(sub)
    sub_total = np.sum(sub_counts)
    sub_probs = np.array([ di/sub_total for di in sub_counts])

    entropy = calculate_entropy(sub_probs)
    print("subentropy", entropy)

    sub_entropy += (sub_total/dataset_count)*entropy

  return initial_entropy - sub_entropy

# How to build a decision tree with information gain metrics

Using the information gain formula, we can now calculate and compare what is the best way to build a decision tree. Given there are three attributes, we calculate the information gain for each as follows.



### A1: Snow distribution

In [None]:
H_D = calculate_entropy(np.array([6/11, 5/11]))
sub_snow = [[4, 0],
            [2, 5]]
calculate_information_gain(11, H_D, sub_snow)

subentropy -0.0
subentropy 0.863120568566631


0.44477166784364586

### A2: Weekend

In [None]:
sub_weekend = [[5, 2],
               [1, 3]]
calculate_information_gain(11, H_D, sub_weekend)

subentropy 0.863120568566631
subentropy 0.8112781244591328


0.14976144076759756

### A3: Sun

In [None]:
sub_sun = [[5, 3],
            [1, 2]]
calculate_information_gain(11, H_D, sub_sun)

subentropy 0.9544340029249649
subentropy 0.9182958340544896


0.0494520727893939

Based on the results, we know that snow distribution is the first branch question we should ask to get the most information gain.