## Entropy
$$
H(X) = -\sum_{i=1}^{n} p_i \log_2 p_i
$$
For example: 
1. DatasetA: {ABCDEFGH}
$$
H(X) = -\frac{1}{8} \times log_2 \frac{1}{8} \times 8
= 3
$$
2. DatasetB: {AAAABBCD}
$$
H(X) = (-\frac{1}{2} \times log_2 \frac{1}{2} \times 4) + (-\frac{1}{4} \times log_2 \frac{1}{4} \times 2)+ (-\frac{1}{8} \times log_2 \frac{1}{8} \times 2)
$$


In [1]:
import math
math.log(8, 2)
# or
from math import log
log(8, 2)

3.0

## Information Gain

$$
g(D, A) = H(D) - H(D|A)
$$
in which $H(D|A)$ is conditional entropy

Expand, 
$$
IG(D, A) = H(D) - \sum_{i=1}^{k} \frac{|D_i|}{|D|} \, H(D_i)
$$


### Example on IG:

feature = [x, x, y, x, y, x]
target = [A, A, B, A, B, B]

x: AAAB
y: BB

entropy has nothing to do with feature, only target

##### calculate for entropy

$$
H(D) = -\frac{1}{2} \times log_2(\frac{1}{2}) \times 2 = 1
$$
---
##### calculate for x conditional entropy

Assume y does not exist:

conditional entropy x = 
$$
H(D|X): (-\frac{1}{4} \times log_2(\frac{1}{4})) + (-\frac{3}{4} \times log_2(\frac{3}{4})) = 0.81
$$
---
##### calculate for y conditional entropy

Assume x does not exist:

conditional entropy y = 
$$
H(D|Y): 0
$$

---
##### calculate for overall conditional entropy
conditional entropy = $xweight * CE(x) + yweight * CE(y) $

$$
\frac{4}{6} \times 0.81 + \frac{2}{6} \times 0 = 0.54
$$

---
##### calculate for information gain 

$$
IG = 1 - 0.54 = 0.46
$$

### Construction of ID3 Decison Tree
- calculate information gain for each feature
- use the largest IG, split into subsets
- use the feature with largest IG as a node in decision tree
- repeate the following 123 steps with the left features

Each level of the ID3 decision tree performs a local ranking of features by information gain and selects the feature with the highest gain.

### C4.5 Decision Tree
One major limitation of ID3 is that Information Gain has a strong bias toward features with many distinct values

Example:

If a feature has 20 unique values and almost every sample falls into its own tiny subset,

→ Information Gain becomes artificially large

→ ID3 is tricked into choosing this feature even though it doesn’t actually help classification.

To fix this, we have: 
$$
GainRatio(A) = \frac{IG(A)}{SplitInformation(A)}
$$
