## Entropy
$$
H(X) = -\sum_{i=1}^{n} p_i \log_2 p_i
$$
For example: 
1. DatasetA: {ABCDEFGH}
$$
H(X) = -\frac{1}{8} \times log_2 \frac{1}{8} \times 8
= 3
$$
2. DatasetB: {AAAABBCD}
$$
H(X) = (-\frac{1}{2} \times log_2 \frac{1}{2} \times 4) + (-\frac{1}{4} \times log_2 \frac{1}{4} \times 2)+ (-\frac{1}{8} \times log_2 \frac{1}{8} \times 2)
$$


In [1]:
import math
math.log(8, 2)
# or
from math import log
log(8, 2)

3.0

## Information Gain

$$
g(D, A) = H(D) - H(D|A)
$$
in which $H(D|A)$ is conditional entropy

Expand, 
$$
IG(D, A) = H(D) - \sum_{i=1}^{k} \frac{|D_i|}{|D|} \, H(D_i)
$$


### Example on IG:

feature = [x, x, y, x, y, x]
target = [A, A, B, A, B, B]

x: AAAB
y: BB

entropy has nothing to do with feature, only target

calculate for entropy

$$
H(D) = -\frac{1}{2} \times log_2(\frac{1}{2}) \times 2 = 1
$$
---
calculate for x conditional entropy

Assume y does not exist:

conditional entropy x = 
$$
H(D|X): (-\frac{1}{4} \times log_2(\frac{1}{4})) + (-\frac{3}{4} \times log_2(\frac{3}{4})) = 0.81
$$
---
calculate for y conditional entropy

Assume x does not exist:

conditional entropy y = 
$$
H(D|Y): 0
$$

---
##### calculate for overall conditional entropy
conditional entropy = $xweight * CE(x) + yweight * CE(y) $

$$
\frac{4}{6} \times 0.81 + \frac{2}{6} \times 0 = 0.54
$$

---
##### calculate for information gain 

$$
IG = 1 - 0.54 = 0.46
$$

### Construction of ID3 Decison Tree
- calculate information gain for each feature
- use the largest IG, split into subsets
- use the feature with largest IG as a node in decision tree
- repeate the following 123 steps with the left features

Each level of the ID3 decision tree performs a local ranking of features by information gain and selects the feature with the highest gain.

### C4.5 Decision Tree
One major limitation of ID3 is that Information Gain has a strong bias toward features with many distinct values

Example:

If a feature has 20 unique values and almost every sample falls into its own tiny subset,

→ Information Gain becomes artificially large

→ ID3 is tricked into choosing this feature even though it doesn’t actually help classification.

To fix this, we have: 
$$
GainRatio(A) = \frac{IG(A)}{SplitInformation(A)}
$$
= information $\times$ penalty term


### CART Decision Tree
Classification and Regressioni Tree
- Classification: Minimum gini coefficient
- Regression: Minimum squared error

#### Gini Coefficient: 
Randomly select two samples from the dataset, the probability of them if they are different.

Therefore, smaller the gini coefficient, more pure the dataset
$$
Gini =  \sum_{i=1}^k \sum_{j \ne i} p_i p_j = 1 - \sum_{i=1}^{k} p_i^2
$$

For example: 
10 balls, 10 are red

$Gini(D) = 1 - 1 = 0$

10 balls, 5 are red, 5 are blue

$Gini(D) = 1 - 0.5^2 - 0.5^2 = 0.5$

10 balls, 5 are red, three are blue, two are greem 

$Gini(D) = 1 - 0.5^2 - 0.3^2 - 0.2^2 = 0.62$

### Gini index for a split

$$
Gini\_index = \sum_{m=1}^{M} \frac{N_m}{N} \, Gini(D_m)
$$

Example: 
feature1: housing?: [y, n, n, y, n, n, y, n, n, n]

label: loan?:       [no, no, no, no, yes, no, no, yes, no, yes]


In [6]:
gini_housing = 1 - (0/3)**2 - (3/3)**2 
gini_no_housing = 1- (3/7)**2 - (4/7)**2 
gini_index = 0 * (3/10) + 0.4898 * (7/10)
print(f'gini housing is {gini_housing}\ngini no housing is {gini_no_housing}\ngini index for this feature is {gini_index}')

gini housing is 0.0
gini no housing is 0.489795918367347
gini index for this feature is 0.34286


| Algorithm | Splitting Criterion                 | Handles Continuous Features? | Handles Missing Values? | Tree Type        | Pruning Method            | Notes / Characteristics |
|-----------|-------------------------------------|-------------------------------|--------------------------|-------------------|----------------------------|--------------------------|
| **ID3**   | Information Gain (Entropy)          | No                         | No                    | Multi-way splits | No pruning (original ID3) | Simple; biased toward high-cardinality features |
| **C4.5**  | Gain Ratio (IG / SplitInfo)         | Yes                        | Yes                   | Multi-way splits | Yes (pessimistic pruning) | Fixes IG bias; more robust; can handle real-valued data |
| **CART**  | Gini Index (classification) or MSE (regression) | Yes             | Yes                   | Binary splits     | Cost-complexity pruning   | Used in scikit-learn; supports regression trees |


### You can actually see the decision tree

In [10]:
"""
from sklearn.tree import plot_tree
tree_model = model.named_steps['tree_model']
plt.figure(figsize = (30, 20))
plot_tree(tree_model, filled = True, max_depth = 10)
plt.show()
"""

"\nfrom sklearn.tree import plot_tree\ntree_model = model.named_steps['tree_model']\nplt.figure(figsize = (30, 20))\nplot_tree(tree_model, filled = True, max_depth = 10)\nplt.show()\n"

## Pruning
- a regularization technique to prevent decision tree overfitting

1. Post-pruning: `DecisionTreeClassifier(ccp_alpha=0.01)`

2. pre-pruning: `DecisionTree(max_depth = 3)` 