<a href="https://colab.research.google.com/github/aaronmat1905/MLdiaries/blob/main/DecisionTrees_ID3Algorithm.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Desicion Trees **(August 2025)**
## **ID3 Algorithm**
The ID3 (Iterative Dichotomiser 3) algorithm is a classic algoritm used to build Decision Trees in Machine Learning.



*Introduced by Ross Quinlan in 1986*

How does ID3 Work? 

**High Level Overview**

ID3 builds a decision tree by
1. Selecting the **best attribute** to split data at each step
2. **Recursively partitioning** the dataset into subsets based on that attribute
3. Continuing untill all data points in a subset belong to the same class (**No attributes remain**)

**Steps**
1. Calculate the **Entropy** of the DataSet.
2. For each attribute, calculate the **Information Gain** if we split on it
3. *Select* the attribute with the **maximum Information Gain** as the decision node
4. **Create Child nodes** for each possible value of the chosen attribute
5. ***Repeat*** the process recursively for each  subset until:
- All Samples in a subset are of the same class
- No Attributes remain -> Use Majority class
- Subset is empty -> also assign majority class of the parent

In [1]:
import numpy as np
from collections import Counter

## **Entropy**
Entropy is a measure of impurity or uncertainity in a dataset.
- If a dataset has all samples from a single class, entropy is 0
- If a dataset has an Even mix of classes, entropy is **maximum**, where most uncertainity is prevalent

For categorical data, it tell you how diverse the categories are:\
 *Are there different varietes of categories? if yes, then there might be maximum uncertainity in predicting a category.*


It's Formula is given by:

$$
H(S) = - \sum_{i=1}^{c} p_i \log_2 p_i
$$

Where:

**H(S)**: Entropy of Data Set

**𝑪**: Number of unique categories/clases

**𝓅_i**: is the Proportion of Samples belonging to class *(The probability of class i in dataset S)*

**For a Dataset, How do you calculate the Entropy?**
1. Compute Class Probabilities
- Find the probability of each class in the dataset.
$$
P(X) = {No. of times X occurs}/{No. of datapoints}
$$
2. Plug it into the formula

In [2]:
# Dry run of Entropy:
data = np.array([
    [1, 1, "no"],
    [0, 0, "yes"],
    [1, 0, "no"]
])

# Getting Count of the unique elements:
values, count = np.unique(data, return_counts = True)
print(f"Unique Values: {values}\tData Type: {type(values)}")
print(f"Frequencies of Unique Values: {count}\tData Type: {type(count)}")

# Calculating the Probabilities/fraction of all the unique elements present
probabilities = count/count.sum()
print(f"Probabilities: {probabilities}\tData Type: {type(count)}")

# Entropy:
entropy_demo = - np.sum(probabilities* np.log2(probabilities))
print(f"\n\nEntropy of the Demo Dataset: {entropy_demo}")

Unique Values: ['0' '1' 'no' 'yes']	Data Type: <class 'numpy.ndarray'>
Frequencies of Unique Values: [3 3 2 1]	Data Type: <class 'numpy.ndarray'>
Probabilities: [0.33333333 0.33333333 0.22222222 0.11111111]	Data Type: <class 'numpy.ndarray'>


Entropy of the Demo Dataset: 1.8910611120726526


**Interpreting the value of an Entropy**\
A Value of 1.891 signifies that the Dataset has high impurity. This is because we can see 4 unique classes within the dataset, and the distribution is uneven and not spread out.
An Entropy value of >1 is normal when we have multiple classes.

**Notes**
- If all samples were the same class: Entropy = 0
- If all samples are perfectly distributed about 4 classes: Entropy ~ 2

In [3]:
# Defining Function:
def entropy(data):
  "Calculates the Entropy | Input: np.array | Output: float"
  values, count = np.unique(data, return_counts = True)
  probabilities = count/count.sum()
  return - np.sum(probabilities*np.log2(probabilities))

## **Average Information**: Expected Entropy after split on attribute A
The **Expected Entropy** is a measure of how impure the dataset remains after splitting on that attribute to divide the data into subsets.

*The average information of an attribute is the weighted sum of entropies of the subsets created by splitting the dataset on that attribute, representing the expected impurity after the split.*

***"How much uncertainity remains about the target variable after we know the value of the attribute A"*** [When we split on attribute A]


___
Formally,
$$
Info_A(S) = \sum_{v \in Values(A)} \frac{|S_v|}{|S|} \cdot H(S_v)
$$
Where:
- 𝐒 is the full dataset
- 𝚨 is the attribute we are evaluating
- 𝐒𝔳 is the subset of 𝐒 where 𝚨 has value 𝔳
- |𝐒𝔳|/|𝐒 | is the weight (proportion of samples with the value 𝔳

When we split the dataset using attribute 𝚨, each branch/subset has some kind of impurity left. Due to the differing size of the subsets, we take a weighted average of their entropies. **Hence, this value tells us the expected disorder if we split on attribute 𝚨**\

**Why It Matters?**
- If an attribute produces **pure subsets** → Average information is low → Lesser Impurity.
- If it produces **mixed subset** → Average information is high → More Impurity

In [4]:
## Average Information: Dry Run
attribute = 1 # Column index of the attribute whose average information we want
total_samples = len(data)

# Finding the Unique values of the attributes and their frequencies
vals, counts = np.unique(data[:, attribute], return_counts=True)
avg_info = 0.0

# Iterate through each unique value of the attribute
for v, count in zip(vals, counts):
  subset = data[data[:,attribute]==v]
  subset_entropy = entropy(subset[:, -1])
  avg_info += (count/total_samples)*subset_entropy

avg_info

np.float64(0.6666666666666666)

What does these lines do?
- `subset` gets only the rows, where attribute = 𝔳
- `subset[:, -1]` takes the last column: Target colum
- It then calculates entropy of that target's distribution
- **Weighting**: Multiply that entropy by the fraciton `count\total_samples`
- **Accumulate**: Sum these weighted entropies into `avg_info`


*In essence, even after splitting, the subsets are not perfectly pure: there is still some confusion about class labels. Here I've taken a very very small data set*

In [5]:
def average_information(data, attribute):
  values, count = np.unique(data[:, attribute], return_counts=True)
  total_samples = len(data)
  avg_info = 0.0
  for v, count in zip(values, count):
    subset = data[data[:, attribute]==v]
    subset_entropy = entropy(subset)
    avg_info += (count/total_samples)* subset_entropy
  return avg_info

## **Information Gain**

Information Gain measures how much an attribute reduces uncertainity about the target variable.

$$
IG(S, A) = H(S) - \sum_{v \in \text{Values}(A)} \frac{|S_v|}{|S|} \cdot H(S_v)
$$


**Difference between Information Gain and Average Information**


In [7]:
def information_gain(data, attribute):
  dataset_entropy = entropy(data)
  avg_info = average_information(data, attribute)
  return round(dataset_entropy-avg_info, 4)

In [None]:
def get_selected_attribute(data):
  n_attributes = data.shape[1]-1
  gains = {}
  for attr in range(n_attributes):
    gains[attr] = information_gain(data, attr)
  best_attr = max(gains, key = gains.get)
  return gains, best_attr