## **Session 07 / 13-Nov-2025**

### ***Import libraries: First, we import the required libraries for today's exercise, which are `Numpy` and `Pandas`.***

In [3]:
import numpy as np
import pandas as pd

### ***Create a dataset: We create a dataset with three features -Gender, Age, and Sport- using a dictionary structure.***

In [4]:
data = {"Gender": ["Man", "Woman", "Other", "Other", "Man", "Other", "Woman", "Man", "Woman"], 
        "Age": [9, 10, 28, 27, 18, 40, 17, 22, 32],
        "Sport": ["Tennis", "Football", "Football", "Football", "Swiming", "Football", "Basedball", "Swiming", "Tennis"]
        }

### ***Create and display DataFrame: We convert the dictionary into a Pandas DataFrame and display it.*** 

In [5]:
df = pd.DataFrame(data)
df

Unnamed: 0,Gender,Age,Sport
0,Man,9,Tennis
1,Woman,10,Football
2,Other,28,Football
3,Other,27,Football
4,Man,18,Swiming
5,Other,40,Football
6,Woman,17,Basedball
7,Man,22,Swiming
8,Woman,32,Tennis


### ***Create class mapping: Based on the Sport feature, we define a numeric class mapping for the dataset.***

In [6]:
class_mapping = {"Football": 0, "Tennis": 1, "Swiming": 2, "Basedball": 3}

### ***Add the encoded class cloumn: We add the new encoded class cloumn to the DataFrame and display the result.***

In [7]:
df["Class"] = df["Sport"].map(class_mapping)
df

Unnamed: 0,Gender,Age,Sport,Class
0,Man,9,Tennis,1
1,Woman,10,Football,0
2,Other,28,Football,0
3,Other,27,Football,0
4,Man,18,Swiming,2
5,Other,40,Football,0
6,Woman,17,Basedball,3
7,Man,22,Swiming,2
8,Woman,32,Tennis,1


### ***`Entropy`***
### ***Entropy measures the amount of disorder or uncertainty within a set of labels. It becomes: when all samples belong to the same class (pure set) 0 / High when classes are equally mixed.***
### ***The function workes by computing the probability of each class and applying the standard entropy formula:***

In [8]:
def calculate_entropy(labels):
    if len(labels) == 0: return 0
    value, counts = np.unique(labels, return_counts=True)
    p = counts / len(labels)
    entropy = -(np.sum(p*np.log2(p)))
    return entropy

### ***`Gini Impurity`***
### ***Gini impurity is another measure of impurity within a dataset.***
### ***It evaluates how often a randomly chosen sample would be incorrectly labeled if classified randomly according to class distribution. A pure set has Gini = 0 / Mixed sets have higher Gini values.***

In [9]:
def calculate_gini(labels):
    if len(labels) == 0: return 0
    value, counts = np.unique(labels, return_counts=True)
    p = counts / len(labels)
    gini = 1-np.sum(p**2)
    return gini

### ***`Information Gain (IG)`***
### ***Information Gain measures how much the entropy decreases after splitting the dataset using a particular feature. It follows the formula: IG = Entropy (before split) - Weighted Entropy (after split)***
### ***A higher IG value indicates a better feature for splitting.***

In [10]:
def calculate_ig(ini_labels, subset_labels):
    ini_entropy = calculate_entropy(ini_labels)
    w_entropy = 0
    total_size = len(ini_labels)
    for subset in subset_labels:
        sub_size = len(subset)
        sub_entropy = calculate_entropy(subset)
        w_entropy += (sub_size / total_size) * sub_entropy
    return ini_entropy - w_entropy

### ***`Gini Gain`***
### ***Gini Gain is the same concept as Information Gain but Gini impurity instead of entropy.***

In [11]:
def calculate_gg(ini_labels, subset_labels):
    ini_gini = calculate_gini(ini_labels)
    w_gini = 0
    total_size = len(ini_labels)
    for subset in subset_labels:
        sub_size = len(subset)
        sub_gini = calculate_gini(subset)
        w_gini += (sub_size / total_size) * sub_gini
    return ini_gini - w_gini

### ***Convert class column toNumpy array: We convert the class column into a Numpy array.***

In [12]:
ini_labels = df["Class"].to_numpy()
ini_labels

array([1, 0, 0, 0, 2, 0, 3, 2, 1])

### ***Split labels based on Gender: We split the class labels into three subsets based on the Gender feature: Man, Woman, and Other.***

In [13]:
labels_man = df[df["Gender"] == "Man"]["Class"].to_numpy()
labels_woman = df[df["Gender"] == "Woman"]["Class"].to_numpy()
labels_other = df[df["Gender"] == "Other"]["Class"].to_numpy()
labels_man, labels_woman, labels_other

(array([1, 2, 2]), array([0, 3, 1]), array([0, 0, 0]))

### ***Compute IG and Gini Gain for Gender: We calculate the Information Gain and Gini Gain for the Gender feature.***

In [14]:
ig_Gender = calculate_ig(ini_labels, [labels_man, labels_woman, labels_other])
gg_Gender = calculate_gg(ini_labels, [labels_man, labels_woman, labels_other])
print(f"ig_Gender: {ig_Gender:.2f} \ngg_Gender: {gg_Gender:.2f}")

ig_Gender: 1.00 
gg_Gender: 0.32


### ***Create age-based split: We split dataset into two groups based on as age threshold `t = 18`.***

In [17]:
t = 18
labels_age_min_18 = df[df["Age"] <= t]["Class"].to_numpy()
labels_age_max_18 = df[df["Age"] > t]["Class"].to_numpy()
labels_age_min_18, labels_age_max_18

(array([1, 0, 2, 3]), array([0, 0, 0, 2, 1]))

### ***Compute IG & GG for Age: We calculate the Information Gain and Gini Gain for the Age feature.***

In [16]:
ig_age = calculate_ig(ini_labels, [labels_age_min_18, labels_age_max_18])
gg_age = calculate_gg(ini_labels, [labels_age_min_18, labels_age_max_18])
print(f"ig_age: {ig_age:.2f} \ngg_age: {gg_age:.2f}")

ig_age: 0.19 
gg_age: 0.05


### ***`Final Summary Note`***

### ***Main Idea: For each feature in the dataset, we evaluate how effective it is at splitting the data by measuring how much impurity is reduced after the split. The overall process is:***
### -  ***Split the dataset based on the chosen feature.***
### -  ***Calculate the Information Gain (IG) or Gini Gain for that split.***
### -  ***Select the feature with the highest IG or Gini Gain as the root node of the decision tree.***

### ***This is the core logic behind building a decision tree: choose the feature that produces the most “pure” subsets after splitting.***