The Gini index (Gini impurity) measures how **mixed** the class labels are at a node in a decision tree, and it is used to choose splits that make nodes as pure (homogeneous) as possible.



### Intuition and probability view

- A node is **pure** if all samples belong to the same class; its Gini index is 0 (no impurity).
- A node is **impure** if labels are mixed; the closer the class proportions are to equal, the higher the impurity (closer to 1 for many classes, up to 0.5 in a balanced binary case).

Probabilistic interpretation:

- Imagine picking a random sample from the node.  
- Then imagine assigning its label **at random**, but using the observed class probabilities in that node.  
- The **Gini impurity** is the probability that this sample would be misclassified in that random process.


Formal definition for a node with $ k $  classes:

- Let $ p_i $  be the proportion of samples of class $ i $  in that node.  
- The Gini index is  

  $$ 
  G = 1 - \sum_{i=1}^{k} p_i^2
  $$  

  where $ \sum_{i=1}^{k} p_i = 1 $ .

Key properties:

- If all samples are one class: one $ p_i = 1 $ , others 0 → $ G = 1 - 1^2 = 0 $  (perfectly pure).
- For a binary node with $ p $  and $ 1-p $ : $ G = 1 - (p^2 + (1-p)^2) $ , with a maximum of 0.5 at $ p = 0.5 $ .


### Role in decision trees

Decision trees use Gini impurity as a **splitting criterion**:

- At a node, the algorithm considers possible splits on each feature.  
- For each candidate split, it computes the **weighted average** Gini of the child nodes.  
- The split that produces the **lowest weighted Gini** (or equivalently, **highest Gini gain**) is chosen.


Weighted Gini after a split:

- Suppose a node with $ N $  samples is split into left and right child nodes with $ N_L $  and $ N_R $  samples.  
- Their impurities are $ G_L $  and $ G_R $ .  
- The weighted Gini of the split is  
  $$ 
  G_{\text{split}} = \frac{N_L}{N} G_L + \frac{N_R}{N} G_R
  $$[5]



Gini gain (how much impurity is reduced):

- Let $ G_{\text{parent}} $  be the impurity before splitting.  
- Gini gain is  
  $$ 
  \Delta G = G_{\text{parent}} - G_{\text{split}}
  $$[1][5]

- Higher $ \Delta G $  means the split is better (it cleans up the node more).  
- Practically, libraries either **minimize** $ G_{\text{split}} $  or **maximize** $ \Delta G $ ; both are equivalent.

The feature (and threshold) that gives the **smallest weighted Gini** after the split becomes the decision at that node (e.g., the root feature for the first split).


### Gini calculation example

Dataset (binary labels “Yes” and “No”):

| Instance | Feature 1 | Feature 2 | Label |
|----------|-----------|-----------|-------|
| 1        | 0         | 1         | Yes   |
| 2        | 1         | 0         | No    |
| 3        | 0         | 1         | Yes   |
| 4        | 1         | 1         | Yes   |
| 5        | 0         | 0         | No    |

1. **Count labels at the root node**:

- Total instances $ N = 5 $ .  
- Yes: 3 (instances 1, 3, 4) → $ p_{\text{Yes}} = 3/5 $ .  
- No: 2 (instances 2, 5) → $ p_{\text{No}} = 2/5 $ .

2. **Compute root Gini**:

$$ 
G_{\text{root}} = 1 - \left( \left(\frac{3}{5}\right)^2 + \left(\frac{2}{5}\right)^2 \right)
= 1 - \left( \frac{9}{25} + \frac{4}{25} \right)
= 1 - \frac{13}{25}
= \frac{12}{25} = 0.48
$$

So the root node impurity is 0.48; the labels are somewhat mixed but not maximally so.


#### Gini for Feature 1

Feature 1 has values 0 and 1.

##### Node: Feature 1 = 0

Instances: 1, 3, 5.

- Labels: Yes, Yes, No.  
- Total: 3.  
- $ p_{\text{Yes}} = 2/3 $ , $ p_{\text{No}} = 1/3 $ .

Gini:

$$ 
G_{F1=0} = 1 - \left( \left(\frac{2}{3}\right)^2 + \left(\frac{1}{3}\right)^2 \right)
= 1 - \left( \frac{4}{9} + \frac{1}{9} \right)
= 1 - \frac{5}{9}
= \frac{4}{9} \approx 0.444
$$

##### Node: Feature 1 = 1

Instances: 2, 4.

- Labels: No, Yes.  
- Total: 2.  
- $ p_{\text{Yes}} = 1/2 $ , $ p_{\text{No}} = 1/2 $ .

Gini:

$$ 
G_{F1=1} = 1 - \left( \left(\frac{1}{2}\right)^2 + \left(\frac{1}{2}\right)^2 \right)
= 1 - \left( \frac{1}{4} + \frac{1}{4} \right)
= 1 - \frac{1}{2}
= \frac{1}{2} = 0.5
$$

##### Weighted Gini after splitting on Feature 1

Node counts:

- For Feature 1 = 0: 3 samples.  
- For Feature 1 = 1: 2 samples.  
- Total: 5.

Weighted Gini:

$$ 
G_{\text{split, F1}} = \frac{3}{5} \cdot 0.444 + \frac{2}{5} \cdot 0.5
\approx 0.2664 + 0.2 = 0.4664
$$

Thus:

- Root node Gini: 0.48.  
- After splitting on Feature 1: 0.4664 (slightly cleaner overall).

Gini gain for this split:

$$ 
\Delta G_{\text{F1}} = 0.48 - 0.4664 = 0.0136
$$

So Feature 1 slightly improves purity compared to the unsplit node.

The same procedure is repeated for **Feature 2**:

- Compute Gini for Feature 2 = 0 and Feature 2 = 1 nodes.  
- Compute their weighted Gini.  
- Compare with 0.4664:  
  - If Feature 2’s weighted Gini is **lower**, Feature 2 is better for the root.  
  - If it is **higher**, Feature 1 is better.  
  - The feature with the **lowest weighted Gini** becomes the root feature.


## Python implementation (from basics)

Below is a minimal, beginner-friendly implementation of Gini impurity and the weighted Gini calculation for a split, using only core Python.

In [1]:
# Representing the toy dataset

data = [
    {"instance": 1, "feature1": 0, "feature2": 1, "label": "Yes"},
    {"instance": 2, "feature1": 1, "feature2": 0, "label": "No"},
    {"instance": 3, "feature1": 0, "feature2": 1, "label": "Yes"},
    {"instance": 4, "feature1": 1, "feature2": 1, "label": "Yes"},
    {"instance": 5, "feature1": 0, "feature2": 0, "label": "No"},
]

In [3]:
# Function to compute Gini impurity

from collections import Counter

def gini_index(labels):
    """
    labels: list of class labels at a node, e.g. ["Yes", "No", "Yes"]
    returns: Gini impurity between 0 and 1
    """
    n = len(labels)
    if n == 0:
        return 0.0

    counts = Counter(labels)
    # compute sum of squared probabilities
    sum_p2 = 0.0
    for count in counts.values():
        p = count / n
        sum_p2 += p ** 2

    return 1.0 - sum_p2

# - `Counter` counts the occurrences of each label.  
# - Each probability $ p_i $  is `count / n`.  
# - It applies the formula $ 1 - \sum p_i^2 $ .

In [4]:
# Root Gini with this function

root_labels = [row["label"] for row in data]
g_root = gini_index(root_labels)
print("Root Gini:", g_root)  # expected 0.48

Root Gini: 0.48


In [5]:
# Splitting on Feature 1 and computing weighted Gini

# Split data into two groups based on feature1
left = [row for row in data if row["feature1"] == 0]
right = [row for row in data if row["feature1"] == 1]

left_labels = [row["label"] for row in left]
right_labels = [row["label"] for row in right]

g_left = gini_index(left_labels)   # expected ≈ 0.444
g_right = gini_index(right_labels) # expected = 0.5

n_total = len(data)
n_left = len(left)
n_right = len(right)

weighted_gini_f1 = (n_left / n_total) * g_left + (n_right / n_total) * g_right

print("Gini(F1=0):", g_left)
print("Gini(F1=1):", g_right)
print("Weighted Gini after split on F1:", weighted_gini_f1)

Gini(F1=0): 0.4444444444444444
Gini(F1=1): 0.5
Weighted Gini after split on F1: 0.4666666666666667


This code reproduces the calculations:

- Gini at root = 0.48.  
- Gini for Feature 1 = 0 node ≈ 0.444.  
- Gini for Feature 1 = 1 node = 0.5.  
- Weighted Gini ≈ 0.4664.

In [6]:
# General function to evaluate a binary split

def weighted_gini(groups):
    """
    groups: list of groups, where each group is a list of labels.
            e.g. groups = [left_labels, right_labels]
    returns: weighted Gini of the split
    """
    total = sum(len(g) for g in groups)
    if total == 0:
        return 0.0

    result = 0.0
    for labels in groups:
        weight = len(labels) / total
        result += weight * gini_index(labels)

    return result
    
groups = [left_labels, right_labels]
print("Weighted Gini:", weighted_gini(groups))

Weighted Gini: 0.4666666666666667


## Summary of what Gini tells you

- Gini impurity is a **number between 0 and 1** that quantifies how mixed the labels are at a node.
- A node with Gini = 0 is perfectly **pure** (all labels identical).  
- Decision trees repeatedly split nodes to **reduce** Gini, aiming for purer child nodes.  
- For each feature, a split’s quality is measured via **weighted average Gini** of the resulting branches; the feature with the **lowest weighted Gini** becomes the splitting feature at that node.

This connects directly to overfitting: aggressively minimizing impurity with deep trees can yield extremely pure leaves (very low Gini and low training error) while harming generalization to new data.

Sources: 

[1](https://victorzhou.com/blog/gini-impurity/)
[2](https://www.geeksforgeeks.org/machine-learning/gini-impurity-and-entropy-in-decision-tree-ml/)
[3](https://www.baeldung.com/cs/impurity-entropy-gini-index)
[4](https://en.wikipedia.org/wiki/Decision_tree_learning)
[5](https://www.learndatasci.com/glossary/gini-impurity/)
[6](https://courses.cs.washington.edu/courses/cse416/20su/files/section/section04/gini-impurity.pdf)
[7](https://blog.quantinsti.com/gini-index/)
[8](https://www.youtube.com/watch?v=u4IxOk2ijSs)