# ðŸ“˜ Decision Tree Splits, Entropy, Gini, and Best Threshold Selection  
*(Complete Explanation + Formulas + Step-by-Step Example + Python Code)*

## 1. Impurity Measures

### **Entropy**
$H = -\sum_{k} p_k \log_2 p_k$

### **Gini Impurity**
$G = 1 - \sum_{k} p_k^2$

Where $p_k$ is the probability of class $k$ in the node.

---

## 2. Impurity After a Split  

For a parent node with $N$ samples, split into left (size $N_L$) and right (size $N_R$):

$\text{Impurity}_{\text{after}} = 
\frac{N_L}{N} \cdot \text{Impurity}(L)
+
\frac{N_R}{N} \cdot \text{Impurity}(R)$

### **Information Gain**
$\text{IG} = H_{\text{parent}} - H_{\text{after}}$

### **Gini Reduction**
$\Delta G = G_{\text{parent}} - G_{\text{after}}$

---

## 3. Numerical Example (Binary Classification)

Dataset: 10 samples  
Classes: A = 6, B = 4

### **Parent Node**

$p_A = 0.6,\quad p_B = 0.4$

**Entropy:**
$H_{\text{parent}} = -0.6\log_2 0.6 - 0.4\log_2 0.4 = 0.97095$

**Gini:**
$G_{\text{parent}} = 1 - (0.6^2 + 0.4^2) = 0.48$

---

### Candidate Split  
Left child: A=4, B=1  
Right child: A=2, B=3  

### **Left Child**
$p_A = 0.8,\quad p_B = 0.2$

$H_L = -0.8\log_2 0.8 - 0.2\log_2 0.2 = 0.7219$

$G_L = 1 - (0.8^2 + 0.2^2) = 0.32$

### **Right Child**
$p_A = 0.4,\quad p_B = 0.6$

$H_R = -0.4\log_2 0.4 - 0.6\log_2 0.6 = 0.97095$

$G_R = 1 - (0.4^2 + 0.6^2) = 0.48$

### **Weighted Child Impurity (Entropy)**

$H_{\text{after}} = 0.5 \cdot 0.7219 + 0.5 \cdot 0.97095 = 0.8464$

**Information Gain:**

$\Delta H = 0.97095 - 0.8464 = 0.1245$

### **Weighted Child Impurity (Gini)**

$G_{\text{after}} = 0.5 \cdot 0.32 + 0.5 \cdot 0.48 = 0.40$

**Gini Reduction:**

$\Delta G = 0.48 - 0.40 = 0.08$

---

## 4. Continuous Feature Example

Feature: Temperature  
Target: Play (Yes/No)

Sorted values:  
10, 14, 15, 18, 20, 22, 25

### **Candidate split thresholds** (midpoints):

- 12  
- 14.5  
- 16.5  
- 19  
- 21  
- 23.5  

Example threshold $T = 16.5$:

Left (â‰¤ 16.5): No, No, Yes  
Right (> 16.5): Yes, Yes, No, Yes  

### **Left Node Gini**

$p_{Yes} = 1/3,\quad p_{No} = 2/3$

$G_L = 1 - (1/3)^2 - (2/3)^2 = 0.44$

### **Right Node Gini**

$p_{Yes} = 3/4,\quad p_{No} = 1/4$

$G_R = 1 - (3/4)^2 - (1/4)^2 = 0.38$

### **Weighted Gini**

$G_{\text{after}} = \frac{3}{7}(0.44) + \frac{4}{7}(0.38) = 0.4057$

The best threshold is the one with **minimum impurity after splitting**.

---

## 5. Python Code: Best Split for a 1-D Feature  
(Entropy or Gini)


In [10]:
import numpy as np

In [11]:
def entropy(y):
    _, counts = np.unique(y, return_counts=True)
    p = counts / counts.sum()
    return -(p * np.log2(p)).sum()

def gini(y):
    _, counts = np.unique(y, return_counts=True)
    p = counts / counts.sum()
    return 1 - np.sum(p**2)

In [20]:
# Example usage:
X = np.array([10, 14, 15, 23, 20, 22, 25])
y = np.array(["No","No","Yes","Yes","Yes","No","Yes"])
# choose impurity function
impurity_fn = gini


In [24]:
# sort by feature
idx = np.argsort(X)
print(idx)
X_sorted = X[idx]
y_sorted = y[idx]

# candidate thresholds = midpoints between adjacent unique values
thresholds = (X_sorted[:-1] + X_sorted[1:]) / 2
best_score = float("inf")
best_threshold = None
print(thresholds)


[0 1 2 4 5 3 6]
[12.  14.5 17.5 21.  22.5 24. ]


In [23]:
print(X)
print(X_sorted)
print(X_sorted[:-1])
print(X_sorted[1:])

[10 14 15 23 20 22 25]
[10 14 15 20 22 23 25]
[10 14 15 20 22 23]
[14 15 20 22 23 25]


In [25]:
for t in thresholds:
    left = y_sorted[X_sorted <= t]
    right = y_sorted[X_sorted > t]

    if len(left) == 0 or len(right) == 0:
        continue  # skip invalid splits

    # weighted impurity after split
    score = (len(left)/len(y)) * impurity_fn(left) + \
            (len(right)/len(y)) * impurity_fn(right)

    if score < best_score:
        best_score = score
        best_threshold = t


In [27]:
print("Best threshold:", best_threshold)
print("Minimum weighted Gini:", best_score)

Best threshold: 14.5
Minimum weighted Gini: 0.22857142857142845
