## Weighted entropy 

Weighted entropy is a way for a decision tree to score how good a split is: it looks at the entropy of each child node, weights each by how many samples go there, adds them up, and then compares this to the parent’s entropy to see how much uncertainty was reduced.

#### Recap: entropy and nodes

- **Entropy** at a node measures how uncertain the class label is at that node.  
  - Mixed classes → high entropy (more uncertainty).  
  - Mostly one class → low entropy (less uncertainty).
- At the **root**, all training samples are together, so entropy is usually high.  
- As you move down the tree and ask more questions, nodes generally become purer and entropy goes down.

#### What is *weighted* entropy (WS)?

Entropy by itself ignores how many samples are in a node. Weighted entropy fixes that by scaling entropy by the node’s size relative to the whole dataset.

- Let a node contain some number of samples (say 54 out of 150).  
- Its **weighted entropy** (WS) is:  
  - WS = (fraction of all samples in this node) × (entropy of this node).

Examples:

- Node with 54 samples, entropy = 0.445:  
  - Fraction of samples: 54 / 150.  
  - WS = (54/150) × 0.445 ≈ 0.16.  
- Terminal node with 3 samples, entropy = 0.918:  
  - Fraction of samples: 3 / 150.  
  - WS = (3/150) × 0.918 ≈ 0.018.

Key intuition:

- Big nodes count more than tiny nodes.  
- A large node becoming “tidier” (lower entropy) is more important than a tiny node getting cleaner, because it affects more of the data.

At the **root**:

- The root contains all samples, so fraction = 150 / 150 = 1.  
- Therefore WS(root) = entropy(root) = 1.58.

#### Using weighted entropy to measure a split

For **one split rule** (e.g., `petal_length ≤ 2.45`), you have:

- One **parent** node (before the question).  
- Two **child** nodes (after the question: “Yes” branch and “No” branch).

Steps:

1. Compute entropy of each child node.  
2. Compute WS for each child = (size fraction) × (its entropy).  
3. Add child WS values to get the **total child WS**.  
4. Compare parent WS and total child WS:
   - **ΔWS = (parent WS) − (sum of child WS)**.  

Interpretation:

- ΔWS is the **reduction in weighted entropy** – how much uncertainty was removed by asking that question.
- Higher ΔWS means a better split (you learned more and made nodes purer overall).

#### Example:

`petal_length ≤ 2.45`

Root:

- WS(root) = 1.58 (since it has all 150 samples).  

Split: `petal_length ≤ 2.45`:

- Left child: all setosa → entropy = 0 → WS(left) = 0.  
- Right child: mixed versicolor/virginica → total WS(right) = 0.67.  

Total child weighted entropy:

- Sum = 0 + 0.67 = 0.67.  

Reduction:

- ΔWS = 1.58 − 0.67 = 0.91.  

Meaning:

- This split reduces uncertainty by 0.91, a large drop.  
- It creates a perfectly pure left node and a much cleaner right node, so it’s a very strong first rule.

#### How the algorithm uses ΔWS in practice

Putting it into the decision tree generation algorithm:

1. At a node, list candidate questions like:
   - `petal_length ≥ 4.0?`  
   - `sepal_length ≥ 0.5?`  
   - `petal_width ≥ 0.8?`  
2. For each candidate:
   - Simulate splitting the data.  
   - Compute child entropies and WS.  
   - Compute ΔWS (reduction in weighted entropy).  
3. Choose the rule with the **largest ΔWS** (greatest drop in uncertainty).  
4. Apply that split, creating child nodes.  
5. Repeat recursively on each child node until stopping criteria are met.

This is a **greedy** strategy:

- At each step, it chooses the best split *locally* (largest ΔWS at that node).  
- It does not look ahead to future levels of the tree, but in practice this local rule works very well.

#### Big picture

- **Entropy** measures impurity / uncertainty at a node.  
- **Weighted entropy (WS)** combines entropy with node size, so big nodes count more.  
- **ΔWS** (information gain) measures how much a split reduces overall uncertainty.  
- At each step, the decision tree **chooses the split with the largest ΔWS**, i.e., where it learns the most and creates the cleanest children overall.  

This mechanism is exactly how scikit‑learn and similar libraries use entropy to select decision tree splits and explains why, in the Iris example, the rule `petal_width ≥ 0.8` is chosen as the first, strongest split.

Sources:  

[1](https://towardsdatascience.com/decision-trees-explained-entropy-information-gain-gini-index-ccp-pruning-4d78070db36c/)
[2](https://bricaud.github.io/personal-blog/entropy-in-decision-trees/)
[3](https://stackoverflow.com/questions/1132805/weighted-decision-trees-using-entropy)
[4](https://ashutoshtripathi.com/2022/03/29/a-complete-guide-to-decision-tree-formation-and-interpretation-in-machine-learning/)
[5](https://en.wikipedia.org/wiki/Decision_tree_learning)
[6](https://www.youtube.com/watch?v=_L39rN6gz7Y)
[7](https://www2.cs.arizona.edu/~debray/Publications/wt_dec_tree.pdf)
[8](https://www.geeksforgeeks.org/machine-learning/gini-impurity-and-entropy-in-decision-tree-ml/)
[9](https://www.geeksforgeeks.org/data-science/how-to-calculate-entropy-in-decision-tree/)
[10](https://www.educative.io/answers/how-is-entropy-used-to-build-a-decision-tree)