## Entropy

Entropy in decision trees is a way to **measure how mixed or uncertain the class labels are at a node**, and the tree uses this to decide which split is best.

#### Decision tree basics and nodes

A decision tree is built **top‑down**, starting from a single root node that contains all the training data.

- At each step, the algorithm:
  - Looks at all features (like `petal_length`, `petal_width` in Iris).
  - Tries possible split values (for numeric features, thresholds like “≥ 2.0”).
  - Chooses the split that best separates the classes according to an impurity measure such as entropy.
- The root node is the starting point; deeper levels correspond to more detailed decisions about the data.

Two special types of nodes:

- **Pure node**: All samples in that node have the same class label (e.g., all are “setosa”).  
- **Unsplittable node**: Contains samples from different classes that are **exactly overlapping in feature space**, so no split on available features can separate them (e.g., multiple flowers with identical petal length and width but different species).
The algorithm stops when:

- All nodes are pure, or
- Nodes are unsplittable, or
- Some stopping rule is met (max depth, min samples per leaf, etc.).

#### Intuitive look at “best split”

In the Iris example, think of a 2D scatter plot with:

- X‑axis: `petal_length`.  
- Y‑axis: `petal_width`.  
- Points colored by species.

Possible splits are straight **horizontal or vertical lines**:

- Example 1: `petal_width < 1.5` vs `≥ 1.5`.  
  - One side has mostly virginica plus some others (e.g., 0 setosa, 15 versicolor, 49 virginica are on one side; others on the opposite side).  
  - This split is **somewhat useful** because it gathers many virginica together, but still mixes classes.[6][2]
- Example 2: `petal_length ≥ 4.0`.  
  - This may put all virginicas into one node, which feels **better** because the node is more homogeneous.[6]
- Example 3: `petal_width ≥ 0.5`.  
  - This keeps most setosa points together (e.g., 48 of them on one side), compressing one class strongly into a single node.[6]
- Example 4: `petal_width ≥ 0.8`.  
  - This line can practically separate setosa entirely from the others, creating a very pure node for that class.[6]

From these examples:

- Different candidate lines (splits) give different **purities** in the child nodes.
- Some splits “feel” better because one side is mostly a single class.
- The goal of the algorithm is to **automate** this intuitive search: try many possible splits and pick the one that makes the resulting nodes as pure as possible.

Entropy provides the **numeric score** for “how good” a node or a split is.

#### What is entropy at a node?

Entropy is a number that measures **uncertainty, randomness, or impurity** in the class labels of a node.

- If all samples in a node belong to **one** class:
  - The node is fully predictable.
  - Entropy is **0** (perfectly pure, no uncertainty).
- If the node has a **mix** of classes with similar proportions:
  - The node is hard to predict.
  - Entropy is **high**.

To compute entropy for a node:

1. Look at all samples in that node.  
2. Count how many belong to each class.  
3. Convert counts to proportions $ p_C $ (e.g., 34 out of 110 is about 0.31 for class 0).  
4. Plug these proportions into the entropy formula as the sum of $ p_C \log $ terms, but you don’t need the exact math for intuition).


Example from Iris Dataset:

- At the root:
  - Class 0: 34 samples → $ p_0 ≈ 0.31 $.  
  - Class 1: 36 samples → $ p_1 ≈ 0.33 $.  
  - Class 2: 40 samples → $ p_2 ≈ 0.36 $.  
- Using the formula, the entropy of this node is **about 1.58**.

Interpretation:

- The three classes are present in similar proportions.
- If you see a random sample at this root, you are quite uncertain which class it belongs to.
- This high entropy reflects that uncertainty.

Another node in the tree (left child):

- Class 0: 31 samples.  
- Class 1: 4 samples.  
- Class 2: 1 sample.  
- Proportions:
  - $ p_0 ≈ 0.86 $.  
  - $ p_1 ≈ 0.11 $.  
  - $ p_2 ≈ 0.028 $.  
- Using the formula, entropy is about **0.68**.

Interpretation:

- This node is **much more dominated** by class 0 (e.g., setosa).  
- A sample reaching this node is very likely to be class 0.  
- The node is more predictable, so entropy is lower.

Some key numeric patterns:

- If all samples are in **one class** → entropy = 0 (pure node).  
- If data are **evenly split between 2 classes** (50–50) → entropy = 1 (maximum uncertainty for binary classification).
- If data are **evenly split between 3 classes** (about 1/3 each) → entropy ≈ 1.58.
- More generally, if there are C classes and they are evenly split, entropy = log₂(C). This grows with the number of equally likely classes, reflecting increasing uncertainty.

#### How entropy changes as you go down the tree

As you move from the root down to leaves:

- At the **root**, the class distribution is very mixed → entropy is high.  
- After a good split, each child node has more skewed distributions:
  - One child might be mostly setosa.
  - Another might be a mix of versicolor and virginica.  
- Entropy **decreases** in nodes where one class becomes more likely than the others.

Eventually:

- At a **leaf node** where all samples belong to the same class (e.g., all setosa), entropy becomes **0.0** (no uncertainty).
- In the Iris tree you built earlier, if you check the entropy values on the visualization, you will see:
  - High entropy near the top.
  - Lower entropy near the bottom.
  - Zero entropy at pure leaves.

So entropy acts like a **“purity meter”** along the path from root to leaf:

- High value → messy / unpredictable node.  
- Low value → clean / predictable node.

#### Why entropy is useful for splitting (information idea)

Even without explicit formulas for information gain, the key idea is:

- **Goal at each split**: make child nodes that are **more pure** (lower entropy) than the parent.
- For a candidate split on a feature:
  - Compute entropy of the left child node.
  - Compute entropy of the right child node.
  - Take a **weighted average** based on how many samples go left vs right.
- A split is considered **better** if:
  - The weighted average entropy of its children is **smaller**.  
  - That means the split produced cleaner, more homogeneous nodes.

Conceptually:

- **Information gain** = “how much uncertainty was removed by this split.”  
- High information gain (big drop in entropy) means that feature choice and threshold did a good job in organizing the data.

In practice:

- For each feature and split value, the algorithm computes the child-node entropies and picks the split with the **largest drop** from parent entropy to child weighted average.  
- This is how the tree automatically finds splits like “petal_width ≥ 0.8” that nearly isolate a single class (e.g., setosa) in one branch.

#### Putting it all together 

- A decision tree keeps **asking questions** (“Is petal_width ≥ 0.8?”, “Is petal_length ≥ 4.0?”, etc.) to split data into more homogeneous groups.
- **Entropy** is the number the tree uses to measure how “mixed” the labels are at each node: high means mixed and uncertain, low means dominated by one class and predictable.
- At each step, the algorithm:
  - Tries many possible splits.
  - For each split, looks at how much the entropy of the child nodes drops compared to the parent.
  - Picks the split that makes child nodes as pure (low entropy) as possible.
- As you go down the tree:
  - Entropy generally decreases along each path.
  - Pure leaves have entropy 0 and correspond to confident predictions.

This is why node entropy is central to top‑down decision tree construction: it gives a **rigorous, numerical way** to define “best split,” instead of relying only on visual intuition about class boundaries.

Sources:

[1](https://www.educative.io/answers/how-is-entropy-used-to-build-a-decision-tree)
[2](https://www.geeksforgeeks.org/data-science/how-to-calculate-entropy-in-decision-tree/)
[3](https://sebastianraschka.com/faq/docs/decisiontree-error-vs-entropy.html)
[4](https://100daysofml.github.io/Week_06/Lesson_29.html)
[5](https://en.wikipedia.org/wiki/Information_gain_(decision_tree))
[6](https://towardsdatascience.com/decision-trees-explained-entropy-information-gain-gini-index-ccp-pruning-4d78070db36c/)
[7](https://www.geeksforgeeks.org/machine-learning/gini-impurity-and-entropy-in-decision-tree-ml/)
[8](https://www.baeldung.com/cs/impurity-entropy-gini-index)
[9](https://community.deeplearning.ai/t/explanation-of-the-formula-for-information-gain-in-the-decision-nodes/540497)
[10](https://www.reddit.com/r/learnmachinelearning/comments/1e4nnb2/what_actually_the_entropy_in_decision_trees/)