# Decision Tree

### **Introduction**

One of the most **intuitive** classifiers that is **easy** to understand and construct

**Main application:** Database Mining  
**Key Limitation:** Prone to overfitting (without proper regularization)

**Structure**

- **Leaves (Terminal Nodes):**
    - Represent target variable
    - Each leaf represents a class label

- **Internal Nodes:**
    - Test an attribute's value, with branches for possible outcomes.

- **Edges**
    - Connect nodes to children based on possible attribute test results.

**Decision Tree Learning**

**Goal:** Construct a decision tree from training data to classify new instances.

Common Algorithms:
- **ID3:**
- **C4.5:** 

**Key Characteristics:**
- Primarily used for **classification** (though regression variants exist).
- Usually operates on **discrete domain**

**Computational Complexity**
- Difference in **Feature Space**
    - Algorithms such as Linear Regression, SWM, etc:
        - Operate in **continuous feature** spaces where optimization is often convex
    - Decision Tree:
        - Work with **discrete structures** (tree configurations), making the search space combinatorial.

- Complexity:
    - **NP-Completeness**:
        - Finding the globally optimal decision tree is NP-complete.
        - Reason: Discrete feature space 
    - Practical Solution:
        - Use **greedy top-down recursion** with **heuristic** splitting criteria (e.g., information gain)..

**Learning Strategy:**

- **Top-Down** Induction:
    - Start at the root, choose the attribute that **best splits the data** (e.g., maximizes information gain).
    - **Recursively** split subsets until:
        - All samples in a node belong to one class (**pure**).
        - **No further informative** splits exist.

- Attribute Usage:
    - Each **discrete** attribute is tested **only once** per path from root to leaf.
    - **Continuous** attributes may be **reused with different thresholds** (e.g., "Age > 30" vs. "Age > 50").

**Limitations**
- **Sub-optimality:** Greedy search may miss better trees deeper in the search space.
- **Overfitting:** Deep trees memorize training data

### **ID3**

**Constructing a Basic Decision Tree**

**Objective:** Build a **simple**, **compact** tree with **few nodes**.

**Key Principles:**
- **Simplicity:** Prefer shorter trees to avoid overfitting.
- **Homogeneity-Driven:** Splits aim to create pure subsets (all samples in a node share the same class).

**Properties**

- **Root Attribute Selection:**
    - Criterion: Choose the attribute that best separates the data into homogeneous groups.
    - Use some metrics such as *Information Gain* or *Gini Impurity* *(which we wil discuss them later)*

- **Forming Descendant:**
    - **Rule:** For each value of the selected attribute $A$:
        - **Create a branch:** One per unique value of $A$
        - **Partition Data:** $S_v = \{x \in S | A(x) = v\}$
        - **Recursion:** Apply ID3 to $S_v$ with remaining attributes

- **Stop Condition:**
    - All samples in a node belong to the same class.
    - No attributes left to split.
    - Minimum samples per node reached (hyperparameter).

- **Guaranteed Consistency:**
    - ID3 will always find a decision tree that **perfectly fits** any **conflict-free** training set.
    - **No training** error in **conflict-free** training set.
    - *Conflict-free:* Identical feature vectors must have identical class labels.

- **Limitations**
    - Not necessarily find the simplest tree (containing minimum number of nodes)
    - **Locally-optimal** decisions at each node
    - No backtracking
    - Globally minimal tree is **NP-complete** 

> Function FindTree(S,A)
> &nbsp;&nbsp;&nbsp;&nbsp;If empty(A) or all labels of the samples in $S$ are the same  
> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;status $\leftarrow$ leaf  
> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;class $\leftarrow$ most common class in the labels of $S$  
> &nbsp;&nbsp;&nbsp;&nbsp;else  
> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;status $\leftarrow$ internal  
> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;a $\leftarrow$ bestAttribute(S,A)  
> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;LeftNode $\leftarrow$ FindTree(S($a \gt t$), $A - \{a\}$)  
> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;RightNode $\leftarrow$ FindTree(S($a \le t$), $A - \{a\}$)  
> &nbsp;&nbsp;&nbsp;&nbsp;end if  
> end function

Where:
- Function input: S (Samples), A (Attributes)
- *bestAttribute* function returns the attribute test
- t (threshold)

More Advance Pseudo Code: (CHECK IT LATER AGAIN)

>ID3 (Examples,Target_Attribute,Attributes)  
>&nbsp;&nbsp;&nbsp;&nbsp;Create a root node for the tree???  
>&nbsp;&nbsp;&nbsp;&nbsp;If all examples are positive:  
>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;return the single-node, with label = +  
>&nbsp;&nbsp;&nbsp;&nbsp;If all examples are negative:  
>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;return the single-node, with label = -  
>&nbsp;&nbsp;&nbsp;&nbsp;If number of predicting attributes is empty then:  
>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;return the single-node, with label = most common value of the target attribute in the examples  
>&nbsp;&nbsp;&nbsp;&nbsp;else:  
>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;A = The Attribute that best classifies examples  
>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Testing attribute for Root = A  
>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;for each possible value, $v_i$ of A:  
>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Add a new tree branch below A, corresponding to the testA =$v_i$  
>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Let Examples($v_i$) be the subset of examples that have the value for A  
>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;if Examples($v_i$) is empty then:  
>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;below this new branch add a leaf node with label = most common target value in the examples  
>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;else  
>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;below this new branch add subtree ID3(Examples($v_i$), Target_Attribute, Attributes – {A})  
>&nbsp;&nbsp;&nbsp;&nbsp;return Root  

### **Metrics**

#### **Introduction**

To determine the optimal attribute for splitting, we evaluate candidates using **splitting heuristics**.

The two most common metrics are:
- **Information Gain**
- **Gini Impurity**

These metrics are applied to each candidate subset, and the resulting values are combined (e.g., averaged) to provide a measure of the quality of the split.

#### **Information Gain**

##### **Entropy**

Entropy measures the uncertainty or unpredictability in a specific distribution.

**Formula:**
$$H(X) = -\sum_{x_i \in X} p(x_i) \log{p(x_i)}$$

**Information theory:**
- **Expected Code Length:**
    - $H(X)$ is the average number of bits needed to encode one randomly drawn value of $X$.
    - *How many yes/no questions (bits) do we need, on average, to guess the value of $X$?*

- **Optimal Encoding:**
    - The most efficient code assigns $−\log_2{P(x_i​)}$ bits to encode outcome $x_i$
    - Rare events should be assigned longer codes (because they happen less often).
    - Common events get shorter codes.

- **Uncertainty Relationship:**
    - Higher entropy $\rightarrow$ More uncertainty $\rightarrow$ Harder to predict outcomes.
    - Lower entropy $\rightarrow$ Less uncertainty $\rightarrow$ Easier to predict outcomes.

**Example: Entropy for a Boolean variable**

<div style="text-align:center">
  <img src="images/entropyOnaGrapah.png" alt="Entropy for a Boolean">
</div>

Where:
- $$H(X = 0) = 0 \log_2{0} - 1 \log_2{1} = 0$$
- $$H(X = 0) = - 1 \log_2{1} - 0 \log_2{0} = 0$$
- $$H(X = 0.5) = -0.5 \log_2{\frac{1}{2}} - 0.5 \log_2{\frac{1}{2}} = 1$$

**Conditional Entropy**

Conditional entropy $$H(Y | X)$ measures the average uncertainty in $Y$ given knowledge of $X$:
$$H(Y | X) = E_{x \sim X} [H(Y | X = x)]$$
Where:
$H(Y | X = x)$ is the entropy of $Y$ for a fixed value $X = x$

Entropy for a $X = x$:
$$H(Y | X = x) = -\sum_{y \in Y} p(y | x) \log{p(y | x)}$$

Take Expectation Over $X$:
$$H(Y | X) = -\sum_{x \in X} p(x) H(Y | X = x)$$

Substitute $H(Y | X = x)$:
$$- \sum_{x \in X} p(x) \sum_{y \in Y} p(y | x) \log{p(y | x)}$$

Since $p(x, y) = p(x) p(y|x)$:
$$- \sum_{x \in X} \sum_{y \in Y} p(x, y) \log{\frac{p(x, y)}{p(x)}}$$
$$= \sum_{x \in X} \sum_{y \in Y} p(x, y) \log{\frac{p(x)}{p(x, y)}}$$

##### **Information Gain (IG)**

- **Definition:** Information Gain (IG) **quantifies the reduction in uncertainty** (entropy) of the target variable $Y$ after splitting samples $S$ by attribute $A$.

- **Input:** Samples S, Attribute A.
- **Output:** Numerical value representing the utility of splitting on A.

**Formula:**
$$\text{Gain} (S, A) = H_S(Y) - \sum_{v \in Values(A)} \frac{|S_v|}{|S|} H_{S_v}(Y)$$

Where:
- $A$: Variable used to split samples
- $Y$: Target variable
- $S$: Samples

**Formula Breakdown:**

- $H_S(Y)$: Entropy of $Y$ before splitting (baseline uncertainty).

- $H_{S_v}(Y)$: Measures purity of subsets ($S_v$) created by splitting.

- $\sum_{v \in Values(A)} \frac{|S_v|}{|S|} H_{S_v}(Y)$: Weighted average uncertainty of subsets.

**Goal:**
- **Maximize IG:** Choose splits that most reduce uncertainty in $Y$.
    - **High IG:** Attribute A effectively separates classes.
    - **Low IG:** Attribute A provides little predictive power.

##### **Mutual Information**

The expected reduction in entropy of $Y$ caused by knowing $X$:
$$I(X, Y) = H(Y) - H(Y | X)$$

Marginal Entropy:
$$H(Y) = -\sum_{j} p(y_j) \log{p(y_j)}$$

Conditional Entropy:
$$H(X | Y) = -\sum_{i} \sum_{j} p(x_i, y_j) \log{p(y_j) | p(x_i)}$$
$$= \sum_{i} \sum_{j} p(x_i, y_j) \log{\frac{p(x_i)}{p(x_i, y_j)}}$$

Subtract Them:
$$H(Y) - H(X | Y) = -\sum_{j} p(y_j) \log{p(y_j)} + \sum_{i} \sum_{j} p(x_i, y_j) \log{p(y_j) | p(x_i)}$$

Since (Law of Total Probability):
$$p(y_j) = \sum_{i}p(x_i, y_j)$$

Substitute into the Equation:
$$H(Y) - H(X | Y) = \sum_{j} \left(\sum_{i} p(x_i, y_j) \right) \log{\frac{1}{p(y_j)}} + \sum_{i} \sum_{j} p(x_i, y_j) \log{\frac{p(x_i)}{p(x_i, y_j)}}$$

Combine the Sums:
$$I(X, Y) = H(Y) - H(X | Y) = \sum_{i} \sum_{j} p(x_i, y_j) \left(\log{\frac{p(x_i)p(y_j)}{p(x_i, y_j)}} \right)$$

**Key Properties:**
- Marginal Entropy
- Conditional Entropy
- Mutual Information

**Marginal Entropy $H(Y)$:**  
- **Definition:**  
Measures the uncertainty (or disorder) in the target variable $Y$

- **Formula:**
$$H(Y) = -\sum_{j} p(y_j) \log{p(y_j)}$$

- **Interpretation:**  
High $H(Y)$: Labels are evenly distributed (harder to predict).  
Low $H(Y)$: One label dominates (easier to predict).

**Conditional Entropy $H(Y|X)$**  
- **Definition:**  
Expected entropy of $Y$ after splitting the data based on attribute $X$.

- **Formula:**
    $$H(Y | X) = -\sum_{x \in X} p(x) H(Y | X = x)$$
    Where:
    $$H(Y | X = x) = -\sum_{y \in Y} p(y | x) \log{p(y | x)}$$

- **Interpretation:**  
Averages the entropy of $Y$ across all subsets created by splitting on $X$.  
Low $H(Y∣X)$: $X$ creates homogeneous subsets (good split).

**Mutual Information $I(X, Y)$**  
- **Formula:**
    $$I(X, Y) = H(Y) - H(Y | X)$$
    Quantifies how much $X$ reduces uncertainty about $Y$.

- **Key Implications:**
    - **Independence Case $I(X, Y) = 0$:**   
        - Condition:
            $$I(X, Y) = 0 \iff p(X, Y) = p(X) p(Y)$$
            Occurs when the logarithm term equals zero (since $\log{1}=0$).
        - Interpretation:
            - $X$ and $Y$ are **independent**.
            - Knowing $X$ provides **no information** about $Y$

    - **Dependence Case $I(X, Y) \gt 0$**  
        - Interpretation:
            - $X$ **reduces uncertainty** about $Y$
            - Higher $I(X, Y)$ $\rightarrow$ Stronger dependence
            - In **Feature Selection**: Prefer higher mutual information.

**Selecting the Best Attribute for Splitting**

Identify the attribute that maximizes Information Gain (IG) or Mutual Information (MI) when splitting a set of samples $S$.  
This ensures the most significant reduction in uncertainty (entropy) about the target variable $Y$

**Recall: Mutual Information**
$$\text{Gain}(S, X_i) = H_S(Y) - H_S(Y | X_i)$$

**Optimal Attribute Selection:**  
Choose $j$-th attribute for test in this node where:
$$j = \argmax_{i \in attributes} \text{Gain}(S, X_i)$$
Equivalently:
$$j = \argmax_{i \in attributes} H_S(Y) - H_S(Y | X_i)$$

- **Single-Node Context:**
    $$j = \argmin_{i \in attributes} H_S(Y | X_i)$$

- **Cross-Node Comparison:**
    - If comparing splits across different nodes (with different sample sets $S$) $H_S(Y)$ varies.
    - It is invalid to directly compare $H_S(Y | X_i)$.

#### **Gini Impurity**

- Gini impurity is a **measure of impurity** or randomness used in decision tree algorithms.

- It quantifies how well a node in a decision tree splits the data.

- **Lower Gini** impurity $\rightarrow$ a more **homogeneous** node $\rightarrow$ most data points belong to the same class.

**Formula:**
$$G(X) = 1 - \sum_{x_i \in X} p(x_i)^2$$

<div style="text-align:center">
  <img src="images/gini.png" alt="Gini">
</div>

### **Overfitting**

#### **Introduction**

**Optimal Decision Trees: Balancing Size and Generalization**

**Smaller trees are preferred**
- Fewer short hypothesis than long ones
- Lower variance of the smaller trees
- Lower risk of overfitting

**Overfitting Problem In ID3**

ID3 algorithm perfectly classifies training data (non-conflict data)

**Causes**:
- **Small Data:**  
    Limited samples lead to poor decisions

- **Noisy Data:**  
    Trees fit irrelevant outliers or mislabeled examples.

- **Irrelevant Attributes:**  
    ID3 may split on non-predictive features, creating useless branches.

Given:
- Hypothesis space $H$: decision trees
- Training (empirical) error of $h \in H: error_{train}(h)$
- Expected error of $h \in H: error_{true}(h)$

$h$ overfits training data if there is a $h' \in H$ such that:
- $error_{train}(h) \lt error_{train}(h')$
- $error_{true}(h) \gt error_{true}(h')$

<div style="text-align:center">
  <img src="images/overfittingDecisionTree.png" alt="Overfitting">
</div>

**Avoiding Overfit**

- **Early Stopping**
- **Pruning**

#### **Early Stopping**

**Stop growing** when the data split is not statistically significant.

#### **Pruning**

**Post-pruning** is a technique to refine decision trees by removing branches that overfit training data while preserving predictive accuracy.

Unlike pre-pruning (early stopping), it **first grows** the full tree and then **prunes it backward**.

Yielding **better results** in practice than pre-pruning (early stopping).

**Training Process:**

- **Split Data**  
    Divide the dataset into:
    - **Training Set:** Used to build the full tree
    - **Validation Set:** Used to evaluate pruning decisions

- **Construct Full Tree**  
    Grow the tree until:
    - All leaves are **pure**
    - **No attributes** left

- **Iterative Pruning**  
    Repeat until further pruning is harmful:
    - **Evaluate impact on validation set** when pruning sub-tree rooted at each node
        - **Temporarily remove** sub-tree rooted at node
        - Replace it with a **leaf labeled** with the current majority class at that node
        - **Measure** and record error on validation set
    - **Greedily remove** the one that **most improves** validation set accuracy (if any).

### **C4.5**

*(Extension of ID3 with rule-based pruning and handling of continuous attributes)*

Decision tree = a set of rules

- Each path from root to leaf represents a conjunctive rule (AND conditions).
- All leaves with class $Y = i$ form a disjunctive rule (OR of paths).

**Training Process:**
- Learns a tree from data (like ID3, prone to overfitting)  
- Convert the tree into the equivalent set of rules
- Prune (generalize) each rule by removing any precondition that results in improving estimated accuracy  
- Sort the pruned rules by their estimated accuracy  
- consider them in sequence when classifying new instances  

**C4.5 vs. ID3**

**In ID3 (Inflexible Pruning):**
- **Tree-Structured Pruning:**  
    If an attribute (e.g., "Outlook") is pruned, it is **removed from all descendant** nodes in the subtree.
    - **Limitation:** Overly rigid.

**In C4.5 (Flexible Rule-Based Pruning):**
- **Per-Rule Condition Removal:**  
    Each rule (derived from a root-to-leaf path) is pruned **independently**.
    - **Advantage:** Preserves informative attributes where they matter, improving generalization.

### **Continuous Attributes**


For **continuous attributes**, we must:
- **Define candidate thresholds** to discretize the feature.
    - A candidate threshold is a value where the **class label changes** in sorted data.

- **Evaluate each threshold** using Information Gain (or Gini Impurity).

- **Select the optimal threshold** that maximizes separation between classes *(highest IG)*.
    - *Unlike discrete* attributes, continuous features can be reused with different thresholds in other branches.

### **Summary**

- **Simple**

- **Interpretable**

- Requires **little data preparation**

- Handles both **numerical & categorical** data

- **Time efficient** $\rightarrow$ Can be used on **large datasets**

- **Robust:** Performs well even if its assumptions are somewhat violated

- **Issue: Bias Towards High-Cardinality Attributes**
    - Information Gain tends to favor attributes with many unique values, even if they’re non-predictive.

    - Attributes with many values can artificially split data into tiny, pure subsets.

    - Example: *Visit Date* Attribute
        - 100 patients, each with a unique visit date.
        - IG will be maximized ($H(S)$) because each date perfectly "predicts" its patient’s outcome.

- **High Variance** tends to overfit