# Ensemble Learning

### **Introduction**

**Definition:**  
Ensemble learning combines multiple hypotheses (models) to form a (hopefully) better-performing hypothesis.  

**Ensemble vs. Multiple Classifier**
- **Ensemble:**  
    Uses the **same base learner** to generate hypotheses.

- **Multiple Classifier (Broader term):**  
    Combines hypotheses from **different base learners**.

**Strong Learner vs. Weak (Simple) Learner**
- **Strong Learner:**
    - A classifier that achieves **low classification error** (high accuracy).

<br>

- **Weak Learner:**  
    - A classifier that performs slightly better than random guessing.
    - Two common Types:
        - **Low Variance, High Bias**
        - **High Variance, Low Bias**

**Ensemble Methods**

- **Parallel Methods**  
    Weak learners are trained **independently** and in **parallel**.
    - Requires **low-bias** weak learners
    - **Reduces variance** in predictions

<br>

- **Sequential Methods**  
    Weak learners are trained **consecutively**, with each new learner correcting the errors of the previous ones.
    - Requires **low-variance** weak learners
    - **Reduces bias** and improves model accuracy

**Ensemble Learning Methods**

- **Bagging**
    - Parallel
    - To decrease variance
    - Random Forest

<br>

- **Boosting**
    - Sequential
    - To decrease the bias (enhance capabilities)
    - AdaBoost, XGBoost

### **Bagging**

#### **Introduction**

Bagging: **B**ootstrap **agg**regat**ing**

**Definition:**  
A parallel ensemble method that reduces variance by combining multiple weak learners trained on bootstrapped datasets.

**Key Concepts:**
- **Bootstrap Resampling:**
    - Creates multiple datasets by sampling **with replacement** from the original training data.
    - Each dataset has the same size as the original but may contain duplicates.

- **Aggregation:**
    - **Classification:** Majority voting.
    - **Regression:** Averaging predictions.

- **Works Best When:**
    - Base learners are **unstable** (high variance, low bias).
    - **Low error correlation** between models (diverse predictions).
    - Base learners are **nearly unbiased**.

**Bagging Algorithm**

**Input:**
- $M$: Required ensemble size
- $D = \{(x^{(1)}, y^{(1)}), \ldots, (x^{(N)}, y^{(N)}) \}$

**Steps:**
> for $t = 1$ to $M$ do:  
> &nbsp;&nbsp;&nbsp;&nbsp;Build a dataset $D_t$ by sampling $N$ items, randomly with replacement from $D$ (Bootstrap Resampling)  
> &nbsp;&nbsp;&nbsp;&nbsp;Train model $h_t$ using $D_t$ and add it to ensemble  
> end for  
> $H(x) = \text{sign}(\sum_{t=1}^M h_t(x))$  

*Aggregate models by voting for classification or by averaging for regression*

#### **Random Forest**

**Bagging on Decision Trees**

Why Decision Trees for Bagging?
- **Simple** and **Interpretable**
- Handles both **numerical & categorical** data
- **Robust** to outliers
- **Low Bias**
- Requires **little data preparation**

How ever they are **high variance**

Why are DTs good candidates for ensembles?
- Averaging many **unbiased but high-variance** trees reduces overall variance.
- Bias remains low; ensemble accuracy improves.

**Random Forest**
- **Random Feature Subsets:**  
    - At each split, only $m$ out of $d$ features are considered ($m \leq \sqrt{d}$).
    - **Goal**: Further decorrelate trees to reduce error correlation.

- **Forest**
    - Use many decision trees

**Random Forest Algorithm**

**Input:**
- $T$ number of trees
- $m$ number of variables used to split each node

> for $t = 1$ to $T$ do  
> &nbsp;&nbsp;&nbsp;&nbsp;Draw a bootstrap dataset  
> &nbsp;&nbsp;&nbsp;&nbsp;Select $m$ features randomly out of $d$ features as candidates for splitting  
> &nbsp;&nbsp;&nbsp;&nbsp;Learn a tree on this dataset  
> end for  
> Output:  
> &nbsp;&nbsp;&nbsp;&nbsp;Regression: average of the outputs  
> &nbsp;&nbsp;&nbsp;&nbsp;Classification: majority voting  

Where usually $m \le \sqrt{d}$

### **Boosting**

#### **Introduction**

Sequentially combine **multiple weak learners** (each slightly better than random guessing) to form a **strong learner**.

**Key Principles:**
- **Complementary Weak Learners:**
    - Each learner specializes in different subsets of the data.
    - Later learners focus on **correcting errors of previous** ones.

- **Weighted Voting:**
    - Strong learner = weighted sum of weak learners.
    - More reliable learners get higher weights ($\alpha_i$).

**Ensemble Model:**
$$H_m(x) = \alpha_1 h(x; \theta_1) + \ldots + \alpha_m h(x; \theta_m), \quad \alpha_t \ge 0$$
$$H_m(x) = \alpha_1 h_1(x) + \ldots + \alpha_m h_m(x)$$
Where:
- $h_t(x)$: Weak learner at step $t$.
- $\alpha_t$: Weight (higher for more accurate learners).

**Prediction Rule:**
$$\hat{y} = \text{sign}(H_m(x))$$

**Decision stumps**

**Definition:**  
A **decision stump** is a single-level decision tree (one feature + threshold).

<br>

**Model Form:**
$$h(x;\theta) = \text{sign}(w_1 x_k - w_0), \quad \theta = \{k, w_1, w_0\}$$
Where
- $k$: Feature index.
- $w_1, w_0$: Weight and threshold parameters.

<br>

**Why Use Stumps?**
- Simple (low computational cost).
- High bias, low variance (ideal for boosting).

#### **AdaBoost**

##### **Core Concepts**

**AdaBoost = Adaptive Boosting**

- Sequential production of classifiers
    - Iteratively add the classifier whose addition will be most helpful.

- Each classifier is **dependent** on the previous ones.
    - Focuses on the previous ones’ error.

- Represent the important of each sample by assigning **weights** to them.
    - **Correct** classification $\rightarrow$ **smaller** weights
    - **Misclassified** samples $\rightarrow$ **larger** weights


##### **AdaBoost Algorithm**

**Definitions**:
- $w_m^{(i)}$: Weighting coefficient of data point $i$ in iteration $m$
- $\alpha_m$: Weighting coefficient of $m$-th in the final ensemble
- $\epsilon_m$: Weighted error rate of $m$-th base classifier

**Steps**:
> Initialize data weight $w_1(i) = \frac{1}{N}$ for all N samples  
> for $m=1$ to M do:  
> &nbsp;&nbsp;&nbsp;&nbsp;Find a classifier $h_m(x)$ by minimizing th weighted error function:  
> $$J_m = \sum_{i = 1}^N w_m^{(i)} I \left( y^{(i)} \neq h_m( x^{(i)} ) \right)$$
> &nbsp;&nbsp;&nbsp;&nbsp;Find the weighted error of $h_m(x)$:
> $$\epsilon_m = \frac{\sum_{i = 1}^N w_m^{(i)} I \left( y^{(i)} \neq h_m( x^{(i)} ) \right)}{\sum_{i = 1}^N w_m^{(i)}}$$
> &nbsp;&nbsp;&nbsp;&nbsp;New component is assigned votes based on its error:
> $$\alpha_m = \ln(\frac{1 - \epsilon_m}{\epsilon_m})$$
> &nbsp;&nbsp;&nbsp;&nbsp;Update the normalized weights:
> $$w_{m+1}^{(i)} = w_m^{(i)} e^{\alpha_m I(y^{(i)} \neq h_m( x^{(i)}))}$$
> end for  
> Combined classifier:  
> $$\hat{y} = sign(H_M(x)), \quad$$  
> Where:  
> $$H_M(x) = \frac{1}{2} \sum_{m=1}^M \alpha_m h_m(x)$$  

**Ensemble Model:**
$$H_M(x) = \frac{1}{2} [\alpha_1 h_1(x) + \ldots + \alpha_M h_M(x)]$$
where:
- $h_m(x)$ is the $m^{th}$ weak learner (typically a decision stump)
- $\alpha_m$ is the learner's weight (confidence)
- The $\frac{1}{2}$ factor simplifies the relationship between $\alpha_m$ and the loss function

<br>

**Optimization Objective:**  
AdaBoost implicitly minimizes the exponential loss:
$$Loss(y, \hat{y}) = e^{-yH_m(x)}$$
$$\hat{y} = sign(H_M(x)), \quad$$

<br>

**Key Properties of Exponential Loss:**
- **Differentiability:**
    - Smooth and convex, enabling efficient optimization

- **Margin Maximization:**
    - Heavily penalizes misclassifications ($\mathcal{L} \to \infty$ when $yH(x) \to -\infty$)
    - Still assigns non-zero loss to correctly classified examples near the decision boundary ($yH(x)$ slightly positive)

<div style="text-align:center">
  <img src="../assets/exponential_loss_function.png" alt="Exponential Loss Function">
</div>

##### **Finding the Optimal Weak Learner $h_m$**

**Objective:**
At iteration $m$, minimize:
$$E = \sum_{i=1}^N \exp\left(-y^{(i)}H_m(x^{(i)})\right)$$

**Current Ensemble:**
$$H_m(x) = H_{m-1}(x) + \tfrac{1}{2}\alpha_m h_m(x)$$

---

**Step 1: Rewrite Loss Function**

Substitute $H_m(x)$ into $E$:
$$
E = \sum_{i=1}^N \exp\left(-y^{(i)}\left[H_{m-1}(x^{(i)}) + \tfrac{1}{2}\alpha_m h_m(x^{(i)})\right]\right) \\
= \sum_{i=1}^N \underbrace{\exp\left(-y^{(i)}H_{m-1}(x^{(i)})\right)}_{w_m^{(i)}} \cdot \exp\left(-\tfrac{1}{2}\alpha_m y^{(i)}h_m(x^{(i)})\right)
$$

where $w_m^{(i)}$ are the sample weights from previous iterations.

---

**Step 2: Analyze Cases**

For each sample:

1. **Correct Classification** ($y^{(i)} = h_m(x^{(i)})$):
   $$\exp\left(-\tfrac{1}{2}\alpha_m\right)$$

2. **Incorrect Classification** ($y^{(i)} \neq h_m(x^{(i)})$):
   $$\exp\left(\tfrac{1}{2}\alpha_m\right)$$

Thus:
$$
E = \exp\left(-\tfrac{\alpha_m}{2}\right)\sum_{\text{correct}} w_m^{(i)} + \exp\left(\tfrac{\alpha_m}{2}\right)\sum_{\text{incorrect}} w_m^{(i)}
$$

---

**Step 3: Simplify Expression**

Add and subtract $\exp(\frac{-\alpha_m}{2})\sum_{\text{incorrect}} w_m^{(i)}$:

$$
E = \exp\left(-\tfrac{\alpha_m}{2}\right)\left[\sum_{\text{correct}} w_m^{(i)} + \sum_{\text{incorrect}} w_m^{(i)}\right] + \left[\exp\left(\tfrac{\alpha_m}{2}\right) - \exp\left(-\tfrac{\alpha_m}{2}\right)\right]\sum_{\text{incorrect}} w_m^{(i)}
$$

Which simplifies to:
$$
E = \left(\exp(\frac{\alpha_m}{2}) - \exp(-\frac{\alpha_m}{2})\right) W_{\text{incorrect}} + \exp(-\frac{\alpha_m}{2}) W_{\text{total}}
$$
where:
- $W_{\text{incorrect}} = \sum_{y^{(i)} \neq h_m(x^{(i)})} w_m^{(i)}$
- $W_{\text{total}} = \sum_{i=1}^N w_m^{(i)}$

---

**Key Insight**

Since $W_{\text{total}}$ is fixed, minimizing $E$ reduces to minimizing:
$$
J_m = W_{\text{incorrect}} = \sum_{i=1}^N w_m^{(i)} I(y^{(i)} \neq h_m(x^{(i)}))
$$

**Find** $h_m(x)$ **that minimizes** $J_m$.

##### **Finding the Optimal Classifier Weight $\alpha_m$**

We need to find the weight $\alpha_m$ that minimizes the exponential loss:
$$E = (e^{\frac{\alpha_m}{2}} - e^{\frac{-\alpha_m}{2}}) \sum_{y^{(i)} \neq h_m(x^{(i)})} w_m^{(i)} +  e^{\frac{-\alpha_m}{2}} \sum_{i=1}^N w_m^{(i)}$$

**Differentiate the Loss Function:**
$$\frac{\partial E}{\partial \alpha_m} = \frac{1}{2}(\frac{e^{\alpha_m}}{2} + \frac{-e^{\alpha_m}}{2}) \sum_{y^{(i)} \neq h_m(x^{(i)})} w_m^{(i)} - \frac{1}{2} e^{\frac{-\alpha_m}{2}} \sum_{i=1}^N w_m^{(i)} = 0$$

**Simplify the Equation**:
$$\frac{e^{\frac{-\alpha_m}{2}}}{\frac{\alpha_m}{2} + \frac{-\alpha_m}{2}} = \frac{\sum_{y^{(i)} \neq h_m(x^{(i)})} w_m^{(i)}}{\sum_{i=1}^N w_m^{(i)}}$$

Divide both sides by $e^{\frac{-\alpha_m}{2}}$ and introduced weighted error rate $\epsilon_m$:
$$\frac{1}{1 + e^{\alpha_m}} = \epsilon_m$$

**Solve for $\alpha_m$:**
$$1 = \epsilon_m + \epsilon_m e^{-\alpha_m} \implies e^{\alpha_m} = \frac{1 - \epsilon_m}{\epsilon_m}$$
As result:
$$\alpha_m = \ln \frac{1 - \epsilon_m}{\epsilon_m}$$

**Final Result**  
The optimal weight for classifier $h_m$ is:

$$
\alpha_m = \frac{1}{2} \ln\left(\frac{1 - \epsilon_m}{\epsilon_m}\right)
$$

where the weighted error rate is:

$$
\epsilon_m = \frac{\sum_{i=1}^N w_m^{(i)} I(y^{(i)} \neq h_m(x^{(i)}))}{\sum_{i=1}^N w_m^{(i)}}
$$

##### **Deriving the Sample Weight Update Rule**

**Objective:**  
We need to derive how sample weights evolve:
$$w_{m+1}^{(i)} = e^{-y^{(i)}H_m(x^{(i)})}$$

<br>

**Express Current Ensemble:**
$$w_{m+1}^{(i)} = w_m^{(i)} e^{-\frac{1}{2}\alpha_m y^{(i)} h_m(x^{(i)})}$$

<br>

**Transform the Indicator:**  
We prove a useful identity:
$$y^{(i)}h_m(x^{(i)}) = 1 - 2I(y^{(i)} \neq h_m(x^{(i)}))$$
**Verification:**
- When **correct** ($y^{(i)} = h_m(x^{(i)})$):  
    $1 = 1 - 2(0)$

- When **incorrect** ($y^{(i)} \neq h_m(x^{(i)})$):  
    $-1 = 1 - 2(1)$

<br>

**Substitute into Weight Update**
$$w_{m+1}^{(i)} = w_m^{(i)} e^{-\frac{1}{2} \alpha_m \left(1 - 2I ( y^{(i)} \neq h_m(x^{(i)}) ) \right)}$$
$$ = w_m^{(i)} e^{-\frac{1}{2} \alpha_m} e^{ \alpha_m \left(I ( y^{(i)} \neq h_m(x^{(i)}) ) \right)}$$

<br>

**Final Update Rule**  
Since $e^{-\frac{\alpha_m}{2}}$ is constant across all samples, we can normalize it away, leaving:
$$w_{m+1}^{(i)} = w_m^{(i)} e^{ \alpha_m \left(I ( y^{(i)} \neq h_m(x^{(i)}) ) \right)}$$

##### **Summary**

**AdaBoost Algorithm Summary**

**Initialization:**
1. For $ i = 1 $ to $ N $:
   - Initialize data weights: $ w_1^{(i)} = \frac{1}{N} $

**Main Loop (for $ m = 1 $ to $ M $):**
- **Find classifier** $ h_m(x) $:  
   Minimize the weighted error function

- **Compute normalized error**:  
   $ \epsilon_m = $ weighted error of $ h_m(x) $

- **Calculate component weight**:  
   $ \alpha_m = \ln\left(\frac{1-\epsilon_m}{\epsilon_m}\right) $

- **Update example weights**:  
   $ w_{m+1}^{(i)} = w_m^{(i)} \cdot e^{\alpha_m \cdot \mathbb{I}(y^{(i)} \neq h_m(x^{(i)}))} $  
   (then normalize)

**Final Classifier:**
$ \hat{y} = \text{sign}(H_M(x)) $  
where:  
$ H_M(x) = \sum_{m=1}^M \alpha_m h_m(x) $

**Behavior**

- Exponential loss goes down
- Training error $H$ goes down
- Weighted error $\epsilon_m$ goes up $\rightarrow$ votes $\alpha_m$ goes down

**Base Learners**

- Decision Stumps
- Decision Trees
- Multi-Layer Neural Network

#### **XGBoost**

**Objective Function: Training Loss + Regularization**

The objective function in machine learning typically consists of two components:
$$obj(\theta) = L(\theta) + \Omega(\theta)$$
Where:
- $L(\theta)$: Training Loss Function  
    Measures how well the model fits the training data (predictive performance)

- $\Omega(\theta)$: Regularization Term  
    Controls model complexity to prevent overfitting

**CART: Classification And Regression Tree**

Key characteristics of CART models:
- Members are classified into different leaves
- Each leaf is assigned a real-valued score (unlike standard decision trees which only provide class decisions)
- Provides richer interpretation beyond simple classification

**Tree Boosting Objective:**

The objective function for tree boosting:
$$J(w) = \sum_{i=1}^n l(y_i, {\hat{y}_i}^{(t)}) + \sum_{k=1}^t w_{(f_k)}$$
Where:
- $t$: Number of trees
- $f_k$: $k$-th tree in the ensemble

**Additive Training:**

The boosting approach:
- Keep existing trees fixed
- Add one new tree at each iteration

Prediction at step $t$:
$${\hat{y}_i}^{(t)} = \sum_{k=1}^t f_k(x_i) = {y_i}^{(t-1)} + f_t(x_i)$$

**Optimize the objective**

**Mean Squared Error (MSE) Case**

If we consider using mean squared error (MSE) as our loss function, the objective at step $t$ becomes:
$$obj^{(t)} = \sum_{i=1}^n \left( y_i - ( {\hat{y}_i}^{(t-1)} + f_t(x_i) ) \right)^2 + \sum_{k=1}^t w(f_k)$$

Expansion of Training Loss:
$$\left( y_i - ( {\hat{y}_i}^{(t-1)} + f_t(x_i) ) \right)^2 = 2 \left( {\hat{y}_i}^{t-1} - y_i \right) f_t(x_i) + f_t(x_i)^2 + \text{constant}$$

Resulting objective
$$obj^{(t)} = \sum_{i=1}^n \left[ 2 \left( {\hat{y}_i}^{t-1} - y_i \right) f_t(x_i) + f_t(x_i)^2 \right] + w(f_t) + \text{constant}$$

**Key Components:**
- **First-order term:** $2(\hat{y}_i^{(t-1)} - y_i)f_t(x_i)$ (**residuals**)
- **Second-order term:** $f_t(x_i)^2$

**General Loss Functions**

For non-MSE losses (e.g., logistic loss), we don’t have a nice quadratic form naturally. So, XGBoost uses a *second-order Taylor expansion of the loss function*:
$$obj^{(t)} \approx \sum_{i=1}^n  \left[ l( y_i, {\hat{y}_i}^{(t-1)} ) + g_i f_t(x_i) + \frac{1}{2} h_i f_t^2(x_i) \right] + w(f_t) + \text{constant}$$

where:
$$g_i = \partial_{ {\hat{y}_i}^{(t-1)} } l(y_i, {\hat{y}_i}^{(t-1)}) \quad \text{gradient}$$
$$h_i = \partial^2_{ {\hat{y}_i}^{(t-1)} } l(y_i, {\hat{y}_i}^{(t-1)}) \quad \text{Hessian}$$

Simplified objective
$$\sum_{i=1}^n \left[g_i f_t(x_i) + \frac{1}{2} h_i f_t^2(x_i) \right] + w(f_t)$$

**Advantage:** This formulation enables XGBoost to support arbitrary loss functions through gradient ($g_i$) and Hessian ($h_i$) computations.

**Regularization Term**

Definition of the tree $f(x)$:
$$f_t(x) = w_{q(x)}, \quad w \in R^T, \quad q: R^d \rightarrow \{1, 2, \ldots, T \}$$
Where:
- $w$: Vector of scores on leaves
- $q$: Function assigning each data point to the corresponding leaf
- $T$: Number of leaves

Complexity Definition:
$$w(f) = \eta T + \frac{1}{2} \lambda \sum_{j=1}^T w_j^2$$

*There is more than one way to define complexity but this one works well in practice.*

**The Structure Score**

Objective value with the $t$-th tree as:
$$obj^{(t)} \approx \sum_{i=1}^n \left[g_i f_t(x_i) + \frac{1}{2} h_i f_t^2(x_i) \right] + \eta T + \frac{1}{2} \lambda \sum_{j=1}^T w_j^2$$

Change the index of the summation because all the data points on the same leaf get the same score.  
Let $I_j = {i | q(x_i) = j}$ be the set of indices assigned to leaf $j$:
$$obj^{(t)} = \sum_{j=1}^T \left[(\sum_{i \in I_j} g_i)w_j + \frac{1}{2}(\sum_{i \in I_j} h_i + \lambda) w_j^2 \right] + \eta T$$

Notation:
$$G_j = \sum_{i \in I_j} g_i, \quad H_j = \sum_{i \in I_j} h_i$$

Final Form:
$$obj^{(t)} = \sum_{j=1}^T \left[G_j w_j + \frac{1}{2}(H_j + \lambda) w_j^2 \right] + \eta T$$

**Optimal Solution**

Objective function:
$$obj^{(t)} = \sum_{j=1}^T \left[G_j w_j + \frac{1}{2}(H_j + \lambda) w_j^2 \right] + \eta T$$

**Optimal Leaf Weights**
$$w^*_j = - \frac{G_j}{H_j + \lambda}$$

**Optimal Objective Value:**
$$\text{obj}^* = -\frac{1}{2} \sum_{j=1}^T \frac{G_j}{H_j + \lambda} + \eta T$$

**Interpretation:** $obj^*$ serves as a quality measure for tree structure $q(x)$, combining both:
- Predictive power (through $G_j$ and $H_j$)
- Model complexity (through $\eta$ and $\lambda$)

This score functions similarly to impurity measures in decision trees, but with built-in complexity control.

**Learning the Tree Structure**

While enumerating all possible trees is computationally infeasible, we can efficiently build trees by:
- Optimizing one level at a time
- Using the **Gain** metric to evaluate split quality

**Gain Formula:**
$$Gain = \frac{1}{2} \left[\frac{G_L^2}{H_L + \lambda} + \frac{G_R^2}{H_L + \lambda} - \frac{(G_R + G_L)^2}{H_L + H_R + \lambda} \right] - \eta$$

**Key Terms:**
- $G_L$ and $G_R$: Sum of gradients for the left child and the right child nodes after the split
- $H_L$ and $H_R$: Sum of Hessians for the left child and the right child nodes
- $\lambda$: Regularization Term
- $\eta$: Complexity control

**Intuition Behind the Gain Formula:**
- Score after splitting:
    $$\frac{G_L^2}{H_L + \lambda} + \frac{G_R^2}{H_L + \lambda}$$
- Score before splitting:
    $$\frac{(G_R + G_L)^2}{H_L + H_R + \lambda}$$

**Role of Regularization:**
- $\lambda$: Smooths leaf weights
    - Prevents overfitting by shrinking leaf weights
- $\eta$: Minimum gain threshold
    - Acts as early stopping criterion
    - if $\text{Gain} \lt \eta$, XGBoost stops splitting (prunes the branch)

**Implementation Note:** XGBoost evaluates all possible splits at each level and selects the one with maximum Gain, continuing recursively until stopping conditions are met.