# Ensemble Learning: Bagging, Boosting, and Stacking

## 1. Bagging (Bootstrap Aggregating)

**Definition:**
Bagging is an **ensemble learning technique** aimed at reducing the **variance** of a model by training multiple base learners on different **bootstrap samples** (sampling with replacement) of the dataset, and then aggregating their predictions.

**Mathematical Formulation:**
Given a dataset:
$$
D = \{(x_i, y_i)\}_{i=1}^n
$$

We create \( B \) bootstrap samples:
$$
D_1, D_2, \dots, D_B
$$

For each sample, we train a base learner \( h_b(x) \). The final prediction is:

- **Regression:**
$$
\hat{f}_{bag}(x) = \frac{1}{B} \sum_{b=1}^{B} h_b(x)
$$

- **Classification:**
$$
\hat{f}_{bag}(x) = \text{majority\_vote}\{h_1(x), h_2(x), \dots, h_B(x)\}
$$

**Key Insight:**
Bagging works best for **high-variance** models (like Decision Trees).
The most popular example is **Random Forest**, which adds an additional random feature selection step.

---

## 2. Boosting

**Definition:**
Boosting is an **iterative (sequential) ensemble technique** where each new weak learner focuses on the **errors (residuals)** of the previous learners. The goal is to reduce **bias and variance** simultaneously.

**AdaBoost Intuition:**
Start with equal weights for all samples:
$$
w_i = \frac{1}{n}
$$

At iteration \( t \), train a weak learner \( h_t(x) \) minimizing the **weighted error**:
$$
\epsilon_t = \sum_{i=1}^n w_i \mathbf{1}(h_t(x_i) \neq y_i)
$$

Compute the weight of the learner:
$$
\alpha_t = \frac{1}{2} \ln\frac{1-\epsilon_t}{\epsilon_t}
$$

Update sample weights:
$$
w_i \leftarrow w_i \cdot e^{-\alpha_t y_i h_t(x_i)}
$$
Normalize \( w_i \).

Final classifier:
$$
H(x) = \text{sign}\Big( \sum_{t=1}^T \alpha_t h_t(x) \Big)
$$

**Gradient Boosting:**
Instead of reweighting samples explicitly, models are fit to the **negative gradients of the loss function**. This connects boosting with gradient descent optimization.

**Key Insight:**
Boosting works best for **reducing bias**, but can overfit if not regularized (shrinkage, early stopping).

Popular algorithms: **AdaBoost, Gradient Boosting, XGBoost, LightGBM.**

---

## 3. Stacking (Stacked Generalization)

**Definition:**
Stacking is a **meta-learning technique** where predictions from multiple diverse models are combined using another model called the **meta-learner**.

**Process:**
1. Train base learners \( h_1, h_2, \dots, h_k \) on the training data.
2. Generate a meta-feature matrix \( Z \):
$$
Z = [h_1(x), h_2(x), \dots, h_k(x)]
$$
3. Train a meta-learner \( g \) on \( Z \) to produce the final prediction:
$$
\hat{y} = g(Z)
$$

**Key Insight:**
Stacking leverages **model diversity** and allows the meta-learner to learn the optimal combination of models.

It requires careful **cross-validation** to avoid overfitting and data leakage.

---

## 4. Key Differences

| Technique | Training Type | Goal | Example Models |
|-----------|---------------|------|----------------|
| Bagging   | Parallel (independent learners) | Reduce variance | Random Forest |
| Boosting  | Sequential (each learner fixes previous errors) | Reduce bias & variance | AdaBoost, XGBoost |
| Stacking  | Parallel + Meta-model | Learn optimal model combination | Kaggle meta-models |

---

## 5. Theoretical Insights

- **Bias-Variance Trade-off:**
  - Bagging: Primarily reduces variance.
  - Boosting: Reduces bias and variance.
  - Stacking: Exploits strengths of heterogeneous models.

- **Generalization:**
  Stacking often achieves the best performance when the base models are diverse and the meta-learner is chosen carefully.

---

## References
- Breiman, L. (1996). *Bagging Predictors*. Machine Learning.
- Freund, Y., & Schapire, R. (1997). *A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting*.
- Wolpert, D. (1992). *Stacked Generalization*.
