# Boosting in Machine Learning

## 1. Introduction

Boosting is an **ensemble learning method** that combines multiple *weak learners* (models that perform slightly better than random guessing) into a single strong learner.

- Unlike Bagging (which trains learners independently on bootstrap samples), **Boosting trains models sequentially**.  
- Each new model focuses on the **errors (misclassified data points)** made by the previous ones.  
- Final prediction is a weighted majority vote (classification) or weighted sum (regression).

---

## 2. Key Idea

1. Train a weak learner on the data.
2. Evaluate errors.
3. Increase the weights of misclassified samples so the next learner focuses more on them.
4. Repeat this process for multiple rounds.
5. Combine all weak learners into a final strong model.

---

## 3. Boosting Algorithm (General)

Given training dataset $D = \{(x_1, y_1), (x_2, y_2), \dots, (x_N, y_N)\}$:

1. Initialize weights:
   $$
   w_i = \frac{1}{N}, \quad i = 1, 2, \dots, N
   $$
   Each data point has equal weight initially.

2. For $m = 1$ to $M$ (number of weak learners):
   - Train weak learner $h_m(x)$ using weights $w_i$.
   - Compute weighted error:
     $$
     \epsilon_m = \frac{\sum_{i=1}^N w_i \cdot I(y_i \neq h_m(x_i))}{\sum_{i=1}^N w_i}
     $$
   - Compute learner weight:
     $$
     \alpha_m = \frac{1}{2} \ln \left(\frac{1 - \epsilon_m}{\epsilon_m}\right)
     $$
   - Update sample weights:
     $$
     w_i \leftarrow w_i \cdot \exp\big(-\alpha_m y_i h_m(x_i)\big)
     $$
   - Normalize $w_i$ so that $\sum w_i = 1$.

3. Final model (strong learner):
   $$
   H(x) = \text{sign}\left(\sum_{m=1}^M \alpha_m h_m(x)\right)
   $$

---

## 4. Types of Boosting

### 4.1 AdaBoost (Adaptive Boosting)
- Original and most famous boosting algorithm.
- Uses decision stumps (1-level decision trees) as weak learners.
- Misclassified samples get higher weights at each step.

### 4.2 Gradient Boosting
- Instead of adjusting weights, it fits new learners to the **residual errors** of previous learners.
- Uses gradient descent to minimize a loss function.

Mathematical idea:
$$
F_{m}(x) = F_{m-1}(x) + \eta h_m(x)
$$
where $\eta$ is the learning rate and $h_m$ is the weak learner trained on residuals.

### 4.3 XGBoost (Extreme Gradient Boosting)
- Optimized version of Gradient Boosting.
- Features: regularization, parallelization, tree pruning, handling missing values.
- Very popular in Kaggle competitions.

### 4.4 LightGBM & CatBoost
- Variants of boosting optimized for **speed, large datasets, and categorical features**.

---

## 5. Advantages & Disadvantages

**Advantages**
- High accuracy compared to single models.
- Works well with weak learners (like decision stumps).
- Handles bias effectively.

**Disadvantages**
- Sensitive to noisy data and outliers (because it gives higher weight to hard points).
- Slower than bagging methods.
- Can overfit if too many learners are used.

---

## 6. Summary

- **Boosting** combines weak learners sequentially, with each focusing on previous mistakes.
- **AdaBoost** → adjusts sample weights.
- **Gradient Boosting** → fits residuals using gradient descent.
- **XGBoost, LightGBM, CatBoost** → modern, optimized boosting frameworks.

Boosting is one of the most powerful ML techniques and is widely used in real-world competitions and production systems.
