## **Gradient Boosting Classifier: Overview**

Gradient Boosting is a powerful machine learning technique that builds a strong classifier by combining many weak learners (typically decision trees) **in sequence**.
At each stage, the new tree tries to fix the errors made by the previous combined model.

---

### **1. Stage-wise Tree Boosting**

* **How it works:**

  * The model starts with a simple prediction (like the average value).
  * At each stage, a new tree is trained to predict the “residuals” (the errors made so far).
  * These trees are added one by one, and each tries to correct the mistakes of the combined model so far.
* **Formula (simplified):**
  After $m$ stages (trees), the prediction is:

  $$
  F_m(x) = F_{m-1}(x) + \text{learning rate} \times \text{Tree}_m(x)
  $$

  where $\text{Tree}_m(x)$ is the prediction from the $m$-th tree.

---

### **2. Learning Rate**

* The **learning rate** (also called “shrinkage”) is a number between 0 and 1 that **controls how much each new tree’s prediction affects the overall model**.
* **Low learning rate:**

  * Each tree has less influence.
  * Needs more trees to fit well, but reduces the risk of overfitting.
* **High learning rate:**

  * Each tree has more influence.
  * Model can learn faster, but may overfit.

---

### **3. Overfitting Control**

Gradient boosting is **prone to overfitting** if not controlled.
**Common strategies:**

* **Use a low learning rate** (common values: 0.01 to 0.2)
* **Limit the number of trees (n\_estimators)**
* **Restrict tree depth** (e.g., max\_depth=3)
* **Use subsampling** (randomly use a subset of data for each tree)

---

## **Summary Table**

| Concept          | Description                                                |
| ---------------- | ---------------------------------------------------------- |
| Stage-wise       | Trees are added one at a time, correcting previous errors  |
| Learning Rate    | Controls the impact of each tree (lower is safer, slower)  |
| Overfitting Ctrl | Fewer trees, shallow trees, low learning rate, subsampling |

---

## **Mini Example**

Suppose we want to classify emails as spam/not spam:

1. **Stage 1:** Start with a baseline guess.
2. **Stage 2:** Fit a small tree to the errors (residuals).
3. **Stage 3:** Fit another tree to the new errors, and so on.
4. **Learning rate:** If set to 0.1, each new tree’s correction is scaled down by 0.1.


## **Dataset**

| Sample | X | True Class (y) |
| ------ | - | -------------- |
| 1      | 1 | 0              |
| 2      | 2 | 0              |
| 3      | 3 | 1              |
| 4      | 4 | 1              |

Suppose we want to classify $X$ as 0 or 1 (binary classification).

---

### **Step 1: Initial Prediction**

We start with a simple guess, such as the average of the target values:

$$
F_0(x) = \text{mean}(y) = \frac{0 + 0 + 1 + 1}{4} = 0.5
$$

So, for every $x$, $F_0(x) = 0.5$.

---

### **Step 2: Compute Residuals**

Residuals = True value - Current prediction

| Sample | True y | F₀(x) | Residual (y - F₀(x)) |
| ------ | ------ | ----- | -------------------- |
| 1      | 0      | 0.5   | -0.5                 |
| 2      | 0      | 0.5   | -0.5                 |
| 3      | 1      | 0.5   | +0.5                 |
| 4      | 1      | 0.5   | +0.5                 |

---

### **Step 3: Fit First Tree on Residuals**

Suppose a simple decision tree (stump) predicts:

$$
h_1(x) = \begin{cases}
-0.5 & \text{if } x < 3 \\
+0.5 & \text{if } x \geq 3
\end{cases}
$$

---

### **Step 4: Update Prediction (Use Learning Rate η = 0.1)**

$$
F_1(x) = F_0(x) + \eta \cdot h_1(x)
$$

For $x = 2$:

$$
F_1(2) = 0.5 + 0.1 \times (-0.5) = 0.5 - 0.05 = 0.45
$$

For $x = 4$:

$$
F_1(4) = 0.5 + 0.1 \times (0.5) = 0.5 + 0.05 = 0.55
$$

---

### **Step 5: Continue Boosting**

* Compute new residuals using $F_1(x)$
* Fit the next tree on these new residuals
* Update prediction: $F_2(x) = F_1(x) + \eta \cdot h_2(x)$
* Repeat for more stages

---

## **Summary Table**

| Sample | True y | F₀(x) | h₁(x) | F₁(x) |
| ------ | ------ | ----- | ----- | ----- |
| 1      | 0      | 0.5   | -0.5  | 0.45  |
| 2      | 0      | 0.5   | -0.5  | 0.45  |
| 3      | 1      | 0.5   | +0.5  | 0.55  |
| 4      | 1      | 0.5   | +0.5  | 0.55  |

---

**In practice**, more trees (stages) and possibly deeper trees are added, but the logic remains the same:

* Start with a simple prediction
* Add trees that fit the residuals
* Scale each tree by the learning rate


In [1]:
# Dataset
X = [1, 2, 3, 4]
y = [0, 0, 1, 1]

# Learning rate
eta = 0.1

# Step 1: Initial prediction (mean of y)
F0 = sum(y) / len(y)  # 0.5
F = [F0, F0, F0, F0]
print("Step 1: Initial prediction (F0):", F)

# Step 2: Residuals (true y - current prediction)
residuals = [y[i] - F[i] for i in range(4)]
print("Step 2: Residuals:", residuals)

# Step 3: First tree (manual stump)
# Predict -0.5 for x<3, +0.5 for x>=3
h1 = []
for i in range(4):
    if X[i] < 3:
        h1.append(-0.5)
    else:
        h1.append(0.5)
print("Step 3: First tree predictions:", h1)

# Step 4: Update predictions with learning rate
F1 = [F[i] + eta * h1[i] for i in range(4)]
print("Step 4: Updated predictions (F1):", F1)

Step 1: Initial prediction (F0): [0.5, 0.5, 0.5, 0.5]
Step 2: Residuals: [-0.5, -0.5, 0.5, 0.5]
Step 3: First tree predictions: [-0.5, -0.5, 0.5, 0.5]
Step 4: Updated predictions (F1): [0.45, 0.45, 0.55, 0.55]
