# W10 : Gradient Boosting for Regression

### **Problem Setup**

We have a dataset:
$$
D = {(x_i, y_i)}_{i=1}^n, \quad x_i \in \mathbb{R}^d, , y_i \in \mathbb{R}
$$
Our goal is to learn a function $F(x)$ that minimizes an empirical risk (loss function):

$$
\min_{F} ; \mathcal{L} = \sum_{i=1}^{n} L(y_i, F(x_i))
$$

where $L(y, F(x))$ is typically a convex differentiable loss function (e.g., Mean Squared Error).


### **Step 1: Initialization**

We start with an initial model $F_0(x)$ that minimizes the loss function over a constant prediction:

$$
F_0(x) = \arg\min_{c} \sum_{i=1}^n L(y_i, c)
$$

For **squared loss** $L(y, F(x)) = \frac{1}{2}(y - F(x))^2$,
this simplifies to the mean of the targets:
$$
F_0(x) = \bar{y} = \frac{1}{n}\sum_{i=1}^{n} y_i
$$
which matches your step:
$$
y_{\text{train}*{p_0}} = \frac{1}{n}\sum*{i=1}^{n} y_{\text{train}*i}, \quad y*{\text{test}*{p_0}} = y*{\text{train}_{p_0}}
$$

### **Step 2: Compute Residuals (Pseudo-Residuals)**

At iteration $m$, compute the **negative gradient** of the loss function with respect to predictions $F_{m-1}(x_i)$ and
$$r_{im} = -\left[\frac{\partial L(y_i, F(x_i))}{\partial F(x_i)}\right]*{F(x_i)=F*{m-1}(x_i)}$$

For **squared loss**, this becomes:
$$r_{im} = y_i - F_{m-1}(x_i)$$
This is exactly your residual formula:
$$
r_0 = y_{\text{train}} - y_{\text{train}_{p_0}}
$$

### **Step 3: Fit a Weak Learner**

Train a weak learner (e.g., a regression tree) $h_m(x)$ to predict the residuals $r_{im}$:

$$
h_m(x) \approx r_{im}
$$

That means the learner fits to the direction of the **steepest descent** in function space (the negative gradient).


### **Step 4: Compute Step Size (Learning Rate)**

Optionally, compute the optimal multiplier $\gamma_m$ for the weak learner to minimize the overall loss:

$$
\gamma_m = \arg\min_{\gamma} \sum_{i=1}^{n} L(y_i, F_{m-1}(x_i) + \gamma h_m(x_i))
$$

For squared loss:
$$
\gamma_m = 1
$$

### **Step 5: Update the Model**

Update the model by adding the scaled weak learner to the ensemble:

$$
F_m(x) = F_{m-1}(x) + \alpha \gamma_m h_m(x)
$$

where $\alpha \in (0,1)$ is the **learning rate** that controls the contribution of each weak learner.

Your equivalent formula
$$
y_{\text{train}*{p_1}} = y*{\text{train}*{p_0}} + \alpha f_0(X*{\text{train}})
$$
$$
y_{\text{test}*{p_1}} = y*{\text{test}*{p_0}} + \alpha f_0(X*{\text{test}})
$$

### **Step 6: Repeat**

Repeat steps 2â€“5 for $M$ boosting rounds, giving the final model:

$$
F_M(x) = F_0(x) + \alpha \sum_{m=1}^{M} \gamma_m h_m(x)
$$

### **For Squared Loss (Regression)**

If $L(y, F(x)) = \frac{1}{2}(y - F(x))^2$, then:

* $r_{im} = y_i - F_{m-1}(x_i)$
* $\gamma_m = 1$
* Final model:
  $$
  F_M(x) = \bar{y} + \alpha \sum_{m=1}^{M} h_m(x)
  $$


### **Summary Table**

| Step | Mathematical Operation                                                   | Description                       |
| ---- | ------------------------------------------------------------------------ | --------------------------------- |
| 1    | $F_0(x) = \arg\min_c \sum L(y_i, c)$                                   | Initialize model with constant    |
| 2    | $r_{im} = -\frac{\partial L}{\partial F(x_i)}$                         | Compute pseudo-residuals          |
| 3    | $h_m(x) \approx r_{im}$                                                | Fit weak learner                  |
| 4    | $\gamma_m = \arg\min_\gamma \sum L(y_i, F_{m-1}(x_i)+\gamma h_m(x_i))$ | Compute step size                 |
| 5    | $F_m(x) = F_{m-1}(x) + \alpha \gamma_m h_m(x)$                         | Update model                      |
| 6    | Repeat                                                                   | Until convergence or $M$ rounds |