
# üìò XGBoost Regression
---

# 1Ô∏è‚É£ Problem Setup

We are solving a **regression problem using XGBoost**.

### Dataset
| CGPA | Package (LPA) |
|------|--------------|
| 6.7  | 4.5 |
| 9.0  | 11 |
| 7.5  | 6 |
| 5.0  | 8 |

Goal:
> Predict package using CGPA

---

# 2Ô∏è‚É£ Big Idea

XGBoost = Gradient Boosting + Improvements

So regression flow is SAME as Gradient Boosting:
1. Start with base model (mean)
2. Compute residuals
3. Train tree on residuals
4. Add prediction
5. Repeat

BUT:
> Trees are DIFFERENT (special XGBoost trees)

---

# 3Ô∏è‚É£ Stage 1 ‚Äî Base Model

Base estimator = **Mean of targets**

Mean = (4.5 + 11 + 6 + 8) / 4 = **7.3**

So prediction for ALL students:
```
≈∑ = 7.3
```

---

# 4Ô∏è‚É£ Residuals (Pseudo-residuals)

Residual = Actual ‚àí Prediction

| CGPA | Actual | Pred | Residual |
|------|--------|------|---------|
| 6.7 | 4.5 | 7.3 | -2.8 |
| 9.0 | 11 | 7.3 | 3.7 |
| 7.5 | 6 | 7.3 | -1.3 |
| 5.0 | 8 | 7.3 | 0.7 |

---

# 5Ô∏è‚É£ Build First XGBoost Tree

We now train tree on:
```
Input: CGPA
Target: Residuals
```

---

# 6Ô∏è‚É£ XGBoost Special Concepts

## 6.1 Similarity Score (Node Score)

Equivalent of:
- Gini (classification)
- Variance (regression)

### Formula
```
Similarity Score = (Sum of residuals)¬≤ / (Number of residuals + Œª)
```

Where:
- Œª = regularization parameter
- Assume Œª = 0 (simplification)

---

# 7Ô∏è‚É£ Root Node Score

Residuals:
```
[-2.8, 3.7, -1.3, 0.7]
```

Sum = 0.3

Similarity:
```
(0.3¬≤)/4 = 0.02
```

This is parent score.

---

# 8Ô∏è‚É£ Find Split Points

Sort CGPA:
```
5.0, 6.7, 7.5, 9.0
```

Candidate splits:
```
5.85, 7.1, 8.25
```

We test ALL splits.

---

# 9Ô∏è‚É£ Split 1: CGPA < 5.85

Left:
```
[0.7]
```
Right:
```
[-2.8, -1.3, 3.7]
```

### Scores

Left:
```
0.7¬≤ / 1 = 0.49
```

Right:
```
(-2.8 -1.3 + 3.7)¬≤ / 3 = 0.05
```

### Gain
```
0.49 + 0.05 ‚àí 0.02 = 0.52
```

---

# üîü Split 2: CGPA < 7.1

Left:
```
[0.7, -2.8]
```
Right:
```
[-1.3, 3.7]
```

Scores:
```
Left = 2.20
Right = 2.88
```

Gain:
```
2.20 + 2.88 ‚àí 0.02 = 5.06
```

---

# 1Ô∏è‚É£1Ô∏è‚É£ Split 3: CGPA < 8.25

Left:
```
[0.7, -2.8, -1.3]
```
Right:
```
[3.7]
```

Scores:
```
Left = 3.85
Right = 13.69
```

Gain:
```
3.85 + 13.69 ‚àí 0.02 ‚âà 17.52
```

‚úÖ Best split = **8.25**

---

# 1Ô∏è‚É£2Ô∏è‚É£ Second Level Split

We now split LEFT node further.

Remaining splits:
- 5.85
- 7.1

After gain calculation:
> Best second split = 5.85

Final tree structure:

```
          CGPA < 8.25
          /            CGPA < 5.85      3.7
     /          7     -2.05
```

---

# 1Ô∏è‚É£3Ô∏è‚É£ Leaf Output Formula

### Leaf Value
```
Output = Sum of residuals / (Number of residuals + Œª)
```

With Œª = 0 ‚Üí average residual

---

# 1Ô∏è‚É£4Ô∏è‚É£ Leaf Outputs

| Leaf Residuals | Output |
|---------------|--------|
| [0.7] | 0.7 |
| [-2.8, -1.3] | -2.05 |
| [3.7] | 3.7 |

---

# 1Ô∏è‚É£5Ô∏è‚É£ Stage 2 Model

Combined model:
```
Prediction = Mean + Œ∑ √ó Tree Output
```

Where:
- Œ∑ = learning rate
- Assume Œ∑ = 0.3

---

# 1Ô∏è‚É£6Ô∏è‚É£ Predictions

Example: CGPA = 6.7

Falls into:
```
CGPA < 8.25 ‚úî
CGPA < 5.85 ‚ùå ‚Üí leaf = -2.05
```

Prediction:
```
7.3 + 0.3 √ó (-2.05) = 6.69
```

---

# 1Ô∏è‚É£7Ô∏è‚É£ Residuals After Stage 2

| Point | New Residual |
|------|-------------|
| 1 | -2.19 |
| 2 | 2.59 |
| 3 | -0.69 |
| 4 | 0.49 |

Residuals moved closer to **0** ‚úÖ

---

# 1Ô∏è‚É£8Ô∏è‚É£ Stage 3+

Repeat:
1. Train new tree on new residuals
2. Add to model

Final model:
```
F(x) = Mean + Œ∑T1 + Œ∑T2 + Œ∑T3 ...
```

---

# 1Ô∏è‚É£9Ô∏è‚É£ Exact vs Approx Algorithm

### Exact Greedy
- Try all splits
- Accurate
- Slow
- Used for small datasets

### Approximate (Histogram-based)
- Bin values
- Faster
- Used in real XGBoost

---

# 2Ô∏è‚É£0Ô∏è‚É£ Important Formulas

## Similarity Score
```
(Sum residuals)¬≤ / (n + Œª)
```

## Gain
```
LeftScore + RightScore ‚àí ParentScore
```

## Leaf Output
```
Sum residuals / (n + Œª)
```

## Final Model
```
F(x) = Base + Œ£ (learning_rate √ó tree_output)
```

---

# 2Ô∏è‚É£1Ô∏è‚É£ Key Insights

‚úÖ XGBoost = Gradient boosting with:
- Regularization
- Smart splits
- Gain-based trees

‚úÖ Trees predict:
> Residuals, NOT labels

‚úÖ Residuals ‚Üí 0 ‚áí Perfect model

---