# Emsemble Learning

Reference:
- element of statistical learning

Two basic ensemble methods are:
- Bagging: training a bunch of individual models in a parallel way. Each model is trained by a random subset of the data. The final predictions are averaged for regression or vote for classification.
- Boosting: training a bunch of individual models in a sequential way. Each individual model learns from mistakes made by the previous model.

## Bagging 

## Boosting


**Framework of boosting**:
- obtain the first classifier $f_1(x)$
- find other models $f_2(x)$ to help $f_1(x)$
  - if similar, will not help a lot
  - how to obtain $f_2(x)$
- obtain the second classifier $f_2(x)$
- finally combine all the classifiers

The classifiers are learned sequentially


**How to obtain different classifiers**
- re-sample training data to form a new set
- re-weight training data to form a new set
- in real implementation, only need to change the cost/objective function

### AdaBoost Trees

An typical boosting algorithm is AdaBoost (Adaptive Boosting). AdaBoost works by weighting the observations, putting more weight on difficult to classify instances and less on those already handled well. New classifiers are added sequentially that focus their training on the more difficult patterns.

- **Idea**: train $f_2(x)$ on the new training set that fails $f_1(x)$
- **how to find a new trainig set that fails $f_1(x)$ ?**
    - if $x_i$ is misclassified by $f_1$, 
      - $w_{i,2} = w_{i, 1} * d_1$
    - if correctly classified:
      - $w_{i,2} = w_{i, 1} / d_1$
    - $d_1$ should be greater than 1, how to find $d_1$?


$$
err_1 = \frac{\sum_i^N w_{1, i} I(y_i \neq f_1(x_i))}{\sum_i^N w_{i, 1}} 
$$

$$
err_1 < 0.5
$$


changing the sample weights from $u_{i, 1}$ to $w_{i, 2}$ such that

$$
\frac{\sum_i^N w_{2, i} I(y_i \neq f_1(x_i))}{\sum_i^N w_{i, 2}} = 0.5 
$$

We can calculate $d_1$ based on this assumption, as

- mismatched: $w_{i, 2} = w_{i, 1} d_1 = w_{i, 1} \exp^{\alpha_1}$
- else: $w_{i, 2} = w_{i, 1} / d_1 = w_{i, 1} \exp^{-\alpha_1}$

where $d1 = \sqrt{\frac{1-err_1}{err_1}}$, $\alpha_1 = \ln d_1$.



![adboost](./resources/imgs/adaboost.png)


AdaBoost is equivalent to a forward stagewise additive modeling using exponential loss.
Since exponential loss is sensitive to outliers, AdaBoost has been empirically observed to be quite sensitive to noisy data and outliers.
Typical classification loss function like cross-entroy, hinge loss, etc. are more robust to outliers.

### Gradient Boosting Trees



![gbt](./resources/imgs/gradient-boosting-tree.png)

```python
# Pseudocode for Gradient Boosting Algorithm

# Step 1: Initialize the model
Initialize the model with a constant value, such as the mean of the target variable

# Step 2: Set the number of iterations (M)
Set the number of iterations (M) for the boosting algorithm

# Step 3: Iterate M times
for m in range(M):
    # Step 4: Compute the pseudo-residuals
    Compute the negative gradient of the loss function with respect to the current model's predictions
    This represents the pseudo-residuals, which are the errors that the current model is not able to capture
    
    # Step 5: Fit a weak learner to the pseudo-residuals
    Fit a weak learner (e.g., decision tree) to the pseudo-residuals
    The weak learner should be trained to minimize the loss function
    
    # Step 6: Update the model
    Update the model by adding the weak learner's predictions, scaled by a learning rate (alpha), to the current model's predictions
    
# Step 7: Return the final model
Return the final boosted model
