

## Boosting

Boosting is an **ensemble meta-algorithm** that combines multiple **weak learners** sequentially to create a single, highly accurate **strong learner**.

* **Weak Learner:** A model (usually a simple decision tree) whose performance is only slightly better than random guessing (e.g., an accuracy of 51% in a binary classification).
* **Strong Learner:** The final aggregated model that achieves arbitrarily high accuracy and is strongly correlated with the true classification.

***

## The Sequential Boosting Process

Boosting is fundamentally a sequential process where each new model is built to compensate for the weaknesses of the combined previous models.

| Step | Action | Statistical Goal |
| :--- | :--- | :--- |
| **1. Initialization** | A weak base learner is trained on the initial dataset. | Establish a baseline prediction. |
| **2. Prioritize Errors (Data Weights)** | The model identifies instances it **misclassified**. These misclassified data points are assigned **higher weights**, making them more "important" for the next learner. | Focus the subsequent learner on the hardest-to-classify samples. |
| **3. Sequential Training** | A new weak learner is trained on the now-reweighted dataset, forced to focus on the previously misclassified points. | Reduce the systematic error (Bias) by specializing the new model. |
| **4. Reward Accuracy (Learner Weights)** | Each new weak learner is assigned a **vote weight** based on its *individual accuracy* on the training data. **More accurate** learners are given **larger weights**. | Ensure the most reliable models contribute the most to the final prediction. |
| **5. Aggregation** | The process continues until a desired performance level or a set number of learners is reached. The final strong learner aggregates the predictions from all weak learners using their calculated vote weights. | Produce a final robust, low-bias prediction. |

***

## Key Clarifications: Weights and Ensembles

### 1. The Role of Weights

| Item | What is Weighted? | Goal |
| :--- | :--- | :--- |
| **Data Weights** | The individual data points in the training set. | **Direct the Learning:** Emphasize misclassified samples for the *next* sequential model to correct. |
| **Learner Weights** | The final prediction (or vote) of each weak model. | **Combine the Votes:** Give more influence to the *more accurate* models in the final ensemble prediction. |

### 2. Boosting vs. Stacking/Blending

The final point about "Stacking and blending" is **incorrect** as a description of Boosting. They are distinct ensemble methods:

* **Boosting:** **Sequential** ensemble method where models are trained one after another, correcting previous errors. (e.g., AdaBoost, Gradient Boosting, XGBoost).
* **Bagging:** **Parallel** ensemble method where models are trained independently on different bootstrap samples of the data. (e.g., Random Forest).
* **Stacking/Blending:** **Layered** ensemble method where multiple diverse models are trained first, and then a final meta-model (or blender) is trained to combine their predictions.

##Adaptive Boosting

**AdaBoost** (Adaptive Boosting) is a foundational and highly effective boosting algorithm. Its primary goal is to **reduce bias** by training a sequence of simple, weak models that focus on correcting the errors made by their predecessors.

***

## The Iterative Training Process

AdaBoost relies on **re-weighting** both the data and the models at each step of the process:

1.  **Initialization:** All observations (data points) in the training set are assigned an **equal weight**.
2.  **Build Weak Learner:** A simple predictive model, often a decision tree with maximum depth of one (a **decision stump**), is built on the currently weighted dataset.
3.  **Re-Weight Data (Prioritize Errors):** The algorithm identifies all **misclassified observations**. The weights of these misclassified points are **increased**, making them more influential in the training of the next model. Conversely, the weights of correctly classified points are decreased.
4.  **Calculate Learner Weight (Reward Accuracy):** The current weak model is assigned a **vote weight** based on its overall accuracy. Models that perform better are given a proportionally **larger weight** for the final ensemble prediction.
5.  **Iteration:** Steps 2 through 4 are **repeated** until a predefined maximum number of estimators is reached, or until the overall error rate stabilizes or falls below a certain threshold.
6.  **Aggregation:** The final prediction is a **weighted majority vote** of all the weak learners, using the weights calculated in Step 4.

***

## Key Characteristics

* **Weak Learners:** AdaBoost uses simple learners, typically a **decision stump** (a decision tree with a $\text{max\_depth}$ of $1$). These models are **high-bias** and **low-variance**, which helps prevent the overall powerful ensemble from becoming too complex and overfitting.
* **Focus on Bias:** The aggressive re-weighting of data effectively forces the ensemble to specialize in the samples that are difficult to classify, rapidly driving down the systematic error, or **bias**, of the model.

In [None]:
# https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.AdaBoostClassifier.html
from sklearn.ensemble import AdaBoostClassifier
from sklearn.datasets import make_classification

X, y = make_classification(n_samples=1000, n_features=4,
                           n_informative=2, n_redundant=0,
                           random_state=0, shuffle=False)

clf = AdaBoostClassifier(n_estimators=100, random_state=0).fit(X, y)
clf.predict([X[0]])
clf.score(X, y)

0.983



## Gradient Boosting

**Gradient Boosting** is a sequential ensemble technique that builds models (typically decision trees) not on the target variable ($y$) directly, but on the **residual errors** (or "pseudo-residuals") of the previous combined predictions. Its primary goal is to minimize a loss function (error) by moving down the gradient (like in optimization).

### The Iterative Process

1.  **Initial Model Deployment:** An initial, simple model ($F_0(x)$), often a constant value (like the mean of the target variable), is used to make predictions across the entire dataset.
2.  **Calculate Residuals (Errors):** The algorithm calculates the **residual error** (or loss gradient) between the true values ($y$) and the current ensemble's predictions ($F_{m-1}(x)$). This residual is the direction of the steepest descent for the loss function.
3.  **Train a New Weak Learner:** A new weak learner ($h_m(x)$), usually a shallow decision tree, is trained. However, it's trained to predict the **residuals ($r_i$)** from the previous step, *not* the original target variable ($y$).
4.  **Merge Predictions (Update Ensemble):** The new weak learner's prediction ($h_m(x)$) is scaled by a learning rate and then **added** to the previous ensemble's prediction ($F_{m-1}(x)$) to create the new, improved ensemble prediction ($F_m(x)$).
   
5.  **Compute New Residuals:** New residuals are computed based on the improved predictions from the updated ensemble $F_m(x)$.
6.  **Repeat:** The process is repeated iteratively until the error function stops improving, or a pre-defined limit of estimators is reached.

### Key Distinction from AdaBoost

* **AdaBoost:** Adjusts **data weights** to focus the next learner.
* **Gradient Boosting:** Adjusts the **target variable** for the next learner (it sets the target to the residual error), forcing the new model to explicitly learn what the previous ensemble missed.

In [None]:
# https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingClassifier.html
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.datasets import make_classification

X, y = make_classification(n_samples=1000, n_features=4,
                           n_informative=2, n_redundant=0,
                           random_state=42, shuffle=False)

clf = GradientBoostingClassifier(n_estimators=100, learning_rate=1.0,
                                 max_depth=1, random_state=42).fit(X, y)
clf.predict([X[0]])
clf.score(X, y)

0.936

## XGBoost (Extreme Gradient Boosting)

* https://www.youtube.com/watch?v=OtD8wVaFm6E
* Fast
* Uses regularization techniques
* Reduces overfitting
* Improves overall performance
* Uses parallel processing
* Customizable optimization objectives and evaluation criteria
* Builtin routine to handle missing values

In [None]:
# pip install xgboost

In [None]:
# https://xgboost.readthedocs.io/en/stable/parameter.html
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from xgboost import XGBClassifier
from sklearn.metrics import accuracy_score
from sklearn.datasets import load_breast_cancer

cancer = load_breast_cancer()
df = pd.DataFrame(cancer.data, columns=cancer.feature_names)
df['class'] = cancer.target
X_train, X_test, y_train, y_test = train_test_split(df.drop('class', axis=1), df['class'], test_size=0.2, random_state=42)

model = XGBClassifier(booster='gbtree', eta=0.3, max_depth=6, seed=42).fit(X_train, y_train)
predictions = model.predict(X_test)
print(accuracy_score(predictions, y_test))

0.956140350877193


## Light GBM

* https://lightgbm.readthedocs.io/en/stable/
* Supports parallel, distributed, and GPU learning

## CATBoost

* https://catboost.ai/
* https://catboost.ai/en/docs/concepts/python-quickstart

## When to Use Which

* https://neptune.ai/blog/when-to-choose-catboost-over-xgboost-or-lightgbm