# Boosting

Boosting (originally called hypothesis boosting) refers to ensemble methods that can combine several weak learners into a strong learner. The general idea of most boosting methods is to train predictors sequentially, each trying to correct its predecessor. There are many boosting methods available, but by far the most popular are
**AdaBoost (short for Adaptive Boosting)** and **Gradient Boosting**.

## AdaBoost

AdaBoost can be explained best in a classification setting. AdaBoost aims to create an ensemble classifier $H_T(x) =\sum_{t=1}^T \alpha_t h_t(x)$ where $h_t$ is some weak classifier, $T$ is the number of classifiers in the ensemble and $\alpha_t$ is some model weight. Note that the default depth of the decision tree used in AdaBoost is 1 (see Scikit documentation).

The ensemble classifier is built in an iterative fashion. In each iteration $t$, a new classifier $h_t$ is trained and used to make predictions on the training set. Then the "correctness" of a new classifier determines the weight $\alpha_t$ in the ensemble. We calculate the model weight $\alpha_t$, where $Acc_t$ is the accuracy of the t-th classifier, as follows (*):

$\alpha_t = log \Big(\frac{Acc_t}{1-Acc_t} \Big)$ 

Afterwards, we update the sample weights of the **misclassified samples** by multiplying their current sample weight with the factor $\frac{Acc_t}{1-Acc_t}$. The sample weight simply denotes the chance that a sample is drawn from the training set. If samples have a higher weight, these samples are more likely to be drawn from the training sample "distribution". As we only increase the weight of the misclassified samples, the next classifier will be "guided" to focus on these samples.

(*) See "log-odds function" in order to understand what this formula represents.

### Log-odds function

We refer to the function $log\big(\frac{p(x)}{1-p(x)}\big)$ where $p(x)$ is some probability as *log-odds* function.

Note that this function is unbounded!

![log_odds_function](imgs/log_odds_function.png)

![Log_odds](imgs/log_odds.png)

**Consider the following example:**
    
Let's assume that we have three models that were trained to solve a binary classification problem. The first model is wrong 50% of the time, the second model is wrong 95% of the time, and the third model is right 97% of the time. In other words, the accuracy of the first, second and third model is 0.5, 0.05 and 0.97, respectively.

We want to build a strong model where the prediction is obtained by a weighted vote from the three models. Thus,
to each of the three models, we assign a score, and that is how much the vote of the model will
count in the final vote. The question is how we should assign these scores.

Obviously, the third model is very reliable because it almost always predicts the class correctly. Hence, we assign a high-positive score. Among the other two, the second model should be preferred. It is almost always wrong. Hence, we simply invert its prediction to be correct most of the time. Hence, we assign a high-negative score. The first model serves no purpose as its prediction is random. Hence, its score should be 0.

This is exactly how the model weight $\alpha_t$ is assigned in the AdaBoost. Models that are mostly right (high accuracy) receive a high positive weight and models that are mostly wrong (low accuracy) receive a high negative weight.

### Example

In [None]:
from sklearn.datasets import make_moons
import numpy as np
from sklearn.ensemble import AdaBoostClassifier
import seaborn as sns
import matplotlib.pyplot as plt

In [None]:
# Sample points from the moon dataset
x, y = make_moons(n_samples=500, noise=0.30, random_state=42)

In [None]:
# TODO: Train an ADABoost classifier

In [None]:
def plot_decision_boundary(clf, X, y, alpha=1.0):
    
    axes=[-1.5, 2.4, -1, 1.5]
    x1, x2 = np.meshgrid(np.linspace(axes[0], axes[1], 100),
                         np.linspace(axes[2], axes[3], 100))
    X_new = np.c_[x1.ravel(), x2.ravel()]
    y_pred = clf.predict(X_new).reshape(x1.shape)
    
    plt.contourf(x1, x2, y_pred, alpha=0.3 * alpha, cmap='Wistia')
    plt.contour(x1, x2, y_pred, cmap="Greys", alpha=0.8 * alpha)
    colors = ["#78785c", "#c47b27"]
    markers = ("o", "^")
    
    for idx in (0, 1):
        plt.plot(X[:, 0][y == idx], X[:, 1][y == idx],
                 color=colors[idx], marker=markers[idx], linestyle="none")
        
    plt.axis(axes)
    plt.xlabel(r"$x_1$")
    plt.ylabel(r"$x_2$", rotation=0)

In [None]:
fig = plt.figure()
plot_decision_boundary(classifier, x, y)
plt.title("AdaBoost Classifier")
plt.show()

## AdaBoost for Regression

Not discussed. See https://dafriedman97.github.io/mlbook/content/c6/s2/boosting.html for further details.