# CS-5600/6600 Lecture 8 - Boosting and Stacking

**Instructor: Dylan Zwick**

*Weber State University*

Reference: [Hands-On Machine Learning with Scikit-Learn, Keras & TensorFlow](https://www.oreilly.com/library/view/hands-on-machine-learning/9781098125967/) by Aurélien Géron - [Ensemble Learning and Random Forests](https://github.com/ageron/handson-ml3/blob/main/07_ensemble_learning_and_random_forests.ipynb)

Today, we're going to continue along the path we started with the random forest, and investigate some other approaches for turning a "forest of stumps" into a good predictive model.

<center>
  <img src="https://drive.google.com/uc?export=view&id=1DOKeu75laBC7MeLEZGr0mnFWL88hVcR-" alt="Forest of Stumps">
</center>

First, let's grab the libraries we'll want to use:

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.datasets import make_moons
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, VotingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC

from sklearn.ensemble import AdaBoostClassifier

from sklearn.tree import DecisionTreeRegressor

from sklearn.ensemble import GradientBoostingRegressor

from sklearn.ensemble import StackingClassifier

The idea behind a random forest is to take a bunch of weak classifiers - typically decision trees with only a few (or even two) layers (a.k.a. decision "stumps"), and combine them into an impressively performant ensemble.

Well, the idea behind "boosting" is similar, except that the trees are not "grown" in parallel, but are instead produced sequentially - each tree attempts to correct the errors of its ancestors. This process is called *boosting*. There are many boosting methods, and today we'll look at some of the most popular - [*AdaBoost*](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.AdaBoostClassifier.html) (short for *adaptive boosting*), and *gradient boosting*.

### AdaBoost

The idea behind AdaBoost is that each predictor pays more attention (gives more weight) to the training instances its precessors got wrong. The basic approach here is:

1. Train a base classifier, and use it to make predictions.
2. Increase the importance of the instances the base classifier got wrong, and train another classifier.
3. Take a weighted combination of these classifiers, weighted by their overall performance. This is our new base classifier.
4. Repeat.

At each stage the base classifier should get better, and the additional classifiers should focus more and more on the "harder" cases.

Let's take a look at how this works for some "moons" data.

In [None]:
X, y = make_moons(n_samples=500, noise=0.30, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

In [None]:
def plot_2d_data(X, y):
    # Separate the data based on binary labels
    class_0 = X[y == 0]
    class_1 = X[y == 1]

    #Assign colors and markers
    colors = ["#78785c", "#c47b27"]
    markers = ("o", "^")

    # Create a scatter plot
    plt.figure(figsize=(8, 6))
    plt.scatter(class_0[:, 0], class_0[:, 1], color=colors[0], marker=markers[0])
    plt.scatter(class_1[:, 0], class_1[:, 1], color=colors[1], marker=markers[1])

    # Add labels and title
    plt.xlabel(r"$x_1$")
    plt.ylabel(r"$x_2$", rotation=0)
    plt.title('Moons Data')

    # Show the plot
    plt.show()

In [None]:
plot_2d_data(X_train, y_train)

Let's see how this type of boosting approach can work on this moons data. We'll run through five iterations using a SVM classifier with an RBF kernel (don't worry about what these means right now, just understand it's not a decision stump). We'll run through five sequences. The second plot is with the same approach, just a different learning rate (which means the incorrect instances aren't boosted as much). Don't concern yourself with the specifics of how this is implemented right now - we'll get to that soon.

In [None]:
def plot_decision_boundary(clf, X, y, alpha=1.0):
    axes=[-1.5, 2.4, -1, 1.5]
    x1, x2 = np.meshgrid(np.linspace(axes[0], axes[1], 100),
                         np.linspace(axes[2], axes[3], 100))
    X_new = np.c_[x1.ravel(), x2.ravel()]
    y_pred = clf.predict(X_new).reshape(x1.shape)

    plt.contourf(x1, x2, y_pred, alpha=0.3 * alpha, cmap='Wistia')
    plt.contour(x1, x2, y_pred, cmap="Greys", alpha=0.8 * alpha)
    colors = ["#78785c", "#c47b27"]
    markers = ("o", "^")
    for idx in (0, 1):
        plt.plot(X[:, 0][y == idx], X[:, 1][y == idx],
                 color=colors[idx], marker=markers[idx], linestyle="none")
    plt.axis(axes)
    plt.xlabel(r"$x_1$")
    plt.ylabel(r"$x_2$", rotation=0)

In [None]:
m = len(X_train)

fig, axes = plt.subplots(ncols=2, figsize=(10, 4), sharey=True)
for subplot, learning_rate in ((0, 1), (1, 0.5)):
    sample_weights = np.ones(m) / m
    plt.sca(axes[subplot])
    for i in range(5):
        svm_clf = SVC(C=0.2, gamma=0.6, random_state=42)
        svm_clf.fit(X_train, y_train, sample_weight=sample_weights * m)
        y_pred = svm_clf.predict(X_train)

        error_weights = sample_weights[y_pred != y_train].sum()
        r = error_weights / sample_weights.sum()
        alpha = learning_rate * np.log((1 - r) / r)
        sample_weights[y_pred != y_train] *= np.exp(alpha)
        sample_weights /= sample_weights.sum()

        plot_decision_boundary(svm_clf, X_train, y_train, alpha=0.4)
        plt.title(f"learning_rate = {learning_rate}")
    if subplot == 0:
        plt.text(-0.75, -0.95, "1", fontsize=16)
        plt.text(-1.05, -0.95, "2", fontsize=16)
        plt.text(1.0, -0.95, "3", fontsize=16)
        plt.text(-1.45, -0.5, "4", fontsize=16)
        plt.text(1.36,  -0.95, "5", fontsize=16)
    else:
        plt.ylabel("")

plt.show()

OK, now let's take a deeper look at the AdaBoost algorithm.

We'll say our instances are indexed by $(i)$, and each instance has a corresponding weight $w^{(i)}$. These weights are *normalized*, which means they all add up to $1$, and initially they're all set to be the same. So, if there are $m$ instances in our data, the initial weights are set to $1/m$.

We train our initial predictor, $r_{1}$, with these initial weights, and its weighted error rate is computed on the data. The weighted error rate for prediction $r_{j}$ is defined as:

<center>
  $\displaystyle r_{j} = \sum_{\hat{y}_{j}^{(i)} \neq y^{(i)}}^{m} w^{(i)}$ where $\hat{y}_{j}^{(i)}$ is the $j$th predictor's prediction for the $i$th instance.
</center>

The predictor's weight $\alpha_{j}$ is then computer as:

<center>
  $\displaystyle \alpha_{j} = \eta \log{\left(\frac{1-r_{j}}{r_{j}}\right)}$
</center>

Based on the predictor's weight, the instance weights are then updated according to the update rule:

<center>
  for $i = 1,2, \ldots, m$

  <br>

  $\displaystyle w^{(i)} \leftarrow \left\{\begin{array}{cc} w^{(i)} & \hat{y}_{j}^{(i)} = y^{(i)} \\ w^{(i)}e^{\alpha_{j}} & \hat{y}_{j}^{(i)} \neq y^{(i)}\end{array}\right.$
</center>

Finally, all weights are normalized (divided by $\sum_{i = 1}^{m} w^{(i)}$) and the next predictor is trained.

The algorithm stops when the desired number of predictors is reached, or a perfect predictor has been found.

But how are these predictions made? Well, AdaBoost simply computes the predictions of all the predictors and weights them using the predictor weights. The predicted class is the one that received the majority of the votes.

<center>
  $\displaystyle \hat{y}(\textbf{x}) = \underset{k}{argmax} \sum_{\hat{y}_{j}(\textbf{x}) = k}^{N} \alpha_{j}$ where $N$ is the number of predictors.
</center>

[Scikit-Learn](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.AdaBoostClassifier.html) has a multiclass version of AdaBoost called SAMME (for *Stagewise Additive Modeling using a Multiclass Exponential loss function). When there are just two classes, SAMME is equivalent to AdaBoost. If the predictors can estimate class probabilities, then Scikit-Learn uses SAMME.R, which relies on class probabilities and usually performs better than with just class predictions.

Let's put together an AdaBoost classifier using 30 decision "stumps", and fit it to our moons data.

In [None]:
ada_clf = AdaBoostClassifier(
    DecisionTreeClassifier(max_depth=1), n_estimators=30, algorithm="SAMME",
    learning_rate=0.5, random_state=42)
ada_clf.fit(X_train, y_train);

In [None]:
plot_decision_boundary(ada_clf, X_train, y_train)

Not bad. If our ensemble is overfitting the training set, we can try reducing the number of estimators.

### Gradient Boosting

A variant on the boosting approach is *gradient boosting*. As with AdaBoost, gradient boosting applies a sequence of models that attempt to correct the errors of the previous models. However, instead of tweaking the instance weights at every iteration, gradient boosting tries to fit the new predictor the the *residual errors* made by the previous predictors.

We can go through a simple regression (numeric prediction) example, using decision trees as the base predictors. These are sometimes called *gradient boosting regression trees*, or GBRTs.

First, let's generate some noisy quadretic data that we can fit.

In [None]:
np.random.seed(42)
X = np.random.rand(100,1) - 0.5
y = 3 * X[:,0]**2 + 0.05 * np.random.randn(100)  # y = 3x² + Gaussian noise

tree_reg1 = DecisionTreeRegressor(max_depth=2, random_state=42)
tree_reg1.fit(X, y); #Note this expects X to be a 2-dimensional matrix, which is why we set X = np.random.rand(100,1) and not np.random.rand(100)

Alright, this gives us a bunch of predictions, and each prediction will have an error (if it's perfect, the error is $0$). We'll produce the vector of residual errors, and then try to build a model to predict them.

In [None]:
y2 = y - tree_reg1.predict(X)
tree_reg2 = DecisionTreeRegressor(max_depth=2, random_state=43)
tree_reg2.fit(X, y2);

We can do this again...

In [None]:
y3 = y2 - tree_reg2.predict(X)
tree_reg3 = DecisionTreeRegressor(max_depth=2, random_state=44)
tree_reg3.fit(X, y3);

Finally, we can use our three predictions to predict the value of the function at some given points. For example, at the points $-.4, 0, .5$.

In [None]:
X_new = np.array([[-0.4], [0.], [0.5]])
sum(tree.predict(X_new) for tree in (tree_reg1, tree_reg2, tree_reg3))

We can take a look at how these models do on predicting our noisy data:

In [None]:
def plot_predictions(regressors, X, y, axes, style,
                     label=None, data_style="b.", data_label=None):
    x1 = np.linspace(axes[0], axes[1], 500)
    y_pred = sum(regressor.predict(x1.reshape(-1, 1))
                 for regressor in regressors)
    plt.plot(X[:, 0], y, data_style, label=data_label)
    plt.plot(x1, y_pred, style, linewidth=2, label=label)
    if label or data_label:
        plt.legend(loc="upper center")
    plt.axis(axes)

plt.figure(figsize=(11, 11))

plt.subplot(3, 2, 1)
plot_predictions([tree_reg1], X, y, axes=[-0.5, 0.5, -0.2, 0.8], style="g-",
                 label="$h_1(x_1)$", data_label="Training set")
plt.ylabel("$y$  ", rotation=0)
plt.title("Residuals and tree predictions")

plt.subplot(3, 2, 2)
plot_predictions([tree_reg1], X, y, axes=[-0.5, 0.5, -0.2, 0.8], style="r-",
                 label="$h(x_1) = h_1(x_1)$", data_label="Training set")
plt.title("Ensemble predictions")

plt.subplot(3, 2, 3)
plot_predictions([tree_reg2], X, y2, axes=[-0.5, 0.5, -0.4, 0.6], style="g-",
                 label="$h_2(x_1)$", data_style="k+",
                 data_label="Residuals: $y - h_1(x_1)$")
plt.ylabel("$y$  ", rotation=0)

plt.subplot(3, 2, 4)
plot_predictions([tree_reg1, tree_reg2], X, y, axes=[-0.5, 0.5, -0.2, 0.8],
                  style="r-", label="$h(x_1) = h_1(x_1) + h_2(x_1)$")

plt.subplot(3, 2, 5)
plot_predictions([tree_reg3], X, y3, axes=[-0.5, 0.5, -0.4, 0.6], style="g-",
                 label="$h_3(x_1)$", data_style="k+",
                 data_label="Residuals: $y - h_1(x_1) - h_2(x_1)$")
plt.xlabel("$x_1$")
plt.ylabel("$y$  ", rotation=0)

plt.subplot(3, 2, 6)
plot_predictions([tree_reg1, tree_reg2, tree_reg3], X, y,
                 axes=[-0.5, 0.5, -0.2, 0.8], style="r-",
                 label="$h(x_1) = h_1(x_1) + h_2(x_1) + h_3(x_1)$")
plt.xlabel("$x_1$")
plt.show()

Scikit-Learn provides a GradientBoostingRegression task so we don't need to do this by hand. For example, the code below will create a predictor identical to the one we just build in three steps.

In [None]:
gbrt = GradientBoostingRegressor(max_depth=2, n_estimators=3,
                                 learning_rate=1.0, random_state=42)
gbrt.fit(X, y);

In [None]:
plot_predictions([gbrt], X, y, axes=[-0.5, 0.5, -0.2, 0.8], style="r-",
                 label="GBRT Predictions", data_label="Training set")
plt.title("GradientBoostingRegressor")
plt.show()

Now, instead of specifying the number of estimators, you can instead just keep adding estimators until your model stops improving. This is usually best to do with a small learning rate (so additional models don't change the overall model much).

For example, in the code below we use up to 500 estimators, with a learning rate of .05 (so only 1/20 as impactful as our learning rate 1 models above). We tell it to stop adding new models if it adds 10 with no improvement.

In [None]:
gbrt_best = GradientBoostingRegressor(
    max_depth=2, learning_rate=0.05, n_estimators=500,
    n_iter_no_change=10, random_state=42)
gbrt_best.fit(X, y);

In [None]:
gbrt_best.n_estimators_

So, while it could have included up to 500 models, it stopped at 92. What do we mean by no improvement? Well, that's a parameter you can specify with the *tol* hyperparameter, which defaults to $.0001$.

We can graph both the ensemble predictors we've constructed with the code below:

In [None]:
fig, axes = plt.subplots(ncols=2, figsize=(10, 4), sharey=True)

plt.sca(axes[0])
plot_predictions([gbrt], X, y, axes=[-0.5, 0.5, -0.1, 0.8], style="r-",
                 label="Ensemble predictions")
plt.title(f"learning_rate={gbrt.learning_rate}, "
          f"n_estimators={gbrt.n_estimators_}")
plt.xlabel("$x_1$")
plt.ylabel("$y$", rotation=0)

plt.sca(axes[1])
plot_predictions([gbrt_best], X, y, axes=[-0.5, 0.5, -0.1, 0.8], style="r-")
plt.title(f"learning_rate={gbrt_best.learning_rate}, "
          f"n_estimators={gbrt_best.n_estimators_}")
plt.xlabel("$x_1$")

plt.show()

### Stacking

The last ensemble method we'll discuss is *stacking*, which is kind of a meta-ensemble. It's based on a simple idea: instead of using something like hard voting to make the prediction, instead train a model train a model to perform this aggregation and decide the weights of the various predictors.

Basically, a final model sits on top of all the rest and blends them.

Scikit-Learn provides a *StackingClassifier* for doing this. The one below uses three estimators, and then a random forest classifier to blend them.

In [None]:
stacking_clf = StackingClassifier(
    estimators=[
        ('lr', LogisticRegression(random_state=42)),
        ('rf', RandomForestClassifier(random_state=42)),
        ('svc', SVC(probability=True, random_state=42))
    ],
    final_estimator=RandomForestClassifier(random_state=42),
    cv=5  # number of cross-validation folds
)
stacking_clf.fit(X_train, y_train);

How does it do?

In [None]:
stacking_clf.score(X_test, y_test)

This is a bit better than our voting classifier we saw in an earlier lecture, but not much. The difference in 92.8% vs 92%.

### References



* [Original AdaBoost paper](https://www.sciencedirect.com/science/article/pii/S002200009791504X)
* [AdaBoost video](https://youtu.be/LsK-xG1cLYA?si=JT2TELEsFZtC90eI)
* [Gradient boost videos](https://youtu.be/3CC4N4z3GJc?si=n9_38p5GHjKrvp5m)

