# Gradient boosted trees

## References

* [Friedman 2001, Greedy Function Approximation: A Gradient Boosting Machine](https://www.jstor.org/stable/2699986)

## The core algorithm

### Training

1. select a `loss(y_true,y_pred)`, e.g. least squares for regression
2. make constant baseline estimate of `y_pred`, e.g. average of `y_true`
3. iterate between
    * calculate gap `gap(y_pred,y_true)` to obtain new `y_true` only containing bits we got wrong so far and
    * fit model (tree) to predict new `y_true` using loss-optimal leaf weights
    * store model
    * stop once `y_true` is pretty much all zero


So at the end you have a baseline estimate and a bunch of models / boosts correcting the prediction for each observation. 

For a more formal version see the [_Algorithm 1 Gradient_Boost_ in Friedman et al. 2001](https://www.jstor.org/stable/2699986).

In `nbs/xgboost.ipynb` the math is spelled out in a bit more detail, using the notation of the paper [_XGBoost: A Scalable Tree Boosting System_ by Chen et al. 2016](http://arxiv.org/abs/1603.02754).

### Inference

1. collect constant baseline estimate
2. compute corrections with each tree / boost
3. sum baseline estimate and model predictions
4. (transform above sum, e.g. for binary classification)

In [None]:
%load_ext autoreload
%autoreload 2

In [None]:
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
import sklearn.datasets as sk_datasets

from random_tree_models.decisiontree.visualize import show_tree
import random_tree_models.gradientboostedtrees as gbtree
from random_tree_models.scoring import MetricNames

In [None]:
rng = np.random.RandomState(42)

## Classification

Brace your self. We are looking at Algorithm 5 (LK_TreeBoost) in Friedman 2001, Greedy Function Approximation: A Gradient Boosting Machine.

To use boosting for classification Friedman et al. map a binary target `y` to -1 and 1 and the negative binomial log-likelihood as a loss, inducing a freaky `dy` in a continuous space that is not bounded by -1 and 1 or 0 and 1.

The baseline estimate is: $0.5 \log \frac{P(y=1)}{P(y=-1)} = 0.5 \log \frac{\text{mean}(y==1)}{\text{mean}(y==-1)}$

The used negative binomial log-likelihood loss is: $\text{loss} = \log\left(1+\text{exp}(-2 \cdot y \cdot \text{estimate})\right)$

Hence the loss changes with the estimate of each observation by: $\frac{d\text{loss}}{d\text{estimate}} = dy = \frac{2 \cdot y}{1 + \exp(2 \cdot y \cdot \text{estimate})}$

To clarify: $y$ is -1 or 1. (baseline) $\text{estimate}$ is something between $-\infty$ and $\infty$, as is $dy$.

This `dy` is what a model is trying to predict and what gets updated for the next model. So since our models here are regression decision trees each leaf contains an update to `dy`.

To compute the final estimate add our baseline estimate and all the leaf values for $n$ models we have trained: $\text{estimate} = \text{baseline estimate} + dy_0 + dy_1 + ... dy_n$

Then we have to map back to the space of probabilities (0 to 1) for this to be useful, using: $P(y=1) = \frac{1}{1 + \exp(-2 \cdot \text{estimate})}$

In [None]:
X, y = sk_datasets.make_classification(
    n_samples=1_000,
    n_features=2,
    n_classes=2,
    n_redundant=0,
    class_sep=2,
    random_state=rng,
)
sns.scatterplot(x=X[:, 0], y=X[:, 1], hue=y, alpha=0.3);

In [None]:
model = gbtree.GradientBoostedTreesClassifier(
    measure_name=MetricNames.friedman_binary_classification, max_depth=4
)

In [None]:
model.fit(X, y)

In [None]:
show_tree(model.trees_[0])

In [None]:
y_prob = model.predict_proba(X)
y_prob[:5]

In [None]:
x0 = np.linspace(X[:, 0].min(), X[:, 0].max(), 100)
x1 = np.linspace(X[:, 1].min(), X[:, 1].max(), 100)
X0, X1 = np.meshgrid(x0, x1)
X_plot = np.array([X0.ravel(), X1.ravel()]).T

In [None]:
y_prob = model.predict_proba(X_plot)[:, 1]
y_prob[:5]

In [None]:
fig, ax = plt.subplots()
im = ax.pcolormesh(X0, X1, y_prob.reshape(X0.shape), alpha=0.2)
fig.colorbar(im, ax=ax)
sns.scatterplot(x=X[:, 0], y=X[:, 1], hue=y, ax=ax, alpha=0.3)
plt.show()

## Regression

Algorithm 2 (LS_Boost) is used here. Since it is pretty much what is stated in _The core algorithm_ above, an explanation is omitted here.

In [None]:
X, y, coefs = sk_datasets.make_regression(
    n_samples=1_000, n_features=2, n_targets=1, coef=True, random_state=rng
)
sns.scatterplot(x=X[:, 0], y=y, alpha=0.3)

In [None]:
model = gbtree.GradientBoostedTreesRegressor(
    measure_name=MetricNames.variance, max_depth=2
)

In [None]:
model.fit(X, y)

In [None]:
show_tree(model.trees_[0])

In [None]:
x0 = np.linspace(X[:, 0].min(), X[:, 0].max(), 100)
x1 = np.linspace(X[:, 1].min(), X[:, 1].max(), 100)
X0, X1 = np.meshgrid(x0, x1)
X_plot = np.array([X0.ravel(), X1.ravel()]).T

In [None]:
y_pred = model.predict(X_plot)
y_pred[:5]

In [None]:
fig, axs = plt.subplots(nrows=2, figsize=(12, 6))

ax = axs[0]
sns.scatterplot(x=X_plot[:, 0], y=y_pred, ax=ax, alpha=0.1, label="prediction")

ax = axs[1]
sns.scatterplot(x=X_plot[:, 1], y=y_pred, ax=ax, alpha=0.1, label="prediction")

plt.tight_layout()

In [None]:
fig, ax = plt.subplots()
im = ax.pcolormesh(X0, X1, y_pred.reshape(X0.shape), alpha=0.2)
fig.colorbar(im, ax=ax)
sns.scatterplot(x=X[:, 0], y=X[:, 1], hue=y, ax=ax, alpha=0.3)
plt.show()

In [None]:
y_pred = model.predict(X)

fig, axs = plt.subplots(nrows=2, figsize=(12, 6))

ax = axs[0]
sns.scatterplot(x=X[:, 0], y=y_pred, ax=ax, alpha=0.1, label="prediction")
sns.scatterplot(x=X[:, 0], y=y, ax=ax, alpha=0.1, label="actual")

ax = axs[1]
sns.scatterplot(x=X[:, 1], y=y_pred, ax=ax, alpha=0.1, label="prediction")
sns.scatterplot(x=X[:, 1], y=y, ax=ax, alpha=0.1, label="actual")

plt.tight_layout()