# Boosting

- 输出: $\hat y$ (-1 或 1)
- 输入: x
- 学习组合 model:
  - 分类器: $f_1(x), f_2(x), ..., f_T(x)$
  - 权重: $w_1, w_2, ..., w_T$
- 预测:
  $$
  \hat y = sign(\sum_{t=1}^Tw_tf_t(x))
  $$

### Gradient Boosting

**构建的树叶节点数量为: 8~32**

- 输入: 
  - 数据集 {$(x_1, y_1),(x_2, y_2), ...,(x_n, y_n)$} 
  - 可微损失函数 $L(y_i, F(x))$
  - 学习速率: $\nu$
- 初始化常数变量: 
  $$
  F_0(x) = \mathop{\arg\min}_{\gamma}\sum_{i=1}^nL(y_i, \gamma)
  $$
- for m = 1, M(生成树的数量 $\geq$ 100)
  - 计算 $r_{i, m} = - \left[\frac{\partial L(y_i, F(x_i))}{\partial F(x_i)}\right]_{F(x) = F_{m-1}(x)}$ for i = 1, 2, ..., n
  - 根据 {$(x_1, r_{1, m}), (x_2, r_{2, m}), ..., (x_n, r_{n, m})$} 构建回归树, 叶节点为: $R_{j,m} = \{F_{m-1}(x),...\}\; (j \in J_m:\text{叶节点的总数量})$
  - $\gamma_{j,m} = \mathop{\arg\min}_{\gamma}\sum_{x_i \in R_{i, j}}L(y_i, F_{m-1}(x_i) + \gamma)$
  - 更新 $F_m(x) = F_{m-1}(x) + \nu \sum_{j=1}^{J_m}\gamma_{j,m}I(x\in R_{j, m})$
- 输出 $F_M(x)$

详细问题的区别:
- 回归问题:
  - 学习速率: $\nu = 0.1$
  - 损失函数: $L(y_i, F(x_i)) = \frac12\sum_{i=1}^N(y_i - F(x_i))^2$
  - 计算 $\gamma_{j,m} = \frac{\sum R_{j,m}}{\#R_{j,m}}$
- 分类问题:
  - 学习速率: $\nu = 0.8$
  - 损失函数:  $L(y_i, F(x_i)) = -\sum_{i=1}^N(y_i\log(F(x_i)) + (1-y_i)\log(1-F(x_i)))$
  - 计算 $\gamma_{j,m}$: 需要利用二项式的泰勒展开式
    $$
    L(y_i, F_{m-1}(x_i) + \gamma) \approx L(y_i, F_{m-1}(x_i)) + \frac{\partial (y_i, F_{m-1}(x_i))}{\partial F()}\gamma + \frac12\frac{\partial^2 (y_i, F_{m-1}(x_i))}{\partial F()^2}\gamma^2
    $$
    简化后:
    $$
    \gamma_{j,m} = \frac{\sum R_{j, m}}{\sum_{x_i \in R_{j, m}} F_{m-1}(x_i) (1 - F_{m-1}(x_i))}
    $$

### 定义[决策树](./0-Decision-Tree.ipynb#%E5%86%B3%E7%AD%96%E6%A0%91)

In [1]:
import numpy as np


class Node:
    def __init__(self, left, right, rule):
        self.left = left
        self.right = right
        self.feature = rule[0]
        self.threshold = rule[1]


class Leaf:
    def __init__(self, value):
        self.value = value


class DecisionTree:
    def __init__(
        self,
        classifier=True,
        max_depth=None,
        n_feats=None,
        criterion="entropy",
        seed=None,
    ):
        if seed:
            np.random.seed(seed)

        self.depth = 0
        self.root = None

        self.n_feats = n_feats
        self.criterion = criterion
        self.classifier = classifier
        self.max_depth = max_depth if max_depth else np.inf

        if not classifier and criterion in ["gini", "entropy"]:
            raise ValueError(
                "{} is a valid criterion only when classifier = True.".format(criterion)
            )
        if classifier and criterion == "mse":
            raise ValueError("`mse` is a valid criterion only when classifier = False.")

    def fit(self, X, Y):
        self.n_classes = max(Y) + 1 if self.classifier else None
        self.n_feats = X.shape[1] if not self.n_feats else min(self.n_feats, X.shape[1])
        self.root = self._grow(X, Y)

    def predict(self, X):
        return np.array([self._traverse(x, self.root) for x in X])

    def predict_class_probs(self, X):
        assert self.classifier, "`predict_class_probs` undefined for classifier = False"
        return np.array([self._traverse(x, self.root, prob=True) for x in X])

    def _grow(self, X, Y):
        # if all labels are the same, return a leaf
        if len(set(Y)) == 1:
            if self.classifier:
                prob = np.zeros(self.n_classes)
                prob[Y[0]] = 1.0
            return Leaf(prob) if self.classifier else Leaf(Y[0])

        # if we have reached max_depth, return a leaf
        if self.depth >= self.max_depth:
            v = np.mean(Y, axis=0)
            if self.classifier:
                v = np.bincount(Y, minlength=self.n_classes) / len(Y)
            return Leaf(v)

        N, M = X.shape
        self.depth += 1
        feat_idxs = np.random.choice(M, self.n_feats, replace=False)

        # greedily select the best split according to `criterion`
        feat, thresh = self._segment(X, Y, feat_idxs)
        l = np.argwhere(X[:, feat] <= thresh).flatten()
        r = np.argwhere(X[:, feat] > thresh).flatten()

        # grow the children that result from the split
        left = self._grow(X[l, :], Y[l])
        right = self._grow(X[r, :], Y[r])
        return Node(left, right, (feat, thresh))

    def _segment(self, X, Y, feat_idxs):
        best_gain = -np.inf
        split_idx, split_thresh = None, None
        for i in feat_idxs:
            vals = X[:, i]
            levels = np.unique(vals)
            thresholds = (levels[:-1] + levels[1:]) / 2
            gains = np.array([self._impurity_gain(Y, t, vals) for t in thresholds])

            if gains.max() > best_gain:
                split_idx = i
                best_gain = gains.max()
                split_thresh = thresholds[gains.argmax()]

        return split_idx, split_thresh

    def _impurity_gain(self, Y, split_thresh, feat_values):
        if self.criterion == "entropy":
            loss = entropy
        elif self.criterion == "gini":
            loss = gini
        elif self.criterion == "mse":
            loss = mse

        parent_loss = loss(Y)

        # generate split
        left = np.argwhere(feat_values <= split_thresh).flatten()
        right = np.argwhere(feat_values > split_thresh).flatten()

        if len(left) == 0 or len(right) == 0:
            return 0

        # compute the weighted avg. of the loss for the children
        n = len(Y)
        n_l, n_r = len(left), len(right)
        e_l, e_r = loss(Y[left]), loss(Y[right])
        child_loss = (n_l / n) * e_l + (n_r / n) * e_r

        # impurity gain is difference in loss before vs. after split
        ig = parent_loss - child_loss
        return ig

    def _traverse(self, X, node, prob=False):
        if isinstance(node, Leaf):
            if self.classifier:
                return node.value if prob else node.value.argmax()
            return node.value
        if X[node.feature] <= node.threshold:
            return self._traverse(X, node.left, prob)
        return self._traverse(X, node.right, prob)


def mse(y):
    return np.mean((y - np.mean(y)) ** 2)


def entropy(y):
    hist = np.bincount(y)
    ps = hist / np.sum(hist)
    return -np.sum([p * np.log2(p) for p in ps if p > 0])


def gini(y):
    hist = np.bincount(y)
    N = np.sum(hist)
    return 1 - sum([(i / N) ** 2 for i in hist])

### 损失函数

In [2]:
import numpy as np

#######################################################################
#                           Base Estimators                           #
#######################################################################


class ClassProbEstimator:
    def fit(self, X, y):
        self.class_prob = y.sum() / len(y)

    def predict(self, X):
        pred = np.empty(X.shape[0], dtype=np.float64)
        pred.fill(self.class_prob)
        return pred


class MeanBaseEstimator:
    def fit(self, X, y):
        self.avg = np.mean(y)

    def predict(self, X):
        pred = np.empty(X.shape[0], dtype=np.float64)
        pred.fill(self.avg)
        return pred


#######################################################################
#                           Loss Functions                            #
#######################################################################


class MSELoss:
    def __call__(self, y, y_pred):
        return np.mean((y - y_pred) ** 2)

    def base_estimator(self):
        return MeanBaseEstimator()

    def grad(self, y, y_pred):
        return -2 / len(y) * (y - y_pred)

    def line_search(self, y, y_pred, h_pred):
        # TODO: revise this
        Lp = np.sum((y - y_pred) * h_pred)
        Lpp = np.sum(h_pred * h_pred)

        # if we perfectly fit the residuals, use max step size
        return 1 if np.sum(Lpp) == 0 else Lp / Lpp


class CrossEntropyLoss:
    def __call__(self, y, y_pred):
        eps = np.finfo(float).eps
        return -np.sum(y * np.log(y_pred + eps))

    def base_estimator(self):
        return ClassProbEstimator()

    def grad(self, y, y_pred):
        eps = np.finfo(float).eps
        return -y * 1 / (y_pred + eps)

    def line_search(self, y, y_pred, h_pred):
        raise NotImplementedError


### 算法实现

In [3]:
def to_one_hot(labels, n_classes=None):
    if labels.ndim > 1:
        raise ValueError("labels must have dimension 1, but got {}".format(labels.ndim))

    N = labels.size
    n_cols = np.max(labels) + 1 if n_classes is None else n_classes
    one_hot = np.zeros((N, n_cols))
    one_hot[np.arange(N), labels] = 1.0
    return one_hot


class GradientBoostedDecisionTree:
    def __init__(
        self,
        n_iter,
        max_depth=None,
        classifier=True,
        learning_rate=1,
        loss="crossentropy",
        step_size="constant",
    ):
        """
        A gradient boosted ensemble of decision trees.

        Notes
        -----
        Gradient boosted machines (GBMs) fit an ensemble of `m` weak learners such that:

        .. math::

            f_m(X) = b(X) + \eta w_1 g_1 + \ldots + \eta w_m g_m

        where `b` is a fixed initial estimate for the targets, :math:`\eta` is
        a learning rate parameter, and :math:`w_{\cdot}` and :math:`g_{\cdot}`
        denote the weights and learner predictions for subsequent fits.

        We fit each `w` and `g` iteratively using a greedy strategy so that at each
        iteration `i`,

        .. math::

            w_i, g_i = \\arg \min_{w_i, g_i} L(Y, f_{i-1}(X) + w_i g_i)

        On each iteration we fit a new weak learner to predict the negative
        gradient of the loss with respect to the previous prediction, :math:`f_{i-1}(X)`.
        We then use the element-wise product of the predictions of this weak
        learner, :math:`g_i`, with a weight, :math:`w_i`, to compute the amount to
        adjust the predictions of our model at the previous iteration, :math:`f_{i-1}(X)`:

        .. math::

            f_i(X) := f_{i-1}(X) + w_i g_i

        Parameters
        ----------
        n_iter : int
            The number of iterations / weak estimators to use when fitting each
            dimension / class of `Y`.
        max_depth : int
            The maximum depth of each decision tree weak estimator. Default is
            None.
        classifier : bool
            Whether `Y` contains class labels or real-valued targets. Default
            is True.
        learning_rate : float
            Value in [0, 1] controlling the amount each weak estimator
            contributes to the overall model prediction. Sometimes known as the
            `shrinkage parameter` in the GBM literature. Default is 1.
        loss : {'crossentropy', 'mse'}
            The loss to optimize for the GBM. Default is 'crossentropy'.
        step_size : {"constant", "adaptive"}
            How to choose the weight for each weak learner. If "constant", use
            a fixed weight of 1 for each learner. If "adaptive", use a step
            size computed via line-search on the current iteration's loss.
            Default is 'constant'.
        """
        self.loss = loss
        self.weights = None
        self.learners = None
        self.out_dims = None
        self.n_iter = n_iter
        self.base_estimator = None
        self.max_depth = max_depth
        self.step_size = step_size
        self.classifier = classifier
        self.learning_rate = learning_rate

    def fit(self, X, Y):
        """
        Fit the gradient boosted decision trees on a dataset.

        Parameters
        ----------
        X : :py:class:`ndarray <numpy.ndarray>` of shape (N, M)
            The training data of `N` examples, each with `M` features
        Y : :py:class:`ndarray <numpy.ndarray>` of shape (N,)
            An array of integer class labels for each example in `X` if
            ``self.classifier = True``, otherwise the set of target values for
            each example in `X`.
        """
        if self.loss == "mse":
            loss = MSELoss()
        elif self.loss == "crossentropy":
            loss = CrossEntropyLoss()

        # convert Y to one_hot if not already
        if self.classifier:
            Y = to_one_hot(Y.flatten())
        else:
            Y = Y.reshape(-1, 1) if len(Y.shape) == 1 else Y

        N, M = X.shape
        self.out_dims = Y.shape[1]
        self.learners = np.empty((self.n_iter, self.out_dims), dtype=object)
        self.weights = np.ones((self.n_iter, self.out_dims))
        self.weights[1:, :] *= self.learning_rate

        # fit the base estimator
        Y_pred = np.zeros((N, self.out_dims))
        for k in range(self.out_dims):
            t = loss.base_estimator()
            t.fit(X, Y[:, k])
            Y_pred[:, k] += t.predict(X)
            self.learners[0, k] = t

        # incrementally fit each learner on the negative gradient of the loss
        # wrt the previous fit (pseudo-residuals)
        for i in range(1, self.n_iter):
            for k in range(self.out_dims):
                y, y_pred = Y[:, k], Y_pred[:, k]
                neg_grad = -1 * loss.grad(y, y_pred)

                # use MSE as the surrogate loss when fitting to negative gradients
                t = DecisionTree(
                    classifier=False, max_depth=self.max_depth, criterion="mse"
                )

                # fit current learner to negative gradients
                t.fit(X, neg_grad)
                self.learners[i, k] = t

                # compute step size and weight for the current learner
                step = 1.0
                h_pred = t.predict(X)
                if self.step_size == "adaptive":
                    step = loss.line_search(y, y_pred, h_pred)

                # update weights and our overall prediction for Y
                self.weights[i, k] *= step
                Y_pred[:, k] += self.weights[i, k] * h_pred

    def predict(self, X):
        """
        Use the trained model to classify or predict the examples in `X`.

        Parameters
        ----------
        X : :py:class:`ndarray <numpy.ndarray>` of shape `(N, M)`
            The training data of `N` examples, each with `M` features

        Returns
        -------
        preds : :py:class:`ndarray <numpy.ndarray>` of shape `(N,)`
            The integer class labels predicted for each example in `X` if
            ``self.classifier = True``, otherwise the predicted target values.
        """
        Y_pred = np.zeros((X.shape[0], self.out_dims))
        for i in range(self.n_iter):
            for k in range(self.out_dims):
                Y_pred[:, k] += self.weights[i, k] * self.learners[i, k].predict(X)

        if self.classifier:
            Y_pred = Y_pred.argmax(axis=1)

        return Y_pred
