# Boosting

- 输出: $\hat y$ (-1 或 1)
- 输入: x
- 学习组合 model:
  - 分类器: $f_1(x), f_2(x), ..., f_T(x)$
  - 权重: $w_1, w_2, ..., w_T$
- 预测:
  $$
  \hat y = sign(\sum_{t=1}^Tw_tf_t(x))
  $$

### AdaBoost 算法
- 每个数据的权重: $\alpha_i = \frac1N$
  - 当 $f_t(x_i)$ 出错时, 增加 $\alpha_i$
  - 权重误差:
    $$
    weightError = \sum_{i=1, f_t(x_i) \neq y_i}^N\alpha_i
    $$
- For t = 1, ..., T
  - 学习 $f_t(x)$: 选取分类误差最小的特征
  - 计算权重 $w_t$
    $$
    w_t = \frac12\ln\left(\frac{1 - weightError(f_t)}{weightError(f_t)}\right)
    $$
  - 重新计算 $\alpha_i$
    $$
    \alpha_i = \begin{cases}\alpha_i e^{-w_t}, & f_t(x_i) = y_i\\\alpha_i e^{w_t}, & f_t(x_i) \neq y_i\end{cases}
    $$
  - 正规化 $\alpha_i$ (避免数据爆炸或数据消失)
    $$
    \alpha_i = \frac{\alpha_i}{\sum_{j=1}^N\alpha_j}
    $$
- 预测:
  $$
  \hat y = sign(\sum_{t=1}^Tw_tf_t(x))
  $$


### Gradient Boosting

**构建的树叶节点数量为: 8~32**

- 输入: 
  - 数据集 {$(x_1, y_1),(x_2, y_2), ...,(x_n, y_n)$} 
  - 可微损失函数 $L(y_i, F(x))$
  - 学习速率: $\nu$
- 初始化常数变量: 
  $$
  F_0(x) = \mathop{\arg\min}_{\gamma}\sum_{i=1}^nL(y_i, \gamma)
  $$
- for m = 1, M(生成树的数量 $\geq$ 100)
  - 计算 $r_{i, m} = - \left[\frac{\partial L(y_i, F(x_i))}{\partial F(x_i)}\right]_{F(x) = F_{m-1}(x)}$ for i = 1, 2, ..., n
  - 根据 {$(x_1, r_{1, m}), (x_2, r_{2, m}), ..., (x_n, r_{n, m})$} 构建回归树, 叶节点为: $R_{j,m} = \{F_{m-1}(x),...\}\; (j \in J_m:\text{叶节点的总数量})$
  - $\gamma_{j,m} = \mathop{\arg\min}_{\gamma}\sum_{x_i \in R_{i, j}}L(y_i, F_{m-1}(x_i) + \gamma)$
  - 更新 $F_m(x) = F_{m-1}(x) + \nu \sum_{j=1}^{J_m}\gamma_{j,m}I(x\in R_{j, m})$
- 输出 $F_M(x)$

详细问题的区别:
- 回归问题:
  - 学习速率: $\nu = 0.1$
  - 损失函数: $L(y_i, F(x_i)) = \frac12\sum_{i=1}^N(y_i - F(x_i))^2$
  - 计算 $\gamma_{j,m} = \frac{\sum R_{j,m}}{\#R_{j,m}}$
- 分类问题:
  - 学习速率: $\nu = 0.8$
  - 损失函数:  $L(y_i, F(x_i)) = -\sum_{i=1}^N(y_i\log(F(x_i)) + (1-y_i)\log(1-F(x_i)))$
  - 计算 $\gamma_{j,m}$: 需要利用二项式的泰勒展开式
    $$
    L(y_i, F_{m-1}(x_i) + \gamma) \approx L(y_i, F_{m-1}(x_i)) + \frac{\partial (y_i, F_{m-1}(x_i))}{\partial F()}\gamma + \frac12\frac{\partial^2 (y_i, F_{m-1}(x_i))}{\partial F()^2}\gamma^2
    $$
    简化后:
    $$
    \gamma_{j,m} = \frac{\sum R_{j, m}}{\sum_{x_i \in R_{j, m}} F_{m-1}(x_i) (1 - F_{m-1}(x_i))}
    $$

### XGBoost

每次迭代添加一个新树来填补上上次产生的残差值,来达到贴近真实值的目的.

- 目标: $Obj^{(t)} = \sum_{i=1}^nl(y_i, \hat y_i^{(t)}) + \sum_{i=1}^t\Omega(f_i) $ 其中: $\hat y_i^{(t)} = \hat y_i^{(t-1)} + f_t(x_i)$
  - 利用二阶泰勒展开式: $f(x + \Delta x) \approx f(x) + f'(x)\Delta x + \frac12f''(x)\Delta x^2$
  $$
  \begin{align*}
  &令: \\
  &g_i = \partial_{\hat y^{(t-1)}}l(y_i, \hat y_i^{(t-1)}) \\
  &h_i = \partial^2_{\hat y^{(t-1)}}l(y_i, \hat y_i^{(t-1)}) \\
  &则: \\
  &Obj^{(t)} \approx \sum_{i=1}^n\left[l(y_i, \hat y_i^{(t-1)}) + g_i f_t(x_i) + \frac12h_if_t^2(x_i)\right] + \Omega(f_t) \\
  &去除常数项 \\
  &Obj^{(t)} \approx \sum_{i=1}^n\left[ g_i f_t(x_i) + \frac12h_if_t^2(x_i)\right] + \Omega(f_t) \\
  &定义 f_t(x): \\
  &f_t(x) = w_{q(x)} \quad (w \in \mathbb{R}^T: 叶节点的权重, q(x): 数据对应叶节点的 index) \\
  &定义正则项: \\
  &\Omega(f_t) = \gamma T + \frac12\lambda\sum_{j=1}^Tw_j^2 \\
  &令: \\
  &I_j = \{i\;|\;q(x_i) = j\} \\
  &则: \\
  &Obj^{(t)} \approx \sum_{j=1}^T\left[(\sum_{i\in I_j}g_i)w_j + \frac12(\sum_{i\in I_j}h_i + \lambda)w_j^2\right] + \gamma T \\
  &令: \\
  &G_j = \sum_{i \in I_j}g_i \\
  &H_j = \sum_{i \in I_j}h_i \\
  &则: \\
  &Obj^{(t)} \approx \sum_{j=1}^T\left[G_jw_j + \frac12(H_j + \lambda)w_j^2\right] + \gamma T \\
  &求最小值,得: \\
  &w_j^* = -\frac{G_j}{H_j + \lambda} \\
  &Obj^* = -\frac12\sum_{j=1}^T\frac{G_j^2}{H_j + \lambda} + \gamma T \\
  &对于一个节点分裂为两个叶节点时,得: \\
  &Gain = \frac{1}{2} \left[\frac{G_L^2}{H_L + \lambda} + \frac{G_R^2}{H_R + \lambda} - \frac{(G_L + G_R)^2}{H_L + H_R + \lambda}\right] - \gamma \\
  &结论: 使得 Gain 增加最多, 则是最佳分裂 \\
  \end{align*}
  $$

特征分裂算法:
1. 每个节点, 列举所有的特征
    1. 每个特征下数据排序
    2. 使用线性扫描的方法,决定最佳分裂