Skip to content

Latest commit

 

History

History
206 lines (123 loc) · 20.1 KB

GBM.md

File metadata and controls

206 lines (123 loc) · 20.1 KB

Tree-Math

Machine learning study notes, contains Math behind all the mainstream tree-based machine learning models, covering basic decision tree models (ID3, C4.5, CART), boosted models (GBM, AdaBoost, Xgboost, LightGBM), bagging models (Bagging Tree, Random Forest, ExtraTrees).

Boosting Tree Models

GBM (Gradient Boosting Machine)

One Sentence Summary:
Continuously adding weak base learners to approximate the negative gradient so as to decrease total loss.

  • a. Difference between AdaBoost & GBM
    AdaBoost uses exponential loss and the exponential loss grows exponentially for negative values which makes it more sensitive to outliers. But GBM allows for using more robust loss functions as long as the loss function is continuously differentiable.

    Models Methods to correct previous errors
    AdaBoost Adding the weights for incorrectly classified samples, decreasing the weights for correctly classified samples.
    GBM Using the negative gradient as the indicator of the error that previous base learners made, fitting the next base learner to approximate the negative gradient of the previous base learners.
  • b. Negative gradient in GBM

    Recall in the AdaBoost, we memtion the Forward Stagewise Additive Modeling. Suppose now we are in interation m:

    img

    We want to reduce loss similar to AdaBoost:

    img

    But here the problem is different from the case in Adaboost. In AdaBoost we know exactly the loss function (exponential loss) so we can find the exact optimal bm(x). But here, in GBM, we want to be able to solve any loss function as long as it is differentiable. So we adopt an idea similar to gradient descent to find the optimal bm(x) by setting bm(x) to proximate the direction of negative gradient:

    img

    where img is a parameter similar to learning rate, but it is negative here.

  • c. GBM algorithm
    Model Input: Dataset D: D = {(x1,y1), ..., (x_N, y_N), y_i belongs to {-1,1}}
    Model Output: Final classifier/regressor: f_m(x)

    Steps:

    (1) Initialization:

    img

    (2) for m in 1,2,3,..., M:

    • compute the negative gradient:
      img

    • Fit a new tree by minimizing the square loss:
      img

    • Use linear search to find the best step (very similar to the learning rate concept in SGD):

      img

    • update the function f_m(x):

      img

    (3) for m in 1,2,3,..., M:

    img

  • d. GBM regression tree algorithm

    Model Input: Dataset D: D = {(x1,y1), ..., (x_N, y_N), y_i belongs to R}
    Model Output: Final regressor: f_m(x)
    Loss Function: Square Loss

    Steps:

    (1) Initialization:

    img

    (2) for m in 1,2,3,..., M:

    • compute the negative gradient:

      img

    • Fit a new CART tree by minimizing the square loss, suppose that the CART decision tree split the area into J different parts R_{j,m}:

      img

    • Instead of using linear search to find the optimal parameter img for the whole tree, we decide to find the optimal parameters img for each zone inside the tree individually so as to achieve better results:

      img

    • update the function f_m(x):

      img

    (3) So we will output our final model f_M(x):

    img

  • e. GBM classification tree algorithm

    Model Input: Dataset D: D = {(x1,y1), ..., (x_N, y_N), y_i belongs to {-1,1}}
    Model Output: Final classifier: f_m(x)
    Loss Function: Deviance Loss

    Steps:

    (1) Initialization:

    • What is the loss function (deviance loss function):

      img

    • So each time, we are using sigmoid(fm(x)) to proximate the probability, which means that we are using iteration to proximate the log(p/1-p).

      Suppose now we are at time 0, and we want a constant f0(x) to minimize our loss function during the initialization.

      img

    • We find this f0 by setting the derivative to be 0:

      img

    • So after computing p0, we can compute the constant f0(x):

      img

    (2) for m in 1,2,3,..., M:

    img

    • compute the negative gradient:

      img

    • Fit a new CART tree by minimizing the square loss, suppose that the CART decision tree split the area into J different parts R_{j,m}:

      img

    • Instead of using linear search to find the optimal parameter img for the whole tree, we decide to find the optimal parameters img for each zone inside the tree individually so as to achieve better results:

      img

    • update the function f_m(x):

      img

    (3) So we will output our final model f_M(x) and final predicted probability p_M(x):

    img

Scikit-learn Application

GradientBoostingRegressor:
class sklearn.ensemble.GradientBoostingRegressor(loss=’ls’, learning_rate=0.1, n_estimators=100, subsample=1.0, criterion=’friedman_mse’, min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_depth=3, min_impurity_decrease=0.0, min_impurity_split=None, init=None, random_state=None, max_features=None, alpha=0.9, verbose=0, max_leaf_nodes=None, warm_start=False, presort=’auto’, validation_fraction=0.1, n_iter_no_change=None, tol=0.0001)

The base learners of GBR are regression trees, which are discussed in the CART decision tree document. Therefore only hyperparameters of GBR itself are introduced below. Please refer to the previous document for more information about CART decision tree parameters.

  • loss : {‘ls’, ‘lad’, ‘huber’, ‘quantile’}, optional (default=’ls’)
    loss function to be optimized. Default 'ls' refers to least squares regression, which is also used in our math derivation. Please refer to Prince Grover's blog for introduction of different loss functions.

  • learning_rate : float, optional (default=0.1)
    Learning rate shrinks the contribution of each tree by learning_rate. Here we use the same example as AdaBoost, suppose previous prediction fm-1 = 1, learning rate = 0.1, and next tree correction = 0.5, then the updated prediction fm = 1 + 0.1 * 0.5 = 1.05. Lower learning rate sometimes makes the model generalize better, but also requires higher number of estimators.

  • n_estimators : int (default=100)
    Numbers of estimators in the model. It refers to M in the formular. Gradient boosting is fairly robust to over-fitting so a large number usually results in better performance.

  • subsample : float, optional (default=1.0)
    The fraction of samples to be randomely selected and used in each weak base learner. Subsample interacts with the parameter n_estimators. Choosing subsample < 1.0 leads to a reduction of variance and an increase in bias. Generally ~0.8 works fine.

  • init : estimator or ‘zero’, optional (default=None)
    An estimator object to compute the initial predictions f0(x). If ‘zero’, the initial raw predictions are set to zero. By default a DummyEstimator is used, predicting either the average target value (for loss=’ls’), or a quantile for the other losses.

  • alpha : float (default=0.9)
    The alpha-quantile of the huber loss function and the quantile loss function. Only if loss='huber' or loss='quantile'.

  • verbose : int, default: 0
    Enable verbose output. Default doesn't generate output. If 1 then it prints progress and performance once in a while (the more trees the lower the frequency). If greater than 1 then it prints progress and performance for every tree.

  • warm_start : bool, default: False
    When warm_start is true, the existing fitted model attributes are used to initialise the new model in a subsequent call to fit. In other words, we can add more trees on a trained model.

  • validation_fraction : float, optional, default 0.1
    The proportion of training data to set aside as validation set for early stopping. Must be between 0 and 1. Only used if n_iter_no_change is set to an integer.

  • n_iter_no_change : int, default None
    n_iter_no_change is used to decide if early stopping will be used to terminate training when validation score is not improving. By default it is set to None to disable early stopping. If set to a number, it will set aside validation_fraction size of the training data as validation and terminate training when validation score is not improving in all of the previous n_iter_no_change numbers of iterations.

  • tol : float, optional, default 1e-4
    Tolerance for the early stopping. When the loss is not improving by at least tol for n_iter_no_change iterations (if set to a number), the training stops.

Reference

  1. Friedman J H. Greedy function approximation: a gradient boosting machine[J]. Annals of statistics, 2001: 1189-1232.
  2. Friedman J H. Stochastic gradient boosting[J]. Computational statistics & data analysis, 2002, 38(4): 367-378.
  3. Hang Li. Statistical Learning Method[M]. Tsinghua University Press, 2019. [Chinese]
  4. Zhihua Zhou. Machine Learning[M]. Tsinghua University Press, 2018. [Chinese]
  5. Mason L, Baxter J, Bartlett P L, et al. Boosting algorithms as gradient descent[C]//Advances in neural information processing systems. 2000: 512-518.
  6. Wikipedia contributors. Gradient boosting. Wikipedia, The Free Encyclopedia. October 21, 2019, 23:33 UTC. Available at: https://en.wikipedia.org/w/index.php?title=Gradient_boosting&oldid=922411214. Accessed November 11, 2019.
  7. https://towardsdatascience.com/understanding-gradient-boosting-machines-9be756fe76ab
  8. https://medium.com/mlreview/gradient-boosting-from-scratch-1e317ae4587d
  9. https://zhuanlan.zhihu.com/p/38329631 [Chinese]
  10. https://zhuanlan.zhihu.com/p/43940320 [Chinese]
  11. https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingRegressor.html
  12. https://www.analyticsvidhya.com/blog/2016/02/complete-guide-parameter-tuning-gradient-boosting-gbm-python/