Tree-Math

Machine learning study notes, contains Math behind all the mainstream tree-based machine learning models, covering basic decision tree models (ID3, C4.5, CART), boosted models (GBM, AdaBoost, Xgboost, LightGBM), bagging models (Bagging Tree, Random Forest, ExtraTrees).

Boosting Tree Models

XGboost (Extreme Gradient Boosting)

One Sentence Summary:
Continuously adding weak base learners to approximate a more complex term including both negative gradient and negative second derivative to find a more accurate direction to reduce loss.

a. Difference between GBM & XGboost
The most important difference is that GBM only uses the first derivative information to find the best dimension to reduce loss. But XGboost uses both the first & second derivative so XGboost tends to have a more accurate result.

GBM (GBDT)	XGboost
Only uses the first derivative to find the best base learners at each stage	Uses both first derivative & second derivative
No regularization term in loss function in the initial version	Adds regularization in the loss function
Uses MSE as the scorer to find the best base learners (regression)	Uses a better scorer, taking overfit into consideration
Doesn't support sparse dataset	Directly supports sparse dataset
Uses pre pruning to stop overfit	Uses post pruning to stop overfit, also better prevent under-fit

b. how to find the best direction to reduce loss in XGboost

Recall in the previous section, we memtion the Forward Stagewise Additive Modeling. The final output f_M(x) is as below:

Suppose now we are in Step m and we use G_m(x) to simplify,

Since all previous m-1 base learners are fixed, so our Loss is as below:

where is defined as below. It's a regulization term, J is how many final leaf nodes are in the base learner , b_j is the output value at each final leaf node:

Using Taylor expansion to expand the Loss function at , we will have:

Recall that here is just a CART decision tree that splits the area into J final nodes, each area Rj with predicted value bj:

We can simplify the above term:

So our target right now is to find the optimal direction to reduce the loss function above by finding the best tree structure :

After find the best structure of base learner , we will continue find the best bj:

So the minimal loss is as below:
c. XGboost algorithm
We use regression tree as the example.
Model Input: Dataset D: D = {(x1,y1), ..., (x_N, y_N), y_i belongs to R}
Model Output: Final regressor: f_m(x)

Steps:

(1) Initialization:

(2) for m in 1,2,3,..., M:
- compute the gradient:
- compute the second derivative:
- Fit a new decision tree by minimizing the loss function with regulization term:
- update the function f_m(x):
(3) So we will output our final model f_M(x):
d. More details on how to find the best split inside each base learner
XGboost offers four split find algorithms to find each split at each base learners.
- Method One: Exactly Greedy Algorithm
  
  In each split of each base learner, our gain is as below:
  
  So in fact, we just need to find the feature and split point that results in the maximum gain and this is exactly the core of exact greedy algorithm. Below is the pseudo code of this algorithm in the original paper.
- Method Two: Approximate Algorithm using Weighted Quantile Sketch
  
  The above the exact greedy algorithm operates too slow when we have a large dataset. Because we need to go through every possible feature and split points then compute the gain for each combination. Suppose there are k features and n samples, then we have to compute around k*(n-1) times of gain. So we can improve this by splitting on predefined percentile buckets instead of every possible data points.
  
  Here we use second derivative as the weight to create the percentile bucket.
  
  Suppose Dk is the multiset contains represent the kth feature values and second order gradient statistics of every training samples.
  
  Then we define the below rk, the rank function based on weighted second derivative to compute percentile:
  
  After defining the rank function, we need to define one more parameter to specify the size of our buckets. The large the , the larger buckets & percentile we are using. For example, suppose is 0.25, then we way we defined buckets are very similar to just using the second derivative weighted quantiles.
  
  After computing this, then we just need to loop through all l split points and to find the split point and give the maximal gain. Below is the pseudo code of this algorithm in the original paper.
- Method Three: Sparsity-aware Split Finding
  
  Most of the tree algorithms before XGboost cannot handle the dataset with missing values. So we need to spend a lot of time filling missing values then feed the dataset into machine learning models. But XGboost uses one simple idea to provide sparsity support: only collect statistics of non-missing samples during creating buckets & split points, then check putting samples with missing value into which side of the tree would give us the maximal gain. Below is the pseudo code of this algorithm in the original paper.
e. System Design of XGboost

XGboost also has excellent system design: Column Block for Parallel Learning, Cache-aware Access and Blocks for Out-of-core Computation.

Before XGboost, parallel learning can only be achieved on Random Forest, since each tree inside Random Forest are independent. But in traditional GBM, each new split is based on the results of previous base learners, so we cannot build base learners parallelly.

XGboost achieved parallel learning by storing data in compressed column (CSC) format, parallelly computing gradient & second derivatives of each feature and parallelly computing the best split points. For example, assume there are 4 features in our dataset and it costs us 10 seconds to find out the best split point on one feature, then the traditional method needs to check feature one by one and spend 40 seconds to find the best features & corresponding split points. But XGboost only costs 10 seconds.

Here we won't introduce much of the rest system design, because these are more related to CS domain.

XGBoost Application

XGBRegressor:
class xgboost.XGBRegressor(max_depth=3, learning_rate=0.1, n_estimators=100, verbosity=1, objective='reg:squarederror', booster='gbtree', tree_method='auto', n_jobs=1, gamma=0, min_child_weight=1, max_delta_step=0, subsample=1, colsample_bytree=1, colsample_bylevel=1, colsample_bynode=1, reg_alpha=0, reg_lambda=1, scale_pos_weight=1, base_score=0.5, random_state=0, missing=None, num_parallel_tree=1, importance_type='gain', **kwargs)

max_depth : int, (default=6)
The maximum depth of base tree learners, i.e. Gm(x). This parameter is used to control overfitting and should be tuned using cross validation.
learning_rate : float, (default=0.1)
Stepsize shrinkage used to prevent overfitting, the same as AdaBoost and GBM.
n_estimators : int, (default=100)
The maximum number of trees to fit. It corresponds to M in the formula.
objective : string, (default='reg:squarederror')
It specifies the learning task and the related learning objective. Represents the loss function to be minimized in the formula. For classification problems (XGBClassifier), possible objectives are 'binary:logistic', 'multi:softmax', 'multi:softprob'.
booster : string, (default='gbtree')
It specifies which booster to use in every iteration. 'gbtree' and 'dart' are tree-based models. 'gblinear' is linear models. Please refer to here for more info about DART booster. Basically it randomly drops out some learners to prevent overfitting and therefore achieves better results in some situations.
tree_method : string, (default='auto')
It determines the tree construction algorithm used in finding the best split in every iteration. Possible choices are 'auto', 'exact', 'approx', 'hist', 'gpu_hist'. As introduced above, 'exact' refers to method 1: Exactly Greedy Algorithm, 'approx' refers to method 2: Approximate Algorithm, 'hist' refers to Fast histogram optimized approximate greedy algorithm, 'gpu_hist' is a GPU_implementation of hist algorithm, and 'auto' uses heuristic method to choose the fastest method. In this article, we only introduce 'exact' and 'approx', and you can refer to LightGBM for more info on 'hist'.
n_jobs : int, (default=1)
The number of parallel threads used to run XGBoost. You could tune this parameter to efficiently use all your CPU cores and improve the training process. Here is a blog introducing how to tune multithreading support in python.
gamma : float, (default=0)
Gamma specifies the minimum loss reduction required to make a split. In the splitting algorithm, a leaf node will be split only when the resulting split gives a loss reduction larger than gamma. Larger gamma value makes the model more conservative by requiring higher loss reduction. It could be tuned using CV.
min_child_weight : int, (default=1)
The minimum sum of weights(hessian) of all observations in a child. It's very similar to 'min_sample_leaf' in the CART decision tree except that it uses the sum of weights instead of the sum of instances number. Therefore this parameter is also used to control overfitting and could be tuned using CV. To better understand how it works, you could read this answer.
max_delta_step : int, (default=0)
Maximum delta step we allow each tree’s weight estimation to be. If the value is set to 0, it means there is no constraint. If it is set to a positive value, it can help making the update step more conservative. Usually, this parameter is not needed, but it might help in logistic regression when class is extremely imbalanced. Set it to value of 1-10 might help control the update. More explanation could be found here.
subsample : float, (default=1.0)
Subsample ratio of the training instances. Setting it to 0.5 means that XGBoost would randomly sample half of the training data prior to growing trees. It works very similar to subsample in GBM.
colsample_bytree, colsample_bylevel, colsample_bynode : float, (defualt=1.0)
This is a family of parameters for subsampling of columns.

Colsample_bytree is similar to max_features in the CART decision tree. It denotes the subsample ratio of columns when constructing each tree. Say if we set Colsample_bytree as 0.8 and we have 10 features in total. Then when constructing each tree, only eight features will be randomly selected and the best feature will be picked among the selected eight.

Colsample_bylevel is the subsample ratio of columns for each level. Subsampling occurs once for every new depth level reached in a tree. Columns are subsampled from the set of columns chosen for the current tree.

Colsample_bynode is the subsample ratio of columns for each node (split). Subsampling occurs once every time a new split is evaluated. Columns are subsampled from the set of columns chosen for the current level.

These three parameters work cumulatively. Suppose we have 64 features, and the combination is {Colsample_bytree=0.5, Colsample_bylevel=0.5, Colsample_bynode=0.5}. Then the process of randomly selecting features goes like this:
reg_alpha : float, (default=0)
L1 regularization term on weights. Increasing this value will make model more conservative.
reg_lambda : float, (default=1)
L2 regularization term on weights. Increasing this value will make model more conservative.
scale_pos_weight ：float, (default=1)
Control the balance of positive and negative weights, useful for unbalanced classes. A typical value to consider: sum(negative instances) / sum(positive instances). See this article for more information.
base_score : float, (default=0.5)
The initial prediction score of all instances, global bias. Theoretically, base_score won't affect the final result as long as you have appropriate learning rate and enough learning steps. See the developer's answer here.
missing : float, optional (default=None) Value in the data which needs to be present as a missing value. If None, defaults to np.nan
num_parallel_tree : int, (default=1) Number of parallel trees constructed during each iteration. Used for boosting random forest.
importance_type : string, (default='gain')
The feature importance type for the feature_importances_ property: either “gain”, “weight”, “cover”, “total_gain” or “total_cover”. It is defined only when decision tree model (gbtree) is chosed as the base learner. See this article for more info about different types of importance.

'weight': the number of times a feature is used to split the data across all trees.
'gain': the average gain across all splits the feature is used in.
'cover': the average coverage across all splits the feature is used in.
'total_gain': the total gain across all splits the feature is used in.
'total_cover': the total coverage across all splits the feature is used in.

Reference

Chen T, Guestrin C. Xgboost: A scalable tree boosting system[C]//Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining. ACM, 2016: 785-794.
Nielsen D. Tree Boosting With XGBoost-Why Does XGBoost Win" Every" Machine Learning Competition?[D]. NTNU, 2016.
https://github.com/dmlc/xgboost
https://machinelearningmastery.com/gentle-introduction-xgboost-applied-machine-learning/
https://towardsdatascience.com/xgboost-mathematics-explained-58262530904a
https://zhuanlan.zhihu.com/p/46683728 [Chinese]
https://zhuanlan.zhihu.com/p/40129825 [Chinese]
https://xgboost.readthedocs.io/en/latest/parameter.html
https://www.analyticsvidhya.com/blog/2016/03/complete-guide-parameter-tuning-xgboost-with-codes-python/
https://xgboost.readthedocs.io/en/latest/python/python_api.html
https://xgboost.readthedocs.io/en/latest/tutorials/param_tuning.html
https://github.com/dmlc/xgboost/blob/master/doc/parameter.rst

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

XGboost.md

XGboost.md

Tree-Math

Boosting Tree Models

Files

XGboost.md

Latest commit

History

XGboost.md

File metadata and controls

Tree-Math

Boosting Tree Models