### Leaf-wise (Best-first) Tree Growth

Most decision tree learning algorithms grow trees by level (depth)-wise, like the following image:

<img src='../../Other/img/level-wise.png' >

LightGBM grows trees leaf-wise (best-first). It will choose the leaf with max delta loss to grow. Holding #leaf fixed, leaf-wise algorithms tend to achieve lower loss than level-wise algorithms.

Leaf-wise may cause over-fitting when #data is small, so LightGBM includes the max_depth parameter to limit tree depth. However, trees still grow leaf-wise even when max_depth is specified.


<img src='../../Other/img/leaf-wise.png'>


### Optimal Split for Categorical Features

It is common to represent categorical features with one-hot encoding(example:xgboost), but this approach is suboptimal for tree learners. Particularly for high-cardinality categorical features, a tree built on one-hot features tends to be unbalanced and needs to grow very deep to achieve good accuracy.

Instead of one-hot encoding, the optimal solution is to split on a categorical feature by partitioning its categories into 2 subsets. If the feature has k categories, there are 2^(k-1) - 1 possible partitions. But there is an efficient solution for regression trees. It needs about O(k * log(k)) to find the optimal partition.

The basic idea is to sort the categories according to the training objective at each split. More specifically, LightGBM sorts the histogram (for a categorical feature) according to its accumulated values (sum_gradient / sum_hessian) and then finds the best split on the sorted histogram.

### 直方图算法

直方图算法的基本思想是先把连续的浮点特征值离散化成k个整数,同时构造一个宽度为k的直方图.在遍历数据的时候,根据离散化后的值作为索引在直方图中累积统计量,当遍历一次数据后,直方图累积了需要的统计量,然后根据直方图的离散值,遍历寻找最优的分割点.在XGBoost中需要遍历所有离散化的值,而在这里只要遍历k个直方图的值.

<img src='../../Other/img/直方图算法.png'>


LightGBM uses histogram-based algorithms, which bucket continuous feature (attribute) values into discrete bins. This speeds up training and reduces memory usage. Advantages of histogram-based algorithms include the following:

* Reduced cost of calculating the gain for each split
    * Pre-sort-based algorithms have time complexity O(#data)
    * Computing the histogram has time complexity O(#data), but this involves only a fast sum-up operation. Once the histogram is constructed, a histogram-based algorithm has time complexity O(#bins), and #bins is far smaller than #data.

* Use histogram subtraction for further speedup
    * To get one leaf’s histograms in a binary tree, use the histogram subtraction of its parent and its neighbor
    * So it needs to construct histograms for only one leaf (with smaller #data than its neighbor). It then can get histograms of its neighbor by histogram subtraction with small cost (O(#bins))
    <img src='../../Other/img/直方图差加速.png'> 


* Reduce memory usage
    * Replaces continuous values with discrete bins. If #bins is small, can use small data type, e.g. uint8_t, to store training data
    * No need to store additional information for pre-sorting feature values

* Reduce communication cost for distributed learning