[New Feature] Fast Histogram Optimized Grower, 8x to 10x Speedup #1950

Closed
hcho3 opened this Issue Jan 9, 2017 · 20 comments

Comments

Projects
None yet
10 participants
@hcho3
Contributor

hcho3 commented Jan 9, 2017

This issue is to document a series of recent improvements in the tree grower. Related: #1673.

Faster training via feature binning

Histogram aggregation is a major computational bottleneck in tree growing. We introduce a new tree growing method hist where only a subset of possible split values are considered. As in FastBDT and LightGBM, continuous features are bucketed into discrete bins. The histogram accumulation becomes faster due to less index manipulations

How does the new method differ from tree_method=approx? The existing approximate splitting method in xgboost also buckets continuous features into discrete bins to speed up training. The approx method generates a new set of bins for each iteration, whereas the hist method re-uses the bins over multiple iterations.

The latter approach enables additional optimizations that are not possible for the former, as follows:

  • Cache of the bins : replace continuous feature values in the training data with bin IDs and cache the data structure
  • Histogram subtraction trick: to compute the histogram for a node, we simply take the difference between the histograms for its parent and sibling.

Besides the above improvements, some highlights

  • Efficient representation of sparse matrices as in xgboost, sparse matrices is supported naturally, with effective speedup for mixed sparse+dense matrices
  • Extensible to other existing features in xgboost such as monotonic constraints, language bindings.

How to use

Simply set tree_method to hist. You may also want to set max_bin, which represents the (maximum) number of discrete bins to bucket continuous features. By default, max_bin is set to 256. Increasing this number improves the optimality of splits at the cost of higher computation time.

Flexible tree growing policies

The existing tree grower in xgboost grows a tree in a depth-wise fashion, executing splits in first level before splits in second and so forth. The new grower lets you control the way new nodes are added to the tree:

  • grow_policy=depthwise (default): split at nodes closest to the root, i.e. grow depth-wise.
  • grow_policy=lossguide: split at nodes with highest loss change. This behavior mimics that of LightGBM.

It has been reported that the lossguide policy often results in faster convergence in loss, though there is also risk of over-fitting(see the preliminary results).

How to use

For now, only the hist grower supports multiple growing policies; we may extend the support to other growers in the future. So you should set tree_method to hist. The grow_policy parameters lets you select the growing policy. If unspecified, grow_policy will be set to depthwise by default. Here are two parameters that are relevant:

  • max_leaves: maximum number of nodes to be added. Only relevant for the lossguide policy.
  • max_depth: maximum depth. Behaves as usual when grow_policy is set to depthwise. If grow_policy is lossguide, max_depth can be set to zero, indicating the lack of limit on depth.

Benchmark

Configurations

  • CPU: Intel(R) Core(TM) i7-6850K CPU @ 3.60GHz, 6 physical cores
  • Memory: 48 GB
  • Training parameters: Evaluation metric is turn on during the benchmark, since that is how users commonly use the code
    • exact: tree_method=exact, eta=0.1, gamma=1.0, max_depth=8, min_child_weight=100
    • approx: tree_method=approx, eta=0.1, gamma=1.0, max_depth=8, min_child_weight=100
    • hist + depthwise: tree_method=hist, grow_policy=depthwise, eta=0.1, gamma=1.0, max_depth=8, min_child_weight=100
    • hist + lossguide: tree_method=hist, grow_policy=lossguide, eta=0.1, gamma=1.0, max_depth=0, max_leaves=255, min_child_weight=100
    • LightGBM: learning_rate=0.1, max_bin=255, num_leaves=255, min_sum_hessian_in_leaf=100, min_data_in_leaf=0

Model accuracy

For allstate and higgs, we plot AUC (Area Under Curve) on the validation set. For yahoo, we plot NDCG (Normalized Discounted Cumulative Gain) at 10.
vauc

Notice that favoring splits with highest change leads to overfitting in the case of allstate, which is more like a real world insurance dataset. While the validation training curve is more consistent on higgs, which is simulated data points.
We tried different options such as min_child_weight didn't move the curve much. This seems to suggest that depthwise is more conservative and more robust to overfitting and should be used when the training validation distribution might not be very identical, or overfitting is a problem in the dataset

Single-threaded performance

Keep in mind that the tree growing policy of LightGBM is equivalent to setting grow_policy=lossguide in xgboost, so LightGBM column should be compared with hist + lossguide column.

Per-iteration runtime (seconds)
The average was taken over 300 iterations.

Datasets exact approx hist + depthwise hist + lossguide LightGBM
allstate 26.47 14.34 3.82 5.24 5.41
higgs 61.38 25.94 6.17 6.44 5.41
yahoo 13.39 7.32 1.37 1.75 2.24

Cumulative runtime over 300 iterations
single

Multi-threaded performance, not yet fully optimized,

Parallel implementation of tree_method=hist needs more work, for the time being.

Per-iteration runtime (seconds)

Datasets exact approx hist + depthwise hist + lossguide LightGBM
allstate 7.50 4.21 2.39 2.94 3.26
higgs 15.55 6.96 3.38 3.37 3.07
yahoo 2.68 1.46 0.66 0.96 0.53

Cumulative runtime over 300 iterations
mult

@wxchan

This comment has been minimized.

Show comment
Hide comment
@wxchan

wxchan Jan 9, 2017

Contributor

Do you have time to also do a performance test on LightGBM as well to compare with new feature?

Contributor

wxchan commented Jan 9, 2017

Do you have time to also do a performance test on LightGBM as well to compare with new feature?

@hcho3

This comment has been minimized.

Show comment
Hide comment
@hcho3

hcho3 Jan 9, 2017

Contributor

@wxchan I will get to it soon

Contributor

hcho3 commented Jan 9, 2017

@wxchan I will get to it soon

@wxchan

This comment has been minimized.

Show comment
Hide comment
@wxchan

wxchan Jan 9, 2017

Contributor

@hcho3 ok, thx, look forward to it.

Contributor

wxchan commented Jan 9, 2017

@hcho3 ok, thx, look forward to it.

@tqchen tqchen changed the title from [New Feature] Fast Histogram Optimized Grower to [New Feature] Fast Histogram Optimized Grower, 8x to 10x speedup Jan 9, 2017

@tqchen tqchen changed the title from [New Feature] Fast Histogram Optimized Grower, 8x to 10x speedup to [New Feature] Fast Histogram Optimized Grower, 8x to 10x Speedup Jan 9, 2017

@tqchen

This comment has been minimized.

Show comment
Hide comment
@tqchen

tqchen Jan 9, 2017

Member

@hetong007 @slundberg @terrytangyuan @CodingCat @khotilov @phunterlau we could use more, test-cases examples and tutorial on language bindings.

Member

tqchen commented Jan 9, 2017

@hetong007 @slundberg @terrytangyuan @CodingCat @khotilov @phunterlau we could use more, test-cases examples and tutorial on language bindings.

@hetong007

This comment has been minimized.

Show comment
Hide comment
@hetong007

hetong007 Jan 9, 2017

Member

This is exciting! However I will be traveling in January, thus may not start working on this until Feb. It would be great if you guys @khotilov and @terrytangyuan have time recently.

Member

hetong007 commented Jan 9, 2017

This is exciting! However I will be traveling in January, thus may not start working on this until Feb. It would be great if you guys @khotilov and @terrytangyuan have time recently.

@RAMitchell

This comment has been minimized.

Show comment
Hide comment
@RAMitchell

RAMitchell Jan 9, 2017

Member

Very nice work. I was planning a multi gpu algorithm using these methods. This PR makes it a lot easier to implement. @hcho3 @tqchen did you want to colloborate on a new gpu algorithm?

Do the benchmarks above refer to the time for a single boosting iteration?

Member

RAMitchell commented Jan 9, 2017

Very nice work. I was planning a multi gpu algorithm using these methods. This PR makes it a lot easier to implement. @hcho3 @tqchen did you want to colloborate on a new gpu algorithm?

Do the benchmarks above refer to the time for a single boosting iteration?

@CodingCat

This comment has been minimized.

Show comment
Hide comment
@CodingCat

CodingCat Jan 10, 2017

Member

Awesome ! Will start working on it in the weekend

Member

CodingCat commented Jan 10, 2017

Awesome ! Will start working on it in the weekend

@terrytangyuan

This comment has been minimized.

Show comment
Hide comment
@terrytangyuan

terrytangyuan Jan 10, 2017

Member

Very cool, @hcho3! I won't be able to get to it soon but will post here when I start the work.

Member

terrytangyuan commented Jan 10, 2017

Very cool, @hcho3! I won't be able to get to it soon but will post here when I start the work.

@Liam0205

This comment has been minimized.

Show comment
Hide comment
@Liam0205

Liam0205 Jan 11, 2017

Contributor

Brilliant explain! Thanks @hcho3 !

Contributor

Liam0205 commented Jan 11, 2017

Brilliant explain! Thanks @hcho3 !

@tqchen tqchen added the enhancement label Jan 11, 2017

@hcho3

This comment has been minimized.

Show comment
Hide comment
@hcho3

hcho3 Jan 11, 2017

Contributor

@wxchan We've added performance numbers for LightGBM. Parallel performance for our new updater is not so good, so we are working to improve that.

Contributor

hcho3 commented Jan 11, 2017

@wxchan We've added performance numbers for LightGBM. Parallel performance for our new updater is not so good, so we are working to improve that.

@wxchan

This comment has been minimized.

Show comment
Hide comment
@wxchan

wxchan Jan 11, 2017

Contributor

@hcho3 how many iterations did you run? You can see a LightGBM log from here. It becomes faster when iteration becomes bigger(200s for 200 iterations, 310s for 500).

Contributor

wxchan commented Jan 11, 2017

@hcho3 how many iterations did you run? You can see a LightGBM log from here. It becomes faster when iteration becomes bigger(200s for 200 iterations, 310s for 500).

@hcho3

This comment has been minimized.

Show comment
Hide comment
@hcho3

hcho3 Jan 11, 2017

Contributor

@wxchan Good point. The data in the table represents first dozen iterations or so. Let us post a graph of cumulative runtime vs (# iterations).

Contributor

hcho3 commented Jan 11, 2017

@wxchan Good point. The data in the table represents first dozen iterations or so. Let us post a graph of cumulative runtime vs (# iterations).

@tqchen

This comment has been minimized.

Show comment
Hide comment
@tqchen

tqchen Jan 13, 2017

Member

a reminder, the PR #1940 is now merged

Member

tqchen commented Jan 13, 2017

a reminder, the PR #1940 is now merged

@hcho3

This comment has been minimized.

Show comment
Hide comment
@hcho3

hcho3 Jan 15, 2017

Contributor

@wxchan We've added performance results over 300 iterations.

Contributor

hcho3 commented Jan 15, 2017

@wxchan We've added performance results over 300 iterations.

@khotilov

This comment has been minimized.

Show comment
Hide comment
@khotilov

khotilov Jan 17, 2017

Member

I ran a quick speed comparison of the old colmaker updater to the new fast_histmaker one using one of my datasets. It was dense data with some missingness, about 100K x 1000 in size, with rare and noisy signal. I ran 1000 simple decision stubs with 0.5 subsampling on a 16-core server. In addition to the speedup in elapsed times factor, the table also shows ratios of user times relative to the single thread user time, as well as ratios of the single thread elapsed time to elapsed times. The significantly worse hist's performance scaling with nthread indicates that its parallel implementation might indeed have some room for improvement.

nthread speedup elapsed exact/hist user time n-th/1-th (exact) user time n-th/1-th (hist) elapsed time 1-th/n-th (exact) elapsed time 1-th/n-th (hist)
1 x3.03 x1 x1 x1 x1
2 x1.92 x1.004 x1.47 x1.98 x1.25
8 x1.48 x1.2 x2.0 x6.5 x3.2
15 x1.27 x1.5 x3.0 x9.2 x3.8
Member

khotilov commented Jan 17, 2017

I ran a quick speed comparison of the old colmaker updater to the new fast_histmaker one using one of my datasets. It was dense data with some missingness, about 100K x 1000 in size, with rare and noisy signal. I ran 1000 simple decision stubs with 0.5 subsampling on a 16-core server. In addition to the speedup in elapsed times factor, the table also shows ratios of user times relative to the single thread user time, as well as ratios of the single thread elapsed time to elapsed times. The significantly worse hist's performance scaling with nthread indicates that its parallel implementation might indeed have some room for improvement.

nthread speedup elapsed exact/hist user time n-th/1-th (exact) user time n-th/1-th (hist) elapsed time 1-th/n-th (exact) elapsed time 1-th/n-th (hist)
1 x3.03 x1 x1 x1 x1
2 x1.92 x1.004 x1.47 x1.98 x1.25
8 x1.48 x1.2 x2.0 x6.5 x3.2
15 x1.27 x1.5 x3.0 x9.2 x3.8
@hcho3

This comment has been minimized.

Show comment
Hide comment
@hcho3

hcho3 Jan 17, 2017

Contributor

@khotilov Thanks a lot for the detailed report. We are currently working hard to improve parallel implementation. Stay tuned for more updates.

Contributor

hcho3 commented Jan 17, 2017

@khotilov Thanks a lot for the detailed report. We are currently working hard to improve parallel implementation. Stay tuned for more updates.

@CodingCat

This comment has been minimized.

Show comment
Hide comment
@CodingCat

CodingCat May 14, 2017

Member

Does fast histogram really work with Rabit?

I didn't find any Rabit operation in the current implementation?

Member

CodingCat commented May 14, 2017

Does fast histogram really work with Rabit?

I didn't find any Rabit operation in the current implementation?

@tqchen

This comment has been minimized.

Show comment
Hide comment
@tqchen

tqchen May 14, 2017

Member

@CodingCat I double checked, we will need to add the allreduce call to end of BuildHist function. @hcho3 https://github.com/dmlc/xgboost/blob/master/src/common/hist_util.cc

Member

tqchen commented May 14, 2017

@CodingCat I double checked, we will need to add the allreduce call to end of BuildHist function. @hcho3 https://github.com/dmlc/xgboost/blob/master/src/common/hist_util.cc

@holyglenn

This comment has been minimized.

Show comment
Hide comment
@holyglenn

holyglenn Jun 2, 2017

I wonder if the optimization to the parallel version has been completed?

I wonder if the optimization to the parallel version has been completed?

@hcho3

This comment has been minimized.

Show comment
Hide comment
@hcho3

hcho3 Jul 5, 2017

Contributor

@holyglenn I've submitted a patch to improve multithreaded scaling.

Contributor

hcho3 commented Jul 5, 2017

@holyglenn I've submitted a patch to improve multithreaded scaling.

@tqchen tqchen removed the enhancement label Jul 4, 2018

@tqchen tqchen closed this Jul 4, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment