Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance optimizations for CPUs [part 2] #4278

Closed
wants to merge 42 commits into from
Closed

Performance optimizations for CPUs [part 2] #4278

wants to merge 42 commits into from

Conversation

SmirnovEgorRu
Copy link
Contributor

This PR should contains performance improvement for multi-core CPUs. Most of these optimizations have been taken from Intel® DAAL and adopted for XGBoost. Here are following changes:

  1. Added nested parallelism by nodes for grow_policy=depthwise on CPU. For this QuantileHistMaker has been reworked to be thread-safe. It provides better scaling by number HW cores and number of levels in a tree.
  2. Additional parallelism with grow_policy=lossguide.
  3. General optimizations for ApplySplit, EvaluateSplit, BuildHist, InitData functions.
  4. Moved computation of gradient statistics from InitNewNode to BuildHist function, where all required data is in hot cache. Here is good boost for sparse data.

Performance measurements and more details will be provided by me a little bit later.

@RAMitchell
Copy link
Member

Thanks @SmirnovEgorRu for the PR. My first thoughts are that this PR is too large for us to comprehensively review all at once. I think it would be better for you to isolate key improvements and stage them into a series of PRs - for example your 4 bullet points could each be a PR.

This algorithm is heavily used so I would be uncomfortable making sweeping changes in one go.

@CodingCat
Copy link
Member

echo on @RAMitchell , can we also have some benchmark results here?

@Laurae2
Copy link
Contributor

Laurae2 commented Mar 21, 2019

Just for some numbers, it seems @SmirnovEgorRu 's PR improves a bit the scalability of fast histogram when using many threads. It may come at singlethreaded performance (in the worst scenario, I think).

@SmirnovEgorRu Which compiler are you using and what would be a proper setup to test the performance improvements?

Some results using the same setup as #3810 (comment) - this is probably not the optimal scenario for this PR to shine I think it might be the worst combination possible (full dense 50M x 100), 50 rounds, 8 depth, 255 leaves, depthwise grow_policy):

Commit 47edfa2 (this PR):
image

Commit 5f151c5 (old, 1st improvement @SmirnovEgorRu ):
image

Commit a2dc929 (for reference, before any improvement):
image

Some comparison values (time + scaling) using only the training time (from end of 1st iteration to last):

Threads 47edfa2 (this PR) 5f151c5 (old) a2dc929 (older)
1 2624.93s (100%) 2450.60s (100%) 2109.41s (100%)
2 1574.67s (167%) 1417.72s (173%) 1397.04s (151%)
3 1173.21s (224%) 1124.53s (218%) 1150.50s (183%)
4 978.45s (268%) 892.52s (275%) 895.65s (236%)
6 693.01s (379%) 666.02s (368%) 680.57s (310%)
9 545.38s (481%) 544.25s (450%) 603.88s (349%)
18 412.61s (636%) 436.25s (562%) 434.40s (486%)
36 375.63s (699%) 380.01s (645%) 406.73s (519%)

@SmirnovEgorRu
Copy link
Contributor Author

Hi @RAMitchell @CodingCat @Laurae2
Sorry for long delay.

About performance. I tuned this additionally, include scalability by cores. Some data obtained by me:

Data set XGBoost 0.81 XGBoost 0.82 This version Intel DAAL 2019u3
higgs1m 19828 11433 3698 3071
higgs10m 93442 82989 31836 30182
mslr (5 classes) 379113 216402 134851 48640

HW: Intel Xeon Platinum 8180 @2.5Hz, 2 sockets, 28 cores per socket.
Parameters: 200 iteration, hist method 256 bins, depthwise grow policy - 6 max levels, learning rate = 0.3. Objective: hinge for higgs, softmax for mslr.

At the moment I see an issue with multi-class case - prediction spends a lot of time, not training. I know how to fix it and make it closer to Intel DAAL performance, but probably as the next step.
In case of regression or binary-classification I see good performance speedup.

What about separate PR - I will do this.

@SmirnovEgorRu SmirnovEgorRu mentioned this pull request May 2, 2019
@hcho3
Copy link
Collaborator

hcho3 commented May 25, 2019

Closing this in favor of #4433

@hcho3 hcho3 closed this May 25, 2019
@lock lock bot locked as resolved and limited conversation to collaborators Aug 23, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants