-
-
Notifications
You must be signed in to change notification settings - Fork 8.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance optimizations for CPUs [part 2] #4278
Conversation
Thanks @SmirnovEgorRu for the PR. My first thoughts are that this PR is too large for us to comprehensively review all at once. I think it would be better for you to isolate key improvements and stage them into a series of PRs - for example your 4 bullet points could each be a PR. This algorithm is heavily used so I would be uncomfortable making sweeping changes in one go. |
echo on @RAMitchell , can we also have some benchmark results here? |
Just for some numbers, it seems @SmirnovEgorRu 's PR improves a bit the scalability of fast histogram when using many threads. It may come at singlethreaded performance (in the worst scenario, I think). @SmirnovEgorRu Which compiler are you using and what would be a proper setup to test the performance improvements? Some results using the same setup as #3810 (comment) - this is probably not the optimal scenario for this PR to shine I think it might be the worst combination possible (full dense 50M x 100), 50 rounds, 8 depth, 255 leaves, depthwise grow_policy): Commit 47edfa2 (this PR): Commit 5f151c5 (old, 1st improvement @SmirnovEgorRu ): Commit a2dc929 (for reference, before any improvement): Some comparison values (time + scaling) using only the training time (from end of 1st iteration to last):
|
Hi @RAMitchell @CodingCat @Laurae2 About performance. I tuned this additionally, include scalability by cores. Some data obtained by me:
HW: Intel Xeon Platinum 8180 @2.5Hz, 2 sockets, 28 cores per socket. At the moment I see an issue with multi-class case - prediction spends a lot of time, not training. I know how to fix it and make it closer to Intel DAAL performance, but probably as the next step. What about separate PR - I will do this. |
Closing this in favor of #4433 |
This PR should contains performance improvement for multi-core CPUs. Most of these optimizations have been taken from Intel® DAAL and adopted for XGBoost. Here are following changes:
Performance measurements and more details will be provided by me a little bit later.