Performance optimizations for CPUs [part 2] #4278

SmirnovEgorRu · 2019-03-19T23:26:12Z

This PR should contains performance improvement for multi-core CPUs. Most of these optimizations have been taken from Intel® DAAL and adopted for XGBoost. Here are following changes:

Added nested parallelism by nodes for grow_policy=depthwise on CPU. For this QuantileHistMaker has been reworked to be thread-safe. It provides better scaling by number HW cores and number of levels in a tree.
Additional parallelism with grow_policy=lossguide.
General optimizations for ApplySplit, EvaluateSplit, BuildHist, InitData functions.
Moved computation of gradient statistics from InitNewNode to BuildHist function, where all required data is in hot cache. Here is good boost for sparse data.

Performance measurements and more details will be provided by me a little bit later.

RAMitchell · 2019-03-19T23:35:59Z

Thanks @SmirnovEgorRu for the PR. My first thoughts are that this PR is too large for us to comprehensively review all at once. I think it would be better for you to isolate key improvements and stage them into a series of PRs - for example your 4 bullet points could each be a PR.

This algorithm is heavily used so I would be uncomfortable making sweeping changes in one go.

CodingCat · 2019-03-20T01:41:51Z

echo on @RAMitchell , can we also have some benchmark results here?

Laurae2 · 2019-03-21T14:40:02Z

Just for some numbers, it seems @SmirnovEgorRu 's PR improves a bit the scalability of fast histogram when using many threads. It may come at singlethreaded performance (in the worst scenario, I think).

@SmirnovEgorRu Which compiler are you using and what would be a proper setup to test the performance improvements?

Some results using the same setup as #3810 (comment) - this is probably not the optimal scenario for this PR to shine I think it might be the worst combination possible (full dense 50M x 100), 50 rounds, 8 depth, 255 leaves, depthwise grow_policy):

Commit 47edfa2 (this PR):

Commit 5f151c5 (old, 1st improvement @SmirnovEgorRu ):

Commit a2dc929 (for reference, before any improvement):

Some comparison values (time + scaling) using only the training time (from end of 1st iteration to last):

Threads	`47edfa2` (this PR)	`5f151c5` (old)	`a2dc929` (older)
1	2624.93s (100%)	2450.60s (100%)	2109.41s (100%)
2	1574.67s (167%)	1417.72s (173%)	1397.04s (151%)
3	1173.21s (224%)	1124.53s (218%)	1150.50s (183%)
4	978.45s (268%)	892.52s (275%)	895.65s (236%)
6	693.01s (379%)	666.02s (368%)	680.57s (310%)
9	545.38s (481%)	544.25s (450%)	603.88s (349%)
18	412.61s (636%)	436.25s (562%)	434.40s (486%)
36	375.63s (699%)	380.01s (645%)	406.73s (519%)

SmirnovEgorRu · 2019-03-28T23:18:59Z

Hi @RAMitchell @CodingCat @Laurae2
Sorry for long delay.

About performance. I tuned this additionally, include scalability by cores. Some data obtained by me:

Data set	XGBoost 0.81	XGBoost 0.82	This version	Intel DAAL 2019u3
higgs1m	19828	11433	3698	3071
higgs10m	93442	82989	31836	30182
mslr (5 classes)	379113	216402	134851	48640

HW: Intel Xeon Platinum 8180 @2.5Hz, 2 sockets, 28 cores per socket.
Parameters: 200 iteration, hist method 256 bins, depthwise grow policy - 6 max levels, learning rate = 0.3. Objective: hinge for higgs, softmax for mslr.

At the moment I see an issue with multi-class case - prediction spends a lot of time, not training. I know how to fix it and make it closer to Intel DAAL performance, but probably as the next step.
In case of regression or binary-classification I see good performance speedup.

What about separate PR - I will do this.

…ations

hcho3 · 2019-05-25T06:23:38Z

Closing this in favor of #4433

SmirnovEgorRu and others added 20 commits December 2, 2018 18:58

Initial performance optimizations for xgboost

0ff8ada

remove includes

40c07c7

revert float->double

c80f4bc

fix for CI

32e88bb

fix for CI

f6c44a6

fix for CI

c36127d

fix for CI

dd12944

fix for CI

416bf2f

fix for CI

c862fa8

fix for CI

1d59566

fix for CI

b7685df

fix for CI

e29229b

fix for CI

d59c386

Check existence of _mm_prefetch and __builtin_prefetch

4a0c9b3

Fix lint

6c37c3f

Merge remote-tracking branch 'remotes/xgb_last/master'

f49c7bc

Merge branch 'master' of https://github.com/dmlc/xgboost

7dac50c

performance optimizations for hist method on CPU

c4c2d68

code refactoring

5f95c5f

merge with master

47edfa2

SmirnovEgorRu added 7 commits March 22, 2019 22:11

tune

0072003

optimizations for pre-processing

db107f9

quintiles building optimizations

3706e05

remove extra locks of mutexes

87e7f5e

fix for GHistIndexMatrix::Init

e519963

correct sum for GHistIndexMatrix

d63bbec

Merge commit '3706e05e4c0dad21bba7e4fc34b603fc9b8fd2b3' into test_fix

e92d555

SmirnovEgorRu added 5 commits March 26, 2019 20:15

Merge commit '87e7f5eda86f60b90ae2fe93fd58b8f379d1ccae' into test_fix

b54ff62

fix in tests

2e8babb

added forgot files for split_evalutor

46660c7

clean code

5b98ae0

merge with master

7d04b64

oprimizations for pre-processing

55a2ed1

SmirnovEgorRu mentioned this pull request Mar 29, 2019

Optimizations of pre-processing for 'hist' tree method #4310

Merged

SmirnovEgorRu and others added 7 commits March 29, 2019 17:18

code cleaning

72709f1

code cleaning

3c88125

code clean up

d96f41b

fix for gcc4.8

dda11af

Merge branch 'master' of https://github.com/dmlc/xgboost into optimiz…

dd6a443

…ations

Merge remote-tracking branch 'upstream/master' into optimizations

3f60fcd

Remove unneeded whitespace changes

08e83fb

SmirnovEgorRu mentioned this pull request May 2, 2019

CPU optimizations #4433

Closed

egor.smirnov added 2 commits May 9, 2019 15:19

removed redeclaration of MemStackAllocator

fa54eeb

remove extra init

ffa9b29

hcho3 closed this May 25, 2019

SmirnovEgorRu mentioned this pull request Aug 15, 2019

Multicore scalability of the Histogram-based GBDT scikit-learn/scikit-learn#14306

Open

lock bot locked as resolved and limited conversation to collaborators Aug 23, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance optimizations for CPUs [part 2] #4278

Performance optimizations for CPUs [part 2] #4278

SmirnovEgorRu commented Mar 19, 2019

RAMitchell commented Mar 19, 2019

CodingCat commented Mar 20, 2019

Laurae2 commented Mar 21, 2019 •

edited

Loading

SmirnovEgorRu commented Mar 28, 2019

hcho3 commented May 25, 2019

Performance optimizations for CPUs [part 2] #4278

Performance optimizations for CPUs [part 2] #4278

Conversation

SmirnovEgorRu commented Mar 19, 2019

RAMitchell commented Mar 19, 2019

CodingCat commented Mar 20, 2019

Laurae2 commented Mar 21, 2019 • edited Loading

SmirnovEgorRu commented Mar 28, 2019

hcho3 commented May 25, 2019

Laurae2 commented Mar 21, 2019 •

edited

Loading