Optimized ApplySplit and UpdatePredictCache functions on CPU #5244

SmirnovEgorRu · 2020-01-29T01:23:45Z

This PR includes changes from #5156, will be rebased after committing the last one.
This commit achieves the same performance as before reverting #5104:

higgs1m	ApplySplit	EvaluateSplit	BuildHist	SyncHistogram	Prediction
Master	33	29	90	26	3
Before reverting	3.7	3.5	6.2	0.0	1.6
This PR	3.65	1.3975	6.75	1.86	0.74

airline-ohe	ApplySplit	EvaluateSplit	BuildHist	SyncHistogram	Prediction
Master	26	27	67	12	2
Before reverting	9.0	6.1	28.8	0.0	0.7
This PR	4.579	0.75	29.05	0.669	0.352

hcho3 · 2020-01-29T03:17:16Z

@SmirnovEgorRu Can you try running this with the URL dataset too?

SmirnovEgorRu · 2020-01-29T17:23:34Z

@hcho3, sure,

Branch	Time of Update, sec
master	139.3
this PR	29.1

Memory consumption is similar as before also (21231488 KB).
HW: Xeon E5-2680 v3 @ 2.50GHz, 2 sockets, 14 cores per socket

trivialfis · 2020-01-31T17:51:40Z

@SmirnovEgorRu Be careful with the change of author in git commit. ;-)

SmirnovEgorRu · 2020-02-01T21:14:59Z

I measured data similar as for previous PR.

Higgs dataset:

nthreads (w/ 1000 trees)	1	8	24	48	96
This PR, sec	207.7	44.3	22.2	18.1	26.5
Master, sec	280.9	87.4	48.4	44.8	45.3
Memory PR, KB	1245208	1284828	1293456	1292076	1302464
Memory master, KB	1241536	1272884	1294728	1300284	1314112
LogLoss this PR	0.525144	0.525144	0.525144	0.525144	0.525144
LogLoss master	0.525144	0.525144	0.525144	0.525144	0.525144

niter (w/ 48 threads)	50	200	500	1000
This PR, sec	2.0	4.8	9.9	18.2
Master, sec	3.5	10.7	22.7	44.8
Memory PR, KB	1292728	1292756	1293332	`1299388`
Memory master, KB	1289544	1301496	1293892	1300076
LogLoss this PR	0.566303	0.543322	0.531112	0.525144
LogLoss master	0.566303	0.543322	0.531112	0.525144

Parameters:

 { 'alpha':  0.9, 'max_bin': 256, 'scale_pos_weight': 2,
'learning_rate': 0.1, 'reg_lambda': 1, "min_child_weight": 0,
'max_depth': 8,  'max_leaves': 2**8, 'objective': 'binary:logistic' }

Airline dataset:

nthreads (w/ 1000 trees)	1	8	24	48	96
This PR, sec	953.1	187.9	83.3	60.1	64.3
Master, sec	953.7	211.4	105.0	94.6	87.1
Memory PR, KB	25387904	25549072	25562984	25567208	25569168
Memory master, KB	25388064	25548864	25562324	25567528	25572872
LogLoss this PR	0.461403	0.461403	0.461403	0.461403	0.461403
LogLoss master	0.461403	0.461403	0.461403	0.461403	0.461403

niter (w/ 48 threads)	50	200	500	1000
This PR, sec	27.2872	37.7223	52.9934	73.568467
Master, sec	28.8187	40.0345	60.4123	94.6476
Memory PR, KB	25566544	25566472	25566640	25566340
Memory master, KB	25567788	25561672	25567108	25567068
LogLoss this PR	0.478638	0.469505	0.465152	0.461403
LogLoss master	0.478638	0.469505	0.465152	0.461403

Parameters:

 { 'alpha':  0.9, 'max_bin': 256, 'scale_pos_weight': 2,
'learning_rate': 0.1, 'reg_lambda': 1, "min_child_weight": 0,
'max_depth': 8,  'max_leaves': 2**8, 'objective': 'binary:logistic' }

URL dataset:

nthreads (10 iter)	8	24	48	96
This PR, sec	58.7	40.7	43.1	49.9
Master, sec	60.7	41.4	43.8	51.2
Memory PR, GB	18.84	20.17	22.25	26.30
Memory master, GB	18.84	20.16	22.23	26.28

Parameters:

{'max_depth': 6,'tree_method':'hist'}

Distributed mode on Mortgage data set:

I used local cluster to test performance of distributed case.

Mortgage 2000Q1	2 workers, 24 threads per worker	48 workers, 2 threads/worker
Master, sec	273.7	198.0
This PR, sec	253.0	193.5

Mortgage 2000Q1	2 workers, 24 threads per worker	48 workers, 2 threads/worker
Master, rmse	9.33264	9.31267
This PRr, rmse	9.33264	9.31267

Extended list of benchmarks:

Dataset	higgs1m	Letters	Airline-ohe	MNIST	MSRank-30K	Mortgage
Before reverting, sec	15.5	10.3	55.3	69.5	99.1	18.3
Current PR, sec	14.7	10.1	59.4	80.5	111.6	19.0
Master, sec	40.2	15.5	91.0	97.1	180.4	37.9
Gain this PR vs. master	2.7	1.5	1.5	1.2	1.6	2.0

Data set	higgs1m	Letters	Airline-ohe	MNIST	MSRank-30K	Mortgage
LogLoss\RMSE, this PR	0.525167381247577	0.016770285209655	0.461402758989979	0.072397198881340	0.802009880542755	0.096547365188599
LogLoss\RMSE, master	0.525167381247577	0.016770285209655	0.461402758989979	0.072397198881340	0.802009880542755	0.096547365188599

OMP env: OMP_NUM_THREADS=48 OMP_PLACES={0}:96:1

HW

AWS c5.metal, CLX 8275 @3.0GHz, 24 cores per socket, 2 sockets, HT: on, 96 threads totally

SmirnovEgorRu · 2020-02-02T01:33:15Z

@hcho3, I have a question related to CI.
I see a fail in one python test with std::bad_alloc. I tried to reproduce this locally - but it passes all test locally and maximum memory usage is ~600MB only. I suppose it's just a sporadic problem and also, this test was passed initially before my small code refactoring. Could you, please, restart this?

Also, you can see above - I tested many things including scaling by threads, scaling by niter, memory usage, different data sets including dense/sparse cases, distributed mode and accuracy.

And I see improvements in performance for many cases, no perf degradation, memory consumption mostly became even slightly less in many cases, no accuracy loosing for all cases. And I want to propose to merge this into 1.0 release branch also, because I see 1.8x improvement in average across data sets, it should become good feature for the major release. I understand that the release branch has already been created, but I tried to cover all problematic things in my additional tests and I don't see any issues which can affect the product (if there are no real issues CI described above).
What is your opinion? If I need to check anything else - I'm ready to do this.

Dataset	higgs1m	Letters	Airline-ohe	MNIST	MSRank-30K	Mortgage	Average gain
Gain this PR vs. master	2.7	1.5	1.5	1.2	1.6	2.0	1.8

hcho3 · 2020-02-02T01:40:23Z

No, it would take a fair amount of time for us to review this PR, and it seems risky to approve this large magnitude PR so close to 1.0. Let us include it in the next 1.1 release. We plan to make a new release every 2 months or so.

@trivialfis After 1.0, I am thinking of preparing 1.1 in about 6 weeks. There are a few fixes that I’d like to see. WDYT?

trivialfis · 2020-02-05T18:39:21Z

@hcho3 Sorry, missed this one. Agreed. Let me know if I can help.

Briefly looked into this PR, need some more time to understand the changes.

* Remove SimpleArray as it's only used in column matrix, and resize is only called once per tree. * Reduce the number of parameters, specifically by computing prefetching at compile time.

trivialfis

Performance improvement is great! Thanks for your wonderful work on hist algorithm optimization. Here are a few things about code structure, please see inlined comments too.

DMatrix should rarely be a parameter of builder, as it uses only the gradient index. So I believe most of pointers/references to DMatrix are unused variables, or just used for accessing meta info. Please remove them.
Please reduce the usage of ibegin, iend and offset as function parameters. You can pass a named Span or Range1d by you as parameter, then expand the pointer out inside function scope. This way it's more clear what are those parameters pointing to.

src/common/hist_util.cc

src/common/hist_util.h

src/tree/updater_quantile_hist.cc

mli · 2020-02-15T20:44:03Z

Codecov Report

Merging #5244 into master will increase coverage by 2.10%.
The diff coverage is n/a.

@@            Coverage Diff             @@
##           master    #5244      +/-   ##
==========================================
+ Coverage   81.66%   83.76%   +2.10%     
==========================================
  Files          11       11              
  Lines        2389     2409      +20     
==========================================
+ Hits         1951     2018      +67     
+ Misses        438      391      -47

Impacted Files	Coverage Δ
python-package/xgboost/libpath.py	`55.55% <0.00%> (-4.45%)`	⬇️
python-package/xgboost/sklearn.py	`90.88% <0.00%> (+0.96%)`	⬆️
python-package/xgboost/dask.py	`90.30% <0.00%> (+2.67%)`	⬆️
python-package/xgboost/tracker.py	`93.97% <0.00%> (+15.66%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update fe8d72b...3df344f. Read the comment docs.

SmirnovEgorRu · 2020-02-18T07:36:40Z

@trivialfis, thank you very much for the review.
I have applied you comments, e.g. replaced ibegin, iend by Span in hist building.

…xgboost into apply_spliy_opt_2

SmirnovEgorRu · 2020-02-19T21:50:27Z

@trivialfis, @hcho3, any changed required to be committed?

trivialfis · 2020-02-19T22:25:06Z

I still need to understand the multiple by 2 expression and try to make it obvious. Will keep you posted.

src/common/row_set.h

src/common/hist_util.cc

src/tree/updater_quantile_hist.cc

src/tree/updater_quantile_hist.h

tests/cpp/common/test_partition_builder.cc

SmirnovEgorRu · 2020-02-24T16:30:09Z

@trivialfis, added explanation in the code.

SmirnovEgorRu · 2020-02-24T21:26:21Z

One CI step is failed, but error is

urllib.error.URLError: <urlopen error [Errno 54] Connection reset by peer>

Could we restart this?

codecov-io · 2020-02-24T22:24:32Z

Codecov Report

Merging #5244 into master will decrease coverage by 0.01%.
The diff coverage is 88.57%.

@@            Coverage Diff             @@
##           master    #5244      +/-   ##
==========================================
- Coverage   83.76%   83.75%   -0.02%     
==========================================
  Files          11       11              
  Lines        2409     2413       +4     
==========================================
+ Hits         2018     2021       +3     
- Misses        391      392       +1

Impacted Files	Coverage Δ
python-package/xgboost/dask.py	`90.3% <ø> (ø)`	⬆️
python-package/xgboost/sklearn.py	`90.88% <100%> (ø)`	⬆️
python-package/xgboost/libpath.py	`55.55% <33.33%> (ø)`	⬆️
python-package/xgboost/__init__.py	`86.36% <0%> (-2.53%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 208d251...e973ecc. Read the comment docs.

SmirnovEgorRu · 2020-02-27T11:36:47Z

@trivialfis @RAMitchell, do you have any new comments/concerns?

RAMitchell

LGTM. Defer to others if they have any issues.

RAMitchell · 2020-02-27T20:18:10Z

src/tree/updater_quantile_hist.cc

+
+  const uint32_t missing_val = std::numeric_limits<uint32_t>::max();
+
+  for (auto rid : rid_span) {


Just so you are aware, iterating through a Span implies boundary checks and may be slower than a standard for loop and accessing memory with pointers.

If this section is not performance critical you should absolutely iterate over span. I try to avoid optimising things that do not have visible effect on runtime.

I tested performance again and I didn't observed regression vs. previous version.
So let's keep it as is (additional checks are not bad if it doesn't affect the performance).

RAMitchell · 2020-02-27T20:21:01Z

src/common/hist_util.cc

+                           GHistRow hist) {
+  const size_t size = row_indices.Size();
+  const size_t* rid = row_indices.begin;
+  const float* pgh = reinterpret_cast<const float*>(gpair.data());


pgh is not meaningful: https://google.github.io/styleguide/cppguide.html#General_Naming_Rules

trivialfis

Overall looks good to me, thanks! Just a suggestion for the future, we might want to do some refactoring to this monolithic updater, like splitting up the builder into different components for loss guided, depthwise, an independent component for row partitioning etc. I want to add extra functionality to both gpu hist and hist, having nicer code would make our lives easier. ;-)

SmirnovEgorRu · 2020-02-29T20:11:05Z

@trivialfis, thank you!
Yes, currently the builder is too huge. Let's think how it can be refactored in the future :)

SmirnovEgorRu requested a review from hcho3 January 29, 2020 01:23

This was referenced Jan 29, 2020

Optimized BuildHist function #5156

Merged

CPU optimizations - 'hist' method #5104

Closed

SmirnovEgorRu force-pushed the apply_spliy_opt_2 branch from 4d18012 to ff3ae2c Compare January 31, 2020 16:17

SmirnovEgorRu force-pushed the apply_spliy_opt_2 branch 2 times, most recently from d0f8c3f to 4283e7d Compare February 1, 2020 18:35

Optimized ApplySplit and UpdatePredictCache functions on CPU

0c1df2b

SmirnovEgorRu force-pushed the apply_spliy_opt_2 branch from 4283e7d to 0c1df2b Compare February 1, 2020 20:31

trivialfis added 2 commits February 15, 2020 18:15

Small cleanup.

1c8ce68

* Remove SimpleArray as it's only used in column matrix, and resize is only called once per tree. * Reduce the number of parameters, specifically by computing prefetching at compile time.

Make enumerate split const function.

f8faa6a

trivialfis requested changes Feb 15, 2020

View reviewed changes

Lint.

3df344f

trivialfis reviewed Feb 15, 2020

View reviewed changes

src/tree/updater_quantile_hist.cc Show resolved Hide resolved

SmirnovEgorRu added 2 commits February 18, 2020 10:29

apply review comments

208d251

Merge branch 'master' into apply_spliy_opt_2

98b2004

SmirnovEgorRu requested a review from trivialfis February 18, 2020 07:42

SmirnovEgorRu added 2 commits February 18, 2020 19:00

fix C++ tests according to new API

1cc2967

Merge branch 'apply_spliy_opt_2' of https://github.com/SmirnovEgorRu/…

350fd6b

…xgboost into apply_spliy_opt_2

RAMitchell reviewed Feb 19, 2020

View reviewed changes

ShvetsKS mentioned this pull request Feb 21, 2020

Reducing memory consumption for 'hist' method on CPU #5334

Merged

hcho3 mentioned this pull request Feb 21, 2020

[Roadmap] 1.1.0 Roadmap #5337

Closed

12 tasks

applying review comments

e973ecc

RAMitchell approved these changes Feb 27, 2020

View reviewed changes

trivialfis approved these changes Feb 28, 2020

View reviewed changes

trivialfis merged commit 1b97eaf into dmlc:master Feb 29, 2020

SmirnovEgorRu mentioned this pull request Mar 13, 2020

Hist 10x slower than Exact #5405

Closed

lock bot locked as resolved and limited conversation to collaborators Jun 24, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimized ApplySplit and UpdatePredictCache functions on CPU #5244

Optimized ApplySplit and UpdatePredictCache functions on CPU #5244

SmirnovEgorRu commented Jan 29, 2020 •

edited

Loading

hcho3 commented Jan 29, 2020

SmirnovEgorRu commented Jan 29, 2020

trivialfis commented Jan 31, 2020 •

edited

Loading

SmirnovEgorRu commented Feb 1, 2020 •

edited

Loading

SmirnovEgorRu commented Feb 2, 2020

hcho3 commented Feb 2, 2020 •

edited

Loading

trivialfis commented Feb 5, 2020

trivialfis left a comment

mli commented Feb 15, 2020

SmirnovEgorRu commented Feb 18, 2020

SmirnovEgorRu commented Feb 19, 2020

trivialfis commented Feb 19, 2020

SmirnovEgorRu commented Feb 24, 2020

SmirnovEgorRu commented Feb 24, 2020

codecov-io commented Feb 24, 2020

SmirnovEgorRu commented Feb 27, 2020

RAMitchell left a comment

RAMitchell Feb 27, 2020

SmirnovEgorRu Feb 29, 2020

RAMitchell Feb 27, 2020

trivialfis left a comment

SmirnovEgorRu commented Feb 29, 2020


		const uint32_t missing_val = std::numeric_limits<uint32_t>::max();

		for (auto rid : rid_span) {

Optimized ApplySplit and UpdatePredictCache functions on CPU #5244

Optimized ApplySplit and UpdatePredictCache functions on CPU #5244

Conversation

SmirnovEgorRu commented Jan 29, 2020 • edited Loading

hcho3 commented Jan 29, 2020

SmirnovEgorRu commented Jan 29, 2020

trivialfis commented Jan 31, 2020 • edited Loading

SmirnovEgorRu commented Feb 1, 2020 • edited Loading

Higgs dataset:

Airline dataset:

URL dataset:

Distributed mode on Mortgage data set:

Extended list of benchmarks:

HW

SmirnovEgorRu commented Feb 2, 2020

hcho3 commented Feb 2, 2020 • edited Loading

trivialfis commented Feb 5, 2020

trivialfis left a comment

Choose a reason for hiding this comment

mli commented Feb 15, 2020

Codecov Report

SmirnovEgorRu commented Feb 18, 2020

SmirnovEgorRu commented Feb 19, 2020

trivialfis commented Feb 19, 2020

SmirnovEgorRu commented Feb 24, 2020

SmirnovEgorRu commented Feb 24, 2020

codecov-io commented Feb 24, 2020

Codecov Report

SmirnovEgorRu commented Feb 27, 2020

RAMitchell left a comment

Choose a reason for hiding this comment

RAMitchell Feb 27, 2020

Choose a reason for hiding this comment

SmirnovEgorRu Feb 29, 2020

Choose a reason for hiding this comment

RAMitchell Feb 27, 2020

Choose a reason for hiding this comment

trivialfis left a comment

Choose a reason for hiding this comment

SmirnovEgorRu commented Feb 29, 2020

SmirnovEgorRu commented Jan 29, 2020 •

edited

Loading

trivialfis commented Jan 31, 2020 •

edited

Loading

SmirnovEgorRu commented Feb 1, 2020 •

edited

Loading

hcho3 commented Feb 2, 2020 •

edited

Loading