Optimizations of pre-processing for 'hist' tree method #4310

SmirnovEgorRu · 2019-03-29T11:52:13Z

This commit is a part of this pull-request.
Performance before this commit:

Data Set	First iteration	50 iterations
Higgs1m, ms	900	3889
Higgs10m, ms	8923	33638
9m x 45, ms	6093	19389

Performance results for these changes:

Data Set	First iteration	50 iterations
Higgs1m, ms	672	3502
Higgs10m, ms	7834	30121
9m x 45, ms	4369	15960

HW: Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz, 2 sockets, 14 cores per socket.

CodingCat · 2019-03-30T01:22:49Z

src/tree/updater_quantile_hist.cc

      for (size_t i = 0; i < info.num_row_; ++i) {
        if (gpair[i].GetHess() >= 0.0f && coin_flip(rnd)) {
-          row_indices.push_back(i);
+          p_row_indices[j++] = i;


why adding values to the underlying array directly is performance friendly?

There are 2 points:

push_back has at least 1 "if" to check current capacity with new size, even if reserve() was called. It leads to additional branching and performance slowdown.

constructions like this:

if (a[i] > c) b[j++] = a[i];

Can be easily vectorized by AVX-512 instructions - vcompressps/vcompresspd and some mask-instructions for comparison, even with compiler auto-vectorization. I know, that now XGBoost from a box supports only sse4.2, however AVX-512 will be possibly available in the future.
With push_back() it doesn't work.

I am very interested in trying to recompile with AVX-512 and testing it in a real world

SmirnovEgorRu · 2019-03-30T11:34:31Z

I see failed tests for R. But logs contain these messages:
Error in download.file(url, destfile, method, mode = "wb", ...) :
cannot open URL 'http://cran.rstudio.com/bin/macosx/el-capitan/contrib/3.5/gh_1.0.1.tgz'
ERROR: dependencies ‘callr’, ‘cli’, ‘digest’, ‘httr’, ‘jsonlite’, ‘memoise’, ‘pkgbuild’, ‘pkgload’, ‘rcmdcheck’, ‘remotes’, ‘sessioninfo’ are not available for package ‘devtools’

Looks like these errors don't depend on my commit. May I take that tests are passed?

CodingCat · 2019-03-30T11:45:04Z

I have retriggered the test, should pass this time

SmirnovEgorRu · 2019-03-30T12:52:27Z

@CodingCat thanks. Looks like it is okay now.

hcho3

@SmirnovEgorRu Thanks for your contribution. I made a first pass and overall it looks good. I notice that you are using pre-fetching and blocking quite often (e.g. lots of reference to block). Is there a reference I can look at as to why it improves performance? I'd like to learn more about this technique, so that in a future pull request, I can propose a suitable abstraction to make this kind of optimization easy. (I noticed that the optimized code is not as readable, due to heavy use of loop variables and temporary buffers. A suitable abstraction should ideally make the code easy to inspect.)

cc @trivialfis

hcho3 · 2019-04-03T20:08:05Z

src/common/column_matrix.h

@@ -71,7 +71,7 @@ class ColumnMatrix {
  // construct column matrix from GHistIndexMatrix
  inline void Init(const GHistIndexMatrix& gmat,
                   double  sparse_threshold) {
-    const auto nfeature = static_cast<bst_uint>(gmat.cut.row_ptr.size() - 1);
+    const int32_t nfeature = static_cast<int32_t>(gmat.cut.row_ptr.size() - 1);


@SmirnovEgorRu Why the change from bst_uint to int32_t?

Windows requires that integer type for count inside "#pragma omp paralell for" should have signed type.
Before nfeature wasn't used for parallel loop, but now it is used. So it was changed from unsigned to signed

hcho3 · 2019-04-03T20:12:46Z

src/common/row_set.h

@@ -59,7 +59,6 @@ class RowSetCollection {
  }
  // clear up things
  inline void Clear() {
-    row_indices_.clear();


I think we still need this line. Clearing a RowSetCollection should also clear all row indices stores within.

I did some measurements with my all optimizations and InitData() function spends ~15% of tree building time.
This block of code spends all of this time:

for (size_t i = 0; i < info.num_row_; ++i) { if (gpair[i].GetHess() >= 0.0f) { row_indices.push_back(i);

There are 2 reasons: std::vector - not performance friendly object, push_back contains at least one "if" to check current size. Also, this code can't be parallelized and it is sequential.

I have changed this code to use C-style partition:

j = 0; for (size_t i = 0; i < info.num_row_; ++i) { if (gpair[i].GetHess() >= 0.0f) { p_row_indices[j++] = i;

But in this case I need to do row_indices_->resize(num_row) each time after row_indices_->clear(). But resize(num_rows) - very expensive operations if current size=0, it will fill data inside by default value (in our case - 0) by not performance manner. So, I deleted clear() and use resize(nrows) only one time.

After my changes InitData() spends less then 1% now.

P.S. std::vector - really not performance friendly :)

hcho3 · 2019-04-03T20:31:59Z

src/common/hist_util.cc

+  } else {
+    for (const auto &batch : p_fmat->GetRowBatches()) {
+      const size_t size = batch.Size();
+      const size_t block_size = 512;


What's the rule for choosing this number (512)? Is this number hardware-dependent?

No, but this value influences on size of allocated buffer: it should not be too large from the one point of view and be not too small (useful work of omp-task should be more than omp-overhead) from the second point of view.

By experiments 512 have been chosen as optimal.

hcho3 · 2019-04-03T22:19:13Z

src/common/hist_util.cc

+  size_t rbegin = 0;
+  size_t prev_sum = 0;
+
+  for (const auto &batch : p_fmat->GetRowBatches()) {


Note to @trivialfis, @CodingCat: we are not really sure if 'hist' method works when external memory is enabled (i.e. GetRowBatches() returns more than one batch). We will need to test 'hist' against external memory in a future pull request.

@hcho3 Absolutely agree.

AFAICT, it doesn't work in distributed setting, even after @trivialfis 's fix in 0.82

hcho3 · 2019-04-03T22:19:42Z

src/common/hist_util.cc

    index.resize(row_ptr.back());

    CHECK_GT(cut.cut.size(), 0U);
-    CHECK_EQ(cut.row_ptr.back(), cut.cut.size());


@SmirnovEgorRu Why are we removing this line?

Now row_ptr is allocated to final length from the start. So if we have a lot of batches - back() for not final batch will return some trash.

hcho3 · 2019-04-03T22:31:21Z

src/tree/updater_quantile_hist.cc

-        if (gpair[i].GetHess() >= 0.0f) {
-          row_indices.push_back(i);
+      MemStackAllocator<int, 128> buff(this->nthread_);
+      int* p_buff = buff.Get();


Suggested change

int* p_buff = buff.Get();

bool* p_buff = buff.Get();

committed, thanks

hcho3 · 2019-04-03T22:31:23Z

src/tree/updater_quantile_hist.cc

-      for (size_t i = 0; i < info.num_row_; ++i) {
-        if (gpair[i].GetHess() >= 0.0f) {
-          row_indices.push_back(i);
+      MemStackAllocator<int, 128> buff(this->nthread_);


Suggested change

MemStackAllocator<int, 128> buff(this->nthread_);

MemStackAllocator<bool, 128> buff(this->nthread_);

@SmirnovEgorRu Did you intend to use int to avoid false sharing? Or can we use bool instead?

False sharing should not affect this place

hcho3 · 2019-04-03T22:31:40Z

src/tree/updater_quantile_hist.cc

+
+      bool has_heg_hess = false;
+      for (size_t tid = 0; tid < this->nthread_; ++tid) {
+        if (p_buff[tid] == true) {


Suggested change

if (p_buff[tid] == true) {

if (p_buff[tid]) {

hcho3 · 2019-04-03T22:33:16Z

src/tree/updater_quantile_hist.cc

+        }
+      }
+
+      bool has_heg_hess = false;


Suggested change

bool has_heg_hess = false;

bool has_neg_hess = false;

hcho3 · 2019-04-03T22:44:11Z

src/tree/updater_quantile_hist.cc

+      if (has_heg_hess) {
+        size_t j = 0;
+        for (size_t i = 0; i < info.num_row_; ++i) {
+          if (gpair[i].GetHess() >= 0.0f) {


Note to @trivialfis, @RAMitchell: I wonder if we should just forbid negative Hessian values. All the non-negativity checks in the updater codebase are not only eye-sores but also reduces parallelism.

SmirnovEgorRu · 2019-04-04T08:26:17Z

@hcho3
Yes, blocking is common practice, even if you don't do it explicitly - omp will do blocking for you implicitly.
On practice prefetch is not needed for modern CPUs, only in some places when order of memory-access is not trivial. In case of BuildHist it helps, but don't see when it can be applied more.

trivialfis · 2019-04-06T16:50:43Z

Sorry for the delay, will review shortly.

trivialfis

This PR is very edifying, thanks a lot! Some notes and questions are inlined in the comment. We will need to make more adjustments for the code structure in the future. Optimization is awesome(big thanks again), but I usually try to measure for how many time I need to spend on understanding each line of code (time spent on reading / number of lines), I think this is getting higher and higher.

@SmirnovEgorRu For avx-512 optimization, is function multi-versioning applicable here in the future? I should try to learn more about it.

@hcho3 I'm not sure about the implication of forbidding neg hess.

trivialfis · 2019-04-08T08:14:32Z

src/common/hist_util.h

+    n_ = n;
+  }
+
+  T& operator[](size_t idx) {


What is the advantage of using SimpleArray over std::vector? We still have one extra indirection each time we index the array.

@SmirnovEgorRu I'm wondering about this too, SimpleArray doens't use dynamic table to amortize the allocation/deallocation time, can you add a note that under what size it's more efficient than dynamic table? Or it's for some other reasons?

trivialfis · 2019-04-08T08:36:03Z

src/common/hist_util.cc

-    size_t group_ind = 0;
-    if (use_group_ind) {
-      group_ind = this->SearchGroupIndFromBaseRow(group_ptr, batch.base_rowid);
+  if (use_group_ind) {


@hcho3 We need to change the name of these Init functions.

trivialfis · 2019-04-08T08:38:51Z

src/common/hist_util.cc

-              sketchs[entry.index].Push(entry.fvalue, info.GetWeight(w_idx));
+          #pragma omp barrier
+          #pragma omp for schedule(static)
+          for (int32_t icol = 0; icol < static_cast<int32_t>(ncol); ++icol) {


The number of columns is usually small, ~3000 is quite an extreme case, ~120 might be the more common case. Is it preferred to use OMP here?

OMP must be here :)
Each OMP task will push around 512*nthreads/ncols elements, it is enough to have gain from OMP. Also, OMP threads have already created on 117 line.

trivialfis · 2019-04-08T08:41:53Z

src/common/hist_util.cc

+      const size_t block_size_iter = block_size * nthread;
+      const size_t n_blocks = size / block_size_iter + !!(size % block_size_iter);
+
+      std::vector<std::vector<std::pair<float, float>>> buff(nthread);


Maybe add a note like buffer for sketches to increase the cache usage or for some other reasons? It's sometime easy to forget the pain when we are new comers. ;-)

Buffs are not needed for better cache locality.
It requires for collecting local features-values by each thread and after push them to global queue without conflicts

trivialfis · 2019-04-08T08:43:19Z

src/common/hist_util.cc

+  size_t rbegin = 0;
+  size_t prev_sum = 0;
+
+  for (const auto &batch : p_fmat->GetRowBatches()) {


@hcho3 Absolutely agree.

trivialfis · 2019-04-08T08:51:26Z

src/tree/updater_quantile_hist.h

@@ -28,6 +28,38 @@
 #include "../common/column_matrix.h"

 namespace xgboost {
+
+template<typename T, size_t MaxStackSize>
+class MemStackAllocator {


As a side note, the allocator is actually used as some sort of memory buffer in this PR (I haven't gone through the big PR yet). So can we make it a std style allocator and uses something like std::vector<int, StackAllocator> instead? Or we can implement our own data structure that can switch allocators at runtime so that Your SimpleArray and this StackAllocator can be organized into a more appropriate structure in the future?

std::vector - very not efficient, but probably in this case it will work.
if you come back tbb - they have std-style ScalableAllocator which can be useful here.

RAMitchell · 2019-04-10T01:38:56Z

@SmirnovEgorRu I have a slightly off topic question: for your experiments do you use hyper threading? In my work I found that xgboost can perform very poorly and with high variance when using all threads (default nthreads setting).

This is my CPU:

Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                80
On-line CPU(s) list:   0-79
Thread(s) per core:    2
Core(s) per socket:    20
Socket(s):             2
NUMA node(s):          2
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 79
Model name:            Intel(R) Xeon(R) CPU E5-2698 v4 @ 2.20GHz
Stepping:              1
CPU MHz:               1656.875
CPU max MHz:           3600.0000
CPU min MHz:           1200.0000
BogoMIPS:              4391.55
Virtualization:        VT-x
L1d cache:             32K
L1i cache:             32K
L2 cache:              256K
L3 cache:              51200K
NUMA node0 CPU(s):     0-19,40-59
NUMA node1 CPU(s):     20-39,60-79

I am wondering if this is a problem that affects others and if so how can we solve it.

@Laurae2 did you also experience this in your xgboost experiments?

Laurae2 · 2019-04-10T15:25:52Z

@RAMitchell xgboost fast histogram has poor scalability, therefore hyperthreads is beneficial only with a small number of threads (such as on a small desktop or a laptop). For instance on a large enough dataset, my i7-7700HQ gets a 40% performance boost from hyperthreads, but my Dual E5-2699v4 (and my Dual Xeon 6154) gets from 50% slowdown to 2x slowdown (the bigger the dataset, the smaller the slowdown).

I recommend (@SmirnovEgorRu also recommends) to disable hyperthreading. Another solution is to use OpenMP environment parameters to pin threads to specific cores when using xgboost with many threads.

Also note if using gcc or icc. I noticed icc provides the same performance as gcc for xgboost for gcc recent versions (before @SmirnovEgorRu performance improvements). For later versions of xgboost, I did not test on icc. More info here: https://sites.google.com/view/lauraepp/benchmarks/intel-compiler-may-2018

Co-Authored-By: SmirnovEgorRu <egor.smirnov@intel.com>

SmirnovEgorRu · 2019-04-11T12:08:47Z

@RAMitchell, I agree with @Laurae2, in case of XGBoost better to disable HyperThreading.
I have already written it here

However, after all my changes (#4278) I observed that HyperThreading is able to provide improvements for some data sets, but need investigate this fact,

codecov-io · 2019-04-11T13:00:34Z

Codecov Report

Merging #4310 into master will increase coverage by 0.19%.
The diff coverage is 78.37%.

@@            Coverage Diff             @@
##           master    #4310      +/-   ##
==========================================
+ Coverage   67.82%   68.01%   +0.19%     
==========================================
  Files         132      132              
  Lines       12201    12363     +162     
==========================================
+ Hits         8275     8409     +134     
- Misses       3926     3954      +28

Impacted Files	Coverage Δ
src/common/row_set.h	`37.5% <ø> (-1.28%)`	⬇️
src/common/column_matrix.h	`94.8% <100%> (+0.06%)`	⬆️
src/common/hist_util.h	`84.9% <100%> (+4.9%)`	⬆️
src/tree/updater_quantile_hist.cc	`34.52% <69.44%> (+2.32%)`	⬆️
src/common/hist_util.cc	`48.39% <77.38%> (+3.13%)`	⬆️
src/tree/updater_quantile_hist.h	`62.16% <80%> (+6.6%)`	⬆️
src/tree/updater_basemaker-inl.h	`63.59% <0%> (-0.18%)`	⬇️
src/data/sparse_page_source.cc	`85.08% <0%> (ø)`	⬆️
src/data/data.cc	`81.49% <0%> (ø)`	⬆️
tests/cpp/predictor/test_cpu_predictor.cc	`100% <0%> (ø)`	⬆️
... and 6 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 7ea5b77...866b681. Read the comment docs.

SmirnovEgorRu · 2019-04-15T16:19:47Z

@hcho3 @CodingCat @trivialfis,
Hi, do I need to do something else for this PR?

CodingCat · 2019-04-15T16:24:31Z

LGTM

hcho3

LGTM. A future PR should forbid negative Hessian values and simplify preprocessing code here.

trivialfis

LGTM. Let's move on. ;-)

SmirnovEgorRu added 3 commits March 29, 2019 14:35

oprimizations for pre-processing

55a2ed1

code cleaning

72709f1

code cleaning

3c88125

CodingCat reviewed Mar 30, 2019

View reviewed changes

hcho3 requested a review from trivialfis April 3, 2019 20:06

hcho3 reviewed Apr 3, 2019

View reviewed changes

code cleaning after review

3bd0163

trivialfis reviewed Apr 8, 2019

View reviewed changes

Apply suggestions from code review

866b681

Co-Authored-By: SmirnovEgorRu <egor.smirnov@intel.com>

hcho3 approved these changes Apr 15, 2019

View reviewed changes

trivialfis approved these changes Apr 16, 2019

View reviewed changes

hcho3 merged commit 711397d into dmlc:master Apr 17, 2019

SmirnovEgorRu mentioned this pull request May 2, 2019

CPU optimizations #4433

Closed

hcho3 mentioned this pull request May 6, 2019

XGBoost 0.90 Roadmap #4389

Closed

18 tasks

This was referenced May 17, 2019

[RFC] Version 0.90 release candidate #4475

Merged

[RFC] Possible DMatrix refactor #4354

Closed

This was referenced Jul 3, 2019

MultiThreading not working properly with version 0.71.2 #3543

Closed

Call for contribution: improve multi-core CPU performance of 'hist' #3810

Closed

SmirnovEgorRu mentioned this pull request Aug 15, 2019

Multicore scalability of the Histogram-based GBDT scikit-learn/scikit-learn#14306

Open

lock bot locked as resolved and limited conversation to collaborators Sep 20, 2019

	MemStackAllocator<int, 128> buff(this->nthread_);
	MemStackAllocator<bool, 128> buff(this->nthread_);

Optimizations of pre-processing for 'hist' tree method #4310

Optimizations of pre-processing for 'hist' tree method #4310

Conversation

SmirnovEgorRu commented Mar 29, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SmirnovEgorRu commented Mar 30, 2019

CodingCat commented Mar 30, 2019

SmirnovEgorRu commented Mar 30, 2019

hcho3 left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SmirnovEgorRu commented Apr 4, 2019

trivialfis commented Apr 6, 2019

trivialfis left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

RAMitchell commented Apr 10, 2019

Laurae2 commented Apr 10, 2019

SmirnovEgorRu commented Apr 11, 2019

codecov-io commented Apr 11, 2019

Codecov Report

SmirnovEgorRu commented Apr 15, 2019

CodingCat commented Apr 15, 2019

hcho3 left a comment

Choose a reason for hiding this comment

trivialfis left a comment

Choose a reason for hiding this comment

hcho3 left a comment •

edited

Loading

trivialfis left a comment •

edited

Loading