Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimizations of pre-processing for 'hist' tree method #4310

Merged
merged 5 commits into from
Apr 17, 2019
Merged

Optimizations of pre-processing for 'hist' tree method #4310

merged 5 commits into from
Apr 17, 2019

Conversation

SmirnovEgorRu
Copy link
Contributor

This commit is a part of this pull-request.
Performance before this commit:

Data Set First iteration 50 iterations
Higgs1m, ms 900 3889
Higgs10m, ms 8923 33638
9m x 45, ms 6093 19389

Performance results for these changes:

Data Set First iteration 50 iterations
Higgs1m, ms 672 3502
Higgs10m, ms 7834 30121
9m x 45, ms 4369 15960

HW: Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz, 2 sockets, 14 cores per socket.

for (size_t i = 0; i < info.num_row_; ++i) {
if (gpair[i].GetHess() >= 0.0f && coin_flip(rnd)) {
row_indices.push_back(i);
p_row_indices[j++] = i;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why adding values to the underlying array directly is performance friendly?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are 2 points:

  1. push_back has at least 1 "if" to check current capacity with new size, even if reserve() was called. It leads to additional branching and performance slowdown.
  2. constructions like this:
if (a[i] > c)
   b[j++] = a[i];

Can be easily vectorized by AVX-512 instructions - vcompressps/vcompresspd and some mask-instructions for comparison, even with compiler auto-vectorization. I know, that now XGBoost from a box supports only sse4.2, however AVX-512 will be possibly available in the future.
With push_back() it doesn't work.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am very interested in trying to recompile with AVX-512 and testing it in a real world

@SmirnovEgorRu
Copy link
Contributor Author

I see failed tests for R. But logs contain these messages:
Error in download.file(url, destfile, method, mode = "wb", ...) :
cannot open URL 'http://cran.rstudio.com/bin/macosx/el-capitan/contrib/3.5/gh_1.0.1.tgz'
ERROR: dependencies ‘callr’, ‘cli’, ‘digest’, ‘httr’, ‘jsonlite’, ‘memoise’, ‘pkgbuild’, ‘pkgload’, ‘rcmdcheck’, ‘remotes’, ‘sessioninfo’ are not available for package ‘devtools’

Looks like these errors don't depend on my commit. May I take that tests are passed?

@CodingCat
Copy link
Member

I have retriggered the test, should pass this time

@SmirnovEgorRu
Copy link
Contributor Author

@CodingCat thanks. Looks like it is okay now.

@hcho3 hcho3 requested a review from trivialfis April 3, 2019 20:06
Copy link
Collaborator

@hcho3 hcho3 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@SmirnovEgorRu Thanks for your contribution. I made a first pass and overall it looks good. I notice that you are using pre-fetching and blocking quite often (e.g. lots of reference to block). Is there a reference I can look at as to why it improves performance? I'd like to learn more about this technique, so that in a future pull request, I can propose a suitable abstraction to make this kind of optimization easy. (I noticed that the optimized code is not as readable, due to heavy use of loop variables and temporary buffers. A suitable abstraction should ideally make the code easy to inspect.)

cc @trivialfis

@@ -71,7 +71,7 @@ class ColumnMatrix {
// construct column matrix from GHistIndexMatrix
inline void Init(const GHistIndexMatrix& gmat,
double sparse_threshold) {
const auto nfeature = static_cast<bst_uint>(gmat.cut.row_ptr.size() - 1);
const int32_t nfeature = static_cast<int32_t>(gmat.cut.row_ptr.size() - 1);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@SmirnovEgorRu Why the change from bst_uint to int32_t?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Windows requires that integer type for count inside "#pragma omp paralell for" should have signed type.
Before nfeature wasn't used for parallel loop, but now it is used. So it was changed from unsigned to signed

@@ -59,7 +59,6 @@ class RowSetCollection {
}
// clear up things
inline void Clear() {
row_indices_.clear();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we still need this line. Clearing a RowSetCollection should also clear all row indices stores within.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did some measurements with my all optimizations and InitData() function spends ~15% of tree building time.
This block of code spends all of this time:

for (size_t i = 0; i < info.num_row_; ++i) {
  if (gpair[i].GetHess() >= 0.0f) {
    row_indices.push_back(i);

There are 2 reasons: std::vector - not performance friendly object, push_back contains at least one "if" to check current size. Also, this code can't be parallelized and it is sequential.

I have changed this code to use C-style partition:

j = 0;
for (size_t i = 0; i < info.num_row_; ++i) {
  if (gpair[i].GetHess() >= 0.0f) {
    p_row_indices[j++] = i;

But in this case I need to do row_indices_->resize(num_row) each time after row_indices_->clear(). But resize(num_rows) - very expensive operations if current size=0, it will fill data inside by default value (in our case - 0) by not performance manner. So, I deleted clear() and use resize(nrows) only one time.

After my changes InitData() spends less then 1% now.

P.S. std::vector - really not performance friendly :)

} else {
for (const auto &batch : p_fmat->GetRowBatches()) {
const size_t size = batch.Size();
const size_t block_size = 512;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the rule for choosing this number (512)? Is this number hardware-dependent?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, but this value influences on size of allocated buffer: it should not be too large from the one point of view and be not too small (useful work of omp-task should be more than omp-overhead) from the second point of view.

By experiments 512 have been chosen as optimal.

size_t rbegin = 0;
size_t prev_sum = 0;

for (const auto &batch : p_fmat->GetRowBatches()) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note to @trivialfis, @CodingCat: we are not really sure if 'hist' method works when external memory is enabled (i.e. GetRowBatches() returns more than one batch). We will need to test 'hist' against external memory in a future pull request.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hcho3 Absolutely agree.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAICT, it doesn't work in distributed setting, even after @trivialfis 's fix in 0.82

index.resize(row_ptr.back());

CHECK_GT(cut.cut.size(), 0U);
CHECK_EQ(cut.row_ptr.back(), cut.cut.size());
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@SmirnovEgorRu Why are we removing this line?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now row_ptr is allocated to final length from the start. So if we have a lot of batches - back() for not final batch will return some trash.

if (gpair[i].GetHess() >= 0.0f) {
row_indices.push_back(i);
MemStackAllocator<int, 128> buff(this->nthread_);
int* p_buff = buff.Get();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
int* p_buff = buff.Get();
bool* p_buff = buff.Get();

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

committed, thanks

for (size_t i = 0; i < info.num_row_; ++i) {
if (gpair[i].GetHess() >= 0.0f) {
row_indices.push_back(i);
MemStackAllocator<int, 128> buff(this->nthread_);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
MemStackAllocator<int, 128> buff(this->nthread_);
MemStackAllocator<bool, 128> buff(this->nthread_);

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@SmirnovEgorRu Did you intend to use int to avoid false sharing? Or can we use bool instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

False sharing should not affect this place


bool has_heg_hess = false;
for (size_t tid = 0; tid < this->nthread_; ++tid) {
if (p_buff[tid] == true) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if (p_buff[tid] == true) {
if (p_buff[tid]) {

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

commited

}
}

bool has_heg_hess = false;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
bool has_heg_hess = false;
bool has_neg_hess = false;

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

committed

if (has_heg_hess) {
size_t j = 0;
for (size_t i = 0; i < info.num_row_; ++i) {
if (gpair[i].GetHess() >= 0.0f) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note to @trivialfis, @RAMitchell: I wonder if we should just forbid negative Hessian values. All the non-negativity checks in the updater codebase are not only eye-sores but also reduces parallelism.

@SmirnovEgorRu
Copy link
Contributor Author

@hcho3
Yes, blocking is common practice, even if you don't do it explicitly - omp will do blocking for you implicitly.
On practice prefetch is not needed for modern CPUs, only in some places when order of memory-access is not trivial. In case of BuildHist it helps, but don't see when it can be applied more.

@trivialfis
Copy link
Member

Sorry for the delay, will review shortly.

Copy link
Member

@trivialfis trivialfis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR is very edifying, thanks a lot! Some notes and questions are inlined in the comment. We will need to make more adjustments for the code structure in the future. Optimization is awesome(big thanks again), but I usually try to measure for how many time I need to spend on understanding each line of code (time spent on reading / number of lines), I think this is getting higher and higher.

@SmirnovEgorRu For avx-512 optimization, is function multi-versioning applicable here in the future? I should try to learn more about it.

@hcho3 I'm not sure about the implication of forbidding neg hess.

n_ = n;
}

T& operator[](size_t idx) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the advantage of using SimpleArray over std::vector? We still have one extra indirection each time we index the array.

@SmirnovEgorRu I'm wondering about this too, SimpleArray doens't use dynamic table to amortize the allocation/deallocation time, can you add a note that under what size it's more efficient than dynamic table? Or it's for some other reasons?

size_t group_ind = 0;
if (use_group_ind) {
group_ind = this->SearchGroupIndFromBaseRow(group_ptr, batch.base_rowid);
if (use_group_ind) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hcho3 We need to change the name of these Init functions.

sketchs[entry.index].Push(entry.fvalue, info.GetWeight(w_idx));
#pragma omp barrier
#pragma omp for schedule(static)
for (int32_t icol = 0; icol < static_cast<int32_t>(ncol); ++icol) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The number of columns is usually small, ~3000 is quite an extreme case, ~120 might be the more common case. Is it preferred to use OMP here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OMP must be here :)
Each OMP task will push around 512*nthreads/ncols elements, it is enough to have gain from OMP. Also, OMP threads have already created on 117 line.

const size_t block_size_iter = block_size * nthread;
const size_t n_blocks = size / block_size_iter + !!(size % block_size_iter);

std::vector<std::vector<std::pair<float, float>>> buff(nthread);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe add a note like buffer for sketches to increase the cache usage or for some other reasons? It's sometime easy to forget the pain when we are new comers. ;-)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Buffs are not needed for better cache locality.
It requires for collecting local features-values by each thread and after push them to global queue without conflicts

size_t rbegin = 0;
size_t prev_sum = 0;

for (const auto &batch : p_fmat->GetRowBatches()) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hcho3 Absolutely agree.

@@ -28,6 +28,38 @@
#include "../common/column_matrix.h"

namespace xgboost {

template<typename T, size_t MaxStackSize>
class MemStackAllocator {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a side note, the allocator is actually used as some sort of memory buffer in this PR (I haven't gone through the big PR yet). So can we make it a std style allocator and uses something like std::vector<int, StackAllocator> instead? Or we can implement our own data structure that can switch allocators at runtime so that Your SimpleArray and this StackAllocator can be organized into a more appropriate structure in the future?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

std::vector - very not efficient, but probably in this case it will work.
if you come back tbb - they have std-style ScalableAllocator which can be useful here.

@RAMitchell
Copy link
Member

@SmirnovEgorRu I have a slightly off topic question: for your experiments do you use hyper threading? In my work I found that xgboost can perform very poorly and with high variance when using all threads (default nthreads setting).

This is my CPU:

Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                80
On-line CPU(s) list:   0-79
Thread(s) per core:    2
Core(s) per socket:    20
Socket(s):             2
NUMA node(s):          2
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 79
Model name:            Intel(R) Xeon(R) CPU E5-2698 v4 @ 2.20GHz
Stepping:              1
CPU MHz:               1656.875
CPU max MHz:           3600.0000
CPU min MHz:           1200.0000
BogoMIPS:              4391.55
Virtualization:        VT-x
L1d cache:             32K
L1i cache:             32K
L2 cache:              256K
L3 cache:              51200K
NUMA node0 CPU(s):     0-19,40-59
NUMA node1 CPU(s):     20-39,60-79

I am wondering if this is a problem that affects others and if so how can we solve it.

@Laurae2 did you also experience this in your xgboost experiments?

@Laurae2
Copy link
Contributor

Laurae2 commented Apr 10, 2019

@RAMitchell xgboost fast histogram has poor scalability, therefore hyperthreads is beneficial only with a small number of threads (such as on a small desktop or a laptop). For instance on a large enough dataset, my i7-7700HQ gets a 40% performance boost from hyperthreads, but my Dual E5-2699v4 (and my Dual Xeon 6154) gets from 50% slowdown to 2x slowdown (the bigger the dataset, the smaller the slowdown).

I recommend (@SmirnovEgorRu also recommends) to disable hyperthreading. Another solution is to use OpenMP environment parameters to pin threads to specific cores when using xgboost with many threads.

Also note if using gcc or icc. I noticed icc provides the same performance as gcc for xgboost for gcc recent versions (before @SmirnovEgorRu performance improvements). For later versions of xgboost, I did not test on icc. More info here: https://sites.google.com/view/lauraepp/benchmarks/intel-compiler-may-2018

Co-Authored-By: SmirnovEgorRu <egor.smirnov@intel.com>
@SmirnovEgorRu
Copy link
Contributor Author

@RAMitchell, I agree with @Laurae2, in case of XGBoost better to disable HyperThreading.
I have already written it here

However, after all my changes (#4278) I observed that HyperThreading is able to provide improvements for some data sets, but need investigate this fact,

@codecov-io
Copy link

Codecov Report

Merging #4310 into master will increase coverage by 0.19%.
The diff coverage is 78.37%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #4310      +/-   ##
==========================================
+ Coverage   67.82%   68.01%   +0.19%     
==========================================
  Files         132      132              
  Lines       12201    12363     +162     
==========================================
+ Hits         8275     8409     +134     
- Misses       3926     3954      +28
Impacted Files Coverage Δ
src/common/row_set.h 37.5% <ø> (-1.28%) ⬇️
src/common/column_matrix.h 94.8% <100%> (+0.06%) ⬆️
src/common/hist_util.h 84.9% <100%> (+4.9%) ⬆️
src/tree/updater_quantile_hist.cc 34.52% <69.44%> (+2.32%) ⬆️
src/common/hist_util.cc 48.39% <77.38%> (+3.13%) ⬆️
src/tree/updater_quantile_hist.h 62.16% <80%> (+6.6%) ⬆️
src/tree/updater_basemaker-inl.h 63.59% <0%> (-0.18%) ⬇️
src/data/sparse_page_source.cc 85.08% <0%> (ø) ⬆️
src/data/data.cc 81.49% <0%> (ø) ⬆️
tests/cpp/predictor/test_cpu_predictor.cc 100% <0%> (ø) ⬆️
... and 6 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 7ea5b77...866b681. Read the comment docs.

@SmirnovEgorRu
Copy link
Contributor Author

@hcho3 @CodingCat @trivialfis,
Hi, do I need to do something else for this PR?

@CodingCat
Copy link
Member

LGTM

Copy link
Collaborator

@hcho3 hcho3 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. A future PR should forbid negative Hessian values and simplify preprocessing code here.

Copy link
Member

@trivialfis trivialfis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Let's move on. ;-)

@lock lock bot locked as resolved and limited conversation to collaborators Sep 20, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants