CPU optimizations #4433

SmirnovEgorRu · 2019-05-02T14:26:45Z

This PR is changes from #4278 without pre-processing optimizations (it has been merged yet #4310)

In addition I refactored the code to make it simple. In previews version I used omp tasks, but they are not supported in MSVS by default, so I replaced them by simple omp parallel for.

Code changes looks massive, but it is hard to split them on small parts.

trivialfis · 2019-05-06T17:42:02Z

It seems this is based on an old commit. Is it possible to make a clean rebase?

hcho3 · 2019-05-06T18:47:14Z

I'll review this pull request after the 0.90 release

hcho3 · 2019-05-23T00:46:59Z

@SmirnovEgorRu Looks like this pull request is breaking some distributed training tests: https://xgboost-ci.net/blue/organizations/jenkins/xgboost/detail/PR-4433/7/pipeline#step-145-log-581

SmirnovEgorRu · 2019-05-23T07:23:23Z

@hcho3 Yes, now I want to understand how distributed mode works. Do we have some materials about this?
Potentially, my changes should optimize distributed mode too, but need to change code.

hcho3 · 2019-05-24T14:03:09Z

@SmirnovEgorRu The XGBoost paper have some details about distributed training. Roughly speaking, every occurrence of AllReduce causes synchronization of all nodes

hcho3 · 2019-05-24T20:18:34Z

Let me look at this over the weekend and see why it's breaking distributed training.

trivialfis · 2019-05-25T21:55:15Z

Hi, @SmirnovEgorRu @hcho3 @RAMitchell I want to revert part of #4310 . That PR improved performance by allocating buffer, specifically:

xgboost/src/common/hist_util.cc

Line 111 in 8ddd271

buff[tid].resize(block_size * ncol);

This makes it very hard to load sparse dataset. For example the first line of a concatenated URL dataset from #2326 has largest index 3224681, but with only 108 columns. For a 16 threads machine, this for loop will generate 16 * 3224681 * 512 * 2 * sizeof(float) bytes (196.8189 GB). Which means my machine won't even be able load a single line of data. And just for future reference, allocating memory based on feature index might not be a good idea. :-(

SmirnovEgorRu · 2019-05-26T00:23:38Z

@hcho3 , looks like all issues are resolved, including distributed mode.

trivialfis

Thanks for the PR. We need to do some benchmarks on different datasets. But before that, please

make a clean rebase
add unittest cases for new/changed functions to show their correctness.

Performance is good, but please don't do it at the expense of maintainability. There are many improvements pending to hist algorithm, so please review your PR first, see if there's a cleaner way of doing things. ;-)

Will add a detailed review once the code become cleaner.

trivialfis · 2019-05-27T08:22:40Z

src/common/hist_util.h

+};
+
+namespace tree {
+class SplitEvaluator;


Is this used somewhere in this file?

trivialfis · 2019-05-27T08:25:37Z

src/common/hist_util.h

-    row_ptr_[nid] = data_.size();
-    data_.resize(data_.size() + nbins_);
+    if (data_arr_[nid] == nullptr) {
+      data_arr_[nid] = new std::vector<GradStatHist>;


Why do you need to use new? I would try to avoid raw new and delete all together.

trivialfis · 2019-05-27T08:27:13Z

src/common/hist_util.h

+  }
+
+void BuildBlockHist(const std::vector<GradientPair>& gpair,
+                                  const RowSetCollection::Elem row_indices,


indention is incorrect.

trivialfis · 2019-05-27T08:33:45Z

tests/cpp/tree/test_quantile_hist.cc

@@ -206,7 +214,8 @@ class QuantileHistMock : public QuantileHistMaker {
      }

      /* Now compare against result given by EvaluateSplit() */
-      RealImpl::EvaluateSplit(0, gmat, hist_, *(*dmat), tree);
+      EvaluateSplitsBatch(nodes, gmat, **dmat, hist_is_init, hist_buffers, const_cast<RegTree*>(&tree));


Why is a const_cast necessary?

trivialfis · 2019-05-27T08:35:08Z

src/tree/updater_quantile_hist.h

@@ -15,54 +15,23 @@
 #include <vector>
 #include <string>
 #include <queue>
+#include <deque>


I couldn't find where you use deque

trivialfis · 2019-05-27T08:38:47Z

src/common/hist_util.cc

-        const uint32_t idx_bin = 2*index[j];
-        const size_t idx_gh = 2*rid[i];
+      grad_stat.sum_grad += pgh[idx_gh];
+      grad_stat.sum_hess += pgh[idx_gh+1];


Please give some notes about memory layout.

trivialfis · 2019-05-27T08:41:09Z

src/common/hist_util.h

 */
-using GHistRow = Span<tree::GradStats>;
+struct GradStatHist {


So can we remove GradStats?

trivialfis · 2019-05-27T08:48:30Z

src/tree/updater_quantile_hist.h

+        global_start = dmlc::GetTime();
+      }
+
+      inline void EndPerfMonitor() {


This is from an old commit. It's no longer here. Could you make a clean rebase?

trivialfis · 2019-05-27T08:49:31Z

src/tree/updater_quantile_hist.cc

+    const size_t n_local_histograms = std::min(nthread, n_local_blocks);
+
+    for (size_t j = 0; j < n_local_blocks; ++j) {
+      task_nid.push_back(nid);


Can this be done by one resize?

trivialfis · 2019-05-27T08:51:06Z

src/tree/updater_quantile_hist.cc

+  hist_is_init->resize(nodes.size());
+
+  // input data for tasks
+  int32_t n_tasks = 0;


2 additional digits are not needed to guarantee that casting the decimal representation will result in the same float, see #3980 (comment)

* Make AUCPR work with multiple query groups * Check AUCPR <= 1.0 in distributed setting

…per-group (#4216) * In AUC and AUCPR metrics, detect whether weights are per-instance or per-group * Fix C++ style check * Add a test for weighted AUC

* Change memory dump size in R test.

Add tutorial on missing values and how to handle those within XGBoost.

* Add `BUILD_WITH_SHARED_NCCL` to CMake.

…xplicitly given in XGBoost4J-Spark (#4446) * Automatically set maximize_evaluation_metrics if not explicitly given. * When custom_eval is set, require maximize_evaluation_metrics. * Update documents on early stop in XGBoost4J-Spark. * Fix code error.

* [CI] Upgrade to Spark 2.4.2 * Pass Spark version to build script * Allow multiple --build-arg in ci_build.sh * Fix syntax * Fix container name * Update pom.xml * Fix container name * Update Jenkinsfile * Update pom.xml * Update Dockerfile.jvm_cross

* mgpu prediction using explicit sharding

* [CI] Build XGBoost wheels with CUDA 9.0 * Do not call archiveArtifacts for 8.0 wheel

* Fix #4462: Use /MT flag consistently for MSVC target * First attempt at Windows CI * Distinguish stages in Linux and Windows pipelines * Try running CMake in Windows pipeline * Add build step

* Fix dask API sphinx docstrings * Update GPU docs page

* Only define `gpu_id` and `n_gpus` in `LearnerTrainParam` * Pass LearnerTrainParam through XGBoost vid factory method. * Disable all GPU usage when GPU related parameters are not specified (fixes XGBoost choosing GPU over aggressively). * Test learner train param io. * Fix gpu pickling.

* - training with external memory part 1 of 2 - this pr focuses on computing the quantiles using multiple gpus on a dataset that uses the external cache capabilities - there will a follow-up pr soon after this that will support creation of histogram indices on large dataset as well - both of these changes are required to support training with external memory - the sparse pages in dmatrix are taken in batches and the the cut matrices are incrementally built - also snuck in some (perf) changes related to sketches aggregation amongst multiple features across multiple sparse page batches. instead of aggregating the summary inside each device and merged later, it is aggregated in-place when the device is working on different rows but the same feature

* simplify the config.h file * revise config.h * revised config.h * revise format * revise format issues * revise whitespace issues * revise whitespace namespace format issues * revise namespace format issues * format issues * format issues * format issues * format issues * Revert submodule changes * minor change * Update src/common/config.h Co-Authored-By: Philip Hyunsu Cho <chohyu01@cs.washington.edu> * address format issue from trivialfis * Use correct cub submodule

…4519) * Smarter choice of histogram construction for distributed gpu_hist * Limit omp team size in ExecuteShards

…s=True (#4522)

hcho3 mentioned this pull request May 23, 2019

[RFC] Possible DMatrix refactor #4354

Closed

hcho3 mentioned this pull request May 25, 2019

Performance optimizations for CPUs [part 2] #4278

Closed

trivialfis mentioned this pull request May 25, 2019

Revert hist init optimization. #4502

Merged

trivialfis requested changes May 27, 2019

View reviewed changes

rongou and others added 18 commits June 2, 2019 16:45

Initial support for external memory in gpu_predictor (#4284)

7085bf3

max_digits10 guarantees float decimal roundtrip (#4435)

33e3669

2 additional digits are not needed to guarantee that casting the decimal representation will result in the same float, see #3980 (comment)

Make AUCPR work with multiple query groups (#4436)

5670bdb

* Make AUCPR work with multiple query groups * Check AUCPR <= 1.0 in distributed setting

In AUC and AUCPR metrics, detect whether weights are per-instance or …

6f3feae

…per-group (#4216) * In AUC and AUCPR metrics, detect whether weights are per-instance or per-group * Fix C++ style check * Add a test for weighted AUC

Change obj name to reg:squarederror in learner. (#4427)

b6d65b6

* Change memory dump size in R test.

[jvm-packages] Tutorial on handling missing values (#4425)

dfd8404

Add tutorial on missing values and how to handle those within XGBoost.

Fix list formatting in missing value tutorial in XGBoost4J-Spark

8f4f85f

Fix list formatting in missing value tutorial in XGBoost4J-Spark

ef7c0e8

Enable building with shared NCCL. (#4447)

8e9ced9

* Add `BUILD_WITH_SHARED_NCCL` to CMake.

Correctly determine cuda version (#4453)

63514b6

mgpu predictor using explicit offsets (#4438)

972ddcd

* mgpu prediction using explicit sharding

[CI] Cache two R build Docker containers (#4458)

5f8918a

[CI] Build XGBoost wheels with CUDA 9.0 (#4459)

b2aa0f3

* [CI] Build XGBoost wheels with CUDA 9.0 * Do not call archiveArtifacts for 8.0 wheel

[CI] Add Windows GPU to Jenkins CI pipeline (#4463)

a9cc3d2

* Fix #4462: Use /MT flag consistently for MSVC target * First attempt at Windows CI * Distinguish stages in Linux and Windows pipelines * Try running CMake in Windows pipeline * Add build step

add cuda 10.1 support (#4468)

c4f3650

only copy the model once when predicting multiple batches (#4457)

e1cff4e

SmirnovEgorRu and others added 26 commits June 2, 2019 16:45

fix for CI

34bc965

fix for CI

00f6c9d

Check existence of _mm_prefetch and __builtin_prefetch

196b2ae

next part of optimizations

ee60036

next part of optimizations

238616e

Build fixes

8102c7b

CI fixing

3a52753

Conform to C++ style convention (cpplint)

69c9688

CI fixing

1d8d018

Fix for gpu building

9fff636

revert doubles for GradStat and some tests

8856db0

support of distributed building

6dbd2e8

fix windows errors and code cleaning

f4de83d

fix distributed lossguide

d8a9248

fix issues with lossguide

f95c58c

fix case when split_score can be nan or inf

f4d9b1a

code cleaning

4229063

Fix dask API sphinx docstrings (#4507)

7d85133

* Fix dask API sphinx docstrings * Update GPU docs page

Fix crash with approx tree method on cpu (#4510)

65dc511

Fix prediction from loaded pickle. (#4516)

039a902

Smarter choice of histogram construction for distributed gpu_hist (#…

826d560

…4519) * Smarter choice of histogram construction for distributed gpu_hist * Limit omp team size in ExecuteShards

Enforce exclusion between pred_interactions=True and pred_interaction…

7923f64

…s=True (#4522)

Fix #4497: Enable feature importance property for DART booster (#4525)

9d44acf

SmirnovEgorRu mentioned this pull request Jun 2, 2019

Optimizations for CPU #4529

Merged

SmirnovEgorRu closed this Jun 2, 2019

SmirnovEgorRu mentioned this pull request Aug 15, 2019

Multicore scalability of the Histogram-based GBDT scikit-learn/scikit-learn#14306

Open

lock bot locked as resolved and limited conversation to collaborators Aug 31, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CPU optimizations #4433

CPU optimizations #4433

SmirnovEgorRu commented May 2, 2019

trivialfis commented May 6, 2019 •

edited

Loading

hcho3 commented May 6, 2019

hcho3 commented May 23, 2019

SmirnovEgorRu commented May 23, 2019

hcho3 commented May 24, 2019

hcho3 commented May 24, 2019 •

edited

Loading

trivialfis commented May 25, 2019 •

edited

Loading

SmirnovEgorRu commented May 26, 2019

trivialfis left a comment •

edited

Loading

trivialfis May 27, 2019

trivialfis May 27, 2019

trivialfis May 27, 2019

trivialfis May 27, 2019

trivialfis May 27, 2019

trivialfis May 27, 2019

trivialfis May 27, 2019

trivialfis May 27, 2019

trivialfis May 27, 2019

trivialfis May 27, 2019

CPU optimizations #4433

CPU optimizations #4433

Conversation

SmirnovEgorRu commented May 2, 2019

trivialfis commented May 6, 2019 • edited Loading

hcho3 commented May 6, 2019

hcho3 commented May 23, 2019

SmirnovEgorRu commented May 23, 2019

hcho3 commented May 24, 2019

hcho3 commented May 24, 2019 • edited Loading

trivialfis commented May 25, 2019 • edited Loading

SmirnovEgorRu commented May 26, 2019

trivialfis left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

trivialfis commented May 6, 2019 •

edited

Loading

hcho3 commented May 24, 2019 •

edited

Loading

trivialfis commented May 25, 2019 •

edited

Loading

trivialfis left a comment •

edited

Loading