Fix memory usage of device sketching #5407

RAMitchell · 2020-03-12T03:36:12Z

This PR resolves a bug where for large datasets it was possible to exceed device memory from sketching.

I added explicit tests for memory usage of sketching and smart batch size calculation able to use 80% of the available memory.

@sriramch can you please verify.

sriramch · 2020-03-12T17:46:57Z

@RAMitchell - i was able to build and run this with the following patch. i'll also provide some comments shortly:

diff --git a/include/xgboost/data.h b/include/xgboost/data.h
index 0995530..7628093 100644
--- a/include/xgboost/data.h
+++ b/include/xgboost/data.h
@@ -169,22 +169,18 @@ struct BatchParam {
   int gpu_id;
   /*! \brief Maximum number of bins per feature for histograms. */
   int max_bin { 0 };
-  /*! \brief Number of rows in a GPU batch, used for finding quantiles on GPU. */
-  int gpu_batch_nrows;
   /*! \brief Page size for external memory mode. */
   size_t gpu_page_size;
   BatchParam() = default;
-  BatchParam(int32_t device, int32_t max_bin, int32_t gpu_batch_nrows,
+  BatchParam(int32_t device, int32_t max_bin,
              size_t gpu_page_size = 0) :
       gpu_id{device},
       max_bin{max_bin},
-      gpu_batch_nrows{gpu_batch_nrows},
       gpu_page_size{gpu_page_size}
   {}
   inline bool operator!=(const BatchParam& other) const {
     return gpu_id != other.gpu_id ||
         max_bin != other.max_bin ||
-        gpu_batch_nrows != other.gpu_batch_nrows ||
         gpu_page_size != other.gpu_page_size;
   }
 };

i was able to train this with large datasets now with this fix (~180m instances on a single gpu)
this is on par with a version that reverts everything from head up to and including Sketching from adapters #5365
the peak memory usage with this version is slightly higher for much larger datasets - not something to be concerned about (~a few 10's of mb more than earlier)
training 100m instances resulted in 50% peak memory usage more than earlier (12.2gb vs 8.2gb earlier). this should probably be ok as well, as it is using a large part of the available memory to sketch more instances at a time, which should be returned back once it is over
the training times were comparable as well

sriramch · 2020-03-12T17:47:30Z

src/common/hist_util.cu

@@ -208,7 +221,8 @@ void ExtractWeightedCuts(int device, Span<SketchEntry> cuts,
 void ProcessBatch(int device, const SparsePage& page, size_t begin, size_t end,
                  SketchContainer* sketch_container, int num_cuts,
                  size_t num_columns) {
-  dh::XGBCachingDeviceAllocator<char> alloc;
+  dh::XGBCachingDeviceAllocator<char> caching_alloc;
+  dh::XGBDeviceAllocator<char> alloc;


should we get rid of this and use the caching_alloc throughout?

src/common/hist_util.cu

sriramch · 2020-03-12T17:48:46Z

src/common/hist_util.cu

@@ -385,7 +401,7 @@ void ProcessBatch(AdapterT* adapter, size_t begin, size_t end, float missing,
  size_t num_valid = host_column_sizes_scan.back();

  // Copy current subset of valid elements into temporary storage and sort
-  thrust::device_vector<Entry> sorted_entries(num_valid);
+  dh::device_vector<Entry> sorted_entries(num_valid);


i think a caching_device_vector can be used everywhere. what is precluding its usage?

Yep that should be used everywhere here. Allocations larger than 1gb will just be standard allocations anyway. The only danger of caching_device_vector is that it does not default initialise memory.

sriramch · 2020-03-12T17:51:45Z

src/tree/updater_gpu_hist.cu

  bool debug_synchronize;
  // declare parameters
  DMLC_DECLARE_PARAMETER(GPUHistMakerTrainParam) {
    DMLC_DECLARE_FIELD(single_precision_histogram).set_default(false).describe(
        "Use single precision to build histograms.");
    DMLC_DECLARE_FIELD(deterministic_histogram).set_default(true).describe(
        "Pre-round the gradient for obtaining deterministic gradient histogram.");
-    DMLC_DECLARE_FIELD(gpu_batch_nrows)


are there existing use-cases that use this? it looks like the (breaking) new behavior is to auto-deduce and i'm wondering if there are configs that use -1 to pull everything in one shot as opposed to looping (with perhaps better latencies).

Current implementation will use up to 80% of available memory so the 'do everything in one batch' approach would only be slightly better in the case where >80% memory is used. Current autodetect behaviour is able to use more available memory than the old implementation and would have faster latencies.

The gpu_batch_nrows parameter was never documented so we have no commitment to support it, I don't think we use it anywhere apart from maybe testing.

RAMitchell · 2020-03-13T01:02:05Z

@trivialfis can I get a review please.

codecov-io · 2020-03-13T01:33:06Z

Codecov Report

Merging #5407 into master will not change coverage by %.
The diff coverage is n/a.

@@           Coverage Diff           @@
##           master    #5407   +/-   ##
=======================================
  Coverage   84.07%   84.07%           
=======================================
  Files          11       11           
  Lines        2411     2411           
=======================================
  Hits         2027     2027           
  Misses        384      384

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 3ad4333...5e4255c. Read the comment docs.

Fix memory usage of device sketching

03369f9

sriramch reviewed Mar 12, 2020

View reviewed changes

RAMitchell added 2 commits March 13, 2020 10:44

Use caching allocators

fe97f65

Clear memory logger

5e4255c

trivialfis approved these changes Mar 13, 2020

View reviewed changes

RAMitchell merged commit b745b7a into dmlc:master Mar 14, 2020

lock bot locked as resolved and limited conversation to collaborators Jun 24, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix memory usage of device sketching #5407

Fix memory usage of device sketching #5407

RAMitchell commented Mar 12, 2020

sriramch commented Mar 12, 2020

sriramch Mar 12, 2020

RAMitchell Mar 12, 2020

sriramch Mar 12, 2020

RAMitchell Mar 12, 2020

sriramch Mar 12, 2020

RAMitchell Mar 12, 2020

RAMitchell commented Mar 13, 2020

codecov-io commented Mar 13, 2020

Fix memory usage of device sketching #5407

Fix memory usage of device sketching #5407

Conversation

RAMitchell commented Mar 12, 2020

sriramch commented Mar 12, 2020

sriramch Mar 12, 2020

Choose a reason for hiding this comment

RAMitchell Mar 12, 2020

Choose a reason for hiding this comment

sriramch Mar 12, 2020

Choose a reason for hiding this comment

RAMitchell Mar 12, 2020

Choose a reason for hiding this comment

sriramch Mar 12, 2020

Choose a reason for hiding this comment

RAMitchell Mar 12, 2020

Choose a reason for hiding this comment

RAMitchell commented Mar 13, 2020

codecov-io commented Mar 13, 2020

Codecov Report