Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix memory usage of device sketching #5407

Merged
merged 3 commits into from
Mar 14, 2020
Merged

Conversation

RAMitchell
Copy link
Member

This PR resolves a bug where for large datasets it was possible to exceed device memory from sketching.

I added explicit tests for memory usage of sketching and smart batch size calculation able to use 80% of the available memory.

@sriramch can you please verify.

@sriramch
Copy link
Contributor

@RAMitchell - i was able to build and run this with the following patch. i'll also provide some comments shortly:

diff --git a/include/xgboost/data.h b/include/xgboost/data.h
index 0995530..7628093 100644
--- a/include/xgboost/data.h
+++ b/include/xgboost/data.h
@@ -169,22 +169,18 @@ struct BatchParam {
   int gpu_id;
   /*! \brief Maximum number of bins per feature for histograms. */
   int max_bin { 0 };
-  /*! \brief Number of rows in a GPU batch, used for finding quantiles on GPU. */
-  int gpu_batch_nrows;
   /*! \brief Page size for external memory mode. */
   size_t gpu_page_size;
   BatchParam() = default;
-  BatchParam(int32_t device, int32_t max_bin, int32_t gpu_batch_nrows,
+  BatchParam(int32_t device, int32_t max_bin,
              size_t gpu_page_size = 0) :
       gpu_id{device},
       max_bin{max_bin},
-      gpu_batch_nrows{gpu_batch_nrows},
       gpu_page_size{gpu_page_size}
   {}
   inline bool operator!=(const BatchParam& other) const {
     return gpu_id != other.gpu_id ||
         max_bin != other.max_bin ||
-        gpu_batch_nrows != other.gpu_batch_nrows ||
         gpu_page_size != other.gpu_page_size;
   }
 };
  • i was able to train this with large datasets now with this fix (~180m instances on a single gpu)
  • this is on par with a version that reverts everything from head up to and including Sketching from adapters #5365
  • the peak memory usage with this version is slightly higher for much larger datasets - not something to be concerned about (~a few 10's of mb more than earlier)
  • training 100m instances resulted in 50% peak memory usage more than earlier (12.2gb vs 8.2gb earlier). this should probably be ok as well, as it is using a large part of the available memory to sketch more instances at a time, which should be returned back once it is over
  • the training times were comparable as well

@@ -208,7 +221,8 @@ void ExtractWeightedCuts(int device, Span<SketchEntry> cuts,
void ProcessBatch(int device, const SparsePage& page, size_t begin, size_t end,
SketchContainer* sketch_container, int num_cuts,
size_t num_columns) {
dh::XGBCachingDeviceAllocator<char> alloc;
dh::XGBCachingDeviceAllocator<char> caching_alloc;
dh::XGBDeviceAllocator<char> alloc;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we get rid of this and use the caching_alloc throughout?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes

src/common/hist_util.cu Outdated Show resolved Hide resolved
@@ -385,7 +401,7 @@ void ProcessBatch(AdapterT* adapter, size_t begin, size_t end, float missing,
size_t num_valid = host_column_sizes_scan.back();

// Copy current subset of valid elements into temporary storage and sort
thrust::device_vector<Entry> sorted_entries(num_valid);
dh::device_vector<Entry> sorted_entries(num_valid);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think a caching_device_vector can be used everywhere. what is precluding its usage?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep that should be used everywhere here. Allocations larger than 1gb will just be standard allocations anyway. The only danger of caching_device_vector is that it does not default initialise memory.

bool debug_synchronize;
// declare parameters
DMLC_DECLARE_PARAMETER(GPUHistMakerTrainParam) {
DMLC_DECLARE_FIELD(single_precision_histogram).set_default(false).describe(
"Use single precision to build histograms.");
DMLC_DECLARE_FIELD(deterministic_histogram).set_default(true).describe(
"Pre-round the gradient for obtaining deterministic gradient histogram.");
DMLC_DECLARE_FIELD(gpu_batch_nrows)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are there existing use-cases that use this? it looks like the (breaking) new behavior is to auto-deduce and i'm wondering if there are configs that use -1 to pull everything in one shot as opposed to looping (with perhaps better latencies).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Current implementation will use up to 80% of available memory so the 'do everything in one batch' approach would only be slightly better in the case where >80% memory is used. Current autodetect behaviour is able to use more available memory than the old implementation and would have faster latencies.

The gpu_batch_nrows parameter was never documented so we have no commitment to support it, I don't think we use it anywhere apart from maybe testing.

@RAMitchell
Copy link
Member Author

@trivialfis can I get a review please.

@codecov-io
Copy link

Codecov Report

Merging #5407 into master will not change coverage by %.
The diff coverage is n/a.

Impacted file tree graph

@@           Coverage Diff           @@
##           master    #5407   +/-   ##
=======================================
  Coverage   84.07%   84.07%           
=======================================
  Files          11       11           
  Lines        2411     2411           
=======================================
  Hits         2027     2027           
  Misses        384      384           

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 3ad4333...5e4255c. Read the comment docs.

@RAMitchell RAMitchell merged commit b745b7a into dmlc:master Mar 14, 2020
@lock lock bot locked as resolved and limited conversation to collaborators Jun 24, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants