-
-
Notifications
You must be signed in to change notification settings - Fork 8.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix memory usage of device sketching #5407
Conversation
@RAMitchell - i was able to build and run this with the following patch. i'll also provide some comments shortly:
|
src/common/hist_util.cu
Outdated
@@ -208,7 +221,8 @@ void ExtractWeightedCuts(int device, Span<SketchEntry> cuts, | |||
void ProcessBatch(int device, const SparsePage& page, size_t begin, size_t end, | |||
SketchContainer* sketch_container, int num_cuts, | |||
size_t num_columns) { | |||
dh::XGBCachingDeviceAllocator<char> alloc; | |||
dh::XGBCachingDeviceAllocator<char> caching_alloc; | |||
dh::XGBDeviceAllocator<char> alloc; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we get rid of this and use the caching_alloc throughout?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes
src/common/hist_util.cu
Outdated
@@ -385,7 +401,7 @@ void ProcessBatch(AdapterT* adapter, size_t begin, size_t end, float missing, | |||
size_t num_valid = host_column_sizes_scan.back(); | |||
|
|||
// Copy current subset of valid elements into temporary storage and sort | |||
thrust::device_vector<Entry> sorted_entries(num_valid); | |||
dh::device_vector<Entry> sorted_entries(num_valid); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think a caching_device_vector can be used everywhere. what is precluding its usage?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep that should be used everywhere here. Allocations larger than 1gb will just be standard allocations anyway. The only danger of caching_device_vector is that it does not default initialise memory.
bool debug_synchronize; | ||
// declare parameters | ||
DMLC_DECLARE_PARAMETER(GPUHistMakerTrainParam) { | ||
DMLC_DECLARE_FIELD(single_precision_histogram).set_default(false).describe( | ||
"Use single precision to build histograms."); | ||
DMLC_DECLARE_FIELD(deterministic_histogram).set_default(true).describe( | ||
"Pre-round the gradient for obtaining deterministic gradient histogram."); | ||
DMLC_DECLARE_FIELD(gpu_batch_nrows) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are there existing use-cases that use this? it looks like the (breaking) new behavior is to auto-deduce and i'm wondering if there are configs that use -1 to pull everything in one shot as opposed to looping (with perhaps better latencies).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Current implementation will use up to 80% of available memory so the 'do everything in one batch' approach would only be slightly better in the case where >80% memory is used. Current autodetect behaviour is able to use more available memory than the old implementation and would have faster latencies.
The gpu_batch_nrows
parameter was never documented so we have no commitment to support it, I don't think we use it anywhere apart from maybe testing.
@trivialfis can I get a review please. |
Codecov Report
@@ Coverage Diff @@
## master #5407 +/- ##
=======================================
Coverage 84.07% 84.07%
=======================================
Files 11 11
Lines 2411 2411
=======================================
Hits 2027 2027
Misses 384 384 Continue to review full report at Codecov.
|
This PR resolves a bug where for large datasets it was possible to exceed device memory from sketching.
I added explicit tests for memory usage of sketching and smart batch size calculation able to use 80% of the available memory.
@sriramch can you please verify.