-
-
Notifications
You must be signed in to change notification settings - Fork 8.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix external memory, gpu_hist and subsampling combination bug. (#7476) #7481
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the fix, removing a state is great! One question is inlined in comment.
Could you please add a test here:
def run_data_iterator( |
|
||
// Compact the ELLPACK pages into the single sample page. | ||
thrust::fill(dh::tbegin(page_->gidx_buffer), dh::tend(page_->gidx_buffer), 0); | ||
for (auto& batch : dmat->GetBatches<EllpackPage>(batch_param_)) { | ||
for (auto& batch : batch_iterator) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible that the first page is being duplicated?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean the new constructor called with first_page data and then again using the same page in the iterator?
If that is the question, then I think the constructor only uses first_page to get some info on how to reserve memory for page_. Later, in the iterator, the real processing is performed so first_page must appear again there.
- This commit refers to the suggestion dmlc#7481 (review) - Adds a test that accompanies the fix dmlc#7476, the test segfaults before the commit dmlc#7481.
I've added the test you suggested. Just copied the existing code from run_data_iterator test into a new test, and added subsample value less than 1. Cheers! |
Let me take a closer look and fix the test |
|
Thank you for the nice work. I will get back to this tomorrow. I don't have access to my device at the moment. |
Ok, now I understand what you meant when you told me to add test. Nice work, thank you. I just added the relaxing of rtol when subsample < 1.0. This is the main reason for test fail. I added scaled rtol, lets see if it will pass. |
Will take a closer look. The quantile is created differently with batched data, so testing prediction result is not reliable. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The fix itself looks good. Will merge once all tests are green. Handling floating-point errors in these kinds of tests are always messy...
Thanks for the good work!
Opened an issue for the major cause of floating-point error: #7488 Will follow up next week. (I'm currently out of the office, sorry). |
…7476) - The error happens because when reading from external memory the batch is reassembled for every new iteration. The variable `original_page_` is initialized from the first batch, when the constructor of `GradiendBasedSample` is called. After iterating through the batches the original memory is not accessible, so when trying to access the memory pointed by `original_page_` causes an error. - The solution is instead of accessing data from the `original_page_`, to access the data from the first page of the available batch. fix dmlc#7476
- This commit refers to the suggestion dmlc#7481 (review) - Adds a test that accompanies the fix dmlc#7476, the test segfaults before the commit dmlc#7481.
- This test fails when run on CPU, so only on GPU should be run. - Added sampling method testing.
da40315
to
88d0bff
Compare
Thanks for the fix and sorry for the long delay. |
The error happens because when reading from external memory the batch is
reassembled for every new iteration. The variable
original_page_
isinitialized from the first batch, when the constructor of
GradiendBasedSample
is called. After iterating through the batches the original memory is not
accessible, so when trying to access the memory pointed by
original_page_
causes an error.
The solution is instead of accessing data from the
original_page_
, to accessthe data from the first page of the available batch.
fix #7476