Fix external memory, gpu_hist and subsampling combination bug. (#7476) #7481

GinkoBalboa · 2021-11-24T07:26:44Z

The error happens because when reading from external memory the batch is
reassembled for every new iteration. The variable original_page_ is
initialized from the first batch, when the constructor of GradiendBasedSample
is called. After iterating through the batches the original memory is not
accessible, so when trying to access the memory pointed by original_page_
causes an error.
The solution is instead of accessing data from the original_page_, to access
the data from the first page of the available batch.

fix #7476

trivialfis

Thanks for the fix, removing a state is great! One question is inlined in comment.

Could you please add a test here:

xgboost/tests/python/test_data_iterator.py

Line 66 in 820e1c0

def run_data_iterator(

so that we won't make the same mistake again in the future. Thanks for the nice work!

trivialfis · 2021-11-24T09:11:45Z

src/tree/gpu_hist/gradient_based_sampler.cu


  // Compact the ELLPACK pages into the single sample page.
  thrust::fill(dh::tbegin(page_->gidx_buffer), dh::tend(page_->gidx_buffer), 0);
-  for (auto& batch : dmat->GetBatches<EllpackPage>(batch_param_)) {
+  for (auto& batch : batch_iterator) {


Is it possible that the first page is being duplicated?

Do you mean the new constructor called with first_page data and then again using the same page in the iterator?

If that is the question, then I think the constructor only uses first_page to get some info on how to reserve memory for page_. Later, in the iterator, the real processing is performed so first_page must appear again there.

- This commit refers to the suggestion dmlc#7481 (review) - Adds a test that accompanies the fix dmlc#7476, the test segfaults before the commit dmlc#7481.

GinkoBalboa · 2021-11-24T15:36:46Z

Thanks for the fix, removing a state is great! One question is inlined in comment.

Could you please add a test here:

xgboost/tests/python/test_data_iterator.py

Line 66 in 820e1c0

def run_data_iterator(

so that we won't make the same mistake again in the future. Thanks for the nice work!

I've added the test you suggested. Just copied the existing code from run_data_iterator test into a new test, and added subsample value less than 1.

Cheers!

trivialfis · 2021-11-24T18:06:53Z

Let me take a closer look and fix the test

GinkoBalboa · 2021-11-25T13:44:52Z

I put the test into a separate function to check out where exactly fails.
I merged the latest change, because I was thinking maybe the new CI changes could mean something.
Noticed after the last fail that the failure was because of greater differences between predicted and expected, so I increased the relative tolerance from 1e-2 to 5e-2. I think the error is greater because we use subsampling. Hope this time the test will pass.

trivialfis · 2021-11-26T16:45:27Z

Thank you for the nice work. I will get back to this tomorrow. I don't have access to my device at the moment.

GinkoBalboa · 2021-11-27T08:02:30Z

Ok, now I understand what you meant when you told me to add test. Nice work, thank you.

I just added the relaxing of rtol when subsample < 1.0. This is the main reason for test fail. I added scaled rtol, lets see if it will pass.

trivialfis · 2021-11-27T11:38:51Z

Will take a closer look. The quantile is created differently with batched data, so testing prediction result is not reliable.

trivialfis

The fix itself looks good. Will merge once all tests are green. Handling floating-point errors in these kinds of tests are always messy...

Thanks for the good work!

trivialfis · 2021-11-27T20:00:46Z

Opened an issue for the major cause of floating-point error: #7488 Will follow up next week. (I'm currently out of the office, sorry).

…7476) - The error happens because when reading from external memory the batch is reassembled for every new iteration. The variable `original_page_` is initialized from the first batch, when the constructor of `GradiendBasedSample` is called. After iterating through the batches the original memory is not accessible, so when trying to access the memory pointed by `original_page_` causes an error. - The solution is instead of accessing data from the `original_page_`, to access the data from the first page of the available batch. fix dmlc#7476

- This commit refers to the suggestion dmlc#7481 (review) - Adds a test that accompanies the fix dmlc#7476, the test segfaults before the commit dmlc#7481.

- This test fails when run on CPU, so only on GPU should be run. - Added sampling method testing.

trivialfis · 2021-12-24T03:15:51Z

Thanks for the fix and sorry for the long delay.

trivialfis reviewed Nov 24, 2021

View reviewed changes

GinkoBalboa requested a review from trivialfis November 24, 2021 22:43

trivialfis approved these changes Nov 27, 2021

View reviewed changes

trivialfis mentioned this pull request Nov 27, 2021

Use double for GPU Hist root sum. #7488

Closed

2 tasks

GinkoBalboa and others added 11 commits December 24, 2021 08:21

Added test for gpu_hist and subsampling

d21055c

- This commit refers to the suggestion dmlc#7481 (review) - Adds a test that accompanies the fix dmlc#7476, the test segfaults before the commit dmlc#7481.

Moved the test for gpu_hist and subsampling only on GPU

f0ae117

- This test fails when run on CPU, so only on GPU should be run. - Added sampling method testing.

Switched to a separate test function

aab1922

Raised tolerances for test

64453d4

Generate the test cases.

4dd8ff2

Relax rtol for subsamping

b9252f8

Define mismatch rate.

7507ea8

Empty dataset.

94b77d6

Floating point.

25e7334

Comments.

88d0bff

trivialfis force-pushed the fix-external-gpu_hist branch from da40315 to 88d0bff Compare December 24, 2021 00:23

Remove mismatch rate.

0a26e4f

trivialfis merged commit 29bfa94 into dmlc:master Dec 24, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix external memory, gpu_hist and subsampling combination bug. (#7476) #7481

Fix external memory, gpu_hist and subsampling combination bug. (#7476) #7481

GinkoBalboa commented Nov 24, 2021

trivialfis left a comment

trivialfis Nov 24, 2021 •

edited

Loading

GinkoBalboa Nov 24, 2021

GinkoBalboa commented Nov 24, 2021

trivialfis commented Nov 24, 2021

GinkoBalboa commented Nov 25, 2021

trivialfis commented Nov 26, 2021

GinkoBalboa commented Nov 27, 2021

trivialfis commented Nov 27, 2021

trivialfis left a comment

trivialfis commented Nov 27, 2021

trivialfis commented Dec 24, 2021

Fix external memory, gpu_hist and subsampling combination bug. (#7476) #7481

Fix external memory, gpu_hist and subsampling combination bug. (#7476) #7481

Conversation

GinkoBalboa commented Nov 24, 2021

trivialfis left a comment

Choose a reason for hiding this comment

trivialfis Nov 24, 2021 • edited Loading

Choose a reason for hiding this comment

GinkoBalboa Nov 24, 2021

Choose a reason for hiding this comment

GinkoBalboa commented Nov 24, 2021

trivialfis commented Nov 24, 2021

GinkoBalboa commented Nov 25, 2021

trivialfis commented Nov 26, 2021

GinkoBalboa commented Nov 27, 2021

trivialfis commented Nov 27, 2021

trivialfis left a comment

Choose a reason for hiding this comment

trivialfis commented Nov 27, 2021

trivialfis commented Dec 24, 2021

trivialfis Nov 24, 2021 •

edited

Loading