Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

External memory, gpu_hist and subsampling combination bug #7476

Closed
GinkoBalboa opened this issue Nov 23, 2021 · 4 comments · Fixed by #7481
Closed

External memory, gpu_hist and subsampling combination bug #7476

GinkoBalboa opened this issue Nov 23, 2021 · 4 comments · Fixed by #7481

Comments

@GinkoBalboa
Copy link
Contributor

Hi,

I've detected that xgboost segfaults when I try to use subsampling with external memory and gpu_hist as a parameter.

Segfault happens when the number of batches is greater than 1 and subsampling is less than 1.0.

Here I send the script that causes the segfault.

I have a solution for the bug, a pull request follows quickly.

external_memory_3.zip

Best Regards

GinkoBalboa added a commit to GinkoBalboa/xgboost that referenced this issue Nov 23, 2021
…7476)

- The error happens because when reading from external memory the batch is
  reassembled for every new iteration. The variable `original_page_` is
  initialized from the first batch, when the constructor of `GradiendBasedSample`
  is called. After iterating through the batches the original memory is not
  accessible, so when trying to access the memory pointed by `original_page_`
  causes an error.

- The solution is instead of accessing data from the `original_page_`, to access
  the data from the first page of the available batch.

fix dmlc#7476
GinkoBalboa added a commit to GinkoBalboa/xgboost that referenced this issue Nov 23, 2021
…7476)

- The error happens because when reading from external memory the batch is
  reassembled for every new iteration. The variable `original_page_` is
  initialized from the first batch, when the constructor of `GradiendBasedSample`
  is called. After iterating through the batches the original memory is not
  accessible, so when trying to access the memory pointed by `original_page_`
  causes an error.

- The solution is instead of accessing data from the `original_page_`, to access
  the data from the first page of the available batch.

fix dmlc#7476
@trivialfis
Copy link
Member

Thanks for sharing! Ping me when you have the fix. ;-)

@trivialfis
Copy link
Member

Might be tricky. One needs to make a copy of the original page.

@trivialfis
Copy link
Member

I think we can obtain the shared ptr via:

    std::shared_ptr<EllpackPage const> page =
        dmat->GetBatches<EllpackPage>(batch_param).begin().Page();

But I'm curious about your solution. ;-)

@GinkoBalboa
Copy link
Contributor Author

Hi,

Here is my solution. Regarding the original page, the memory that the original_page_ points to is no longer accesible after iterating. In that manner, I think a copy of that pointer is not needed.

Thank you for the suggestion.

Best Regards

GinkoBalboa added a commit to GinkoBalboa/xgboost that referenced this issue Nov 24, 2021
- This commit refers to the suggestion
  dmlc#7481 (review)

- Adds a test that accompanies the fix dmlc#7476, the
  test segfaults before the commit dmlc#7481.
trivialfis pushed a commit to GinkoBalboa/xgboost that referenced this issue Dec 24, 2021
…7476)

- The error happens because when reading from external memory the batch is
  reassembled for every new iteration. The variable `original_page_` is
  initialized from the first batch, when the constructor of `GradiendBasedSample`
  is called. After iterating through the batches the original memory is not
  accessible, so when trying to access the memory pointed by `original_page_`
  causes an error.

- The solution is instead of accessing data from the `original_page_`, to access
  the data from the first page of the available batch.

fix dmlc#7476
trivialfis pushed a commit to GinkoBalboa/xgboost that referenced this issue Dec 24, 2021
- This commit refers to the suggestion
  dmlc#7481 (review)

- Adds a test that accompanies the fix dmlc#7476, the
  test segfaults before the commit dmlc#7481.
trivialfis added a commit that referenced this issue Dec 24, 2021
…7481)

Instead of accessing data from the `original_page_`, access the data from the first page of the available batch.

fix #7476

Co-authored-by: jiamingy <jm.yuan@outlook.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants