New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BLOCKING] Column batches in SparsePage are not merged properly #4130
Comments
FWIW, the issue seems to be caused by a recent change. The 0.81 encounters segfaults on my data with 200K rows x 1000 sparse columns but 0.72 works with ~150MM rows x 3000 sparse columns. |
@hcho3 Hi, are you working on this? If not I will try to look into it. |
@trivialfis I've got other task to take care of. I can look at it in a day or two. |
I think the bug was introduced in #3395 due to lack of tests back then. That PR removed This is the same bug with #4037 , but easier to reproduce. I think we should come out of a solution before publishing the next release, otherwise external memory will be completely broken. @hcho3 @RAMitchell @CodingCat WDYT? |
I think it's critical to fix...and I am very curious why we didn't see this in production.... |
I concur. We should fix this bug. |
@trivialfis I would like to have a better understanding on this...why only support CSR is the root cause leading to the failure of @hcho3's test and also the issue in #4037 ? |
@CodingCat Because the merging logic for CSR doesn't work with transposed columns. Does my explanation in the example make sense to you? |
@hcho3 I am kind of understand this part now @trivialfis where we are on this? if we cannot come up with an elegant solution in a short time, does a quick-and-dirty fix like adding a flag on Push() to distinguish row/column page work? |
Sorry I'm occupied by other things lately (about to graduate). Will try fixing it tomorrow. |
I was playing around with external memory and found that
approx
algorithm fails for a 1666667x5 matrix (produced withCreateBigTestData
).Re-production: Add the following test to
tests/cpp/test_learner.cc
:Then compile Google tests with Address sanitizer enabled:
Now running
./testxgboost --gtest_filter=Learner.CheckMultiBatch
will cause Heap Overflow exception.Diagnosis. The current logic for merging SparsePage objects is not correct for column pages. In this example, we have two sparse pages, whose dimensions are 1099080x5 and 567587x5 respectively. The
approx
updater callsGetSortedColumnBatches()
method, which first transposes the sparse pages (hence the dimensions become 5x1099080 and 5x567587) and then merge the two pages via thePush()
method. ThePush()
method is not aware that the transposed pages represent column batches, so the method incorrectly produces a page of size 10x1099080.You can add the following diagnostic outputs:
Proposed fix. Write a special logic to properly merge column batches.
@RAMitchell @trivialfis
The text was updated successfully, but these errors were encountered: