Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BLOCKING] Column batches in SparsePage are not merged properly #4130

Closed
hcho3 opened this issue Feb 12, 2019 · 11 comments · Fixed by #4193
Closed

[BLOCKING] Column batches in SparsePage are not merged properly #4130

hcho3 opened this issue Feb 12, 2019 · 11 comments · Fixed by #4193

Comments

@hcho3
Copy link
Collaborator

hcho3 commented Feb 12, 2019

I was playing around with external memory and found that approx algorithm fails for a 1666667x5 matrix (produced with CreateBigTestData).

Re-production: Add the following test to tests/cpp/test_learner.cc:

TEST(Learner, CheckMultiBatch) {
  using arg = std::pair<std::string, std::string>;

  // Create sufficiently large data to make two row pages
  dmlc::TemporaryDirectory tempdir;
  const std::string tmp_file = tempdir.path + "/big.libsvm";
  CreateBigTestData(tmp_file, 5000000);

  std::shared_ptr<DMatrix> dmat(xgboost::DMatrix::Load(
    tmp_file + "#" + tmp_file + ".cache", true, false));
  EXPECT_TRUE(FileExists(tmp_file + ".cache.row.page"));
  EXPECT_FALSE(dmat->SingleColBlock());

  size_t num_row = dmat->Info().num_row_;
  std::vector<bst_float> labels(num_row);
  for (size_t i = 0; i < num_row; ++i) {
    labels[i] = i % 2;
  }
  dmat->Info().SetInfo("label", labels.data(), DataType::kFloat32, num_row);

  std::vector<std::shared_ptr<DMatrix>> mat{dmat};
  auto learner = std::unique_ptr<Learner>(Learner::Create(mat));
  learner->Configure({arg{"objective", "binary:logistic"}});
  learner->InitModel();
  learner->UpdateOneIter(0, dmat.get());
}

Then compile Google tests with Address sanitizer enabled:

cmake .. -DGOOGLE_TEST=ON -DGTEST_ROOT=$PWD/../gtest/     \
         -DUSE_SANITIZER=ON -DENABLED_SANITIZERS=address -DCMAKE_BUILD_TYPE=Debug

Now running ./testxgboost --gtest_filter=Learner.CheckMultiBatch will cause Heap Overflow exception.

Diagnosis. The current logic for merging SparsePage objects is not correct for column pages. In this example, we have two sparse pages, whose dimensions are 1099080x5 and 567587x5 respectively. The approx updater calls GetSortedColumnBatches() method, which first transposes the sparse pages (hence the dimensions become 5x1099080 and 5x567587) and then merge the two pages via the Push() method. The Push() method is not aware that the transposed pages represent column batches, so the method incorrectly produces a page of size 10x1099080.

You can add the following diagnostic outputs:

diff --git a/include/xgboost/data.h b/include/xgboost/data.h
index e2d800ca..18aa2bb7 100644
--- a/include/xgboost/data.h
+++ b/include/xgboost/data.h
@@ -271,6 +271,7 @@ class SparsePage {
    * \param batch the row page
    */
   inline void Push(const SparsePage &batch) {
+    std::cout << "Push() before: Size() = " << Size() << std::endl;
     auto& data_vec = data.HostVector();
     auto& offset_vec = offset.HostVector();
     const auto& batch_offset_vec = batch.offset.HostVector();
@@ -285,6 +286,7 @@ class SparsePage {
     for (size_t i = 0; i < batch.Size(); ++i) {
       offset_vec[i + begin] = top + batch_offset_vec[i + 1];
     }
+    std::cout << "Push() after: Size() = " << Size() << std::endl;
   }
   /*!
    * \brief Push one instance into page
diff --git a/src/tree/updater_basemaker-inl.h b/src/tree/updater_basemaker-inl.h
index 1a8238e7..3125b419 100644
--- a/src/tree/updater_basemaker-inl.h
+++ b/src/tree/updater_basemaker-inl.h
@@ -46,9 +46,11 @@ class BaseMaker: public TreeUpdater {
                 -std::numeric_limits<bst_float>::max());
       // start accumulating statistics
       for (const auto &batch : p_fmat->GetSortedColumnBatches()) {
+        std::cout << "batch.Size() = " << batch.Size() << std::endl;
         for (bst_uint fid = 0; fid < batch.Size(); ++fid) {
           auto c = batch[fid];
           if (c.size() != 0) {
+            CHECK_LT(fid * 2, fminmax_.size());
             fminmax_[fid * 2 + 0] =
                 std::max(-c[0].fvalue, fminmax_[fid * 2 + 0]);
             fminmax_[fid * 2 + 1] =

Proposed fix. Write a special logic to properly merge column batches.

@RAMitchell @trivialfis

@trivialfis
Copy link
Member

@hcho3 This might be related to #4037 .

@jjtav
Copy link

jjtav commented Feb 13, 2019

FWIW, the issue seems to be caused by a recent change. The 0.81 encounters segfaults on my data with 200K rows x 1000 sparse columns but 0.72 works with ~150MM rows x 3000 sparse columns.

@trivialfis
Copy link
Member

@hcho3 Hi, are you working on this? If not I will try to look into it.

@hcho3
Copy link
Collaborator Author

hcho3 commented Feb 18, 2019

@trivialfis I've got other task to take care of. I can look at it in a day or two.

@trivialfis
Copy link
Member

trivialfis commented Feb 21, 2019

I think the bug was introduced in #3395 due to lack of tests back then. That PR removed ColPageIter and made SparsePage exclusive to csr format. We need to somehow reintroduce csc format back to SparsePage.

This is the same bug with #4037 , but easier to reproduce. I think we should come out of a solution before publishing the next release, otherwise external memory will be completely broken.

@hcho3 @RAMitchell @CodingCat WDYT?

@CodingCat
Copy link
Member

I think it's critical to fix...and I am very curious why we didn't see this in production....

@hcho3
Copy link
Collaborator Author

hcho3 commented Feb 21, 2019

I concur. We should fix this bug.

@CodingCat
Copy link
Member

@trivialfis I would like to have a better understanding on this...why only support CSR is the root cause leading to the failure of @hcho3's test and also the issue in #4037 ?

@hcho3 hcho3 changed the title Column batches in SparsePage are not merged properly [BLOCKING] Column batches in SparsePage are not merged properly Feb 22, 2019
@hcho3
Copy link
Collaborator Author

hcho3 commented Feb 22, 2019

@CodingCat Because the merging logic for CSR doesn't work with transposed columns. Does my explanation in the example make sense to you?

@CodingCat
Copy link
Member

@hcho3 I am kind of understand this part now

@trivialfis where we are on this? if we cannot come up with an elegant solution in a short time, does a quick-and-dirty fix like adding a flag on Push() to distinguish row/column page work?

@trivialfis
Copy link
Member

Sorry I'm occupied by other things lately (about to graduate). Will try fixing it tomorrow.

@lock lock bot locked as resolved and limited conversation to collaborators May 30, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants