Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-15052: [C++][Parquet] Fix DELTA_BINARY_PACKED decoder when reading only one value #15124

Merged
merged 12 commits into from Jan 4, 2023

Conversation

mapleFU
Copy link
Member

@mapleFU mapleFU commented Dec 30, 2022

This patch trying to fix #15052 . The problem is mentioned here: #15052 (comment)

When read 1 value, DeltaBitPackDecoder will not call InitBlock, causing it always read last_value_.

Seems the problem is introduced in #10627 and amol-@d982bed

I will add some test tonight

@github-actions
Copy link

@github-actions
Copy link

⚠️ GitHub issue #15052 has been automatically assigned in GitHub to PR creator.

@github-actions
Copy link

⚠️ GitHub issue #15052 has no components, please add labels for components.

@github-actions
Copy link

⚠️ GitHub issue #15052 has no components, please add labels for components.

Copy link
Member

@rok rok left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for doing this @mapleFU! This is a valuable fix.

Comment on lines 2483 to 2489
buffer[i++] = last_value_;
if (ARROW_PREDICT_FALSE(i == max_values)) break;
if (ARROW_PREDICT_FALSE(i == max_values)) {
if (i != static_cast<int>(total_value_count_)) {
InitBlock();
}
break;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find this difficult to read, could you add a comment about what this logic is doing?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be honest, the logic is a bit trickey here, I will try to explain it. I'm bad in English, so maybe you can edit it directly if you can explain it better.

cpp/src/parquet/encoding.cc Show resolved Hide resolved
@mapleFU
Copy link
Member Author

mapleFU commented Dec 30, 2022

@tachyonwill I think this patch might fix DELTA_BINARY_PACKED issue, mind take a look?

cpp/src/parquet/encoding.cc Outdated Show resolved Hide resolved
Change the comments for fixing

Co-authored-by: Rok Mihevc <rok@mihevc.org>
@mapleFU
Copy link
Member Author

mapleFU commented Dec 31, 2022

Retrigger CI because I think the fail is not caused by me, but still it looks like a bug: https://github.com/apache/arrow/actions/runs/3808810467/jobs/6479699758

By the way, is there any method that I can retrigger the CI?

@mapleFU
Copy link
Member Author

mapleFU commented Dec 31, 2022

@pitrou Mind take a look here?

@pitrou
Copy link
Member

pitrou commented Dec 31, 2022

@pitrou Mind take a look here?

In a few days certainly :-)

@rok
Copy link
Member

rok commented Jan 2, 2023

By the way, is there any method that I can retrigger the CI?

Once your first PR is merged CI will start automatically.

cpp/src/parquet/encoding.cc Outdated Show resolved Hide resolved
@mapleFU mapleFU requested a review from pitrou January 3, 2023 11:55
@mapleFU
Copy link
Member Author

mapleFU commented Jan 3, 2023

oops, I got a

=================================== FAILURES ===================================
______________________ [doctest] pyarrow._fs.PyFileSystem ______________________
EXAMPLE LOCATION UNKNOWN, not showing all tests of that example
??? >>> gfs = github.GithubFileSystem('apache', 'arrow', sha='ec51aec4d15035f4d9d6a1c4346d0a2b9a37fb75')
UNEXPECTED EXCEPTION: HTTPError('403 Client Error: rate limit exceeded for url: https://api.github.com/repos/apache/arrow/git/trees/ec51aec4d15035f4d9d6a1c4346d0a2b9a37fb75')
Traceback (most recent call last):
  File "/opt/conda/envs/arrow/lib/python3.9/doctest.py", line 1334, in __run
    exec(compile(example.source, filename, "single",
  File "<doctest pyarrow._fs.PyFileSystem[1]>", line 1, in <module>
  File "/opt/conda/envs/arrow/lib/python3.9/site-packages/fsspec/spec.py", line 76, in __call__
    obj = super().__call__(*args, **kwargs)
  File "/opt/conda/envs/arrow/lib/python3.9/site-packages/fsspec/implementations/github.py", line 56, in __init__
    self.ls("")
  File "/opt/conda/envs/arrow/lib/python3.9/site-packages/fsspec/implementations/github.py", line 158, in ls
    r.raise_for_status()
  File "/opt/conda/envs/arrow/lib/python3.9/site-packages/requests/models.py", line 1021, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 403 Client Error: rate limit exceeded for url: https://api.github.com/repos/apache/arrow/git/trees/ec51aec4d15035f4d9d6a1c4346d0a2b9a37fb75
/opt/conda/envs/arrow/lib/python3.9/site-packages/pyarrow/_fs.cpython-39-x86_64-linux-gnu.so:None: UnexpectedException
=============================== warnings summary ===============================
opt/conda/envs/arrow/lib/python3.9/site-packages/pytest_cython/plugin.py:57
  /opt/conda/envs/arrow/lib/python3.9/site-packages/pytest_cython/plugin.py:57: PytestRemovedIn8Warning: The (fspath: py.path.local) argument to DoctestModule is deprecated. Please use the (path: pathlib.Path) argument instead.
  See https://docs.pytest.org/en/latest/deprecations.html#fspath-argument-for-node-constructors-replaced-with-pathlib-path
    return DoctestModule.from_parent(parent, fspath=path)

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
=========================== short test summary info ============================
SKIPPED [2] opt/conda/envs/arrow/lib/python3.9/site-packages/_pytest/doctest.py:455: all tests skipped by +SKIP option
============= 1 failed, 223 passed, 2 skipped, 1 warning in 8.70s ==============

Seems that maybe we should rerun it?

@pitrou
Copy link
Member

pitrou commented Jan 3, 2023

oops, I got a

[...]

Seems that maybe we should rerun it?

No worries, this is unrelated to his PR.

@mapleFU
Copy link
Member Author

mapleFU commented Jan 3, 2023

Well, it would actually be simpler without adding a separate test method... and it would exercise the code more thoroughly as well.

So how can I organize that code? I didn't make it clear. Is it like:

case 1

template <typename Type>
class TestEncodingBase : public ::testing::Test {
 public:
...
 private:
   std::vector<int> small_read_batch_size_;
};

and

TYPED_TEST(TestDeltaBitPackEncoding, SmallBatchRoundTrip) {
  // .. set draws_ 1, 3
  ASSERT_NO_FATAL_FAILURE(this->Execute(1, 1));
  ASSERT_NO_FATAL_FAILURE(this->Execute(1, 2));
  ASSERT_NO_FATAL_FAILURE(this->Execute(2, 2));
  ASSERT_NO_FATAL_FAILURE(this->ExecuteSteps(10, 10));
  ASSERT_NO_FATAL_FAILURE(this->ExecuteSteps(10, 10));
  ASSERT_NO_FATAL_FAILURE(this->ExecuteSteps(1, 10));
}

Or

case 2

template <typename Type>
class TestDeltaBitPackEncoding : public TestEncodingBase<Type> {
 public:
  using c_type = typename Type::c_type;
  static constexpr int TYPE = Type::type_num;
  static constexpr size_t ROUND_TRIP_TIMES = 3;

  void InitReadBatchSizedData(...);
};

Can you help me make it clear? @pitrou

@pitrou
Copy link
Member

pitrou commented Jan 3, 2023

You could have something like:

  const std::vector<int> kBatchSizes = {...};

  void CheckRoundtrip() override {
    auto encoder =
        MakeTypedEncoder<Type>(Encoding::DELTA_BINARY_PACKED, false, descr_.get());
    auto decoder = MakeTypedDecoder<Type>(Encoding::DELTA_BINARY_PACKED, descr_.get());
    for (size_t i = 0; i < ROUND_TRIP_TIMES; ++i) {
      for (const auto batch_size : kBatchSizes} {
        ...

@mapleFU
Copy link
Member Author

mapleFU commented Jan 3, 2023

Seems that would causing our testing run a long time, should I make kBatchSizes configurable, and just using another Execute? Like:

template <typename Type>
class TestDeltaBitPackEncoding : public TestEncodingBase<Type> {
 public:
  using c_type = typename Type::c_type;
  static constexpr int TYPE = Type::type_num;
  static constexpr size_t ROUND_TRIP_TIMES = 3;

  void InitReadBatchSizedData(...);
};

  void CheckRoundtrip() override {
    ...
    auto read_batch_sizes = extra_read_batch_sizes;
    read_batch_sizes.push_back(num_values_);
    for (size_t i = 0; i < ROUND_TRIP_TIMES; ++i) {
      for (int read_batch_size: read_batch_sizes) {
        encoder->Put(draws_, num_values_);
        encode_buffer_ = encoder->FlushValues();

        decoder->SetData(num_values_, encode_buffer_->data(),
                         static_cast<int>(encode_buffer_->size()));

        int values_decoded_sum = 0;
        while (values_decoded_sum < num_values_) {
          int values_decoded =
              decoder->Decode(decode_buf_ + values_decoded_sum, read_batch_size);
          values_decoded_sum += values_decoded;
        }
        ASSERT_EQ(num_values_, values_decoded_sum);
        ASSERT_NO_FATAL_FAILURE(VerifyResults<c_type>(decode_buf_, draws_, num_values_));
      }
    }
  }

and

void ExecuteSteps(int nvalues, int repeats, std::vector<int> extra_read_batch_size) {
  this->InitReadBatchSizedData(nvalues, repeats, extra_read_batch_size);
  this->CheckRoundtrip();
}

I'm arguing this, because when I set const std::vector<int> EXTRA_READ_BATCH_SIZES = {1, 10}; and make CheckRoundTrip run it every time, the testing will spend 5 minutes in debug mode, which makes me mad :( ExecuteSteps will only finished in one second

@mapleFU
Copy link
Member Author

mapleFU commented Jan 4, 2023

@pitrou comment fixed and merged your patch which could make test faster. Mind take a look?

Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@pitrou pitrou changed the title GH-15052: [C++][Parquet] Fixing DELTA_BINARY_PACKED when reading only 1 GH-15052: [C++][Parquet] Fix DELTA_BINARY_PACKED decoder when reading only one value Jan 4, 2023
@pitrou pitrou merged commit 53d73f8 into apache:master Jan 4, 2023
@mapleFU mapleFU deleted the parquet/fix-gh-15052 branch January 4, 2023 15:12
@ursabot
Copy link

ursabot commented Jan 5, 2023

Benchmark runs are scheduled for baseline = ec9a8a3 and contender = 53d73f8. 53d73f8 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Finished ⬇️0.0% ⬆️0.0%] ec2-t3-xlarge-us-east-2
[Failed ⬇️1.11% ⬆️0.0%] test-mac-arm
[Finished ⬇️12.24% ⬆️0.0%] ursa-i9-9960x
[Failed ⬇️2.86% ⬆️0.0%] ursa-thinkcentre-m75q
Buildkite builds:
[Finished] 53d73f8f ec2-t3-xlarge-us-east-2
[Failed] 53d73f8f test-mac-arm
[Finished] 53d73f8f ursa-i9-9960x
[Failed] 53d73f8f ursa-thinkcentre-m75q
[Finished] ec9a8a32 ec2-t3-xlarge-us-east-2
[Failed] ec9a8a32 test-mac-arm
[Finished] ec9a8a32 ursa-i9-9960x
[Failed] ec9a8a32 ursa-thinkcentre-m75q
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

@ursabot
Copy link

ursabot commented Jan 5, 2023

['Python', 'R'] benchmarks have high level of regressions.
ursa-i9-9960x

EpsilonPrime pushed a commit to EpsilonPrime/arrow that referenced this pull request Jan 5, 2023
…eading only one value (apache#15124)

This patch trying to fix apache#15052 . The problem is mentioned here: apache#15052 (comment)

When read 1 value, DeltaBitPackDecoder will not call `InitBlock`, causing it always read `last_value_`.

Seems the problem is introduced in apache#10627 and amol-@d982bed

I will add some test tonight
* Closes: apache#15052

Lead-authored-by: mwish <maplewish117@gmail.com>
Co-authored-by: Antoine Pitrou <antoine@python.org>
Co-authored-by: mwish <1506118561@qq.com>
Co-authored-by: Rok Mihevc <rok@mihevc.org>
Signed-off-by: Antoine Pitrou <antoine@python.org>
EpsilonPrime pushed a commit to EpsilonPrime/arrow that referenced this pull request Jan 5, 2023
…eading only one value (apache#15124)

This patch trying to fix apache#15052 . The problem is mentioned here: apache#15052 (comment)

When read 1 value, DeltaBitPackDecoder will not call `InitBlock`, causing it always read `last_value_`.

Seems the problem is introduced in apache#10627 and amol-@d982bed

I will add some test tonight
* Closes: apache#15052

Lead-authored-by: mwish <maplewish117@gmail.com>
Co-authored-by: Antoine Pitrou <antoine@python.org>
Co-authored-by: mwish <1506118561@qq.com>
Co-authored-by: Rok Mihevc <rok@mihevc.org>
Signed-off-by: Antoine Pitrou <antoine@python.org>
vibhatha pushed a commit to vibhatha/arrow that referenced this pull request Jan 9, 2023
…eading only one value (apache#15124)

This patch trying to fix apache#15052 . The problem is mentioned here: apache#15052 (comment)

When read 1 value, DeltaBitPackDecoder will not call `InitBlock`, causing it always read `last_value_`.

Seems the problem is introduced in apache#10627 and amol-@d982bed

I will add some test tonight
* Closes: apache#15052

Lead-authored-by: mwish <maplewish117@gmail.com>
Co-authored-by: Antoine Pitrou <antoine@python.org>
Co-authored-by: mwish <1506118561@qq.com>
Co-authored-by: Rok Mihevc <rok@mihevc.org>
Signed-off-by: Antoine Pitrou <antoine@python.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[C++][Parquet] Getting only 0 when reading DELTA_BINARY_PACKED
4 participants