New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GH-15052: [C++][Parquet] Fix DELTA_BINARY_PACKED decoder when reading only one value #15124
Conversation
|
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for doing this @mapleFU! This is a valuable fix.
cpp/src/parquet/encoding.cc
Outdated
buffer[i++] = last_value_; | ||
if (ARROW_PREDICT_FALSE(i == max_values)) break; | ||
if (ARROW_PREDICT_FALSE(i == max_values)) { | ||
if (i != static_cast<int>(total_value_count_)) { | ||
InitBlock(); | ||
} | ||
break; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I find this difficult to read, could you add a comment about what this logic is doing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be honest, the logic is a bit trickey here, I will try to explain it. I'm bad in English, so maybe you can edit it directly if you can explain it better.
31f7558
to
0abb863
Compare
@tachyonwill I think this patch might fix |
Change the comments for fixing Co-authored-by: Rok Mihevc <rok@mihevc.org>
Retrigger CI because I think the fail is not caused by me, but still it looks like a bug: https://github.com/apache/arrow/actions/runs/3808810467/jobs/6479699758 By the way, is there any method that I can retrigger the CI? |
642d779
to
a019574
Compare
@pitrou Mind take a look here? |
In a few days certainly :-) |
Once your first PR is merged CI will start automatically. |
oops, I got a
Seems that maybe we should rerun it? |
No worries, this is unrelated to his PR. |
So how can I organize that code? I didn't make it clear. Is it like: case 1template <typename Type>
class TestEncodingBase : public ::testing::Test {
public:
...
private:
std::vector<int> small_read_batch_size_;
}; and TYPED_TEST(TestDeltaBitPackEncoding, SmallBatchRoundTrip) {
// .. set draws_ 1, 3
ASSERT_NO_FATAL_FAILURE(this->Execute(1, 1));
ASSERT_NO_FATAL_FAILURE(this->Execute(1, 2));
ASSERT_NO_FATAL_FAILURE(this->Execute(2, 2));
ASSERT_NO_FATAL_FAILURE(this->ExecuteSteps(10, 10));
ASSERT_NO_FATAL_FAILURE(this->ExecuteSteps(10, 10));
ASSERT_NO_FATAL_FAILURE(this->ExecuteSteps(1, 10));
} Or case 2template <typename Type>
class TestDeltaBitPackEncoding : public TestEncodingBase<Type> {
public:
using c_type = typename Type::c_type;
static constexpr int TYPE = Type::type_num;
static constexpr size_t ROUND_TRIP_TIMES = 3;
void InitReadBatchSizedData(...);
}; Can you help me make it clear? @pitrou |
You could have something like: const std::vector<int> kBatchSizes = {...};
void CheckRoundtrip() override {
auto encoder =
MakeTypedEncoder<Type>(Encoding::DELTA_BINARY_PACKED, false, descr_.get());
auto decoder = MakeTypedDecoder<Type>(Encoding::DELTA_BINARY_PACKED, descr_.get());
for (size_t i = 0; i < ROUND_TRIP_TIMES; ++i) {
for (const auto batch_size : kBatchSizes} {
... |
Seems that would causing our testing run a long time, should I make template <typename Type>
class TestDeltaBitPackEncoding : public TestEncodingBase<Type> {
public:
using c_type = typename Type::c_type;
static constexpr int TYPE = Type::type_num;
static constexpr size_t ROUND_TRIP_TIMES = 3;
void InitReadBatchSizedData(...);
};
void CheckRoundtrip() override {
...
auto read_batch_sizes = extra_read_batch_sizes;
read_batch_sizes.push_back(num_values_);
for (size_t i = 0; i < ROUND_TRIP_TIMES; ++i) {
for (int read_batch_size: read_batch_sizes) {
encoder->Put(draws_, num_values_);
encode_buffer_ = encoder->FlushValues();
decoder->SetData(num_values_, encode_buffer_->data(),
static_cast<int>(encode_buffer_->size()));
int values_decoded_sum = 0;
while (values_decoded_sum < num_values_) {
int values_decoded =
decoder->Decode(decode_buf_ + values_decoded_sum, read_batch_size);
values_decoded_sum += values_decoded;
}
ASSERT_EQ(num_values_, values_decoded_sum);
ASSERT_NO_FATAL_FAILURE(VerifyResults<c_type>(decode_buf_, draws_, num_values_));
}
}
} and
I'm arguing this, because when I set |
@pitrou comment fixed and merged your patch which could make test faster. Mind take a look? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
Benchmark runs are scheduled for baseline = ec9a8a3 and contender = 53d73f8. 53d73f8 is a master commit associated with this PR. Results will be available as each benchmark for each run completes. |
['Python', 'R'] benchmarks have high level of regressions. |
…eading only one value (apache#15124) This patch trying to fix apache#15052 . The problem is mentioned here: apache#15052 (comment) When read 1 value, DeltaBitPackDecoder will not call `InitBlock`, causing it always read `last_value_`. Seems the problem is introduced in apache#10627 and amol-@d982bed I will add some test tonight * Closes: apache#15052 Lead-authored-by: mwish <maplewish117@gmail.com> Co-authored-by: Antoine Pitrou <antoine@python.org> Co-authored-by: mwish <1506118561@qq.com> Co-authored-by: Rok Mihevc <rok@mihevc.org> Signed-off-by: Antoine Pitrou <antoine@python.org>
…eading only one value (apache#15124) This patch trying to fix apache#15052 . The problem is mentioned here: apache#15052 (comment) When read 1 value, DeltaBitPackDecoder will not call `InitBlock`, causing it always read `last_value_`. Seems the problem is introduced in apache#10627 and amol-@d982bed I will add some test tonight * Closes: apache#15052 Lead-authored-by: mwish <maplewish117@gmail.com> Co-authored-by: Antoine Pitrou <antoine@python.org> Co-authored-by: mwish <1506118561@qq.com> Co-authored-by: Rok Mihevc <rok@mihevc.org> Signed-off-by: Antoine Pitrou <antoine@python.org>
…eading only one value (apache#15124) This patch trying to fix apache#15052 . The problem is mentioned here: apache#15052 (comment) When read 1 value, DeltaBitPackDecoder will not call `InitBlock`, causing it always read `last_value_`. Seems the problem is introduced in apache#10627 and amol-@d982bed I will add some test tonight * Closes: apache#15052 Lead-authored-by: mwish <maplewish117@gmail.com> Co-authored-by: Antoine Pitrou <antoine@python.org> Co-authored-by: mwish <1506118561@qq.com> Co-authored-by: Rok Mihevc <rok@mihevc.org> Signed-off-by: Antoine Pitrou <antoine@python.org>
This patch trying to fix #15052 . The problem is mentioned here: #15052 (comment)
When read 1 value, DeltaBitPackDecoder will not call
InitBlock
, causing it always readlast_value_
.Seems the problem is introduced in #10627 and amol-@d982bed
I will add some test tonight