Skip to content

Commit

Permalink
GH-38432: [C++][Parquet] Try to fix performance regression in the Dic…
Browse files Browse the repository at this point in the history
…tByteArrayDecoderImpl (#38784)

### Rationale for this change

Do some changes mentioned in #38432

I believe this might fix #38577

Problem1:

The `BinaryHelper` might call `Prepare()` and `Prepare(estimated-output-binary-length)` for data. This might because:

1. For Plain Encoding ByteArray, the `len_` is similar to the data-page size, so `Reserve` is related.
2. For Dict Encoding. The Data Page is just a RLE encoding Page, it's `len_` might didn't directly related to output-binary. 

Problem2:

`Prepare` using `::arrow::kBinaryMemoryLimit` as min-value, we should use `this->chunk_space_remaining_`.

Problem3:

`std::optional<int64_t>` is hard to optimize for some compilers

### What changes are included in this PR?

Mention the behavior of BinaryHelper. And trying to fix it.

### Are these changes tested?

No

### Are there any user-facing changes?

Regression fixes

* Closes: #38432

Lead-authored-by: mwish <maplewish117@gmail.com>
Co-authored-by: mwish <1506118561@qq.com>
Co-authored-by: Gang Wu <ustcwg@gmail.com>
Signed-off-by: Antoine Pitrou <antoine@python.org>
  • Loading branch information
3 people committed Nov 27, 2023
1 parent b0e1f74 commit 6070815
Showing 1 changed file with 30 additions and 6 deletions.
36 changes: 30 additions & 6 deletions cpp/src/parquet/encoding.cc
Original file line number Diff line number Diff line change
Expand Up @@ -1196,24 +1196,40 @@ struct ArrowBinaryHelper<ByteArrayType> {
chunk_space_remaining_(::arrow::kBinaryMemoryLimit -
acc_->builder->value_data_length()) {}

// Prepare will reserve the number of entries remaining in the current chunk.
// If estimated_data_length is provided, it will also reserve the estimated data length,
// and the caller should better call `UnsafeAppend` instead of `Append` to avoid
// double-checking the data length.
Status Prepare(std::optional<int64_t> estimated_data_length = {}) {
RETURN_NOT_OK(acc_->builder->Reserve(entries_remaining_));
if (estimated_data_length.has_value()) {
RETURN_NOT_OK(acc_->builder->ReserveData(
std::min<int64_t>(*estimated_data_length, ::arrow::kBinaryMemoryLimit)));
std::min<int64_t>(*estimated_data_length, this->chunk_space_remaining_)));
}
return Status::OK();
}

Status PrepareNextInput(int64_t next_value_length) {
if (ARROW_PREDICT_FALSE(!CanFit(next_value_length))) {
// This element would exceed the capacity of a chunk
RETURN_NOT_OK(PushChunk());
RETURN_NOT_OK(acc_->builder->Reserve(entries_remaining_));
}
return Status::OK();
}

// If estimated_remaining_data_length is provided, it will also reserve the estimated
// data length, and the caller should better call `UnsafeAppend` instead of
// `Append` to avoid double-checking the data length.
Status PrepareNextInput(int64_t next_value_length,
std::optional<int64_t> estimated_remaining_data_length = {}) {
int64_t estimated_remaining_data_length) {
if (ARROW_PREDICT_FALSE(!CanFit(next_value_length))) {
// This element would exceed the capacity of a chunk
RETURN_NOT_OK(PushChunk());
RETURN_NOT_OK(acc_->builder->Reserve(entries_remaining_));
if (estimated_remaining_data_length.has_value()) {
if (estimated_remaining_data_length) {
RETURN_NOT_OK(acc_->builder->ReserveData(
std::min<int64_t>(*estimated_remaining_data_length, chunk_space_remaining_)));
std::min<int64_t>(estimated_remaining_data_length, chunk_space_remaining_)));
}
}
return Status::OK();
Expand Down Expand Up @@ -1271,8 +1287,10 @@ struct ArrowBinaryHelper<FLBAType> {
return acc_->Reserve(entries_remaining_);
}

Status PrepareNextInput(int64_t next_value_length) { return Status::OK(); }

Status PrepareNextInput(int64_t next_value_length,
std::optional<int64_t> estimated_remaining_data_length = {}) {
int64_t estimated_remaining_data_length) {
return Status::OK();
}

Expand Down Expand Up @@ -1915,6 +1933,9 @@ class DictByteArrayDecoderImpl : public DictDecoderImpl<ByteArrayType>,
int32_t indices[kBufferSize];

ArrowBinaryHelper<ByteArrayType> helper(out, num_values);
// The `len_` in the ByteArrayDictDecoder is the total length of the
// RLE/Bit-pack encoded data size, so, we cannot use `len_` to reserve
// space for binary data.
RETURN_NOT_OK(helper.Prepare());

auto dict_values = reinterpret_cast<const ByteArray*>(dictionary_->data());
Expand Down Expand Up @@ -1983,7 +2004,10 @@ class DictByteArrayDecoderImpl : public DictDecoderImpl<ByteArrayType>,
int values_decoded = 0;

ArrowBinaryHelper<ByteArrayType> helper(out, num_values);
RETURN_NOT_OK(helper.Prepare(len_));
// The `len_` in the ByteArrayDictDecoder is the total length of the
// RLE/Bit-pack encoded data size, so, we cannot use `len_` to reserve
// space for binary data.
RETURN_NOT_OK(helper.Prepare());

auto dict_values = reinterpret_cast<const ByteArray*>(dictionary_->data());

Expand Down

0 comments on commit 6070815

Please sign in to comment.