Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[C++][Parquet] DeltaLengthByteArrayEncoder::Put may reserve too much space #35665

Closed
pitrou opened this issue May 18, 2023 · 6 comments · Fixed by #35670
Closed

[C++][Parquet] DeltaLengthByteArrayEncoder::Put may reserve too much space #35665

pitrou opened this issue May 18, 2023 · 6 comments · Fixed by #35670

Comments

@pitrou
Copy link
Member

pitrou commented May 18, 2023

Describe the bug, including details regarding any error messages, version, and platform.

DeltaLengthByteArrayEncoder::Put passes the total desired capacity (encoded_size_) to BufferBuilder::Reserve. However, BufferBuilder::Reserve expects the additional desired capacity. This will therefore request much more memory than desired, and perhaps reallocate more often than expected.

template <typename DType>
void DeltaLengthByteArrayEncoder<DType>::Put(const T* src, int num_values) {
if (num_values == 0) {
return;
}
constexpr int kBatchSize = 256;
std::array<int32_t, kBatchSize> lengths;
for (int idx = 0; idx < num_values; idx += kBatchSize) {
const int batch_size = std::min(kBatchSize, num_values - idx);
for (int j = 0; j < batch_size; ++j) {
const int32_t len = src[idx + j].len;
if (AddWithOverflow(encoded_size_, len, &encoded_size_)) {
throw ParquetException("excess expansion in DELTA_LENGTH_BYTE_ARRAY");
}
lengths[j] = len;
}
length_encoder_.Put(lengths.data(), batch_size);
}
PARQUET_THROW_NOT_OK(sink_.Reserve(encoded_size_));
for (int idx = 0; idx < num_values; idx++) {
sink_.UnsafeAppend(src[idx].ptr, src[idx].len);
}
}

Component(s)

C++

@pitrou
Copy link
Member Author

pitrou commented May 18, 2023

@felipecrv Do you want to take a look at this?

Also @rok FYI.

@pitrou
Copy link
Member Author

pitrou commented May 18, 2023

It would be a good idea to look for similar mistakes in other Parquet encoders/decoders.

@mapleFU
Copy link
Member

mapleFU commented May 18, 2023

Maybe I can take a look and fix them.

@pitrou
Copy link
Member Author

pitrou commented May 18, 2023

@mapleFU Feel free to.

@rok
Copy link
Member

rok commented May 18, 2023

Thanks for catching this @pitrou! I seem to have mixed up Resize and Reserve.

@mapleFU
Copy link
Member

mapleFU commented May 18, 2023

(Actually it's a bit misleading. C++ std::vector<T>::reserve interface works like rok does. But our arrow code use it as incremental...)

pitrou added a commit that referenced this issue May 18, 2023
… much space (#35670)

### Rationale for this change

`BufferBuilder::Reserve` is different from `std::vector<T>::reserve`, its argument refers to additional bytes.

### What changes are included in this PR?

Avoid allocating too much memory space for the encoding sink buffer.

### Are these changes tested?

No. The change may affect runtime performance but not correctness.

### Are there any user-facing changes?

No.

* Closes: #35665

Lead-authored-by: mwish <maplewish117@gmail.com>
Co-authored-by: Antoine Pitrou <pitrou@free.fr>
Signed-off-by: Antoine Pitrou <antoine@python.org>
@pitrou pitrou added this to the 13.0.0 milestone May 18, 2023
@raulcd raulcd changed the title DeltaLengthByteArrayEncoder::Put may reserve too much space [C++][Parquet] DeltaLengthByteArrayEncoder::Put may reserve too much space Jun 1, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants