-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PARQUET-492: [C++][Parquet] Basic support for reading DELTA_BYTE_ARRAY data. #10978
Conversation
…YTE_ARRAY and DELTA_BYTE_ARRAY. TODO: read corrupted files written with bug(PARQUET-246)
270e08b
to
9629cf7
Compare
9629cf7
to
4f4300e
Compare
e8a751e
to
0bc78bb
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you very much. I haven't read everything in detail but here are some things that stand out.
cpp/src/parquet/encoding.cc
Outdated
prefix_len_offset_ = 0; | ||
num_valid_values_ = num_prefix; | ||
|
||
suffix_decoder_.SetData(num_values, decoder_); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here as well, it's weird to have the same BitReader
shared by the two decoders.
The spec says:
This is stored as a sequence of delta-encoded prefix lengths (DELTA_BINARY_PACKED), followed by the suffixes encoded as delta length byte arrays (DELTA_LENGTH_BYTE_ARRAY).
This means you should ideally compute the position of the encoded suffixes and then call suffix_decoder_.SetData
with the right (data, len)
values.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that we cannot get the end position of DELTA_BINARY_PACKED data unless we've read and inited all the blocks that are distributed in this data space. Because in DELTA_BINARY_PACKED encoding, the bitwidth of each miniblock is stored in blocks rather than in the header.
Do you mean that a method like DeltaBitPackDecoder::InitAllBlocks
should be added? In this method we may init all the blocks in DELTA_BINARY_PACKED data, compute the total bytes as a return value and finally reset decoder_
for follow-up DeltaBitPackDecoder::Decode
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know how it should be done exactly, but as I said Decode(T* buffer, int max_values)
can be called incrementally (and you're taking care to support that already in the DeltaBitPackDecoder
). You can compute the total bytes upfront in Init
, or you can do it lazily...
Ideally, we would have unit tests for incremental decoding...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or as an alternative, you can raise an NYI when max_values
is not equal to total_value_count_
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here as well, it's weird to have the same
BitReader
shared by the two decoders.The spec says:
This is stored as a sequence of delta-encoded prefix lengths (DELTA_BINARY_PACKED), followed by the suffixes encoded as delta length byte arrays (DELTA_LENGTH_BYTE_ARRAY).
This means you should ideally compute the position of the encoded suffixes and then call
suffix_decoder_.SetData
with the right(data, len)
values.
I agree that this way will make the code in DeltaByteArrayDecoder
more straightforward. However, current implement also works when max_values
is not equal to num_valid_values_
. I add a unit test for incremental encoding and some notes :)
Maybe I should raise an NYI to compute the start position of the encoded suffixes upfront?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I understand your question. Did you check your incremental test does pass a smaller max_values
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. I printed max_values
and num_valid_values_
in function DeltaByteArrayDecoder::GetInternal
at a local test. max_values
is close to 100. num_valid_values_
is close to 1000 at first.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the update again.
- Can you look at the following comments?
- Can you rebase on a more recent git master?
suffix_decoder_.SetDecoder(num_values, decoder_); | ||
|
||
// TODO: read corrupted files written with bug(PARQUET-246). last_value_ should be set | ||
// to last_value_in_previous_page_ when decoding a new page(except the first page) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you intend to work on this? If so, can you open a new JIRA for it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I'm willing to solve this problem. But I'm not sure how to solve it. I opened a new JIRA here
@shanhuuang Sorry, I had missed that you had requested another review! I'll try to do that soon. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Sorry for the delay and thank you very much for working on this!
I'll wait for CI now. |
Benchmark runs are scheduled for baseline = b97965a and contender = 00c94e0. 00c94e0 is a master commit associated with this PR. Results will be available as each benchmark for each run completes. |
Can the list of supported codecs at https://arrow.apache.org/docs/cpp/parquet.html#encodings be updated? |
It is on git master, but unfortunately only docs for the released versions are published online. |
OK, thanks @pitrou . Will all three DELTA_ codecs be in the next release? |
I have no idea if @shanhuuang intends to implement |
Got it, thanks again. |
TODO: