Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PARQUET-492: [C++][Parquet] Basic support for reading DELTA_BYTE_ARRAY data. #10978

Closed
wants to merge 12 commits into from

Conversation

shanhuuang
Copy link
Contributor

  1. Add "Advance" method in arrow::BitUtil::BitReader.
  2. Basic implement of DeltaLengthByteArrayDecoder and DeltaByteArrayDecoder.
  3. Test case of DeltaByteArrayDecoder, which relies on DeltaLengthByteArrayDecoder.

TODO:

  1. Read corrupted files written with bug(PARQUET-246).
  2. Add a test case of DeltaLengthByteArrayDecoder, if possible.

…YTE_ARRAY and DELTA_BYTE_ARRAY.

TODO: read corrupted files written with bug(PARQUET-246)
@github-actions
Copy link

Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you very much. I haven't read everything in detail but here are some things that stand out.

cpp/src/arrow/util/bit_stream_utils.h Outdated Show resolved Hide resolved
cpp/src/parquet/arrow/arrow_reader_writer_test.cc Outdated Show resolved Hide resolved
cpp/src/parquet/arrow/arrow_reader_writer_test.cc Outdated Show resolved Hide resolved
cpp/src/parquet/encoding.cc Show resolved Hide resolved
cpp/src/parquet/encoding.cc Show resolved Hide resolved
prefix_len_offset_ = 0;
num_valid_values_ = num_prefix;

suffix_decoder_.SetData(num_values, decoder_);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here as well, it's weird to have the same BitReader shared by the two decoders.

The spec says:

This is stored as a sequence of delta-encoded prefix lengths (DELTA_BINARY_PACKED), followed by the suffixes encoded as delta length byte arrays (DELTA_LENGTH_BYTE_ARRAY).

This means you should ideally compute the position of the encoded suffixes and then call suffix_decoder_.SetData with the right (data, len) values.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that we cannot get the end position of DELTA_BINARY_PACKED data unless we've read and inited all the blocks that are distributed in this data space. Because in DELTA_BINARY_PACKED encoding, the bitwidth of each miniblock is stored in blocks rather than in the header.

Do you mean that a method like DeltaBitPackDecoder::InitAllBlocks should be added? In this method we may init all the blocks in DELTA_BINARY_PACKED data, compute the total bytes as a return value and finally reset decoder_ for follow-up DeltaBitPackDecoder::Decode

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know how it should be done exactly, but as I said Decode(T* buffer, int max_values) can be called incrementally (and you're taking care to support that already in the DeltaBitPackDecoder). You can compute the total bytes upfront in Init, or you can do it lazily...

Ideally, we would have unit tests for incremental decoding...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or as an alternative, you can raise an NYI when max_values is not equal to total_value_count_.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here as well, it's weird to have the same BitReader shared by the two decoders.

The spec says:

This is stored as a sequence of delta-encoded prefix lengths (DELTA_BINARY_PACKED), followed by the suffixes encoded as delta length byte arrays (DELTA_LENGTH_BYTE_ARRAY).

This means you should ideally compute the position of the encoded suffixes and then call suffix_decoder_.SetData with the right (data, len) values.

I agree that this way will make the code in DeltaByteArrayDecoder more straightforward. However, current implement also works when max_values is not equal to num_valid_values_. I add a unit test for incremental encoding and some notes :)

Maybe I should raise an NYI to compute the start position of the encoded suffixes upfront?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I understand your question. Did you check your incremental test does pass a smaller max_values?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. I printed max_values and num_valid_values_ in function DeltaByteArrayDecoder::GetInternal at a local test. max_values is close to 100. num_valid_values_ is close to 1000 at first.

Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the update again.

  1. Can you look at the following comments?
  2. Can you rebase on a more recent git master?

cpp/src/parquet/arrow/arrow_reader_writer_test.cc Outdated Show resolved Hide resolved
cpp/src/parquet/arrow/arrow_reader_writer_test.cc Outdated Show resolved Hide resolved
cpp/src/arrow/util/bit_stream_utils.h Outdated Show resolved Hide resolved
cpp/src/parquet/encoding.cc Outdated Show resolved Hide resolved
suffix_decoder_.SetDecoder(num_values, decoder_);

// TODO: read corrupted files written with bug(PARQUET-246). last_value_ should be set
// to last_value_in_previous_page_ when decoding a new page(except the first page)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you intend to work on this? If so, can you open a new JIRA for it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I'm willing to solve this problem. But I'm not sure how to solve it. I opened a new JIRA here

@shanhuuang shanhuuang requested a review from pitrou October 8, 2021 02:25
@pitrou
Copy link
Member

pitrou commented Nov 8, 2021

@shanhuuang Sorry, I had missed that you had requested another review! I'll try to do that soon.

Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Sorry for the delay and thank you very much for working on this!

@pitrou
Copy link
Member

pitrou commented Nov 8, 2021

I'll wait for CI now.

@pitrou pitrou closed this in 00c94e0 Nov 8, 2021
@ursabot
Copy link

ursabot commented Nov 8, 2021

Benchmark runs are scheduled for baseline = b97965a and contender = 00c94e0. 00c94e0 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Failed ⬇️0.0% ⬆️0.0%] ec2-t3-xlarge-us-east-2
[Finished ⬇️1.03% ⬆️0.0%] ursa-i9-9960x
[Finished ⬇️1.25% ⬆️0.04%] ursa-thinkcentre-m75q
Supported benchmarks:
ursa-i9-9960x: langs = Python, R, JavaScript
ursa-thinkcentre-m75q: langs = C++, Java
ec2-t3-xlarge-us-east-2: cloud = True

@martindurant
Copy link

Can the list of supported codecs at https://arrow.apache.org/docs/cpp/parquet.html#encodings be updated?

@pitrou
Copy link
Member

pitrou commented Dec 8, 2021

It is on git master, but unfortunately only docs for the released versions are published online.

@martindurant
Copy link

OK, thanks @pitrou . Will all three DELTA_ codecs be in the next release?

@pitrou
Copy link
Member

pitrou commented Dec 8, 2021

I have no idea if @shanhuuang intends to implement DELTA_LENGTH_BYTE_ARRAY. Also, they are only implemented on the read side (I don't know if writing will be done, nor when).

@martindurant
Copy link

Got it, thanks again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants