PARQUET-492: [C++][Parquet] Basic support for reading DELTA_BYTE_ARRAY data. #10978

shanhuuang · 2021-08-23T12:42:26Z

Add "Advance" method in arrow::BitUtil::BitReader.
Basic implement of DeltaLengthByteArrayDecoder and DeltaByteArrayDecoder.
Test case of DeltaByteArrayDecoder, which relies on DeltaLengthByteArrayDecoder.

TODO:

Read corrupted files written with bug(PARQUET-246).
Add a test case of DeltaLengthByteArrayDecoder, if possible.

…YTE_ARRAY and DELTA_BYTE_ARRAY. TODO: read corrupted files written with bug(PARQUET-246)

github-actions · 2021-08-23T12:42:44Z

https://issues.apache.org/jira/browse/PARQUET-492

cpp/src/arrow/util/bit_stream_utils.h

pitrou

Thank you very much. I haven't read everything in detail but here are some things that stand out.

cpp/src/arrow/util/bit_stream_utils.h

cpp/src/parquet/arrow/arrow_reader_writer_test.cc

cpp/src/parquet/encoding.cc

pitrou · 2021-09-07T09:44:10Z

cpp/src/parquet/encoding.cc

+    prefix_len_offset_ = 0;
+    num_valid_values_ = num_prefix;
+
+    suffix_decoder_.SetData(num_values, decoder_);


Here as well, it's weird to have the same BitReader shared by the two decoders.

The spec says:

This is stored as a sequence of delta-encoded prefix lengths (DELTA_BINARY_PACKED), followed by the suffixes encoded as delta length byte arrays (DELTA_LENGTH_BYTE_ARRAY).

This means you should ideally compute the position of the encoded suffixes and then call suffix_decoder_.SetData with the right (data, len) values.

I think that we cannot get the end position of DELTA_BINARY_PACKED data unless we've read and inited all the blocks that are distributed in this data space. Because in DELTA_BINARY_PACKED encoding, the bitwidth of each miniblock is stored in blocks rather than in the header.

Do you mean that a method like DeltaBitPackDecoder::InitAllBlocks should be added? In this method we may init all the blocks in DELTA_BINARY_PACKED data, compute the total bytes as a return value and finally reset decoder_ for follow-up DeltaBitPackDecoder::Decode

I don't know how it should be done exactly, but as I said Decode(T* buffer, int max_values) can be called incrementally (and you're taking care to support that already in the DeltaBitPackDecoder). You can compute the total bytes upfront in Init, or you can do it lazily...

Ideally, we would have unit tests for incremental decoding...

Or as an alternative, you can raise an NYI when max_values is not equal to total_value_count_.

Here as well, it's weird to have the same BitReader shared by the two decoders.

The spec says:

This is stored as a sequence of delta-encoded prefix lengths (DELTA_BINARY_PACKED), followed by the suffixes encoded as delta length byte arrays (DELTA_LENGTH_BYTE_ARRAY).

This means you should ideally compute the position of the encoded suffixes and then call suffix_decoder_.SetData with the right (data, len) values.

I agree that this way will make the code in DeltaByteArrayDecoder more straightforward. However, current implement also works when max_values is not equal to num_valid_values_. I add a unit test for incremental encoding and some notes :)

Maybe I should raise an NYI to compute the start position of the encoded suffixes upfront?

I'm not sure I understand your question. Did you check your incremental test does pass a smaller max_values?

Yes. I printed max_values and num_valid_values_ in function DeltaByteArrayDecoder::GetInternal at a local test. max_values is close to 100. num_valid_values_ is close to 1000 at first.

…he code

pitrou

Thanks for the update again.

Can you look at the following comments?
Can you rebase on a more recent git master?

cpp/src/parquet/arrow/arrow_reader_writer_test.cc

cpp/src/arrow/util/bit_stream_utils.h

cpp/src/parquet/encoding.cc

pitrou · 2021-09-20T13:54:27Z

cpp/src/parquet/encoding.cc

+    suffix_decoder_.SetDecoder(num_values, decoder_);
+
+    // TODO: read corrupted files written with bug(PARQUET-246). last_value_ should be set
+    // to last_value_in_previous_page_ when decoding a new page(except the first page)


Do you intend to work on this? If so, can you open a new JIRA for it?

Yes, I'm willing to solve this problem. But I'm not sure how to solve it. I opened a new JIRA here

pitrou · 2021-11-08T16:22:24Z

@shanhuuang Sorry, I had missed that you had requested another review! I'll try to do that soon.

pitrou

LGTM. Sorry for the delay and thank you very much for working on this!

pitrou · 2021-11-08T17:36:49Z

I'll wait for CI now.

ursabot · 2021-11-08T18:19:11Z

Benchmark runs are scheduled for baseline = b97965a and contender = 00c94e0. 00c94e0 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Failed ⬇️0.0% ⬆️0.0%] ec2-t3-xlarge-us-east-2
[Finished ⬇️1.03% ⬆️0.0%] ursa-i9-9960x
[Finished ⬇️1.25% ⬆️0.04%] ursa-thinkcentre-m75q
Supported benchmarks:
ursa-i9-9960x: langs = Python, R, JavaScript
ursa-thinkcentre-m75q: langs = C++, Java
ec2-t3-xlarge-us-east-2: cloud = True

martindurant · 2021-12-08T20:55:24Z

Can the list of supported codecs at https://arrow.apache.org/docs/cpp/parquet.html#encodings be updated?

pitrou · 2021-12-08T20:57:04Z

It is on git master, but unfortunately only docs for the released versions are published online.

martindurant · 2021-12-08T20:58:02Z

OK, thanks @pitrou . Will all three DELTA_ codecs be in the next release?

pitrou · 2021-12-08T21:01:18Z

I have no idea if @shanhuuang intends to implement DELTA_LENGTH_BYTE_ARRAY. Also, they are only implemented on the read side (I don't know if writing will be done, nor when).

martindurant · 2021-12-08T21:02:39Z

Got it, thanks again.

PARQUET-492: [C++][Parquet] Basic support for encoding DELTA_LENGTH_B…

afb0fee

…YTE_ARRAY and DELTA_BYTE_ARRAY. TODO: read corrupted files written with bug(PARQUET-246)

github-actions bot added Component: C++ Component: Parquet labels Aug 23, 2021

shanhuuang force-pushed the PARQUET-492 branch from 270e08b to 9629cf7 Compare August 24, 2021 07:06

Fix CI failure

4f4300e

shanhuuang force-pushed the PARQUET-492 branch from 9629cf7 to 4f4300e Compare August 24, 2021 07:57

emkornfield reviewed Aug 31, 2021

View reviewed changes

cpp/src/arrow/util/bit_stream_utils.h Outdated Show resolved Hide resolved

shanhuuang and others added 3 commits September 1, 2021 12:16

Modify BitReader::Advance and the name of unit test

ba3d33e

Remove the unused code

5c3fb36

Merge branch 'master' into PARQUET-492

0bc78bb

shanhuuang force-pushed the PARQUET-492 branch from e8a751e to 0bc78bb Compare September 3, 2021 03:47

pitrou reviewed Sep 7, 2021

View reviewed changes

shanhuuang added 2 commits September 8, 2021 21:08

Add function "ResetBufferdValues_" in bit_stream_utils.h and modify t…

5cb1dea

…he code

Add test case "IncrementalDecodeDeltaByteArray" and some notes.

d89f472

pitrou requested changes Sep 20, 2021

View reviewed changes

shanhuuang and others added 2 commits September 23, 2021 11:32

Merge branch 'apache:master' into PARQUET-492

76d1705

Factor out repeated code

7469605

shanhuuang requested a review from pitrou October 8, 2021 02:25

Merge branch 'apache:master' into PARQUET-492

a08b9f1

pitrou added 2 commits November 8, 2021 18:21

Merge branch 'master' into PARQUET-492

f728b53

Update doc for supported Parquet encodings

e4f265f

pitrou approved these changes Nov 8, 2021

View reviewed changes

pitrou closed this in 00c94e0 Nov 8, 2021

This was referenced Jun 23, 2024

[C++][Parquet] Incorporate DELTA_BYTE_ARRAY value encoder into library and add unit tests #42396

Closed

[C++][Parquet] Add DELTA_BYTE_ARRAY encoder to Parquet writer #32863

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PARQUET-492: [C++][Parquet] Basic support for reading DELTA_BYTE_ARRAY data. #10978

PARQUET-492: [C++][Parquet] Basic support for reading DELTA_BYTE_ARRAY data. #10978

shanhuuang commented Aug 23, 2021

github-actions bot commented Aug 23, 2021

pitrou left a comment

pitrou Sep 7, 2021

shanhuuang Sep 8, 2021

pitrou Sep 8, 2021

pitrou Sep 8, 2021

shanhuuang Sep 9, 2021

pitrou Sep 9, 2021

shanhuuang Sep 10, 2021

pitrou left a comment

pitrou Sep 20, 2021

shanhuuang Sep 23, 2021

pitrou commented Nov 8, 2021

pitrou left a comment

pitrou commented Nov 8, 2021

ursabot commented Nov 8, 2021 •

edited

Loading

martindurant commented Dec 8, 2021

pitrou commented Dec 8, 2021

martindurant commented Dec 8, 2021

pitrou commented Dec 8, 2021

martindurant commented Dec 8, 2021

PARQUET-492: [C++][Parquet] Basic support for reading DELTA_BYTE_ARRAY data. #10978

PARQUET-492: [C++][Parquet] Basic support for reading DELTA_BYTE_ARRAY data. #10978

Conversation

shanhuuang commented Aug 23, 2021

github-actions bot commented Aug 23, 2021

pitrou left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pitrou left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pitrou commented Nov 8, 2021

pitrou left a comment

Choose a reason for hiding this comment

pitrou commented Nov 8, 2021

ursabot commented Nov 8, 2021 • edited Loading

martindurant commented Dec 8, 2021

pitrou commented Dec 8, 2021

martindurant commented Dec 8, 2021

pitrou commented Dec 8, 2021

martindurant commented Dec 8, 2021

ursabot commented Nov 8, 2021 •

edited

Loading