Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-38271: [C++][Parquet] Support reading parquet files with multiple gzip members #38272

Merged
merged 5 commits into from
Nov 29, 2023

Conversation

amassalha
Copy link
Contributor

@amassalha amassalha commented Oct 15, 2023

What changes are included in this PR?

Adding support in GZipCodec to decompress concatenated gzip members

Are these changes tested?

test is attached

Are there any user-facing changes?

no

@github-actions
Copy link

⚠️ GitHub issue #38271 has been automatically assigned in GitHub to PR creator.

Copy link
Member

@wgtmac wgtmac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the contribution! Could you please add a round trip test for the GZipCodec, which will fail decoding without you fix?

cpp/src/arrow/util/compression_zlib.cc Outdated Show resolved Hide resolved
cpp/src/arrow/util/compression_zlib.cc Outdated Show resolved Hide resolved
@wgtmac wgtmac changed the title GH-38271: [C++,Parquet] Support reading parquet files with multiple gzip members GH-38271: [C++][Parquet] Support reading parquet files with multiple gzip members Oct 16, 2023
@github-actions github-actions bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels Oct 16, 2023
@mapleFU
Copy link
Member

mapleFU commented Oct 16, 2023

@amassalha How does the gzip streaming parquet file generated? This patch looks general ok but I want to know how other system support file like this

@amassalha
Copy link
Contributor Author

amassalha commented Oct 16, 2023

@mapleFU to generate gzip streaming we use our internal high-performance gzip implementation which is fully adheres to the gzip RFC.
btw, the file I added can be read successfully using spark (tested with spark-shell)
@wgtmac Roundtrip test added
(patches updated)

@amassalha amassalha requested a review from wgtmac October 16, 2023 10:44
@mapleFU
Copy link
Member

mapleFU commented Oct 17, 2023

Yeah, I know it's allowed by the protocol. I will investigate arrow-rs and parquet-mr about can we write or read file like this.

Also, do you know what system could generate parquet file with multiple gzip members?

@wgtmac
Copy link
Member

wgtmac commented Oct 18, 2023

I agree with @mapleFU. Did you have an example parquet file with multiple gzip members on hand? We need to make sure parquet-mr can decompress it successfully. If you don't mind, please upload one here and I can verify it on my end.

@amassalha
Copy link
Contributor Author

amassalha commented Oct 18, 2023

@wgtmac the parquet file is attached to the issue:
#38271

I also tried pushing it to parquet-testing repo (https://github.com/apache/parquet-testing) but somehow I couldn't creat a pull request (thats what causing the failiures in this run, it can't find the file). any suggestions on that? is 'parquet-testing' also opened for pull requests ?

@mapleFU Im not familier with other system (than ours) curently

@mapleFU
Copy link
Member

mapleFU commented Oct 18, 2023

CI failed might related to the lint problem. You can run clang-format -i with your edited files.

How do you generate that parquet-file?

@amassalha
Copy link
Contributor Author

we use our internal high-performance gzip implementation which is fully adheres to the gzip RFC.

As I mentioned, we use our internal high-performance gzip implementation which is fully adheres to the gzip RFC. (I can't give more details, I work in a startup)

@mapleFU
Copy link
Member

mapleFU commented Oct 18, 2023

https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L496-L498

Parquet also has two versions of LZ4 in standard. I've no issue about gzip standard, but maybe I need to make sure this kind of file is supported.

From the standard ( https://github.com/apache/parquet-format/blob/master/Compression.md#gzip ), the link is authority. Let me go through the parquet-mr impl tonight to make it clear.

@mapleFU
Copy link
Member

mapleFU commented Oct 18, 2023

@amassalha I've try current implemention, seems that we're able to read the concatenated_gzip_members.zip . Which contains 513 rows in this file. Can you confirm that?

@amassalha
Copy link
Contributor Author

@amassalha I've try current implemention, seems that we're able to read the concatenated_gzip_members.zip . Which contains 513 rows in this file. Can you confirm that?

to read with current values? because when I tried using "ReadBatch" it returned "0" instead of last number (513)

@mapleFU
Copy link
Member

mapleFU commented Oct 18, 2023

Ah you're right!

I've check the file, read with arrow-rs:

 arrow-rs git:(master) ✗ .//target/debug/parquet-read /Users/fuxuwei/Downloads/concatenated_gzip_members_2.parquet 
thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: General("Actual decompressed size doesn't match the expected one (4099 vs 4107)")', parquet/src/record/reader.rs:583:36
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

@wgtmac Would you mind also check it using parquet-mr?

@wgtmac
Copy link
Member

wgtmac commented Oct 18, 2023

parquet-mr seems to be happy with the file.

➜  Downloads parquet-cli meta concatenated_gzip_members.parquet

File path:  concatenated_gzip_members.parquet
Created by: null
Properties: (none)
Schema:
message root {
  optional int64 long_col (INTEGER(64,false));
}


Row group 0:  count: 513  2.86 B records  start: 4  total(compressed): 1.433 kB total(uncompressed):4.058 kB
--------------------------------------------------------------------------------
          type      encodings count     avg size   nulls   min / max
long_col  INT64     G RB_     513       2.86 B             "1" / "513"

➜  Downloads parquet-cli cat concatenated_gzip_members.parquet
{"long_col": 1}
{"long_col": 2}
{"long_col": 3}
{"long_col": 4}
{"long_col": 5}
...
{"long_col": 511}
{"long_col": 512}
{"long_col": 513}

@mapleFU
Copy link
Member

mapleFU commented Oct 18, 2023

Yeah, seems it's a bug we need fix, thanks @amassalha ! I'll checkout why arrow-rs cannot open file like that

@amassalha
Copy link
Contributor Author

Yeah, seems it's a bug we need fix, thanks @amassalha ! I'll checkout why arrow-rs cannot open file like that

great, thank you

@@ -368,6 +368,41 @@ TEST_P(CodecTest, CodecRoundtrip) {
}
}

TEST(CodecTest, CodecRoundtripGzipMembers) {
int sizes[] = {0, 10000, 100000};
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: use std::array or std::vector instead of raw array.

std::unique_ptr<Codec> c1;
ASSERT_OK_AND_ASSIGN(c1, Codec::Create(Compression::GZIP));

for (int data_half_size : sizes) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
for (int data_half_size : sizes) {
for (int data_half_size : {0, 10000, 100000}) {

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or simply do this.

Comment on lines 374 to 375
std::unique_ptr<Codec> c1;
ASSERT_OK_AND_ASSIGN(c1, Codec::Create(Compression::GZIP));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
std::unique_ptr<Codec> c1;
ASSERT_OK_AND_ASSIGN(c1, Codec::Create(Compression::GZIP));
ASSERT_OK_AND_ASSIGN(auto gzip_codec, Codec::Create(Compression::GZIP));

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's name it more meaningfully.

Comment on lines 388 to 391
ASSERT_OK_AND_ASSIGN(actual_size_p1, c1->Compress(data_half.size(), data_half.data(),
max_compressed_len_half, compressed.data()));
ASSERT_OK_AND_ASSIGN(actual_size_p2, c1->Compress(data_half.size(), data_half.data(),
max_compressed_len_half, compressed.data()+actual_size_p1));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
ASSERT_OK_AND_ASSIGN(actual_size_p1, c1->Compress(data_half.size(), data_half.data(),
max_compressed_len_half, compressed.data()));
ASSERT_OK_AND_ASSIGN(actual_size_p2, c1->Compress(data_half.size(), data_half.data(),
max_compressed_len_half, compressed.data()+actual_size_p1));
ASSERT_OK_AND_ASSIGN(int64_t actual_size_p1, gzip_codec->Compress(data_half.size(), data_half.data(),
max_compressed_len_half, compressed.data()));
ASSERT_OK_AND_ASSIGN(int64_t actual_size_p2, gzip_codec->Compress(data_half.size(), data_half.data(),
max_compressed_len_half, compressed.data() + actual_size_p1));

We can simply do this.

compressed.resize(actual_size_p1 + actual_size_p2);

// Decompress the concatenated data
std::vector<uint8_t> decompressed(data_half_size*2);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
std::vector<uint8_t> decompressed(data_half_size*2);
std::vector<uint8_t> decompressed(data_half_size * 2);

c1->Decompress(compressed.size(), compressed.data(),
decompressed.size(), decompressed.data()));

ASSERT_EQ(data_half.size()*2, actual_decompressed_size);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
ASSERT_EQ(data_half.size()*2, actual_decompressed_size);
ASSERT_EQ(data_half.size() * 2, actual_decompressed_size);

cpp/src/parquet/reader_test.cc Show resolved Hide resolved
@mapleFU
Copy link
Member

mapleFU commented Oct 18, 2023

I found arrow-cpp we can read the file because we don't check the decompress size:

  // Decompress the values
  PARQUET_ASSIGN_OR_THROW(auto decompress_len, decompressor_->Decompress(
      compressed_len - levels_byte_len, page_buffer->data() + levels_byte_len,
      uncompressed_len - levels_byte_len,
      decompression_buffer_->mutable_data() + levels_byte_len));
  if (decompress_len != uncompressed_len - levels_byte_len) {
    throw ParquetException("Expected " + std::to_string(uncompressed_len - levels_byte_len) +
                           " bytes but decompressed " + std::to_string(decompress_len));
  }

@wgtmac @pitrou Do you think we need to add the check above?

@pitrou
Copy link
Member

pitrou commented Oct 18, 2023

@mapleFU Yes, I think it would be good regardless of whether we decide to support this feature or not.

@mapleFU
Copy link
Member

mapleFU commented Nov 15, 2023

@amassalha

  1. PARQUET-2369: Add new test file concatenated_gzip_members.parquet parquet-testing#41 is merged now, since we can update the testing
  2. Also, I've update the decompression check, previously when you implement this, the final value will be 0, but now, without this patch, reading file like this will throw exception

Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the update! Here are two more comments.
Also, can you fix the C++ lint failures (see CI results)?

cpp/src/arrow/util/compression_zlib.cc Outdated Show resolved Hide resolved
cpp/src/parquet/reader_test.cc Show resolved Hide resolved
linting

review fixes

code review fixes2

code review fixes

Update testing

Update parquet-testing

flint fixes

Update compression_test.cc
@amassalha
Copy link
Contributor Author

amassalha commented Nov 22, 2023

thanks for the comments all
@wgtmac I think its your call now (all tests passed and I fixed review comments)

@amassalha
Copy link
Contributor Author

amassalha commented Nov 22, 2023

I see a timeout fail in 3sfs, I don't see how this is related to gzip, do you have a flaky / non-deterministic bug?
(it passed before the review changes)

@mapleFU
Copy link
Member

mapleFU commented Nov 22, 2023

Don't worry I'll rerun it.

@amassalha
Copy link
Contributor Author

Don't worry I'll rerun it.

great, thanks

@amassalha
Copy link
Contributor Author

@wgtmac any comments from your side?

@wgtmac
Copy link
Member

wgtmac commented Nov 26, 2023

I don't have any concern. It seems that you need rebase it to get merged.

@amassalha
Copy link
Contributor Author

amassalha commented Nov 27, 2023

@wgtmac I rebased, and got same random fale of sf3, should we just rerun or what?

@amassalha
Copy link
Contributor Author

@mapleFU how can I rerun this?

@pitrou
Copy link
Member

pitrou commented Nov 27, 2023

@amassalha There is no need to re-run, these failures are unrelated.

@pitrou
Copy link
Member

pitrou commented Nov 27, 2023

@wgtmac Would you like to review this again? Otherwise, I will just merge this PR.

@wgtmac
Copy link
Member

wgtmac commented Nov 27, 2023

@wgtmac Would you like to review this again? Otherwise, I will just merge this PR.

Thanks for notifying me! Feel free to merge this.

@amassalha
Copy link
Contributor Author

@wgtmac @pitrou can we finish this before another rebase is needed?

@mapleFU
Copy link
Member

mapleFU commented Nov 29, 2023

I'll merge it

@mapleFU mapleFU merged commit 1d904d6 into apache:main Nov 29, 2023
32 of 35 checks passed
@mapleFU mapleFU removed the awaiting committer review Awaiting committer review label Nov 29, 2023
@amassalha
Copy link
Contributor Author

thanks!

Copy link

After merging your PR, Conbench analyzed the 5 benchmarking runs that have been run so far on merge-commit 1d904d6.

There were 5 benchmark results indicating a performance regression:

The full Conbench report has more details. It also includes information about 6 possible false positives for unstable benchmarks that are known to sometimes produce them.

dgreiss pushed a commit to dgreiss/arrow that referenced this pull request Feb 19, 2024
…tiple gzip members (apache#38272)

### What changes are included in this PR?
Adding support in GZipCodec to decompress concatenated gzip members

### Are these changes tested?
test is attached

### Are there any user-facing changes?
no

* Closes: apache#38271

Lead-authored-by: amassalha <amassalha@speedata.io>
Co-authored-by: Atheel Massalha <amassalha@speedata.io>
Signed-off-by: mwish <maplewish117@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support decompressing concatenated gzip members (stream)
4 participants