GH-38271: [C++][Parquet] Support reading parquet files with multiple gzip members #38272

amassalha · 2023-10-15T05:36:06Z

What changes are included in this PR?

Adding support in GZipCodec to decompress concatenated gzip members

Are these changes tested?

test is attached

Are there any user-facing changes?

no

Closes: Support decompressing concatenated gzip members (stream) #38271

github-actions · 2023-10-15T05:36:34Z

⚠️ GitHub issue #38271 has been automatically assigned in GitHub to PR creator.

wgtmac

Thanks for the contribution! Could you please add a round trip test for the GZipCodec, which will fail decoding without you fix?

cpp/src/arrow/util/compression_zlib.cc

mapleFU · 2023-10-16T02:45:56Z

@amassalha How does the gzip streaming parquet file generated? This patch looks general ok but I want to know how other system support file like this

amassalha · 2023-10-16T10:35:43Z

@mapleFU to generate gzip streaming we use our internal high-performance gzip implementation which is fully adheres to the gzip RFC.
btw, the file I added can be read successfully using spark (tested with spark-shell)
@wgtmac Roundtrip test added
(patches updated)

mapleFU · 2023-10-17T15:55:11Z

Yeah, I know it's allowed by the protocol. I will investigate arrow-rs and parquet-mr about can we write or read file like this.

Also, do you know what system could generate parquet file with multiple gzip members?

wgtmac · 2023-10-18T01:57:04Z

I agree with @mapleFU. Did you have an example parquet file with multiple gzip members on hand? We need to make sure parquet-mr can decompress it successfully. If you don't mind, please upload one here and I can verify it on my end.

amassalha · 2023-10-18T04:11:10Z

@wgtmac the parquet file is attached to the issue:
#38271

I also tried pushing it to parquet-testing repo (https://github.com/apache/parquet-testing) but somehow I couldn't creat a pull request (thats what causing the failiures in this run, it can't find the file). any suggestions on that? is 'parquet-testing' also opened for pull requests ?

@mapleFU Im not familier with other system (than ours) curently

mapleFU · 2023-10-18T04:39:36Z

CI failed might related to the lint problem. You can run clang-format -i with your edited files.

How do you generate that parquet-file?

amassalha · 2023-10-18T04:41:44Z

we use our internal high-performance gzip implementation which is fully adheres to the gzip RFC.

As I mentioned, we use our internal high-performance gzip implementation which is fully adheres to the gzip RFC. (I can't give more details, I work in a startup)

mapleFU · 2023-10-18T04:46:28Z

https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L496-L498

Parquet also has two versions of LZ4 in standard. I've no issue about gzip standard, but maybe I need to make sure this kind of file is supported.

From the standard ( https://github.com/apache/parquet-format/blob/master/Compression.md#gzip ), the link is authority. Let me go through the parquet-mr impl tonight to make it clear.

mapleFU · 2023-10-18T04:54:11Z

@amassalha I've try current implemention, seems that we're able to read the concatenated_gzip_members.zip . Which contains 513 rows in this file. Can you confirm that?

amassalha · 2023-10-18T05:03:17Z

@amassalha I've try current implemention, seems that we're able to read the concatenated_gzip_members.zip . Which contains 513 rows in this file. Can you confirm that?

to read with current values? because when I tried using "ReadBatch" it returned "0" instead of last number (513)

mapleFU · 2023-10-18T05:14:17Z

Ah you're right!

I've check the file, read with arrow-rs:

 arrow-rs git:(master) ✗ .//target/debug/parquet-read /Users/fuxuwei/Downloads/concatenated_gzip_members_2.parquet 
thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: General("Actual decompressed size doesn't match the expected one (4099 vs 4107)")', parquet/src/record/reader.rs:583:36
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

@wgtmac Would you mind also check it using parquet-mr?

wgtmac · 2023-10-18T05:17:42Z

parquet-mr seems to be happy with the file.

➜  Downloads parquet-cli meta concatenated_gzip_members.parquet

File path:  concatenated_gzip_members.parquet
Created by: null
Properties: (none)
Schema:
message root {
  optional int64 long_col (INTEGER(64,false));
}


Row group 0:  count: 513  2.86 B records  start: 4  total(compressed): 1.433 kB total(uncompressed):4.058 kB
--------------------------------------------------------------------------------
          type      encodings count     avg size   nulls   min / max
long_col  INT64     G RB_     513       2.86 B             "1" / "513"

➜  Downloads parquet-cli cat concatenated_gzip_members.parquet
{"long_col": 1}
{"long_col": 2}
{"long_col": 3}
{"long_col": 4}
{"long_col": 5}
...
{"long_col": 511}
{"long_col": 512}
{"long_col": 513}

mapleFU · 2023-10-18T05:18:46Z

Yeah, seems it's a bug we need fix, thanks @amassalha ! I'll checkout why arrow-rs cannot open file like that

amassalha · 2023-10-18T05:20:27Z

Yeah, seems it's a bug we need fix, thanks @amassalha ! I'll checkout why arrow-rs cannot open file like that

great, thank you

wgtmac · 2023-10-18T07:02:19Z