Decompression fails for files written by arrow2.rs/parquet2.rs #2882

FauxFaux · 2022-01-07T17:37:34Z

What happens?

duckdb cannot read my parquet files. They are readable by pyarrow, arrow2.rs and arrow.rs.

select * from parquet_scan(['7-set.snappy.arrow2.parquet']);
Error: Decompression failure

To Reproduce

Unpack pq.zip.

select * from parquet_scan(['7-set.snappy.arrow2.parquet']);

Rambling

Initially I thought this was an arrow2 bug, but it appears the files are acceptable to all other tooling, so I'm back to thinking it must be a bug here.
The arrow2 file can be generated by reading the above bug report. Also included in the zip file is the file generated by duckdb itself, which is readable by all four tools, generation steps, but this is not a bug:

create table foo (c1 int null);
insert into foo select * from generate_series(0,7);
copy foo to '7-set.snappy.duckdb.parquet' (format 'parquet', codec 'snappy');

if you create a much larger file, and debug duckdb, it looks very much like it's trying to decompress the validity map; a huge run of 0xff bytes. It is completely unclear to me in the code why it would be trying to do this; why it would think uncompressed header data is a compressed page.

Environment (please complete the following information):

OS: Linux (Ubuntu 20.04 LTS)
DuckDB Version: v0.3.1 88aa81c (also tried master/HEAD)
DuckDB Client: duckdb cli, also parquetcli.

The text was updated successfully, but these errors were encountered:

hannes · 2022-01-08T06:03:44Z

I had some code in an unfinished branch that also included some fixes to reading Parquet V2 pages. Can you maybe try it out, I've opened a PR, #2885

The file you sent works for me in that branch ^^

hannes · 2022-01-14T15:18:09Z

@FauxFaux ?

FauxFaux · 2022-01-18T14:40:56Z

I wasn't able to try your branch at the time as my original files were in zstd, and I had trouble building the rebased branch. I'll have another go with the now-merged code.

Arttii · 2022-04-19T12:15:09Z

I am hitting this as well for a snappy compressed parquet file with the latest version from pypi. Recreating the file with the DataPage version set to 1 fixes it, though.

hannes closed this as completed Dec 6, 2022

sathishcyberintelsysnew mentioned this issue Dec 1, 2023

[Snyk] Security upgrade node-gyp from 9.4.1 to 10.0.0 sathishcyberintelsysnew/duckdb#2

Open

timothyhoward mentioned this issue May 7, 2024

TProtocolexception: Invalid Data after chrome update evidence-dev/evidence#1693

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Decompression fails for files written by arrow2.rs/parquet2.rs #2882

Decompression fails for files written by arrow2.rs/parquet2.rs #2882

FauxFaux commented Jan 7, 2022

hannes commented Jan 8, 2022 •

edited

hannes commented Jan 14, 2022

FauxFaux commented Jan 18, 2022

Arttii commented Apr 19, 2022 •

edited

Decompression fails for files written by arrow2.rs/parquet2.rs #2882

Decompression fails for files written by arrow2.rs/parquet2.rs #2882

Comments

FauxFaux commented Jan 7, 2022

What happens?

To Reproduce

Rambling

Environment (please complete the following information):

hannes commented Jan 8, 2022 • edited

hannes commented Jan 14, 2022

FauxFaux commented Jan 18, 2022

Arttii commented Apr 19, 2022 • edited

hannes commented Jan 8, 2022 •

edited

Arttii commented Apr 19, 2022 •

edited