Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Decompression fails for files written by arrow2.rs/parquet2.rs #2882

Closed
FauxFaux opened this issue Jan 7, 2022 · 4 comments
Closed

Decompression fails for files written by arrow2.rs/parquet2.rs #2882

FauxFaux opened this issue Jan 7, 2022 · 4 comments

Comments

@FauxFaux
Copy link

FauxFaux commented Jan 7, 2022

What happens?

duckdb cannot read my parquet files. They are readable by pyarrow, arrow2.rs and arrow.rs.

select * from parquet_scan(['7-set.snappy.arrow2.parquet']);
Error: Decompression failure

To Reproduce

  1. Unpack pq.zip.
select * from parquet_scan(['7-set.snappy.arrow2.parquet']);

Rambling

  • Initially I thought this was an arrow2 bug, but it appears the files are acceptable to all other tooling, so I'm back to thinking it must be a bug here.

  • The arrow2 file can be generated by reading the above bug report. Also included in the zip file is the file generated by duckdb itself, which is readable by all four tools, generation steps, but this is not a bug:

create table foo (c1 int null);
insert into foo select * from generate_series(0,7);
copy foo to '7-set.snappy.duckdb.parquet' (format 'parquet', codec 'snappy');
  • if you create a much larger file, and debug duckdb, it looks very much like it's trying to decompress the validity map; a huge run of 0xff bytes. It is completely unclear to me in the code why it would be trying to do this; why it would think uncompressed header data is a compressed page.

Environment (please complete the following information):

  • OS: Linux (Ubuntu 20.04 LTS)
  • DuckDB Version: v0.3.1 88aa81c (also tried master/HEAD)
  • DuckDB Client: duckdb cli, also parquetcli.
@hannes
Copy link
Member

hannes commented Jan 8, 2022

I had some code in an unfinished branch that also included some fixes to reading Parquet V2 pages. Can you maybe try it out, I've opened a PR, #2885

The file you sent works for me in that branch ^^

@hannes
Copy link
Member

hannes commented Jan 14, 2022

@FauxFaux ?

@FauxFaux
Copy link
Author

I wasn't able to try your branch at the time as my original files were in zstd, and I had trouble building the rebased branch. I'll have another go with the now-merged code.

@Arttii
Copy link

Arttii commented Apr 19, 2022

I am hitting this as well for a snappy compressed parquet file with the latest version from pypi. Recreating the file with the DataPage version set to 1 fixes it, though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants