-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Decompression fails for files written by arrow2.rs/parquet2.rs #2882
Comments
I had some code in an unfinished branch that also included some fixes to reading Parquet V2 pages. Can you maybe try it out, I've opened a PR, #2885 The file you sent works for me in that branch ^^ |
I wasn't able to try your branch at the time as my original files were in zstd, and I had trouble building the rebased branch. I'll have another go with the now-merged code. |
I am hitting this as well for a snappy compressed parquet file with the latest version from pypi. Recreating the file with the DataPage version set to 1 fixes it, though. |
What happens?
duckdb
cannot read my parquet files. They are readable bypyarrow
,arrow2.rs
andarrow.rs
.To Reproduce
Rambling
Initially I thought this was an arrow2 bug, but it appears the files are acceptable to all other tooling, so I'm back to thinking it must be a bug here.
The
arrow2
file can be generated by reading the above bug report. Also included in the zip file is the file generated by duckdb itself, which is readable by all four tools, generation steps, but this is not a bug:duckdb
, it looks very much like it's trying to decompress the validity map; a huge run of 0xff bytes. It is completely unclear to me in the code why it would be trying to do this; why it would think uncompressed header data is a compressed page.Environment (please complete the following information):
duckdb
cli, alsoparquetcli
.The text was updated successfully, but these errors were encountered: