-
-
Notifications
You must be signed in to change notification settings - Fork 172
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: reading boolean column with RLE encoding gives wrong values #884
Comments
Yes, we very probably assume bools should be bit-packed. Is there really any case in which RLE achieves better packing or read performance? https://parquet.apache.org/docs/file-format/data-pages/encodings/
The last point was certainly not there the last time I looked at the spec, but it might have been a while. |
The website is newer, but in the actual doc page it was added 6 years ago: apache/parquet-format#79 I am not aware of an extensive performance comparison, but from the discussion on the Arrow side, it was mentioned that also several other Parquet implementations (eg parquet-rs) already default to use RLE for boolean values. |
I don't gather from the docs, it RLE bool encoding restricted to V2 data pages? |
AFAIK encoding options are completely separate from V1 vs V2 data pages |
Describe the issue:
When reading a Parquet file with a column of boolean values and written by pyarrow using the RLE encoding for the boolean values, fastparquet silently reads garbage values for this column.
Such a roundtrip started to fail on the pyarrow side, because we switched to use RLE by default for boolean values (apache/arrow#36955), which is planned for the upcoming 14.0 release.
Minimal Complete Verifiable Example:
The text was updated successfully, but these errors were encountered: