Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: reading boolean column with RLE encoding gives wrong values #884

Closed
jorisvandenbossche opened this issue Sep 25, 2023 · 4 comments · Fixed by #885
Closed

BUG: reading boolean column with RLE encoding gives wrong values #884

jorisvandenbossche opened this issue Sep 25, 2023 · 4 comments · Fixed by #885

Comments

@jorisvandenbossche
Copy link
Member

Describe the issue:

When reading a Parquet file with a column of boolean values and written by pyarrow using the RLE encoding for the boolean values, fastparquet silently reads garbage values for this column.

Such a roundtrip started to fail on the pyarrow side, because we switched to use RLE by default for boolean values (apache/arrow#36955), which is planned for the upcoming 14.0 release.

Minimal Complete Verifiable Example:

In [1]: df = pd.DataFrame({"col": [True, False, True]})

In [2]: df.to_parquet("test_bool_pa13_plain.parquet", engine="pyarrow", column_encoding={"col": "PLAIN"}, use_dictionary=False)

In [3]: df.to_parquet("test_bool_pa13_rle.parquet", engine="pyarrow", column_encoding={"col": "RLE"}, use_dictionary=False)

In [4]: pd.read_parquet("test_bool_pa13_plain.parquet", engine="fastparquet")

In [5]: pd.read_parquet("test_bool_pa13_plain.parquet", engine="fastparquet")
Out[5]: 
     col
0   True
1  False
2   True

In [6]: pd.read_parquet("test_bool_pa13_rle.parquet", engine="fastparquet")
Out[6]: 
     col
0  False
1  False
2  False

In [7]: import fastparquet; fastparquet.__version__
Out[7]: '2023.8.0'

In [8]: import pyarrow as pa; pa.__version__
Out[8]: '13.0.0'
@martindurant
Copy link
Member

Yes, we very probably assume bools should be bit-packed. Is there really any case in which RLE achieves better packing or read performance?

https://parquet.apache.org/docs/file-format/data-pages/encodings/

Note that the RLE encoding method is only supported for the following types of data:

Repetition and definition levels
Dictionary indices
Boolean values in data pages, as an alternative to PLAIN encoding

The last point was certainly not there the last time I looked at the spec, but it might have been a while.

@jorisvandenbossche
Copy link
Member Author

The last point was certainly not there the last time I looked at the spec, but it might have been a while.

The website is newer, but in the actual doc page it was added 6 years ago: apache/parquet-format#79

I am not aware of an extensive performance comparison, but from the discussion on the Arrow side, it was mentioned that also several other Parquet implementations (eg parquet-rs) already default to use RLE for boolean values.

@martindurant
Copy link
Member

I don't gather from the docs, it RLE bool encoding restricted to V2 data pages?

@jorisvandenbossche
Copy link
Member Author

AFAIK encoding options are completely separate from V1 vs V2 data pages

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants