Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow RLE for bools in v1 pages #885

Merged
merged 1 commit into from
Sep 28, 2023
Merged

Conversation

martindurant
Copy link
Member

Fixes #884

@jorisvandenbossche , could you please try with this?

Interestingly, there seems to have been no issue with V2 pages. I wonder why the push to change the defaults (and indeed the allowed encodings) for a version that is long superceded?

@martindurant
Copy link
Member Author

I don't think fastparquet will follow suit and write bools as RLE...

@martindurant martindurant merged commit df43dac into dask:main Sep 28, 2023
20 checks passed
@martindurant martindurant deleted the rle-bool-v1 branch September 28, 2023 14:59
@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Sep 29, 2023

It might be worth adding a test for this .. as it doesn't seem to be fixing it for me (well, almost, but values are still inverted). Using the files from the issue:

In [1]: pd.read_parquet("test_bool_pa13_plain.parquet", engine="fastparquet")
Out[1]: 
     col
0   True
1  False
2   True

In [2]: pd.read_parquet("test_bool_pa13_rle.parquet", engine="fastparquet")
Out[2]: 
     col
0  False
1   True
2  False

In [3]: pd.read_parquet("test_bool_pa13_rle.parquet", engine="pyarrow")
Out[3]: 
     col
0   True
1  False
2   True

In [4]: import fastparquet; fastparquet.__version__
Out[4]: '2023.8.1.dev2'

@martindurant
Copy link
Member Author

Are you certain?

In [17]: pd.read_parquet("test_bool_pa13_rle.parquet", engine="fastparquet").equals(pd.read_parquet("test_bool_pa13_rle.parquet", engine="pyarrow"))
Out[17]: True

In [18]: pd.read_parquet("test_bool_pa13_rle.parquet", engine="fastparquet").equals(pd.read_parquet("test_bool_pa13_plain.parquet", engine="pyarrow"))
Out[18]: True

In [19]: pd.read_parquet("test_bool_pa13_plain.parquet", engine="fastparquet").equals(pd.read_parquet("test_bool_pa13_plain.parquet", engine="pyarrow"))
Out[19]: True

@jorisvandenbossche
Copy link
Member

Well, you can see what I printed above. That was after doing a pip install git+https://github.com/dask/fastparquet, and didn't further investigate except from what I printed above.

You can check if the files you created are actually using RLE encoding with pq.read_metadata("test_bool_pa13_rle.parquet").row_group(0).column(0).encodings.

The files I was using: test_bool_pa13.zip

@emkornfield
Copy link

Interestingly, there seems to have been no issue with V2 pages. I wonder why the push to change the defaults (and indeed the allowed encodings) for a version that is long superceded?

V2 Data pages are actually in a weird state, where most of the benefits. Last I checked, I think Arrow still had some bugs for them. There is also a somewhat stalled effort to try to standardize expectations around the format: apache/parquet-format#164

@wgtmac
Copy link

wgtmac commented Oct 7, 2023

Interestingly, there seems to have been no issue with V2 pages. I wonder why the push to change the defaults (and indeed the allowed encodings) for a version that is long superceded?

V2 Data pages are actually in a weird state, where most of the benefits. Last I checked, I think Arrow still had some bugs for them. There is also a somewhat stalled effort to try to standardize expectations around the format: apache/parquet-format#164

Did you remember any outstanding bug related to data page v2 in the parquet-cpp?

@emkornfield
Copy link

Interestingly, there seems to have been no issue with V2 pages. I wonder why the push to change the defaults (and indeed the allowed encodings) for a version that is long superceded?

V2 Data pages are actually in a weird state, where most of the benefits. Last I checked, I think Arrow still had some bugs for them. There is also a somewhat stalled effort to try to standardize expectations around the format: apache/parquet-format#164

Did you remember any outstanding bug related to data page v2 in the parquet-cpp?

Sorry for the late reply, doing a quick audit appears that they might all be fixed (I think row count might have been the last one).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: reading boolean column with RLE encoding gives wrong values
4 participants