-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GH-37312: [Python][Docs] Update Python docstrings to reflect new parquet encoding option #38070
GH-37312: [Python][Docs] Update Python docstrings to reflect new parquet encoding option #38070
Conversation
|
cc @rok @jorisvandenbossche @pitrou Is this ok? Or I should add more doc about this? |
Only if "use_dictionary" and "use_byte_stream_split" is False, | ||
the following encodings are supported. | ||
Currently supported values: {'PLAIN', 'BYTE_STREAM_SPLIT', | ||
'DELTA_BINARY_PACKED', 'DELTA_LENGTH_BYTE_ARRAY', 'DELTA_BYTE_ARRAY'}. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Strictly speaking, "RLE" is also allowed, but since that is now the default and only valid for booleans, there is probably no use in actually specifying it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🤔As we mentioned in https://issues.apache.org/jira/browse/PARQUET-2222 . I think export RLE is not a so good idea...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But so then we should revert #36955, since that enabled RLE by default for boolean columns?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rle is default enabled on Page V2, however, write page v2 is not recommended.
Page V2 format is great, however, most parquet implementions didn't get agreement on it. As a result, page v2 is unstable now.
RLE Boolean is default enabled on page v2, I think it's ok there, but I don't think is a good idea to default enable it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just as this https://blog.getdaft.io/p/working-with-the-apache-parquet-file blog saying. Parquet V2 is a ambigious naming. Although we (arrow and arrow-rs ) is using format 2.x and some properties on it. Most of implementions can still decode page v1.
So I think if user know what he/she is doing, RLE is ok to export to user, however I think here we can just hide it until PARQUET-2222 has a conclusion about this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would propose to move this discussion elsewhere (new issue? or the original issue of #36882)
Rle is default enabled on Page V2, however, write page v2 is not recommended.
No, we enabled it by default also for DataPage V1 (we have a separate "parquet v2" which is enabled by default, and we decided to write this encoding when that is enabled)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved to #36882
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right, I've added RLE here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh finally I decide to remove RLE
. Just leave it as a hack, until we make clear how it could be used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good!
No idea why lint failed... |
/arrow/python/pyarrow/parquet/core.py:826:78: W291 trailing whitespace |
Co-authored-by: Rok Mihevc <rok@mihevc.org>
Thanks! |
python/pyarrow/parquet/core.py
Outdated
Specify if we should use dictionary encoding in general or only for | ||
some columns. | ||
compression : str or dict | ||
When encoding the column, if the dictionary size is too large, the | ||
column will fallback to fallback encoding. Specially, ``BOOLEAN`` type |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this fallback encoding always PLAIN?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. Even the fallback method is called FallbackToPlainEncoding()
, but we may switch to other encodings in the future( arrow-rs and parquet-mr allow some different encoding in format-v2 )
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can refer to #15165
I change here to PLAIN
, and if this issue is supported, we can change the wording here.
@pitrou @jorisvandenbossche Would you mind take a look? This is just a 15line doc change |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the reminder, looking good!
After merging your PR, Conbench analyzed the 5 benchmarking runs that have been run so far on merge-commit a5043e7. There were no benchmark performance regressions. 🎉 The full Conbench report has more details. It also includes information about 1 possible false positive for unstable benchmarks that are known to sometimes produce them. |
…w parquet encoding option (apache#38070) ### Rationale for this change Since parquet C++ has complete all encoding, we can publish this in Python doc. ### What changes are included in this PR? Add encoding in document. ### Are these changes tested? No ### Are there any user-facing changes? No * Closes: apache#37312 Lead-authored-by: mwish <maplewish117@gmail.com> Co-authored-by: mwish <1506118561@qq.com> Co-authored-by: Rok Mihevc <rok@mihevc.org> Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
…w parquet encoding option (apache#38070) ### Rationale for this change Since parquet C++ has complete all encoding, we can publish this in Python doc. ### What changes are included in this PR? Add encoding in document. ### Are these changes tested? No ### Are there any user-facing changes? No * Closes: apache#37312 Lead-authored-by: mwish <maplewish117@gmail.com> Co-authored-by: mwish <1506118561@qq.com> Co-authored-by: Rok Mihevc <rok@mihevc.org> Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
…w parquet encoding option (apache#38070) ### Rationale for this change Since parquet C++ has complete all encoding, we can publish this in Python doc. ### What changes are included in this PR? Add encoding in document. ### Are these changes tested? No ### Are there any user-facing changes? No * Closes: apache#37312 Lead-authored-by: mwish <maplewish117@gmail.com> Co-authored-by: mwish <1506118561@qq.com> Co-authored-by: Rok Mihevc <rok@mihevc.org> Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Rationale for this change
Since parquet C++ has complete all encoding, we can publish this in Python doc.
What changes are included in this PR?
Add encoding in document.
Are these changes tested?
No
Are there any user-facing changes?
No