GH-37312: [Python][Docs] Update Python docstrings to reflect new parquet encoding option #38070

mapleFU · 2023-10-06T11:39:38Z

Rationale for this change

Since parquet C++ has complete all encoding, we can publish this in Python doc.

What changes are included in this PR?

Add encoding in document.

Are these changes tested?

No

Are there any user-facing changes?

No

Closes: [Python][Docs] Update Python docstrings to reflect new parquet encoding option #37312

github-actions · 2023-10-06T11:40:08Z

⚠️ GitHub issue #37312 has been automatically assigned in GitHub to PR creator.

mapleFU · 2023-10-06T11:40:37Z

cc @rok @jorisvandenbossche @pitrou Is this ok? Or I should add more doc about this?

jorisvandenbossche · 2023-10-06T11:58:27Z

python/pyarrow/parquet/core.py

+    Only if "use_dictionary" and "use_byte_stream_split" is False, 
+    the following encodings are supported.
+    Currently supported values: {'PLAIN', 'BYTE_STREAM_SPLIT', 
+    'DELTA_BINARY_PACKED', 'DELTA_LENGTH_BYTE_ARRAY', 'DELTA_BYTE_ARRAY'}.


Strictly speaking, "RLE" is also allowed, but since that is now the default and only valid for booleans, there is probably no use in actually specifying it.

🤔As we mentioned in https://issues.apache.org/jira/browse/PARQUET-2222 . I think export RLE is not a so good idea...

But so then we should revert #36955, since that enabled RLE by default for boolean columns?

Rle is default enabled on Page V2, however, write page v2 is not recommended.

Page V2 format is great, however, most parquet implementions didn't get agreement on it. As a result, page v2 is unstable now.

RLE Boolean is default enabled on page v2, I think it's ok there, but I don't think is a good idea to default enable it.

Just as this https://blog.getdaft.io/p/working-with-the-apache-parquet-file blog saying. Parquet V2 is a ambigious naming. Although we (arrow and arrow-rs ) is using format 2.x and some properties on it. Most of implementions can still decode page v1.

So I think if user know what he/she is doing, RLE is ok to export to user, however I think here we can just hide it until PARQUET-2222 has a conclusion about this

I would propose to move this discussion elsewhere (new issue? or the original issue of #36882)

Rle is default enabled on Page V2, however, write page v2 is not recommended.

No, we enabled it by default also for DataPage V1 (we have a separate "parquet v2" which is enabled by default, and we decided to write this encoding when that is enabled)

Moved to #36882

You're right, I've added RLE here

Oh finally I decide to remove RLE. Just leave it as a hack, until we make clear how it could be used.

python/pyarrow/parquet/core.py

rok

Looks good!

mapleFU · 2023-10-09T09:31:34Z

No idea why lint failed...

rok · 2023-10-09T09:39:12Z

/arrow/python/pyarrow/parquet/core.py:826:78: W291 trailing whitespace

python/pyarrow/parquet/core.py

Co-authored-by: Rok Mihevc <rok@mihevc.org>

mapleFU · 2023-10-09T10:44:18Z

Thanks!

jorisvandenbossche · 2023-10-10T10:01:38Z

python/pyarrow/parquet/core.py

    Specify if we should use dictionary encoding in general or only for
    some columns.
-compression : str or dict
+    When encoding the column, if the dictionary size is too large, the
+    column will fallback to fallback encoding. Specially, ``BOOLEAN`` type


Is this fallback encoding always PLAIN?

Yes. Even the fallback method is called FallbackToPlainEncoding(), but we may switch to other encodings in the future( arrow-rs and parquet-mr allow some different encoding in format-v2 )

You can refer to #15165

I change here to PLAIN, and if this issue is supported, we can change the wording here.

mapleFU · 2023-10-11T02:10:17Z

@pitrou @jorisvandenbossche Would you mind take a look? This is just a 15line doc change

jorisvandenbossche

Thanks for the reminder, looking good!

conbench-apache-arrow · 2023-10-20T15:36:20Z

After merging your PR, Conbench analyzed the 5 benchmarking runs that have been run so far on merge-commit a5043e7.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details. It also includes information about 1 possible false positive for unstable benchmarks that are known to sometimes produce them.

…w parquet encoding option (apache#38070) ### Rationale for this change Since parquet C++ has complete all encoding, we can publish this in Python doc. ### What changes are included in this PR? Add encoding in document. ### Are these changes tested? No ### Are there any user-facing changes? No * Closes: apache#37312 Lead-authored-by: mwish <maplewish117@gmail.com> Co-authored-by: mwish <1506118561@qq.com> Co-authored-by: Rok Mihevc <rok@mihevc.org> Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>

Init doc for encoding

1161cee

github-actions bot added Component: Python awaiting review Awaiting review labels Oct 6, 2023

jorisvandenbossche reviewed Oct 6, 2023

View reviewed changes

github-actions bot added awaiting changes Awaiting changes and removed awaiting review Awaiting review labels Oct 6, 2023

rok approved these changes Oct 6, 2023

View reviewed changes

github-actions bot added awaiting merge Awaiting merge and removed awaiting changes Awaiting changes labels Oct 6, 2023

mapleFU added 4 commits October 6, 2023 22:56

resolve comment

814e041

Merge branch 'main' into parquet-doc/add-doc-for-encoding

99d26cf

Merge branch 'main' into parquet-doc/add-doc-for-encoding

6a95840

try to fix lint

bdf82ae

rok reviewed Oct 9, 2023

View reviewed changes

python/pyarrow/parquet/core.py Outdated Show resolved Hide resolved

github-actions bot added awaiting changes Awaiting changes and removed awaiting merge Awaiting merge labels Oct 9, 2023

Update python/pyarrow/parquet/core.py

d38e083

Co-authored-by: Rok Mihevc <rok@mihevc.org>

github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Oct 9, 2023

mapleFU requested a review from jorisvandenbossche October 9, 2023 10:44

github-actions bot added awaiting changes Awaiting changes and removed awaiting change review Awaiting change review labels Oct 9, 2023

jorisvandenbossche mentioned this pull request Oct 9, 2023

[C++][Parquet] Use RLE for boolean type by default when parquet version is 2.x #36882

Closed

add RLE as encoding type

ecec908

github-actions bot removed the awaiting changes Awaiting changes label Oct 9, 2023

github-actions bot added the awaiting change review Awaiting change review label Oct 9, 2023

jorisvandenbossche reviewed Oct 10, 2023

View reviewed changes

github-actions bot added awaiting changes Awaiting changes and removed awaiting change review Awaiting change review labels Oct 10, 2023

resolve comment and remove 'RLE'

f2869f3

github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Oct 10, 2023

mapleFU requested a review from jorisvandenbossche October 11, 2023 02:09

jorisvandenbossche approved these changes Oct 17, 2023

View reviewed changes

jorisvandenbossche merged commit a5043e7 into apache:main Oct 17, 2023
14 checks passed

jorisvandenbossche removed the awaiting change review Awaiting change review label Oct 17, 2023

github-actions bot added the awaiting merge Awaiting merge label Oct 17, 2023

mapleFU deleted the parquet-doc/add-doc-for-encoding branch October 17, 2023 13:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GH-37312: [Python][Docs] Update Python docstrings to reflect new parquet encoding option #38070

GH-37312: [Python][Docs] Update Python docstrings to reflect new parquet encoding option #38070

mapleFU commented Oct 6, 2023 •

edited by github-actions bot

Loading

github-actions bot commented Oct 6, 2023

mapleFU commented Oct 6, 2023

jorisvandenbossche Oct 6, 2023

mapleFU Oct 6, 2023

jorisvandenbossche Oct 9, 2023

mapleFU Oct 9, 2023

mapleFU Oct 9, 2023

jorisvandenbossche Oct 9, 2023

jorisvandenbossche Oct 9, 2023

mapleFU Oct 9, 2023

mapleFU Oct 10, 2023

rok left a comment

mapleFU commented Oct 9, 2023

rok commented Oct 9, 2023

mapleFU commented Oct 9, 2023

jorisvandenbossche Oct 10, 2023

mapleFU Oct 10, 2023

mapleFU Oct 10, 2023

mapleFU commented Oct 11, 2023

jorisvandenbossche left a comment

conbench-apache-arrow bot commented Oct 20, 2023

GH-37312: [Python][Docs] Update Python docstrings to reflect new parquet encoding option #38070

GH-37312: [Python][Docs] Update Python docstrings to reflect new parquet encoding option #38070

Conversation

mapleFU commented Oct 6, 2023 • edited by github-actions bot Loading

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

github-actions bot commented Oct 6, 2023

mapleFU commented Oct 6, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rok left a comment

Choose a reason for hiding this comment

mapleFU commented Oct 9, 2023

rok commented Oct 9, 2023

mapleFU commented Oct 9, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mapleFU commented Oct 11, 2023

jorisvandenbossche left a comment

Choose a reason for hiding this comment

conbench-apache-arrow bot commented Oct 20, 2023

mapleFU commented Oct 6, 2023 •

edited by github-actions bot

Loading