Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-37312: [Python][Docs] Update Python docstrings to reflect new parquet encoding option #38070

Merged

Conversation

mapleFU
Copy link
Member

@mapleFU mapleFU commented Oct 6, 2023

Rationale for this change

Since parquet C++ has complete all encoding, we can publish this in Python doc.

What changes are included in this PR?

Add encoding in document.

Are these changes tested?

No

Are there any user-facing changes?

No

@github-actions
Copy link

github-actions bot commented Oct 6, 2023

⚠️ GitHub issue #37312 has been automatically assigned in GitHub to PR creator.

@mapleFU
Copy link
Member Author

mapleFU commented Oct 6, 2023

cc @rok @jorisvandenbossche @pitrou Is this ok? Or I should add more doc about this?

Only if "use_dictionary" and "use_byte_stream_split" is False,
the following encodings are supported.
Currently supported values: {'PLAIN', 'BYTE_STREAM_SPLIT',
'DELTA_BINARY_PACKED', 'DELTA_LENGTH_BYTE_ARRAY', 'DELTA_BYTE_ARRAY'}.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Strictly speaking, "RLE" is also allowed, but since that is now the default and only valid for booleans, there is probably no use in actually specifying it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤔As we mentioned in https://issues.apache.org/jira/browse/PARQUET-2222 . I think export RLE is not a so good idea...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But so then we should revert #36955, since that enabled RLE by default for boolean columns?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rle is default enabled on Page V2, however, write page v2 is not recommended.

Page V2 format is great, however, most parquet implementions didn't get agreement on it. As a result, page v2 is unstable now.

RLE Boolean is default enabled on page v2, I think it's ok there, but I don't think is a good idea to default enable it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just as this https://blog.getdaft.io/p/working-with-the-apache-parquet-file blog saying. Parquet V2 is a ambigious naming. Although we (arrow and arrow-rs ) is using format 2.x and some properties on it. Most of implementions can still decode page v1.

So I think if user know what he/she is doing, RLE is ok to export to user, however I think here we can just hide it until PARQUET-2222 has a conclusion about this

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would propose to move this discussion elsewhere (new issue? or the original issue of #36882)

Rle is default enabled on Page V2, however, write page v2 is not recommended.

No, we enabled it by default also for DataPage V1 (we have a separate "parquet v2" which is enabled by default, and we decided to write this encoding when that is enabled)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved to #36882

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, I've added RLE here

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh finally I decide to remove RLE. Just leave it as a hack, until we make clear how it could be used.

python/pyarrow/parquet/core.py Outdated Show resolved Hide resolved
python/pyarrow/parquet/core.py Outdated Show resolved Hide resolved
@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting review Awaiting review labels Oct 6, 2023
Copy link
Member

@rok rok left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!

@github-actions github-actions bot added awaiting merge Awaiting merge and removed awaiting changes Awaiting changes labels Oct 6, 2023
@mapleFU
Copy link
Member Author

mapleFU commented Oct 9, 2023

No idea why lint failed...

@rok
Copy link
Member

rok commented Oct 9, 2023

/arrow/python/pyarrow/parquet/core.py:826:78: W291 trailing whitespace

python/pyarrow/parquet/core.py Outdated Show resolved Hide resolved
@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting merge Awaiting merge labels Oct 9, 2023
Co-authored-by: Rok Mihevc <rok@mihevc.org>
@github-actions github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Oct 9, 2023
@mapleFU
Copy link
Member Author

mapleFU commented Oct 9, 2023

Thanks!

@github-actions github-actions bot removed the awaiting changes Awaiting changes label Oct 9, 2023
@github-actions github-actions bot added the awaiting change review Awaiting change review label Oct 9, 2023
Specify if we should use dictionary encoding in general or only for
some columns.
compression : str or dict
When encoding the column, if the dictionary size is too large, the
column will fallback to fallback encoding. Specially, ``BOOLEAN`` type
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this fallback encoding always PLAIN?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. Even the fallback method is called FallbackToPlainEncoding(), but we may switch to other encodings in the future( arrow-rs and parquet-mr allow some different encoding in format-v2 )

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can refer to #15165

I change here to PLAIN, and if this issue is supported, we can change the wording here.

@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting change review Awaiting change review labels Oct 10, 2023
@github-actions github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Oct 10, 2023
@mapleFU
Copy link
Member Author

mapleFU commented Oct 11, 2023

@pitrou @jorisvandenbossche Would you mind take a look? This is just a 15line doc change

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the reminder, looking good!

@jorisvandenbossche jorisvandenbossche merged commit a5043e7 into apache:main Oct 17, 2023
14 checks passed
@jorisvandenbossche jorisvandenbossche removed the awaiting change review Awaiting change review label Oct 17, 2023
@github-actions github-actions bot added the awaiting merge Awaiting merge label Oct 17, 2023
@mapleFU mapleFU deleted the parquet-doc/add-doc-for-encoding branch October 17, 2023 13:35
@conbench-apache-arrow
Copy link

After merging your PR, Conbench analyzed the 5 benchmarking runs that have been run so far on merge-commit a5043e7.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details. It also includes information about 1 possible false positive for unstable benchmarks that are known to sometimes produce them.

JerAguilon pushed a commit to JerAguilon/arrow that referenced this pull request Oct 23, 2023
…w parquet encoding option (apache#38070)

### Rationale for this change

Since parquet C++ has complete all encoding, we can publish this in Python doc.

### What changes are included in this PR?

Add encoding in document.

### Are these changes tested?

No

### Are there any user-facing changes?

No

* Closes: apache#37312

Lead-authored-by: mwish <maplewish117@gmail.com>
Co-authored-by: mwish <1506118561@qq.com>
Co-authored-by: Rok Mihevc <rok@mihevc.org>
Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
loicalleyne pushed a commit to loicalleyne/arrow that referenced this pull request Nov 13, 2023
…w parquet encoding option (apache#38070)

### Rationale for this change

Since parquet C++ has complete all encoding, we can publish this in Python doc.

### What changes are included in this PR?

Add encoding in document.

### Are these changes tested?

No

### Are there any user-facing changes?

No

* Closes: apache#37312

Lead-authored-by: mwish <maplewish117@gmail.com>
Co-authored-by: mwish <1506118561@qq.com>
Co-authored-by: Rok Mihevc <rok@mihevc.org>
Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
dgreiss pushed a commit to dgreiss/arrow that referenced this pull request Feb 19, 2024
…w parquet encoding option (apache#38070)

### Rationale for this change

Since parquet C++ has complete all encoding, we can publish this in Python doc.

### What changes are included in this PR?

Add encoding in document.

### Are these changes tested?

No

### Are there any user-facing changes?

No

* Closes: apache#37312

Lead-authored-by: mwish <maplewish117@gmail.com>
Co-authored-by: mwish <1506118561@qq.com>
Co-authored-by: Rok Mihevc <rok@mihevc.org>
Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Python][Docs] Update Python docstrings to reflect new parquet encoding option
3 participants