Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-32863: [C++][Parquet] Add DELTA_BYTE_ARRAY encoder to Parquet writer #14341

Merged
merged 78 commits into from
Aug 21, 2023

Conversation

rok
Copy link
Member

@rok rok commented Oct 7, 2022

This is to add DELTA_BYTE_ARRAY encoder.

@github-actions
Copy link

github-actions bot commented Oct 7, 2022

@rok
Copy link
Member Author

rok commented Oct 7, 2022

This is not really review ready yet and needs #14191 and #14293 to merge first.

@rok
Copy link
Member Author

rok commented Jan 26, 2023

note: this should support BYTE_ARRAY and FIXED_LEN_BYTE_ARRAY. PARQUET-2231

@rok rok changed the title ARROW-17619: [C++][Parquet] Add DELTA_BYTE_ARRAY encoder to Parquet writer GH-32863: [C++][Parquet] Add DELTA_BYTE_ARRAY encoder to Parquet writer Feb 23, 2023
@github-actions
Copy link

@mapleFU
Copy link
Member

mapleFU commented Feb 28, 2023

(Can this patch go ahead now?)

Copy link
Member

@mapleFU mapleFU left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Though this patch is still draft, I run it for fun. Seems that here is some issue. May that helps

cpp/src/parquet/encoding.cc Outdated Show resolved Hide resolved
cpp/src/parquet/encoding.cc Outdated Show resolved Hide resolved
@github-actions github-actions bot added awaiting review Awaiting review awaiting committer review Awaiting committer review awaiting changes Awaiting changes and removed awaiting review Awaiting review awaiting committer review Awaiting committer review labels Mar 1, 2023
@github-actions github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Mar 1, 2023
@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting change review Awaiting change review labels Mar 1, 2023
@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting change review Awaiting change review labels Aug 21, 2023
@github-actions github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Aug 21, 2023
@pitrou
Copy link
Member

pitrou commented Aug 21, 2023

I'm going to merge this now.

@pitrou pitrou merged commit 94bd0d2 into apache:main Aug 21, 2023
30 of 33 checks passed
@pitrou pitrou removed the awaiting change review Awaiting change review label Aug 21, 2023
@github-actions github-actions bot added the awaiting committer review Awaiting committer review label Aug 21, 2023
@mapleFU
Copy link
Member

mapleFU commented Aug 21, 2023

Thats a long time, bravo!

@rok
Copy link
Member Author

rok commented Aug 21, 2023

Thanks all for helping this along! I'm very happy we got this in!

@anjakefala
Copy link
Collaborator

Congratulations, @rok!!!

@AlenkaF
Copy link
Member

AlenkaF commented Aug 22, 2023

Great work everybody, congrats!!

@conbench-apache-arrow
Copy link

After merging your PR, Conbench analyzed the 6 benchmarking runs that have been run so far on merge-commit 94bd0d2.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details. It also includes information about possible false positives for unstable benchmarks that are known to sometimes produce them.

@jorisvandenbossche
Copy link
Member

We should probably update the python docstring as well:

column_encoding : string or dict, default None
Specify the encoding scheme on a per column basis.
Currently supported values: {'PLAIN', 'BYTE_STREAM_SPLIT'}.
Certain encodings are only compatible with certain data types.
Please refer to the encodings section of `Reading and writing Parquet
files <https://arrow.apache.org/docs/cpp/parquet.html#encodings>`_.

(which was clearly already outdated before this PR as well! ;))

@pitrou
Copy link
Member

pitrou commented Aug 22, 2023

Feel free to open a PR :-)

@rok
Copy link
Member Author

rok commented Aug 22, 2023

I created an issue :) #37312

Comment on lines -1342 to -1343
RETURN_NOT_OK(helper.builder->ReserveData(
std::min<int64_t>(len_, helper.chunk_space_remaining)));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why previously here ReserveData for min(len_, helper.chunk_space_remaining) here, wouldn't len_ be too large @pitrou

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand: are you commented on the removed code? I'd rather not try to understand code that was removed months ago...

Copy link
Member

@jorisvandenbossche jorisvandenbossche Nov 8, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See #38437 for context, where this code is being added back partially

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also don't understand them T_T.

Just confused by this change, so I tried to understand the origin code and find out why this cause the regression, what should I do to fix it

@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting committer review Awaiting committer review labels Nov 8, 2023
loicalleyne pushed a commit to loicalleyne/arrow that referenced this pull request Nov 13, 2023
…t writer (apache#14341)

This is to add DELTA_BYTE_ARRAY encoder.
* Closes: apache#32863

Lead-authored-by: Rok Mihevc <rok@mihevc.org>
Co-authored-by: Rok <rok@mihevc.org>
Co-authored-by: Antoine Pitrou <antoine@python.org>
Co-authored-by: Antoine Pitrou <pitrou@free.fr>
Co-authored-by: Gang Wu <ustcwg@gmail.com>
Co-authored-by: mwish <1506118561@qq.com>
Co-authored-by: Will Jones <willjones127@gmail.com>
Signed-off-by: Antoine Pitrou <antoine@python.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[C++][Parquet] Add DELTA_BYTE_ARRAY encoder to Parquet writer
9 participants