Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-35859: [Python] Actually change the default row group size to 1Mi #36012

Conversation

westonpace
Copy link
Member

@westonpace westonpace commented Jun 9, 2023

Rationale for this change

In #34280 the default row group size was changed to 1Mi. However, this was accidentally reverted (for python, but not C++) in #34435

The problem is that there is both an "absolute max row group size for the writer" and a "row group size to use for this table" The pyarrow user is unable to set the former property.

The behavior in pyarrow was previously "If no value is given in the call to write_table then don't specify anything and let the absolute max apply"

The first fix changed the absolute max to 1Mi. However, this made it impossible for the user to use a larger row group size. The second fix changed the absolute max back to 64Mi. However, this meant the default didn't change.

What changes are included in this PR?

This change leaves the absolute max at 64Mi. However, if the user does not specify a row group size, we no longer "just use the table size" and instead use 1Mi.

Are these changes tested?

Yes, a unit test was added.

Are there any user-facing changes?

Yes, the default row group size now truly changes to 1Mi. This change was already announced as part of #34280

@@ -1767,7 +1767,7 @@ cdef class ParquetWriter(_Weakrefable):
int64_t c_row_group_size

if row_group_size is None or row_group_size == -1:
c_row_group_size = ctable.num_rows()
c_row_group_size = min(ctable.num_rows(), 1024*1024)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we declare a constant rather than 1024*1024 directly?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll defer to @jorisvandenbossche only because I don't actually know what a constant should like like in this file (style wise). I can't find any good examples.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Either way is fine for me, but so it would like the following (defined at the top of the file, after the imports):

_DEFAULT_BATCH_SIZE = 2**17
_DEFAULT_BATCH_READAHEAD = 16
_DEFAULT_FRAGMENT_READAHEAD = 4

(but for something that is not reused multiple times, it is less worth it, I think)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok,I have to fix a lint issue anyways so I'll add a constant real quick for readability.

@github-actions github-actions bot added awaiting merge Awaiting merge awaiting changes Awaiting changes and removed awaiting committer review Awaiting committer review awaiting merge Awaiting merge labels Jun 13, 2023
@westonpace westonpace force-pushed the bugfix/GH-35859--default-1Mi-row-group-not-working branch from e82c78d to 3abca31 Compare June 22, 2023 19:28
@github-actions github-actions bot added awaiting change review Awaiting change review awaiting changes Awaiting changes and removed awaiting changes Awaiting changes awaiting change review Awaiting change review labels Jun 22, 2023
@github-actions github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Jun 22, 2023
@westonpace westonpace merged commit 0d1f723 into apache:main Jun 22, 2023
11 checks passed
@conbench-apache-arrow
Copy link

Conbench analyzed the 6 benchmark runs on commit 0d1f7234.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Python] New row_group_size default of 1 Mi not working
3 participants