-
Notifications
You must be signed in to change notification settings - Fork 3.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GH-35859: [Python] Actually change the default row group size to 1Mi #36012
GH-35859: [Python] Actually change the default row group size to 1Mi #36012
Conversation
python/pyarrow/_parquet.pyx
Outdated
@@ -1767,7 +1767,7 @@ cdef class ParquetWriter(_Weakrefable): | |||
int64_t c_row_group_size | |||
|
|||
if row_group_size is None or row_group_size == -1: | |||
c_row_group_size = ctable.num_rows() | |||
c_row_group_size = min(ctable.num_rows(), 1024*1024) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we declare a constant rather than 1024*1024 directly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll defer to @jorisvandenbossche only because I don't actually know what a constant should like like in this file (style wise). I can't find any good examples.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Either way is fine for me, but so it would like the following (defined at the top of the file, after the imports):
arrow/python/pyarrow/_dataset.pyx
Lines 42 to 46 in 2ce4a38
_DEFAULT_BATCH_SIZE = 2**17 | |
_DEFAULT_BATCH_READAHEAD = 16 | |
_DEFAULT_FRAGMENT_READAHEAD = 4 | |
(but for something that is not reused multiple times, it is less worth it, I think)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok,I have to fix a lint issue anyways so I'll add a constant real quick for readability.
…efault max to 1Mi
e82c78d
to
3abca31
Compare
Conbench analyzed the 6 benchmark runs on commit There were no benchmark performance regressions. 🎉 The full Conbench report has more details. |
Rationale for this change
In #34280 the default row group size was changed to 1Mi. However, this was accidentally reverted (for python, but not C++) in #34435
The problem is that there is both an "absolute max row group size for the writer" and a "row group size to use for this table" The pyarrow user is unable to set the former property.
The behavior in pyarrow was previously "If no value is given in the call to write_table then don't specify anything and let the absolute max apply"
The first fix changed the absolute max to 1Mi. However, this made it impossible for the user to use a larger row group size. The second fix changed the absolute max back to 64Mi. However, this meant the default didn't change.
What changes are included in this PR?
This change leaves the absolute max at 64Mi. However, if the user does not specify a row group size, we no longer "just use the table size" and instead use 1Mi.
Are these changes tested?
Yes, a unit test was added.
Are there any user-facing changes?
Yes, the default row group size now truly changes to 1Mi. This change was already announced as part of #34280