GH-35859: [Python] Actually change the default row group size to 1Mi #36012

westonpace · 2023-06-09T15:54:21Z

Rationale for this change

In #34280 the default row group size was changed to 1Mi. However, this was accidentally reverted (for python, but not C++) in #34435

The problem is that there is both an "absolute max row group size for the writer" and a "row group size to use for this table" The pyarrow user is unable to set the former property.

The behavior in pyarrow was previously "If no value is given in the call to write_table then don't specify anything and let the absolute max apply"

The first fix changed the absolute max to 1Mi. However, this made it impossible for the user to use a larger row group size. The second fix changed the absolute max back to 64Mi. However, this meant the default didn't change.

What changes are included in this PR?

This change leaves the absolute max at 64Mi. However, if the user does not specify a row group size, we no longer "just use the table size" and instead use 1Mi.

Are these changes tested?

Yes, a unit test was added.

Are there any user-facing changes?

Yes, the default row group size now truly changes to 1Mi. This change was already announced as part of #34280

Closes: [Python] New row_group_size default of 1 Mi not working #35859

mapleFU · 2023-06-09T16:03:28Z

python/pyarrow/_parquet.pyx

@@ -1767,7 +1767,7 @@ cdef class ParquetWriter(_Weakrefable):
            int64_t c_row_group_size

        if row_group_size is None or row_group_size == -1:
-            c_row_group_size = ctable.num_rows()
+            c_row_group_size = min(ctable.num_rows(), 1024*1024)


should we declare a constant rather than 1024*1024 directly?

I'll defer to @jorisvandenbossche only because I don't actually know what a constant should like like in this file (style wise). I can't find any good examples.

Either way is fine for me, but so it would like the following (defined at the top of the file, after the imports):

arrow/python/pyarrow/_dataset.pyx

Lines 42 to 46 in 2ce4a38

_DEFAULT_BATCH_SIZE = 2**17

_DEFAULT_BATCH_READAHEAD = 16

_DEFAULT_FRAGMENT_READAHEAD = 4

(but for something that is not reused multiple times, it is less worth it, I think)

Ok,I have to fix a lint issue anyways so I'll add a constant real quick for readability.

…efault max to 1Mi

…readability

conbench-apache-arrow · 2023-06-26T03:44:37Z

Conbench analyzed the 6 benchmark runs on commit 0d1f7234.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details.

westonpace requested a review from AlenkaF as a code owner June 9, 2023 15:54

github-actions bot added Component: Python awaiting committer review Awaiting committer review labels Jun 9, 2023

mapleFU reviewed Jun 9, 2023

View reviewed changes

jorisvandenbossche approved these changes Jun 13, 2023

View reviewed changes

github-actions bot added awaiting merge Awaiting merge awaiting changes Awaiting changes and removed awaiting committer review Awaiting committer review awaiting merge Awaiting merge labels Jun 13, 2023

Now that we restored the absolute max to 64Mi we need to change the d…

3abca31

…efault max to 1Mi

westonpace force-pushed the bugfix/GH-35859--default-1Mi-row-group-not-working branch from e82c78d to 3abca31 Compare June 22, 2023 19:28

github-actions bot added awaiting change review Awaiting change review awaiting changes Awaiting changes and removed awaiting changes Awaiting changes awaiting change review Awaiting change review labels Jun 22, 2023

Fix a lint issue in test. Add constants instead of magic numbers for …

6361f42

…readability

github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Jun 22, 2023

westonpace merged commit 0d1f723 into apache:main Jun 22, 2023
11 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GH-35859: [Python] Actually change the default row group size to 1Mi #36012

GH-35859: [Python] Actually change the default row group size to 1Mi #36012

westonpace commented Jun 9, 2023 •

edited by github-actions bot

mapleFU Jun 9, 2023

westonpace Jun 14, 2023

jorisvandenbossche Jun 22, 2023

westonpace Jun 22, 2023

conbench-apache-arrow bot commented Jun 26, 2023


	_DEFAULT_BATCH_SIZE = 2**17
	_DEFAULT_BATCH_READAHEAD = 16
	_DEFAULT_FRAGMENT_READAHEAD = 4

GH-35859: [Python] Actually change the default row group size to 1Mi #36012

GH-35859: [Python] Actually change the default row group size to 1Mi #36012

Conversation

westonpace commented Jun 9, 2023 • edited by github-actions bot

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

mapleFU Jun 9, 2023

Choose a reason for hiding this comment

westonpace Jun 14, 2023

Choose a reason for hiding this comment

jorisvandenbossche Jun 22, 2023

Choose a reason for hiding this comment

westonpace Jun 22, 2023

Choose a reason for hiding this comment

conbench-apache-arrow bot commented Jun 26, 2023

westonpace commented Jun 9, 2023 •

edited by github-actions bot