Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[C++][Python] Clarify meaning of "row_group_size" and change default to something more reasonable #34280

Closed
westonpace opened this issue Feb 21, 2023 · 2 comments · Fixed by #34281
Assignees
Labels
Breaking Change Includes a breaking change to the API Component: C++ Type: enhancement
Milestone

Comments

@westonpace
Copy link
Member

Describe the enhancement requested

The row_group_size property in pyarrow.parquet.write_table is described as:

Maximum size of each written row group. If None, the row group size will be the minimum of the Table size and 64 * 1024 * 1024.

This limit is in # of rows but that is not obvious from the description. Furthermore, 64Mi is an extremely high limit for row group size. I believe it is perhaps based on the Java implementation, however the Java implementation treats this number as "bytes" and not "rows" (64MiB row groups is very reasonable).

Perhaps the best solution would be to add support for a limit in terms of bytes. In the meantime, I think we should lower the default limit to 1Mi rows.

Component(s)

C++

@mapleFU
Copy link
Member

mapleFU commented Feb 22, 2023

Hi westonpace, maybe it's not related to your patch. But I found the usage of row_group_size may confusing: #33652

Maybe you can take a look?

@westonpace
Copy link
Member Author

It's not related to the patch but it is something we want to be able to support in #21090 . Thank you for helping me connect these things.

westonpace added a commit that referenced this issue Feb 27, 2023
…default to 1Mi (#34281)

BREAKING CHANGE: Changes the default row group size when writing parquet files.
* Closes: #34280

Authored-by: Weston Pace <weston.pace@gmail.com>
Signed-off-by: Weston Pace <weston.pace@gmail.com>
@westonpace westonpace added this to the 12.0.0 milestone Feb 27, 2023
@wjones127 wjones127 added the Breaking Change Includes a breaking change to the API label Apr 26, 2023
westonpace added a commit that referenced this issue Jun 22, 2023
…36012)

### Rationale for this change

In #34280 the default row group size was changed to 1Mi.  However, this was accidentally reverted (for python, but not C++) in #34435 

The problem is that there is both an "absolute max row group size for the writer" and a "row group size to use for this table"  The pyarrow user is unable to set the former property.

The behavior in pyarrow was previously "If no value is given in the call to write_table then don't specify anything and let the absolute max apply"

The first fix changed the absolute max to 1Mi.  However, this made it impossible for the user to use a larger row group size.  The second fix changed the absolute max back to 64Mi.  However, this meant the default didn't change.

### What changes are included in this PR?

This change leaves the absolute max at 64Mi.  However, if the user does not specify a row group size, we no longer "just use the table size" and instead use 1Mi.

### Are these changes tested?

Yes, a unit test was added.

### Are there any user-facing changes?

Yes, the default row group size now truly changes to 1Mi.  This change was already announced as part of #34280
* Closes: #35859

Authored-by: Weston Pace <weston.pace@gmail.com>
Signed-off-by: Weston Pace <weston.pace@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Breaking Change Includes a breaking change to the API Component: C++ Type: enhancement
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants