Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Python] New row_group_size default of 1 Mi not working #35859

Closed
jonashaag opened this issue May 31, 2023 · 1 comment · Fixed by #36012
Closed

[Python] New row_group_size default of 1 Mi not working #35859

jonashaag opened this issue May 31, 2023 · 1 comment · Fixed by #36012
Assignees
Labels
Milestone

Comments

@jonashaag
Copy link
Contributor

jonashaag commented May 31, 2023

Describe the bug, including details regarding any error messages, version, and platform.

In the release notes of Arrow 12 say that the default row group size has been lowered from 64 Mi to 1 Mi. But here's what's happening in practice (PyArrow from conda-forge):

In [1]: import pyarrow.parquet as pq

In [2]: import pyarrow as pa

In [3]: pa.__version__
Out[3]: '12.0.0'

In [12]: pq.write_table(pa.Table.from_pydict({"x": [42] * (64 * 1024 * 1024 + 1)}), "/tmp/x")

In [13]: pq.read_metadata("/tmp/x")
Out[13]:
<pyarrow._parquet.FileMetaData object at 0x7fa6b81eaa70>
  created_by: parquet-cpp-arrow version 12.0.0
  num_columns: 1
  num_rows: 67108865
  num_row_groups: 2  # <=============
  format_version: 2.6
  serialized_size: 485

In [14]: pq.write_table(pa.Table.from_pydict({"x": [42] * (64 * 1024 * 1024 + 0)}), "/tmp/x")

In [15]: pq.read_metadata("/tmp/x")
Out[15]:
<pyarrow._parquet.FileMetaData object at 0x7fa6b823d490>
  created_by: parquet-cpp-arrow version 12.0.0
  num_columns: 1
  num_rows: 67108864
  num_row_groups: 1  # <=============
  format_version: 2.6
  serialized_size: 376

Component(s)

Python

@jonashaag jonashaag changed the title New row_group default of 1024 * 1024 not working New row_group_size default of 1 Mi not working May 31, 2023
@westonpace
Copy link
Member

This is embarrassing, and shame on me for not writing better regression tests.

#34281 changed the default for C++ and python but it was too strict and it wasn't possible (via python) to go past the default.

#34435 restored the ability to go past the default but it looks like it changed the default for pyarrow in the process.

I'll put in a fix soon.

@westonpace westonpace added the Priority: Blocker Marks a blocker for the release label Jun 8, 2023
@westonpace westonpace changed the title New row_group_size default of 1 Mi not working [Python] New row_group_size default of 1 Mi not working Jun 8, 2023
@westonpace westonpace self-assigned this Jun 8, 2023
westonpace added a commit that referenced this issue Jun 22, 2023
…36012)

### Rationale for this change

In #34280 the default row group size was changed to 1Mi.  However, this was accidentally reverted (for python, but not C++) in #34435 

The problem is that there is both an "absolute max row group size for the writer" and a "row group size to use for this table"  The pyarrow user is unable to set the former property.

The behavior in pyarrow was previously "If no value is given in the call to write_table then don't specify anything and let the absolute max apply"

The first fix changed the absolute max to 1Mi.  However, this made it impossible for the user to use a larger row group size.  The second fix changed the absolute max back to 64Mi.  However, this meant the default didn't change.

### What changes are included in this PR?

This change leaves the absolute max at 64Mi.  However, if the user does not specify a row group size, we no longer "just use the table size" and instead use 1Mi.

### Are these changes tested?

Yes, a unit test was added.

### Are there any user-facing changes?

Yes, the default row group size now truly changes to 1Mi.  This change was already announced as part of #34280
* Closes: #35859

Authored-by: Weston Pace <weston.pace@gmail.com>
Signed-off-by: Weston Pace <weston.pace@gmail.com>
@westonpace westonpace added this to the 13.0.0 milestone Jun 22, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants