Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Python] Configure size of data pages in pyarrow.parquet.write_table #18035

Closed
asfimport opened this issue Jan 30, 2018 · 4 comments
Closed

[Python] Configure size of data pages in pyarrow.parquet.write_table #18035

asfimport opened this issue Jan 30, 2018 · 4 comments

Comments

@asfimport
Copy link

It would be useful to be able to set the size of data pages (within Parquet column chunks) from Python. The current default is set to 1MiB at https://github.com/apache/parquet-cpp/blob/0875e43010af485e1c0b506d77d7e0edc80c66cc/src/parquet/properties.h#L81. It might be useful in some situations to lower this for more granular access.

We should provide this value as a parameter to pyarrow.parquet.write_table.

Reporter: Wes McKinney / @wesm
Assignee: Wes McKinney / @wesm

PRs and other links:

Note: This issue was originally created as ARROW-2057. Please see the migration documentation for further details.

@asfimport
Copy link
Author

Even Oldridge:
RAPIDS.AI has recently implemented a parquet reader to load data to GPU.  According to the dev the optimal page size for GPUs is much smaller than the default of 1M and should be set closer to 256K.  My current workflow uses pyarrow to do the parquet write and I'd love to be able to specify this.

@asfimport
Copy link
Author

Wes McKinney / @wesm:
Thanks for the context. Would you like to submit a pull request?

@asfimport
Copy link
Author

Even Oldridge:
I'm not confident enough that I could implement this; I'm new to parquet and am not comfortable enough with the c++ required, but i'll bring it up with the team that developed the parquet reader.

@asfimport
Copy link
Author

Wes McKinney / @wesm:
Issue resolved by pull request 4597
#4597

@asfimport asfimport added this to the 0.14.0 milestone Jan 11, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants