-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Python] Make CacheOptions
configurable from Python
#36441
Comments
CacheOptions
configurable from Python CacheOptions
configurable from Python
It seems reasonable. |
The CacheOptions is also already explicitly exposed in the All the other options of In the ParquetFragmentScanOptions, currently only |
take |
I've opened a PR for review to address this #36627. I don't have much experience with C++ or Cython and this is my first contribution to arrow so hopefully what I've done makes sense. |
…pacheGH-36441' of github.com:Tom-Newton/arrow into tomnewton/make_cache_options_configurable_from_python/apacheGH-36441
…pacheGH-36441' of github.com:Tom-Newton/arrow into tomnewton/make_cache_options_configurable_from_python/apacheGH-36441
### Rationale for this change Resolves: #36441 ### What changes are included in this PR? - Add python bindings for `CacheOptions` from the C++ side. - Allow setting `cache_options` on `ParquetFragmentScanOptions` from the python side. - Adjust some of the comments on `CacheOptions` ### Are these changes tested? Yes. I added python side tests for these newly available configs similar to other configs. I have not added an integration test that ensures setting the configs on the python side leads to correctly using them on the C++ side. ### Are there any user-facing changes? Yes. The are new configs available on the python side but the defaults are unchanged. I've added/updated docstrings where relevant. * Closes: #36441 Lead-authored-by: Thomas Newton <thomas.w.newton@gmail.com> Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
…apache#36627) ### Rationale for this change Resolves: apache#36441 ### What changes are included in this PR? - Add python bindings for `CacheOptions` from the C++ side. - Allow setting `cache_options` on `ParquetFragmentScanOptions` from the python side. - Adjust some of the comments on `CacheOptions` ### Are these changes tested? Yes. I added python side tests for these newly available configs similar to other configs. I have not added an integration test that ensures setting the configs on the python side leads to correctly using them on the C++ side. ### Are there any user-facing changes? Yes. The are new configs available on the python side but the defaults are unchanged. I've added/updated docstrings where relevant. * Closes: apache#36441 Lead-authored-by: Thomas Newton <thomas.w.newton@gmail.com> Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
…apache#36627) ### Rationale for this change Resolves: apache#36441 ### What changes are included in this PR? - Add python bindings for `CacheOptions` from the C++ side. - Allow setting `cache_options` on `ParquetFragmentScanOptions` from the python side. - Adjust some of the comments on `CacheOptions` ### Are these changes tested? Yes. I added python side tests for these newly available configs similar to other configs. I have not added an integration test that ensures setting the configs on the python side leads to correctly using them on the C++ side. ### Are there any user-facing changes? Yes. The are new configs available on the python side but the defaults are unchanged. I've added/updated docstrings where relevant. * Closes: apache#36441 Lead-authored-by: Thomas Newton <thomas.w.newton@gmail.com> Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Describe the enhancement requested
When doing parquet reads from high latency storage e.g. S3 or Azure blob storage its beneficial to use
fragment_scan_options=ParquetFragmentScanOptions(pre_buffer=True)
.As far as I can tell
pre_buffer=True
does 2 important things:CacheOptions
, in particularhole_size_limit
andrange_size_limit
.In my use-case I have found that while point 1 is very beneficial for performance point 2 is actually detrimental. Point 1 though is still enough to make
pre_buffer=True
a net advantage. I found that settingrange_size_limit
to a much smaller value (in combination withpyarrow.set_io_thread_count(<large number>)
) can give me ~6X performance improvement on my usecase because it increases the parallelism. I would therefore like to make these parameters configurable from python.If you are interested in my use-case I'm loading a single parquet file from US East Azure blob storage to a workstation in the UK. Probably the current defaults would be sensible if it wasn't for the trans-Atlantic latency. (Pyarrow does not natively support Azure yet but I'm making use of #12914 even though its not in an official release yet)
I can make a PR that implements this if others agree that its a sensible idea.
Component(s)
Python
The text was updated successfully, but these errors were encountered: