Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Python] Make CacheOptions configurable from Python #36441

Closed
Tom-Newton opened this issue Jul 3, 2023 · 5 comments · Fixed by #36627
Closed

[Python] Make CacheOptions configurable from Python #36441

Tom-Newton opened this issue Jul 3, 2023 · 5 comments · Fixed by #36627

Comments

@Tom-Newton
Copy link
Contributor

Describe the enhancement requested

When doing parquet reads from high latency storage e.g. S3 or Azure blob storage its beneficial to use fragment_scan_options=ParquetFragmentScanOptions(pre_buffer=True).

As far as I can tell pre_buffer=True does 2 important things:

  1. It reads all the parquet metadata at the beginning then makes all the data block reads.
  2. It coalesces adjacent data blocks so that it can read the data in a smaller number of larger reads. Coalescing is done based on CacheOptions, in particular hole_size_limit and range_size_limit.

In my use-case I have found that while point 1 is very beneficial for performance point 2 is actually detrimental. Point 1 though is still enough to make pre_buffer=True a net advantage. I found that setting range_size_limit to a much smaller value (in combination with pyarrow.set_io_thread_count(<large number>)) can give me ~6X performance improvement on my usecase because it increases the parallelism. I would therefore like to make these parameters configurable from python.

If you are interested in my use-case I'm loading a single parquet file from US East Azure blob storage to a workstation in the UK. Probably the current defaults would be sensible if it wasn't for the trans-Atlantic latency. (Pyarrow does not natively support Azure yet but I'm making use of #12914 even though its not in an official release yet)

I can make a PR that implements this if others agree that its a sensible idea.

Component(s)

Python

@Tom-Newton Tom-Newton changed the title Make CacheOptions configurable from Python [Python] Make CacheOptions configurable from Python Jul 3, 2023
@pitrou
Copy link
Member

pitrou commented Jul 5, 2023

@lidavidm
Copy link
Member

lidavidm commented Jul 5, 2023

It seems reasonable.

@jorisvandenbossche
Copy link
Member

The CacheOptions is also already explicitly exposed in the IpcFragmentScanOptions (on the C++ side, this class it not yet exposed in Python).

All the other options of ArrowReaderProperties are exposed in the pyarrow equivalent (in ParquetReader in _parquet.pyx), so indeed seems to make sense to expose the cache options as well.

In the ParquetFragmentScanOptions, currently only pre_buffer is exposed.

@Tom-Newton
Copy link
Contributor Author

take

@Tom-Newton
Copy link
Contributor Author

I've opened a PR for review to address this #36627.

I don't have much experience with C++ or Cython and this is my first contribution to arrow so hopefully what I've done makes sense.

Tom-Newton added a commit to Tom-Newton/arrow that referenced this issue Dec 1, 2023
…pacheGH-36441' of github.com:Tom-Newton/arrow into tomnewton/make_cache_options_configurable_from_python/apacheGH-36441
Tom-Newton added a commit to Tom-Newton/arrow that referenced this issue Dec 6, 2023
…pacheGH-36441' of github.com:Tom-Newton/arrow into tomnewton/make_cache_options_configurable_from_python/apacheGH-36441
jorisvandenbossche added a commit that referenced this issue Dec 14, 2023
### Rationale for this change
Resolves: #36441

### What changes are included in this PR?
- Add python bindings for `CacheOptions` from the C++ side. 
- Allow setting `cache_options` on `ParquetFragmentScanOptions` from the python side. 
- Adjust some of the comments on `CacheOptions`

### Are these changes tested?
Yes. I added python side tests for these newly available configs similar to other configs. I have not added an integration test that ensures setting the configs on the python side leads to correctly using them on the C++ side. 

### Are there any user-facing changes?
Yes. The are new configs available on the python side but the defaults are unchanged. I've added/updated docstrings where relevant. 

* Closes: #36441

Lead-authored-by: Thomas Newton <thomas.w.newton@gmail.com>
Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
@jorisvandenbossche jorisvandenbossche added this to the 15.0.0 milestone Dec 14, 2023
clayburn pushed a commit to clayburn/arrow that referenced this issue Jan 23, 2024
…apache#36627)

### Rationale for this change
Resolves: apache#36441

### What changes are included in this PR?
- Add python bindings for `CacheOptions` from the C++ side. 
- Allow setting `cache_options` on `ParquetFragmentScanOptions` from the python side. 
- Adjust some of the comments on `CacheOptions`

### Are these changes tested?
Yes. I added python side tests for these newly available configs similar to other configs. I have not added an integration test that ensures setting the configs on the python side leads to correctly using them on the C++ side. 

### Are there any user-facing changes?
Yes. The are new configs available on the python side but the defaults are unchanged. I've added/updated docstrings where relevant. 

* Closes: apache#36441

Lead-authored-by: Thomas Newton <thomas.w.newton@gmail.com>
Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
dgreiss pushed a commit to dgreiss/arrow that referenced this issue Feb 19, 2024
…apache#36627)

### Rationale for this change
Resolves: apache#36441

### What changes are included in this PR?
- Add python bindings for `CacheOptions` from the C++ side. 
- Allow setting `cache_options` on `ParquetFragmentScanOptions` from the python side. 
- Adjust some of the comments on `CacheOptions`

### Are these changes tested?
Yes. I added python side tests for these newly available configs similar to other configs. I have not added an integration test that ensures setting the configs on the python side leads to correctly using them on the C++ side. 

### Are there any user-facing changes?
Yes. The are new configs available on the python side but the defaults are unchanged. I've added/updated docstrings where relevant. 

* Closes: apache#36441

Lead-authored-by: Thomas Newton <thomas.w.newton@gmail.com>
Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants