Add prefetcher reader for standard buckets.#795
Conversation
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #795 +/- ##
==========================================
+ Coverage 75.98% 79.81% +3.83%
==========================================
Files 14 16 +2
Lines 2665 3042 +377
==========================================
+ Hits 2025 2428 +403
+ Misses 640 614 -26 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
gcsfs/core.py
Outdated
| ) or os.environ.get("use_prefetch_reader", False) | ||
| if use_prefetch_reader: | ||
| max_prefetch_size = kwargs.get("max_prefetch_size", None) | ||
| concurrency = kwargs.get("concurrency", 4) |
There was a problem hiding this comment.
let's call this prefetcher_concurrency
There was a problem hiding this comment.
I think concurrency is better term here, it is because _cat_file also uses this concurrency parameter in case prefetcher engine is disabled, so this is unrelated to prefetcher
gcsfs/core.py
Outdated
| await asyncio.gather(*tasks, return_exceptions=True) | ||
| raise e | ||
|
|
||
| async def _cat_file(self, path, start=None, end=None, concurrency=4, **kwargs): |
There was a problem hiding this comment.
let's also move this to constant namely DEFAULT_PREFETCHER_CONCURRENCY
There was a problem hiding this comment.
Aren't we mixing concerns here? _cat_file does not necessarily use the prefetcher at all. Indeed, why is prefetcher an option, when this is a single blob read, not sequential?
There was a problem hiding this comment.
let's also move this to constant namely DEFAULT_PREFETCHER_CONCURRENCY
Introduced a default in zb_hns_util.py named as DEFAULT_CONCURRENCY
Aren't we mixing concerns here? _cat_file does not necessarily use the prefetcher at all. Indeed, why is prefetcher an option, when this is a single blob read, not sequential?
_cat_file doesn't actually do any prefetching. We are just completing the existing call path (GCSFile -> cache -> GCSFile._fetch -> GCSFileSystem._cat_file). While it does not prefetch, it will now fetch chunks concurrently if the requested size is under 5MB.
There was a problem hiding this comment.
Resonating with what Martin is saying, I also feel the prefetching and concurrency logics are being integrated directly into the GCSFile core functionality (_cat_file), I think adding custom arguments and environment variables actually is a side effect of mixing these concerns. Should we consider implementing this as an fsspec cache implementation rather than modifying the core _cat_file logic, wdyt @martindurant ?
There was a problem hiding this comment.
I disagree that this should be implemented as a cache! prefetching, and caching should be two different component. I think moving forward we should've both caching, and prefetching just like kernel.
The repetitive access should be cached (This will help significantly when reading header/footer based file), and prefetching where the engine fetches the data in background which user request next.
There's a switch to turn the prefetching off as per user wish.
Regarding change in _cat_file, The current code path still points to previously written code, so i don't see a problem here, and we need concurrency in _cat_file to support large file non-streaming downloads.
44d34f3 to
5919b67
Compare
gcsfs/core.py
Outdated
| use_prefetch_reader = kwargs.get( | ||
| "use_experimental_adaptive_prefetching", False | ||
| ) or os.environ.get( | ||
| "use_experimental_adaptive_prefetching", "false" |
There was a problem hiding this comment.
let's capitalize this, to be consistent with other env variable naming convention
There was a problem hiding this comment.
Can someone please explain why we cannot use normal cache_type= and cache_options= alone rather than having to invent a set of new environment variables (not to mention the extra kwargs)?
There was a problem hiding this comment.
Thanks for the feedback, Martin! I definitely see the logic in wanting to keep the cache_options namespace clean.
While this was initially implemented as a cache but later on we chose to keep the prefetcher separate because it addresses a different concern than our standard caches. While cache_type defines how we persist data locally, this prefetcher is focused on when we request data from the network based on sequential read streaks.
If we merge this into a cache option, we lose the ability to use it in tandem with the existing, robust fsspec caches. We wanted a solution where a customer could get 'smarter' fetches without having to re-architect their current caching strategy.
Regarding the environment variables and kwargs: I completely agree it's a bit of 'noise.' We introduced the feature flag (use_experimental_adaptive_prefetching) mainly to keep it safe and opt-in while it's in this experimental stage.
Additionally, the concurrent downloading logic in _cat_file is a fundamental speed improvement that we felt should benefit the core file-fetching process regardless of the caching or prefetching state.
Open to your thoughts on if there's a better way to expose this 'fetch-level' optimization without cluttering the API!
There was a problem hiding this comment.
Additionally, the concurrent downloading logic in _cat_file is a fundamental speed improvement that we felt should benefit the core file-fetching process regardless of the caching or prefetching state.
Agreed on this - it's completely decoupled functionality, could be split into a separate PR (essentially the same as fsspec/s3fs#1007 )
There was a problem hiding this comment.
I think you may be fixing the wrong problem. I think the source of trouble, is that the File API is sync/blocking, and so all the cache implementations are too (_fetch() calls sync on the filesystem for async impls), but you want to run async code.
I think it might be wrong, though - the cache API does see all the reads of the file and you need to cross the sync/async bridge eventually either way.
There was a problem hiding this comment.
@martindurant I didn't quite catch the meaning of your last comment - could you please elaborate?
Regarding why we need separate environment variable, and why it can't come in cache_options: the reason it isn't included there yet is that the existing fsspec caches don't accept *args or **kwargs.
Cache gets initialised with cache_options here If a user passes an unsupported option like (enable_prefetch_engine) in cache_options, the cache raises an unexpected argument error. I plan to open a separate PR to allow caches to accept arbitrary arguments, and I've already left a comment in this PR tracking that.
Finally, as for why this is implemented as a prefetch engine rather than a cache, that architectural decision is already in discussion in another thread.
| # there currently causes instantiation errors. We are holding off on introducing | ||
| # them as explicit keyword arguments to ensure existing user workloads are not | ||
| # disrupted. This will be refactored once the upstream `fsspec` changes are merged. | ||
| use_prefetch_reader = kwargs.get( |
There was a problem hiding this comment.
let's only do env variable for flag and not kwargs
There was a problem hiding this comment.
I disagree! This means when we want to surface the arguments, we'll have to support both and decide precedence.
There was a problem hiding this comment.
+1, I would also like to keep the argument & environment variable so that users can disable the prefetch engine if they would like to disable.
gcsfs/core.py
Outdated
| await asyncio.gather(*tasks, return_exceptions=True) | ||
| raise e | ||
|
|
||
| async def _cat_file(self, path, start=None, end=None, concurrency=4, **kwargs): |
There was a problem hiding this comment.
Resonating with what Martin is saying, I also feel the prefetching and concurrency logics are being integrated directly into the GCSFile core functionality (_cat_file), I think adding custom arguments and environment variables actually is a side effect of mixing these concerns. Should we consider implementing this as an fsspec cache implementation rather than modifying the core _cat_file logic, wdyt @martindurant ?
martindurant
left a comment
There was a problem hiding this comment.
The main and I think overriding reason to have the prefetcher be a cache_type, is that we want to make it accessible from other (async) filesystem implementations eventually. This will show that this collaboration isn't just good for GCS, but for the community at large.
In addition, keeping to the established pattern will help long-term maintainability.
| ): | ||
| """Simple one-shot, or concurrent get of file data""" | ||
| if concurrency > 1: | ||
| return await self._cat_file_concurrent( |
There was a problem hiding this comment.
Should work for concurrency==1 too, instead of having two separate method
There was a problem hiding this comment.
That is correct. However, in a follow-up CL, the concurrent method will use zero-copy. Therefore, this call is necessary because _cat_file_concurrent will fetch data differently moving forward, and we want to avoid shifting our entire workload to that new path all at once.
There was a problem hiding this comment.
we want to avoid shifting our entire workload to that new path all at once.
Why? Having one code path to maintain should be better, unless you anticipate some problem.
Furthermore, join() doe not copy when not necessary:
>>> x is b"".join([x])
True
There was a problem hiding this comment.
Personally, I don't have any issue pointing this to the same code path, and I do not anticipate any problems with the new implementation.
However, from an organizational standpoint, we need to keep this feature strictly behind a flag to ensure the merge has no immediate impact, and hence want to keep these changes isolated so that when we later introduce zero-copy to the concurrent path, users who opt out of the flag will still safely default to the old behavior.
Once we make the new behavior the default, we will consolidate the code and remove this method. I have already added a comment in the code for the same.
| for i in range(concurrency): | ||
| offset = start + (i * part_size) | ||
| actual_size = ( | ||
| part_size if i < concurrency - 1 else total_size - (i * part_size) | ||
| ) | ||
| tasks.append( | ||
| asyncio.create_task( | ||
| self._cat_file_sequential( | ||
| path, start=offset, end=offset + actual_size, **kwargs | ||
| ) | ||
| ) | ||
| ) | ||
|
|
||
| try: | ||
| results = await asyncio.gather(*tasks) |
There was a problem hiding this comment.
This is just gather(*[...]); I don't think you need to write out the loop. Also, you don't need create_task(), gather() does that automatically if given coroutines.
There was a problem hiding this comment.
The reason to keep asyncio.create_task() is because we need explicit Task objects to manually cancel them in the except block if a failure occurs. If we were on Python 3.11+, we could definitely drop this and use asyncio.TaskGroup to handle the cancellation automatically, but gather doesn't do that natively.
I'm also going to stick with the explicit for loop. Packing the start/end offset calculations into a list comprehension makes that block too dense, so the explicit loop is necessary here for readability.
The code if i remove the loop
tasks = [
asyncio.create_task(
self._cat_file_sequential(
path,
start=start + (i * part_size),
end=start + (i * part_size) + (part_size if i < concurrency - 1 else total_size - (i * part_size)),
**kwargs
)
)
for i in range(concurrency)
]
| # there currently causes instantiation errors. We are holding off on introducing | ||
| # them as explicit keyword arguments to ensure existing user workloads are not | ||
| # disrupted. This will be refactored once the upstream `fsspec` changes are merged. | ||
| use_prefetch_reader = kwargs.get( |
There was a problem hiding this comment.
I disagree! This means when we want to surface the arguments, we'll have to support both and decide precedence.
| try: | ||
| return self.gcsfs.cat_file(self.path, start=start, end=end) | ||
| if self._prefetch_engine: | ||
| return self._prefetch_engine._fetch(start=start, end=end) |
There was a problem hiding this comment.
I really don't see why you would have a standard fsspec cacher overlayed on the prefetcher. The only one it might work with is "readhead", but actually the prefetches does all of that functionality and more, no?
There was a problem hiding this comment.
The primary reason to position the prefetcher below cache rather than integrating it directly is to allow other caches to benefit from the prefetched data. Furthermore, we will always retain the ability to enable or disable the prefetch logic as needed.
This approach mirrors standard OS kernel architecture, which maintains the page cache (which can be bypassed) and the read-ahead prefetching mechanism as distinct, decoupled entities.
There was a problem hiding this comment.
Well we can do some experiments if you like, read patterns like:
Some backtracking: 1,2,3,4,2,5,6,7,5...
Frequent visits home: 1,2,1,3,1,4,1,5...
but I strongly suspect that the first one would behave just like readahead (but better because of prefetching) and the second would be better with type "first" and the prefetcher doesn't help at all.
There was a problem hiding this comment.
You're exactly right that the prefetcher doesn't actively help with the 'frequent visits home' pattern, but crucially, it doesn't backfire either. Because the prefetcher only triggers after detecting a threshold of sequential reads (which we can configure to be 2, 3, or more), it simply stays out of the way during non-sequential access. This is why purely random workloads perform exactly the same whether prefetching is enabled or disabled, as reflected in the benchmarks.
Regarding the 'frequent visits home' pattern specifically: identifying and serving repeatedly accessed blocks (like block 1) is entirely the job of a cache, not a prefetcher. This pattern is actually a perfect example of why decoupling the two is so valuable. Layering them allows the cache to handle the repetitive hits, while the prefetcher handles the sequential scans.
b7edd3c to
9065d67
Compare
|
please see https://github.com/ankitaluthra1/gcsfs/blob/regional/docs/source/prefetcher.rst, I would also like to highlight going forward, we'll have cache_type="none" as default |
|
Thanks for the writeup. I am reading through, but today is a holiday here, and I will have limited time. |
I agree that this is exactly what makes sense, and seems to me to back up my suggestion that this is really a cache type. :) |
|
@martindurant, while we're debating over |
Well that is material and disappointing - especially since Buffers declare whether they are mutable or not (all slices of memoryviews of bytes are readonly). |
Agreed, so please split this PR into three distinct pieces so that we can get the important things in first:
|
|
Review on the documentation. I found the review rather long and wordy for what it is - concision is important if we want to get the detail across. I have the following suggestions.
|
I still think this suggests we need a better join(), not a better slice, and also that this is something we can contribute to CPython, since buffers cannot change when they say they are readonly, and this is something we can check at runtime. |
|
Responding to all 4 open discussions here: _fast_slice
I agree with you that this is fundamentally a Python issue and the long-term fix belongs upstream. Ideally, we shouldn't have to monkey-patch or work around this locally. However, we are choosing this approach over waiting on a CPython contribution for the following reasons:
If you're convinced that this is an issue in Python, and Split the PR
I understand this PR covers two distinct logical components (concurrency and prefetching) and is on the larger side. However, we already have good momentum and discussion from multiple reviewers on this branch. The concurrency code itself is quite small and limited to Keeping it as cache
I actually don't agree with implementing this strictly as a cache. @martindurant, For future, I see the cache and prefetcher as two fundamentally distinct components that will eventually work together for optimal performance:
I strongly prefer not to conflate the two. While this mental model might not perfectly map to existing fsspec caches right now, we plan to contribute our own cache implementation down the line. I will run benchmarks against other caches to demonstrate the performance difference; the comparison for the readahead cache is already in the documentation. Documentation
I'm fine with this. @jasha26, what are your thoughts on dropping the Linux kernel comparison from the prefetcher docs?
As explained above, I'm still not aligned with forcing the prefetcher into the cache_type architecture. I'm addressing the rest of your documentation comments. |
|
Here's the numbers for sequential, and random workload with different caches, it seems performing well for pure sequential and random workload. I still do not understand the problem of not introducing prefetcher as engine over cache, if is controlled by a flag which is by default false. |
|
If your concern is that prefetching might perform worse than existing with specific access patterns, I further ran a probabilistic benchmark executing read operations for 30 seconds. The test randomly alternated between sequential reads and random seeks with different probability, using random sizes ranging from 64KB to 100MB. The results show that the prefetcher outperforms the performance on all existing caches. Performance under concurrency would naturally be better since no prefetching is involved for this random test. Additionally the prefetching will be disabled by default. (100% sequential 0% seek), and (0% sequential 100% seek) benchmarks are covered in my previous comment.50% sequential, 50% seek30% sequential, 70% seek70% sequential, 30% seek90% sequential, 10% seek10% sequential, 90% seek |
|
/gcbrun |
|
xref: python/cpython#148323 |
Description generated by AI
Asynchronous Background Prefetcher
A new BackgroundPrefetcher class has been implemented in gcsfs/prefetcher.py ( source). This component is designed to:
Core Refactoring for Concurrency
The file-fetching logic in gcsfs/core.py has been refactored to enable parallel downloads: