Skip to content

Add prefetcher reader for standard buckets.#795

Merged
martindurant merged 8 commits intofsspec:mainfrom
ankitaluthra1:regional
Apr 10, 2026
Merged

Add prefetcher reader for standard buckets.#795
martindurant merged 8 commits intofsspec:mainfrom
ankitaluthra1:regional

Conversation

@googlyrahman
Copy link
Copy Markdown
Contributor

Description generated by AI

Asynchronous Background Prefetcher

A new BackgroundPrefetcher class has been implemented in gcsfs/prefetcher.py ( source). This component is designed to:

  • Proactively Fetch Data: It spawns a background producer task that fetches sequential blocks of data before they are explicitly requested ( source).
  • Adaptive Blocksize: The engine dynamically adjusts its blocksize based on the history of requested read sizes ( source).
  • Sequential Streak Detection: Prefetching is triggered after detecting a "streak" of sequential reads ( source).
  • Optimized Slicing: Uses ctypes for a fast, low-overhead slice implementation (_fast_slice) to manage internal buffers ( source).

Core Refactoring for Concurrency

The file-fetching logic in gcsfs/core.py has been refactored to enable parallel downloads:

  • _cat_file Decomposition: _cat_file is now split into _cat_file_sequential and _cat_file_concurrent ( source).
  • Threshold-Based Routing: Concurrent fetching is automatically utilized when the requested data size exceeds MIN_CHUNK_SIZE_FOR_CONCURRENCY (defaulting to 5MB) and multiple concurrency slots are requested ( source).
  • Integration: The GCSFile object now optionally initializes the _prefetch_engine when the use_prefetch_reader flag is provided ( source).

@googlyrahman googlyrahman changed the title Add prefetcher engine for regional buckets. Add prefetcher reader for regional buckets. Mar 30, 2026
@codecov
Copy link
Copy Markdown

codecov bot commented Mar 30, 2026

Codecov Report

❌ Patch coverage is 97.31903% with 10 lines in your changes missing coverage. Please review.
✅ Project coverage is 79.81%. Comparing base (e70bc65) to head (242279a).
⚠️ Report is 3 commits behind head on main.

Files with missing lines Patch % Lines
gcsfs/core.py 87.50% 7 Missing ⚠️
gcsfs/prefetcher.py 99.04% 3 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #795      +/-   ##
==========================================
+ Coverage   75.98%   79.81%   +3.83%     
==========================================
  Files          14       16       +2     
  Lines        2665     3042     +377     
==========================================
+ Hits         2025     2428     +403     
+ Misses        640      614      -26     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@googlyrahman googlyrahman marked this pull request as ready for review March 30, 2026 07:38
@googlyrahman googlyrahman changed the title Add prefetcher reader for regional buckets. Add prefetcher reader for standard buckets. Mar 30, 2026
gcsfs/core.py Outdated
) or os.environ.get("use_prefetch_reader", False)
if use_prefetch_reader:
max_prefetch_size = kwargs.get("max_prefetch_size", None)
concurrency = kwargs.get("concurrency", 4)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's call this prefetcher_concurrency

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think concurrency is better term here, it is because _cat_file also uses this concurrency parameter in case prefetcher engine is disabled, so this is unrelated to prefetcher

gcsfs/core.py Outdated
await asyncio.gather(*tasks, return_exceptions=True)
raise e

async def _cat_file(self, path, start=None, end=None, concurrency=4, **kwargs):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's also move this to constant namely DEFAULT_PREFETCHER_CONCURRENCY

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aren't we mixing concerns here? _cat_file does not necessarily use the prefetcher at all. Indeed, why is prefetcher an option, when this is a single blob read, not sequential?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's also move this to constant namely DEFAULT_PREFETCHER_CONCURRENCY

Introduced a default in zb_hns_util.py named as DEFAULT_CONCURRENCY

Aren't we mixing concerns here? _cat_file does not necessarily use the prefetcher at all. Indeed, why is prefetcher an option, when this is a single blob read, not sequential?

_cat_file doesn't actually do any prefetching. We are just completing the existing call path (GCSFile -> cache -> GCSFile._fetch -> GCSFileSystem._cat_file). While it does not prefetch, it will now fetch chunks concurrently if the requested size is under 5MB.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Resonating with what Martin is saying, I also feel the prefetching and concurrency logics are being integrated directly into the GCSFile core functionality (_cat_file), I think adding custom arguments and environment variables actually is a side effect of mixing these concerns. Should we consider implementing this as an fsspec cache implementation rather than modifying the core _cat_file logic, wdyt @martindurant ?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I disagree that this should be implemented as a cache! prefetching, and caching should be two different component. I think moving forward we should've both caching, and prefetching just like kernel.

The repetitive access should be cached (This will help significantly when reading header/footer based file), and prefetching where the engine fetches the data in background which user request next.

There's a switch to turn the prefetching off as per user wish.

Regarding change in _cat_file, The current code path still points to previously written code, so i don't see a problem here, and we need concurrency in _cat_file to support large file non-streaming downloads.

@googlyrahman googlyrahman force-pushed the regional branch 6 times, most recently from 44d34f3 to 5919b67 Compare March 31, 2026 21:50
gcsfs/core.py Outdated
use_prefetch_reader = kwargs.get(
"use_experimental_adaptive_prefetching", False
) or os.environ.get(
"use_experimental_adaptive_prefetching", "false"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's capitalize this, to be consistent with other env variable naming convention

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can someone please explain why we cannot use normal cache_type= and cache_options= alone rather than having to invent a set of new environment variables (not to mention the extra kwargs)?

Copy link
Copy Markdown
Contributor

@jasha26 jasha26 Apr 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the feedback, Martin! I definitely see the logic in wanting to keep the cache_options namespace clean.

While this was initially implemented as a cache but later on we chose to keep the prefetcher separate because it addresses a different concern than our standard caches. While cache_type defines how we persist data locally, this prefetcher is focused on when we request data from the network based on sequential read streaks.

If we merge this into a cache option, we lose the ability to use it in tandem with the existing, robust fsspec caches. We wanted a solution where a customer could get 'smarter' fetches without having to re-architect their current caching strategy.

Regarding the environment variables and kwargs: I completely agree it's a bit of 'noise.' We introduced the feature flag (use_experimental_adaptive_prefetching) mainly to keep it safe and opt-in while it's in this experimental stage.

Additionally, the concurrent downloading logic in _cat_file is a fundamental speed improvement that we felt should benefit the core file-fetching process regardless of the caching or prefetching state.

Open to your thoughts on if there's a better way to expose this 'fetch-level' optimization without cluttering the API!

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additionally, the concurrent downloading logic in _cat_file is a fundamental speed improvement that we felt should benefit the core file-fetching process regardless of the caching or prefetching state.

Agreed on this - it's completely decoupled functionality, could be split into a separate PR (essentially the same as fsspec/s3fs#1007 )

Copy link
Copy Markdown
Member

@martindurant martindurant Apr 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you may be fixing the wrong problem. I think the source of trouble, is that the File API is sync/blocking, and so all the cache implementations are too (_fetch() calls sync on the filesystem for async impls), but you want to run async code.
I think it might be wrong, though - the cache API does see all the reads of the file and you need to cross the sync/async bridge eventually either way.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@martindurant I didn't quite catch the meaning of your last comment - could you please elaborate?

Regarding why we need separate environment variable, and why it can't come in cache_options: the reason it isn't included there yet is that the existing fsspec caches don't accept *args or **kwargs.

Cache gets initialised with cache_options here If a user passes an unsupported option like (enable_prefetch_engine) in cache_options, the cache raises an unexpected argument error. I plan to open a separate PR to allow caches to accept arbitrary arguments, and I've already left a comment in this PR tracking that.

Finally, as for why this is implemented as a prefetch engine rather than a cache, that architectural decision is already in discussion in another thread.

# there currently causes instantiation errors. We are holding off on introducing
# them as explicit keyword arguments to ensure existing user workloads are not
# disrupted. This will be refactored once the upstream `fsspec` changes are merged.
use_prefetch_reader = kwargs.get(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's only do env variable for flag and not kwargs

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I disagree! This means when we want to surface the arguments, we'll have to support both and decide precedence.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, I would also like to keep the argument & environment variable so that users can disable the prefetch engine if they would like to disable.

gcsfs/core.py Outdated
await asyncio.gather(*tasks, return_exceptions=True)
raise e

async def _cat_file(self, path, start=None, end=None, concurrency=4, **kwargs):
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Resonating with what Martin is saying, I also feel the prefetching and concurrency logics are being integrated directly into the GCSFile core functionality (_cat_file), I think adding custom arguments and environment variables actually is a side effect of mixing these concerns. Should we consider implementing this as an fsspec cache implementation rather than modifying the core _cat_file logic, wdyt @martindurant ?

Copy link
Copy Markdown
Member

@martindurant martindurant left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The main and I think overriding reason to have the prefetcher be a cache_type, is that we want to make it accessible from other (async) filesystem implementations eventually. This will show that this collaboration isn't just good for GCS, but for the community at large.
In addition, keeping to the established pattern will help long-term maintainability.

):
"""Simple one-shot, or concurrent get of file data"""
if concurrency > 1:
return await self._cat_file_concurrent(
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should work for concurrency==1 too, instead of having two separate method

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is correct. However, in a follow-up CL, the concurrent method will use zero-copy. Therefore, this call is necessary because _cat_file_concurrent will fetch data differently moving forward, and we want to avoid shifting our entire workload to that new path all at once.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we want to avoid shifting our entire workload to that new path all at once.

Why? Having one code path to maintain should be better, unless you anticipate some problem.

Furthermore, join() doe not copy when not necessary:

>>> x is b"".join([x])
True

Copy link
Copy Markdown
Contributor Author

@googlyrahman googlyrahman Apr 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personally, I don't have any issue pointing this to the same code path, and I do not anticipate any problems with the new implementation.

However, from an organizational standpoint, we need to keep this feature strictly behind a flag to ensure the merge has no immediate impact, and hence want to keep these changes isolated so that when we later introduce zero-copy to the concurrent path, users who opt out of the flag will still safely default to the old behavior.

Once we make the new behavior the default, we will consolidate the code and remove this method. I have already added a comment in the code for the same.

Comment on lines +1206 to +1220
for i in range(concurrency):
offset = start + (i * part_size)
actual_size = (
part_size if i < concurrency - 1 else total_size - (i * part_size)
)
tasks.append(
asyncio.create_task(
self._cat_file_sequential(
path, start=offset, end=offset + actual_size, **kwargs
)
)
)

try:
results = await asyncio.gather(*tasks)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is just gather(*[...]); I don't think you need to write out the loop. Also, you don't need create_task(), gather() does that automatically if given coroutines.

Copy link
Copy Markdown
Contributor Author

@googlyrahman googlyrahman Apr 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason to keep asyncio.create_task() is because we need explicit Task objects to manually cancel them in the except block if a failure occurs. If we were on Python 3.11+, we could definitely drop this and use asyncio.TaskGroup to handle the cancellation automatically, but gather doesn't do that natively.

I'm also going to stick with the explicit for loop. Packing the start/end offset calculations into a list comprehension makes that block too dense, so the explicit loop is necessary here for readability.

The code if i remove the loop

tasks = [
    asyncio.create_task(
         self._cat_file_sequential(
              path,
              start=start + (i * part_size),
              end=start + (i * part_size) + (part_size if i < concurrency - 1 else total_size - (i * part_size)),
              **kwargs
         )
    )
    for i in range(concurrency)
]

# there currently causes instantiation errors. We are holding off on introducing
# them as explicit keyword arguments to ensure existing user workloads are not
# disrupted. This will be refactored once the upstream `fsspec` changes are merged.
use_prefetch_reader = kwargs.get(
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I disagree! This means when we want to surface the arguments, we'll have to support both and decide precedence.

try:
return self.gcsfs.cat_file(self.path, start=start, end=end)
if self._prefetch_engine:
return self._prefetch_engine._fetch(start=start, end=end)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really don't see why you would have a standard fsspec cacher overlayed on the prefetcher. The only one it might work with is "readhead", but actually the prefetches does all of that functionality and more, no?

Copy link
Copy Markdown
Contributor Author

@googlyrahman googlyrahman Apr 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The primary reason to position the prefetcher below cache rather than integrating it directly is to allow other caches to benefit from the prefetched data. Furthermore, we will always retain the ability to enable or disable the prefetch logic as needed.

This approach mirrors standard OS kernel architecture, which maintains the page cache (which can be bypassed) and the read-ahead prefetching mechanism as distinct, decoupled entities.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well we can do some experiments if you like, read patterns like:

Some backtracking: 1,2,3,4,2,5,6,7,5...
Frequent visits home: 1,2,1,3,1,4,1,5...

but I strongly suspect that the first one would behave just like readahead (but better because of prefetching) and the second would be better with type "first" and the prefetcher doesn't help at all.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're exactly right that the prefetcher doesn't actively help with the 'frequent visits home' pattern, but crucially, it doesn't backfire either. Because the prefetcher only triggers after detecting a threshold of sequential reads (which we can configure to be 2, 3, or more), it simply stays out of the way during non-sequential access. This is why purely random workloads perform exactly the same whether prefetching is enabled or disabled, as reflected in the benchmarks.

Regarding the 'frequent visits home' pattern specifically: identifying and serving repeatedly accessed blocks (like block 1) is entirely the job of a cache, not a prefetcher. This pattern is actually a perfect example of why decoupling the two is so valuable. Layering them allows the cache to handle the repetitive hits, while the prefetcher handles the sequential scans.

@googlyrahman googlyrahman force-pushed the regional branch 2 times, most recently from b7edd3c to 9065d67 Compare April 2, 2026 17:22
@googlyrahman
Copy link
Copy Markdown
Contributor Author

googlyrahman commented Apr 3, 2026

please see https://github.com/ankitaluthra1/gcsfs/blob/regional/docs/source/prefetcher.rst, I would also like to highlight going forward, we'll have cache_type="none" as default

@martindurant
Copy link
Copy Markdown
Member

Thanks for the writeup. I am reading through, but today is a holiday here, and I will have limited time.

@martindurant
Copy link
Copy Markdown
Member

we'll have cache_type="none" as default

I agree that this is exactly what makes sense, and seems to me to back up my suggestion that this is really a cache type. :)

@googlyrahman
Copy link
Copy Markdown
Contributor Author

@martindurant, while we're debating over _fast_slice, I think it would be good to review the rest of the code. _fast_slice only helps with inconsistent reads. We would really like to get this PR reviewed and merged as a priority, if that's possible.

@martindurant
Copy link
Copy Markdown
Member

a built-in optimization to release the GIL for large payloads via GIL_THRESHOLD, if you look at the source code for join, its fast-path strictly requires bytes

Well that is material and disappointing - especially since Buffers declare whether they are mutable or not (all slices of memoryviews of bytes are readonly).

@martindurant
Copy link
Copy Markdown
Member

while we're debating over _fast_slice, I think it would be good to review the rest of the code. _fast_slice only helps with inconsistent reads. We would really like to get this PR reviewed and merged as a priority, if that's possible.

Agreed, so please split this PR into three distinct pieces so that we can get the important things in first:

  • concurrency in _cat_file (which doesn't depend on prefetcher/cache issues at all, as discussed)
  • the prefetcher/cache implementation, implemented only on the cache_type entrypoint (as we agreed, the process only makes sense with readahead policy, so the cache and prefetch ideas are intimately collected). It is for the best if all of the kwargs are explicitly written out in the signature
  • fast-slicing/join etc

@martindurant
Copy link
Copy Markdown
Member

Review on the documentation.

I found the review rather long and wordy for what it is - concision is important if we want to get the detail across. I have the following suggestions.

  • Remove all comparisons to the linux kernel. You may mention it as inspiration (just once), but there is no need to describe all the ways in which the implementation here is not like the kernel. Better, just say what the implementation is.
  • I don't find the diagrams particularly useful. An API summary of the classes after the text description would do. I think we talked about an animaton (or stepped diagram) showing the read location and pending/arrived buffers, I think that would be a clearer picture of what's actually going on. For example, download_size = sequential_streak * io_size is far more expressive than the current text in Scaling Strategy where it is buried.
  • Remove extra names: RunningAverageTracker is a better label for this thing than "the brain".
  • moving back to a cache_type architecture, most of the "interaction with GCSFile" session become unnecessary. Passing cache_type is all that is needed (although a default_cache_type argument should be made available for the filesystem class init, as for s3fs).
  • threads are mentioned in a few places, but I think most readers would be confused by what is happening in which thread. main thread, fsspec async thread, prefetch IO (same as fsspec thread?), memcopy/offload worker thread. What about in the case of multiple File objects with prefetchers in one user or multiple user threads?

@martindurant
Copy link
Copy Markdown
Member

a built-in optimization to release the GIL for large payloads via GIL_THRESHOLD, if you look at the source code for join, its fast-path strictly requires bytes

Well that is material and disappointing - especially since Buffers declare whether they are mutable or not (all slices of memoryviews of bytes are readonly).

I still think this suggests we need a better join(), not a better slice, and also that this is something we can contribute to CPython, since buffers cannot change when they say they are readonly, and this is something we can check at runtime.

@googlyrahman
Copy link
Copy Markdown
Contributor Author

Responding to all 4 open discussions here:

_fast_slice

Well that is material and disappointing - especially since Buffers declare whether they are mutable or not (all slices of memoryviews of bytes are readonly).
I still think this suggests we need a better join(), not a better slice, and also that this is something we can contribute to CPython, since buffers cannot change when they say they are readonly, and this is something we can check at runtime.

I agree with you that this is fundamentally a Python issue and the long-term fix belongs upstream. Ideally, we shouldn't have to monkey-patch or work around this locally. However, we are choosing this approach over waiting on a CPython contribution for the following reasons:

  • Even if fixed upstream, the performance gains would only be available in future Python releases, leaving older versions unaffected.
  • Our goal is for customers to see performance benefits simply by setting an environment variable. Requiring a Python version upgrade breaks that frictionless experience.
  • Upstreaming takes time, and I am not highly familiar with the CPython contribution process. _fast_slice provides the fastest immediate workaround with perf benefits.
  • We plan to use ctypes extensively in zonal anyway due to Python's limitations, so relying on it here aligns with our current mental model.
  • A 4-line local fix ships faster and delivers immediate performance improvements across all supported Python versions. I prefer this as a short-term solution while we pursue an upstream fix long-term.
  • We likely won't need _fast_slice once zero-copy is integrated (though that will also rely on ctypes.memmove). This will address your slicing copy concern as well.

If you're convinced that this is an issue in Python, and _fast_slice is just a short-term workaround until a fix is merged and our user base is on that newer Python version, we can keep _fast_slice for now. Simultaneously, we should open an issue in the Python repository for a long-term solution?


Split the PR

Agreed, so please split this PR into three distinct pieces so that we can get the important things in first:

I understand this PR covers two distinct logical components (concurrency and prefetching) and is on the larger side. However, we already have good momentum and discussion from multiple reviewers on this branch. The concurrency code itself is quite small and limited to core.py and one function. Since both features are closely tied to our performance goals and given existing review on this PR already, I'd request that we keep them together and review them collectively here.


Keeping it as cache

the prefetcher/cache implementation, implemented only on the cache_type entrypoint (as we agreed, the process only makes sense with readahead policy, so the cache and prefetch ideas are intimately collected). It is for the best if all of the kwargs are explicitly written out in the signature

I actually don't agree with implementing this strictly as a cache. @martindurant, For future, I see the cache and prefetcher as two fundamentally distinct components that will eventually work together for optimal performance:

  • The cache should handle repetitive access to specific sections (like headers/footers) without fetching anything extra.
  • The prefetcher should be strictly responsible for proactively fetching the next chunk based on sequential read patterns.

I strongly prefer not to conflate the two. While this mental model might not perfectly map to existing fsspec caches right now, we plan to contribute our own cache implementation down the line. I will run benchmarks against other caches to demonstrate the performance difference; the comparison for the readahead cache is already in the documentation.


Documentation

Regarding documentation comments

Remove all comparisons to the linux kernel. You may mention it as inspiration (just once), but there is no need to describe all the ways in which the implementation here is not like the kernel. Better, just say what the implementation is.

I'm fine with this. @jasha26, what are your thoughts on dropping the Linux kernel comparison from the prefetcher docs?

moving back to a cache_type architecture, most of the "interaction with GCSFile" session become unnecessary. Passing cache_type is all that is needed (although a default_cache_type argument should be made available for the filesystem class init, as for s3fs).

As explained above, I'm still not aligned with forcing the prefetcher into the cache_type architecture.

I'm addressing the rest of your documentation comments.

@googlyrahman
Copy link
Copy Markdown
Contributor Author

Here's the numbers for sequential, and random workload with different caches, it seems performing well for pure sequential and random workload.

I still do not understand the problem of not introducing prefetcher as engine over cache, if is controlled by a flag which is by default false.

Benchmarking GCS Read Performance (Best of 3 runs)
Path: rahman-bucket/20gb-file
Modes: sequential, random
Processes: 1 (with 1 threads/process)
Duration per test: 30 seconds
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
processes | io size (MB)    | pattern    | cache type   | default throughput   | concurrency     | prefetcher+concurrency    | default mem     | concurrency mem   | prefetcher mem 
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
1         | 0.06            | sequential | none         | 1.68                 | 1.68            | 216.05                    | 150.17          | 149.88            | 239.38         
1         | 0.06            | sequential | readahead    | 82.40                | 90.81           | 698.96                    | 150.53          | 149.78            | 362.56         
1         | 0.06            | sequential | blockcache   | 87.70                | 93.02           | 685.53                    | 283.01          | 288.34            | 532.45         
1         | 0.06            | sequential | background   | 77.91                | 93.86           | 708.13                    | 282.87          | 290.36            | 525.93         
1         | 0.06            | sequential | first        | 1.86                 | 1.81            | 196.57                    | 149.90          | 149.88            | 240.61         
1         | 0.06            | sequential | mmap         | 85.90                | 77.39           | 387.63                    | 2798.84         | 2540.77           | 12822.79       
1         | 0.06            | sequential | bytes        | 79.04                | 90.43           | 626.45                    | 149.63          | 155.50            | 362.12         
1         | 1.00            | sequential | none         | 24.99                | 25.68           | 627.97                    | 149.96          | 149.65            | 359.25         
1         | 1.00            | sequential | readahead    | 101.96               | 117.30          | 717.32                    | 149.77          | 150.28            | 381.48         
1         | 1.00            | sequential | blockcache   | 87.12                | 98.80           | 690.63                    | 284.54          | 291.89            | 531.04         
1         | 1.00            | sequential | background   | 90.83                | 98.93           | 726.88                    | 285.45          | 293.58            | 530.31         
1         | 1.00            | sequential | first        | 26.13                | 26.48           | 621.12                    | 149.89          | 149.65            | 368.50         
1         | 1.00            | sequential | mmap         | 87.28                | 91.33           | 418.61                    | 2844.79         | 2986.48           | 14042.47       
1         | 1.00            | sequential | bytes        | 82.78                | 89.86           | 633.42                    | 150.16          | 157.76            | 367.50         
1         | 16.00           | sequential | none         | 151.91               | 206.98          | 679.55                    | 172.29          | 190.18            | 409.20         
1         | 16.00           | sequential | readahead    | 153.29               | 209.94          | 615.98                    | 215.39          | 241.95            | 502.20         
1         | 16.00           | sequential | blockcache   | 83.55                | 89.94           | 654.67                    | 348.78          | 344.59            | 580.84         
1         | 16.00           | sequential | background   | 83.76                | 94.51           | 722.35                    | 347.34          | 343.62            | 575.42         
1         | 16.00           | sequential | first        | 152.02               | 235.47          | 727.88                    | 177.47          | 213.55            | 431.75         
1         | 16.00           | sequential | mmap         | 133.86               | 190.81          | 439.98                    | 4391.33         | 6254.80           | 14870.61       
1         | 16.00           | sequential | bytes        | 138.92               | 205.73          | 202.89                    | 184.43          | 221.42            | 220.68         
1         | 100.00          | sequential | none         | 229.39               | 431.20          | 460.65                    | 415.08          | 513.57            | 685.46         
1         | 100.00          | sequential | readahead    | 179.33               | 289.17          | 276.14                    | 605.15          | 645.19            | 865.71         
1         | 100.00          | sequential | blockcache   | 83.07                | 87.27           | 605.60                    | 476.90          | 473.35            | 709.85         
1         | 100.00          | sequential | background   | 85.55                | 94.08           | 641.21                    | 461.36          | 476.56            | 690.65         
1         | 100.00          | sequential | first        | 206.91               | 401.11          | 498.39                    | 421.57          | 535.19            | 733.51         
1         | 100.00          | sequential | mmap         | 157.08               | 284.05          | 220.88                    | 5355.21         | 9615.83           | 7734.38        
1         | 100.00          | sequential | bytes        | 183.50               | 372.93          | 373.17                    | 532.65          | 549.83            | 544.48         
1         | 0.06            | random     | none         | 1.77                 | 1.73            | 1.77                      | 150.23          | 149.64            | 150.06         
1         | 0.06            | random     | readahead    | 1.12                 | 1.30            | 1.33                      | 150.12          | 149.95            | 155.24         
1         | 0.06            | random     | blockcache   | 1.21                 | 1.27            | 1.27                      | 283.16          | 288.20            | 306.55         
1         | 0.06            | random     | background   | 1.13                 | 1.22            | 0.96                      | 289.13          | 309.74            | 307.14         
1         | 0.06            | random     | first        | 1.87                 | 1.83            | 1.83                      | 149.63          | 150.12            | 149.91         
1         | 0.06            | random     | mmap         | 1.21                 | 1.32            | 1.31                      | 2962.95         | 3205.63           | 3177.78        
1         | 0.06            | random     | bytes        | 1.19                 | 1.29            | 1.34                      | 150.05          | 150.46            | 150.19         
1         | 1.00            | random     | none         | 26.00                | 25.60           | 25.40                     | 149.86          | 150.20            | 150.26         
1         | 1.00            | random     | readahead    | 17.32                | 20.56           | 20.13                     | 150.23          | 150.16            | 150.56         
1         | 1.00            | random     | blockcache   | 16.37                | 18.36           | 14.97                     | 286.67          | 291.99            | 316.14         
1         | 1.00            | random     | background   | 16.58                | 17.55           | 11.54                     | 297.12          | 312.61            | 413.60         
1         | 1.00            | random     | first        | 25.13                | 25.04           | 24.93                     | 150.63          | 150.18            | 150.53         
1         | 1.00            | random     | mmap         | 16.74                | 19.86           | 18.76                     | 2939.65         | 3510.95           | 3316.24        
1         | 1.00            | random     | bytes        | 16.94                | 20.07           | 20.57                     | 150.22          | 149.84            | 149.94         
1         | 16.00           | random     | none         | 147.06               | 210.96          | 211.69                    | 172.29          | 192.88            | 204.70         
1         | 16.00           | random     | readahead    | 128.26               | 185.78          | 185.84                    | 226.55          | 269.53            | 251.01         
1         | 16.00           | random     | blockcache   | 70.22                | 75.01           | 72.22                     | 352.53          | 341.61            | 396.61         
1         | 16.00           | random     | background   | 71.39                | 78.82           | 76.45                     | 348.84          | 357.98            | 401.70         
1         | 16.00           | random     | first        | 153.31               | 226.32          | 260.15                    | 173.59          | 207.68            | 212.64         
1         | 16.00           | random     | mmap         | 132.43               | 195.13          | 197.75                    | 5056.42         | 6930.05           | 6943.20        
1         | 16.00           | random     | bytes        | 138.98               | 213.67          | 221.15                    | 184.71          | 214.33            | 220.97         
1         | 100.00          | random     | none         | 222.12               | 398.74          | 394.62                    | 412.17          | 512.51            | 523.11         
1         | 100.00          | random     | readahead    | 213.33               | 362.26          | 335.73                    | 531.33          | 645.97            | 659.40         
1         | 100.00          | random     | blockcache   | 70.60                | 76.75           | 186.01                    | 487.53          | 475.97            | 599.38         
1         | 100.00          | random     | background   | 75.37                | 78.41           | 211.78                    | 478.44          | 440.82            | 686.96         
1         | 100.00          | random     | first        | 203.17               | 417.30          | 348.70                    | 412.88          | 555.05            | 507.64         
1         | 100.00          | random     | mmap         | 173.54               | 297.69          | 353.30                    | 5554.14         | 7978.68           | 8984.27        
1         | 100.00          | random     | bytes        | 195.71               | 357.95          | 374.41                    | 517.89          | 553.16            | 565.44         

@googlyrahman
Copy link
Copy Markdown
Contributor Author

If your concern is that prefetching might perform worse than existing with specific access patterns, I further ran a probabilistic benchmark executing read operations for 30 seconds. The test randomly alternated between sequential reads and random seeks with different probability, using random sizes ranging from 64KB to 100MB. The results show that the prefetcher outperforms the performance on all existing caches. Performance under concurrency would naturally be better since no prefetching is involved for this random test. Additionally the prefetching will be disabled by default.

(100% sequential 0% seek), and (0% sequential 100% seek) benchmarks are covered in my previous comment.

50% sequential, 50% seek

Benchmarking Pure Probabilistic GCS Read Performance (Best of 3 runs)
Path: rahman-bucket/20gb-file
Modes Pool: ['sequential', 'random'] (Weights: Uniform)
Payload Size Range: 0.06MB to 100.00MB (Randomly selected per read)
Processes: 1 (with 1 threads/process)
Duration per test: 30 seconds
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
processes | io size range   | pattern    | cache type   | default throughput   | concurrency     | prefetcher+concurrency    | default mem     | concurrency mem   | prefetcher mem 
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
1         | 64K-100M        | mixed      | none         | 188.41               | 385.30          | 247.54                    | 426.50          | 535.46            | 817.86         
1         | 64K-100M        | mixed      | readahead    | 198.75               | 330.73          | 248.92                    | 607.46          | 640.81            | 988.20         
1         | 64K-100M        | mixed      | blockcache   | 83.20                | 94.34           | 265.25                    | 442.09          | 428.56            | 677.76         
1         | 64K-100M        | mixed      | background   | 87.71                | 98.07           | 258.70                    | 496.65          | 439.92            | 705.32         
1         | 64K-100M        | mixed      | first        | 193.57               | 414.54          | 269.40                    | 432.23          | 480.20            | 704.06         
1         | 64K-100M        | mixed      | mmap         | 172.30               | 320.92          | 244.23                    | 5287.50         | 9200.60           | 7449.22        
1         | 64K-100M        | mixed      | bytes        | 210.25               | 363.25          | 375.41                    | 569.38          | 588.05            | 562.96 

30% sequential, 70% seek

Benchmarking Pure Probabilistic GCS Read Performance (Best of 3 runs)
Path: rahman-bucket/20gb-file
Modes Pool: ['sequential', 'random'] (Weights: [0.3, 0.7])
Payload Size Range: 0.06MB to 100.00MB (Randomly selected per read)
Processes: 1 (with 1 threads/process)
Duration per test: 30 seconds
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
processes | io size range   | pattern    | cache type   | default throughput   | concurrency     | prefetcher+concurrency    | default mem     | concurrency mem   | prefetcher mem 
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
1         | 64K-100M        | mixed      | none         | 238.35               | 385.15          | 296.57                    | 411.24          | 458.50            | 737.23         
1         | 64K-100M        | mixed      | readahead    | 175.40               | 337.67          | 272.64                    | 570.00          | 591.68            | 870.39         
1         | 64K-100M        | mixed      | blockcache   | 86.21                | 96.63           | 204.86                    | 468.94          | 421.15            | 667.37         
1         | 64K-100M        | mixed      | background   | 91.01                | 95.03           | 200.70                    | 464.49          | 435.68            | 661.07         
1         | 64K-100M        | mixed      | first        | 234.87               | 412.07          | 292.85                    | 389.13          | 450.06            | 683.10         
1         | 64K-100M        | mixed      | mmap         | 196.51               | 350.15          | 237.52                    | 6270.89         | 8946.45           | 7372.57        
1         | 64K-100M        | mixed      | bytes        | 172.19               | 354.86          | 354.80                    | 475.48          | 544.79            | 570.45 

70% sequential, 30% seek

Benchmarking Pure Probabilistic GCS Read Performance (Best of 3 runs)
Path: rahman-bucket/20gb-file
Modes Pool: ['sequential', 'random'] (Weights: [0.7, 0.3])
Payload Size Range: 0.06MB to 100.00MB (Randomly selected per read)
Processes: 1 (with 1 threads/process)
Duration per test: 30 seconds
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
processes | io size range   | pattern    | cache type   | default throughput   | concurrency     | prefetcher+concurrency    | default mem     | concurrency mem   | prefetcher mem 
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
1         | 64K-100M        | mixed      | none         | 213.71               | 411.54          | 267.15                    | 422.75          | 471.50            | 749.57         
1         | 64K-100M        | mixed      | readahead    | 177.18               | 351.74          | 252.61                    | 627.62          | 605.20            | 938.20         
1         | 64K-100M        | mixed      | blockcache   | 92.05                | 104.34          | 327.43                    | 497.26          | 491.27            | 691.04         
1         | 64K-100M        | mixed      | background   | 87.63                | 105.37          | 314.72                    | 482.64          | 502.89            | 739.58         
1         | 64K-100M        | mixed      | first        | 242.20               | 421.95          | 277.06                    | 405.77          | 424.25            | 714.46         
1         | 64K-100M        | mixed      | mmap         | 209.46               | 371.35          | 258.25                    | 5701.29         | 9356.32           | 7596.39        
1         | 64K-100M        | mixed      | bytes        | 189.74               | 341.48          | 288.01                    | 526.69          | 565.18            | 554.11      

90% sequential, 10% seek

Benchmarking Pure Probabilistic GCS Read Performance (Best of 3 runs)
Path: rahman-bucket/20gb-file
Modes Pool: ['sequential', 'random'] (Weights: [0.9, 0.1])
Payload Size Range: 0.06MB to 100.00MB (Randomly selected per read)
Processes: 1 (with 1 threads/process)
Duration per test: 30 seconds
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
processes | io size range   | pattern    | cache type   | default throughput   | concurrency     | prefetcher+concurrency    | default mem     | concurrency mem   | prefetcher mem 
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
1         | 64K-100M        | mixed      | none         | 188.66               | 350.62          | 319.71                    | 419.74          | 503.00            | 889.20         
1         | 64K-100M        | mixed      | readahead    | 172.27               | 332.19          | 270.28                    | 662.27          | 729.73            | 905.65         
1         | 64K-100M        | mixed      | blockcache   | 87.21                | 89.94           | 468.12                    | 442.75          | 470.15            | 721.55         
1         | 64K-100M        | mixed      | background   | 86.60                | 96.95           | 470.47                    | 444.08          | 487.85            | 734.70         
1         | 64K-100M        | mixed      | first        | 182.30               | 393.94          | 319.02                    | 397.27          | 528.46            | 735.02         
1         | 64K-100M        | mixed      | mmap         | 195.35               | 275.87          | 253.81                    | 4692.38         | 7987.88           | 6809.74        
1         | 64K-100M        | mixed      | bytes        | 145.76               | 322.12          | 312.12                    | 500.68          | 645.19            | 540.38 

10% sequential, 90% seek

Benchmarking Pure Probabilistic GCS Read Performance (Best of 3 runs)
Path: rahman-bucket/20gb-file
Modes Pool: ['sequential', 'random'] (Weights: [0.1, 0.9])
Payload Size Range: 0.06MB to 100.00MB (Randomly selected per read)
Processes: 1 (with 1 threads/process)
Duration per test: 30 seconds
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
processes | io size range   | pattern    | cache type   | default throughput   | concurrency     | prefetcher+concurrency    | default mem     | concurrency mem   | prefetcher mem 
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
1         | 64K-100M        | mixed      | none         | 209.96               | 336.99          | 314.57                    | 429.36          | 502.16            | 473.85         
1         | 64K-100M        | mixed      | readahead    | 176.74               | 315.32          | 259.38                    | 604.33          | 676.70            | 868.23         
1         | 64K-100M        | mixed      | blockcache   | 77.60                | 83.51           | 156.53                    | 438.90          | 436.17            | 599.39         
1         | 64K-100M        | mixed      | background   | 69.82                | 71.91           | 154.31                    | 441.64          | 422.99            | 666.27         
1         | 64K-100M        | mixed      | first        | 206.98               | 379.02          | 309.85                    | 413.13          | 443.55            | 551.02         
1         | 64K-100M        | mixed      | mmap         | 172.05               | 280.41          | 256.43                    | 5710.51         | 8282.66           | 7596.24        
1         | 64K-100M        | mixed      | bytes        | 144.82               | 265.41          | 310.36                    | 495.11          | 545.33            | 637.43         

@martindurant
Copy link
Copy Markdown
Member

/gcbrun

@martindurant
Copy link
Copy Markdown
Member

xref: python/cpython#148323

@martindurant martindurant merged commit 5596df9 into fsspec:main Apr 10, 2026
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants