Skip to content

Conversation

lwfitzgerald
Copy link
Contributor

@lwfitzgerald lwfitzgerald commented Sep 22, 2025

Rationale for this change

FsspecFileIO.get_fs can be called by multiple threads when ExecutorFactory is used (for example by DataScan.plan_files).

The base class of fsspec filesystem objects, fsspec.spec.AbstractFileSystem, internally caches instances through the fsspec.spec._Cached metaclass. The caching key used includes threading.get_ident(), making entries thread-local:
https://github.com/fsspec/filesystem_spec/blob/f84b99f0d1f079f990db1a219b74df66ab3e7160/fsspec/spec.py#L71

The FsspecFileIO.get_fs LRU cache (around FsspecFileIO._get_fs) breaks the thread-locality of the filesystem instances as it will return the same instance for different threads.

One consequence of this is that for s3fs.S3FileSystem, HTTP connection pooling no longer occurs per thread (as is normal with aiobotocore), as the aiobotocore client object (containing the aiohttp.ClientSession) is stored on the s3fs.S3FileSystem.

This change addresses this by making the FsspecFileIO.get_fs cache thread-local.

Are these changes tested?

Tested locally. Unit test included covering the caching behaviour.

Are there any user-facing changes?

Yes - S3 HTTP connection pooling now occurs per-thread, matching the behaviour of aiobotocore when it used in the recommended way with an event loop per thread.

@lwfitzgerald lwfitzgerald marked this pull request as ready for review September 22, 2025 10:24
Copy link

@AlexWendland AlexWendland left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The use of the local self._thread_locals variable ensure the lru_cache is distinct to each thread. This enables per-thread connection pooling of HTTP requests. This has been correctly removed and added back upon serialisation/deserialisation.

@lwfitzgerald lwfitzgerald force-pushed the fix/fsspec-cache-thread-safety branch 4 times, most recently from 0450687 to 308dee2 Compare September 22, 2025 14:27
Copy link
Contributor

@kevinjqliu kevinjqliu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM Thanks for the detailed PR + test :)

`FsspecFileIO.get_fs` can be called by multiple threads when `ExecutorFactory`
is used (for example by `DataScan.plan_files`).

The base class of `fsspec` filesystem objects, `fsspec.spec.AbstractFileSystem`,
internally caches instances through the `fsspec.spec._Cached` metaclass.
The caching key used includes `threading.get_ident()`, making entries
thread-local:
https://github.com/fsspec/filesystem_spec/blame/f84b99f0d1f079f990db1a219b74df66ab3e7160/fsspec/spec.py#L71

The `FsspecFileIO.get_fs` LRU cache (around `FsspecFileIO._get_fs`) breaks
the thread-locality of the filesystem instances as it will return the
same instance for different threads.

One consequence of this is that for `s3fs.S3FileSystem`, HTTP connection
pooling no longer occurs per thread (as is normal with `aiobotocore`), as
the `aiobotocore` client object (containing the `aiohttp.ClientSession`)
is stored on the `s3fs.S3FileSystem`.

This change addresses this by making the `FsspecFileIO.get_fs` cache
thread-local.
@lwfitzgerald lwfitzgerald force-pushed the fix/fsspec-cache-thread-safety branch from 308dee2 to 9d2354d Compare September 22, 2025 14:57
@lwfitzgerald
Copy link
Contributor Author

Removed some unnecessary events ^

lwfitzgerald added a commit to man-group/iceberg-python that referenced this pull request Sep 22, 2025
The existing S3 remote signing hook function (`s3v4_rest_signer`) uses
`requests.post` to submit `POST` requests to the REST signing endpoint,
but this internally creates a new `requests.Session` for every request,
preventing any reuse of connections.

In my profiling I saw this add overhead from repeated loading of CA certs
and reestablishing of TLS connections.

This change makes the signing function a callable object that wraps a
`request.Session`, using this for `POST`ing, therefore achieving
connection reuse.

Signer callables are stored on the hook internals of the `aiobotocore`
client inside the `s3fs.S3FileSystem` instance, so use and lifetime will
match that of those instances.

The `s3fs.S3FileSystem` instances are guaranteed thread-local since:
apache#2495.
Copy link
Contributor

@Fokko Fokko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @lwfitzgerald for working on this 🙌 This looks great!

@Fokko Fokko merged commit 6b07b4a into apache:main Sep 22, 2025
10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants