-
Notifications
You must be signed in to change notification settings - Fork 359
Fix FsspecFileIO.get_fs
thread safety
#2495
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The use of the local self._thread_locals
variable ensure the lru_cache is distinct to each thread. This enables per-thread connection pooling of HTTP requests. This has been correctly removed and added back upon serialisation/deserialisation.
0450687
to
308dee2
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM Thanks for the detailed PR + test :)
`FsspecFileIO.get_fs` can be called by multiple threads when `ExecutorFactory` is used (for example by `DataScan.plan_files`). The base class of `fsspec` filesystem objects, `fsspec.spec.AbstractFileSystem`, internally caches instances through the `fsspec.spec._Cached` metaclass. The caching key used includes `threading.get_ident()`, making entries thread-local: https://github.com/fsspec/filesystem_spec/blame/f84b99f0d1f079f990db1a219b74df66ab3e7160/fsspec/spec.py#L71 The `FsspecFileIO.get_fs` LRU cache (around `FsspecFileIO._get_fs`) breaks the thread-locality of the filesystem instances as it will return the same instance for different threads. One consequence of this is that for `s3fs.S3FileSystem`, HTTP connection pooling no longer occurs per thread (as is normal with `aiobotocore`), as the `aiobotocore` client object (containing the `aiohttp.ClientSession`) is stored on the `s3fs.S3FileSystem`. This change addresses this by making the `FsspecFileIO.get_fs` cache thread-local.
308dee2
to
9d2354d
Compare
Removed some unnecessary events ^ |
The existing S3 remote signing hook function (`s3v4_rest_signer`) uses `requests.post` to submit `POST` requests to the REST signing endpoint, but this internally creates a new `requests.Session` for every request, preventing any reuse of connections. In my profiling I saw this add overhead from repeated loading of CA certs and reestablishing of TLS connections. This change makes the signing function a callable object that wraps a `request.Session`, using this for `POST`ing, therefore achieving connection reuse. Signer callables are stored on the hook internals of the `aiobotocore` client inside the `s3fs.S3FileSystem` instance, so use and lifetime will match that of those instances. The `s3fs.S3FileSystem` instances are guaranteed thread-local since: apache#2495.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @lwfitzgerald for working on this 🙌 This looks great!
Rationale for this change
FsspecFileIO.get_fs
can be called by multiple threads whenExecutorFactory
is used (for example byDataScan.plan_files
).The base class of
fsspec
filesystem objects,fsspec.spec.AbstractFileSystem
, internally caches instances through thefsspec.spec._Cached
metaclass. The caching key used includesthreading.get_ident()
, making entries thread-local:https://github.com/fsspec/filesystem_spec/blob/f84b99f0d1f079f990db1a219b74df66ab3e7160/fsspec/spec.py#L71
The
FsspecFileIO.get_fs
LRU cache (aroundFsspecFileIO._get_fs
) breaks the thread-locality of the filesystem instances as it will return the same instance for different threads.One consequence of this is that for
s3fs.S3FileSystem
, HTTP connection pooling no longer occurs per thread (as is normal withaiobotocore
), as theaiobotocore
client object (containing theaiohttp.ClientSession
) is stored on thes3fs.S3FileSystem
.This change addresses this by making the
FsspecFileIO.get_fs
cache thread-local.Are these changes tested?
Tested locally. Unit test included covering the caching behaviour.
Are there any user-facing changes?
Yes - S3 HTTP connection pooling now occurs per-thread, matching the behaviour of
aiobotocore
when it used in the recommended way with an event loop per thread.