Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

google.auth.exceptions.RefreshError with excessive concurrent requests. #71

Open
asford opened this issue Feb 14, 2018 · 5 comments
Open

Comments

@asford
Copy link
Collaborator

asford commented Feb 14, 2018

gcsfs propagates an google.auth.exceptions.RefreshError when executing many concurrent requests from a single node using the google_default credentials class. This is likely due to repeated, excessive number of requests to the internal metadata service. This is a known bug of the external library at googleapis/google-auth-library-python#211.

Anecdotally, I've primarily observed this in dask.distributed workers and believe this might occur due to the way GCSFiles are distributed. This primarily occurs when a large number of small files are being read from storage and many worker threads are performing concurrent reads. I believe the GCSFiles serialized in dask tasks then each instantiate a separate GCSFilesystem, resolve credentials and open a session.

If this is the case it would be preferable to store a fixed set of AuthenticatedSession handles, ideally via cache on the GCSFilesystem class, and dispatch to an auth-method-specific session in the GCSFilesystem._connect_* connection functions.

As a more specific solution, google.auth.exceptions.RefreshError or its base class should be added to the retrying exception list in _call, however this may mask legitimate authentication errors. The credentials should probably be "tested" via some call that does not retry this error during session initialization. This may be as simple as calling session.credentials.refresh or performing a single authenticated request after session initialization.

asford added a commit to asford/gcsfs that referenced this issue Feb 14, 2018
Retry on requests failing due to `google.auth.exceptions.RefreshError`,
partial resolution of fsspec#71.
asford added a commit to asford/gcsfs that referenced this issue Feb 14, 2018
Retry on requests failing due to `google.auth.exceptions.RefreshError`,
partial resolution of fsspec#71.
@martindurant
Copy link
Member

When the scheduler farms out tasks to workers, they indeed all individually authenticate themselves (unless you explicitly pass a token, which is unsafe and generally still needs to be refreshed). This should only happen the first time, after which the existing instance should be reused. I am including this for information only.

@martindurant
Copy link
Member

Can you think of a case where RefreshError should not lead to a retry? I am thinking that this is not the error you get if the call is processed completely by the server, but the auth details are somehow invalid.

@asford
Copy link
Collaborator Author

asford commented Feb 14, 2018

I haven't had a chance to read though the google.auth documentation or code deeply enough to understand if this error is thrown for invalid refresh requests or just due to temporary errors during token refresh. This would probably require some grep-ing.

@martindurant to your comment above, do you mean "this should only happen the first time" as in "this in the currently implemented behavior of the system" or "this behavior should be implemented"? I believe your comment in pangeo-data/pangeo#112 using __setstate__ to detect whether a cached session can be used is the right starting place, though probably requires some logic updates to make sure the correct auth method is being used.

@skeller88
Copy link

skeller88 commented Oct 21, 2019

Is gcsfs thread-safe? A dask worker could be running multiple threads. For example:

fs = gcsfs.GCSFileSystem(project='project_name')
def read_from_gcs(filename):
    r = fs.cat(filename)
    return imageio.core.asarray(imageio.imread(r, 'TIFF'))

delayed_read = dask.delayed(read_from_gcs, pure=True)

@martindurant
Copy link
Member

martindurant commented Oct 21, 2019

gcsfs is used routinely with Dask, but does not guarantee thread-safety. Specifically, if you have the same set of parameters when instantiating (which would be true for your example), you only create one instance and share it, so only one auth request is sent. However, the underlying library requests is almost, but not entirely thread-safe: apparently it is possible for connections to be dropped if a pool fills up; but this case would seem very unlikely in this kind of use (and should be covered by internal retries).

Directory listings could also potentially fall out of sync, but the code aggressively purges the cache when writing, and in the dask scenario, listings are usually done just once in the client.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants