[Bug] OAuth Bearer tokens aren't refreshed correctly #686

iain-macdonald · 2023-06-23T22:07:46Z

Description

Hi SOCI-Snapshotter maintainers,

I'm reporting a buggy interaction between the soci-snapshotter fs/remote/resolver.go code that resolves remote resources and the containerd remotes/docker/authorizer.go code that authenticates remote container repository requests. The code in authorizer.go doesn't keep track of OAuth token expiration times, relying on the caller to inform it of failed requests in AddResponses, which it uses to determine when to invalidate OAuth tokens. The logic for doing that is in invalidAuthorization which invalidates tokens if there is an "error" in the auth challenge and the two more recent provided HTTP Responses are for the same HTTP method and URL. In other words, the client has to receive two successive authorization failures for the same URL and only then will the authorizer will refresh its cached auth token.

However, the soci-snapshotter code that provides HTTP Responses to the authorizer only ever provides the most recent HTTP Response, so the logic above is never exercised. Instead, it always hits the n == 1 condition in invalidAuthorization. This means that once a remote resolver is created, OAuth bearer tokens used for remote authorization will never be refreshed, and if/when they expire parts of the container image that haven't been fetched yet can never be retrieved by the snapshotter. We encountered this bug running the snapshotter in production at BuildBuddy and observed that for container images that were streamed to remote executors, if the test running the container ran slowly enough then parts of the container would become unretrievable causing Input/output errors running these containers. We suspect this issue is exacerbated by image streaming because the lifetime of an image pull is much longer during streaming than during a conventional image pull operation.

I've opened a PR in containerd to track token expiration time and invalidate tokens that have expired, but even if that's merged there are still cases where this can happen. Specifically, if the remote revokes the access token before its expiration time, or if there's a race between the local and remote token invalidation (e.g. if the local issues a request one nanosecond prior to token expiration) then the snapshotter could theoretically get into this state, though it's much less likely. If the containerd authorizer does indeed expect two successive authorization failures (I'm not sure why one is insufficient, I'll ask), then the snapshotter remote resolver should retry remote requests twice instead of just once in RoundTrip and provide both responses to the authorizer.

I'm planning to make this change in our forked version of the snapshotter, let me know if you'd like me to push this change upstream as well.

Thanks!

Steps to reproduce the bug

This issue is more reproducible with images with several large layers because there are more likely to be portions of one or more layers that aren't accessed during container startup. We observed it with docker://pivotalrabbitmq/rabbitmq-server-buildenv@sha256:d8d23b68607129345df9b65e9d2be97e49d37bf7daf503548a65e02a46ed6d58. Here are repro-steps using that image:

Pull the image, index it, and push the artifacts to your local registry following the instructions in getting-started.
Clear your local machine's state so the image must be retrieved remotely.
Start an interactive terminal in a container streamed from your local registry.
Run sleep x where x is your local registry's OAuth bearer token expires_in
Run a command that will pull parts of the image that haven't been pulled yet, e.g. ls -laR / > /dev/null or grep -R a / > /dev/null
Note the Input/output errors, these are due to the failures I've described above.

Sorry I can't provide more specific instructions, we use the snapshotter with Podman, for which I have a very reliable reproducible case, but I haven't tried with Containerd.

Describe the results you expected

When the snapshotter streams container images to containers running for longer than the remote repository's OAuth bearer token expires_in value, the snapshotter should refresh its OAuth bearer token so the container image is still retrievable after the original OAuth token has expired.

Host information

OS: Ubuntu 22.04
Snapshotter Version: Head
Containerd Version: 1.7.1

Any additional context or information about the bug

No response

The text was updated successfully, but these errors were encountered:

Kern-- · 2023-06-27T17:08:47Z

Thanks for the report. We would definitely be interested in this fix (and it probably also affects stargz).

We can look into writing an integration test to verify it too.

Kern-- · 2023-06-30T17:49:27Z

In addition to fixing the current flow, I think we want to take a look at the reauthorization flow and make sure it plays well with the redirect work we've been looking at. I think that means that we'd like to take the patch, but keep this open to look a little deeper at this code path and see if there are other things that we need to do.

Kern-- · 2023-07-31T18:47:31Z

Hey @iain-macdonald. We would like to merge this fix. Would you be interested in sending the change from your fork as a squashed commit with a DCO?

sparr · 2023-08-09T14:49:46Z

PR merged

iain-macdonald added the bug Something isn't working label Jun 23, 2023

iain-macdonald mentioned this issue Jun 26, 2023

Hacky fix for the token-invalidation bug buildbuddy-io/soci-snapshotter#2

Merged

iain-macdonald mentioned this issue Aug 3, 2023

Fix bearer token invalidation bug by calling AddResponses() twice on 401 errors #763

Merged

sparr closed this as completed Aug 9, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] OAuth Bearer tokens aren't refreshed correctly #686

[Bug] OAuth Bearer tokens aren't refreshed correctly #686

iain-macdonald commented Jun 23, 2023

Kern-- commented Jun 27, 2023

Kern-- commented Jun 30, 2023

Kern-- commented Jul 31, 2023

sparr commented Aug 9, 2023

[Bug] OAuth Bearer tokens aren't refreshed correctly #686

[Bug] OAuth Bearer tokens aren't refreshed correctly #686

Comments

iain-macdonald commented Jun 23, 2023

Description

Steps to reproduce the bug

Describe the results you expected

Host information

Any additional context or information about the bug

Kern-- commented Jun 27, 2023

Kern-- commented Jun 30, 2023

Kern-- commented Jul 31, 2023

sparr commented Aug 9, 2023