Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] OAuth Bearer tokens aren't refreshed correctly #686

Closed
iain-macdonald opened this issue Jun 23, 2023 · 4 comments
Closed

[Bug] OAuth Bearer tokens aren't refreshed correctly #686

iain-macdonald opened this issue Jun 23, 2023 · 4 comments
Labels
bug Something isn't working

Comments

@iain-macdonald
Copy link
Contributor

Description

Hi SOCI-Snapshotter maintainers,

I'm reporting a buggy interaction between the soci-snapshotter fs/remote/resolver.go code that resolves remote resources and the containerd remotes/docker/authorizer.go code that authenticates remote container repository requests. The code in authorizer.go doesn't keep track of OAuth token expiration times, relying on the caller to inform it of failed requests in AddResponses, which it uses to determine when to invalidate OAuth tokens. The logic for doing that is in invalidAuthorization which invalidates tokens if there is an "error" in the auth challenge and the two more recent provided HTTP Responses are for the same HTTP method and URL. In other words, the client has to receive two successive authorization failures for the same URL and only then will the authorizer will refresh its cached auth token.

However, the soci-snapshotter code that provides HTTP Responses to the authorizer only ever provides the most recent HTTP Response, so the logic above is never exercised. Instead, it always hits the n == 1 condition in invalidAuthorization. This means that once a remote resolver is created, OAuth bearer tokens used for remote authorization will never be refreshed, and if/when they expire parts of the container image that haven't been fetched yet can never be retrieved by the snapshotter. We encountered this bug running the snapshotter in production at BuildBuddy and observed that for container images that were streamed to remote executors, if the test running the container ran slowly enough then parts of the container would become unretrievable causing Input/output errors running these containers. We suspect this issue is exacerbated by image streaming because the lifetime of an image pull is much longer during streaming than during a conventional image pull operation.

I've opened a PR in containerd to track token expiration time and invalidate tokens that have expired, but even if that's merged there are still cases where this can happen. Specifically, if the remote revokes the access token before its expiration time, or if there's a race between the local and remote token invalidation (e.g. if the local issues a request one nanosecond prior to token expiration) then the snapshotter could theoretically get into this state, though it's much less likely. If the containerd authorizer does indeed expect two successive authorization failures (I'm not sure why one is insufficient, I'll ask), then the snapshotter remote resolver should retry remote requests twice instead of just once in RoundTrip and provide both responses to the authorizer.

I'm planning to make this change in our forked version of the snapshotter, let me know if you'd like me to push this change upstream as well.

Thanks!

Steps to reproduce the bug

This issue is more reproducible with images with several large layers because there are more likely to be portions of one or more layers that aren't accessed during container startup. We observed it with docker://pivotalrabbitmq/rabbitmq-server-buildenv@sha256:d8d23b68607129345df9b65e9d2be97e49d37bf7daf503548a65e02a46ed6d58. Here are repro-steps using that image:

  1. Pull the image, index it, and push the artifacts to your local registry following the instructions in getting-started.
  2. Clear your local machine's state so the image must be retrieved remotely.
  3. Start an interactive terminal in a container streamed from your local registry.
  4. Run sleep x where x is your local registry's OAuth bearer token expires_in
  5. Run a command that will pull parts of the image that haven't been pulled yet, e.g. ls -laR / > /dev/null or grep -R a / > /dev/null
  6. Note the Input/output errors, these are due to the failures I've described above.

Sorry I can't provide more specific instructions, we use the snapshotter with Podman, for which I have a very reliable reproducible case, but I haven't tried with Containerd.

Describe the results you expected

When the snapshotter streams container images to containers running for longer than the remote repository's OAuth bearer token expires_in value, the snapshotter should refresh its OAuth bearer token so the container image is still retrievable after the original OAuth token has expired.

Host information

  1. OS: Ubuntu 22.04
  2. Snapshotter Version: Head
  3. Containerd Version: 1.7.1

Any additional context or information about the bug

No response

@Kern--
Copy link
Contributor

Kern-- commented Jun 27, 2023

Thanks for the report. We would definitely be interested in this fix (and it probably also affects stargz).

We can look into writing an integration test to verify it too.

@Kern--
Copy link
Contributor

Kern-- commented Jun 30, 2023

In addition to fixing the current flow, I think we want to take a look at the reauthorization flow and make sure it plays well with the redirect work we've been looking at. I think that means that we'd like to take the patch, but keep this open to look a little deeper at this code path and see if there are other things that we need to do.

@Kern--
Copy link
Contributor

Kern-- commented Jul 31, 2023

Hey @iain-macdonald. We would like to merge this fix. Would you be interested in sending the change from your fork as a squashed commit with a DCO?

@sparr
Copy link
Contributor

sparr commented Aug 9, 2023

PR merged

@sparr sparr closed this as completed Aug 9, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
Status: Done
Development

No branches or pull requests

3 participants