You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm reporting a buggy interaction between the soci-snapshotter fs/remote/resolver.go code that resolves remote resources and the containerd remotes/docker/authorizer.go code that authenticates remote container repository requests. The code in authorizer.go doesn't keep track of OAuth token expiration times, relying on the caller to inform it of failed requests in AddResponses, which it uses to determine when to invalidate OAuth tokens. The logic for doing that is in invalidAuthorization which invalidates tokens if there is an "error" in the auth challenge and the two more recent provided HTTP Responses are for the same HTTP method and URL. In other words, the client has to receive two successive authorization failures for the same URL and only then will the authorizer will refresh its cached auth token.
However, the soci-snapshotter code that provides HTTP Responses to the authorizer only ever provides the most recent HTTP Response, so the logic above is never exercised. Instead, it always hits the n == 1condition in invalidAuthorization. This means that once a remote resolver is created, OAuth bearer tokens used for remote authorization will never be refreshed, and if/when they expire parts of the container image that haven't been fetched yet can never be retrieved by the snapshotter. We encountered this bug running the snapshotter in production at BuildBuddy and observed that for container images that were streamed to remote executors, if the test running the container ran slowly enough then parts of the container would become unretrievable causing Input/output errors running these containers. We suspect this issue is exacerbated by image streaming because the lifetime of an image pull is much longer during streaming than during a conventional image pull operation.
I've opened a PR in containerd to track token expiration time and invalidate tokens that have expired, but even if that's merged there are still cases where this can happen. Specifically, if the remote revokes the access token before its expiration time, or if there's a race between the local and remote token invalidation (e.g. if the local issues a request one nanosecond prior to token expiration) then the snapshotter could theoretically get into this state, though it's much less likely. If the containerd authorizer does indeed expect two successive authorization failures (I'm not sure why one is insufficient, I'll ask), then the snapshotter remote resolver should retry remote requests twice instead of just once in RoundTrip and provide both responses to the authorizer.
I'm planning to make this change in our forked version of the snapshotter, let me know if you'd like me to push this change upstream as well.
Thanks!
Steps to reproduce the bug
This issue is more reproducible with images with several large layers because there are more likely to be portions of one or more layers that aren't accessed during container startup. We observed it with docker://pivotalrabbitmq/rabbitmq-server-buildenv@sha256:d8d23b68607129345df9b65e9d2be97e49d37bf7daf503548a65e02a46ed6d58. Here are repro-steps using that image:
Pull the image, index it, and push the artifacts to your local registry following the instructions in getting-started.
Clear your local machine's state so the image must be retrieved remotely.
Start an interactive terminal in a container streamed from your local registry.
Run sleep x where x is your local registry's OAuth bearer token expires_in
Run a command that will pull parts of the image that haven't been pulled yet, e.g. ls -laR / > /dev/null or grep -R a / > /dev/null
Note the Input/output errors, these are due to the failures I've described above.
Sorry I can't provide more specific instructions, we use the snapshotter with Podman, for which I have a very reliable reproducible case, but I haven't tried with Containerd.
Describe the results you expected
When the snapshotter streams container images to containers running for longer than the remote repository's OAuth bearer token expires_in value, the snapshotter should refresh its OAuth bearer token so the container image is still retrievable after the original OAuth token has expired.
Host information
OS: Ubuntu 22.04
Snapshotter Version: Head
Containerd Version: 1.7.1
Any additional context or information about the bug
No response
The text was updated successfully, but these errors were encountered:
In addition to fixing the current flow, I think we want to take a look at the reauthorization flow and make sure it plays well with the redirect work we've been looking at. I think that means that we'd like to take the patch, but keep this open to look a little deeper at this code path and see if there are other things that we need to do.
Description
Hi SOCI-Snapshotter maintainers,
I'm reporting a buggy interaction between the soci-snapshotter fs/remote/resolver.go code that resolves remote resources and the containerd remotes/docker/authorizer.go code that authenticates remote container repository requests. The code in authorizer.go doesn't keep track of OAuth token expiration times, relying on the caller to inform it of failed requests in AddResponses, which it uses to determine when to invalidate OAuth tokens. The logic for doing that is in invalidAuthorization which invalidates tokens if there is an "error" in the auth challenge and the two more recent provided HTTP Responses are for the same HTTP method and URL. In other words, the client has to receive two successive authorization failures for the same URL and only then will the authorizer will refresh its cached auth token.
However, the soci-snapshotter code that provides HTTP Responses to the authorizer only ever provides the most recent HTTP Response, so the logic above is never exercised. Instead, it always hits the
n == 1
condition in invalidAuthorization. This means that once a remote resolver is created, OAuth bearer tokens used for remote authorization will never be refreshed, and if/when they expire parts of the container image that haven't been fetched yet can never be retrieved by the snapshotter. We encountered this bug running the snapshotter in production at BuildBuddy and observed that for container images that were streamed to remote executors, if the test running the container ran slowly enough then parts of the container would become unretrievable causingInput/output
errors running these containers. We suspect this issue is exacerbated by image streaming because the lifetime of an image pull is much longer during streaming than during a conventional image pull operation.I've opened a PR in containerd to track token expiration time and invalidate tokens that have expired, but even if that's merged there are still cases where this can happen. Specifically, if the remote revokes the access token before its expiration time, or if there's a race between the local and remote token invalidation (e.g. if the local issues a request one nanosecond prior to token expiration) then the snapshotter could theoretically get into this state, though it's much less likely. If the containerd authorizer does indeed expect two successive authorization failures (I'm not sure why one is insufficient, I'll ask), then the snapshotter remote resolver should retry remote requests twice instead of just once in RoundTrip and provide both responses to the authorizer.
I'm planning to make this change in our forked version of the snapshotter, let me know if you'd like me to push this change upstream as well.
Thanks!
Steps to reproduce the bug
This issue is more reproducible with images with several large layers because there are more likely to be portions of one or more layers that aren't accessed during container startup. We observed it with
docker://pivotalrabbitmq/rabbitmq-server-buildenv@sha256:d8d23b68607129345df9b65e9d2be97e49d37bf7daf503548a65e02a46ed6d58
. Here are repro-steps using that image:sleep x
where x is your local registry's OAuth bearer tokenexpires_in
ls -laR / > /dev/null
orgrep -R a / > /dev/null
Input/output
errors, these are due to the failures I've described above.Sorry I can't provide more specific instructions, we use the snapshotter with Podman, for which I have a very reliable reproducible case, but I haven't tried with Containerd.
Describe the results you expected
When the snapshotter streams container images to containers running for longer than the remote repository's OAuth bearer token
expires_in
value, the snapshotter should refresh its OAuth bearer token so the container image is still retrievable after the original OAuth token has expired.Host information
Any additional context or information about the bug
No response
The text was updated successfully, but these errors were encountered: