Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fallback to containerd if we are unable to fetch SOCI artifacts #1035

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

turan18
Copy link
Contributor

@turan18 turan18 commented Jan 16, 2024

If we know that we cannot lazy load an image (eg: SOCI index does not exist for an image), we should fallback to the underlying runtime to do the fetching/unpacking of all layers.

Issue #, if available:

Fixes: #1034

Description of changes:

Testing performed:

make test && make integration

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

If we know that we cannot lazy load an image (eg: SOCI index does not
exist for an image), we should fallback to the underlying runtime to
do the fetching/unpacking of all layers.

Signed-off-by: Yasin Turan <turyasin@amazon.com>
@turan18 turan18 marked this pull request as ready for review January 16, 2024 21:25
@turan18 turan18 requested a review from a team as a code owner January 16, 2024 21:25
fs/fs.go Show resolved Hide resolved
Copy link
Contributor

@sondavidb sondavidb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, from what I understand since we currently just use all local mounts in this use case, it makes sense to defer to container runtime.

Definitely not in the scope of this PR, but wonder if we'd ever want behavior where we would want a toggle to attempt to auto-generate a SOCI index if one isn't found. We need a SOCI index to do any work so it might be something nice to have. Just a thought for now though.

fs/fs.go Show resolved Hide resolved
@dims
Copy link
Member

dims commented Jan 20, 2024

This is probably useful even when an image has been already downloaded before soci-snapshotter was enabled, right?

Copy link
Contributor

@austinvazquez austinvazquez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall the changes LGTM. A few questions I had:

  1. Are we able to mock the ErrUnableToLazyLoad returned from the filesystem mount in snapshotter unit tests to validate a fallback occurs?
  2. Wondering if we should emit a metric on this case where SOCI artifacts exist but we failed to pull them. My thought is it might not be worth tracking because we have the log. Curious if you had some thoughts on it.
  3. Just want to validate an integration test here is difficult because we'd need a method for pulling an image with SOCI artifacts but an error occurs while pulling. e.g. network error.

@sondavidb
Copy link
Contributor

Just want to validate an integration test here is difficult because we'd need a method for pulling an image with SOCI artifacts but an error occurs while pulling. e.g. network error.

We do a similar thing in TestNetworkRetry, where we pull an image from a repo, run the image with SOCI, attempt to do run a command, then block network access to the registry and try again. TestNetworkRetry specifically is not implemented correctly, though, and I'm not sure how we would simulate this, but the idea is there.

Does this need to be an integration test though? Is there any way we can make it a unit test, e.g. by calling Mount with a bunch of local mounts?

Copy link
Contributor

@Kern-- Kern-- left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think Prepare is in need of some rethinking/refactoring to make it more clear what the logic actually is and how that logic appears in logs.

For example, we now have both skipLazyLoadingImage and o.skipRemoteSnapshotPrepare. If we skip lazy loading an image, we will log "failed to prepare remote snapshot" and "failed to prepare snapshot; deferring to container runtime" for every layer which doesn't really convey what's actually going on.

@sondavidb
Copy link
Contributor

Should we also fall back if a bad SOCI index digest is passed in? Right now, pulling while specifying a bad index digest ends up having a similar issue, so unless we want to switch between SOCI indexes for different layers of the same image (which I really don't think we do), it might be a nice thing to add.

@turan18
Copy link
Contributor Author

turan18 commented Feb 5, 2024

This is probably useful even when an image has been already downloaded before soci-snapshotter was enabled, right?

@dims Sorry for the delay. All this change will do is ensure that we defer back to containerd earlier when we know we can't lazy load the image. If the image has already been downloaded and the blob content exists in the containerd content store, this will ensure that SOCI won't blindly attempt to sequentially fetch each layer again.

@turan18
Copy link
Contributor Author

turan18 commented Feb 5, 2024

Should we also fall back if a bad SOCI index digest is passed in? Right now, pulling while specifying a bad index digest ends up having a similar issue, so unless we want to switch between SOCI indexes for different layers of the same image (which I really don't think we do), it might be a nice thing to add.

We defer back to containerd anytime we can't fetch SOCI artifacts, whether it's because they don't exist (a bad digest) or if there were network issues.

@turan18
Copy link
Contributor Author

turan18 commented Feb 27, 2024

TODO:

  1. Do not emit mount failure metric if the err is no referrers (eg: the index does not exist)
  2. Add integration test

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEATURE] Fallback to containerd if we cannot lazy load image
5 participants