Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fs: Check connection only when image isn't fully cached #1584

Merged
merged 1 commit into from
Feb 28, 2024

Conversation

ktock
Copy link
Member

@ktock ktock commented Feb 23, 2024

Fixes: #1583

When the layer is fully cached on the node, registry connection won't happen so we can skip the checking.

When the layer is fully cached on the node, registry connection won't happen so
we can skip the checking.

Signed-off-by: Kohei Tokunaga <ktokunaga.mail@gmail.com>
@AkihiroSuda AkihiroSuda merged commit 803d4f2 into containerd:main Feb 28, 2024
26 checks passed
@ktock ktock deleted the allow-skip-checking branch February 28, 2024 02:15
@jonathanbeber
Copy link

is any release planned that would include the changes in this PR?

@ktock
Copy link
Member Author

ktock commented Mar 11, 2024

Will release it after several on-going issues are fixed (e.g. #1600, #1599 ...), maybe this or next week.

@ElenaHenderson
Copy link

ElenaHenderson commented Apr 24, 2024

Hello @ktock,

Will you please confirm that this fix also applies for these 2 scenarios when stargz-snapshotter is restarted during a node reboot?

Once this fix is released, will the new pods with stargz images already on the nodes (pulled before stargz-snapshotter restart) be able to deploy?

Scenario 1 > EKS > Stargz image is pulled from private registry with creds that do not expire

  1. Enable stargz-snapshotter on EKS nodes just like described in lazy pull from private registry #1107 (comment)

  2. Create image pull secret for private registry

kubectl create secret docker-registry registry-creds \
  --docker-server=$PRIVATE_REGISTRY_SERVER \
  --docker-username=$PRIVATE_REGISTRY_USERNAME \
  --docker-password=$PRIVATE_REGISTRY_PASSWORD
  1. Deploy pod with stargz image
cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: Pod
metadata:
  name: stargz-app
spec:
  imagePullSecrets:
    - name: registry-creds
  containers:
    - name: app
      image: $PRIVATE_REGISTRY_SERVER/estargz:1
EOF

Note that stargz-app pod is deployed without any image pull issues.

  1. Delete pod
kubectl delete pod stargz-app
  1. Reboot ec2 instance or restart stargz-snaphotter on the ec2 node:
systemctl restart stargz-snapshotter
Job for stargz-snapshotter.service failed because the control process exited with error code. See "systemctl status stargz-snapshotter.service" and "journalctl -xe" for details.
journalctl -u stargz-snapshotter -l

Apr 23 23:56:15 ip-10-0-2-130.ec2.internal containerd-stargz-grpc[14958]: {"level":"debug","msg":"Waiting for CRI service is started...","time":"2024-04-23T23:56:15.194866142Z"}
Apr 23 23:56:15 ip-10-0-2-130.ec2.internal containerd-stargz-grpc[14958]: {"level":"info","msg":"connected to backend CRI service","time":"2024-04-23T23:56:15.195422179Z"}
Apr 23 23:56:15 ip-10-0-2-130.ec2.internal containerd-stargz-grpc[14958]: {"level":"info","msg":"preparing filesystem mount at mountpoint=/var/lib/containerd-stargz-grpc/snapshotter/snapshots/119/fs","time":"2024-04-23T23:56:15.196665212Z"}
Apr 23 23:56:15 ip-10-0-2-130.ec2.internal containerd-stargz-grpc[14958]: {"level":"debug","mountpoint":"/var/lib/containerd-stargz-grpc/snapshotter/snapshots/119/fs","msg":"resolving","src":"private-regsitry/estargz:1/sha256:1b82fbeab8a04e8548e0708cdbf2ddc35edd3a5aff4ab77a161765d8935bca5a","time":"2024-04-23T23:56:15.196812462Z"}
Apr 23 23:56:15 ip-10-0-2-130.ec2.internal containerd-stargz-grpc[14958]: {"digest":"sha256:1b82fbeab8a04e8548e0708cdbf2ddc35edd3a5aff4ab77a161765d8935bca5a","error":null,"level":"debug","mountpoint":"/var/lib/containerd-stargz-grpc/snapshotter/snapshots/119/fs","msg":"using default handler","ref":"private-regsitry/estargz:1","src":"private-regsitry/estargz:1/sha256:1b82fbeab8a04e8548e0708cdbf2ddc35edd3a5aff4ab77a161765d8935bca5a","time":"2024-04-23T23:56:15.196899102Z"}
Apr 23 23:56:15 ip-10-0-2-130.ec2.internal containerd-stargz-grpc[14958]: {"level":"info","mountpoint":"/var/lib/containerd-stargz-grpc/snapshotter/snapshots/119/fs","msg":"Received status code: 401 Unauthorized. Refreshing creds...","src":"private-regsitry/estargz:1/sha256:1b82fbeab8a04e8548e0708cdbf2ddc35edd3a5aff4ab77a161765d8935bca5a","time":"2024-04-23T23:56:15.349907820Z"}
Apr 23 23:56:15 ip-10-0-2-130.ec2.internal containerd-stargz-grpc[14958]: {"error":"failed to resolve layer \"sha256:1b82fbeab8a04e8548e0708cdbf2ddc35edd3a5aff4ab77a161765d8935bca5a\" from \"private-regsitry/estargz:1\": failed to resolve the blob: failed to resolve the source: cannot resolve layer: failed to redirect (host \"gcr.io\", ref:\"private-regsitry/estargz:1\", digest:\"sha256:1b82fbeab8a04e8548e0708cdbf2ddc35edd3a5aff4ab77a161765d8935bca5a\"): failed to request: failed to fetch anonymous token: unexpected status from GET request to https://gcr.io/v2/token?scope=repository%3Aprivate-regsitry%2Festargz%3Apull\u0026scope=repository%3Aprivate-regsitry%2Fgcr.io%2Festargz%3Apull\u0026service=gcr.io: 403 Forbidden: failed to resolve: failed to resolve target","level":"debug","mountpoint":"/var/lib/containerd-stargz-grpc/snapshotter/snapshots/119/fs","msg":"failed to resolve layer","time":"2024-04-23T23:56:15.650207312Z"}
Apr 23 23:56:15 ip-10-0-2-130.ec2.internal containerd-stargz-grpc[14958]: {"error":"failed to restore remote snapshot: failed to prepare remote snapshot: sha256:19ac957b239fbbf329b0c303a6d3ab6425d96d1556475eb8c12093670f81366a: failed to resolve layer: failed to resolve layer \"sha256:1b82fbeab8a04e8548e0708cdbf2ddc35edd3a5aff4ab77a161765d8935bca5a\" from \"private-regsitry/estargz:1\": failed to resolve the blob: failed to resolve the source: cannot resolve layer: failed to redirect (host \"gcr.io\", ref:\"private-regsitry/estargz:1\", digest:\"sha256:1b82fbeab8a04e8548e0708cdbf2ddc35edd3a5aff4ab77a161765d8935bca5a\"): failed to request: failed to fetch anonymous token: unexpected status from GET request to https://gcr.io/v2/token?scope=repository%3Aprivate-regsitry%2Festargz%3Apull\u0026scope=repository%3Aprivate-regsitry%2Fgcr.io%2Festargz%3Apull\u0026service=gcr.io: 403 Forbidden: failed to resolve: failed to resolve target","level":"fatal","msg":"failed to create new snapshotter","time":"2024-04-23T23:56:15.650305520Z"}
Apr 23 23:56:15 ip-10-0-2-130.ec2.internal systemd[1]: stargz-snapshotter.service: main process exited, code=exited, status=1/FAILURE
Apr 23 23:56:15 ip-10-0-2-130.ec2.internal systemd[1]: Failed to start stargz snapshotter.
Apr 23 23:56:15 ip-10-0-2-130.ec2.internal systemd[1]: Unit stargz-snapshotter.service entered failed state.
Apr 23 23:56:15 ip-10-0-2-130.ec2.internal systemd[1]: stargz-snapshotter.service failed.
Apr 23 23:56:16 ip-10-0-2-130.ec2.internal systemd[1]: stargz-snapshotter.service holdoff time over, scheduling restart.
Apr 23 23:56:16 ip-10-0-2-130.ec2.internal systemd[1]: Stopped stargz snapshotter.
Apr 23 23:56:16 ip-10-0-2-130.ec2.internal systemd[1]: start request repeated too quickly for stargz-snapshotter.service
Apr 23 23:56:16 ip-10-0-2-130.ec2.internal systemd[1]: Failed to start stargz snapshotter.
Apr 23 23:56:16 ip-10-0-2-130.ec2.internal systemd[1]: Unit stargz-snapshotter.service entered failed state.
Apr 23 23:56:16 ip-10-0-2-130.ec2.internal systemd[1]: stargz-snapshotter.service failed.

Note that stargz-snapshotter would not start.

Scenario 2 > k3d > Stargz image is pulled from private registry with creds that do not expire

  1. Create k3d cluster with stargz-snapshotter
k3d cluster delete demo
k3d cluster create demo --k3s-arg='--snapshotter=stargz@server:*;agent:*'
  1. Create image pull secret for private registry
kubectl create secret docker-registry registry-creds \
  --docker-server=$PRIVATE_REGISTRY_SERVER \
  --docker-username=$PRIVATE_REGISTRY_USERNAME \
  --docker-password=$PRIVATE_REGISTRY_PASSWORD
  1. Deploy pod with stargz image
cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: Pod
metadata:
  name: stargz-app
spec:
  imagePullSecrets:
    - name: registry-creds
  containers:
    - name: app
      image: $PRIVATE_REGISTRY_SERVER/estargz:1
EOF

Note that stargz-app is deployed without any image pull issues

  1. Delete pod
kubectl delete pod stargz-app
  1. Restart k3d cluster
k3d cluster stop demo
k3d cluster start demo

Note that k3d cluster would not start.

@ktock
Copy link
Member Author

ktock commented Apr 24, 2024

@ElenaHenderson Thanks for sharing the logs.

Will you please confirm that this fix also applies for these 2 scenarios when stargz-snapshotter is restarted during a node reboot?

This patch won't solve the restarting failure because stargz-snapshotter doesn't persist filesystem data cache over restarting. And CRI-based authentication mode doesn't persist registry creds over restarting as well.

Alternatively, there are 3 possible ways to solve the issue.

  • A. Use other authentication methods like dockerconfig-based one or kubeconfig-based one that enables the snapshotter to acquire creds during restarting.
  • B. Allow the snapshotter having failed snapshots, using the following configuration. You need to manually remove these (possibly empty) images after stargz-snapshotter started, using ctr image rm <image-name>. Please see also Allow manually remove invalid snapshots on restore #901 for the usage of the configuration.
    [snapshotter]
    allow_invalid_mounts_on_restart = true
    
  • C. Delete images before restarting the node so that the snapshotter can start from the fresh state.

Or, if we want ways to persist filesystem/creds data, we'll need additional patches.

@ElenaHenderson
Copy link

@ktock Thank you for your prompt response and solutions.

Solution A (kubeconfig-based-authentication) is working for us.

I did run into an issue with stargz-snapshotter failing on the first try when ec2 is rebooted but succeeding on the second try, which messes up the start up order of stargz-snapshotter > containerd > kubectl as per systemd service order and requires manual restart of stargz-snapshotter > containerd > kubectl. I will be creating a separate issue for this.

Solution B and C would not work for us because we need to have valid images on reboots and can't really remove images from the nodes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ImagePullPolicy for private ECR repositories
4 participants