fs: Check connection only when image isn't fully cached #1584

ktock · 2024-02-23T13:26:12Z

When the layer is fully cached on the node, registry connection won't happen so we can skip the checking.

When the layer is fully cached on the node, registry connection won't happen so we can skip the checking. Signed-off-by: Kohei Tokunaga <ktokunaga.mail@gmail.com>

jonathanbeber · 2024-03-08T13:27:04Z

is any release planned that would include the changes in this PR?

ktock · 2024-03-11T08:07:21Z

Will release it after several on-going issues are fixed (e.g. #1600, #1599 ...), maybe this or next week.

ElenaHenderson · 2024-04-24T00:07:14Z

Hello @ktock,

Will you please confirm that this fix also applies for these 2 scenarios when stargz-snapshotter is restarted during a node reboot?

Once this fix is released, will the new pods with stargz images already on the nodes (pulled before stargz-snapshotter restart) be able to deploy?

Scenario 1 > EKS > Stargz image is pulled from private registry with creds that do not expire

Enable stargz-snapshotter on EKS nodes just like described in lazy pull from private registry #1107 (comment)
Create image pull secret for private registry

kubectl create secret docker-registry registry-creds \
  --docker-server=$PRIVATE_REGISTRY_SERVER \
  --docker-username=$PRIVATE_REGISTRY_USERNAME \
  --docker-password=$PRIVATE_REGISTRY_PASSWORD

Deploy pod with stargz image

cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: Pod
metadata:
  name: stargz-app
spec:
  imagePullSecrets:
    - name: registry-creds
  containers:
    - name: app
      image: $PRIVATE_REGISTRY_SERVER/estargz:1
EOF

Note that stargz-app pod is deployed without any image pull issues.

Delete pod

kubectl delete pod stargz-app

Reboot ec2 instance or restart stargz-snaphotter on the ec2 node:

systemctl restart stargz-snapshotter
Job for stargz-snapshotter.service failed because the control process exited with error code. See "systemctl status stargz-snapshotter.service" and "journalctl -xe" for details.
journalctl -u stargz-snapshotter -l

Apr 23 23:56:15 ip-10-0-2-130.ec2.internal containerd-stargz-grpc[14958]: {"level":"debug","msg":"Waiting for CRI service is started...","time":"2024-04-23T23:56:15.194866142Z"}
Apr 23 23:56:15 ip-10-0-2-130.ec2.internal containerd-stargz-grpc[14958]: {"level":"info","msg":"connected to backend CRI service","time":"2024-04-23T23:56:15.195422179Z"}
Apr 23 23:56:15 ip-10-0-2-130.ec2.internal containerd-stargz-grpc[14958]: {"level":"info","msg":"preparing filesystem mount at mountpoint=/var/lib/containerd-stargz-grpc/snapshotter/snapshots/119/fs","time":"2024-04-23T23:56:15.196665212Z"}
Apr 23 23:56:15 ip-10-0-2-130.ec2.internal containerd-stargz-grpc[14958]: {"level":"debug","mountpoint":"/var/lib/containerd-stargz-grpc/snapshotter/snapshots/119/fs","msg":"resolving","src":"private-regsitry/estargz:1/sha256:1b82fbeab8a04e8548e0708cdbf2ddc35edd3a5aff4ab77a161765d8935bca5a","time":"2024-04-23T23:56:15.196812462Z"}
Apr 23 23:56:15 ip-10-0-2-130.ec2.internal containerd-stargz-grpc[14958]: {"digest":"sha256:1b82fbeab8a04e8548e0708cdbf2ddc35edd3a5aff4ab77a161765d8935bca5a","error":null,"level":"debug","mountpoint":"/var/lib/containerd-stargz-grpc/snapshotter/snapshots/119/fs","msg":"using default handler","ref":"private-regsitry/estargz:1","src":"private-regsitry/estargz:1/sha256:1b82fbeab8a04e8548e0708cdbf2ddc35edd3a5aff4ab77a161765d8935bca5a","time":"2024-04-23T23:56:15.196899102Z"}
Apr 23 23:56:15 ip-10-0-2-130.ec2.internal containerd-stargz-grpc[14958]: {"level":"info","mountpoint":"/var/lib/containerd-stargz-grpc/snapshotter/snapshots/119/fs","msg":"Received status code: 401 Unauthorized. Refreshing creds...","src":"private-regsitry/estargz:1/sha256:1b82fbeab8a04e8548e0708cdbf2ddc35edd3a5aff4ab77a161765d8935bca5a","time":"2024-04-23T23:56:15.349907820Z"}
Apr 23 23:56:15 ip-10-0-2-130.ec2.internal containerd-stargz-grpc[14958]: {"error":"failed to resolve layer \"sha256:1b82fbeab8a04e8548e0708cdbf2ddc35edd3a5aff4ab77a161765d8935bca5a\" from \"private-regsitry/estargz:1\": failed to resolve the blob: failed to resolve the source: cannot resolve layer: failed to redirect (host \"gcr.io\", ref:\"private-regsitry/estargz:1\", digest:\"sha256:1b82fbeab8a04e8548e0708cdbf2ddc35edd3a5aff4ab77a161765d8935bca5a\"): failed to request: failed to fetch anonymous token: unexpected status from GET request to https://gcr.io/v2/token?scope=repository%3Aprivate-regsitry%2Festargz%3Apull\u0026scope=repository%3Aprivate-regsitry%2Fgcr.io%2Festargz%3Apull\u0026service=gcr.io: 403 Forbidden: failed to resolve: failed to resolve target","level":"debug","mountpoint":"/var/lib/containerd-stargz-grpc/snapshotter/snapshots/119/fs","msg":"failed to resolve layer","time":"2024-04-23T23:56:15.650207312Z"}
Apr 23 23:56:15 ip-10-0-2-130.ec2.internal containerd-stargz-grpc[14958]: {"error":"failed to restore remote snapshot: failed to prepare remote snapshot: sha256:19ac957b239fbbf329b0c303a6d3ab6425d96d1556475eb8c12093670f81366a: failed to resolve layer: failed to resolve layer \"sha256:1b82fbeab8a04e8548e0708cdbf2ddc35edd3a5aff4ab77a161765d8935bca5a\" from \"private-regsitry/estargz:1\": failed to resolve the blob: failed to resolve the source: cannot resolve layer: failed to redirect (host \"gcr.io\", ref:\"private-regsitry/estargz:1\", digest:\"sha256:1b82fbeab8a04e8548e0708cdbf2ddc35edd3a5aff4ab77a161765d8935bca5a\"): failed to request: failed to fetch anonymous token: unexpected status from GET request to https://gcr.io/v2/token?scope=repository%3Aprivate-regsitry%2Festargz%3Apull\u0026scope=repository%3Aprivate-regsitry%2Fgcr.io%2Festargz%3Apull\u0026service=gcr.io: 403 Forbidden: failed to resolve: failed to resolve target","level":"fatal","msg":"failed to create new snapshotter","time":"2024-04-23T23:56:15.650305520Z"}
Apr 23 23:56:15 ip-10-0-2-130.ec2.internal systemd[1]: stargz-snapshotter.service: main process exited, code=exited, status=1/FAILURE
Apr 23 23:56:15 ip-10-0-2-130.ec2.internal systemd[1]: Failed to start stargz snapshotter.
Apr 23 23:56:15 ip-10-0-2-130.ec2.internal systemd[1]: Unit stargz-snapshotter.service entered failed state.
Apr 23 23:56:15 ip-10-0-2-130.ec2.internal systemd[1]: stargz-snapshotter.service failed.
Apr 23 23:56:16 ip-10-0-2-130.ec2.internal systemd[1]: stargz-snapshotter.service holdoff time over, scheduling restart.
Apr 23 23:56:16 ip-10-0-2-130.ec2.internal systemd[1]: Stopped stargz snapshotter.
Apr 23 23:56:16 ip-10-0-2-130.ec2.internal systemd[1]: start request repeated too quickly for stargz-snapshotter.service
Apr 23 23:56:16 ip-10-0-2-130.ec2.internal systemd[1]: Failed to start stargz snapshotter.
Apr 23 23:56:16 ip-10-0-2-130.ec2.internal systemd[1]: Unit stargz-snapshotter.service entered failed state.
Apr 23 23:56:16 ip-10-0-2-130.ec2.internal systemd[1]: stargz-snapshotter.service failed.

Note that stargz-snapshotter would not start.

Scenario 2 > k3d > Stargz image is pulled from private registry with creds that do not expire

Create k3d cluster with stargz-snapshotter

k3d cluster delete demo
k3d cluster create demo --k3s-arg='--snapshotter=stargz@server:*;agent:*'

Create image pull secret for private registry

kubectl create secret docker-registry registry-creds \
  --docker-server=$PRIVATE_REGISTRY_SERVER \
  --docker-username=$PRIVATE_REGISTRY_USERNAME \
  --docker-password=$PRIVATE_REGISTRY_PASSWORD

Deploy pod with stargz image

cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: Pod
metadata:
  name: stargz-app
spec:
  imagePullSecrets:
    - name: registry-creds
  containers:
    - name: app
      image: $PRIVATE_REGISTRY_SERVER/estargz:1
EOF

Note that stargz-app is deployed without any image pull issues

Delete pod

kubectl delete pod stargz-app

Restart k3d cluster

k3d cluster stop demo
k3d cluster start demo

Note that k3d cluster would not start.

ktock · 2024-04-24T15:36:04Z

@ElenaHenderson Thanks for sharing the logs.

Will you please confirm that this fix also applies for these 2 scenarios when stargz-snapshotter is restarted during a node reboot?

This patch won't solve the restarting failure because stargz-snapshotter doesn't persist filesystem data cache over restarting. And CRI-based authentication mode doesn't persist registry creds over restarting as well.

Alternatively, there are 3 possible ways to solve the issue.

A. Use other authentication methods like dockerconfig-based one or kubeconfig-based one that enables the snapshotter to acquire creds during restarting.
B. Allow the snapshotter having failed snapshots, using the following configuration. You need to manually remove these (possibly empty) images after stargz-snapshotter started, using ctr image rm <image-name>. Please see also Allow manually remove invalid snapshots on restore #901 for the usage of the configuration.
```
[snapshotter]
allow_invalid_mounts_on_restart = true
```
C. Delete images before restarting the node so that the snapshotter can start from the fresh state.

Or, if we want ways to persist filesystem/creds data, we'll need additional patches.

ElenaHenderson · 2024-04-25T02:20:15Z

@ktock Thank you for your prompt response and solutions.

Solution A (kubeconfig-based-authentication) is working for us.

I did run into an issue with stargz-snapshotter failing on the first try when ec2 is rebooted but succeeding on the second try, which messes up the start up order of stargz-snapshotter > containerd > kubectl as per systemd service order and requires manual restart of stargz-snapshotter > containerd > kubectl. I will be creating a separate issue for this.

Solution B and C would not work for us because we need to have valid images on reboots and can't really remove images from the nodes.

ktock mentioned this pull request Feb 23, 2024

ImagePullPolicy for private ECR repositories #1583

Closed

fs: Check connection only when image isn't fully cached

86b107c

When the layer is fully cached on the node, registry connection won't happen so we can skip the checking. Signed-off-by: Kohei Tokunaga <ktokunaga.mail@gmail.com>

ktock force-pushed the allow-skip-checking branch from 9067701 to 86b107c Compare February 23, 2024 13:34

AkihiroSuda approved these changes Feb 28, 2024

View reviewed changes

AkihiroSuda merged commit 803d4f2 into containerd:main Feb 28, 2024
26 checks passed

ktock deleted the allow-skip-checking branch February 28, 2024 02:15

sondavidb mentioned this pull request Mar 15, 2024

[FEATURE] Don't make registry calls when content has completed fetching awslabs/soci-snapshotter#1116

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fs: Check connection only when image isn't fully cached #1584

fs: Check connection only when image isn't fully cached #1584

ktock commented Feb 23, 2024

jonathanbeber commented Mar 8, 2024

ktock commented Mar 11, 2024

ElenaHenderson commented Apr 24, 2024 •

edited

Loading

ktock commented Apr 24, 2024

ElenaHenderson commented Apr 25, 2024

fs: Check connection only when image isn't fully cached #1584

fs: Check connection only when image isn't fully cached #1584

Conversation

ktock commented Feb 23, 2024

jonathanbeber commented Mar 8, 2024

ktock commented Mar 11, 2024

ElenaHenderson commented Apr 24, 2024 • edited Loading

ktock commented Apr 24, 2024

ElenaHenderson commented Apr 25, 2024

ElenaHenderson commented Apr 24, 2024 •

edited

Loading