Pods stuck in ContainerCreating due to pull error unauthorized #2030

axelczk · 2022-07-12T09:54:39Z

We recently switched our cluster to EKS 1.22 with managed node group and since we have sometime this error when container are created. We don't have a fix except replacing the node where the pod try to be scheduled.

Failed to create pod sandbox: rpc error: code = Unknown desc = failed to get sandbox image "602401143452.dkr.ecr.eu-north-1.amazonaws.com/eks/pause:3.1-eksbuild.1": failed to pull image "602401143452.dkr.ecr.eu-north-1.amazonaws.com/eks/pause:3.1-eksbuild.1": failed to pull and unpack image "602401143452.dkr.ecr.eu-north-1.amazonaws.com/eks/pause:3.1-eksbuild.1": failed to resolve reference "602401143452.dkr.ecr.eu-north-1.amazonaws.com/eks/pause:3.1-eksbuild.1": pulling from host 602401143452.dkr.ecr.eu-north-1.amazonaws.com failed with status code [manifests 3.1-eksbuild.1]: 401 Unauthorized

I don't know if it's the right place to ask for this. If this is not, please tell me where I can post this issue.

The text was updated successfully, but these errors were encountered:

jayanthvn · 2022-07-12T21:03:17Z

@axelczk - Can you please open a support ticket for this? Team should be able to check if it is any permission issues to pull from ECR. Looks like you are getting a 401. This issue doesn't belong to CNI.

axelczk · 2022-07-13T08:17:06Z

I know I'm getting a 401. The real question is why when the node just started it's working and I can pull this image and after some days or hours, it's not working anymore ?

I don't know which service is responsible for this.

dotsuber · 2022-09-11T08:43:50Z

Hi, I'm having this exactly issue too after upgrading EKS. Is there any solution to that?

juris · 2023-01-05T12:56:23Z

Just had the same issue and found this ticket.
In my case, pause image was gone after pruning unused images and it turns out that it can not be downloaded back by containerd.
So I had to manually download it.

I'm using BottlerocketOS and it was not that trivial. Here's how to do it.

Get your aws ecs auth token first (did it with my aws access key / secret key on a laptop)

aws ecr get-login-password --region <your-region>

Login to the affected instance and get crictl

cd /tmp
yum install tar -y
curl -fsL -o crictl.tar.gz https://github.com/kubernetes-sigs/cri-tools/releases/download/v1.26.0/crictl-v1.26.0-linux-amd64.tar.gz
tar zxf crictl.tar.gz
chmod u+x crictl

Pull the pause image

./crictl --runtime-endpoint=unix:///.bottlerocket/rootfs/run/dockershim.sock pull --creds "AWS:TOKEN_FROM_STEP_1" XXXXX.dkr.ecr.XXXXXXX.amazonaws.com/eks/pause:3.1-eksbuild.1

Now you have that pause image in place, so pods should be able to start normally.

axelczk · 2023-01-05T12:59:31Z

Hi, I'm having this exactly issue too after upgrading EKS. Is there any solution to that?

There is an error on their side on EKS node. You need to add this bootstrap extra arg:
'--pod-infra-container-image=602401143452.dkr.ecr.${var.region}.amazonaws.com/eks/pause:3.1-eksbuild.1'

Using this, the garbage collector will not remove the pause container and you will not have the need to pull the image.

juris · 2023-01-05T13:09:17Z

As far as I know, garbage collector takes only disk space into account. In my case, the server was running out of inodes, so I had to manually prune images.

axelczk · 2023-01-05T13:16:20Z

We have contacted the AWS support on our side, and after days of exchange and debugging this was the explanation we found. The garbage collector was pruning image on the node and removing also the pause container with others images. I still have the ticket somewhere and can check for the full explanation if necessary.

dotsuber · 2023-06-25T12:17:27Z

Hi, I'm having the same issue after upgrading EKS to 1.25. Is this solution still valid? I think this feature flag is deprecated.

Hi, I'm having this exactly issue too after upgrading EKS. Is there any solution to that?

There is an error on their side on EKS node. You need to add this bootstrap extra arg: '--pod-infra-container-image=602401143452.dkr.ecr.${var.region}.amazonaws.com/eks/pause:3.1-eksbuild.1'

Using this, the garbage collector will not remove the pause container and you will not have the need to pull the image.

ddl-slevine · 2023-08-02T19:51:49Z

Having the same issue in an EKS upgrade to 1.24

Failed to create pod sandbox: rpc error: code = Unknown desc = failed to get sandbox image "602401143452.dkr.ecr.us-east-1.amazonaws.com/eks/pause:3.5": failed to pull image "602401143452.dkr.ecr.us-east-1.amazonaws.com/eks/pause:3.5": failed to pull and unpack image "602401143452.dkr.ecr.us-east-1.amazonaws.com/eks/pause:3.5": failed to resolve reference "602401143452.dkr.ecr.us-east-1.amazonaws.com/eks/pause:3.5": pulling from host 602401143452.dkr.ecr.us-east-1.amazonaws.com failed with status code [manifests 3.5]: 401 Unauthorized

interair · 2023-08-10T11:12:44Z

Having the same issue in an EKS after upgrade to 1.27, can anyone help me, please?

Aug 10 10:50:06  kubelet[3229]: E0810 10:50:06.304292    3229 pod_workers.go:1294] "Error syncing pod, skipping" err="failed to \"CreatePodSandbox\" for \ with CreatePodSandboxError: \"Failed to create sandbox for pod \\\: rpc error: code = Unknown desc = failed to get sandbox image \\\"602401143452.dkr.ecr.us-west-2.amazonaws.com/eks/pause:3.5\\\": failed to pull image \\\"602401143452.dkr.ecr.us-west-2.amazonaws.com/eks/pause:3.5\\\": failed to pull and unpack image \\\"602401143452.dkr.ecr.us-west-2.amazonaws.com/eks/pause:3.5\\\": failed to resolve reference \\\"602401143452.dkr.ecr.us-west-2.amazonaws.com/eks/pause:3.5\\\": pulling from host 602401143452.dkr.ecr.us-west-2.amazonaws.com failed with status code [manifests 3.5]: 401 Unauthorized\""

journalctl -xu kubelet.service --no-pager | grep -i credent

Aug 10 12:31:11 kubelet[3222]: W0810 12:31:11.489570    3222 feature_gate.go:241] Setting GA feature gate KubeletCredentialProviders=true. It will be removed in a future release.
Aug 10 12:31:11 kubelet[3222]: I0810 12:31:11.490173    3222 flags.go:64] FLAG: --image-credential-provider-bin-dir="/etc/eks/image-credential-provider"
Aug 10 12:31:11 kubelet[3222]: I0810 12:31:11.490179    3222 flags.go:64] FLAG: --image-credential-provider-config="/etc/eks/image-credential-provider/config.json"
Aug 10 12:31:11 kubelet[3222]: W0810 12:31:11.490930    3222 feature_gate.go:241] Setting GA feature gate KubeletCredentialProviders=true. It will be removed in a future release.
Aug 10 12:31:11 kubelet[3222]: I0810 12:31:11.490944    3222 feature_gate.go:249] feature gates: &{map[KubeletCredentialProviders:true RotateKubeletServerCertificate:true]}
Aug 10 12:31:11 kubelet[3222]: W0810 12:31:11.495660    3222 feature_gate.go:241] Setting GA feature gate KubeletCredentialProviders=true. It will be removed in a future release.
Aug 10 12:31:11 kubelet[3222]: I0810 12:31:11.495678    3222 feature_gate.go:249] feature gates: &{map[KubeletCredentialProviders:true RotateKubeletServerCertificate:true]}
Aug 10 12:31:11 kubelet[3222]: W0810 12:31:11.495793    3222 feature_gate.go:241] Setting GA feature gate KubeletCredentialProviders=true. It will be removed in a future release.
Aug 10 12:31:11 kubelet[3222]: I0810 12:31:11.495808    3222 feature_gate.go:249] feature gates: &{map[KubeletCredentialProviders:true RotateKubeletServerCertificate:true]}
Aug 10 12:31:17 kubelet[3222]: I0810 12:31:17.568804    3222 provider.go:102] Refreshing cache for provider: *credentialprovider.defaultDockerConfigProvider

ps aux | grep bin/kubelet | grep -v grep

root      3222  2.0  0.6 1812360 104096 ?      Ssl  12:31   3:34 /usr/bin/kubelet --config /etc/kubernetes/kubelet/kubelet-config.json --kubeconfig /var/lib/kubelet/kubeconfig --container-runtime-endpoint unix:///run/containerd/containerd.sock --image-credential-provider-config /etc/eks/image-credential-provider/config.json --image-credential-provider-bin-dir /etc/eks/image-credential-provider --node-ip=xxxxx --v=2 --hostname-override=ip-xxxxx.us-west-2.compute.internal --cloud-provider=external --node-labels=eks.amazonaws.com/nodegroup-image=ami-xxxxx,eks.amazonaws.com/capacityType=SPOT,environment=test,eks.amazonaws.com/nodegroup=testSpot --max-pods=58

cat /etc/eks/image-credential-provider/config.json

{
  "apiVersion": "kubelet.config.k8s.io/v1",
  "kind": "CredentialProviderConfig",
  "providers": [
    {
      "name": "ecr-credential-provider",
      "matchImages": [
        "*.dkr.ecr.*.amazonaws.com",
        "*.dkr.ecr.*.amazonaws.com.cn",
        "*.dkr.ecr-fips.*.amazonaws.com",
        "*.dkr.ecr.*.c2s.ic.gov",
        "*.dkr.ecr.*.sc2s.sgov.gov"
      ],
      "defaultCacheDuration": "12h",
      "apiVersion": "credentialprovider.kubelet.k8s.io/v1"
    }
  ]
}

ls -la /etc/eks/image-credential-provider

drwxr-xr-x 2 root root       56 Jul 28 04:18 .
drwxr-xr-x 5 root root      265 Jul 28 04:18 ..
-rw-r--r-- 1 root root      477 Jul 28 04:15 config.json
-rwxrwxr-x 1 root root 16072704 Jun 30 18:40 ecr-credential-provider

Getting token works at node:
aws ecr get-login-password --region us-west-2

Fetching image via ./crictl works fine from node

VikramPunnam · 2023-09-11T15:21:40Z

hi @interair

Hi, We are also having same issue in our environment..

The kubelet is able to pull all system images(amazon-k8s-cni-init, amazon-k8s-cni) except pause image as given below.

Failed to create pod sandbox: rpc error: code = Unknown desc = failed to get sandbox image "900889452093.dkr.ecr.ap-south-2.amazonaws.com/eks/pause:3.5": failed to pull image "900889452093.dkr.ecr.ap-south-2.amazonaws.com/eks/pause:3.5": failed to pull and unpack image "900889452093.dkr.ecr.ap-south-2.amazonaws.com/eks/pause:3.5": failed to resolve reference "900889452093.dkr.ecr.ap-south-2.amazonaws.com/eks/pause:3.5": pulling from host 900889452093.dkr.ecr.ap-south-2.amazonaws.com failed with status code [manifests 3.5]: 401 Unauthorized

Fetching image via ./crictl is not a possible solution in production kind of environment,

Can anyone help me, please?

hamdallahjodah · 2023-09-11T16:18:34Z

any updates here same issue!!

jdn5126 · 2023-09-11T16:23:27Z

@VikramPunnam @hamdallahjodah @interair @ddl-slevine I am not familiar with this issue, and it is not an issue with the VPC CNI, so I suggest opening an AWS support case to get help. That will be the fastest way to a resolution, and you can your findings here.

ohrab-hacken · 2024-01-28T10:27:27Z

I am have the same issue when upgrade to 1.29. Some node can download pause image, but some node cannot. So all pods no the node just hung in create state. Doesn't understand why pause image have 401 only some times.

elvishsu66 · 2024-01-29T19:26:38Z

We also have this issue after upgrading to 1.29. Do we have a few good hints so I can start digging?

nightmareze1 · 2024-01-29T19:54:12Z

I have the same issue with EKS 1.29 :(

pjanouse · 2024-01-29T20:12:09Z

I've observed the same after v1.29 upgrade today too. Tried to re-place an affected compute node with the fresh one and it seems helped (at least for awhile). So far so good...

nightmareze1 · 2024-01-29T20:24:52Z

I think the problem is happening after 12hs when the session token expires and curiously the instance where I tested it didn't have any inodes/space problems.

cartermckinnon · 2024-01-29T21:21:45Z

If this is happening on the official EKS AMI, can you open an issue in our repo so we can look into it? https://github.com/awslabs/amazon-eks-ami

ohrab-hacken · 2024-01-30T09:01:47Z

--pod-infra-container-image flag is set on kubelet. I found that my disk on node really become full after some time and kubelet image garbage collector delete pause image. So, instead of delete different images, it deletes pause image. After pause image deleted, node doesn't work.
I found the reason of full disk. In my case, I have ttlSecondsAfterFinished: 7200 for dagster jobs, and it consume all disk space. I've changed it to ttlSecondsAfterFinished: 120 and jobs cleaned up more frequently and we don't have this issue any more.
It's strange cause I didn't have this issue on 1.28, and I didn't change any Dagster configuration between version upgrade. My guess, it kubelet image garbage collector works different in 1.28 and 1.29.

jdn5126 · 2024-01-30T16:24:06Z

@ohrab-hacken --pod-infra-container-image was deprecated in k8s 1.27. As I understand it, the container runtime will prune the image unless it is marked as pinned. From the EKS 1.28 AMI, it does seem like the pause image is not pinned for some reason. @cartermckinnon do you know if it should be?

jdn5126 · 2024-01-30T17:03:27Z

This issue is being discussed at awslabs/amazon-eks-ami#1597

avisaradir · 2024-01-31T16:28:13Z

Is there any new progress solving this matter?

jdn5126 · 2024-01-31T16:35:16Z

Is there any new progress solving this matter?

Did you follow the issue I linked to? This issue is in the EKS AMI, not the VPC CNI, so short and long-term resolutions are being discussed there

avisaradir · 2024-01-31T16:44:42Z

Is there any new progress solving this matter?

Did you follow the issue I linked to? This issue is in the EKS AMI, not the VPC CNI, so short and long-term resolutions are being discussed there

I will take another look at that.

ForbiddenEra · 2024-03-14T08:21:39Z

Just started running into this today?

Failed to create pod sandbox: rpc error: code = Unknown desc = failed to get sandbox image "602401143452.dkr.ecr.ca-central-1.amazonaws.com/eks/pause:3.5": failed to pull image "602401143452.dkr.ecr.ca-central-1.amazonaws.com/eks/pause:3.5": failed to pull and unpack image "602401143452.dkr.ecr.ca-central-1.amazonaws.com/eks/pause:3.5": failed to resolve reference "602401143452.dkr.ecr.ca-central-1.amazonaws.com/eks/pause:3.5": unexpected status from HEAD request to https://602401143452.dkr.ecr.ca-central-1.amazonaws.com/v2/eks/pause/manifests/3.5: 401 Unauthorized

2/3 replicas for my pod deployed; all were scheduled on different nodes but all nodes are self-managed and running the same AMI.. thought maybe it was only affecting one AZ, tried re-deploy again and now only 1/3 worked. Not sure yet if it's only affecting specific nodes or what...

Edit: So, I don't see any patterns with regards to node type, node group, AZ or specific resources or anything.

Seems to have started a few days ago. Not AMI-related really. Not sure if it's specifically VPC-CNI related either though it of course prevented me from updating that plugin.

Doing an instance refresh and/or terminating/re-creating the nodes/instances that were failing seems to have resolved the issue (for now?) - they were all redeployed with the same AMI and everything. No idea WTH.

axelczk added the bug label Jul 12, 2022

axelczk closed this as completed Jul 28, 2022

ginglis13 mentioned this issue Feb 22, 2024

aarch64 nodes fail to pull eks/pause during node init bottlerocket-os/bottlerocket#2778

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pods stuck in ContainerCreating due to pull error unauthorized #2030

Pods stuck in ContainerCreating due to pull error unauthorized #2030

axelczk commented Jul 12, 2022

jayanthvn commented Jul 12, 2022 •

edited

axelczk commented Jul 13, 2022

dotsuber commented Sep 11, 2022

juris commented Jan 5, 2023 •

edited

axelczk commented Jan 5, 2023

juris commented Jan 5, 2023

axelczk commented Jan 5, 2023

dotsuber commented Jun 25, 2023

ddl-slevine commented Aug 2, 2023 •

edited

interair commented Aug 10, 2023 •

edited

VikramPunnam commented Sep 11, 2023 •

edited

hamdallahjodah commented Sep 11, 2023

jdn5126 commented Sep 11, 2023

ohrab-hacken commented Jan 28, 2024

elvishsu66 commented Jan 29, 2024

nightmareze1 commented Jan 29, 2024

pjanouse commented Jan 29, 2024

nightmareze1 commented Jan 29, 2024

cartermckinnon commented Jan 29, 2024 •

edited

ohrab-hacken commented Jan 30, 2024

jdn5126 commented Jan 30, 2024

jdn5126 commented Jan 30, 2024

avisaradir commented Jan 31, 2024

jdn5126 commented Jan 31, 2024

avisaradir commented Jan 31, 2024

ForbiddenEra commented Mar 14, 2024 •

edited

Pods stuck in ContainerCreating due to pull error unauthorized #2030

Pods stuck in ContainerCreating due to pull error unauthorized #2030

Comments

axelczk commented Jul 12, 2022

jayanthvn commented Jul 12, 2022 • edited

axelczk commented Jul 13, 2022

dotsuber commented Sep 11, 2022

juris commented Jan 5, 2023 • edited

axelczk commented Jan 5, 2023

juris commented Jan 5, 2023

axelczk commented Jan 5, 2023

dotsuber commented Jun 25, 2023

ddl-slevine commented Aug 2, 2023 • edited

interair commented Aug 10, 2023 • edited

VikramPunnam commented Sep 11, 2023 • edited

hamdallahjodah commented Sep 11, 2023

jdn5126 commented Sep 11, 2023

ohrab-hacken commented Jan 28, 2024

elvishsu66 commented Jan 29, 2024

nightmareze1 commented Jan 29, 2024

pjanouse commented Jan 29, 2024

nightmareze1 commented Jan 29, 2024

cartermckinnon commented Jan 29, 2024 • edited

ohrab-hacken commented Jan 30, 2024

jdn5126 commented Jan 30, 2024

jdn5126 commented Jan 30, 2024

avisaradir commented Jan 31, 2024

jdn5126 commented Jan 31, 2024

avisaradir commented Jan 31, 2024

ForbiddenEra commented Mar 14, 2024 • edited

jayanthvn commented Jul 12, 2022 •

edited

juris commented Jan 5, 2023 •

edited

ddl-slevine commented Aug 2, 2023 •

edited

interair commented Aug 10, 2023 •

edited

VikramPunnam commented Sep 11, 2023 •

edited

cartermckinnon commented Jan 29, 2024 •

edited

ForbiddenEra commented Mar 14, 2024 •

edited