Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pods stuck in ContainerCreating due to pull error unauthorized #2030

Closed
axelczk opened this issue Jul 12, 2022 · 26 comments
Closed

Pods stuck in ContainerCreating due to pull error unauthorized #2030

axelczk opened this issue Jul 12, 2022 · 26 comments
Labels

Comments

@axelczk
Copy link

axelczk commented Jul 12, 2022

We recently switched our cluster to EKS 1.22 with managed node group and since we have sometime this error when container are created. We don't have a fix except replacing the node where the pod try to be scheduled.

Failed to create pod sandbox: rpc error: code = Unknown desc = failed to get sandbox image "602401143452.dkr.ecr.eu-north-1.amazonaws.com/eks/pause:3.1-eksbuild.1": failed to pull image "602401143452.dkr.ecr.eu-north-1.amazonaws.com/eks/pause:3.1-eksbuild.1": failed to pull and unpack image "602401143452.dkr.ecr.eu-north-1.amazonaws.com/eks/pause:3.1-eksbuild.1": failed to resolve reference "602401143452.dkr.ecr.eu-north-1.amazonaws.com/eks/pause:3.1-eksbuild.1": pulling from host 602401143452.dkr.ecr.eu-north-1.amazonaws.com failed with status code [manifests 3.1-eksbuild.1]: 401 Unauthorized   

I don't know if it's the right place to ask for this. If this is not, please tell me where I can post this issue.

@axelczk axelczk added the bug label Jul 12, 2022
@jayanthvn
Copy link
Contributor

jayanthvn commented Jul 12, 2022

@axelczk - Can you please open a support ticket for this? Team should be able to check if it is any permission issues to pull from ECR. Looks like you are getting a 401. This issue doesn't belong to CNI.

@axelczk
Copy link
Author

axelczk commented Jul 13, 2022

I know I'm getting a 401. The real question is why when the node just started it's working and I can pull this image and after some days or hours, it's not working anymore ?

I don't know which service is responsible for this.

@axelczk axelczk closed this as completed Jul 28, 2022
@dotsuber
Copy link

Hi, I'm having this exactly issue too after upgrading EKS. Is there any solution to that?

@juris
Copy link

juris commented Jan 5, 2023

Just had the same issue and found this ticket.
In my case, pause image was gone after pruning unused images and it turns out that it can not be downloaded back by containerd.
So I had to manually download it.

I'm using BottlerocketOS and it was not that trivial. Here's how to do it.

  1. Get your aws ecs auth token first (did it with my aws access key / secret key on a laptop)
aws ecr get-login-password --region <your-region>
  1. Login to the affected instance and get crictl
cd /tmp
yum install tar -y
curl -fsL -o crictl.tar.gz https://github.com/kubernetes-sigs/cri-tools/releases/download/v1.26.0/crictl-v1.26.0-linux-amd64.tar.gz
tar zxf crictl.tar.gz
chmod u+x crictl
  1. Pull the pause image
./crictl --runtime-endpoint=unix:///.bottlerocket/rootfs/run/dockershim.sock pull --creds "AWS:TOKEN_FROM_STEP_1" XXXXX.dkr.ecr.XXXXXXX.amazonaws.com/eks/pause:3.1-eksbuild.1

Now you have that pause image in place, so pods should be able to start normally.

@axelczk
Copy link
Author

axelczk commented Jan 5, 2023

Hi, I'm having this exactly issue too after upgrading EKS. Is there any solution to that?

There is an error on their side on EKS node. You need to add this bootstrap extra arg:
'--pod-infra-container-image=602401143452.dkr.ecr.${var.region}.amazonaws.com/eks/pause:3.1-eksbuild.1'

Using this, the garbage collector will not remove the pause container and you will not have the need to pull the image.

@juris
Copy link

juris commented Jan 5, 2023

As far as I know, garbage collector takes only disk space into account. In my case, the server was running out of inodes, so I had to manually prune images.

@axelczk
Copy link
Author

axelczk commented Jan 5, 2023

We have contacted the AWS support on our side, and after days of exchange and debugging this was the explanation we found. The garbage collector was pruning image on the node and removing also the pause container with others images. I still have the ticket somewhere and can check for the full explanation if necessary.

@dotsuber
Copy link

Hi, I'm having the same issue after upgrading EKS to 1.25. Is this solution still valid? I think this feature flag is deprecated.

Hi, I'm having this exactly issue too after upgrading EKS. Is there any solution to that?

There is an error on their side on EKS node. You need to add this bootstrap extra arg: '--pod-infra-container-image=602401143452.dkr.ecr.${var.region}.amazonaws.com/eks/pause:3.1-eksbuild.1'

Using this, the garbage collector will not remove the pause container and you will not have the need to pull the image.

@ddl-slevine
Copy link

ddl-slevine commented Aug 2, 2023

Having the same issue in an EKS upgrade to 1.24

Failed to create pod sandbox: rpc error: code = Unknown desc = failed to get sandbox image "602401143452.dkr.ecr.us-east-1.amazonaws.com/eks/pause:3.5": failed to pull image "602401143452.dkr.ecr.us-east-1.amazonaws.com/eks/pause:3.5": failed to pull and unpack image "602401143452.dkr.ecr.us-east-1.amazonaws.com/eks/pause:3.5": failed to resolve reference "602401143452.dkr.ecr.us-east-1.amazonaws.com/eks/pause:3.5": pulling from host 602401143452.dkr.ecr.us-east-1.amazonaws.com failed with status code [manifests 3.5]: 401 Unauthorized

@interair
Copy link

interair commented Aug 10, 2023

Having the same issue in an EKS after upgrade to 1.27, can anyone help me, please?

Aug 10 10:50:06  kubelet[3229]: E0810 10:50:06.304292    3229 pod_workers.go:1294] "Error syncing pod, skipping" err="failed to \"CreatePodSandbox\" for \ with CreatePodSandboxError: \"Failed to create sandbox for pod \\\: rpc error: code = Unknown desc = failed to get sandbox image \\\"602401143452.dkr.ecr.us-west-2.amazonaws.com/eks/pause:3.5\\\": failed to pull image \\\"602401143452.dkr.ecr.us-west-2.amazonaws.com/eks/pause:3.5\\\": failed to pull and unpack image \\\"602401143452.dkr.ecr.us-west-2.amazonaws.com/eks/pause:3.5\\\": failed to resolve reference \\\"602401143452.dkr.ecr.us-west-2.amazonaws.com/eks/pause:3.5\\\": pulling from host 602401143452.dkr.ecr.us-west-2.amazonaws.com failed with status code [manifests 3.5]: 401 Unauthorized\"" 

journalctl -xu kubelet.service --no-pager | grep -i credent

Aug 10 12:31:11 kubelet[3222]: W0810 12:31:11.489570    3222 feature_gate.go:241] Setting GA feature gate KubeletCredentialProviders=true. It will be removed in a future release.
Aug 10 12:31:11 kubelet[3222]: I0810 12:31:11.490173    3222 flags.go:64] FLAG: --image-credential-provider-bin-dir="/etc/eks/image-credential-provider"
Aug 10 12:31:11 kubelet[3222]: I0810 12:31:11.490179    3222 flags.go:64] FLAG: --image-credential-provider-config="/etc/eks/image-credential-provider/config.json"
Aug 10 12:31:11 kubelet[3222]: W0810 12:31:11.490930    3222 feature_gate.go:241] Setting GA feature gate KubeletCredentialProviders=true. It will be removed in a future release.
Aug 10 12:31:11 kubelet[3222]: I0810 12:31:11.490944    3222 feature_gate.go:249] feature gates: &{map[KubeletCredentialProviders:true RotateKubeletServerCertificate:true]}
Aug 10 12:31:11 kubelet[3222]: W0810 12:31:11.495660    3222 feature_gate.go:241] Setting GA feature gate KubeletCredentialProviders=true. It will be removed in a future release.
Aug 10 12:31:11 kubelet[3222]: I0810 12:31:11.495678    3222 feature_gate.go:249] feature gates: &{map[KubeletCredentialProviders:true RotateKubeletServerCertificate:true]}
Aug 10 12:31:11 kubelet[3222]: W0810 12:31:11.495793    3222 feature_gate.go:241] Setting GA feature gate KubeletCredentialProviders=true. It will be removed in a future release.
Aug 10 12:31:11 kubelet[3222]: I0810 12:31:11.495808    3222 feature_gate.go:249] feature gates: &{map[KubeletCredentialProviders:true RotateKubeletServerCertificate:true]}
Aug 10 12:31:17 kubelet[3222]: I0810 12:31:17.568804    3222 provider.go:102] Refreshing cache for provider: *credentialprovider.defaultDockerConfigProvider

ps aux | grep bin/kubelet | grep -v grep

root      3222  2.0  0.6 1812360 104096 ?      Ssl  12:31   3:34 /usr/bin/kubelet --config /etc/kubernetes/kubelet/kubelet-config.json --kubeconfig /var/lib/kubelet/kubeconfig --container-runtime-endpoint unix:///run/containerd/containerd.sock --image-credential-provider-config /etc/eks/image-credential-provider/config.json --image-credential-provider-bin-dir /etc/eks/image-credential-provider --node-ip=xxxxx --v=2 --hostname-override=ip-xxxxx.us-west-2.compute.internal --cloud-provider=external --node-labels=eks.amazonaws.com/nodegroup-image=ami-xxxxx,eks.amazonaws.com/capacityType=SPOT,environment=test,eks.amazonaws.com/nodegroup=testSpot --max-pods=58

cat /etc/eks/image-credential-provider/config.json

{
  "apiVersion": "kubelet.config.k8s.io/v1",
  "kind": "CredentialProviderConfig",
  "providers": [
    {
      "name": "ecr-credential-provider",
      "matchImages": [
        "*.dkr.ecr.*.amazonaws.com",
        "*.dkr.ecr.*.amazonaws.com.cn",
        "*.dkr.ecr-fips.*.amazonaws.com",
        "*.dkr.ecr.*.c2s.ic.gov",
        "*.dkr.ecr.*.sc2s.sgov.gov"
      ],
      "defaultCacheDuration": "12h",
      "apiVersion": "credentialprovider.kubelet.k8s.io/v1"
    }
  ]
}

ls -la /etc/eks/image-credential-provider

drwxr-xr-x 2 root root       56 Jul 28 04:18 .
drwxr-xr-x 5 root root      265 Jul 28 04:18 ..
-rw-r--r-- 1 root root      477 Jul 28 04:15 config.json
-rwxrwxr-x 1 root root 16072704 Jun 30 18:40 ecr-credential-provider

Getting token works at node:
aws ecr get-login-password --region us-west-2

Fetching image via ./crictl works fine from node

@VikramPunnam
Copy link

VikramPunnam commented Sep 11, 2023

hi @interair

Hi, We are also having same issue in our environment..

The kubelet is able to pull all system images(amazon-k8s-cni-init, amazon-k8s-cni) except pause image as given below.

Failed to create pod sandbox: rpc error: code = Unknown desc = failed to get sandbox image "900889452093.dkr.ecr.ap-south-2.amazonaws.com/eks/pause:3.5": failed to pull image "900889452093.dkr.ecr.ap-south-2.amazonaws.com/eks/pause:3.5": failed to pull and unpack image "900889452093.dkr.ecr.ap-south-2.amazonaws.com/eks/pause:3.5": failed to resolve reference "900889452093.dkr.ecr.ap-south-2.amazonaws.com/eks/pause:3.5": pulling from host 900889452093.dkr.ecr.ap-south-2.amazonaws.com failed with status code [manifests 3.5]: 401 Unauthorized

Fetching image via ./crictl is not a possible solution in production kind of environment,

Can anyone help me, please?

@hamdallahjodah
Copy link

any updates here same issue!!

@jdn5126
Copy link
Contributor

jdn5126 commented Sep 11, 2023

@VikramPunnam @hamdallahjodah @interair @ddl-slevine I am not familiar with this issue, and it is not an issue with the VPC CNI, so I suggest opening an AWS support case to get help. That will be the fastest way to a resolution, and you can your findings here.

@ohrab-hacken
Copy link

I am have the same issue when upgrade to 1.29. Some node can download pause image, but some node cannot. So all pods no the node just hung in create state. Doesn't understand why pause image have 401 only some times.

@elvishsu66
Copy link

We also have this issue after upgrading to 1.29. Do we have a few good hints so I can start digging?

@nightmareze1
Copy link

I have the same issue with EKS 1.29 :(

@pjanouse
Copy link

I've observed the same after v1.29 upgrade today too. Tried to re-place an affected compute node with the fresh one and it seems helped (at least for awhile). So far so good...

@nightmareze1
Copy link

I think the problem is happening after 12hs when the session token expires and curiously the instance where I tested it didn't have any inodes/space problems.

@cartermckinnon
Copy link
Member

cartermckinnon commented Jan 29, 2024

If this is happening on the official EKS AMI, can you open an issue in our repo so we can look into it? https://github.com/awslabs/amazon-eks-ami

@ohrab-hacken
Copy link

--pod-infra-container-image flag is set on kubelet. I found that my disk on node really become full after some time and kubelet image garbage collector delete pause image. So, instead of delete different images, it deletes pause image. After pause image deleted, node doesn't work.
I found the reason of full disk. In my case, I have ttlSecondsAfterFinished: 7200 for dagster jobs, and it consume all disk space. I've changed it to ttlSecondsAfterFinished: 120 and jobs cleaned up more frequently and we don't have this issue any more.
It's strange cause I didn't have this issue on 1.28, and I didn't change any Dagster configuration between version upgrade. My guess, it kubelet image garbage collector works different in 1.28 and 1.29.

@jdn5126
Copy link
Contributor

jdn5126 commented Jan 30, 2024

@ohrab-hacken --pod-infra-container-image was deprecated in k8s 1.27. As I understand it, the container runtime will prune the image unless it is marked as pinned. From the EKS 1.28 AMI, it does seem like the pause image is not pinned for some reason. @cartermckinnon do you know if it should be?

@jdn5126
Copy link
Contributor

jdn5126 commented Jan 30, 2024

This issue is being discussed at awslabs/amazon-eks-ami#1597

@avisaradir
Copy link

Is there any new progress solving this matter?

@jdn5126
Copy link
Contributor

jdn5126 commented Jan 31, 2024

Is there any new progress solving this matter?

Did you follow the issue I linked to? This issue is in the EKS AMI, not the VPC CNI, so short and long-term resolutions are being discussed there

@avisaradir
Copy link

Is there any new progress solving this matter?

Did you follow the issue I linked to? This issue is in the EKS AMI, not the VPC CNI, so short and long-term resolutions are being discussed there

I will take another look at that.

@ForbiddenEra
Copy link

ForbiddenEra commented Mar 14, 2024

Just started running into this today?

Failed to create pod sandbox: rpc error: code = Unknown desc = failed to get sandbox image "602401143452.dkr.ecr.ca-central-1.amazonaws.com/eks/pause:3.5": failed to pull image "602401143452.dkr.ecr.ca-central-1.amazonaws.com/eks/pause:3.5": failed to pull and unpack image "602401143452.dkr.ecr.ca-central-1.amazonaws.com/eks/pause:3.5": failed to resolve reference "602401143452.dkr.ecr.ca-central-1.amazonaws.com/eks/pause:3.5": unexpected status from HEAD request to https://602401143452.dkr.ecr.ca-central-1.amazonaws.com/v2/eks/pause/manifests/3.5: 401 Unauthorized

2/3 replicas for my pod deployed; all were scheduled on different nodes but all nodes are self-managed and running the same AMI.. thought maybe it was only affecting one AZ, tried re-deploy again and now only 1/3 worked. Not sure yet if it's only affecting specific nodes or what...

Edit: So, I don't see any patterns with regards to node type, node group, AZ or specific resources or anything.

Seems to have started a few days ago. Not AMI-related really. Not sure if it's specifically VPC-CNI related either though it of course prevented me from updating that plugin.

Doing an instance refresh and/or terminating/re-creating the nodes/instances that were failing seems to have resolved the issue (for now?) - they were all redeployed with the same AMI and everything. No idea WTH.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests