CNI pod crashloopbackoff due to incorrect instance metadata while fetching ENI #1008

cshivashankar · 2020-06-05T16:01:45Z

The agent in the program looks for all the attached ENIs in EC2 instance metadata before starting the pod. However, in very rare cases like the one which I observed yesterday, there was a discrepancy in the instance metadata to the actual ENIs attached. Due to this IPAMd fails and enters into a crashloopbackoff state. I have collected logs and can share the information in private.
@mogren

InAnimaTe · 2020-06-09T13:51:59Z

@cshivashankar can you please share some of your log output. Does it resemble anything to #486 or #950 of which we're presently having the same issue and in-contact with AWS Support right now.

cshivashankar · 2020-06-09T14:16:01Z

Hi @InAnimaTe Logs indicated following
Timed out waiting for IPAM daemon to start:
When I captured logs from eks log collectors and analyzed ipamd logs it looked something like this
(This can be found under var_log\aws-routed-eni)

{"level":"error","ts":"2020-06-04T08:26:43.180Z","caller":"aws-k8s-agent/main.go:30","msg":"Initialization failure: ipamd init: failed to retrieve attached ENIs info"}

{"level":"info","ts":"2020-06-06T00:39:58.944Z","caller":"runtime/asm_amd64.s:1357","msg":"Synced successfully with APIServer"}
{"level":"info","ts":"2020-06-06T00:39:58.944Z","caller":"k8sapi/discovery.go:216","msg":"Add/Update for CNI pod aws-node-glbkt"}
{"level":"error","ts":"2020-06-06T00:39:58.960Z","caller":"ipamd/ipamd.go:328","msg":"Failed to call ec2:DescribeNetworkInterfaces for [eni-043a9bb8856789abc eni-00000099 eni-00000088]: InvalidNetworkInterfaceID.NotFound: The networkInterface ID 'eni-00000099' does not exist\n\tstatus code: 400, request id: 192e237e-dbb4-46d3-bcc1-1e184ddd19cf"}
{"level":"debug","ts":"2020-06-06T00:39:58.960Z","caller":"ipamd/ipamd.go:328","msg":"Could not find interface: The networkInterface ID 'eni-00000099' does not exist, ID: eni-00000099"}
{"level":"error","ts":"2020-06-06T00:39:59.037Z","caller":"ipamd/ipamd.go:328","msg":"Failed to call ec2:DescribeNetworkInterfaces for [eni-043a9bb8856789abc eni-00000088]: InvalidNetworkInterfaceID.NotFound: The networkInterface ID 'eni-00000088' does not exist\n\tstatus code: 400, request id: 45415af0-835d-49c1-9b60-dc235eeeacd5"}
{"level":"debug","ts":"2020-06-06T00:39:59.037Z","caller":"ipamd/ipamd.go:328","msg":"Could not find interface: The networkInterface ID 'eni-00000088' does not exist, ID: eni-00000088"}
{"level":"debug","ts":"2020-06-06T00:39:59.114Z","caller":"ipamd/ipamd.go:314","msg":"DescribeAllENIs success: ENIs: 1, tagged: 0"}
{"level":"info","ts":"2020-06-06T00:39:59.114Z","caller":"ipamd/ipamd.go:360","msg":"Setting up host network... "}

I checked both the bugs you mentioned, it doesn't look like either of them

InAnimaTe · 2020-06-09T14:22:44Z

Yeah your issue looks different. Thanks for sharing your logs so quickly. Hopefully this one can get nabbed quickly!

RaspiRepo · 2020-06-12T18:57:20Z

I am having similar issue (could be same as mentioned above). POD not getting scheduled on those nodes

orderer-0   1/1     Running             0          78m   100.116.130.31    ip-100-116-130-136.us-west-2.compute.internal   <none>           <none>
raft0-0     0/1     ContainerCreating   0          69m   <none>            ip-100-116-135-44.us-west-2.compute.internal    <none>           <none>
raft1-0     1/1     Running             0          78m   100.116.138.243   ip-100-116-137-235.us-west-2.compute.internal   <none>           <none>
raft2-0     1/1     Running             0          78m   100.116.131.20    ip-100-116-130-136.us-west-2.compute.internal   <none>           <none>
raft3-0     1/1     Running             0          78m   100.116.134.95    ip-100-116-133-195.us-west-2.compute.internal   <none>           <none>

can be traced back to AWS-CNI pod instance crash on that node

aws-node-42g8r 0/1 Error 6 9m52s 100.116.135.44 ip-100-116-135-44.us-west-2.compute.internal <none> <none>

kubectl describe po aws-node-42g8r

 Type     Reason     Age                   From                                                   Message
  ----     ------     ----                  ----                                                   -------
  Normal   Scheduled  13m                   default-scheduler                                      Successfully assigned kube-system/aws-node-42g8r to ip-100-116-135-44.us-west-2.compute.internal
  Warning  Unhealthy  11m                   kubelet, ip-100-116-135-44.us-west-2.compute.internal  Readiness probe errored: rpc error: code = Unknown desc = container not running (5035fe1baa42897f47183d48d6be83c73cc91aac71e30bfc5dde612f3d1472ab)
  Normal   Pulling    11m (x4 over 13m)     kubelet, ip-100-116-135-44.us-west-2.compute.internal  Pulling image "602401143452.dkr.ecr.us-west-2.amazonaws.com/amazon-k8s-cni:v1.6.1"
  Normal   Pulled     11m (x4 over 13m)     kubelet, ip-100-116-135-44.us-west-2.compute.internal  Successfully pulled image "602401143452.dkr.ecr.us-west-2.amazonaws.com/amazon-k8s-cni:v1.6.1"
  Normal   Created    11m (x4 over 13m)     kubelet, ip-100-116-135-44.us-west-2.compute.internal  Created container aws-node
  Normal   Started    11m (x4 over 13m)     kubelet, ip-100-116-135-44.us-west-2.compute.internal  Started container aws-node
  Warning  BackOff    3m53s (x32 over 12m)  kubelet, ip-100-116-135-44.us-west-2.compute.internal  Back-off restarting failed container

Crash pod logs

mk:~ sku $ kubectl logs -f aws-node-42g8r
Starting IPAM daemon in the background ... ok.
Checking for IPAM connectivity ...  failed.
Timed out waiting for IPAM daemon to start:

System information :

 Kernel Version:             4.14.177-139.254.amzn2.x86_64
 OS Image:                   Amazon Linux 2
 Operating System:           linux
 Architecture:               amd64
 Container Runtime Version:  docker://19.3.6
 Kubelet Version:            v1.16.8-eks-e16311
 Kube-Proxy Version:         v1.16.8-eks-e16311
ProviderID:                  aws:///us-west-2b/i-0260dbe1061

RaspiRepo · 2020-06-12T18:58:07Z

is any recommendation fix for this issue ?

mogren · 2020-06-15T06:10:42Z

Hi @RaspiRepo, unfortunately not yet. I added a work-around for the issue with the metadata in #1011, but that change has not yet made it into a new release.

mogren · 2020-06-16T04:45:12Z

@RaspiRepo, @cshivashankar: I just published a new release candidate, v1.6.3-rc1 that we are currently testing to prepare for the final release. Like all pre-releases, it has been made available in us-west-2, in case you would like to test it out.

cshivashankar · 2020-06-16T06:25:13Z

Thanks @mogren I will test and update.

mogren · 2020-06-19T20:09:13Z

We just released v1.6.3 with the fix from #1011.

mogren added bug priority/P0 Highest priority. Someone needs to actively work on this. labels Jun 5, 2020

mogren mentioned this issue Jun 7, 2020

Handle stale instance metadata #1011

Merged

mogren mentioned this issue Jun 16, 2020

Failing liveness Probe #1038

Closed

mogren closed this as completed Jun 24, 2020

InAnimaTe mentioned this issue Jun 24, 2020

Timeout waiting for IPAM to start causing perpetual failure (1.6.1, EKS 1.15) #1055

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CNI pod crashloopbackoff due to incorrect instance metadata while fetching ENI #1008

CNI pod crashloopbackoff due to incorrect instance metadata while fetching ENI #1008

cshivashankar commented Jun 5, 2020

InAnimaTe commented Jun 9, 2020

cshivashankar commented Jun 9, 2020

InAnimaTe commented Jun 9, 2020

RaspiRepo commented Jun 12, 2020 •

edited

RaspiRepo commented Jun 12, 2020

mogren commented Jun 15, 2020

mogren commented Jun 16, 2020

cshivashankar commented Jun 16, 2020

mogren commented Jun 19, 2020

CNI pod crashloopbackoff due to incorrect instance metadata while fetching ENI #1008

CNI pod crashloopbackoff due to incorrect instance metadata while fetching ENI #1008

Comments

cshivashankar commented Jun 5, 2020

InAnimaTe commented Jun 9, 2020

cshivashankar commented Jun 9, 2020

InAnimaTe commented Jun 9, 2020

RaspiRepo commented Jun 12, 2020 • edited

RaspiRepo commented Jun 12, 2020

mogren commented Jun 15, 2020

mogren commented Jun 16, 2020

cshivashankar commented Jun 16, 2020

mogren commented Jun 19, 2020

RaspiRepo commented Jun 12, 2020 •

edited