Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CNI pod crashloopbackoff due to incorrect instance metadata while fetching ENI #1008

Closed
cshivashankar opened this issue Jun 5, 2020 · 9 comments
Labels
bug priority/P0 Highest priority. Someone needs to actively work on this.

Comments

@cshivashankar
Copy link

The agent in the program looks for all the attached ENIs in EC2 instance metadata before starting the pod. However, in very rare cases like the one which I observed yesterday, there was a discrepancy in the instance metadata to the actual ENIs attached. Due to this IPAMd fails and enters into a crashloopbackoff state. I have collected logs and can share the information in private.
@mogren

@mogren mogren added bug priority/P0 Highest priority. Someone needs to actively work on this. labels Jun 5, 2020
@InAnimaTe
Copy link

@cshivashankar can you please share some of your log output. Does it resemble anything to #486 or #950 of which we're presently having the same issue and in-contact with AWS Support right now.

@cshivashankar
Copy link
Author

Hi @InAnimaTe Logs indicated following
Timed out waiting for IPAM daemon to start:
When I captured logs from eks log collectors and analyzed ipamd logs it looked something like this
(This can be found under var_log\aws-routed-eni)

{"level":"error","ts":"2020-06-04T08:26:43.180Z","caller":"aws-k8s-agent/main.go:30","msg":"Initialization failure: ipamd init: failed to retrieve attached ENIs info"}

{"level":"info","ts":"2020-06-06T00:39:58.944Z","caller":"runtime/asm_amd64.s:1357","msg":"Synced successfully with APIServer"}
{"level":"info","ts":"2020-06-06T00:39:58.944Z","caller":"k8sapi/discovery.go:216","msg":"Add/Update for CNI pod aws-node-glbkt"}
{"level":"error","ts":"2020-06-06T00:39:58.960Z","caller":"ipamd/ipamd.go:328","msg":"Failed to call ec2:DescribeNetworkInterfaces for [eni-043a9bb8856789abc eni-00000099 eni-00000088]: InvalidNetworkInterfaceID.NotFound: The networkInterface ID 'eni-00000099' does not exist\n\tstatus code: 400, request id: 192e237e-dbb4-46d3-bcc1-1e184ddd19cf"}
{"level":"debug","ts":"2020-06-06T00:39:58.960Z","caller":"ipamd/ipamd.go:328","msg":"Could not find interface: The networkInterface ID 'eni-00000099' does not exist, ID: eni-00000099"}
{"level":"error","ts":"2020-06-06T00:39:59.037Z","caller":"ipamd/ipamd.go:328","msg":"Failed to call ec2:DescribeNetworkInterfaces for [eni-043a9bb8856789abc eni-00000088]: InvalidNetworkInterfaceID.NotFound: The networkInterface ID 'eni-00000088' does not exist\n\tstatus code: 400, request id: 45415af0-835d-49c1-9b60-dc235eeeacd5"}
{"level":"debug","ts":"2020-06-06T00:39:59.037Z","caller":"ipamd/ipamd.go:328","msg":"Could not find interface: The networkInterface ID 'eni-00000088' does not exist, ID: eni-00000088"}
{"level":"debug","ts":"2020-06-06T00:39:59.114Z","caller":"ipamd/ipamd.go:314","msg":"DescribeAllENIs success: ENIs: 1, tagged: 0"}
{"level":"info","ts":"2020-06-06T00:39:59.114Z","caller":"ipamd/ipamd.go:360","msg":"Setting up host network... "}

I checked both the bugs you mentioned, it doesn't look like either of them

@InAnimaTe
Copy link

Yeah your issue looks different. Thanks for sharing your logs so quickly. Hopefully this one can get nabbed quickly!

@RaspiRepo
Copy link

RaspiRepo commented Jun 12, 2020

I am having similar issue (could be same as mentioned above). POD not getting scheduled on those nodes

orderer-0   1/1     Running             0          78m   100.116.130.31    ip-100-116-130-136.us-west-2.compute.internal   <none>           <none>
raft0-0     0/1     ContainerCreating   0          69m   <none>            ip-100-116-135-44.us-west-2.compute.internal    <none>           <none>
raft1-0     1/1     Running             0          78m   100.116.138.243   ip-100-116-137-235.us-west-2.compute.internal   <none>           <none>
raft2-0     1/1     Running             0          78m   100.116.131.20    ip-100-116-130-136.us-west-2.compute.internal   <none>           <none>
raft3-0     1/1     Running             0          78m   100.116.134.95    ip-100-116-133-195.us-west-2.compute.internal   <none>           <none>

can be traced back to AWS-CNI pod instance crash on that node

aws-node-42g8r 0/1 Error 6 9m52s 100.116.135.44 ip-100-116-135-44.us-west-2.compute.internal <none> <none>

kubectl describe po aws-node-42g8r

 Type     Reason     Age                   From                                                   Message
  ----     ------     ----                  ----                                                   -------
  Normal   Scheduled  13m                   default-scheduler                                      Successfully assigned kube-system/aws-node-42g8r to ip-100-116-135-44.us-west-2.compute.internal
  Warning  Unhealthy  11m                   kubelet, ip-100-116-135-44.us-west-2.compute.internal  Readiness probe errored: rpc error: code = Unknown desc = container not running (5035fe1baa42897f47183d48d6be83c73cc91aac71e30bfc5dde612f3d1472ab)
  Normal   Pulling    11m (x4 over 13m)     kubelet, ip-100-116-135-44.us-west-2.compute.internal  Pulling image "602401143452.dkr.ecr.us-west-2.amazonaws.com/amazon-k8s-cni:v1.6.1"
  Normal   Pulled     11m (x4 over 13m)     kubelet, ip-100-116-135-44.us-west-2.compute.internal  Successfully pulled image "602401143452.dkr.ecr.us-west-2.amazonaws.com/amazon-k8s-cni:v1.6.1"
  Normal   Created    11m (x4 over 13m)     kubelet, ip-100-116-135-44.us-west-2.compute.internal  Created container aws-node
  Normal   Started    11m (x4 over 13m)     kubelet, ip-100-116-135-44.us-west-2.compute.internal  Started container aws-node
  Warning  BackOff    3m53s (x32 over 12m)  kubelet, ip-100-116-135-44.us-west-2.compute.internal  Back-off restarting failed container

Crash pod logs

mk:~ sku $ kubectl logs -f aws-node-42g8r
Starting IPAM daemon in the background ... ok.
Checking for IPAM connectivity ...  failed.
Timed out waiting for IPAM daemon to start:

System information :

 Kernel Version:             4.14.177-139.254.amzn2.x86_64
 OS Image:                   Amazon Linux 2
 Operating System:           linux
 Architecture:               amd64
 Container Runtime Version:  docker://19.3.6
 Kubelet Version:            v1.16.8-eks-e16311
 Kube-Proxy Version:         v1.16.8-eks-e16311
ProviderID:                  aws:///us-west-2b/i-0260dbe1061

@RaspiRepo
Copy link

is any recommendation fix for this issue ?

@mogren
Copy link
Contributor

mogren commented Jun 15, 2020

Hi @RaspiRepo, unfortunately not yet. I added a work-around for the issue with the metadata in #1011, but that change has not yet made it into a new release.

@mogren
Copy link
Contributor

mogren commented Jun 16, 2020

@RaspiRepo, @cshivashankar: I just published a new release candidate, v1.6.3-rc1 that we are currently testing to prepare for the final release. Like all pre-releases, it has been made available in us-west-2, in case you would like to test it out.

@cshivashankar
Copy link
Author

Thanks @mogren I will test and update.

@mogren
Copy link
Contributor

mogren commented Jun 19, 2020

We just released v1.6.3 with the fix from #1011.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug priority/P0 Highest priority. Someone needs to actively work on this.
Projects
None yet
Development

No branches or pull requests

4 participants