aws-node pod restarts without any obvious errors #283

max-rocket-internet · 2019-01-09T13:29:29Z

EKS: v1.11.5
CNI: 602401143452.dkr.ecr.us-west-2.amazonaws.com/amazon-k8s-cni:1.3.0
AMI: amazon-eks-node-1.11-v20181210 (ami-0a9006fb385703b54)

E.g:

$ kubectl -n kube-system get pods -l k8s-app=aws-node
NAME             READY   STATUS    RESTARTS   AGE
aws-node-2t4n4   1/1     Running   1          20h
aws-node-45j6l   1/1     Running   1          3h
aws-node-4mstw   1/1     Running   0          22h
aws-node-95smx   1/1     Running   0          1h
aws-node-9cz4c   1/1     Running   1          1h
aws-node-9nkzt   1/1     Running   0          3h
aws-node-9pfgq   1/1     Running   0          2h
aws-node-cr5ds   1/1     Running   1          1h
aws-node-hhtrt   1/1     Running   2          1h
aws-node-j8brm   1/1     Running   0          6d
aws-node-jvvgc   1/1     Running   1          1h
aws-node-kd7ld   1/1     Running   1          22h
aws-node-mr7dh   1/1     Running   1          1h
aws-node-ntn57   1/1     Running   1          1h
aws-node-tntxp   1/1     Running   1          2h
aws-node-vk6cz   1/1     Running   0          2h
aws-node-vtpz7   1/1     Running   1          4h
aws-node-xm9wz   1/1     Running   1          1h

Even if I describe pod aws-node-hhtrt there is no events. No interesting logs from the pod either. Or the previous pod. I looked in our logging system to get all logs from this pod and there is nothing beyond the normal startup messages. But I did see from pod aws-node-9cz4c this message:

Failed to communicate with K8S Server. Please check instance security groups or http proxy setting

I tried to run /opt/cni/bin/aws-cni-support.sh on the node with pod aws-node-hhtrt but I get this error:

[root@ip-10-0-25-4 ~]# /opt/cni/bin/aws-cni-support.sh
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  1223  100  1223    0     0   1223      0  0:00:01 --:--:--  0:00:01 1194k
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   912  100   912    0     0    912      0  0:00:01 --:--:--  0:00:01  890k
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   106  100   106    0     0    106      0  0:00:01 --:--:--  0:00:01  103k
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100    83  100    83    0     0     83      0  0:00:01 --:--:--  0:00:01 83000
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100    28  100    28    0     0     28      0  0:00:01 --:--:--  0:00:01 28000
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  6268  100  6268    0     0   6268      0  0:00:01 --:--:--  0:00:01 6121k
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0curl: (7) Failed to connect to localhost port 10255: Connection refused

The text was updated successfully, but these errors were encountered:

max-rocket-internet · 2019-01-15T09:25:14Z

Hey @liwenwu-amazon,
Do you have any idea about this?

We are now seeing it in a production cluster:

$ kubectl -n kube-system get pods
NAME                                                              READY   STATUS    RESTARTS   AGE
aws-node-52mcx                                                    1/1     Running   1          6m

$ kubectl -n kube-system describe pod aws-node-52mcx
Name:           aws-node-52mcx
Namespace:      kube-system
Node:           ip-10-1-25-141.ec2.internal/10.1.25.141
Start Time:     Tue, 15 Jan 2019 09:54:30 +0100
Labels:         controller-revision-hash=3287759043
                k8s-app=aws-node
                pod-template-generation=2
Annotations:    scheduler.alpha.kubernetes.io/critical-pod:
Status:         Running
...
    State:          Running
      Started:      Tue, 15 Jan 2019 09:55:17 +0100
    Last State:     Terminated
      Reason:       Error
      Exit Code:    1
      Started:      Tue, 15 Jan 2019 09:54:45 +0100
      Finished:     Tue, 15 Jan 2019 09:55:15 +0100
    Ready:          True
    Restart Count:  1
   ...
Events:
  Type    Reason                 Age                   From                                  Message
  ----    ------                 ----                  ----                                  -------
  Normal  SuccessfulMountVolume  6m54s                 kubelet, ip-10-1-25-141.ec2.internal  MountVolume.SetUp succeeded for volume "cni-bin-dir"
  Normal  SuccessfulMountVolume  6m54s                 kubelet, ip-10-1-25-141.ec2.internal  MountVolume.SetUp succeeded for volume "cni-net-dir"
  Normal  SuccessfulMountVolume  6m54s                 kubelet, ip-10-1-25-141.ec2.internal  MountVolume.SetUp succeeded for volume "dockersock"
  Normal  SuccessfulMountVolume  6m54s                 kubelet, ip-10-1-25-141.ec2.internal  MountVolume.SetUp succeeded for volume "log-dir"
  Normal  SuccessfulMountVolume  6m53s                 kubelet, ip-10-1-25-141.ec2.internal  MountVolume.SetUp succeeded for volume "aws-node-token-cnmp7"
  Normal  Pulling                6m8s (x2 over 6m52s)  kubelet, ip-10-1-25-141.ec2.internal  pulling image "602401143452.dkr.ecr.us-west-2.amazonaws.com/amazon-k8s-cni:1.3.0"
  Normal  Pulled                 6m7s (x2 over 6m41s)  kubelet, ip-10-1-25-141.ec2.internal  Successfully pulled image "602401143452.dkr.ecr.us-west-2.amazonaws.com/amazon-k8s-cni:1.3.0"
  Normal  Created                6m7s (x2 over 6m39s)  kubelet, ip-10-1-25-141.ec2.internal  Created container
  Normal  Started                6m7s (x2 over 6m39s)  kubelet, ip-10-1-25-141.ec2.internal  Started container

nak3 · 2019-01-15T09:45:03Z

(I'm sorry if you are waiting for @liwenwu-amazon's reply) but you should collect following logs:

  $ kubectl -n kube-system logs aws-node-52mcx -p
  $ kubectl -n kube-system get event

First kubectl logs -p collects previous container logs, so if the process in previous container exited with some reason, we can see the message.
Second kubectl event collects entire event logs in kube-system namespace. You collected describe pod but it contains only Running pod's event. It is better to check entire event in the namespace.

Note, kubectl logs -p cannot collect logs if previous containers are already GCed. Also, kubectl get event might be gone in 2 hours (depends on settings on the cluster).

max-rocket-internet · 2019-01-15T13:09:48Z

Thanks for the reply @nak3

I already mentioned the logs in issue text. Again, this is the message from the previous pod:

Failed to communicate with K8S Server. Please check instance security groups or http proxy setting

Getting event data is a good idea but in this scenario it produces only Normal type events:

$ kubectl -n kube-system get event | grep aws-node
13m         13m          1       aws-node.157a0690cdccdc01                                    DaemonSet                                               Normal    SuccessfulCreate         daemonset-controller                  Created pod: aws-node-b6jrg
13m         13m          1       aws-node-b6jrg.157a0690fe521337                              Pod                                                     Normal    SuccessfulMountVolume    kubelet, ip-10-1-24-47.ec2.internal   MountVolume.SetUp succeeded for volume "cni-bin-dir"
13m         13m          1       aws-node-b6jrg.157a0690fe53507e                              Pod                                                     Normal    SuccessfulMountVolume    kubelet, ip-10-1-24-47.ec2.internal   MountVolume.SetUp succeeded for volume "log-dir"
13m         13m          1       aws-node-b6jrg.157a0690fe53085f                              Pod                                                     Normal    SuccessfulMountVolume    kubelet, ip-10-1-24-47.ec2.internal   MountVolume.SetUp succeeded for volume "cni-net-dir"
13m         13m          1       aws-node-b6jrg.157a0690fe52e142                              Pod                                                     Normal    SuccessfulMountVolume    kubelet, ip-10-1-24-47.ec2.internal   MountVolume.SetUp succeeded for volume "dockersock"
13m         13m          1       aws-node-b6jrg.157a06914d3edae2                              Pod                                                     Normal    SuccessfulMountVolume    kubelet, ip-10-1-24-47.ec2.internal   MountVolume.SetUp succeeded for volume "aws-node-token-cnmp7"
13m         13m          2       aws-node-b6jrg.157a0691a62f4b04                              Pod         spec.containers{aws-node}                   Normal    Pulling                  kubelet, ip-10-1-24-47.ec2.internal   pulling image "602401143452.dkr.ecr.us-west-2.amazonaws.com/amazon-k8s-cni:1.3.0"
13m         13m          2       aws-node-b6jrg.157a06945ded769c                              Pod         spec.containers{aws-node}                   Normal    Pulled                   kubelet, ip-10-1-24-47.ec2.internal   Successfully pulled image "602401143452.dkr.ecr.us-west-2.amazonaws.com/amazon-k8s-cni:1.3.0"
13m         13m          2       aws-node-b6jrg.157a0694d9e113d9                              Pod         spec.containers{aws-node}                   Normal    Created                  kubelet, ip-10-1-24-47.ec2.internal   Created container
13m         13m          2       aws-node-b6jrg.157a0694debaabe7                              Pod         spec.containers{aws-node}                   Normal    Started                  kubelet, ip-10-1-24-47.ec2.internal   Started container

nak3 · 2019-01-15T13:20:09Z

I see. (I'm sorry. I missed you mentioned that have already checked previous pod.). How about /var/log/aws-routed-eni/{ipamd.log,plugin.log}? Have you already checked them?

max-rocket-internet · 2019-01-15T14:35:28Z

All good!

Good idea. I had a look at these logs just now and there is a few errors.

For this pod:

Containers:
  aws-node:
....
    Image:          602401143452.dkr.ecr.us-west-2.amazonaws.com/amazon-k8s-cni:1.3.0
...
    Last State:     Terminated
      Reason:       Error
      Exit Code:    1
      Started:      Tue, 15 Jan 2019 11:19:46 +0100
      Finished:     Tue, 15 Jan 2019 11:20:16 +0100

Looking at /var/log/aws-routed-eni/ipamd.log around this time, nothing interesting:

2019-01-15T10:19:46Z [INFO] Starting L-IPAMD v1.3.0  ...
2019-01-15T10:19:46Z [INFO] Testing communication with server
2019-01-15T10:20:18Z [INFO] Starting L-IPAMD v1.3.0  ...
2019-01-15T10:20:18Z [INFO] Testing communication with server
2019-01-15T10:20:18Z [INFO] Running with Kubernetes cluster version: v1.11+. git version: v1.11.5-eks-6bad6d. git tree state: clean. commit: 6bad6d9c768dc0864dab48a11653aa53b5a47043. platform: linux/amd64
2019-01-15T10:20:18Z [INFO] Communication with server successful
2019-01-15T10:20:18Z [INFO] Starting Pod controller

But in /var/log/aws-routed-eni/plugin.log:

2019-01-15T10:19:48Z [INFO] Starting CNI Plugin v1.3.0  ...
2019-01-15T10:19:48Z [INFO] Received CNI add request: ContainerID(2192c36a267d314cac5d81635db4173ad79c802af65c5e3853147dc9bb1c0544) Netns(/proc/5870/ns/net) IfName(eth0) Args(IgnoreUnknown=1;K8S_POD_NAMESPACE=default;K8S_POD_NAME=eu01-stg01-icash-web-668bf5f99b-2k6zt;K8S_POD_INFRA_CONTAINER_ID=2192c36a267d314cac5d81635db4173ad79c802af65c5e3853147dc9bb1c0544) Path(/opt/cni/bin) argsStdinData({"cniVersion":"","name":"aws-cni","type":"aws-cni","vethPrefix":"eni"})
2019-01-15T10:19:49Z [INFO] Starting CNI Plugin v1.3.0  ...
2019-01-15T10:19:49Z [INFO] Received CNI del request: ContainerID(2192c36a267d314cac5d81635db4173ad79c802af65c5e3853147dc9bb1c0544) Netns(/proc/5870/ns/net) IfName(eth0) Args(IgnoreUnknown=1;K8S_POD_NAMESPACE=default;K8S_POD_NAME=eu01-stg01-icash-web-668bf5f99b-2k6zt;K8S_POD_INFRA_CONTAINER_ID=2192c36a267d314cac5d81635db4173ad79c802af65c5e3853147dc9bb1c0544) Path(/opt/cni/bin) argsStdinData({"cniVersion":"","name":"aws-cni","type":"aws-cni","vethPrefix":"eni"})
2019-01-15T10:19:49Z [INFO] Starting CNI Plugin v1.3.0  ...
2019-01-15T10:19:49Z [INFO] Received CNI add request: ContainerID(0d0d4165483b8565a2718326cfd02b593ddcd43e2234742d640ead8e90731935) Netns(/proc/6153/ns/net) IfName(eth0) Args(IgnoreUnknown=1;K8S_POD_NAMESPACE=kube-system;K8S_POD_NAME=fluent-bit-np426;K8S_POD_INFRA_CONTAINER_ID=0d0d4165483b8565a2718326cfd02b593ddcd43e2234742d640ead8e90731935) Path(/opt/cni/bin) argsStdinData({"cniVersion":"","name":"aws-cni","type":"aws-cni","vethPrefix":"eni"})
2019-01-15T10:19:49Z [INFO] Starting CNI Plugin v1.3.0  ...
2019-01-15T10:19:49Z [INFO] Received CNI del request: ContainerID(0d0d4165483b8565a2718326cfd02b593ddcd43e2234742d640ead8e90731935) Netns(/proc/6153/ns/net) IfName(eth0) Args(IgnoreUnknown=1;K8S_POD_NAMESPACE=kube-system;K8S_POD_NAME=fluent-bit-np426;K8S_POD_INFRA_CONTAINER_ID=0d0d4165483b8565a2718326cfd02b593ddcd43e2234742d640ead8e90731935) Path(/opt/cni/bin) argsStdinData({"cniVersion":"","name":"aws-cni","type":"aws-cni","vethPrefix":"eni"})
2019-01-15T10:19:50Z [INFO] Starting CNI Plugin v1.3.0  ...
2019-01-15T10:19:50Z [INFO] Received CNI del request: ContainerID(2192c36a267d314cac5d81635db4173ad79c802af65c5e3853147dc9bb1c0544) Netns() IfName(eth0) Args(IgnoreUnknown=1;K8S_POD_NAMESPACE=default;K8S_POD_NAME=eu01-stg01-icash-web-668bf5f99b-2k6zt;K8S_POD_INFRA_CONTAINER_ID=2192c36a267d314cac5d81635db4173ad79c802af65c5e3853147dc9bb1c0544) Path(/opt/cni/bin) argsStdinData({"cniVersion":"","name":"aws-cni","type":"aws-cni","vethPrefix":"eni"})
2019-01-15T10:19:50Z [ERROR] Error received from DelNetwork grpc call for pod eu01-stg01-icash-web-668bf5f99b-2k6zt namespace default container 2192c36a267d314cac5d81635db4173ad79c802af65c5e3853147dc9bb1c0544: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:50051: connect: connection refused"

That last error repeats over and over until:

2019-01-15T10:20:20Z [INFO] Starting CNI Plugin v1.3.0  ...
2019-01-15T10:20:20Z [INFO] Received CNI del request: ContainerID(55ea31155c5f2f30c0f4eb71e0c2e4501526e5e79642683d3916434c7501a368) Netns() IfName(eth0) Args(IgnoreUnknown=1;K8S_POD_NAMESPACE=default;K8S_POD_NAME=ingress01-nginx-ingress-controller-tsqr6;K8S_POD_INFRA_CONTAINER_ID=55ea31155c5f2f30c0f4eb71e0c2e4501526e5e79642683d3916434c7501a368) Path(/opt/cni/bin) argsStdinData({"cniVersion":"","name":"aws-cni","type":"aws-cni","vethPrefix":"eni"})
2019-01-15T10:20:20Z [ERROR] Failed to process delete request for pod ingress01-nginx-ingress-controller-tsqr6 namespace default container 55ea31155c5f2f30c0f4eb71e0c2e4501526e5e79642683d3916434c7501a368: <nil>
2019-01-15T10:20:20Z [INFO] Starting CNI Plugin v1.3.0  ...

Both error repeat for a few more times and then there is no more in this log.

I hope that helps!

liwenwu-amazon · 2019-01-15T15:27:50Z

@max-rocket-internet
for Failed to communicate with K8S Server. Please check instance security groups or http proxy setting , you should open a support TT with AWS EKS service and see why your worker can NOT communicate with EKS master through EKS-Owned ENI at that timestamp

for error when running /opt/cni/bin/aws-cni-support.sh, please see #285

for Failed to process delete request for pod ingress01-nginx-ingress-controller-tsqr6 namespace default container 55ea31155c5f2f30c0f4eb71e0c2e4501526e5e79642683d3916434c7501a368: <nil>, please take a look at /var/log/aws-routed-eni/ipamd.log.xxx at that timestamp why it failed to process delete

max-rocket-internet · 2019-01-16T10:50:01Z

for Failed to process delete request for pod ingress01-nginx-ingress-controller-tsqr6 namespace default container 55ea31155c5f2f30c0f4eb71e0c2e4501526e5e79642683d3916434c7501a368: , please take a look at /var/log/aws-routed-eni/ipamd.log.xxx at that timestamp why it failed to process delete

OK, maybe it's unrelated to this problem then but here it is:

2019-01-15T10:20:20Z [WARN] UnassignIPv4Address: Failed to find pod ingress01-nginx-ingress-controller-tsqr6 namespace default Container 55ea31155c5f2f30c0f4eb71e0c2e4501526e5e79642683d3916434c7501a368
2019-01-15T10:20:20Z [WARN] UnassignIPv4Address: Failed to find pod ingress01-nginx-ingress-controller-tsqr6 namespace default Container

you should open a support TT with AWS EKS service and see why your worker can NOT communicate with EKS master through EKS-Owned ENI at that timestamp

I can do this but we have 4 different clusters in different regions and they all have the same problem. That's why I thought it was more likely a problem with the CNI or related config than an AWS service. But OK I'll open a support ticket also.

max-rocket-internet · 2019-01-16T10:54:28Z

2 more things...

Doesn't my log output above show the CNI starting multiple times? Since its printing Starting CNI Plugin v1.3.0 ... over and over?
We see the message Failed to communicate with K8S Server a lot. 241 messages over the last week But only a few pod restarts. So I think perhaps the 2 are not related? Can you tell me if that message causes the pod to restart?

micahhausler · 2019-01-18T19:55:15Z

Yea this looks like a race condition where aws-node (with its kubernetes client code) comes up before kube-proxy (and possibly kube-dns/coredns) has set up the kubernetes.svc.default.cluster.local DNS address. If you run the following command on your EKS cluster, you'll see that kube-proxy uses the public endpoint.

$ kubectl get po \
  -n kube-system \
  -l k8s-app=kube-proxy \
  -o jsonpath='{range .items[0].spec.containers[0].command[2]}{@}{"\n"}{end}'

We need to update CNI to have a similar flag so it isn't dependent on DNS or kube-proxy being up.

max-rocket-internet · 2019-01-22T10:53:00Z

Cool, thanks for the update @micahhausler

max-rocket-internet · 2019-01-29T09:20:17Z

Any ETA for a fix?

tabern · 2019-03-05T00:39:32Z

Duplicate of #282
We are targeting a fix as part of v1.4 (https://github.com/aws/amazon-vpc-cni-k8s/milestone/5)

mogren · 2019-03-26T23:30:05Z

I think #282 is a different issue, reopening this one. The digging so far there seems to be an open issue in Kubernetes where ENV variables are not always set. The operator-framework has a workaround for it that we are looking into.

tiffanyfay · 2019-03-28T22:52:44Z

Fixed by #364.

kivagant-ba · 2019-04-19T10:56:37Z

JFYI: I run into the issue and after upgrading aws-node (amazon-k8s-cni) to 1.4 got the error from the pod on the same node.

➜ kubectl logs -n kube-system aws-node-xm4dd
=====Starting installing AWS-CNI =========
=====Starting amazon-k8s-agent ===========
ERROR: logging before flag.Parse: E0418 16:22:29.501108      11 memcache.go:138] couldn't get current server API group list; will keep using cached value. (Get https://172.20.0.1:443/api?timeout=32s: dial tcp 172.20.0.1:443: connect: connection timed out)
Failed to communicate with K8S Server. Please check instance security groups or http proxy setting%

Pod restart didn't help.

All other nodes were ok and still are ok. I terminated the broken node to replace with another one.

tiffanyfay · 2019-04-19T21:42:02Z

@kivagant-ba Darn, thanks for letting us know. Any chance you ran https://github.com/aws/amazon-vpc-cni-k8s/blob/master/scripts/aws-cni-support.sh ?

kivagant-ba · 2019-04-22T12:55:38Z

The node is dead but if I get the problem again, I'll run the script, thank you for sharing.

kivagant-ba · 2019-06-03T11:42:50Z

I faced this again.

➜ k logs -n kube-system aws-node-mgf7v
=====Starting installing AWS-CNI =========
=====Starting amazon-k8s-agent ===========
ERROR: logging before flag.Parse: E0603 11:32:43.025918      11 memcache.go:138] couldn't get current server API group list; will keep using cached value. (Get https://172.20.0.1:443/api?timeout=32s: dial tcp 172.20.0.1:443: connect: connection timed out)
Failed to communicate with K8S Server. Please check instance security groups or http proxy setting%

  Warning  FailedCreatePodSandBox  8m15s                kubelet, ip-10-51-31-71.ec2.internal  Failed create pod sandbox: rpc error: code = Unknown desc = [failed to set up sandbox container "295d63fad14d812c84ddbda1c93be193bf2bab63ac8ef8a599c5190ad885b0e8" network for pod "tables-job-1559561271-sbjmq": NetworkPlugin cni failed to set up pod "tables-job-1559561271-sbjmq_consumer" network: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:50051: connect: connection refused", failed to clean up sandbox container "295d63fad14d812c84ddbda1c93be193bf2bab63ac8ef8a599c5190ad885b0e8" network for pod "tables-job-1559561271-sbjmq": NetworkPlugin cni failed to teardown pod "tables-job-1559561271-sbjmq_consumer" network: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:50051: connect: connection refused"]
  Normal   SandboxChanged          3m (x25 over 8m15s)  kubelet, ip-10-51-31-71.ec2.internal  Pod sandbox changed, it will be killed and re-created.

# /opt/cni/bin/aws-cni-support.sh
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0curl: (7) Failed to connect to localhost port 61678: Connection refused

mogren · 2019-06-04T04:32:16Z

@kivagant-ba The Failed to communicate with K8S Server error is worrying, could the masters be overloaded? How many nodes and pods are you running?

kivagant-ba · 2019-06-04T08:04:35Z

It's EKS and I don't think the masters are overloaded. The cluster is almost empty and only random nodes periodically experience the issue.

Client Version: version.Info{Major:"1", Minor:"14", GitVersion:"v1.14.0", GitCommit:"641856db18352033a0d96dbc99153fa3b27298e5", GitTreeState:"clean", BuildDate:"2019-03-26T00:05:06Z", GoVersion:"go1.12.1", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"12+", GitVersion:"v1.12.6-eks-d69f1b", GitCommit:"d69f1bf3669bf00b7f4a758e978e0e7a1e3a68f7", GitTreeState:"clean", BuildDate:"2019-02-28T20:26:10Z", GoVersion:"go1.10.8", Compiler:"gc", Platform:"linux/amd64"}

Right now we have 6 nodes on the cluster.

mattparkes · 2019-07-26T06:28:36Z

I had this issue (or an extremely similar one) after using eksctl to upgrade aws-node, coredns and kube-proxy in an EKS cluster going from 1.11 to 1.12 and then 1.13.

I did 2 things which ultimately fixed this issue for me in all my clusters (unfortunately I haven't narrowed it down to 1 thing yet):

Restored my manifests for the above 3 Deployments/Daemonsets and then just patched the image field manually (eksctl changes several other fields)
Upgraded to aws-node 1.5.0 which as of the past few days, AWS now recommend: https://docs.aws.amazon.com/eks/latest/userguide/update-cluster.html

kivagant-ba · 2019-08-22T20:19:19Z

- image: 602401143452.dkr.ecr.us-west-2.amazonaws.com/amazon-k8s-cni:v1.5.3

➜ k logs -n kube-system aws-node-c7kv6
====== Installing AWS-CNI ======
====== Starting amazon-k8s-agent ======
ERROR: logging before flag.Parse: E0822 20:13:30.043898       9 memcache.go:138] couldn't get current server API group list; will keep using cached value. (Get https://172.20.0.1:443/api?timeout=32s: dial tcp 172.20.0.1:443: connect: connection timed out)
Failed to communicate with K8S Server. Please check instance security groups or http proxy setting%

Added +7 nodes at once to the cluster and none of them can start while the older nodes work fine.

kivagant-ba · 2019-08-22T20:23:54Z

@tiffanyfay, where I can sent cni logs privately?

tiffanyfay · 2019-08-22T20:25:56Z

@kivagant-ba mogren[at]amazon.com and tfj[at]amazon.com please! Thanks.

kivagant-ba · 2019-08-22T20:30:09Z

Before I send this... I'm watching for the get pod command right now and I see how pods start one after another. ... aaand that's it. It took almost an hour for oldest of them to finally start.

mogren · 2019-08-22T20:31:54Z

@kivagant-ba Also, this PR that got merged into 1.14 seems relevant kubernetes/kubernetes#70994.

kivagant-ba · 2019-08-22T20:35:11Z

@tiffanyfay can I contact you using an official AWS support channel?

UP: I created a support request to move the conversation to more official area.

cgkades · 2019-09-25T18:26:37Z

We've run into a similar issue, and for us, /etc/cni/10-aws.conflist is missing. copying that from another node allows us to continue. We've engaged AWS support. Does anyone know how that file get's put on the node?

UPDATE: looks like cni version 1.5.3 causes the issue, but if i edit the daemonset to 1.5.1, all the nodes come up

mogren · 2019-09-25T23:14:39Z

If the issue is in CNI v1.5.3, the problem should be visible in the log. The change to not copy the config until ipamd is running happened in commit f9acdeb and was done to avoid scheduling pods to that node until we had ENIs and IPs available. If ipamd can not talk to the kubernetes API server, it will not copy the plugin binary or the config file and not make the node Ready.

ShahafSW · 2019-11-08T19:34:09Z

We've run into a similar issue, and for us, /etc/cni/10-aws.conflist is missing. copying that from another node allows us to continue. We've engaged AWS support. Does anyone know how that file get's put on the node?

UPDATE: looks like cni version 1.5.3 causes the issue, but if i edit the daemonset to 1.5.1, all the nodes come up

This solution worked for me as well!

aweis89 · 2020-01-08T17:14:30Z

@kivagant-ba Darn, thanks for letting us know. Any chance you ran https://github.com/aws/amazon-vpc-cni-k8s/blob/master/scripts/aws-cni-support.sh ?

I've ran into the same issue and running that script on all master didn't seem to resolve the issue:
====== Installing AWS-CNI ====== ====== Starting amazon-k8s-agent ====== ERROR: logging before flag.Parse: E0108 17:11:43.934767 9 memcache.go:138] couldn't get current server API group list; will keep using cached value.

jaypipes · 2020-01-08T17:30:18Z

Hi @aweis89, the support script does not resolve any issues. Rather, it simply gathers information useful during troubleshooting.

Can you let us know what version of the CNI plugin you are deploying and what version of Kubernetes you are using? Thanks much!

aweis89 · 2020-01-08T17:52:54Z

Hey @jaypipes!

I've tried with a couple different version from 1.5.0 to the current one I'm using 1.5.5 but the issue is still persistent. The K8S version is 1.14.10. The full log I'm seeing is:
====== Installing AWS-CNI ====== ====== Starting amazon-k8s-agent ====== ERROR: logging before flag.Parse: E0108 17:39:47.947983 9 memcache.go:138] couldn't get current server API group list; will keep using cached value. (Get https://100.66.122.254:443/api?timeout=32s: x509: certificate is valid for 100.64.0.1, 127.0.0.1, not 100.66.122.254) Failed to communicate with K8S Server. Please check instance security groups or http proxy setting%
Notably 100.64.0.1 seems to be the correct IP for the kube-apiserver (as determined from the ClusterIP of the kubernetes service in the default namespace).

Okay while typing this I managed to find the root cause of the issue. In my case the issue was that there was a kubernetes SVC in the kube-system namespace and one in the default namespace. Apparently the CNI pod first looks for a kubernetes SVC in the kube-system namespace (which is not maintained or updated by the controller). (Additionally, even though the kubernetes SVC in the kube-system namespace was load-balancing to the correct IPs, the cert was only valid for the IP of the kubernetes SVC in the default namespace.) Simply deleting the SVC in the kube-system namespace fixed the issue.

I'm not sure if this is a bug or intended behavior. But it should probably be noted somewhere that an SVC with the name kubernetes in the kube-system namespace will cause the CNI to fail if it's not properly set up to talk to the apiserver (even if there's a properly configured SVC in the default namespace). I think it might make sense to reverse the lookup order for the kubernetes SVC and prioritize the default namespace since that's the one designed to talk to the apiserver per the K8S docs: https://kubernetes.io/docs/tasks/administer-cluster/access-cluster-api/#directly-accessing-the-rest-api-1

nak3 mentioned this issue Jan 16, 2019

Log message why failed to connect k8s server is missing #288

Closed

tabern marked this as a duplicate of #282 Mar 5, 2019

tabern closed this as completed Mar 5, 2019

tabern added bug duplicate labels Mar 5, 2019

mogren reopened this Mar 26, 2019

mogren removed the duplicate label Mar 27, 2019

mogren assigned mogren and tiffanyfay and unassigned mogren Mar 28, 2019

mogren added this to the v1.4 milestone Mar 28, 2019

tiffanyfay mentioned this issue Mar 28, 2019

Update k8sapi to use operator-framework inClusterConfig #364

Merged

tiffanyfay closed this as completed Mar 28, 2019

mogren mentioned this issue Sep 25, 2019

Cannot scale new nodes with CNI version 1.5.3 #625

Closed

mogren unassigned tiffanyfay Apr 22, 2020

mogren added the documentation label Apr 22, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

aws-node pod restarts without any obvious errors #283

aws-node pod restarts without any obvious errors #283

max-rocket-internet commented Jan 9, 2019

max-rocket-internet commented Jan 15, 2019

nak3 commented Jan 15, 2019

max-rocket-internet commented Jan 15, 2019

nak3 commented Jan 15, 2019

max-rocket-internet commented Jan 15, 2019 •

edited

Loading

liwenwu-amazon commented Jan 15, 2019

max-rocket-internet commented Jan 16, 2019

max-rocket-internet commented Jan 16, 2019

micahhausler commented Jan 18, 2019

max-rocket-internet commented Jan 22, 2019

max-rocket-internet commented Jan 29, 2019

tabern commented Mar 5, 2019 •

edited

Loading

mogren commented Mar 26, 2019

tiffanyfay commented Mar 28, 2019

kivagant-ba commented Apr 19, 2019

tiffanyfay commented Apr 19, 2019

kivagant-ba commented Apr 22, 2019

kivagant-ba commented Jun 3, 2019

mogren commented Jun 4, 2019

kivagant-ba commented Jun 4, 2019 •

edited

Loading

mattparkes commented Jul 26, 2019

kivagant-ba commented Aug 22, 2019 •

edited

Loading

kivagant-ba commented Aug 22, 2019

tiffanyfay commented Aug 22, 2019

kivagant-ba commented Aug 22, 2019

mogren commented Aug 22, 2019

kivagant-ba commented Aug 22, 2019 •

edited

Loading

cgkades commented Sep 25, 2019 •

edited

Loading

mogren commented Sep 25, 2019

ShahafSW commented Nov 8, 2019

aweis89 commented Jan 8, 2020 •

edited

Loading

jaypipes commented Jan 8, 2020

aweis89 commented Jan 8, 2020 •

edited

Loading

aws-node pod restarts without any obvious errors #283

aws-node pod restarts without any obvious errors #283

Comments

max-rocket-internet commented Jan 9, 2019

max-rocket-internet commented Jan 15, 2019

nak3 commented Jan 15, 2019

max-rocket-internet commented Jan 15, 2019

nak3 commented Jan 15, 2019

max-rocket-internet commented Jan 15, 2019 • edited Loading

liwenwu-amazon commented Jan 15, 2019

max-rocket-internet commented Jan 16, 2019

max-rocket-internet commented Jan 16, 2019

micahhausler commented Jan 18, 2019

max-rocket-internet commented Jan 22, 2019

max-rocket-internet commented Jan 29, 2019

tabern commented Mar 5, 2019 • edited Loading

mogren commented Mar 26, 2019

tiffanyfay commented Mar 28, 2019

kivagant-ba commented Apr 19, 2019

tiffanyfay commented Apr 19, 2019

kivagant-ba commented Apr 22, 2019

kivagant-ba commented Jun 3, 2019

mogren commented Jun 4, 2019

kivagant-ba commented Jun 4, 2019 • edited Loading

mattparkes commented Jul 26, 2019

kivagant-ba commented Aug 22, 2019 • edited Loading

kivagant-ba commented Aug 22, 2019

tiffanyfay commented Aug 22, 2019

kivagant-ba commented Aug 22, 2019

mogren commented Aug 22, 2019

kivagant-ba commented Aug 22, 2019 • edited Loading

cgkades commented Sep 25, 2019 • edited Loading

mogren commented Sep 25, 2019

ShahafSW commented Nov 8, 2019

aweis89 commented Jan 8, 2020 • edited Loading

jaypipes commented Jan 8, 2020

aweis89 commented Jan 8, 2020 • edited Loading

max-rocket-internet commented Jan 15, 2019 •

edited

Loading

tabern commented Mar 5, 2019 •

edited

Loading

kivagant-ba commented Jun 4, 2019 •

edited

Loading

kivagant-ba commented Aug 22, 2019 •

edited

Loading

kivagant-ba commented Aug 22, 2019 •

edited

Loading

cgkades commented Sep 25, 2019 •

edited

Loading

aweis89 commented Jan 8, 2020 •

edited

Loading

aweis89 commented Jan 8, 2020 •

edited

Loading