Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

aws-node pod restarts without any obvious errors #283

Closed
max-rocket-internet opened this issue Jan 9, 2019 · 33 comments
Closed

aws-node pod restarts without any obvious errors #283

max-rocket-internet opened this issue Jan 9, 2019 · 33 comments

Comments

@max-rocket-internet
Copy link
Contributor

EKS: v1.11.5
CNI: 602401143452.dkr.ecr.us-west-2.amazonaws.com/amazon-k8s-cni:1.3.0
AMI: amazon-eks-node-1.11-v20181210 (ami-0a9006fb385703b54)

E.g:

$ kubectl -n kube-system get pods -l k8s-app=aws-node
NAME             READY   STATUS    RESTARTS   AGE
aws-node-2t4n4   1/1     Running   1          20h
aws-node-45j6l   1/1     Running   1          3h
aws-node-4mstw   1/1     Running   0          22h
aws-node-95smx   1/1     Running   0          1h
aws-node-9cz4c   1/1     Running   1          1h
aws-node-9nkzt   1/1     Running   0          3h
aws-node-9pfgq   1/1     Running   0          2h
aws-node-cr5ds   1/1     Running   1          1h
aws-node-hhtrt   1/1     Running   2          1h
aws-node-j8brm   1/1     Running   0          6d
aws-node-jvvgc   1/1     Running   1          1h
aws-node-kd7ld   1/1     Running   1          22h
aws-node-mr7dh   1/1     Running   1          1h
aws-node-ntn57   1/1     Running   1          1h
aws-node-tntxp   1/1     Running   1          2h
aws-node-vk6cz   1/1     Running   0          2h
aws-node-vtpz7   1/1     Running   1          4h
aws-node-xm9wz   1/1     Running   1          1h

Even if I describe pod aws-node-hhtrt there is no events. No interesting logs from the pod either. Or the previous pod. I looked in our logging system to get all logs from this pod and there is nothing beyond the normal startup messages. But I did see from pod aws-node-9cz4c this message:

Failed to communicate with K8S Server. Please check instance security groups or http proxy setting	

I tried to run /opt/cni/bin/aws-cni-support.sh on the node with pod aws-node-hhtrt but I get this error:

[root@ip-10-0-25-4 ~]# /opt/cni/bin/aws-cni-support.sh
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  1223  100  1223    0     0   1223      0  0:00:01 --:--:--  0:00:01 1194k
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   912  100   912    0     0    912      0  0:00:01 --:--:--  0:00:01  890k
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   106  100   106    0     0    106      0  0:00:01 --:--:--  0:00:01  103k
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100    83  100    83    0     0     83      0  0:00:01 --:--:--  0:00:01 83000
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100    28  100    28    0     0     28      0  0:00:01 --:--:--  0:00:01 28000
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  6268  100  6268    0     0   6268      0  0:00:01 --:--:--  0:00:01 6121k
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0curl: (7) Failed to connect to localhost port 10255: Connection refused
@max-rocket-internet
Copy link
Contributor Author

Hey @liwenwu-amazon,
Do you have any idea about this?

We are now seeing it in a production cluster:

$ kubectl -n kube-system get pods
NAME                                                              READY   STATUS    RESTARTS   AGE
aws-node-52mcx                                                    1/1     Running   1          6m

$ kubectl -n kube-system describe pod aws-node-52mcx
Name:           aws-node-52mcx
Namespace:      kube-system
Node:           ip-10-1-25-141.ec2.internal/10.1.25.141
Start Time:     Tue, 15 Jan 2019 09:54:30 +0100
Labels:         controller-revision-hash=3287759043
                k8s-app=aws-node
                pod-template-generation=2
Annotations:    scheduler.alpha.kubernetes.io/critical-pod:
Status:         Running
...
    State:          Running
      Started:      Tue, 15 Jan 2019 09:55:17 +0100
    Last State:     Terminated
      Reason:       Error
      Exit Code:    1
      Started:      Tue, 15 Jan 2019 09:54:45 +0100
      Finished:     Tue, 15 Jan 2019 09:55:15 +0100
    Ready:          True
    Restart Count:  1
   ...
Events:
  Type    Reason                 Age                   From                                  Message
  ----    ------                 ----                  ----                                  -------
  Normal  SuccessfulMountVolume  6m54s                 kubelet, ip-10-1-25-141.ec2.internal  MountVolume.SetUp succeeded for volume "cni-bin-dir"
  Normal  SuccessfulMountVolume  6m54s                 kubelet, ip-10-1-25-141.ec2.internal  MountVolume.SetUp succeeded for volume "cni-net-dir"
  Normal  SuccessfulMountVolume  6m54s                 kubelet, ip-10-1-25-141.ec2.internal  MountVolume.SetUp succeeded for volume "dockersock"
  Normal  SuccessfulMountVolume  6m54s                 kubelet, ip-10-1-25-141.ec2.internal  MountVolume.SetUp succeeded for volume "log-dir"
  Normal  SuccessfulMountVolume  6m53s                 kubelet, ip-10-1-25-141.ec2.internal  MountVolume.SetUp succeeded for volume "aws-node-token-cnmp7"
  Normal  Pulling                6m8s (x2 over 6m52s)  kubelet, ip-10-1-25-141.ec2.internal  pulling image "602401143452.dkr.ecr.us-west-2.amazonaws.com/amazon-k8s-cni:1.3.0"
  Normal  Pulled                 6m7s (x2 over 6m41s)  kubelet, ip-10-1-25-141.ec2.internal  Successfully pulled image "602401143452.dkr.ecr.us-west-2.amazonaws.com/amazon-k8s-cni:1.3.0"
  Normal  Created                6m7s (x2 over 6m39s)  kubelet, ip-10-1-25-141.ec2.internal  Created container
  Normal  Started                6m7s (x2 over 6m39s)  kubelet, ip-10-1-25-141.ec2.internal  Started container

@nak3
Copy link
Contributor

nak3 commented Jan 15, 2019

(I'm sorry if you are waiting for @liwenwu-amazon's reply) but you should collect following logs:

  $ kubectl -n kube-system logs aws-node-52mcx -p
  $ kubectl -n kube-system get event

First kubectl logs -p collects previous container logs, so if the process in previous container exited with some reason, we can see the message.
Second kubectl event collects entire event logs in kube-system namespace. You collected describe pod but it contains only Running pod's event. It is better to check entire event in the namespace.

Note, kubectl logs -p cannot collect logs if previous containers are already GCed. Also, kubectl get event might be gone in 2 hours (depends on settings on the cluster).

@max-rocket-internet
Copy link
Contributor Author

Thanks for the reply @nak3

I already mentioned the logs in issue text. Again, this is the message from the previous pod:

Failed to communicate with K8S Server. Please check instance security groups or http proxy setting

Getting event data is a good idea but in this scenario it produces only Normal type events:

$ kubectl -n kube-system get event | grep aws-node
13m         13m          1       aws-node.157a0690cdccdc01                                    DaemonSet                                               Normal    SuccessfulCreate         daemonset-controller                  Created pod: aws-node-b6jrg
13m         13m          1       aws-node-b6jrg.157a0690fe521337                              Pod                                                     Normal    SuccessfulMountVolume    kubelet, ip-10-1-24-47.ec2.internal   MountVolume.SetUp succeeded for volume "cni-bin-dir"
13m         13m          1       aws-node-b6jrg.157a0690fe53507e                              Pod                                                     Normal    SuccessfulMountVolume    kubelet, ip-10-1-24-47.ec2.internal   MountVolume.SetUp succeeded for volume "log-dir"
13m         13m          1       aws-node-b6jrg.157a0690fe53085f                              Pod                                                     Normal    SuccessfulMountVolume    kubelet, ip-10-1-24-47.ec2.internal   MountVolume.SetUp succeeded for volume "cni-net-dir"
13m         13m          1       aws-node-b6jrg.157a0690fe52e142                              Pod                                                     Normal    SuccessfulMountVolume    kubelet, ip-10-1-24-47.ec2.internal   MountVolume.SetUp succeeded for volume "dockersock"
13m         13m          1       aws-node-b6jrg.157a06914d3edae2                              Pod                                                     Normal    SuccessfulMountVolume    kubelet, ip-10-1-24-47.ec2.internal   MountVolume.SetUp succeeded for volume "aws-node-token-cnmp7"
13m         13m          2       aws-node-b6jrg.157a0691a62f4b04                              Pod         spec.containers{aws-node}                   Normal    Pulling                  kubelet, ip-10-1-24-47.ec2.internal   pulling image "602401143452.dkr.ecr.us-west-2.amazonaws.com/amazon-k8s-cni:1.3.0"
13m         13m          2       aws-node-b6jrg.157a06945ded769c                              Pod         spec.containers{aws-node}                   Normal    Pulled                   kubelet, ip-10-1-24-47.ec2.internal   Successfully pulled image "602401143452.dkr.ecr.us-west-2.amazonaws.com/amazon-k8s-cni:1.3.0"
13m         13m          2       aws-node-b6jrg.157a0694d9e113d9                              Pod         spec.containers{aws-node}                   Normal    Created                  kubelet, ip-10-1-24-47.ec2.internal   Created container
13m         13m          2       aws-node-b6jrg.157a0694debaabe7                              Pod         spec.containers{aws-node}                   Normal    Started                  kubelet, ip-10-1-24-47.ec2.internal   Started container

@nak3
Copy link
Contributor

nak3 commented Jan 15, 2019

I see. (I'm sorry. I missed you mentioned that have already checked previous pod.). How about /var/log/aws-routed-eni/{ipamd.log,plugin.log}? Have you already checked them?

@max-rocket-internet
Copy link
Contributor Author

max-rocket-internet commented Jan 15, 2019

All good!

Good idea. I had a look at these logs just now and there is a few errors.

For this pod:

Containers:
  aws-node:
....
    Image:          602401143452.dkr.ecr.us-west-2.amazonaws.com/amazon-k8s-cni:1.3.0
...
    Last State:     Terminated
      Reason:       Error
      Exit Code:    1
      Started:      Tue, 15 Jan 2019 11:19:46 +0100
      Finished:     Tue, 15 Jan 2019 11:20:16 +0100

Looking at /var/log/aws-routed-eni/ipamd.log around this time, nothing interesting:

2019-01-15T10:19:46Z [INFO] Starting L-IPAMD v1.3.0  ...
2019-01-15T10:19:46Z [INFO] Testing communication with server
2019-01-15T10:20:18Z [INFO] Starting L-IPAMD v1.3.0  ...
2019-01-15T10:20:18Z [INFO] Testing communication with server
2019-01-15T10:20:18Z [INFO] Running with Kubernetes cluster version: v1.11+. git version: v1.11.5-eks-6bad6d. git tree state: clean. commit: 6bad6d9c768dc0864dab48a11653aa53b5a47043. platform: linux/amd64
2019-01-15T10:20:18Z [INFO] Communication with server successful
2019-01-15T10:20:18Z [INFO] Starting Pod controller

But in /var/log/aws-routed-eni/plugin.log:

2019-01-15T10:19:48Z [INFO] Starting CNI Plugin v1.3.0  ...
2019-01-15T10:19:48Z [INFO] Received CNI add request: ContainerID(2192c36a267d314cac5d81635db4173ad79c802af65c5e3853147dc9bb1c0544) Netns(/proc/5870/ns/net) IfName(eth0) Args(IgnoreUnknown=1;K8S_POD_NAMESPACE=default;K8S_POD_NAME=eu01-stg01-icash-web-668bf5f99b-2k6zt;K8S_POD_INFRA_CONTAINER_ID=2192c36a267d314cac5d81635db4173ad79c802af65c5e3853147dc9bb1c0544) Path(/opt/cni/bin) argsStdinData({"cniVersion":"","name":"aws-cni","type":"aws-cni","vethPrefix":"eni"})
2019-01-15T10:19:49Z [INFO] Starting CNI Plugin v1.3.0  ...
2019-01-15T10:19:49Z [INFO] Received CNI del request: ContainerID(2192c36a267d314cac5d81635db4173ad79c802af65c5e3853147dc9bb1c0544) Netns(/proc/5870/ns/net) IfName(eth0) Args(IgnoreUnknown=1;K8S_POD_NAMESPACE=default;K8S_POD_NAME=eu01-stg01-icash-web-668bf5f99b-2k6zt;K8S_POD_INFRA_CONTAINER_ID=2192c36a267d314cac5d81635db4173ad79c802af65c5e3853147dc9bb1c0544) Path(/opt/cni/bin) argsStdinData({"cniVersion":"","name":"aws-cni","type":"aws-cni","vethPrefix":"eni"})
2019-01-15T10:19:49Z [INFO] Starting CNI Plugin v1.3.0  ...
2019-01-15T10:19:49Z [INFO] Received CNI add request: ContainerID(0d0d4165483b8565a2718326cfd02b593ddcd43e2234742d640ead8e90731935) Netns(/proc/6153/ns/net) IfName(eth0) Args(IgnoreUnknown=1;K8S_POD_NAMESPACE=kube-system;K8S_POD_NAME=fluent-bit-np426;K8S_POD_INFRA_CONTAINER_ID=0d0d4165483b8565a2718326cfd02b593ddcd43e2234742d640ead8e90731935) Path(/opt/cni/bin) argsStdinData({"cniVersion":"","name":"aws-cni","type":"aws-cni","vethPrefix":"eni"})
2019-01-15T10:19:49Z [INFO] Starting CNI Plugin v1.3.0  ...
2019-01-15T10:19:49Z [INFO] Received CNI del request: ContainerID(0d0d4165483b8565a2718326cfd02b593ddcd43e2234742d640ead8e90731935) Netns(/proc/6153/ns/net) IfName(eth0) Args(IgnoreUnknown=1;K8S_POD_NAMESPACE=kube-system;K8S_POD_NAME=fluent-bit-np426;K8S_POD_INFRA_CONTAINER_ID=0d0d4165483b8565a2718326cfd02b593ddcd43e2234742d640ead8e90731935) Path(/opt/cni/bin) argsStdinData({"cniVersion":"","name":"aws-cni","type":"aws-cni","vethPrefix":"eni"})
2019-01-15T10:19:50Z [INFO] Starting CNI Plugin v1.3.0  ...
2019-01-15T10:19:50Z [INFO] Received CNI del request: ContainerID(2192c36a267d314cac5d81635db4173ad79c802af65c5e3853147dc9bb1c0544) Netns() IfName(eth0) Args(IgnoreUnknown=1;K8S_POD_NAMESPACE=default;K8S_POD_NAME=eu01-stg01-icash-web-668bf5f99b-2k6zt;K8S_POD_INFRA_CONTAINER_ID=2192c36a267d314cac5d81635db4173ad79c802af65c5e3853147dc9bb1c0544) Path(/opt/cni/bin) argsStdinData({"cniVersion":"","name":"aws-cni","type":"aws-cni","vethPrefix":"eni"})
2019-01-15T10:19:50Z [ERROR] Error received from DelNetwork grpc call for pod eu01-stg01-icash-web-668bf5f99b-2k6zt namespace default container 2192c36a267d314cac5d81635db4173ad79c802af65c5e3853147dc9bb1c0544: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:50051: connect: connection refused"

That last error repeats over and over until:

2019-01-15T10:20:20Z [INFO] Starting CNI Plugin v1.3.0  ...
2019-01-15T10:20:20Z [INFO] Received CNI del request: ContainerID(55ea31155c5f2f30c0f4eb71e0c2e4501526e5e79642683d3916434c7501a368) Netns() IfName(eth0) Args(IgnoreUnknown=1;K8S_POD_NAMESPACE=default;K8S_POD_NAME=ingress01-nginx-ingress-controller-tsqr6;K8S_POD_INFRA_CONTAINER_ID=55ea31155c5f2f30c0f4eb71e0c2e4501526e5e79642683d3916434c7501a368) Path(/opt/cni/bin) argsStdinData({"cniVersion":"","name":"aws-cni","type":"aws-cni","vethPrefix":"eni"})
2019-01-15T10:20:20Z [ERROR] Failed to process delete request for pod ingress01-nginx-ingress-controller-tsqr6 namespace default container 55ea31155c5f2f30c0f4eb71e0c2e4501526e5e79642683d3916434c7501a368: <nil>
2019-01-15T10:20:20Z [INFO] Starting CNI Plugin v1.3.0  ...

Both error repeat for a few more times and then there is no more in this log.

I hope that helps!

@liwenwu-amazon
Copy link
Contributor

@max-rocket-internet
for Failed to communicate with K8S Server. Please check instance security groups or http proxy setting , you should open a support TT with AWS EKS service and see why your worker can NOT communicate with EKS master through EKS-Owned ENI at that timestamp

for error when running /opt/cni/bin/aws-cni-support.sh, please see #285

for Failed to process delete request for pod ingress01-nginx-ingress-controller-tsqr6 namespace default container 55ea31155c5f2f30c0f4eb71e0c2e4501526e5e79642683d3916434c7501a368: <nil>, please take a look at /var/log/aws-routed-eni/ipamd.log.xxx at that timestamp why it failed to process delete

@max-rocket-internet
Copy link
Contributor Author

for Failed to process delete request for pod ingress01-nginx-ingress-controller-tsqr6 namespace default container 55ea31155c5f2f30c0f4eb71e0c2e4501526e5e79642683d3916434c7501a368: , please take a look at /var/log/aws-routed-eni/ipamd.log.xxx at that timestamp why it failed to process delete

OK, maybe it's unrelated to this problem then but here it is:

2019-01-15T10:20:20Z [WARN] UnassignIPv4Address: Failed to find pod ingress01-nginx-ingress-controller-tsqr6 namespace default Container 55ea31155c5f2f30c0f4eb71e0c2e4501526e5e79642683d3916434c7501a368
2019-01-15T10:20:20Z [WARN] UnassignIPv4Address: Failed to find pod ingress01-nginx-ingress-controller-tsqr6 namespace default Container

you should open a support TT with AWS EKS service and see why your worker can NOT communicate with EKS master through EKS-Owned ENI at that timestamp

I can do this but we have 4 different clusters in different regions and they all have the same problem. That's why I thought it was more likely a problem with the CNI or related config than an AWS service. But OK I'll open a support ticket also.

@max-rocket-internet
Copy link
Contributor Author

2 more things...

  1. Doesn't my log output above show the CNI starting multiple times? Since its printing Starting CNI Plugin v1.3.0 ... over and over?

  2. We see the message Failed to communicate with K8S Server a lot. 241 messages over the last week But only a few pod restarts. So I think perhaps the 2 are not related? Can you tell me if that message causes the pod to restart?

@micahhausler
Copy link
Member

Yea this looks like a race condition where aws-node (with its kubernetes client code) comes up before kube-proxy (and possibly kube-dns/coredns) has set up the kubernetes.svc.default.cluster.local DNS address. If you run the following command on your EKS cluster, you'll see that kube-proxy uses the public endpoint.

$ kubectl get po \
  -n kube-system \
  -l k8s-app=kube-proxy \
  -o jsonpath='{range .items[0].spec.containers[0].command[2]}{@}{"\n"}{end}'

We need to update CNI to have a similar flag so it isn't dependent on DNS or kube-proxy being up.

@max-rocket-internet
Copy link
Contributor Author

Cool, thanks for the update @micahhausler

@max-rocket-internet
Copy link
Contributor Author

Any ETA for a fix?

@tabern
Copy link

tabern commented Mar 5, 2019

Duplicate of #282
We are targeting a fix as part of v1.4 (https://github.com/aws/amazon-vpc-cni-k8s/milestone/5)

@tabern tabern marked this as a duplicate of #282 Mar 5, 2019
@tabern tabern closed this as completed Mar 5, 2019
@mogren mogren reopened this Mar 26, 2019
@mogren
Copy link
Contributor

mogren commented Mar 26, 2019

I think #282 is a different issue, reopening this one. The digging so far there seems to be an open issue in Kubernetes where ENV variables are not always set. The operator-framework has a workaround for it that we are looking into.

@tiffanyfay
Copy link
Contributor

Fixed by #364.

@kivagant-ba
Copy link

JFYI: I run into the issue and after upgrading aws-node (amazon-k8s-cni) to 1.4 got the error from the pod on the same node.

➜ kubectl logs -n kube-system aws-node-xm4dd
=====Starting installing AWS-CNI =========
=====Starting amazon-k8s-agent ===========
ERROR: logging before flag.Parse: E0418 16:22:29.501108      11 memcache.go:138] couldn't get current server API group list; will keep using cached value. (Get https://172.20.0.1:443/api?timeout=32s: dial tcp 172.20.0.1:443: connect: connection timed out)
Failed to communicate with K8S Server. Please check instance security groups or http proxy setting%

Pod restart didn't help.

All other nodes were ok and still are ok. I terminated the broken node to replace with another one.

@tiffanyfay
Copy link
Contributor

@kivagant-ba Darn, thanks for letting us know. Any chance you ran https://github.com/aws/amazon-vpc-cni-k8s/blob/master/scripts/aws-cni-support.sh ?

@kivagant-ba
Copy link

The node is dead but if I get the problem again, I'll run the script, thank you for sharing.

@kivagant-ba
Copy link

I faced this again.

➜ k logs -n kube-system aws-node-mgf7v
=====Starting installing AWS-CNI =========
=====Starting amazon-k8s-agent ===========
ERROR: logging before flag.Parse: E0603 11:32:43.025918      11 memcache.go:138] couldn't get current server API group list; will keep using cached value. (Get https://172.20.0.1:443/api?timeout=32s: dial tcp 172.20.0.1:443: connect: connection timed out)
Failed to communicate with K8S Server. Please check instance security groups or http proxy setting%
  Warning  FailedCreatePodSandBox  8m15s                kubelet, ip-10-51-31-71.ec2.internal  Failed create pod sandbox: rpc error: code = Unknown desc = [failed to set up sandbox container "295d63fad14d812c84ddbda1c93be193bf2bab63ac8ef8a599c5190ad885b0e8" network for pod "tables-job-1559561271-sbjmq": NetworkPlugin cni failed to set up pod "tables-job-1559561271-sbjmq_consumer" network: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:50051: connect: connection refused", failed to clean up sandbox container "295d63fad14d812c84ddbda1c93be193bf2bab63ac8ef8a599c5190ad885b0e8" network for pod "tables-job-1559561271-sbjmq": NetworkPlugin cni failed to teardown pod "tables-job-1559561271-sbjmq_consumer" network: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:50051: connect: connection refused"]
  Normal   SandboxChanged          3m (x25 over 8m15s)  kubelet, ip-10-51-31-71.ec2.internal  Pod sandbox changed, it will be killed and re-created.
# /opt/cni/bin/aws-cni-support.sh
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0curl: (7) Failed to connect to localhost port 61678: Connection refused

@mogren
Copy link
Contributor

mogren commented Jun 4, 2019

@kivagant-ba The Failed to communicate with K8S Server error is worrying, could the masters be overloaded? How many nodes and pods are you running?

@kivagant-ba
Copy link

kivagant-ba commented Jun 4, 2019

It's EKS and I don't think the masters are overloaded. The cluster is almost empty and only random nodes periodically experience the issue.

Client Version: version.Info{Major:"1", Minor:"14", GitVersion:"v1.14.0", GitCommit:"641856db18352033a0d96dbc99153fa3b27298e5", GitTreeState:"clean", BuildDate:"2019-03-26T00:05:06Z", GoVersion:"go1.12.1", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"12+", GitVersion:"v1.12.6-eks-d69f1b", GitCommit:"d69f1bf3669bf00b7f4a758e978e0e7a1e3a68f7", GitTreeState:"clean", BuildDate:"2019-02-28T20:26:10Z", GoVersion:"go1.10.8", Compiler:"gc", Platform:"linux/amd64"}

Right now we have 6 nodes on the cluster.

@mattparkes
Copy link

I had this issue (or an extremely similar one) after using eksctl to upgrade aws-node, coredns and kube-proxy in an EKS cluster going from 1.11 to 1.12 and then 1.13.

I did 2 things which ultimately fixed this issue for me in all my clusters (unfortunately I haven't narrowed it down to 1 thing yet):

  1. Restored my manifests for the above 3 Deployments/Daemonsets and then just patched the image field manually (eksctl changes several other fields)
  2. Upgraded to aws-node 1.5.0 which as of the past few days, AWS now recommend: https://docs.aws.amazon.com/eks/latest/userguide/update-cluster.html

@kivagant-ba
Copy link

kivagant-ba commented Aug 22, 2019

- image: 602401143452.dkr.ecr.us-west-2.amazonaws.com/amazon-k8s-cni:v1.5.3
➜ k logs -n kube-system aws-node-c7kv6
====== Installing AWS-CNI ======
====== Starting amazon-k8s-agent ======
ERROR: logging before flag.Parse: E0822 20:13:30.043898       9 memcache.go:138] couldn't get current server API group list; will keep using cached value. (Get https://172.20.0.1:443/api?timeout=32s: dial tcp 172.20.0.1:443: connect: connection timed out)
Failed to communicate with K8S Server. Please check instance security groups or http proxy setting%

Added +7 nodes at once to the cluster and none of them can start while the older nodes work fine.

@kivagant-ba
Copy link

@tiffanyfay, where I can sent cni logs privately?

@tiffanyfay
Copy link
Contributor

@kivagant-ba mogren[at]amazon.com and tfj[at]amazon.com please! Thanks.

@kivagant-ba
Copy link

Before I send this... I'm watching for the get pod command right now and I see how pods start one after another. ... aaand that's it. It took almost an hour for oldest of them to finally start.

@mogren
Copy link
Contributor

mogren commented Aug 22, 2019

@kivagant-ba Also, this PR that got merged into 1.14 seems relevant kubernetes/kubernetes#70994.

@kivagant-ba
Copy link

kivagant-ba commented Aug 22, 2019

@tiffanyfay can I contact you using an official AWS support channel?

UP: I created a support request to move the conversation to more official area.

@cgkades
Copy link

cgkades commented Sep 25, 2019

We've run into a similar issue, and for us, /etc/cni/10-aws.conflist is missing. copying that from another node allows us to continue. We've engaged AWS support. Does anyone know how that file get's put on the node?

UPDATE: looks like cni version 1.5.3 causes the issue, but if i edit the daemonset to 1.5.1, all the nodes come up

@mogren
Copy link
Contributor

mogren commented Sep 25, 2019

If the issue is in CNI v1.5.3, the problem should be visible in the log. The change to not copy the config until ipamd is running happened in commit f9acdeb and was done to avoid scheduling pods to that node until we had ENIs and IPs available. If ipamd can not talk to the kubernetes API server, it will not copy the plugin binary or the config file and not make the node Ready.

@ShahafSW
Copy link

ShahafSW commented Nov 8, 2019

We've run into a similar issue, and for us, /etc/cni/10-aws.conflist is missing. copying that from another node allows us to continue. We've engaged AWS support. Does anyone know how that file get's put on the node?

UPDATE: looks like cni version 1.5.3 causes the issue, but if i edit the daemonset to 1.5.1, all the nodes come up

This solution worked for me as well!

@aweis89
Copy link

aweis89 commented Jan 8, 2020

@kivagant-ba Darn, thanks for letting us know. Any chance you ran https://github.com/aws/amazon-vpc-cni-k8s/blob/master/scripts/aws-cni-support.sh ?

I've ran into the same issue and running that script on all master didn't seem to resolve the issue:
====== Installing AWS-CNI ====== ====== Starting amazon-k8s-agent ====== ERROR: logging before flag.Parse: E0108 17:11:43.934767 9 memcache.go:138] couldn't get current server API group list; will keep using cached value.

@jaypipes
Copy link
Contributor

jaypipes commented Jan 8, 2020

Hi @aweis89, the support script does not resolve any issues. Rather, it simply gathers information useful during troubleshooting.

Can you let us know what version of the CNI plugin you are deploying and what version of Kubernetes you are using? Thanks much!

@aweis89
Copy link

aweis89 commented Jan 8, 2020

Hey @jaypipes!

I've tried with a couple different version from 1.5.0 to the current one I'm using 1.5.5 but the issue is still persistent. The K8S version is 1.14.10. The full log I'm seeing is:
====== Installing AWS-CNI ====== ====== Starting amazon-k8s-agent ====== ERROR: logging before flag.Parse: E0108 17:39:47.947983 9 memcache.go:138] couldn't get current server API group list; will keep using cached value. (Get https://100.66.122.254:443/api?timeout=32s: x509: certificate is valid for 100.64.0.1, 127.0.0.1, not 100.66.122.254) Failed to communicate with K8S Server. Please check instance security groups or http proxy setting%
Notably 100.64.0.1 seems to be the correct IP for the kube-apiserver (as determined from the ClusterIP of the kubernetes service in the default namespace).

Okay while typing this I managed to find the root cause of the issue. In my case the issue was that there was a kubernetes SVC in the kube-system namespace and one in the default namespace. Apparently the CNI pod first looks for a kubernetes SVC in the kube-system namespace (which is not maintained or updated by the controller). (Additionally, even though the kubernetes SVC in the kube-system namespace was load-balancing to the correct IPs, the cert was only valid for the IP of the kubernetes SVC in the default namespace.) Simply deleting the SVC in the kube-system namespace fixed the issue.

I'm not sure if this is a bug or intended behavior. But it should probably be noted somewhere that an SVC with the name kubernetes in the kube-system namespace will cause the CNI to fail if it's not properly set up to talk to the apiserver (even if there's a properly configured SVC in the default namespace). I think it might make sense to reverse the lookup order for the kubernetes SVC and prioritize the default namespace since that's the one designed to talk to the apiserver per the K8S docs: https://kubernetes.io/docs/tasks/administer-cluster/access-cluster-api/#directly-accessing-the-rest-api-1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests