-
Notifications
You must be signed in to change notification settings - Fork 728
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
aws-node pod restarts without any obvious errors #283
Comments
Hey @liwenwu-amazon, We are now seeing it in a production cluster:
|
(I'm sorry if you are waiting for @liwenwu-amazon's reply) but you should collect following logs:
First Note, |
Thanks for the reply @nak3 I already mentioned the logs in issue text. Again, this is the message from the previous pod:
Getting event data is a good idea but in this scenario it produces only
|
I see. (I'm sorry. I missed you mentioned that have already checked previous pod.). How about |
All good! Good idea. I had a look at these logs just now and there is a few errors. For this pod:
Looking at
But in
That last error repeats over and over until:
Both error repeat for a few more times and then there is no more in this log. I hope that helps! |
@max-rocket-internet for for |
OK, maybe it's unrelated to this problem then but here it is:
I can do this but we have 4 different clusters in different regions and they all have the same problem. That's why I thought it was more likely a problem with the CNI or related config than an AWS service. But OK I'll open a support ticket also. |
2 more things...
|
Yea this looks like a race condition where aws-node (with its kubernetes client code) comes up before kube-proxy (and possibly kube-dns/coredns) has set up the
We need to update CNI to have a similar flag so it isn't dependent on DNS or kube-proxy being up. |
Cool, thanks for the update @micahhausler |
Any ETA for a fix? |
Duplicate of #282 |
I think #282 is a different issue, reopening this one. The digging so far there seems to be an open issue in Kubernetes where ENV variables are not always set. The operator-framework has a workaround for it that we are looking into. |
Fixed by #364. |
JFYI: I run into the issue and after upgrading aws-node (amazon-k8s-cni) to 1.4 got the error from the pod on the same node.
Pod restart didn't help. All other nodes were ok and still are ok. I terminated the broken node to replace with another one. |
@kivagant-ba Darn, thanks for letting us know. Any chance you ran https://github.com/aws/amazon-vpc-cni-k8s/blob/master/scripts/aws-cni-support.sh ? |
The node is dead but if I get the problem again, I'll run the script, thank you for sharing. |
I faced this again.
|
@kivagant-ba The |
It's EKS and I don't think the masters are overloaded. The cluster is almost empty and only random nodes periodically experience the issue.
Right now we have 6 nodes on the cluster. |
I had this issue (or an extremely similar one) after using I did 2 things which ultimately fixed this issue for me in all my clusters (unfortunately I haven't narrowed it down to 1 thing yet):
|
Added +7 nodes at once to the cluster and none of them can start while the older nodes work fine. |
@tiffanyfay, where I can sent cni logs privately? |
@kivagant-ba mogren[at]amazon.com and tfj[at]amazon.com please! Thanks. |
Before I send this... I'm watching for the |
@kivagant-ba Also, this PR that got merged into 1.14 seems relevant kubernetes/kubernetes#70994. |
@tiffanyfay can I contact you using an official AWS support channel? UP: I created a support request to move the conversation to more official area. |
We've run into a similar issue, and for us, /etc/cni/10-aws.conflist is missing. copying that from another node allows us to continue. We've engaged AWS support. Does anyone know how that file get's put on the node? UPDATE: looks like cni version 1.5.3 causes the issue, but if i edit the daemonset to 1.5.1, all the nodes come up |
If the issue is in CNI v1.5.3, the problem should be visible in the log. The change to not copy the config until ipamd is running happened in commit f9acdeb and was done to avoid scheduling pods to that node until we had ENIs and IPs available. If ipamd can not talk to the kubernetes API server, it will not copy the plugin binary or the config file and not make the node Ready. |
This solution worked for me as well! |
I've ran into the same issue and running that script on all master didn't seem to resolve the issue: |
Hi @aweis89, the support script does not resolve any issues. Rather, it simply gathers information useful during troubleshooting. Can you let us know what version of the CNI plugin you are deploying and what version of Kubernetes you are using? Thanks much! |
Hey @jaypipes! I've tried with a couple different version from Okay while typing this I managed to find the root cause of the issue. In my case the issue was that there was a kubernetes SVC in the kube-system namespace and one in the default namespace. Apparently the CNI pod first looks for a kubernetes SVC in the kube-system namespace (which is not maintained or updated by the controller). (Additionally, even though the kubernetes SVC in the kube-system namespace was load-balancing to the correct IPs, the cert was only valid for the IP of the kubernetes SVC in the default namespace.) Simply deleting the SVC in the kube-system namespace fixed the issue. I'm not sure if this is a bug or intended behavior. But it should probably be noted somewhere that an SVC with the name kubernetes in the kube-system namespace will cause the CNI to fail if it's not properly set up to talk to the apiserver (even if there's a properly configured SVC in the default namespace). I think it might make sense to reverse the lookup order for the kubernetes SVC and prioritize the default namespace since that's the one designed to talk to the apiserver per the K8S docs: https://kubernetes.io/docs/tasks/administer-cluster/access-cluster-api/#directly-accessing-the-rest-api-1 |
EKS:
v1.11.5
CNI:
602401143452.dkr.ecr.us-west-2.amazonaws.com/amazon-k8s-cni:1.3.0
AMI:
amazon-eks-node-1.11-v20181210 (ami-0a9006fb385703b54)
E.g:
Even if I
describe
podaws-node-hhtrt
there is no events. No interesting logs from the pod either. Or the previous pod. I looked in our logging system to get all logs from this pod and there is nothing beyond the normal startup messages. But I did see from podaws-node-9cz4c
this message:I tried to run
/opt/cni/bin/aws-cni-support.sh
on the node with podaws-node-hhtrt
but I get this error:The text was updated successfully, but these errors were encountered: