-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Services are unreachable after node is restarted #27848
Comments
Ok, the bug disappears with Kubernetes v1.27.4. As 1.28 is not officially supported by Cilium feel free to close the issue or maybe use it as a todo item for Cilium's K8s 1.28 support. |
K8s 1.28 support was added for v1.15 (current dev cycle, i.e. |
Ok, I didn't see the commit adding support. I also tried the latest main just now (90a9402) the problem still persists. |
@cilium/loader Could there be a behavioral difference on reboot before and after 68fd9ee? |
Yes, in general since 68fd9ee we try to use |
|
@3u13r Do you see the same changes when running |
This is the output of
And this is the output after the restart:
Note that the restarted node's cilium logs include:
|
@3u13r I might have a reproducer of the symptoms using just a Kind cluster, running on a CentOS 9 host... but for my procedure dropping back and using Cilium 1.13.6 doesn't fix the problem... which is different from your experience. Ref Gist with my testing scenarios including the Cilium 1.13.6 scenario: My spidey sense tells me my procedure with the Kind node container restarts should be hitting the same underlying problem that your procedure is...but my Cilium 1.13.6 test failed and yours didn't... that's curious. Is there something racing on node restart that my procedure is just better at causing more frequently? |
From the output of @3u13r I don't see any Cilium cgroup BPF progs attached, which explains why services are not reachable. |
@jspaleta I think your reproducers trigger a different issue than what @3u13r is seeing. In your case, after a restart the cgroup BPF progs are attached to the wrong cgroup as @brb has shown in #27900 (comment). From what I could see when debugging, this happens because the However in @3u13r 's case it looks like we have a bug when updating the bpf links after a restart. My current suspicion is that retrieving the link from bpffs and just update the program after a restart doesn't work for some reason because no cgroup progs are attached anymore. But as mentioned before I don't have a good reproducer for this exact behavior yet 😞 |
@3u13r So bpf links and the pin in bpffs shouldn't actually be persistent across reboots, so it's strange that we can still pick up the pins and try to update them. Are you doing a proper reboot of the node? |
As mentioned I do a |
okay @rgo3 so this one is probably not reproducible just with kind. Maybe I need to kick my k3s home lab worker node in just the right way.... |
I'm 99% sure that the K8s upstream fix mentioned in the other issue will also solve this issue. Note that the order of dependency changed in the commit I bisected i.e. before that commit the pinning took place before the cilium-agent container and then after all the other init-containers. |
Looks like there could be multiple issues at play here. 😕 Issue (1) k8s v1.28 regression, see - #27900 (comment). This might be exposing some other issues in the loader logic as @rgo3 pointed out due to which BPF cgroup programs are not getting attached at all after node restart?
👍 Could you confirm the bisected commit, please? Thanks!
I would suggesting keeping this issue open for the potential issue in the loader logic. (Edit: Sorry, I accidentally deleted my previous comment while editing.) |
Hm, I just remembered that we also do a restart of the cilium pod once it is healthy since there was a connection issue that was fixed by that. I don't think it's needed anymore and likely skews the behavior/logs. Let me remove that code and let me see if it makes a difference.
What do you mean? Should I get the cgroup attachment again for the commit (and the one before it)? |
What's the bisected commit ID you were referring to? Is it the same as 68fd9ee? |
Yes this is the commit that breaks the behavior on K8s 1.28 for me. |
The restart of the cilium pod could explain the log line about an updated link as it will have created one at initial startup, but still a bit odd that it succeeds and then |
Okay just for clarity I got my k3s cluster in a bad state. I'm going to try the downgrade of cilium to 1.13.6 on it just to confirm I'm seeing the same issue now. I'll update this comment with more info. Sorry for the noise with the spurious other issue reproducer So i rolled back to cilium 1.13.6 on my k3s 2 node intel nuc cluster using a k8s 1.28, confirmed all pods were up and running and cilium status is in the greent, rebooted both my k3s nodes and coreDNS pods didn't come back up...so its same symptoms as my kind cluster reproducer... Do i need to document this further, what information should i provide? There are a lot of moving parts here the k3s images are release candidates themselves, so ruling out k3s rc specific "type 2 fun" {tm} is not possible.
sysdump attached |
I had a look at the logs of the pods again and in the restarted node the cilium pod has the following warning:
Otherwise all observation from above regarding the unattached programs still apply. |
@3u13r would you be able to test it again with Kubernetes 1.28.2? Thank you |
Sorry for the delay. I can confirm that using K8s v1.28.2 fixes the reported problem. Thanks for the help and investigation. |
Is there an existing issue for this?
What happened?
When I install cilium in version >=1.14.0 and restart a Kubernetes node the node's pod cannot reach any service.
Reproduction:
kubectl debug node/<node-name> --image=ubuntu -- bash -c "echo reboot > reboot.sh && chroot /host < reboot.sh"
{"level":"error","ts":"2023-08-31T12:25:29Z","logger":"setup","msg":"unable to start manager","error":"Get \"https://10.96.0.1:443/api?timeout=32s\": dial tcp 10.96.0.1:443: i/o timeout"}
One can also observe the attached traffic flow when doing an nslookup inside the pod on this node:
This should look like a flow on a not yet restarted node:
Cilium Version
1.14.0, 1.14.1, main (90a9402)
Kernel Version
Linux fedora 6.1.45-100.constellation.fc38.x86_64 #1 SMP PREEMPT_DYNAMIC Mon Aug 14 17:39:05 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
Kubernetes Version
Server Version: version.Info{Major:"1", Minor:"28", GitVersion:"v1.28.0", GitCommit:"855e7c48de7388eb330da0f8d9d2394ee818fb8d", GitTreeState:"clean", BuildDate:"2023-08-15T10:15:54Z", GoVersion:"go1.20.7", Compiler:"gc", Platform:"linux/amd64"}
Sysdump
cilium-sysdump-20230831-142049.zip
Relevant log output
No response
Anything else?
The bug does not happen on 1.13.6 and I bisected the problem to the following commit: 68fd9ee (i.e. this commit is the first in which this error occurs).
When using Kubernetes 1.27 or 1.26 the problem disappears.
Code of Conduct
The text was updated successfully, but these errors were encountered: