-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cilium 1.14.5+ can break Kubernetes pod connectivity due to the reserved:init label never being removed #30524
Comments
ta Timo, nice catch! One suggestion would be to bisect between v1.14.4 and v1.14.5, and see if you truly end up in #29248. But the repro doesn't sound like a 5-minute exercise :). |
he @julianwiedmann, thanks for the suggestion. You're right, it might involve some work. I think I'll try to patch out the culprit change first from 1.14.5 and see if I'm lucky with that. If not, I'll start bisecting. Will let you know once I know more. |
So I went ahead and patched out #29251 (or, rather, its commits) from 1.14.5, which is the 1.14 back port variant from upstream #29248 that I suspected to be the culprit. I pushed the resulting branch to my fork (the top commit has the reverting patch). I then built the Cilium agent as well as operator images to test in a custom cluster. In that setup, the RBAC-based reproduction does not work anymore, meaning I could not trigger the bug (similar to how I wasn't able to do that in v1.14.4). I then created 500 clusters and checked for pods having the Just to be safe, I did the counter example and created several clusters on 1.14.5 (unpatched). Out of 100 cluster creates, I ran into two broken instances right away again. Overall, I believe this confirms that the PR in question is the problematic one. Worth highlighting that the back-porting PR #29251 has an additional commit (endpoint: fix panic in RunMetadataResolver due to send on closed channel) that the original upstream PR did not have. I had planned to run another test where I reverted all commits from #29251 except for that specific one to further distinguish which commit(s) would truly be responsible; unfortunately, the revert does not cleanly apply in that combination. I did no further attempts to address any conflicts since I felt I was really stepping out of my territory, potentially making things even worse. Regardless, I think the above demonstrated that #29251 is related to the reported bug. I can try to dig further into what the change in question did to hopefully understand where the essence of the problems lies, but I might need some pointers/guidance from folks more familiar with the code (possibly @aanm). |
FWIW, here's the output of
Note the line
where presumably, the API server wasn't immediately ready yet. I thought the final |
Is there an existing issue for this?
What happened?
Starting with Cilium 1.14.5, we are occasionally seeing pods in newly provisioned (DigitalOcean managed) Kubernetes clusters completely fail to ingress or egress traffic any traffic. The built-in cilium monitor shows that packets are dropped / being rejected on their way into and out of such pods.
We did not notice that the affected pods all have the
reserved:init
label set on the Cilium endpoint in addition to the regular pod labels. Also, the policies for both ingress and egress are marked enabled for such pods according tocilium endpoint get/list
even though the policy enforcement policy is set todefault
and no network policies are installed at all -- output fromcilium endpoint list
for an affected CoreDNS instance:I am assuming that the policy enforcement is a consequence of the
reserved:init
label being set. What is not clear to me is why that special label has not been removed despite all other pod labels having been discovered successfully.According to the documentation for the
reserved:init
label, it may be assigned when one of three described conditions hold. We do not use Docker on the Kubernetes nodes and we also don't run in etcd mode, so only a (brief) unavailability of the Kubernetes API server remains as possible explanation why the label was added in the first place. FWIW, in none of the cases I have looked at more closely, I was able to spot a clear error message in the Cilium agent logs indicating that an API server request may have failed. (If the failures manifest as API watch events not being propagated to the agent in time or with a delay, however, then I imagine it could be conceivable that no error logs may be produced.)So far we have only spotted the problem during the initial provisioning of a cluster. Once a cluster manages to finish bootstrapping without the
reserved:init
label being assigned, the issue does not seem to occur anymore during the remaining runtime of a cluster (or we haven't spotted it yet). I've been speculating whether the API server not being totally stable and settled in during the provisioning process may be a contributing factor.The issue also happens on 1.14.6 but does not seem to manifest in 1.14.4.
I've been trying to reproduce the problem by attempting to surgically disallow access from the Cilium agent to the API server just as far as label discovery is concerned. This turned out to be somewhat tricky since pods won't even come up if access is prohibited too broadly. The closest I have gotten this to is by modifying the RBAC permissions of the agent to strip off the
get
andwatch
permissions from thepods
resource, essentially splitting the (default) RBAC API groupinto two separate groups, like this:
After restarting the Cilium agent, newly launched pods received the
reserved:init
label. Even when I re-enable all permissions for thepods
resource, the label sticks around.I have to say that I am not super certain if the above is the perfect or even right way to reproduce the bug: I would have expected the CiliumEndpoint objects of deliberately broken pods not not have any regular pod labels at all but only the
reserved:init
label while the RBAC permission is modified. However, when I watch for new CiliumEndpoint objects, I always see both the regular labels and thereserved:init
label being associated right from the beginning. Chances are though I am missing some details on how the label is handled exactly, so happy to hear feedback on this and apply any repro suggestions folks may have.One large(r) scale reproduction case that does work fairly reliable is to launch 50-60 new clusters in our environment. I'd then usually hit 1-2 broken clusters that show the described symptoms.
We're now downgrading to 1.14.4 again as a workaround, but obviously would want to see the bug addressed so that we can roll forward again. I've been peaking around the code a bit but haven't been able to identify any particular area that would need to be worked on for a fix. Once the root cause is identified and the approach to fixing it becomes apparent, I'm open to trying to submit a PR myself. Appreciate your support on this case.
Cilium Version
v1.14.5 (also tried v1.14.6 which is similarly affected)
Kernel Version
Linux pool-om3sd 6.1.0-17-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.69-1 (2023-12-30) x86_64 GNU/Linux
Kubernetes Version
1.29.0
Sysdump
Able to provide on request
Relevant log output
Anything else?
This was initially discussed on Slack: https://cilium.slack.com/archives/C53TG4J4R/p1706254144830909
My best bet as to which change may have introduced a bug into Cilium is #29248 by @aanm. He also mentioned that there's another, currently still open PR #30170 (authored by @oblazek) that is related to the his PR I referenced earlier. Mentioning it here as well for the sake of completeness.
Code of Conduct
The text was updated successfully, but these errors were encountered: