-
Notifications
You must be signed in to change notification settings - Fork 2.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CiliumInternalIP removed & re-added on cilium-agent restart #27439
Comments
@jaredledvina Thanks for the investigation! I'm getting some eyes on this.. |
I think it's strange that 1.13 is somehow unaffected because v1.14 should have the same code regarding CiliumNode updates, at the very least it also has #19765. |
Anyway, it seems like the offending code could be cilium/pkg/nodediscovery/nodediscovery.go Line 366 in 14fd579
mutateNodeResource reconstructs the CiliumNode, instead of trying to apply a diff.
One way to validate that this theory would be to have |
Hey @christarazi Thanks so much for taking a look here!
Sorry for the confusion, I don't currently have a 1.13 install to verify the behavior on. I also suspect it sees the unnecessary CiliumNode updates just like 1.14 does. That said, the CPU usage increase as a result of those updates, that I've been hunting primarily in 1.11 and 1.12, likely doesn't affect 1.13 & 1.14 because of PR#19765. |
@aanm - To make sure I'm being clear, this bug does impact 1.14 from my testing unless something else changed the behavior. |
@jaredledvina but you wrote the following:
|
@aanm - Sorry for the lack of clarity, that statement was in reference to the resulting CPU usage bug that kicked all of this off. That CPU usage issue likely doesn't affect 1.13 & 1.14. The removal and then addition of |
As described in #27590, we still have a race condition between first CiliumNode update after restart and |
This issue has been automatically marked as stale because it has not |
This issue has not seen any activity since it was marked stale. |
Is there an existing issue for this?
What happened?
tl;dr On cilium-agent start-up, the CiliumNode object is updated to remove the
CiliumInternalIP
's fromspec.addresses
and then immediately added back.Hello! I'm down a huge rabbit hole but believe I've finally found a reproducible bug present in Cilium 1.11.18, 1.12.11, and 1.14 while not occurring on Cilium 1.10.13. What got me here is a saga but, the issue is that whenever a cilium-agent pod is deleted (noticeable during daemonset rollouts), every cilium-agent is
update
'ing it's CiliumNode object to remove theCiliumInternalIP
entires and then immediately adding them back.This is an issue as the downstream impacts cause all other nodes in the cluster to receive these updates (through kvstore in our case, appears present in CRD mode too). When these updates are processed in 1.11 and 1.12, they go through a very expensive
TriggerLabelInjection
method https://github.com/cilium/cilium/blob/v1.11/pkg/ipcache/metadata.go#L519-L567. From my profiling of the agent during this daemonset rollouts, this can yield +30% more time getting spent in that code path as a result.The CPU usage issue likely isn't an issue in 1.13+ as #19765 has re-worked most of it. However, while trying to upgrade to 1.12 this is becoming quite the issue as agents in larger clusters are getting overwhelmed with CiliumNode updates causing significant (i.e. <100mcores to >1.9 cores) CPU usage increases.
I've found this to be the easiest way to reproduce the superflous updates to the CiliumNode object:
kubectl get ciliumnode CILIUM_NODE_HERE -w -o json | jq '.spec.addresses'
kubectl delete
the cilium-agent pod on that nodeCiliumInternalIP
's listed followed by another adding them back.I'll update this issue as I discover more relevent information.
Cilium Version
I've confirmed this behaviour on v1.11.18, 1.12.11, and 1.14
Kernel Version
5.15 linux-aws
Kubernetes Version
1.24
Sysdump
Unable
Relevant log output
No response
Anything else?
No response
Code of Conduct
The text was updated successfully, but these errors were encountered: