CiliumInternalIP removed & re-added on cilium-agent restart #27439

jaredledvina · 2023-08-10T19:29:02Z

Is there an existing issue for this?

I have searched the existing issues

What happened?

tl;dr On cilium-agent start-up, the CiliumNode object is updated to remove the CiliumInternalIP's from spec.addresses and then immediately added back.

Hello! I'm down a huge rabbit hole but believe I've finally found a reproducible bug present in Cilium 1.11.18, 1.12.11, and 1.14 while not occurring on Cilium 1.10.13. What got me here is a saga but, the issue is that whenever a cilium-agent pod is deleted (noticeable during daemonset rollouts), every cilium-agent is update'ing it's CiliumNode object to remove the CiliumInternalIP entires and then immediately adding them back.

This is an issue as the downstream impacts cause all other nodes in the cluster to receive these updates (through kvstore in our case, appears present in CRD mode too). When these updates are processed in 1.11 and 1.12, they go through a very expensive TriggerLabelInjection method https://github.com/cilium/cilium/blob/v1.11/pkg/ipcache/metadata.go#L519-L567. From my profiling of the agent during this daemonset rollouts, this can yield +30% more time getting spent in that code path as a result.

The CPU usage issue likely isn't an issue in 1.13+ as #19765 has re-worked most of it. However, while trying to upgrade to 1.12 this is becoming quite the issue as agents in larger clusters are getting overwhelmed with CiliumNode updates causing significant (i.e. <100mcores to >1.9 cores) CPU usage increases.

I've found this to be the easiest way to reproduce the superflous updates to the CiliumNode object:

In one terminal run: kubectl get ciliumnode CILIUM_NODE_HERE -w -o json | jq '.spec.addresses'
In another, kubectl delete the cilium-agent pod on that node
Wait a bit, and you'll notice that 1 event will happen without CiliumInternalIP's listed followed by another adding them back.

I'll update this issue as I discover more relevent information.

Cilium Version

I've confirmed this behaviour on v1.11.18, 1.12.11, and 1.14

Kernel Version

5.15 linux-aws

Kubernetes Version

1.24

Sysdump

Unable

Relevant log output

No response

Anything else?

No response

Code of Conduct

I agree to follow this project's Code of Conduct

The text was updated successfully, but these errors were encountered:

ti-mo · 2023-08-14T12:13:48Z

@jaredledvina Thanks for the investigation! I'm getting some eyes on this..

christarazi · 2023-08-14T20:51:10Z

The CPU usage issue likely isn't an issue in 1.13+ as #19765 has re-worked most of it. However, while trying to upgrade to 1.12 this is becoming quite the issue as agents in larger clusters are getting overwhelmed with CiliumNode updates causing significant (i.e. <100mcores to >1.9 cores) CPU usage increases.

I think it's strange that 1.13 is somehow unaffected because v1.14 should have the same code regarding CiliumNode updates, at the very least it also has #19765.

christarazi · 2023-08-14T21:01:51Z

Anyway, it seems like the offending code could be

cilium/pkg/nodediscovery/nodediscovery.go

Line 366 in 14fd579

if err := n.mutateNodeResource(nodeResource); err != nil {

. Essentially, mutateNodeResource reconstructs the CiliumNode, instead of trying to apply a diff.

One way to validate that this theory would be to have mutateNodeResource return a copy of the mutated CiliumNode and perform a DeepEquals against it and the original CiliumNode. If it's not equal, then proceed with update. If it is equal, skip the update. One thing to keep in mind is that the DeepEquals may return false if the IP addresses are in a different order, rather than checking if the contents are the same. It's possible that we could sort the IPs to mitigate and this is my hunch for the culprit.

jaredledvina · 2023-08-15T12:10:22Z

Hey @christarazi

Thanks so much for taking a look here!

I think it's strange that 1.13 is somehow unaffected because v1.14 should have the same code regarding CiliumNode updates, at the very least it also has #19765.

Sorry for the confusion, I don't currently have a 1.13 install to verify the behavior on. I also suspect it sees the unnecessary CiliumNode updates just like 1.14 does. That said, the CPU usage increase as a result of those updates, that I've been hunting primarily in 1.11 and 1.12, likely doesn't affect 1.13 & 1.14 because of PR#19765.

jaredledvina · 2023-08-17T17:15:08Z

@aanm - To make sure I'm being clear, this bug does impact 1.14 from my testing unless something else changed the behavior.

aanm · 2023-08-18T12:20:51Z

@jaredledvina but you wrote the following:

likely doesn't affect 1.13 & 1.14 because of PR#19765.

jaredledvina · 2023-08-18T21:59:57Z

@aanm - Sorry for the lack of clarity, that statement was in reference to the resulting CPU usage bug that kicked all of this off. That CPU usage issue likely doesn't affect 1.13 & 1.14. The removal and then addition of CiliumInternalIP on restart affects all versions. Hemanth PR'ed a partial fix for this #27590 to master which hopefully clarifies the impact here further.

hemanthmalla · 2023-08-25T19:37:30Z

As described in #27590, we still have a race condition between first CiliumNode update after restart and n.localNode.IPAddresses being populated.

github-actions · 2023-10-25T01:45:11Z

This issue has been automatically marked as stale because it has not
had recent activity. It will be closed if no further activity occurs.

github-actions · 2023-11-08T01:45:39Z

This issue has not seen any activity since it was marked stale.
Closing.

jaredledvina added kind/bug This is a bug in the Cilium logic. kind/community-report This was reported by a user in the Cilium community, eg via Slack. needs/triage This issue requires triaging to establish severity and next steps. labels Aug 10, 2023

hemanthmalla mentioned this issue Aug 11, 2023

Retain CiliumInternalIP from CiliumNode when local data is unavailable DataDog/cilium#517

Merged

aanm removed the affects/v1.14 This issue affects v1.14 branch label Aug 15, 2023

hemanthmalla mentioned this issue Aug 18, 2023

Fix a bug where cilium host IP is not read from k8s node annotations #27590

Merged

joestringer closed this as completed in #27590 Aug 23, 2023

hemanthmalla reopened this Aug 25, 2023

hemanthmalla mentioned this issue Sep 26, 2023

Improve agent CPU usage by avoiding unnecessary node update events #28299

Open

5 tasks

github-actions bot added the stale The stale bot thinks this issue is old. Add "pinned" label to prevent this from becoming stale. label Oct 25, 2023

github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Nov 8, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CiliumInternalIP removed & re-added on cilium-agent restart #27439

CiliumInternalIP removed & re-added on cilium-agent restart #27439

jaredledvina commented Aug 10, 2023 •

edited

Loading

ti-mo commented Aug 14, 2023

christarazi commented Aug 14, 2023

christarazi commented Aug 14, 2023

jaredledvina commented Aug 15, 2023

jaredledvina commented Aug 17, 2023 •

edited

Loading

aanm commented Aug 18, 2023

jaredledvina commented Aug 18, 2023

hemanthmalla commented Aug 25, 2023

github-actions bot commented Oct 25, 2023

github-actions bot commented Nov 8, 2023

CiliumInternalIP removed & re-added on cilium-agent restart #27439

CiliumInternalIP removed & re-added on cilium-agent restart #27439

Comments

jaredledvina commented Aug 10, 2023 • edited Loading

Is there an existing issue for this?

What happened?

Cilium Version

Kernel Version

Kubernetes Version

Sysdump

Relevant log output

Anything else?

Code of Conduct

ti-mo commented Aug 14, 2023

christarazi commented Aug 14, 2023

christarazi commented Aug 14, 2023

jaredledvina commented Aug 15, 2023

jaredledvina commented Aug 17, 2023 • edited Loading

aanm commented Aug 18, 2023

jaredledvina commented Aug 18, 2023

hemanthmalla commented Aug 25, 2023

github-actions bot commented Oct 25, 2023

github-actions bot commented Nov 8, 2023

jaredledvina commented Aug 10, 2023 •

edited

Loading

jaredledvina commented Aug 17, 2023 •

edited

Loading