Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CiliumInternalIP removed & re-added on cilium-agent restart #27439

Closed
2 tasks done
jaredledvina opened this issue Aug 10, 2023 · 10 comments · Fixed by #27590
Closed
2 tasks done

CiliumInternalIP removed & re-added on cilium-agent restart #27439

jaredledvina opened this issue Aug 10, 2023 · 10 comments · Fixed by #27590
Labels
affects/v1.11 This issue affects v1.11 branch affects/v1.12 This issue affects v1.12 branch kind/bug This is a bug in the Cilium logic. kind/community-report This was reported by a user in the Cilium community, eg via Slack. kind/performance There is a performance impact of this. sig/agent Cilium agent related. sig/k8s Impacts the kubernetes API, or kubernetes -> cilium internals translation layers. stale The stale bot thinks this issue is old. Add "pinned" label to prevent this from becoming stale.

Comments

@jaredledvina
Copy link

jaredledvina commented Aug 10, 2023

Is there an existing issue for this?

  • I have searched the existing issues

What happened?

tl;dr On cilium-agent start-up, the CiliumNode object is updated to remove the CiliumInternalIP's from spec.addresses and then immediately added back.

Hello! I'm down a huge rabbit hole but believe I've finally found a reproducible bug present in Cilium 1.11.18, 1.12.11, and 1.14 while not occurring on Cilium 1.10.13. What got me here is a saga but, the issue is that whenever a cilium-agent pod is deleted (noticeable during daemonset rollouts), every cilium-agent is update'ing it's CiliumNode object to remove the CiliumInternalIP entires and then immediately adding them back.

This is an issue as the downstream impacts cause all other nodes in the cluster to receive these updates (through kvstore in our case, appears present in CRD mode too). When these updates are processed in 1.11 and 1.12, they go through a very expensive TriggerLabelInjection method https://github.com/cilium/cilium/blob/v1.11/pkg/ipcache/metadata.go#L519-L567. From my profiling of the agent during this daemonset rollouts, this can yield +30% more time getting spent in that code path as a result.

The CPU usage issue likely isn't an issue in 1.13+ as #19765 has re-worked most of it. However, while trying to upgrade to 1.12 this is becoming quite the issue as agents in larger clusters are getting overwhelmed with CiliumNode updates causing significant (i.e. <100mcores to >1.9 cores) CPU usage increases.

I've found this to be the easiest way to reproduce the superflous updates to the CiliumNode object:

  1. In one terminal run: kubectl get ciliumnode CILIUM_NODE_HERE -w -o json | jq '.spec.addresses'
  2. In another, kubectl delete the cilium-agent pod on that node
  3. Wait a bit, and you'll notice that 1 event will happen without CiliumInternalIP's listed followed by another adding them back.

I'll update this issue as I discover more relevent information.

Cilium Version

I've confirmed this behaviour on v1.11.18, 1.12.11, and 1.14

Kernel Version

5.15 linux-aws

Kubernetes Version

1.24

Sysdump

Unable

Relevant log output

No response

Anything else?

No response

Code of Conduct

  • I agree to follow this project's Code of Conduct
@jaredledvina jaredledvina added kind/bug This is a bug in the Cilium logic. kind/community-report This was reported by a user in the Cilium community, eg via Slack. needs/triage This issue requires triaging to establish severity and next steps. labels Aug 10, 2023
@ti-mo ti-mo added affects/v1.11 This issue affects v1.11 branch affects/v1.12 This issue affects v1.12 branch affects/v1.14 This issue affects v1.14 branch sig/k8s Impacts the kubernetes API, or kubernetes -> cilium internals translation layers. kind/performance There is a performance impact of this. sig/agent Cilium agent related. and removed needs/triage This issue requires triaging to establish severity and next steps. labels Aug 14, 2023
@ti-mo
Copy link
Contributor

ti-mo commented Aug 14, 2023

@jaredledvina Thanks for the investigation! I'm getting some eyes on this..

@christarazi
Copy link
Member

The CPU usage issue likely isn't an issue in 1.13+ as #19765 has re-worked most of it. However, while trying to upgrade to 1.12 this is becoming quite the issue as agents in larger clusters are getting overwhelmed with CiliumNode updates causing significant (i.e. <100mcores to >1.9 cores) CPU usage increases.

I think it's strange that 1.13 is somehow unaffected because v1.14 should have the same code regarding CiliumNode updates, at the very least it also has #19765.

@christarazi
Copy link
Member

Anyway, it seems like the offending code could be

if err := n.mutateNodeResource(nodeResource); err != nil {
. Essentially, mutateNodeResource reconstructs the CiliumNode, instead of trying to apply a diff.

One way to validate that this theory would be to have mutateNodeResource return a copy of the mutated CiliumNode and perform a DeepEquals against it and the original CiliumNode. If it's not equal, then proceed with update. If it is equal, skip the update. One thing to keep in mind is that the DeepEquals may return false if the IP addresses are in a different order, rather than checking if the contents are the same. It's possible that we could sort the IPs to mitigate and this is my hunch for the culprit.

@jaredledvina
Copy link
Author

Hey @christarazi

Thanks so much for taking a look here!

I think it's strange that 1.13 is somehow unaffected because v1.14 should have the same code regarding CiliumNode updates, at the very least it also has #19765.

Sorry for the confusion, I don't currently have a 1.13 install to verify the behavior on. I also suspect it sees the unnecessary CiliumNode updates just like 1.14 does. That said, the CPU usage increase as a result of those updates, that I've been hunting primarily in 1.11 and 1.12, likely doesn't affect 1.13 & 1.14 because of PR#19765.

@aanm aanm removed the affects/v1.14 This issue affects v1.14 branch label Aug 15, 2023
@jaredledvina
Copy link
Author

jaredledvina commented Aug 17, 2023

@aanm - To make sure I'm being clear, this bug does impact 1.14 from my testing unless something else changed the behavior.

@aanm
Copy link
Member

aanm commented Aug 18, 2023

@jaredledvina but you wrote the following:

likely doesn't affect 1.13 & 1.14 because of PR#19765.

@jaredledvina
Copy link
Author

@aanm - Sorry for the lack of clarity, that statement was in reference to the resulting CPU usage bug that kicked all of this off. That CPU usage issue likely doesn't affect 1.13 & 1.14. The removal and then addition of CiliumInternalIP on restart affects all versions. Hemanth PR'ed a partial fix for this #27590 to master which hopefully clarifies the impact here further.

@hemanthmalla
Copy link
Member

As described in #27590, we still have a race condition between first CiliumNode update after restart and n.localNode.IPAddresses being populated.

@github-actions
Copy link

This issue has been automatically marked as stale because it has not
had recent activity. It will be closed if no further activity occurs.

@github-actions github-actions bot added the stale The stale bot thinks this issue is old. Add "pinned" label to prevent this from becoming stale. label Oct 25, 2023
Copy link

github-actions bot commented Nov 8, 2023

This issue has not seen any activity since it was marked stale.
Closing.

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Nov 8, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
affects/v1.11 This issue affects v1.11 branch affects/v1.12 This issue affects v1.12 branch kind/bug This is a bug in the Cilium logic. kind/community-report This was reported by a user in the Cilium community, eg via Slack. kind/performance There is a performance impact of this. sig/agent Cilium agent related. sig/k8s Impacts the kubernetes API, or kubernetes -> cilium internals translation layers. stale The stale bot thinks this issue is old. Add "pinned" label to prevent this from becoming stale.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants