-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cilium Identity not getting deleted sometimes in large scale cluster #26514
Comments
Found pushUpdate getting triggered for that identity cilium/pkg/k8s/resource/resource.go Line 240 in c7f6fa5
pushUpdate->upsertEvent->runHeartbeatUpdater(update event)->markAlive... not sure how pushUpdate getting triggered for identity that's being deleted |
@joamaki do you have any thoughts on how this might occur? It seems like the resource is getting an async update while the corresponding identity is being deleted, which causes it to persist and not get properly garbage collected. |
I don't quite see how we could get a delete followed by upsert from I'd approach this by trying to create a test case in @pippolo84 would you have time to look into this? |
In case of a large scale cluster, the loop over the identities listed might work, at a certain point, with identities that have been deleted in the meantime. Doing so, if the in-memory identity is seen as alive (because of a CES or CEP referencing it), it is recreated in the
But AFAICT having a stale entry in the
that is written after
can be written only by handling a |
@tamilmani1989 would it possible for you to reproduce it another time?
Before and after the scaling down. Finally, could you also please share the output of Thanks in advance! |
@pippolo84 Thanks for looking. I scaled down the nodes in that cluster and that identity got deleted. I suspect if one of cilium agent is triggering the update for that identity... let me try to repro this again and share the output you requested |
@pippolo84 I'm able to repro the issue and this time multiple identities are leaked (the identities in scale-test namespace are all leaked) and i shared details for one of those. Let me know if you need any other details. I'm going to keep this cluster untouched.
Cilium Config: To repro the issue faster, i reduced gc-interval, heartbeat-interval otherwise it happens in default setting as well..
|
@pippolo84 any update on this? do you require any other info from cluster |
@tamilmani1989 thank you very much for the additional details. 🙏 |
I've tried to simulate a large scale cluster scenario inserting increasing artificial delays in the @tamilmani1989 is it possible to collect and share a sysdump from your test cluster? I am suspecting that the leaking is not related to the GC itself, or at least not only to that. Also, looking again at what you've shared, I've noticed that
That is after the handling of the deletion event in the operator:
Though those timestamps might not be reliable, I wondering if the identity was deleted as reported by the operator and somehow recreated later. |
@pippolo84 sure. does cilium agent sysdmup from 1 node enough? I have 30 nodes running in this cluster |
Let's start with that, it should be enough. Additionally, it might be good to also look for all entries (in all logs) related to the CiliumIdentity under inspection (like |
I tried to reproduce this in a local kind cluster applying deployments with many replicas and scaling them down and up, but unfortunately no luck yet. @tamilmani1989 would you please be able to run a debug version of Cilium in your cluster? I've already prepared this PR where I added some logs to better understand the identities management by both agent and operator.
Then, focusing on one of the leaked identity:
That would be incredibly helpful. Thank you in advance for the help 🙏 |
@pippolo84 sure I will try to repro this with debug build and share logs with you. |
This could potentially be involved here: #27340. If you have a reliable repro, could you perhaps try a build with this patch and see if it still happens? |
I tried to repro with @pippolo84 debug branch but couldn't repro it. @vipul-21 will be working on repro as i'm going to away for some days. @joamaki Thanks for the headsup. we will create repro first and then try your branch |
@pippolo84 I was able to replicate the issue with patch you mentioned.
|
@pippolo84 any update on this? @vipul-21 was able to repro with the patch you gave. Meanwhile we will try to repro with @joamaki patch |
Hi @tamilmani1989 , unfortunately I didn't have cycles to investigate more. If you could do another try with that latest patch it would be great. Let us know the results and will try to get back to this as soon as possible. Thank you! |
Hi @vipul-21 , any chance you were able to track down the underlying issue here? |
@joestringer Not really, the last update from my end was that the issue was replicated using the patch mentioned above by @pippolo84. But was not able to track down the reason behind it. |
I don't know if there's a canonical way to do this in k8s, but you could imagine writing a little script to iterate through the pod names in It seems like with Fabio's patch there were additional logs in the most recent reproducer that look like this:
Does that help? And as I follow there was a patch from Jussi which touched this, but we also have no information about whether that helped or not? Seems like we need to either test whether this problem still exists after recent changes in the tree, or create another build with additional logs from cilium-operator to gather additional information to investigate further. |
Not actively working on the replicating the issue. |
Is there an existing issue for this?
What happened?
Created AKS azure cni powered by cilium cluster with 1k nodes with 1000 deployments, each with 100 pods. Then scale down the deployments to 0 and expect cilium identity should get removed in few hours. I noticed few identities not getting removed even for days. To debug, added few more logs in cilium operator to find where the problem is. I found delete identity is being triggered and followed by identity is marked alive. There are no pods/cep running in that namespace but not sure how idenity is being marked alive. Added a log to find where identity is being marked alive and found its happening in runHeartbeatUpdater function:
cilium/operator/identitygc/crd_gc.go
Line 49 in d1162cc
Pulled latest cilium master and added logs on top of it and then created an image out of it.
k get ciliumidentity
NAME NAMESPACE AGE
17682 kube-system 13d
19740 kube-system 13d
1978 kube-system 13d
20776 kube-system 54d
30723 kube-system 13d
3647 kube-system 54d
40097 new-testing 2m49s
42123 kube-system 54d
k get pods -n new-testing
No resources found in new-testing namespace.
k get cep -n new-testing
No resources found in new-testing namespace.
Operatior Logs:
2023-06-27T16:08:11.157678430Z level=debug msg="Deleting unused identity; marked for deletion at 2023-06-27T16:07:11.142122029Z" identity="&{{ } {40097 f0c65750-fa9e-418a-9c22-05ffb16040e9 526408139 1 2023-06-27 16:05:41 +0000 UTC map[app:new-testing io.cilium.k8s.policy.cluster:default io.cilium.k8s.policy.serviceaccount:default io.kubernetes.pod.namespace:new-testing is-real:true name:real-dep-00541 real-dep-lab-00541-00001:val real-dep-lab-00541-00002:val real-dep-lab-00541-00003:val real-dep-lab-00541-00004:val real-dep-lab-00541-00005:val] map[io.cilium.heartbeat:2023-06-27T16:07:11.142122029Z] [] [] [{cilium-agent Update cilium.io/v2 2023-06-27 16:05:41 +0000 UTC FieldsV1 {"f:metadata":{"f:labels":{".":{},"f:app":{},"f:io.cilium.k8s.policy.cluster":{},"f:io.cilium.k8s.policy.serviceaccount":{},"f:io.kubernetes.pod.namespace":{},"f:is-real":{},"f:name":{},"f:real-dep-lab-00541-00001":{},"f:real-dep-lab-00541-00002":{},"f:real-dep-lab-00541-00003":{},"f:real-dep-lab-00541-00004":{},"f:real-dep-lab-00541-00005":{}}},"f:security-labels":{".":{},"f:k8s:app":{},"f:k8s:io.cilium.k8s.namespace.labels.kubernetes.io/metadata.name":{},"f:k8s:io.cilium.k8s.policy.cluster":{},"f:k8s:io.cilium.k8s.policy.serviceaccount":{},"f:k8s:io.kubernetes.pod.namespace":{},"f:k8s:is-real":{},"f:k8s:name":{},"f:k8s:real-dep-lab-00541-00001":{},"f:k8s:real-dep-lab-00541-00002":{},"f:k8s:real-dep-lab-00541-00003":{},"f:k8s:real-dep-lab-00541-00004":{},"f:k8s:real-dep-lab-00541-00005":{}}} } {cilium-operator-generic Update cilium.io/v2 2023-06-27 16:07:11 +0000 UTC FieldsV1 {"f:metadata":{"f:annotations":{".":{},"f:io.cilium.heartbeat":{}}}} }]} map[k8s:app:new-testing k8s:io.cilium.k8s.namespace.labels.kubernetes.io/metadata.name:new-testing k8s:io.cilium.k8s.policy.cluster:default k8s:io.cilium.k8s.policy.serviceaccount:default k8s:io.kubernetes.pod.namespace:new-testing k8s:is-real:true k8s:name:real-dep-00541 k8s:real-dep-lab-00541-00001:val k8s:real-dep-lab-00541-00002:val k8s:real-dep-lab-00541-00003:val k8s:real-dep-lab-00541-00004:val k8s:real-dep-lab-00541-00005:val]}" subsys=identity-heartbeat
2023-06-27T16:08:11.168553712Z level=debug msg="Garbage collected identity" identity=40097 subsys=identity-heartbeat
2023-06-27T16:08:11.168582313Z level=info msg="CE associated with identity. so mark true" identity=42123 subsys=identity-heartbeat
2023-06-27T16:08:11.168588113Z level=debug msg="Marking identity alive" identity=42123 subsys=identity-heartbeat
2023-06-27T16:08:11.168593213Z level=info msg="CE associated with identity. so mark true" identity=17682 subsys=identity-heartbeat
2023-06-27T16:08:11.168623414Z level=debug msg="Marking identity alive" identity=17682 subsys=identity-heartbeat
2023-06-27T16:08:11.168628814Z level=info msg="CE associated with identity. so mark true" identity=19740 subsys=identity-heartbeat
2023-06-27T16:08:11.168633314Z level=debug msg="Marking identity alive" identity=19740 subsys=identity-heartbeat
2023-06-27T16:08:11.168640214Z level=debug msg="Controller func execution time: 11.514098ms" name=crd-identity-gc subsys=controller uuid=42621a4e-9dc8-4da9-9622-d0c3dac5eef3
2023-06-27T16:08:11.168804718Z level=debug msg="Deleting identity in heartbeat lifesign table" identity=40097 subsys=identity-heartbeat
(added by me) 2023-06-27T16:08:11.177053432Z level=info msg="heartbeat updater triggered" identity=40097 subsys=identity-heartbeat
2023-06-27T16:08:11.177079633Z level=debug msg="Marking identity alive" identity=40097 subsys=identity-heartbeat
Cilium Version
latest cilium (master)
Kernel Version
5.15
Kubernetes Version
1.25.6
Sysdump
No response
Relevant log output
No response
Anything else?
No response
Code of Conduct
The text was updated successfully, but these errors were encountered: