Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cilium Identity not getting deleted sometimes in large scale cluster #26514

Closed
2 tasks done
tamilmani1989 opened this issue Jun 27, 2023 · 25 comments
Closed
2 tasks done
Assignees
Labels
area/operator Impacts the cilium-operator component kind/bug This is a bug in the Cilium logic. kind/community-report This was reported by a user in the Cilium community, eg via Slack. need-more-info More information is required to further debug or fix the issue. sig/agent Cilium agent related.

Comments

@tamilmani1989
Copy link
Contributor

tamilmani1989 commented Jun 27, 2023

Is there an existing issue for this?

  • I have searched the existing issues

What happened?

Created AKS azure cni powered by cilium cluster with 1k nodes with 1000 deployments, each with 100 pods. Then scale down the deployments to 0 and expect cilium identity should get removed in few hours. I noticed few identities not getting removed even for days. To debug, added few more logs in cilium operator to find where the problem is. I found delete identity is being triggered and followed by identity is marked alive. There are no pods/cep running in that namespace but not sure how idenity is being marked alive. Added a log to find where identity is being marked alive and found its happening in runHeartbeatUpdater function:

igc.heartbeatStore.markAlive(event.Object.Name, time.Now())

Pulled latest cilium master and added logs on top of it and then created an image out of it.

k get ciliumidentity
NAME NAMESPACE AGE
17682 kube-system 13d
19740 kube-system 13d
1978 kube-system 13d
20776 kube-system 54d
30723 kube-system 13d
3647 kube-system 54d
40097 new-testing 2m49s
42123 kube-system 54d

k get pods -n new-testing
No resources found in new-testing namespace.

k get cep -n new-testing
No resources found in new-testing namespace.

Operatior Logs:

2023-06-27T16:08:11.157678430Z level=debug msg="Deleting unused identity; marked for deletion at 2023-06-27T16:07:11.142122029Z" identity="&{{ } {40097 f0c65750-fa9e-418a-9c22-05ffb16040e9 526408139 1 2023-06-27 16:05:41 +0000 UTC map[app:new-testing io.cilium.k8s.policy.cluster:default io.cilium.k8s.policy.serviceaccount:default io.kubernetes.pod.namespace:new-testing is-real:true name:real-dep-00541 real-dep-lab-00541-00001:val real-dep-lab-00541-00002:val real-dep-lab-00541-00003:val real-dep-lab-00541-00004:val real-dep-lab-00541-00005:val] map[io.cilium.heartbeat:2023-06-27T16:07:11.142122029Z] [] [] [{cilium-agent Update cilium.io/v2 2023-06-27 16:05:41 +0000 UTC FieldsV1 {"f:metadata":{"f:labels":{".":{},"f:app":{},"f:io.cilium.k8s.policy.cluster":{},"f:io.cilium.k8s.policy.serviceaccount":{},"f:io.kubernetes.pod.namespace":{},"f:is-real":{},"f:name":{},"f:real-dep-lab-00541-00001":{},"f:real-dep-lab-00541-00002":{},"f:real-dep-lab-00541-00003":{},"f:real-dep-lab-00541-00004":{},"f:real-dep-lab-00541-00005":{}}},"f:security-labels":{".":{},"f:k8s:app":{},"f:k8s:io.cilium.k8s.namespace.labels.kubernetes.io/metadata.name":{},"f:k8s:io.cilium.k8s.policy.cluster":{},"f:k8s:io.cilium.k8s.policy.serviceaccount":{},"f:k8s:io.kubernetes.pod.namespace":{},"f:k8s:is-real":{},"f:k8s:name":{},"f:k8s:real-dep-lab-00541-00001":{},"f:k8s:real-dep-lab-00541-00002":{},"f:k8s:real-dep-lab-00541-00003":{},"f:k8s:real-dep-lab-00541-00004":{},"f:k8s:real-dep-lab-00541-00005":{}}} } {cilium-operator-generic Update cilium.io/v2 2023-06-27 16:07:11 +0000 UTC FieldsV1 {"f:metadata":{"f:annotations":{".":{},"f:io.cilium.heartbeat":{}}}} }]} map[k8s:app:new-testing k8s:io.cilium.k8s.namespace.labels.kubernetes.io/metadata.name:new-testing k8s:io.cilium.k8s.policy.cluster:default k8s:io.cilium.k8s.policy.serviceaccount:default k8s:io.kubernetes.pod.namespace:new-testing k8s:is-real:true k8s:name:real-dep-00541 k8s:real-dep-lab-00541-00001:val k8s:real-dep-lab-00541-00002:val k8s:real-dep-lab-00541-00003:val k8s:real-dep-lab-00541-00004:val k8s:real-dep-lab-00541-00005:val]}" subsys=identity-heartbeat
2023-06-27T16:08:11.168553712Z level=debug msg="Garbage collected identity" identity=40097 subsys=identity-heartbeat
2023-06-27T16:08:11.168582313Z level=info msg="CE associated with identity. so mark true" identity=42123 subsys=identity-heartbeat
2023-06-27T16:08:11.168588113Z level=debug msg="Marking identity alive" identity=42123 subsys=identity-heartbeat
2023-06-27T16:08:11.168593213Z level=info msg="CE associated with identity. so mark true" identity=17682 subsys=identity-heartbeat
2023-06-27T16:08:11.168623414Z level=debug msg="Marking identity alive" identity=17682 subsys=identity-heartbeat
2023-06-27T16:08:11.168628814Z level=info msg="CE associated with identity. so mark true" identity=19740 subsys=identity-heartbeat
2023-06-27T16:08:11.168633314Z level=debug msg="Marking identity alive" identity=19740 subsys=identity-heartbeat
2023-06-27T16:08:11.168640214Z level=debug msg="Controller func execution time: 11.514098ms" name=crd-identity-gc subsys=controller uuid=42621a4e-9dc8-4da9-9622-d0c3dac5eef3
2023-06-27T16:08:11.168804718Z level=debug msg="Deleting identity in heartbeat lifesign table" identity=40097 subsys=identity-heartbeat
(added by me) 2023-06-27T16:08:11.177053432Z level=info msg="heartbeat updater triggered" identity=40097 subsys=identity-heartbeat
2023-06-27T16:08:11.177079633Z level=debug msg="Marking identity alive" identity=40097 subsys=identity-heartbeat

Cilium Version

latest cilium (master)

Kernel Version

5.15

Kubernetes Version

1.25.6

Sysdump

No response

Relevant log output

No response

Anything else?

No response

Code of Conduct

  • I agree to follow this project's Code of Conduct
@tamilmani1989 tamilmani1989 added kind/bug This is a bug in the Cilium logic. kind/community-report This was reported by a user in the Cilium community, eg via Slack. needs/triage This issue requires triaging to establish severity and next steps. labels Jun 27, 2023
@ldelossa ldelossa added sig/agent Cilium agent related. and removed needs/triage This issue requires triaging to establish severity and next steps. labels Jun 27, 2023
@tamilmani1989
Copy link
Contributor Author

Found pushUpdate getting triggered for that identity 40097

func (r *resource[T]) pushUpdate(key Key) {

pushUpdate->upsertEvent->runHeartbeatUpdater(update event)->markAlive...

not sure how pushUpdate getting triggered for identity that's being deleted

@joestringer joestringer added the affects/v1.14 This issue affects v1.14 branch label Jul 6, 2023
@joestringer
Copy link
Member

@joamaki do you have any thoughts on how this might occur? It seems like the resource is getting an async update while the corresponding identity is being deleted, which causes it to persist and not get properly garbage collected.

@joamaki
Copy link
Contributor

joamaki commented Jul 6, 2023

I don't quite see how we could get a delete followed by upsert from Resource[T], but looking at crd_gc.go we might have a potential race between gc() and runHeartbeatUpdater() which both manipulate the heartbeatStore, e.g. we might have gc() doing an action based on latest data and then runHeartbeatUpdate() following with an action based on stale data, or other way around.

I'd approach this by trying to create a test case in gc_test.go that churns through a lot of identities and see if this race can be triggered and then go from there. I would not rule out a bug in Resource[T], but would start from the assumption that crd_gc.go has a race that causes this. It's likely worth refactoring it to work on top of a single consume of the event stream to make sure actions are always consistent (e.g. don't use Store at all).

@pippolo84 would you have time to look into this?

@pippolo84
Copy link
Member

I don't quite see how we could get a delete followed by upsert from Resource[T], but looking at crd_gc.go we might have a potential race between gc() and runHeartbeatUpdater() which both manipulate the heartbeatStore, e.g. we might have gc() doing an action based on latest data and then runHeartbeatUpdate() following with an action based on stale data, or other way around.

In case of a large scale cluster, the loop over the identities listed might work, at a certain point, with identities that have been deleted in the meantime. Doing so, if the in-memory identity is seen as alive (because of a CES or CEP referencing it), it is recreated in the lastLifeSign map by the call to markAlive in gc.
This produces stale entries in the lastLifeSign map, but all of these should be eventually deleted by the call to the inner heartbeat store GC:

// gc removes all lifesign entries which have exceeded the heartbeat by a large
// amount. This happens when the CiliumIdentity is deleted before the
// CiliumEndpoint that refers to it. In that case, the lifesign entry will
// continue to exist. Remove it once has not been updated for a long time.
func (i *heartbeatStore) gc() {

But AFAICT having a stale entry in the lastLifeSign should not lead to recreate a deleted identity.
There is evidence that the identity has been deleted. The log reports:

2023-06-27T16:08:11.168553712Z level=debug msg="Garbage collected identity" identity=40097 subsys=identity-heartbeat

that is written after clientset.Delete() returned without error. Also, we are receiving a resource.Delete event from Resource[T]. That's because this entry:

2023-06-27T16:08:11.168804718Z level=debug msg="Deleting identity in heartbeat lifesign table" identity=40097 subsys=identity-heartbeat

can be written only by handling a resource.Delete. So, the identity 40097 should have been deleted, at some point.

@pippolo84
Copy link
Member

@tamilmani1989 would it possible for you to reproduce it another time?
If so, I'd be curious to see the output of

  • k get ciliumidentity

Before and after the scaling down.
It might be good to also have the output of kubectl get ciliumidentity <name> -n <namespace> -o yaml for the identity your focusing the analysis on, again before and after the scaling down.

Finally, could you also please share the output of kubectl get cm cilium-config -n kube-system -o yaml? There are some options for the identity GC that might help the analysis.

Thanks in advance!

@aanm aanm added feature/k8s-ingress and removed affects/v1.14 This issue affects v1.14 branch labels Jul 7, 2023
@tamilmani1989
Copy link
Contributor Author

@pippolo84 Thanks for looking. I scaled down the nodes in that cluster and that identity got deleted. I suspect if one of cilium agent is triggering the update for that identity... let me try to repro this again and share the output you requested

@tamilmani1989
Copy link
Contributor Author

tamilmani1989 commented Jul 11, 2023

@pippolo84 I'm able to repro the issue and this time multiple identities are leaked (the identities in scale-test namespace are all leaked) and i shared details for one of those. Let me know if you need any other details. I'm going to keep this cluster untouched.

2023-07-11T06:44:47.456109691Z level=debug msg="Garbage collected identity" identity=12314 subsys=identity-heartbeat
2023-07-11T06:44:47.457268499Z level=debug msg="Deleting identity in heartbeat lifesign table" identity=12314 subsys=identity-heartbeat
2023-07-11T06:44:47.466164757Z level=info msg="heartbeat updater triggered" identity=12314 subsys=identity-heartbeat
2023-07-11T06:44:47.466179657Z level=debug msg="Marking identity alive" identity=12314 subsys=identity-heartbeat
~/go/src/github.com/cilium$ k get ciliumidentity
NAME    NAMESPACE     AGE
10195   default       59d
12314   scale-test    105s
16144   scale-test    3m45s
34192   scale-test    3m45s
34413   cilium-test   59d
34685   kube-system   21d
39781   kube-system   21d
41567   scale-test    3m45s
45004   cilium-test   59d
51455   scale-test    105s
56025   scale-test    3m45s
56324   cilium-test   59d
58986   scale-test    3m45s
5940    cilium-test   59d
60858   scale-test    3m45s
63009   kube-system   21d
983     scale-test    105s
$ k get ciliumidentity -n  scale-test 12314 -oyaml
apiVersion: cilium.io/v2
kind: CiliumIdentity
metadata:
  creationTimestamp: "2023-07-11T06:48:47Z"
  generation: 1
  labels:
    app: scale-test
    io.cilium.k8s.policy.cluster: default
    io.cilium.k8s.policy.serviceaccount: default
    io.kubernetes.pod.namespace: scale-test
    is-real: "true"
    name: real-dep-00941
    real-dep-lab-00941-00001: val
    real-dep-lab-00941-00002: val
    real-dep-lab-00941-00003: val
    real-dep-lab-00941-00004: val
    real-dep-lab-00941-00005: val
    shared-lab-00001: val
    shared-lab-00002: val
    shared-lab-00003: val
  name: "12314"
  resourceVersion: "23441966"
  uid: 3baa2be7-f9b9-45d6-b389-0f0a966af77d
security-labels:
  k8s:app: scale-test
  k8s:io.cilium.k8s.namespace.labels.kubernetes.io/metadata.name: scale-test
  k8s:io.cilium.k8s.policy.cluster: default
  k8s:io.cilium.k8s.policy.serviceaccount: default
  k8s:io.kubernetes.pod.namespace: scale-test
  k8s:is-real: "true"
  k8s:name: real-dep-00941
  k8s:real-dep-lab-00941-00001: val
  k8s:real-dep-lab-00941-00002: val
  k8s:real-dep-lab-00941-00003: val
  k8s:real-dep-lab-00941-00004: val
  k8s:real-dep-lab-00941-00005: val
  k8s:shared-lab-00001: val
  k8s:shared-lab-00002: val
  k8s:shared-lab-00003: val

Cilium Config:

To repro the issue faster, i reduced gc-interval, heartbeat-interval otherwise it happens in default setting as well..

 kubectl get cm cilium-config -n kube-system -o yaml

apiVersion: v1
data:
  agent-not-ready-taint-key: node.cilium.io/agent-not-ready
  arping-refresh-period: 30s
  auto-direct-node-routes: "false"
  bpf-lb-external-clusterip: "false"
  bpf-lb-map-max: "65536"
  bpf-lb-mode: snat
  bpf-map-dynamic-size-ratio: "0.0025"
  bpf-policy-map-max: "16384"
  bpf-root: /sys/fs/bpf
  cgroup-root: /run/cilium/cgroupv2
  cilium-endpoint-gc-interval: 5m0s
  cluster-id: "0"
  cluster-name: default
  debug: "true"
  disable-cnp-status-updates: "true"
  disable-endpoint-crd: "false"
  enable-auto-protect-node-port-range: "true"
  enable-bgp-control-plane: "false"
  enable-bpf-clock-probe: "true"
  enable-bpf-masquerade: "false"
  enable-endpoint-health-checking: "false"
  enable-endpoint-routes: "true"
  enable-health-check-nodeport: "true"
  enable-health-checking: "true"
  enable-host-legacy-routing: "true"
  enable-hubble: "false"
  enable-ipv4: "true"
  enable-ipv4-masquerade: "false"
  enable-ipv6: "false"
  enable-ipv6-masquerade: "false"
  enable-k8s-terminating-endpoint: "true"
  enable-l2-neigh-discovery: "true"
  enable-l7-proxy: "false"
  enable-local-node-route: "false"
  enable-local-redirect-policy: "false"
  enable-metrics: "true"
  enable-policy: default
  enable-remote-node-identity: "true"
  enable-session-affinity: "true"
  enable-svc-source-range-check: "true"
  enable-vtep: "false"
  enable-well-known-identities: "false"
  enable-xt-socket-fallback: "true"
  identity-allocation-mode: crd
  identity-gc-interval: 2m0s
  identity-heartbeat-timeout: 2m0s
  install-iptables-rules: "true"
  install-no-conntrack-iptables-rules: "false"
  ipam: delegated-plugin
  kube-proxy-replacement: strict
  kube-proxy-replacement-healthz-bind-address: ""
  local-router-ipv4: 169.254.23.0
  metrics: +cilium_bpf_map_pressure
  monitor-aggregation: medium
  monitor-aggregation-flags: all
  monitor-aggregation-interval: 5s
  node-port-bind-protection: "true"
  nodes-gc-interval: 5m0s
  operator-api-serve-addr: 127.0.0.1:9234
  operator-prometheus-serve-addr: :9963
  preallocate-bpf-maps: "false"
  procfs: /host/proc
  prometheus-serve-addr: :9962
  remove-cilium-node-taints: "true"
  routing-mode: native
  set-cilium-is-up-condition: "true"
  sidecar-istio-proxy-image: cilium/istio_proxy
  synchronize-k8s-nodes: "true"
  tofqdns-dns-reject-response-code: refused
  tofqdns-enable-dns-compression: "true"
  tofqdns-endpoint-max-ip-per-hostname: "50"
  tofqdns-idle-connection-grace-period: 0s
  tofqdns-max-deferred-connection-deletes: "10000"
  tofqdns-min-ttl: "3600"
  tofqdns-proxy-response-max-delay: 100ms
  unmanaged-pod-watcher-interval: "15"
  vtep-cidr: ""
  vtep-endpoint: ""
  vtep-mac: ""
  vtep-mask: ""
kind: ConfigMap
metadata:
  annotations:
    meta.helm.sh/release-name: cilium
    meta.helm.sh/release-namespace: kube-system
  creationTimestamp: "2023-05-12T17:02:09Z"
  labels:
    app.kubernetes.io/managed-by: Helm
    helm.toolkit.fluxcd.io/name: cilium-adapter-helmrelease
    helm.toolkit.fluxcd.io/namespace: 645e6f58e7466200016979e0
  name: cilium-config
  namespace: kube-system
  resourceVersion: "23235573"
  uid: 82f5de40-d2b2-4499-93e6-ebddd7b809ac

@tamilmani1989
Copy link
Contributor Author

@pippolo84 any update on this? do you require any other info from cluster

@pippolo84
Copy link
Member

@tamilmani1989 thank you very much for the additional details. 🙏
I should have enough cycles to take a deeper look at this early next week. Skimming through the info you reported, I think they should be enough, no need to do anything else for now.

@pippolo84
Copy link
Member

I've tried to simulate a large scale cluster scenario inserting increasing artificial delays in the resource.Upsert events handler. I was able to reproduce the sequence of log messages with a Marking identity alive following a Garbage collected identity, but that did not result in leaking identities, as expected from the analysis of the GC inner workings above.

@tamilmani1989 is it possible to collect and share a sysdump from your test cluster? I am suspecting that the leaking is not related to the GC itself, or at least not only to that.

Also, looking again at what you've shared, I've noticed that k get ciliumidentity -n scale-test 12314 -oyaml is reporting:

creationTimestamp: "2023-07-11T06:48:47Z"

That is after the handling of the deletion event in the operator:

2023-07-11T06:44:47.457268499Z level=debug msg="Deleting identity in heartbeat lifesign table" identity=12314 subsys=identity-heartbeat

Though those timestamps might not be reliable, I wondering if the identity was deleted as reported by the operator and somehow recreated later.

@tamilmani1989
Copy link
Contributor Author

@pippolo84 sure. does cilium agent sysdmup from 1 node enough? I have 30 nodes running in this cluster

@pippolo84
Copy link
Member

@pippolo84 sure. does cilium agent sysdmup from 1 node enough? I have 30 nodes running in this cluster

Let's start with that, it should be enough. Additionally, it might be good to also look for all entries (in all logs) related to the CiliumIdentity under inspection (like 12314 in the case above),
Thanks again for the help. 🙏

@tamilmani1989
Copy link
Contributor Author

@pippolo84
Copy link
Member

Quick update: after looking at the code together with @aanm and @marseel (thanks to both 🙏 ), we think there might be an issue with the endpoint release codepath in case of high concurrency. I'm going to setup a local cluster to prove that hypothesis.

@pippolo84
Copy link
Member

pippolo84 commented Aug 4, 2023

I tried to reproduce this in a local kind cluster applying deployments with many replicas and scaling them down and up, but unfortunately no luck yet.

@tamilmani1989 would you please be able to run a debug version of Cilium in your cluster? I've already prepared this PR where I added some logs to better understand the identities management by both agent and operator.
We would need to try this version in your large scale cluster and then collect:

  • the list of commands used to repro the issue
  • the identities leaked and the namespace

Then, focusing on one of the leaked identity:

  • output of kubectl get -o yaml for that identity
  • the full operator logs
  • the full logs of all the agents where you can find an occurrence of that identity
  • if possible, a sysdump for the node where the agent is recreating/updating that identity

That would be incredibly helpful. Thank you in advance for the help 🙏

@tamilmani1989
Copy link
Contributor Author

@pippolo84 sure I will try to repro this with debug build and share logs with you.

@joamaki
Copy link
Contributor

joamaki commented Aug 17, 2023

This could potentially be involved here: #27340. If you have a reliable repro, could you perhaps try a build with this patch and see if it still happens?

@tamilmani1989
Copy link
Contributor Author

tamilmani1989 commented Aug 18, 2023

I tried to repro with @pippolo84 debug branch but couldn't repro it. @vipul-21 will be working on repro as i'm going to away for some days. @joamaki Thanks for the headsup. we will create repro first and then try your branch

@vipul-21
Copy link
Contributor

vipul-21 commented Sep 1, 2023

@pippolo84 I was able to replicate the issue with patch you mentioned.

  • Steps to reproduce
    I used the same cluster which @tamilmani1989 used. AKS cluster with cilium.(Number of nodes ~980)
    I created 970 deployments each with 100 pods in the namespace new-testing
    Deleted the namespace after the deployment is done.

  • Identities leaked are in the file leaked_identities, namespace is new-testing
    One of those identities: 4290

  • output of kubectl get -o yaml for that identity.

apiVersion: cilium.io/v2
kind: CiliumIdentity
metadata:
  creationTimestamp: "2023-08-31T15:40:35Z"
  generation: 1
  labels:
    app: new-testing
    io.cilium.k8s.policy.cluster: default
    io.cilium.k8s.policy.serviceaccount: default
    io.kubernetes.pod.namespace: new-testing
    is-real: "true"
    name: real-dep-00932
    real-dep-lab-00932-00001: val
    real-dep-lab-00932-00002: val
    real-dep-lab-00932-00003: val
    real-dep-lab-00932-00004: val
    real-dep-lab-00932-00005: val
  name: "4290"
  resourceVersion: "746642258"
  uid: f0158eca-a036-420a-abc6-b0e5e188ff3c
security-labels:
  k8s:app: new-testing
  k8s:io.cilium.k8s.namespace.labels.kubernetes.io/metadata.name: new-testing
  k8s:io.cilium.k8s.policy.cluster: default
  k8s:io.cilium.k8s.policy.serviceaccount: default
  k8s:io.kubernetes.pod.namespace: new-testing
  k8s:is-real: "true"
  k8s:name: real-dep-00932
  k8s:real-dep-lab-00932-00001: val
  k8s:real-dep-lab-00932-00002: val
  k8s:real-dep-lab-00932-00003: val
  k8s:real-dep-lab-00932-00004: val
  k8s:real-dep-lab-00932-00005: val
  • the full operator logs are attached
  • the full logs of all the agents where you can find an occurrence of that identity, there are 970 agents need to see if I can get a hold of the agent. Is there a way I can get all the cilium-agent logs and grep for the identity ? (Can get the sysdump too then)
    cilium_opertor.txt
    leaked_identities.txt
    cilium_identity_after_delete.txt

@tamilmani1989
Copy link
Contributor Author

@pippolo84 any update on this? @vipul-21 was able to repro with the patch you gave. Meanwhile we will try to repro with @joamaki patch

@pippolo84
Copy link
Member

Hi @tamilmani1989 , unfortunately I didn't have cycles to investigate more. If you could do another try with that latest patch it would be great. Let us know the results and will try to get back to this as soon as possible. Thank you!

@joestringer
Copy link
Member

Hi @vipul-21 , any chance you were able to track down the underlying issue here?

@vipul-21
Copy link
Contributor

vipul-21 commented Mar 4, 2024

@joestringer Not really, the last update from my end was that the issue was replicated using the patch mentioned above by @pippolo84. But was not able to track down the reason behind it.

@joestringer
Copy link
Member

Is there a way I can get all the cilium-agent logs and grep for the identity ? (Can get the sysdump too then)

I don't know if there's a canonical way to do this in k8s, but you could imagine writing a little script to iterate through the pod names in kubectl -n kube-system get pods -l "k8s-app=cilium" -o jsonpath='{ $..items[*].metadata.name } and run kubectl -n kube-system logs $pod --timestamps | grep foo.

It seems like with Fabio's patch there were additional logs in the most recent reproducer that look like this:

level=info msg="Marking identity alive in lastLifesign" identity=4290 subsys=identity-heartbeat
level=info msg="Looked up endpoints using the identity" endpoints="[]" identity=4290 subsys=watchers
level=info msg="Looked up endpoints using the identity" endpoints="[]" identity=4290 subsys=watchers
level=info msg="Marking identity for later deletion" identity=4290 k8sUID=f0158eca-a036-420a-abc6-b0e5e188ff3c subsys=identity-heartbeat
level=debug msg="Updated identity" identity=4290 subsys=identity-heartbeat
level=info msg="Marked identity for later deletion" identity=4290 k8sUID=f0158eca-a036-420a-abc6-b0e5e188ff3c subsys=identity-heartbeat timestamp="2023-08-31T16:01:48.640691252Z"
level=info msg="Marking identity alive in lastLifesign" identity=4290 subsys=identity-heartbeat
level=info msg="Marking identity alive in lastLifesign" identity=4290 subsys=identity-heartbeat

Does that help?

And as I follow there was a patch from Jussi which touched this, but we also have no information about whether that helped or not?

Seems like we need to either test whether this problem still exists after recent changes in the tree, or create another build with additional logs from cilium-operator to gather additional information to investigate further.

@joestringer joestringer added need-more-info More information is required to further debug or fix the issue. area/operator Impacts the cilium-operator component and removed feature/k8s-ingress labels Mar 4, 2024
@vipul-21
Copy link
Contributor

Not actively working on the replicating the issue.
@joestringer Thanks for the help, will try to use a script to get the logs from all the cilium agents. Feel free to close the issue, will reopen if I am able to replicate the issue(Already synced with @tamilmani1989 on this)

@joestringer joestringer closed this as not planned Won't fix, can't repro, duplicate, stale Mar 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/operator Impacts the cilium-operator component kind/bug This is a bug in the Cilium logic. kind/community-report This was reported by a user in the Cilium community, eg via Slack. need-more-info More information is required to further debug or fix the issue. sig/agent Cilium agent related.
Projects
None yet
Development

No branches or pull requests

7 participants