New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
operator: fix deadlock when running in kvstore mode #24631
operator: fix deadlock when running in kvstore mode #24631
Conversation
When running in kvstore mode, the start hook of the identity GC cell blocks until the kvstore client has been initialized, which is performed by the legacyCell start hook. Given that the identity GC cell was registered first, and there are no explicit dependencies among the two, its start hook was also executed first, causing a deadlock. This commit changes the order in which the cells are registered as a workaround, until the kvstore is refactored into a proper cell. The current hooks execution order is the following: function="gops.registerGopsHooks.func1 (cell.go:44)" function="client.(*compositeClientset).onStart" function="cmd.registerOperatorHooks.func1 (root.go:142)" function="*resource.resource[*github.com/cilium/cilium/pkg/k8s/apis/cilium.io/v2alpha1.CiliumLoadBalancerIPPool].Start" function="*resource.resource[*github.com/cilium/cilium/pkg/k8s/slim/k8s/api/core/v1.Service].Start" function="*lbipam.LBIPAM.Start" function="*resource.resource[*k8s.io/api/core/v1.Node].Start" function="*resource.resource[*github.com/cilium/cilium/pkg/k8s/apis/cilium.io/v2.CiliumNode].Start" function="*resource.resource[*github.com/cilium/cilium/pkg/k8s/slim/k8s/api/core/v1.Namespace].Start" function="*resource.resource[*github.com/cilium/cilium/pkg/k8s/apis/cilium.io/v2.CiliumIdentity].Start" function="cmd.(*legacyOnLeader).onStart" function="identitygc.registerGC.func1 (gc.go:107)" Fixes: b115951 ("operator: Refactor cilium identities GC to a cell") Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you! Do we need to backport this to 1.13?
No, since the PR introducing that cell (#22892) was not backported to 1.13. |
/test |
/ci-gke |
/ci-multicluster |
/ci-gke |
/ci-multicluster |
/ci-awscni |
All previous failures were due to the GitHub actions issues of yesterday |
/test-runtime Hit known flake #23495 |
/test-1.25-4.19 Hit known flake: #23236 |
/test-1.24-5.4 Hit new flake #24643 Job 'Cilium-PR-K8s-1.24-kernel-5.4' failed: Click to show.Test Name
Failure Output
If it is a flake and a GitHub issue doesn't already exist to track it, comment |
/test-1.24-5.4 Hit known flake #16122 |
/test-1.24-5.4 Hit known flake #24573 |
When running in kvstore mode, the start hook of the identity GC cell blocks until the kvstore client has been initialized, which is performed by the legacyCell start hook. Given that the identity GC cell was registered first, and there are no explicit dependencies among the two, its start hook was also executed first, causing a deadlock.
This PR changes the order in which the cells are registered as a workaround, until the kvstore is refactored into a proper cell.
No backport is required, since the original commit is not present in any stable branch.