Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

operator: fix deadlock when running in kvstore mode #24631

Merged
merged 1 commit into from Mar 31, 2023

Conversation

giorio94
Copy link
Member

When running in kvstore mode, the start hook of the identity GC cell blocks until the kvstore client has been initialized, which is performed by the legacyCell start hook. Given that the identity GC cell was registered first, and there are no explicit dependencies among the two, its start hook was also executed first, causing a deadlock.

This PR changes the order in which the cells are registered as a workaround, until the kvstore is refactored into a proper cell.

No backport is required, since the original commit is not present in any stable branch.

operator: fix deadlock when running in kvstore mode

When running in kvstore mode, the start hook of the identity GC cell
blocks until the kvstore client has been initialized, which is performed
by the legacyCell start hook. Given that the identity GC cell was
registered first, and there are no explicit dependencies among the two,
its start hook was also executed first, causing a deadlock.

This commit changes the order in which the cells are registered as a
workaround, until the kvstore is refactored into a proper cell.

The current hooks execution order is the following:

function="gops.registerGopsHooks.func1 (cell.go:44)"
function="client.(*compositeClientset).onStart"
function="cmd.registerOperatorHooks.func1 (root.go:142)"
function="*resource.resource[*github.com/cilium/cilium/pkg/k8s/apis/cilium.io/v2alpha1.CiliumLoadBalancerIPPool].Start"
function="*resource.resource[*github.com/cilium/cilium/pkg/k8s/slim/k8s/api/core/v1.Service].Start"
function="*lbipam.LBIPAM.Start"
function="*resource.resource[*k8s.io/api/core/v1.Node].Start"
function="*resource.resource[*github.com/cilium/cilium/pkg/k8s/apis/cilium.io/v2.CiliumNode].Start"
function="*resource.resource[*github.com/cilium/cilium/pkg/k8s/slim/k8s/api/core/v1.Namespace].Start"
function="*resource.resource[*github.com/cilium/cilium/pkg/k8s/apis/cilium.io/v2.CiliumIdentity].Start"
function="cmd.(*legacyOnLeader).onStart"
function="identitygc.registerGC.func1 (gc.go:107)"

Fixes: b115951 ("operator: Refactor cilium identities GC to a cell")
Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>
@giorio94 giorio94 added kind/bug This is a bug in the Cilium logic. release-note/bug This PR fixes an issue in a previous release of Cilium. area/operator Impacts the cilium-operator component sig/kvstore Impacts the KVStore package interactions. labels Mar 29, 2023
@giorio94 giorio94 requested a review from a team as a code owner March 29, 2023 13:13
@giorio94 giorio94 requested a review from squeed March 29, 2023 13:13
Copy link
Member

@aanm aanm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you! Do we need to backport this to 1.13?

@giorio94
Copy link
Member Author

Thank you! Do we need to backport this to 1.13?

No, since the PR introducing that cell (#22892) was not backported to 1.13.

@giorio94
Copy link
Member Author

/test

@aanm
Copy link
Member

aanm commented Mar 29, 2023

Thank you! Do we need to backport this to 1.13?

No, since the PR introducing that cell (#22892) was not backported to 1.13.

@giorio94 in that case, since this bug is not user-facing, the release note label should be release-note/misc.

@aanm aanm added release-note/misc This PR makes changes that have no direct user impact. and removed release-note/bug This PR fixes an issue in a previous release of Cilium. labels Mar 29, 2023
@giorio94
Copy link
Member Author

/ci-gke

@giorio94
Copy link
Member Author

/ci-multicluster

@giorio94
Copy link
Member Author

/ci-gke

@giorio94
Copy link
Member Author

/ci-multicluster

@giorio94
Copy link
Member Author

/ci-awscni

@giorio94
Copy link
Member Author

All previous failures were due to the GitHub actions issues of yesterday

@giorio94
Copy link
Member Author

giorio94 commented Mar 30, 2023

/test-runtime

Hit known flake #23495

@giorio94
Copy link
Member Author

giorio94 commented Mar 30, 2023

/test-1.25-4.19

Hit known flake: #23236

@giorio94
Copy link
Member Author

giorio94 commented Mar 30, 2023

/test-1.24-5.4

Hit new flake #24643

Job 'Cilium-PR-K8s-1.24-kernel-5.4' failed:

Click to show.

Test Name

K8sDatapathConfig Host firewall With VXLAN

Failure Output

FAIL: Failed to reach 10.0.0.59:80 from testclient-host-tmzd4

If it is a flake and a GitHub issue doesn't already exist to track it, comment /mlh new-flake Cilium-PR-K8s-1.24-kernel-5.4 so I can create one.

@giorio94
Copy link
Member Author

giorio94 commented Mar 30, 2023

/test-1.24-5.4

Hit known flake #16122

@giorio94
Copy link
Member Author

giorio94 commented Mar 31, 2023

/test-1.24-5.4

Hit known flake #24573

@maintainer-s-little-helper maintainer-s-little-helper bot added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Mar 31, 2023
@julianwiedmann julianwiedmann merged commit b5f1cb7 into cilium:master Mar 31, 2023
43 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/operator Impacts the cilium-operator component kind/bug This is a bug in the Cilium logic. ready-to-merge This PR has passed all tests and received consensus from code owners to merge. release-note/misc This PR makes changes that have no direct user impact. sig/kvstore Impacts the KVStore package interactions.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants