New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
redirectpolicy: prevent startup deadlock #27115
Merged
joamaki
merged 1 commit into
cilium:main
from
bimmlerd:pr/bimmlerd/minimal-lrp-deadlock-fix
Jul 31, 2023
Merged
redirectpolicy: prevent startup deadlock #27115
joamaki
merged 1 commit into
cilium:main
from
bimmlerd:pr/bimmlerd/minimal-lrp-deadlock-fix
Jul 31, 2023
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
maintainer-s-little-helper
bot
added
the
dont-merge/needs-release-note-label
The author needs to describe the release impact of these changes.
label
Jul 27, 2023
bimmlerd
added
kind/bug
This is a bug in the Cilium logic.
area/daemon
Impacts operation of the Cilium daemon.
area/lrp
Impacts Local Redirect Policy.
sig/agent
Cilium agent related.
needs-backport/1.14
This PR / issue needs backporting to the v1.14 branch
and removed
dont-merge/needs-release-note-label
The author needs to describe the release impact of these changes.
labels
Jul 27, 2023
maintainer-s-little-helper
bot
added
the
dont-merge/needs-release-note-label
The author needs to describe the release impact of these changes.
label
Jul 27, 2023
bimmlerd
added
the
release-note/bug
This PR fixes an issue in a previous release of Cilium.
label
Jul 27, 2023
maintainer-s-little-helper
bot
removed
the
dont-merge/needs-release-note-label
The author needs to describe the release impact of these changes.
label
Jul 27, 2023
/test |
joamaki
approved these changes
Jul 31, 2023
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. This fixes the deadlock as now it's getting the store reference in a way that it itself cannot block (as there's the Resource[T] work queue that decouples the two).
This commit prevents a deadlock in startup when local redirect policies are in use. Involved in the deadlock is the node-local pod watcher and the redirectpolicy manager. The pod watcher (synchronously) calls into the manager, in oder to notify it of a pod event. The manager then tries to acquire a reference to the pod store. Due to recent refactoring, however, this leads to a wait cycle: 1. Acquiring the pod store reference blocks on initial synchronisation of the store, by waiting for the `podStoreSet` channel to be closed. 2. This channel is closed by the watcher once it receives a `Sync` event from the local pod resource. Before the resource emits a Sync event, however, it will 'replay' the consumer up to the current snapshot by emitting Upsert events. 3. In the Upsert call, the redirectpolicy manager is called (blocking on 1.), and we have a classical deadlock. To fix this, instead of going through the k8s watcher to acquire a reference to the local pod store, inject the resource directly into the redirect policy manager. This will also block on the store being synchronised, but not by way of the `podStoreSet` channel (closed by the k8s watcher): it instead relies on the Resource[T]-internal tracking. Fixes: 3db8b14 (k8s: Use LocalPodResource in pod watcher) Signed-off-by: David Bimmler <david.bimmler@isovalent.com>
bimmlerd
force-pushed
the
pr/bimmlerd/minimal-lrp-deadlock-fix
branch
from
July 31, 2023 10:59
655137c
to
31f56d5
Compare
/test |
dylandreimerink
added a commit
to dylandreimerink/cilium
that referenced
this pull request
Aug 2, 2023
…lium#27115 CI runs on both were green, but when merged into main they conflicted. Signed-off-by: Dylan Reimerink <dylan.reimerink@isovalent.com>
dylandreimerink
added a commit
that referenced
this pull request
Aug 2, 2023
CI runs on both were green, but when merged into main they conflicted. Signed-off-by: Dylan Reimerink <dylan.reimerink@isovalent.com>
sayboras
added
backport-pending/1.14
The backport for Cilium 1.14.x for this PR is in progress.
and removed
needs-backport/1.14
This PR / issue needs backporting to the v1.14 branch
labels
Aug 3, 2023
maintainer-s-little-helper
bot
moved this from Needs backport from main
to Backport pending to v1.14
in 1.14.1
Aug 3, 2023
sayboras
added
backport-done/1.14
The backport for Cilium 1.14.x for this PR is done.
and removed
backport-pending/1.14
The backport for Cilium 1.14.x for this PR is in progress.
labels
Aug 4, 2023
maintainer-s-little-helper
bot
moved this from Backport pending to v1.14
to Backport done to v1.14
in 1.14.1
Aug 4, 2023
2 tasks
1 task
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
area/daemon
Impacts operation of the Cilium daemon.
area/lrp
Impacts Local Redirect Policy.
backport-done/1.14
The backport for Cilium 1.14.x for this PR is done.
kind/bug
This is a bug in the Cilium logic.
release-note/bug
This PR fixes an issue in a previous release of Cilium.
sig/agent
Cilium agent related.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This commit prevents a deadlock in startup when local redirect policies are in use.
Involved in the deadlock is the node-local pod watcher and the redirectpolicy manager. The pod watcher (synchronously) calls into the manager, in oder to notify it of a pod event. The manager then tries to acquire a reference to the pod store. Due to recent refactoring, however, this leads to a wait cycle:
podStoreSet
channel to be closed.Sync
event from the local pod resource. Before the resource emits a Sync event, however, it will 'replay' the consumer up to the current snapshot by emitting Upsert events.To fix this, instead of going through the k8s watcher to acquire a reference to the local pod store, inject the resource directly into the redirect policy manager. This will also block on the store being synchronised, but not by way of the
podStoreSet
channel (closed by the k8s watcher): it instead relies on the Resource[T]-internal tracking.Fixes: #27084