Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

redirectpolicy: prevent startup deadlock #27115

Merged
merged 1 commit into from Jul 31, 2023

Conversation

bimmlerd
Copy link
Member

This commit prevents a deadlock in startup when local redirect policies are in use.

Involved in the deadlock is the node-local pod watcher and the redirectpolicy manager. The pod watcher (synchronously) calls into the manager, in oder to notify it of a pod event. The manager then tries to acquire a reference to the pod store. Due to recent refactoring, however, this leads to a wait cycle:

  1. Acquiring the pod store reference blocks on initial synchronisation of the store, by waiting for the podStoreSet channel to be closed.
  2. This channel is closed by the watcher once it receives a Sync event from the local pod resource. Before the resource emits a Sync event, however, it will 'replay' the consumer up to the current snapshot by emitting Upsert events.
  3. In the Upsert call, the redirectpolicy manager is called (blocking on 1.), and we have a classical deadlock.

To fix this, instead of going through the k8s watcher to acquire a reference to the local pod store, inject the resource directly into the redirect policy manager. This will also block on the store being synchronised, but not by way of the podStoreSet channel (closed by the k8s watcher): it instead relies on the Resource[T]-internal tracking.

Fixes: #27084

Resolve a deadlock on startup when local redirect policies are used.

@maintainer-s-little-helper maintainer-s-little-helper bot added the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Jul 27, 2023
@bimmlerd bimmlerd added kind/bug This is a bug in the Cilium logic. area/daemon Impacts operation of the Cilium daemon. area/lrp Impacts Local Redirect Policy. sig/agent Cilium agent related. needs-backport/1.14 This PR / issue needs backporting to the v1.14 branch and removed dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. labels Jul 27, 2023
@maintainer-s-little-helper maintainer-s-little-helper bot added the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Jul 27, 2023
@maintainer-s-little-helper maintainer-s-little-helper bot added this to Needs backport from main in 1.14.1 Jul 27, 2023
@bimmlerd bimmlerd added the release-note/bug This PR fixes an issue in a previous release of Cilium. label Jul 27, 2023
@maintainer-s-little-helper maintainer-s-little-helper bot removed the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Jul 27, 2023
@bimmlerd bimmlerd requested a review from joamaki July 27, 2023 15:33
@bimmlerd
Copy link
Member Author

/test

Copy link
Contributor

@joamaki joamaki left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. This fixes the deadlock as now it's getting the store reference in a way that it itself cannot block (as there's the Resource[T] work queue that decouples the two).

This commit prevents a deadlock in startup when local redirect policies
are in use.

Involved in the deadlock is the node-local pod watcher and the
redirectpolicy manager. The pod watcher (synchronously) calls into the
manager, in oder to notify it of a pod event. The manager then tries to
acquire a reference to the pod store. Due to recent refactoring,
however, this leads to a wait cycle:

1. Acquiring the pod store reference blocks on initial synchronisation
   of the store, by waiting for the `podStoreSet` channel to be closed.
2. This channel is closed by the watcher once it receives a `Sync` event
   from the local pod resource. Before the resource emits a Sync event,
   however, it will 'replay' the consumer up to the current snapshot by
   emitting Upsert events.
3. In the Upsert call, the redirectpolicy manager is called (blocking on
   1.), and we have a classical deadlock.

To fix this, instead of going through the k8s watcher to acquire a
reference to the local pod store, inject the resource directly into the
redirect policy manager. This will also block on the store being
synchronised, but not by way of the `podStoreSet` channel (closed by the
k8s watcher): it instead relies on the Resource[T]-internal tracking.

Fixes: 3db8b14 (k8s: Use LocalPodResource in pod watcher)

Signed-off-by: David Bimmler <david.bimmler@isovalent.com>
@bimmlerd bimmlerd force-pushed the pr/bimmlerd/minimal-lrp-deadlock-fix branch from 655137c to 31f56d5 Compare July 31, 2023 10:59
@bimmlerd bimmlerd marked this pull request as ready for review July 31, 2023 11:08
@bimmlerd bimmlerd requested review from a team as code owners July 31, 2023 11:08
@bimmlerd
Copy link
Member Author

/test

@joamaki joamaki merged commit 825b91d into cilium:main Jul 31, 2023
58 checks passed
dylandreimerink added a commit to dylandreimerink/cilium that referenced this pull request Aug 2, 2023
…lium#27115

CI runs on both were green, but when merged into main they conflicted.

Signed-off-by: Dylan Reimerink <dylan.reimerink@isovalent.com>
dylandreimerink added a commit that referenced this pull request Aug 2, 2023
CI runs on both were green, but when merged into main they conflicted.

Signed-off-by: Dylan Reimerink <dylan.reimerink@isovalent.com>
@bimmlerd bimmlerd deleted the pr/bimmlerd/minimal-lrp-deadlock-fix branch August 2, 2023 11:26
@sayboras sayboras mentioned this pull request Aug 3, 2023
11 tasks
@sayboras sayboras added backport-pending/1.14 The backport for Cilium 1.14.x for this PR is in progress. and removed needs-backport/1.14 This PR / issue needs backporting to the v1.14 branch labels Aug 3, 2023
@maintainer-s-little-helper maintainer-s-little-helper bot moved this from Needs backport from main to Backport pending to v1.14 in 1.14.1 Aug 3, 2023
@sayboras sayboras added backport-done/1.14 The backport for Cilium 1.14.x for this PR is done. and removed backport-pending/1.14 The backport for Cilium 1.14.x for this PR is in progress. labels Aug 4, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/daemon Impacts operation of the Cilium daemon. area/lrp Impacts Local Redirect Policy. backport-done/1.14 The backport for Cilium 1.14.x for this PR is done. kind/bug This is a bug in the Cilium logic. release-note/bug This PR fixes an issue in a previous release of Cilium. sig/agent Cilium agent related.
Projects
No open projects
1.14.1
Backport done to v1.14
Development

Successfully merging this pull request may close these issues.

Cilium LRP causes deadlock on cilium-agent initialization
3 participants