Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

clustermesh: introduce circuit breaker in wait for synchronization operations #32671

Merged

Conversation

giorio94
Copy link
Member

Upon agent and operator restart, we need to wait for full clustermesh synchronization in multiple subsystems, to prevent breaking existing cross-cluster connections due to e.g., garbage collection of valid but not yet retrieved entries for a given remote cluster. However, the absence of a timeout controlling this process is problematic as well, as the impossibility of connecting to a remote cluster (e.g., due to a misconfiguration) can cause issues for local communication to the blocked GC operations.

Let's standardize the different wait for synchronization functions to automatically return after a user configurable timeout (tunable via the clustermesh-sync-timeout, and set to 1 minute by default) elapses. This mimics and replaces the already existing timeout used to unblock endpoint regeneration, generalizing it to all the other resources as well. The existing flag is deprecated, but it still takes precedence for consistency with the existing behavior.

Introduce timeout when waiting for the initial synchronization from remote clusters, to avoid blocking forever necessary GC operations in case of clustermesh misconfigurations.

The only reason for that function to return an error is that the parent
context expired, which happens if the agent is being shut down while the
synchronization has not yet completed. Hence, let's just return, rather
than triggering a fatal error.

Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>
@giorio94 giorio94 added kind/bug This is a bug in the Cilium logic. release-note/bug This PR fixes an issue in a previous release of Cilium. area/clustermesh Relates to multi-cluster routing functionality in Cilium. needs-backport/1.15 This PR / issue needs backporting to the v1.15 branch labels May 22, 2024
@maintainer-s-little-helper maintainer-s-little-helper bot added this to Needs backport from main in 1.15.6 May 22, 2024
@giorio94 giorio94 added the backport/author The backport will be carried out by the author of the PR. label May 22, 2024
Upon agent and operator restart, we need to wait for full clustermesh
synchronization in multiple subsystems, to prevent breaking existing
cross-cluster connections due to e.g., garbage collection of valid
but not yet retrieved entries for a given remote cluster. However,
the absence of a timeout controlling this process is problematic as
well, as the impossibility of connecting to a remote cluster (e.g.,
due to a misconfiguration) can cause issues for local communication
to the blocked GC operations.

Let's standardize the different wait for synchronization functions
to automatically return after a user configurable timeout (tunable
via the clustermesh-sync-timeout, and set to 1 minute by default)
elapses. This mimics and replaces the already existing timeout used
to unblock endpoint regeneration, generalizing it to all the other
resources as well. The existing flag is deprecated, but it still
takes precedence for consistency with the existing behavior.

Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>
@giorio94 giorio94 force-pushed the mio/clustermesh-wait-circuit-breaker branch from 3726082 to 09a4124 Compare May 22, 2024 14:29
@giorio94
Copy link
Member Author

/test

@giorio94 giorio94 marked this pull request as ready for review May 22, 2024 14:31
@giorio94 giorio94 requested review from a team as code owners May 22, 2024 14:31
Copy link
Contributor

@lambdanis lambdanis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

docs good

Copy link
Contributor

@tommyp1ckles tommyp1ckles left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

endpoint lgtm

@maintainer-s-little-helper maintainer-s-little-helper bot added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label May 28, 2024
@julianwiedmann julianwiedmann added this pull request to the merge queue May 28, 2024
Merged via the queue into cilium:main with commit cc7c27d May 28, 2024
64 checks passed
@giorio94 giorio94 mentioned this pull request May 30, 2024
1 task
@giorio94 giorio94 added backport-pending/1.15 The backport for Cilium 1.15.x for this PR is in progress. and removed needs-backport/1.15 This PR / issue needs backporting to the v1.15 branch labels May 30, 2024
@maintainer-s-little-helper maintainer-s-little-helper bot moved this from Needs backport from main to Backport pending to v1.15 in 1.15.6 May 30, 2024
@giorio94 giorio94 added affects/v1.13 This issue affects v1.13 branch affects/v1.14 This issue affects v1.14 branch labels Jun 3, 2024
@giorio94 giorio94 added the affects/v1.15 This issue affects v1.15 branch label Jun 3, 2024
@github-actions github-actions bot added backport-done/1.15 The backport for Cilium 1.15.x for this PR is done. and removed backport-pending/1.15 The backport for Cilium 1.15.x for this PR is in progress. labels Jun 3, 2024
@maintainer-s-little-helper maintainer-s-little-helper bot removed this from Backport pending to v1.15 in 1.15.6 Jun 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
affects/v1.13 This issue affects v1.13 branch affects/v1.14 This issue affects v1.14 branch affects/v1.15 This issue affects v1.15 branch area/clustermesh Relates to multi-cluster routing functionality in Cilium. backport/author The backport will be carried out by the author of the PR. backport-done/1.15 The backport for Cilium 1.15.x for this PR is done. kind/bug This is a bug in the Cilium logic. ready-to-merge This PR has passed all tests and received consensus from code owners to merge. release-note/bug This PR fixes an issue in a previous release of Cilium.
Projects
No open projects
Status: Released
Development

Successfully merging this pull request may close these issues.

None yet

5 participants