-
Notifications
You must be signed in to change notification settings - Fork 2.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
clustermesh: introduce circuit breaker in wait for synchronization operations #32671
Merged
julianwiedmann
merged 2 commits into
cilium:main
from
giorio94:mio/clustermesh-wait-circuit-breaker
May 28, 2024
Merged
clustermesh: introduce circuit breaker in wait for synchronization operations #32671
julianwiedmann
merged 2 commits into
cilium:main
from
giorio94:mio/clustermesh-wait-circuit-breaker
May 28, 2024
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
The only reason for that function to return an error is that the parent context expired, which happens if the agent is being shut down while the synchronization has not yet completed. Hence, let's just return, rather than triggering a fatal error. Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>
giorio94
added
kind/bug
This is a bug in the Cilium logic.
release-note/bug
This PR fixes an issue in a previous release of Cilium.
area/clustermesh
Relates to multi-cluster routing functionality in Cilium.
needs-backport/1.15
This PR / issue needs backporting to the v1.15 branch
labels
May 22, 2024
giorio94
added
the
backport/author
The backport will be carried out by the author of the PR.
label
May 22, 2024
Upon agent and operator restart, we need to wait for full clustermesh synchronization in multiple subsystems, to prevent breaking existing cross-cluster connections due to e.g., garbage collection of valid but not yet retrieved entries for a given remote cluster. However, the absence of a timeout controlling this process is problematic as well, as the impossibility of connecting to a remote cluster (e.g., due to a misconfiguration) can cause issues for local communication to the blocked GC operations. Let's standardize the different wait for synchronization functions to automatically return after a user configurable timeout (tunable via the clustermesh-sync-timeout, and set to 1 minute by default) elapses. This mimics and replaces the already existing timeout used to unblock endpoint regeneration, generalizing it to all the other resources as well. The existing flag is deprecated, but it still takes precedence for consistency with the existing behavior. Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>
giorio94
force-pushed
the
mio/clustermesh-wait-circuit-breaker
branch
from
May 22, 2024 14:29
3726082
to
09a4124
Compare
/test |
lambdanis
approved these changes
May 22, 2024
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
docs good
tommyp1ckles
approved these changes
May 28, 2024
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
endpoint lgtm
YutaroHayakawa
approved these changes
May 28, 2024
maintainer-s-little-helper
bot
added
the
ready-to-merge
This PR has passed all tests and received consensus from code owners to merge.
label
May 28, 2024
giorio94
added
backport-pending/1.15
The backport for Cilium 1.15.x for this PR is in progress.
and removed
needs-backport/1.15
This PR / issue needs backporting to the v1.15 branch
labels
May 30, 2024
maintainer-s-little-helper
bot
moved this from Needs backport from main
to Backport pending to v1.15
in 1.15.6
May 30, 2024
giorio94
added
affects/v1.13
This issue affects v1.13 branch
affects/v1.14
This issue affects v1.14 branch
labels
Jun 3, 2024
github-actions
bot
added
backport-done/1.15
The backport for Cilium 1.15.x for this PR is done.
and removed
backport-pending/1.15
The backport for Cilium 1.15.x for this PR is in progress.
labels
Jun 3, 2024
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
affects/v1.13
This issue affects v1.13 branch
affects/v1.14
This issue affects v1.14 branch
affects/v1.15
This issue affects v1.15 branch
area/clustermesh
Relates to multi-cluster routing functionality in Cilium.
backport/author
The backport will be carried out by the author of the PR.
backport-done/1.15
The backport for Cilium 1.15.x for this PR is done.
kind/bug
This is a bug in the Cilium logic.
ready-to-merge
This PR has passed all tests and received consensus from code owners to merge.
release-note/bug
This PR fixes an issue in a previous release of Cilium.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Upon agent and operator restart, we need to wait for full clustermesh synchronization in multiple subsystems, to prevent breaking existing cross-cluster connections due to e.g., garbage collection of valid but not yet retrieved entries for a given remote cluster. However, the absence of a timeout controlling this process is problematic as well, as the impossibility of connecting to a remote cluster (e.g., due to a misconfiguration) can cause issues for local communication to the blocked GC operations.
Let's standardize the different wait for synchronization functions to automatically return after a user configurable timeout (tunable via the clustermesh-sync-timeout, and set to 1 minute by default) elapses. This mimics and replaces the already existing timeout used to unblock endpoint regeneration, generalizing it to all the other resources as well. The existing flag is deprecated, but it still takes precedence for consistency with the existing behavior.