New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix missed clustermesh config change race condition with back-to-back changes #24993
Merged
ldelossa
merged 1 commit into
cilium:main
from
giorio94:mio/clustermesh-config-watcher-fix
Apr 20, 2023
Merged
Fix missed clustermesh config change race condition with back-to-back changes #24993
ldelossa
merged 1 commit into
cilium:main
from
giorio94:mio/clustermesh-config-watcher-fix
Apr 20, 2023
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Currently, the config watcher used to detect changes in the clustermesh configuration is affected by a possible race condition in case the same file is modified multiple times back-to-back, causing one of the updates to be missed. Specifically, there is a small chance that the file content changes between when it is read and when the watcher is established, hence losing that notification. This commit fixes this by re-reading the file after having established the watcher, to ensure that we are always processing the most up-to-date version of the file, and that subsequent modifications are guaranteed to be detected. This issue could realistically occur only in tests, due the tight timing constraints, and it caused a seldom failure of the TestWatchConfigDirectory test with the following error: FAIL: config_test.go:69: ClusterMeshTestSuite.TestWatchConfigDirectory config_test.go:178: expectChange(c, cm, "cluster2") config_test.go:46: c.Fatal("timeout while waiting for changed event") ... Error: timeout while waiting for changed event Fixes: 2cdd4ee ("agent: rework clustermesh config watcher") Fixes: cilium#24491 Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>
giorio94
added
area/CI
Continuous Integration testing issue or flake
area/clustermesh
Relates to multi-cluster routing functionality in Cilium.
release-note/misc
This PR makes changes that have no direct user impact.
needs-backport/1.13
This PR / issue needs backporting to the v1.13 branch
labels
Apr 20, 2023
YutaroHayakawa
approved these changes
Apr 20, 2023
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM with nit.
/test |
test-1.27-net-nextegressgw egressgw tests are failing due to #24835 having been merged, as this one is not rebased. Marking as ready to merge, since reviews are in, and this change is completely unrelated (the clustermesh suite passed). |
giorio94
added
the
ready-to-merge
This PR has passed all tests and received consensus from code owners to merge.
label
Apr 20, 2023
ldelossa
removed
the
ready-to-merge
This PR has passed all tests and received consensus from code owners to merge.
label
Apr 20, 2023
15 tasks
nbusseneau
added
backport-pending/1.13
The backport for Cilium 1.13.x for this PR is in progress.
and removed
needs-backport/1.13
This PR / issue needs backporting to the v1.13 branch
labels
Apr 20, 2023
sayboras
added
backport-done/1.13
The backport for Cilium 1.13.x for this PR is done.
and removed
backport-pending/1.13
The backport for Cilium 1.13.x for this PR is in progress.
labels
Apr 26, 2023
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
area/CI
Continuous Integration testing issue or flake
area/clustermesh
Relates to multi-cluster routing functionality in Cilium.
backport-done/1.13
The backport for Cilium 1.13.x for this PR is done.
release-note/misc
This PR makes changes that have no direct user impact.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Currently, the config watcher used to detect changes in the clustermesh configuration is affected by a possible race condition in case the same file is modified multiple times back-to-back, causing one of the updates to be missed. Specifically, there is a small chance that the file content changes between when it is read and when the watcher is established, hence losing that notification. This PR fixes this by re-reading the file after having established the watcher, to ensure that we are always processing the most up-to-date version of the file, and that subsequent modifications are guaranteed to be detected.
This issue could realistically occur only in tests, due the tight timing constraints (hence I'm adding the
misc
release note rather thanbug
), and it caused a seldom failure of the TestWatchConfigDirectory test with the following error:FAIL: config_test.go:69: ClusterMeshTestSuite.TestWatchConfigDirectory
config_test.go:178:
expectChange(c, cm, "cluster2")
config_test.go:46:
c.Fatal("timeout while waiting for changed event")
... Error: timeout while waiting for changed event
I've run the test locally with 1K+ repetitions and observed no failures with the fix.
Marking for backport in v1.13 to propagate the fix for the flaky test.
Fixes: #24491