Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix missed clustermesh config change race condition with back-to-back changes #24993

Merged
merged 1 commit into from Apr 20, 2023

Conversation

giorio94
Copy link
Member

Currently, the config watcher used to detect changes in the clustermesh configuration is affected by a possible race condition in case the same file is modified multiple times back-to-back, causing one of the updates to be missed. Specifically, there is a small chance that the file content changes between when it is read and when the watcher is established, hence losing that notification. This PR fixes this by re-reading the file after having established the watcher, to ensure that we are always processing the most up-to-date version of the file, and that subsequent modifications are guaranteed to be detected.

This issue could realistically occur only in tests, due the tight timing constraints (hence I'm adding the misc release note rather than bug), and it caused a seldom failure of the TestWatchConfigDirectory test with the following error:

FAIL: config_test.go:69: ClusterMeshTestSuite.TestWatchConfigDirectory

config_test.go:178:
expectChange(c, cm, "cluster2")
config_test.go:46:
c.Fatal("timeout while waiting for changed event")
... Error: timeout while waiting for changed event

I've run the test locally with 1K+ repetitions and observed no failures with the fix.
Marking for backport in v1.13 to propagate the fix for the flaky test.

Fixes: #24491

Fix missed clustermesh config change race condition with back-to-back changes

Currently, the config watcher used to detect changes in the clustermesh
configuration is affected by a possible race condition in case the same
file is modified multiple times back-to-back, causing one of the updates
to be missed. Specifically, there is a small chance that the file
content changes between when it is read and when the watcher is
established, hence losing that notification. This commit fixes this by
re-reading the file after having established the watcher, to ensure that
we are always processing the most up-to-date version of the file, and
that subsequent modifications are guaranteed to be detected.

This issue could realistically occur only in tests, due the tight timing
constraints, and it caused a seldom failure of the TestWatchConfigDirectory
test with the following error:

FAIL: config_test.go:69: ClusterMeshTestSuite.TestWatchConfigDirectory

config_test.go:178:
    expectChange(c, cm, "cluster2")
config_test.go:46:
    c.Fatal("timeout while waiting for changed event")
... Error: timeout while waiting for changed event

Fixes: 2cdd4ee ("agent: rework clustermesh config watcher")
Fixes: cilium#24491
Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>
@giorio94 giorio94 added area/CI Continuous Integration testing issue or flake area/clustermesh Relates to multi-cluster routing functionality in Cilium. release-note/misc This PR makes changes that have no direct user impact. needs-backport/1.13 This PR / issue needs backporting to the v1.13 branch labels Apr 20, 2023
@giorio94 giorio94 requested a review from a team as a code owner April 20, 2023 07:59
Copy link
Member

@YutaroHayakawa YutaroHayakawa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM with nit.

@giorio94
Copy link
Member Author

/test

@giorio94
Copy link
Member Author

test-1.27-net-nextegressgw egressgw tests are failing due to #24835 having been merged, as this one is not rebased. Marking as ready to merge, since reviews are in, and this change is completely unrelated (the clustermesh suite passed).

@giorio94 giorio94 added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Apr 20, 2023
@ldelossa ldelossa merged commit 106d098 into cilium:main Apr 20, 2023
57 of 58 checks passed
@ldelossa ldelossa removed the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Apr 20, 2023
@nbusseneau nbusseneau mentioned this pull request Apr 20, 2023
15 tasks
@nbusseneau nbusseneau added backport-pending/1.13 The backport for Cilium 1.13.x for this PR is in progress. and removed needs-backport/1.13 This PR / issue needs backporting to the v1.13 branch labels Apr 20, 2023
@sayboras sayboras added backport-done/1.13 The backport for Cilium 1.13.x for this PR is done. and removed backport-pending/1.13 The backport for Cilium 1.13.x for this PR is in progress. labels Apr 26, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/CI Continuous Integration testing issue or flake area/clustermesh Relates to multi-cluster routing functionality in Cilium. backport-done/1.13 The backport for Cilium 1.13.x for this PR is done. release-note/misc This PR makes changes that have no direct user impact.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

CI: IntegrationTests: ClusterMeshTestSuite.TestWatchConfigDirectory: timeout while waiting for changed event
5 participants