New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix possible race condition in the clustermesh's users management test #24652
Conversation
The clustermesh users management test was affected by a possible race condition due to the fact that the configuration file was overwritten with 'os.WriteFile'. Since this method first opens the file truncating it and then writes the desired content, we could be notified and read the file in that time interval, hence observing an empty file. This caused all users to be first deleted and then recreated, leading the test to fail. To avoid this issue, we now first write a different file, and then rename it. In this way, the update operation is atomic, and we always observe the correct file content. Failing test output: level=info msg=Invoked duration="34.8µs" function="github.com/cilium/cilium/clustermesh-apiserver.registerUsersManager (users_mgmt.go:69)" subsys=hive level=info msg=Starting subsys=hive level=info msg="Starting managing etcd users based on configuration" file-path=/tmp/clustermesh-config159493219/users.yaml subsys=clustermesh-apiserver level=info msg="Start hook executed" duration="115.098µs" function="*main.usersManager.Start" subsys=hive level=info msg="User successfully configured" subsys=clustermesh-apiserver user=foo level=info msg="User successfully configured" subsys=clustermesh-apiserver user=bar level=info msg="User successfully configured" subsys=clustermesh-apiserver user=qux level=info msg="User successfully removed" subsys=clustermesh-apiserver user=foo level=info msg="User successfully removed" subsys=clustermesh-apiserver user=bar level=info msg="User successfully removed" subsys=clustermesh-apiserver user=qux level=info msg="User successfully configured" subsys=clustermesh-apiserver user=baz level=info msg="User successfully configured" subsys=clustermesh-apiserver user=foo level=info msg="User successfully configured" subsys=clustermesh-apiserver user=qux users_mgmt_test.go:123: Error Trace: /home/runner/work/cilium/cilium/clustermesh-apiserver/users_mgmt_test.go:123 Error: "map[baz:r3 foo:r1 qux:r4]" should have 2 item(s), but has 3 Test: TestUsersManagement level=info msg=Stopping subsys=hive level=info msg="Stopping managing etcd users based on configuration" file-path=/tmp/clustermesh-config159493219/users.yaml subsys=clustermesh-apiserver level=info msg=Closing subsys=clustermesh-apiserver level=info msg="Stop hook executed" duration="27.8µs" function="*main.usersManager.Stop" subsys=hive Fixes: 60f94e4 ("clustermesh-apiserver: implement users management for remote clusters") Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>
Integration tests completed successfully. I'm not triggering the others, since not relevant for this change (affecting integration tests only). |
Thanks for the fix, I noticed this failure as well. I filed #24653 , now linked in the PR description. |
Apparently it is quite frequent in the CI environment. In my local machine I reproduced it in less than 1/1000 iterations on average. |
|
||
// Update the users config file, and require that changes are propagated | ||
// We first write to a different file and then rename it, to avoid the possible | ||
// race condition caused by truncate + write if we detect the event sufficiently | ||
// fast (i.e., we first read an empty file, and then the expected one). | ||
cfgPath2 := path.Join(tmpdir, "users.yaml.2") | ||
require.NoError(t, os.WriteFile(cfgPath2, []byte(users2), 0600)) | ||
require.NoError(t, os.Rename(cfgPath2, cfgPath)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alternatively, this could use github.com/google/renameio/v2
which would make it clear (without the comment) that the file is written atomically and would also take care of automatically cleaning up stale files on error:
// Update the users config file, and require that changes are propagated | |
// We first write to a different file and then rename it, to avoid the possible | |
// race condition caused by truncate + write if we detect the event sufficiently | |
// fast (i.e., we first read an empty file, and then the expected one). | |
cfgPath2 := path.Join(tmpdir, "users.yaml.2") | |
require.NoError(t, os.WriteFile(cfgPath2, []byte(users2), 0600)) | |
require.NoError(t, os.Rename(cfgPath2, cfgPath)) | |
require.NoError(t, renameio.WriteFile(cfgPath, []byte(users2), 0600)) |
Given that os.Rename
is an atomic operation on Linux at least, there's no functional difference of the proposed implementation using renameio
(apart from potentially leaking temporary files in case of an error), so LGTM to fix the test flake.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! I didn't know there was already a package for that. As for the file leak there should be no problem, since the temporary directory is entirely deleted at the end of the test (deferred after creation).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I meant to approve on the previous review 🙈 #24652 (comment)
The clustermesh users management test was affected by a possible race condition due to the fact that the configuration file was overwritten with 'os.WriteFile'. Since this method first opens the file truncating it and then writes the desired content, we could be notified and read the file in that time interval, hence observing an empty file. This caused all users to be first deleted and then recreated, leading the test to fail. To avoid this issue, we now first write a different file, and then rename it. In this way, the update operation is atomic, and we always observe the correct file content.
Failing test output:
Fixes: 60f94e4 ("clustermesh-apiserver: implement users management for remote clusters")
Fixes: #24653