Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix possible race condition in the clustermesh's users management test #24652

Merged
merged 1 commit into from Mar 31, 2023

Conversation

giorio94
Copy link
Member

@giorio94 giorio94 commented Mar 30, 2023

The clustermesh users management test was affected by a possible race condition due to the fact that the configuration file was overwritten with 'os.WriteFile'. Since this method first opens the file truncating it and then writes the desired content, we could be notified and read the file in that time interval, hence observing an empty file. This caused all users to be first deleted and then recreated, leading the test to fail. To avoid this issue, we now first write a different file, and then rename it. In this way, the update operation is atomic, and we always observe the correct file content.

Failing test output:

level=info msg=Invoked duration="34.8µs" function="github.com/cilium/cilium/clustermesh-apiserver.registerUsersManager (users_mgmt.go:69)" subsys=hive level=info msg=Starting subsys=hive
level=info msg="Starting managing etcd users based on configuration" file-path=/tmp/clustermesh-config159493219/users.yaml subsys=clustermesh-apiserver level=info msg="Start hook executed" duration="115.098µs" function="*main.usersManager.Start" subsys=hive level=info msg="User successfully configured" subsys=clustermesh-apiserver user=foo level=info msg="User successfully configured" subsys=clustermesh-apiserver user=bar level=info msg="User successfully configured" subsys=clustermesh-apiserver user=qux level=info msg="User successfully removed" subsys=clustermesh-apiserver user=foo level=info msg="User successfully removed" subsys=clustermesh-apiserver user=bar level=info msg="User successfully removed" subsys=clustermesh-apiserver user=qux level=info msg="User successfully configured" subsys=clustermesh-apiserver user=baz level=info msg="User successfully configured" subsys=clustermesh-apiserver user=foo level=info msg="User successfully configured" subsys=clustermesh-apiserver user=qux
    users_mgmt_test.go:123:
        	Error Trace:	/home/runner/work/cilium/cilium/clustermesh-apiserver/users_mgmt_test.go:123
        	Error:      	"map[baz:r3 foo:r1 qux:r4]" should have 2 item(s), but has 3
        	Test:       	TestUsersManagement
level=info msg=Stopping subsys=hive
level=info msg="Stopping managing etcd users based on configuration" file-path=/tmp/clustermesh-config159493219/users.yaml subsys=clustermesh-apiserver
level=info msg=Closing subsys=clustermesh-apiserver
level=info msg="Stop hook executed" duration="27.8µs" function="*main.usersManager.Stop" subsys=hive

Fixes: 60f94e4 ("clustermesh-apiserver: implement users management for remote clusters")
Fixes: #24653

The clustermesh users management test was affected by a possible race
condition due to the fact that the configuration file was overwritten
with 'os.WriteFile'. Since this method first opens the file truncating
it and then writes the desired content, we could be notified and read
the file in that time interval, hence observing an empty file. This
caused all users to be first deleted and then recreated, leading the
test to fail. To avoid this issue, we now first write a different file,
and then rename it. In this way, the update operation is atomic, and we
always observe the correct file content.

Failing test output:
level=info msg=Invoked duration="34.8µs" function="github.com/cilium/cilium/clustermesh-apiserver.registerUsersManager (users_mgmt.go:69)" subsys=hive
level=info msg=Starting subsys=hive
level=info msg="Starting managing etcd users based on configuration" file-path=/tmp/clustermesh-config159493219/users.yaml subsys=clustermesh-apiserver
level=info msg="Start hook executed" duration="115.098µs" function="*main.usersManager.Start" subsys=hive
level=info msg="User successfully configured" subsys=clustermesh-apiserver user=foo
level=info msg="User successfully configured" subsys=clustermesh-apiserver user=bar
level=info msg="User successfully configured" subsys=clustermesh-apiserver user=qux
level=info msg="User successfully removed" subsys=clustermesh-apiserver user=foo
level=info msg="User successfully removed" subsys=clustermesh-apiserver user=bar
level=info msg="User successfully removed" subsys=clustermesh-apiserver user=qux
level=info msg="User successfully configured" subsys=clustermesh-apiserver user=baz
level=info msg="User successfully configured" subsys=clustermesh-apiserver user=foo
level=info msg="User successfully configured" subsys=clustermesh-apiserver user=qux
    users_mgmt_test.go:123:
        	Error Trace:	/home/runner/work/cilium/cilium/clustermesh-apiserver/users_mgmt_test.go:123
        	Error:      	"map[baz:r3 foo:r1 qux:r4]" should have 2 item(s), but has 3
        	Test:       	TestUsersManagement
level=info msg=Stopping subsys=hive
level=info msg="Stopping managing etcd users based on configuration" file-path=/tmp/clustermesh-config159493219/users.yaml subsys=clustermesh-apiserver
level=info msg=Closing subsys=clustermesh-apiserver
level=info msg="Stop hook executed" duration="27.8µs" function="*main.usersManager.Stop" subsys=hive

Fixes: 60f94e4 ("clustermesh-apiserver: implement users management for remote clusters")
Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>
@giorio94 giorio94 added area/CI Continuous Integration testing issue or flake area/clustermesh Relates to multi-cluster routing functionality in Cilium. release-note/misc This PR makes changes that have no direct user impact. ci/flake This is a known failure that occurs in the tree. Please investigate me! labels Mar 30, 2023
@giorio94 giorio94 requested a review from a team as a code owner March 30, 2023 16:37
@giorio94
Copy link
Member Author

Integration tests completed successfully. I'm not triggering the others, since not relevant for this change (affecting integration tests only).

@joestringer
Copy link
Member

Thanks for the fix, I noticed this failure as well. I filed #24653 , now linked in the PR description.

@giorio94
Copy link
Member Author

Thanks for the fix, I noticed this failure as well. I filed #24653 , now linked in the PR description.

Apparently it is quite frequent in the CI environment. In my local machine I reproduced it in less than 1/1000 iterations on average.

Comment on lines +117 to +124

// Update the users config file, and require that changes are propagated
// We first write to a different file and then rename it, to avoid the possible
// race condition caused by truncate + write if we detect the event sufficiently
// fast (i.e., we first read an empty file, and then the expected one).
cfgPath2 := path.Join(tmpdir, "users.yaml.2")
require.NoError(t, os.WriteFile(cfgPath2, []byte(users2), 0600))
require.NoError(t, os.Rename(cfgPath2, cfgPath))
Copy link
Member

@tklauser tklauser Mar 31, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alternatively, this could use github.com/google/renameio/v2 which would make it clear (without the comment) that the file is written atomically and would also take care of automatically cleaning up stale files on error:

Suggested change
// Update the users config file, and require that changes are propagated
// We first write to a different file and then rename it, to avoid the possible
// race condition caused by truncate + write if we detect the event sufficiently
// fast (i.e., we first read an empty file, and then the expected one).
cfgPath2 := path.Join(tmpdir, "users.yaml.2")
require.NoError(t, os.WriteFile(cfgPath2, []byte(users2), 0600))
require.NoError(t, os.Rename(cfgPath2, cfgPath))
require.NoError(t, renameio.WriteFile(cfgPath, []byte(users2), 0600))

Given that os.Rename is an atomic operation on Linux at least, there's no functional difference of the proposed implementation using renameio (apart from potentially leaking temporary files in case of an error), so LGTM to fix the test flake.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! I didn't know there was already a package for that. As for the file leak there should be no problem, since the temporary directory is entirely deleted at the end of the test (deferred after creation).

Copy link
Member

@tklauser tklauser left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant to approve on the previous review 🙈 #24652 (comment)

@joestringer joestringer merged commit dbb13be into cilium:master Mar 31, 2023
43 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/CI Continuous Integration testing issue or flake area/clustermesh Relates to multi-cluster routing functionality in Cilium. ci/flake This is a known failure that occurs in the tree. Please investigate me! release-note/misc This PR makes changes that have no direct user impact.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

CI: TestUsersManagement in clustermesh-apiserver tests fail with incorrect number of users
4 participants