Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Address race condition in TestGetIdentity #30885

Merged
merged 2 commits into from
Feb 23, 2024

Conversation

bimmlerd
Copy link
Member

@bimmlerd bimmlerd commented Feb 21, 2024

Please read the commit message of the first commit to understand the race - second commit is a bit of cleanup.

Fixes: #30873
Fixes: #30255

@maintainer-s-little-helper maintainer-s-little-helper bot added the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Feb 21, 2024
@bimmlerd bimmlerd added sig/k8s Impacts the kubernetes API, or kubernetes -> cilium internals translation layers. release-note/misc This PR makes changes that have no direct user impact. sig/agent Cilium agent related. labels Feb 21, 2024
@maintainer-s-little-helper maintainer-s-little-helper bot removed the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Feb 21, 2024
@bimmlerd bimmlerd added the kind/cleanup This includes no functional changes. label Feb 21, 2024
@bimmlerd bimmlerd force-pushed the pr/bimmlerd/fix-30873 branch 2 times, most recently from d04605f to 5ae54e1 Compare February 21, 2024 14:13
@bimmlerd
Copy link
Member Author

/test

@bimmlerd bimmlerd marked this pull request as ready for review February 21, 2024 14:35
@bimmlerd bimmlerd requested a review from a team as a code owner February 21, 2024 14:35
@bimmlerd
Copy link
Member Author

bimmlerd commented Feb 21, 2024

CI triage (WIP):

@bimmlerd bimmlerd added the kind/bug/CI This is a bug in the testing code. label Feb 22, 2024
Copy link
Member

@margamanterola margamanterola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor comment nits

pkg/k8s/identitybackend/identity_test.go Outdated Show resolved Hide resolved
pkg/k8s/identitybackend/identity_test.go Outdated Show resolved Hide resolved
Copy link
Contributor

@tommyp1ckles tommyp1ckles left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice writeup!

bimmlerd and others added 2 commits February 23, 2024 08:46
TestGetIdentity has been unreliable, even withstanding some previous
attempts at deflaking.

The issue lies in the use of the k8s fake infrastructure: the simple
testing object tracker of client-go does _not_ set the ResourceVersion
for resources created. This interacts badly with the logic of the
client-go reflector's ListAndWatch method, which relies on the resource
version to close the racy window between its List and Watch calls. The
real k8s api-server will replay events which occur after the completion
of List and before the establishment of the Watch, thanks to the
ResourceVersion. The object tracker's Watch implementation, however,
does (and can) not do so, as it doesn't have a resource version to
determine which events it would need to replay.

Notably, the HasSynced method of the informer will return true once the
initial List has succeeded. This isn't a guarantee for the Watch to be
established (and indeed, the reflector establishes the Watch _after_ the
list). This is fine for reality, again thanks to the resource version
and the api-server replaying.

The race, hence, is that the creation of the identities can happen
concurrently to the establishment of the watch (HasSynced guarantees
that it happens _after_ the list), and thus we race the creation of the
"RaceFreeWatcher" in the object tracker. If the watcher is late, it
misses the creation of an identity, and we time out waiting on the wait
group.

To fix this, instead of attempting to wait for the Watch
establishment (which doesn't seem easy, on first glance), just create
the resources _before_ list and watch is started, so that they are
returned in the initial list call.

Prior to this patch, the following commandline typically failed quickly:

while true; do go test ./pkg/k8s/identitybackend -run 'TestGetIdentity' -v -count=1 -timeout=10s || break; done

After this patch, it ran thousands of times reliably.

Co-authored-by: Fabian Fischer <fabian.fischer@isovalent.com>
Signed-off-by: David Bimmler <david.bimmler@isovalent.com>
The previous patch explains and fixes a flake, this patch removes some
of the remaining cruft from earlier attempts at fixing said flake, as
well as running the test in parallel (for efficiency).

Signed-off-by: David Bimmler <david.bimmler@isovalent.com>
@bimmlerd
Copy link
Member Author

/test

@bimmlerd
Copy link
Member Author

bimmlerd commented Feb 23, 2024

CI triage ( 😞 )

@maintainer-s-little-helper maintainer-s-little-helper bot added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Feb 23, 2024
@tklauser tklauser added this pull request to the merge queue Feb 23, 2024
Merged via the queue into cilium:main with commit 1907334 Feb 23, 2024
62 checks passed
@bimmlerd bimmlerd added the needs-backport/1.14 This PR / issue needs backporting to the v1.14 branch label Mar 18, 2024
@bimmlerd bimmlerd added the needs-backport/1.15 This PR / issue needs backporting to the v1.15 branch label Mar 18, 2024
@gandro gandro added the backport/author The backport will be carried out by the author of the PR. label Mar 19, 2024
@gandro
Copy link
Member

gandro commented Mar 19, 2024

There are some non-trivial conflicts on the v1.15 branch, and given this is about races I think it's probably better if I don't try to resolve them without context. Marking as "backport/author"

@gandro gandro mentioned this pull request Mar 19, 2024
21 tasks
@bimmlerd bimmlerd added backport-pending/1.15 The backport for Cilium 1.15.x for this PR is in progress. and removed needs-backport/1.15 This PR / issue needs backporting to the v1.15 branch labels Mar 21, 2024
@bimmlerd bimmlerd added backport-pending/1.14 The backport for Cilium 1.14.x for this PR is in progress. and removed needs-backport/1.14 This PR / issue needs backporting to the v1.14 branch labels Mar 21, 2024
@github-actions github-actions bot added backport-done/1.14 The backport for Cilium 1.14.x for this PR is done. backport-done/1.15 The backport for Cilium 1.15.x for this PR is done. and removed backport-pending/1.14 The backport for Cilium 1.14.x for this PR is in progress. backport-pending/1.15 The backport for Cilium 1.15.x for this PR is in progress. labels Mar 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport/author The backport will be carried out by the author of the PR. backport-done/1.14 The backport for Cilium 1.14.x for this PR is done. backport-done/1.15 The backport for Cilium 1.15.x for this PR is done. kind/bug/CI This is a bug in the testing code. kind/cleanup This includes no functional changes. ready-to-merge This PR has passed all tests and received consensus from code owners to merge. release-note/misc This PR makes changes that have no direct user impact. sig/agent Cilium agent related. sig/k8s Impacts the kubernetes API, or kubernetes -> cilium internals translation layers.
Projects
No open projects
Status: Released
Status: Released
5 participants