Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bgpv1: avoid object tracker vs informer race #31010

Merged
merged 1 commit into from Mar 5, 2024

Conversation

bimmlerd
Copy link
Member

The fake k8s testing infrastructure has an unfortunate interaction between real informers and the object tracker used in fake clientsets: the informer uses ListAndWatch to subscribe to the api server, but ListAndWatch relies on the ResourceVersion field to bridge the gap between the List and the Watch. The simple object tracker does not provide the resource versioning, nor does its Watch implementation attempt to replay existing state.

The race window, hence, allows for creation of objects after the initial list, but before the establishment of Watch. These objects are not observed by the informer and thus tests are likely to fail.

As a workaround, we ensure that the object tracker has registered a watcher before we start the test in earnest. It is only important that we do not create/update/delete objects in the window between starting the hive (and hence running of the informer) and having ensured that the watcher is in place. Creation prior to starting the hive is okay, as well as after ensuring the watcher exists.

Fixes: #31006

@bimmlerd bimmlerd added kind/bug/CI This is a bug in the testing code. release-note/ci This PR makes changes to the CI. sig/agent Cilium agent related. area/bgp labels Feb 27, 2024
@bimmlerd
Copy link
Member Author

/test

@bimmlerd bimmlerd marked this pull request as ready for review February 27, 2024 15:59
@bimmlerd bimmlerd requested a review from a team as a code owner February 27, 2024 15:59
@harsimran-pabla
Copy link
Contributor

Thanks @bimmlerd for thorough explanation. I think we have this pattern repeated across tests in Cilium which uses fake client set. For eg BGP component tests

Does it require wider discussion to address all these test cases ?

@YutaroHayakawa
Copy link
Member

I found ControllerRuntime has a versionedTracker that implements testing.ObjectTracker with resource versioning, which "looks like" what we want, but I'm not sure if we can reuse it for our use case.

https://github.com/cilium/cilium/blob/main/vendor/sigs.k8s.io/controller-runtime/pkg/client/fake/client.go#L61-L65

Anyways, thanks for working on this David!

@bimmlerd
Copy link
Member Author

I think we have this pattern repeated across tests in Cilium which uses fake client set. For eg BGP component tests

Does it require wider discussion to address all these test cases ?

Yep, see also #30906 and #30885. I'll cc you on a slack thread, also.

I found ControllerRuntime has a versionedTracker that implements testing.ObjectTracker with resource versioning, which "looks like" what we want, but I'm not sure if we can reuse it for our use case.

Nice find. I took a brief peek, and unfortunately it doesn't seem like this will be enough - the versionedTracker's Watch is delegated to the underlying, standard ObjectTracker, as far as I can tell. This would then not solve the problem, even if the resources are versioned: the missing part is replaying the right information to a watcher which subscribes starting at a specific version.

@bimmlerd
Copy link
Member Author

bimmlerd commented Feb 28, 2024

CI triage:

@bimmlerd
Copy link
Member Author

bimmlerd commented Feb 28, 2024

/test (rebased to get the fix for the gateway conformance test)

The fake k8s testing infrastructure has an unfortunate interaction
between real informers and the object tracker used in fake clientsets:
the informer uses ListAndWatch to subscribe to the api server, but
ListAndWatch relies on the ResourceVersion field to bridge the gap
between the List and the Watch. The simple object tracker does not
provide the resource versioning, nor does its Watch implementation
attempt to replay existing state.

The race window, hence, allows for creation of objects _after_ the
initial list, but _before_ the establishment of Watch. These objects are
not observed by the informer and thus tests are likely to fail.

As a workaround, we ensure that the object tracker has registered a
watcher before we start the test in earnest. It is only important that
we do not create/update/delete objects in the window between starting
the hive (and hence running of the informer) and having ensured that the
watcher is in place. Creation prior to starting the hive is okay, as
well as after ensuring the watcher exists.

Signed-off-by: David Bimmler <david.bimmler@isovalent.com>
@bimmlerd
Copy link
Member Author

/test

@bimmlerd
Copy link
Member Author

Fixed a race in my fix, PTAL 😓 Specifically, I was closing the channel before actually calling Watch on the underlying tracker (delegating it to the default reaction). Fixed by calling Watch in our reaction and only then closing the channel.

Quantified using the deflaking tips from https://gist.github.com/liggitt/6a3a2217fa5f846b52519acfc0ffece0#running-unit-tests-to-reproduce-flakes I'm now confident that it's less broken than before :)

before: 21m30s: 3069 runs so far, 49 failures (1.60%)
after: 21m30s: 3033 runs so far, 0 failures

@maintainer-s-little-helper maintainer-s-little-helper bot added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Mar 5, 2024
@joestringer joestringer added this pull request to the merge queue Mar 5, 2024
Merged via the queue into cilium:main with commit cfd1790 Mar 5, 2024
62 checks passed
@bimmlerd bimmlerd deleted the pr/bimmlerd/fix-bgpv1-flake branch March 6, 2024 06:52
@YutaroHayakawa YutaroHayakawa added the needs-backport/1.15 This PR / issue needs backporting to the v1.15 branch label Mar 12, 2024
@maintainer-s-little-helper maintainer-s-little-helper bot added this to Needs backport from main in 1.15.2 Mar 12, 2024
@YutaroHayakawa
Copy link
Member

YutaroHayakawa commented Mar 12, 2024

Marking as needs-backport to v1.15. I hit this in #31354.

@jrajahalme jrajahalme removed this from Needs backport from main in 1.15.2 Mar 13, 2024
@gandro gandro mentioned this pull request Mar 19, 2024
21 tasks
@gandro gandro added backport-pending/1.15 The backport for Cilium 1.15.x for this PR is in progress. and removed needs-backport/1.15 This PR / issue needs backporting to the v1.15 branch labels Mar 19, 2024
@github-actions github-actions bot added backport-done/1.15 The backport for Cilium 1.15.x for this PR is done. and removed backport-pending/1.15 The backport for Cilium 1.15.x for this PR is in progress. labels Mar 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/bgp backport-done/1.15 The backport for Cilium 1.15.x for this PR is done. kind/bug/CI This is a bug in the testing code. ready-to-merge This PR has passed all tests and received consensus from code owners to merge. release-note/ci This PR makes changes to the CI. sig/agent Cilium agent related.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

CI: Conformance Runtime (ci-runtime): TestDiffUpsertCoalesce: Expected 2 upserted objects
5 participants