bgpv1: avoid object tracker vs informer race #31010

bimmlerd · 2024-02-27T15:41:16Z

The fake k8s testing infrastructure has an unfortunate interaction between real informers and the object tracker used in fake clientsets: the informer uses ListAndWatch to subscribe to the api server, but ListAndWatch relies on the ResourceVersion field to bridge the gap between the List and the Watch. The simple object tracker does not provide the resource versioning, nor does its Watch implementation attempt to replay existing state.

The race window, hence, allows for creation of objects after the initial list, but before the establishment of Watch. These objects are not observed by the informer and thus tests are likely to fail.

As a workaround, we ensure that the object tracker has registered a watcher before we start the test in earnest. It is only important that we do not create/update/delete objects in the window between starting the hive (and hence running of the informer) and having ensured that the watcher is in place. Creation prior to starting the hive is okay, as well as after ensuring the watcher exists.

Fixes: #31006

bimmlerd · 2024-02-27T15:59:14Z

/test

harsimran-pabla · 2024-02-27T18:52:23Z

Thanks @bimmlerd for thorough explanation. I think we have this pattern repeated across tests in Cilium which uses fake client set. For eg BGP component tests

Does it require wider discussion to address all these test cases ?

YutaroHayakawa · 2024-02-28T05:37:17Z

I found ControllerRuntime has a versionedTracker that implements testing.ObjectTracker with resource versioning, which "looks like" what we want, but I'm not sure if we can reuse it for our use case.

https://github.com/cilium/cilium/blob/main/vendor/sigs.k8s.io/controller-runtime/pkg/client/fake/client.go#L61-L65

Anyways, thanks for working on this David!

bimmlerd · 2024-02-28T08:05:41Z

I think we have this pattern repeated across tests in Cilium which uses fake client set. For eg BGP component tests

Does it require wider discussion to address all these test cases ?

Yep, see also #30906 and #30885. I'll cc you on a slack thread, also.

I found ControllerRuntime has a versionedTracker that implements testing.ObjectTracker with resource versioning, which "looks like" what we want, but I'm not sure if we can reuse it for our use case.

Nice find. I took a brief peek, and unfortunately it doesn't seem like this will be enough - the versionedTracker's Watch is delegated to the underlying, standard ObjectTracker, as far as I can tell. This would then not solve the problem, even if the resources are versioned: the missing part is replaying the right information to a watcher which subscribes starting at a specific version.

bimmlerd · 2024-02-28T08:43:54Z

CI triage:

ci-aks: hit CI: Conformance AKS - check-log-errors - EndpointPolicyVisibilityEvent event failed #31009 in https://github.com/cilium/cilium/actions/runs/8067653433/job/22038619283#step:28:195.
ci-eks: https://github.com/cilium/cilium/actions/runs/8067653691/job/22038621674#step:21:135 and all following tests failed with pods "client2-5668f9f59b-5f9sq" not found - this is very likely due to the fact that during the test, a node was removed by EKS. We don't seem to be very resilient against this.
ci-gateway-api: hit a persistent failure in https://github.com/cilium/cilium/actions/runs/8067653750/job/22038610100#step:16:52 - fixed in gateway: Sync up the experimental conformance test #31017, will need a rebase 😢
ci-ginkgo failed with evel=warning msg="github.com/cilium/cilium/pkg/k8s/resource/resource.go:808: watch of *v2.CiliumNode ended with: an error on the server (\"unable to decode an event from the watch stream: http2: client connection lost\") has prevented the request from succeeding" subsys=klog for many resources, which seems odd, probably a network blip?

bimmlerd · 2024-02-28T09:39:50Z

/test (rebased to get the fix for the gateway conformance test)

bimmlerd · 2024-02-28T10:07:00Z

CI triage (WIP)

ci-ingress: https://github.com/cilium/cilium/actions/runs/8078278780/job/22070247578#step:19:306 failed with CI: Ingress Conformance Test: An Ingress with mixed path rules should send traffic to the matching backend service where Exact is preferred #30281

The fake k8s testing infrastructure has an unfortunate interaction between real informers and the object tracker used in fake clientsets: the informer uses ListAndWatch to subscribe to the api server, but ListAndWatch relies on the ResourceVersion field to bridge the gap between the List and the Watch. The simple object tracker does not provide the resource versioning, nor does its Watch implementation attempt to replay existing state. The race window, hence, allows for creation of objects _after_ the initial list, but _before_ the establishment of Watch. These objects are not observed by the informer and thus tests are likely to fail. As a workaround, we ensure that the object tracker has registered a watcher before we start the test in earnest. It is only important that we do not create/update/delete objects in the window between starting the hive (and hence running of the informer) and having ensured that the watcher is in place. Creation prior to starting the hive is okay, as well as after ensuring the watcher exists. Signed-off-by: David Bimmler <david.bimmler@isovalent.com>

bimmlerd · 2024-02-29T08:56:08Z

/test

bimmlerd · 2024-02-29T09:01:40Z

Fixed a race in my fix, PTAL 😓 Specifically, I was closing the channel before actually calling Watch on the underlying tracker (delegating it to the default reaction). Fixed by calling Watch in our reaction and only then closing the channel.

Quantified using the deflaking tips from https://gist.github.com/liggitt/6a3a2217fa5f846b52519acfc0ffece0#running-unit-tests-to-reproduce-flakes I'm now confident that it's less broken than before :)

before: 21m30s: 3069 runs so far, 49 failures (1.60%)
after: 21m30s: 3033 runs so far, 0 failures

YutaroHayakawa · 2024-03-12T15:35:23Z

Marking as needs-backport to v1.15. I hit this in #31354.

bimmlerd added kind/bug/CI This is a bug in the testing code. release-note/ci This PR makes changes to the CI. sig/agent Cilium agent related. area/bgp labels Feb 27, 2024

bimmlerd marked this pull request as ready for review February 27, 2024 15:59

bimmlerd requested a review from a team as a code owner February 27, 2024 15:59

bimmlerd requested a review from harsimran-pabla February 27, 2024 15:59

harsimran-pabla approved these changes Feb 27, 2024

View reviewed changes

bimmlerd force-pushed the pr/bimmlerd/fix-bgpv1-flake branch from 237520b to 91fe546 Compare February 28, 2024 09:31

bimmlerd force-pushed the pr/bimmlerd/fix-bgpv1-flake branch from 91fe546 to 4098bbb Compare February 29, 2024 08:16

bimmlerd requested a review from harsimran-pabla February 29, 2024 09:01

harsimran-pabla approved these changes Feb 29, 2024

View reviewed changes

maintainer-s-little-helper bot added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Mar 5, 2024

joestringer added this pull request to the merge queue Mar 5, 2024

Merged via the queue into cilium:main with commit cfd1790 Mar 5, 2024
62 checks passed

bimmlerd deleted the pr/bimmlerd/fix-bgpv1-flake branch March 6, 2024 06:52

YutaroHayakawa added the needs-backport/1.15 This PR / issue needs backporting to the v1.15 branch label Mar 12, 2024

maintainer-s-little-helper bot added this to Needs backport from main in 1.15.2 Mar 12, 2024

jrajahalme removed this from Needs backport from main in 1.15.2 Mar 13, 2024

gandro mentioned this pull request Mar 19, 2024

v1.15 Backports 2024-03-19 #31490

Merged

21 tasks

gandro added backport-pending/1.15 The backport for Cilium 1.15.x for this PR is in progress. and removed needs-backport/1.15 This PR / issue needs backporting to the v1.15 branch labels Mar 19, 2024

github-actions bot added backport-done/1.15 The backport for Cilium 1.15.x for this PR is done. and removed backport-pending/1.15 The backport for Cilium 1.15.x for this PR is in progress. labels Mar 21, 2024

jrajahalme mentioned this pull request Mar 26, 2024

Prepare for release v1.15.3 #31621

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bgpv1: avoid object tracker vs informer race #31010

bgpv1: avoid object tracker vs informer race #31010

bimmlerd commented Feb 27, 2024

bimmlerd commented Feb 27, 2024

harsimran-pabla commented Feb 27, 2024

YutaroHayakawa commented Feb 28, 2024

bimmlerd commented Feb 28, 2024

bimmlerd commented Feb 28, 2024 •

edited

bimmlerd commented Feb 28, 2024 •

edited

bimmlerd commented Feb 28, 2024

bimmlerd commented Feb 29, 2024

bimmlerd commented Feb 29, 2024

YutaroHayakawa commented Mar 12, 2024 •

edited

bgpv1: avoid object tracker vs informer race #31010

bgpv1: avoid object tracker vs informer race #31010

Conversation

bimmlerd commented Feb 27, 2024

bimmlerd commented Feb 27, 2024

harsimran-pabla commented Feb 27, 2024

YutaroHayakawa commented Feb 28, 2024

bimmlerd commented Feb 28, 2024

bimmlerd commented Feb 28, 2024 • edited

bimmlerd commented Feb 28, 2024 • edited

bimmlerd commented Feb 28, 2024

bimmlerd commented Feb 29, 2024

bimmlerd commented Feb 29, 2024

YutaroHayakawa commented Mar 12, 2024 • edited

bimmlerd commented Feb 28, 2024 •

edited

bimmlerd commented Feb 28, 2024 •

edited

YutaroHayakawa commented Mar 12, 2024 •

edited