Reduce flakiness of controlplane tests #30906

The control plane test for CNPNodeStatusGC was initially intended to check the behaviour of running with and without GC. With the deprecation of the CNP node status, the GC now amounts to unconditional cleanup. The commit mentioned below had broken the test insofar as it remove the differentiating factor: the configuration. In other words, we were running the same code twice, expecting a different outcome. Ideally, this would have broken the first variant - with GC "disabled", but unfortunately the test was racy in itself. I suspect that with progressing modularisation of the agent and operator, somewhere along the line we lost the guarantee that the GC happens before StartOperator returns. Hence we checked that the CNPs were unchanged while the operator was starting up concurrently - a classic race condition. Since the whole test is obsolete, simply remove it, and leave the second variant of the test around to check that we actually perform the deletion. Fixes: c15f8e4 (Remove skip-cnp-status-startup-clean) Co-authored-by: Fabian Fischer <fabian.fischer@isovalent.com> Signed-off-by: David Bimmler <david.bimmler@isovalent.com>

Rarely, the control plane test panics, due to a send on a closed channel. This can occur in a narrow race window in the filteringWatcher: 1. Stop is called on the child watcher 2. Child watcher calls stop on parent watcher 3. Concurrently, an event is dequeued from the parent result chan, and we enter the filtering logic. 4. The parent result chan is closed, and we close the child event channel 5. The filter is matched, and we attempt to write on the closed channel, which causes the panic. Instead of closing the channel in the Stop method, close the channel from the writing goroutine (as is commonly considered best practice in Go.) Fixes: fa89802 (controlplane: Implement filtering of objects with field selectors) Signed-off-by: David Bimmler <david.bimmler@isovalent.com>

We've recently learned that the fake k8s client set's object tracker do not respect the semantics of the real api-server when it comes to 'Watch': since the object tracker does not care for ResourceVersions, it cannot respect the version from which it ought to replay events. As a result, the default informer (more precisely, its reflector) is racy: it uses a ListAndWatch approach, which relies on this resource version to avoid a race window between the end of list and the beginning of watch. Therefore, all informers used in cilium have a low chance of hitting this race when used with a k8s fake object tracker. This is somewhat known in the k8s community, see for example [1]. However, the upstream response is that one simply shouldn't be using the fake infrastructure to test real informers. Unfortunately, this pattern is used somewhat pervasively inside the cilium tests, specifically so in the controlplane tests. This patch introduces a mechanism which reduces the likelihood of hitting the flake, under the assumption that we do not (often) establish multiple watchers for the same resource. In the following patch, we'll use the new infrastructure to reduce the flakiness of tests. [1]: kubernetes/kubernetes#95372 Co-authored-by: Fabian Fischer <fabian.fischer@isovalent.com> Signed-off-by: David Bimmler <david.bimmler@isovalent.com>

Use the infrastructure introduced in the previous commit to deflake control plane tests which update k8s state after starting the agent. Co-authored-by: Fabian Fischer <fabian.fischer@isovalent.com> Signed-off-by: David Bimmler <david.bimmler@isovalent.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reduce flakiness of controlplane tests #30906

Reduce flakiness of controlplane tests #30906

Commits on Feb 22, 2024