Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce flakiness of controlplane tests #30906

Merged

Conversation

bimmlerd
Copy link
Member

@bimmlerd bimmlerd commented Feb 22, 2024

The control plane tests were suffering from the same race as found in #30873. Introduce a mechanism to (mostly) prevent it from happening. And while at it, also fix the send on closed channel panic of #30646 and remove an obsolete test.

Fixes: #30892
Fixes: #30807
Fixes: #30646
Fixes: #21682

@bimmlerd bimmlerd added area/CI Continuous Integration testing issue or flake release-note/ci This PR makes changes to the CI. sig/agent Cilium agent related. labels Feb 22, 2024
@bimmlerd bimmlerd force-pushed the pr/bimmlerd/reduce-flakiness-of-controlplane-tests branch from cdc9c8e to c746a87 Compare February 22, 2024 15:26
bimmlerd and others added 2 commits February 22, 2024 16:28
The control plane test for CNPNodeStatusGC was initially intended to
check the behaviour of running with and without GC. With the deprecation
of the CNP node status, the GC now amounts to unconditional cleanup.

The commit mentioned below had broken the test insofar as it remove the
differentiating factor: the configuration. In other words, we were
running the same code twice, expecting a different outcome.

Ideally, this would have broken the first variant - with GC "disabled",
but unfortunately the test was racy in itself. I suspect that with
progressing modularisation of the agent and operator, somewhere along
the line we lost the guarantee that the GC happens before StartOperator
returns. Hence we checked that the CNPs were unchanged while the
operator was starting up concurrently - a classic race condition.

Since the whole test is obsolete, simply remove it, and leave the second
variant of the test around to check that we actually perform the
deletion.

Fixes: c15f8e4 (Remove skip-cnp-status-startup-clean)

Co-authored-by: Fabian Fischer <fabian.fischer@isovalent.com>
Signed-off-by: David Bimmler <david.bimmler@isovalent.com>
Rarely, the control plane test panics, due to a send on a closed
channel. This can occur in a narrow race window in the filteringWatcher:

1. Stop is called on the child watcher
2. Child watcher calls stop on parent watcher
3. Concurrently, an event is dequeued from the parent result chan, and
   we enter the filtering logic.
4. The parent result chan is closed, and we close the child event channel
5. The filter is matched, and we attempt to write on the closed channel,
   which causes the panic.

Instead of closing the channel in the Stop method, close the channel
from the writing goroutine (as is commonly considered best practice in
Go.)

Fixes: fa89802 (controlplane: Implement filtering of objects with field selectors)

Signed-off-by: David Bimmler <david.bimmler@isovalent.com>
@bimmlerd bimmlerd force-pushed the pr/bimmlerd/reduce-flakiness-of-controlplane-tests branch from c746a87 to 7c92d90 Compare February 22, 2024 15:28
@bimmlerd bimmlerd marked this pull request as ready for review February 22, 2024 15:30
@bimmlerd bimmlerd requested a review from a team as a code owner February 22, 2024 15:30
bimmlerd and others added 2 commits February 22, 2024 16:42
We've recently learned that the fake k8s client set's object tracker do
not respect the semantics of the real api-server when it comes to
'Watch': since the object tracker does not care for ResourceVersions, it
cannot respect the version from which it ought to replay events. As a
result, the default informer (more precisely, its reflector) is racy: it
uses a ListAndWatch approach, which relies on this resource version to
avoid a race window between the end of list and the beginning of watch.
Therefore, all informers used in cilium have a low chance of hitting
this race when used with a k8s fake object tracker.

This is somewhat known in the k8s community, see for example [1].
However, the upstream response is that one simply shouldn't be using the
fake infrastructure to test real informers. Unfortunately, this pattern
is used somewhat pervasively inside the cilium tests, specifically so in
the controlplane tests.

This patch introduces a mechanism which reduces the likelihood of
hitting the flake, under the assumption that we do not (often) establish
multiple watchers for the same resource. In the following patch, we'll
use the new infrastructure to reduce the flakiness of tests.

[1]: kubernetes/kubernetes#95372

Co-authored-by: Fabian Fischer <fabian.fischer@isovalent.com>
Signed-off-by: David Bimmler <david.bimmler@isovalent.com>
Use the infrastructure introduced in the previous commit to deflake
control plane tests which update k8s state after starting the agent.

Co-authored-by: Fabian Fischer <fabian.fischer@isovalent.com>
Signed-off-by: David Bimmler <david.bimmler@isovalent.com>
@bimmlerd bimmlerd force-pushed the pr/bimmlerd/reduce-flakiness-of-controlplane-tests branch from 7c92d90 to 6839dc4 Compare February 22, 2024 15:43
@bimmlerd
Copy link
Member Author

/test

@aanm aanm enabled auto-merge February 22, 2024 17:00
@aanm aanm added this pull request to the merge queue Feb 22, 2024
@maintainer-s-little-helper maintainer-s-little-helper bot added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Feb 22, 2024
Merged via the queue into cilium:main with commit 955049a Feb 22, 2024
59 checks passed
@bimmlerd bimmlerd deleted the pr/bimmlerd/reduce-flakiness-of-controlplane-tests branch February 23, 2024 07:36
@bimmlerd bimmlerd added needs-backport/1.13 This PR / issue needs backporting to the v1.13 branch needs-backport/1.14 This PR / issue needs backporting to the v1.14 branch needs-backport/1.15 This PR / issue needs backporting to the v1.15 branch labels Mar 18, 2024
@gandro gandro mentioned this pull request Mar 19, 2024
21 tasks
@gandro gandro added backport-pending/1.15 The backport for Cilium 1.15.x for this PR is in progress. and removed needs-backport/1.15 This PR / issue needs backporting to the v1.15 branch labels Mar 19, 2024
@gandro
Copy link
Member

gandro commented Mar 19, 2024

lock.Map does not exist on v1.14 and older. Marking as backport/author

@gandro gandro added the backport/author The backport will be carried out by the author of the PR. label Mar 19, 2024
@bimmlerd bimmlerd added backport-pending/1.14 The backport for Cilium 1.14.x for this PR is in progress. and removed needs-backport/1.14 This PR / issue needs backporting to the v1.14 branch labels Mar 21, 2024
@github-actions github-actions bot added backport-done/1.15 The backport for Cilium 1.15.x for this PR is done. and removed backport-pending/1.15 The backport for Cilium 1.15.x for this PR is in progress. labels Mar 21, 2024
@bimmlerd bimmlerd added backport-pending/1.13 The backport for Cilium 1.13.x for this PR is in progress. and removed needs-backport/1.13 This PR / issue needs backporting to the v1.13 branch labels Mar 25, 2024
@github-actions github-actions bot added backport-done/1.14 The backport for Cilium 1.14.x for this PR is done. and removed backport-pending/1.14 The backport for Cilium 1.14.x for this PR is in progress. labels Mar 25, 2024
@github-actions github-actions bot added backport-done/1.13 The backport for Cilium 1.13.x for this PR is done. and removed backport-pending/1.13 The backport for Cilium 1.13.x for this PR is in progress. labels Apr 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/CI Continuous Integration testing issue or flake backport/author The backport will be carried out by the author of the PR. backport-done/1.13 The backport for Cilium 1.13.x for this PR is done. backport-done/1.14 The backport for Cilium 1.14.x for this PR is done. backport-done/1.15 The backport for Cilium 1.15.x for this PR is done. kind/bug/CI This is a bug in the testing code. ready-to-merge This PR has passed all tests and received consensus from code owners to merge. release-note/ci This PR makes changes to the CI. sig/agent Cilium agent related.
Projects
None yet
4 participants