New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reduce flakiness of controlplane tests #30906
Merged
aanm
merged 4 commits into
cilium:main
from
bimmlerd:pr/bimmlerd/reduce-flakiness-of-controlplane-tests
Feb 22, 2024
Merged
Reduce flakiness of controlplane tests #30906
aanm
merged 4 commits into
cilium:main
from
bimmlerd:pr/bimmlerd/reduce-flakiness-of-controlplane-tests
Feb 22, 2024
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
bimmlerd
added
area/CI
Continuous Integration testing issue or flake
release-note/ci
This PR makes changes to the CI.
sig/agent
Cilium agent related.
labels
Feb 22, 2024
bimmlerd
force-pushed
the
pr/bimmlerd/reduce-flakiness-of-controlplane-tests
branch
from
February 22, 2024 15:26
cdc9c8e
to
c746a87
Compare
The control plane test for CNPNodeStatusGC was initially intended to check the behaviour of running with and without GC. With the deprecation of the CNP node status, the GC now amounts to unconditional cleanup. The commit mentioned below had broken the test insofar as it remove the differentiating factor: the configuration. In other words, we were running the same code twice, expecting a different outcome. Ideally, this would have broken the first variant - with GC "disabled", but unfortunately the test was racy in itself. I suspect that with progressing modularisation of the agent and operator, somewhere along the line we lost the guarantee that the GC happens before StartOperator returns. Hence we checked that the CNPs were unchanged while the operator was starting up concurrently - a classic race condition. Since the whole test is obsolete, simply remove it, and leave the second variant of the test around to check that we actually perform the deletion. Fixes: c15f8e4 (Remove skip-cnp-status-startup-clean) Co-authored-by: Fabian Fischer <fabian.fischer@isovalent.com> Signed-off-by: David Bimmler <david.bimmler@isovalent.com>
Rarely, the control plane test panics, due to a send on a closed channel. This can occur in a narrow race window in the filteringWatcher: 1. Stop is called on the child watcher 2. Child watcher calls stop on parent watcher 3. Concurrently, an event is dequeued from the parent result chan, and we enter the filtering logic. 4. The parent result chan is closed, and we close the child event channel 5. The filter is matched, and we attempt to write on the closed channel, which causes the panic. Instead of closing the channel in the Stop method, close the channel from the writing goroutine (as is commonly considered best practice in Go.) Fixes: fa89802 (controlplane: Implement filtering of objects with field selectors) Signed-off-by: David Bimmler <david.bimmler@isovalent.com>
bimmlerd
force-pushed
the
pr/bimmlerd/reduce-flakiness-of-controlplane-tests
branch
from
February 22, 2024 15:28
c746a87
to
7c92d90
Compare
We've recently learned that the fake k8s client set's object tracker do not respect the semantics of the real api-server when it comes to 'Watch': since the object tracker does not care for ResourceVersions, it cannot respect the version from which it ought to replay events. As a result, the default informer (more precisely, its reflector) is racy: it uses a ListAndWatch approach, which relies on this resource version to avoid a race window between the end of list and the beginning of watch. Therefore, all informers used in cilium have a low chance of hitting this race when used with a k8s fake object tracker. This is somewhat known in the k8s community, see for example [1]. However, the upstream response is that one simply shouldn't be using the fake infrastructure to test real informers. Unfortunately, this pattern is used somewhat pervasively inside the cilium tests, specifically so in the controlplane tests. This patch introduces a mechanism which reduces the likelihood of hitting the flake, under the assumption that we do not (often) establish multiple watchers for the same resource. In the following patch, we'll use the new infrastructure to reduce the flakiness of tests. [1]: kubernetes/kubernetes#95372 Co-authored-by: Fabian Fischer <fabian.fischer@isovalent.com> Signed-off-by: David Bimmler <david.bimmler@isovalent.com>
Use the infrastructure introduced in the previous commit to deflake control plane tests which update k8s state after starting the agent. Co-authored-by: Fabian Fischer <fabian.fischer@isovalent.com> Signed-off-by: David Bimmler <david.bimmler@isovalent.com>
bimmlerd
force-pushed
the
pr/bimmlerd/reduce-flakiness-of-controlplane-tests
branch
from
February 22, 2024 15:43
7c92d90
to
6839dc4
Compare
/test |
aanm
approved these changes
Feb 22, 2024
tklauser
approved these changes
Feb 22, 2024
maintainer-s-little-helper
bot
added
the
ready-to-merge
This PR has passed all tests and received consensus from code owners to merge.
label
Feb 22, 2024
bimmlerd
deleted the
pr/bimmlerd/reduce-flakiness-of-controlplane-tests
branch
February 23, 2024 07:36
bimmlerd
added
needs-backport/1.13
This PR / issue needs backporting to the v1.13 branch
needs-backport/1.14
This PR / issue needs backporting to the v1.14 branch
needs-backport/1.15
This PR / issue needs backporting to the v1.15 branch
labels
Mar 18, 2024
gandro
added
backport-pending/1.15
The backport for Cilium 1.15.x for this PR is in progress.
and removed
needs-backport/1.15
This PR / issue needs backporting to the v1.15 branch
labels
Mar 19, 2024
|
gandro
added
the
backport/author
The backport will be carried out by the author of the PR.
label
Mar 19, 2024
This was referenced Mar 19, 2024
3 tasks
bimmlerd
added
backport-pending/1.14
The backport for Cilium 1.14.x for this PR is in progress.
and removed
needs-backport/1.14
This PR / issue needs backporting to the v1.14 branch
labels
Mar 21, 2024
github-actions
bot
added
backport-done/1.15
The backport for Cilium 1.15.x for this PR is done.
and removed
backport-pending/1.15
The backport for Cilium 1.15.x for this PR is in progress.
labels
Mar 21, 2024
2 tasks
bimmlerd
added
backport-pending/1.13
The backport for Cilium 1.13.x for this PR is in progress.
and removed
needs-backport/1.13
This PR / issue needs backporting to the v1.13 branch
labels
Mar 25, 2024
github-actions
bot
added
backport-done/1.14
The backport for Cilium 1.14.x for this PR is done.
and removed
backport-pending/1.14
The backport for Cilium 1.14.x for this PR is in progress.
labels
Mar 25, 2024
This was referenced Mar 26, 2024
github-actions
bot
added
backport-done/1.13
The backport for Cilium 1.13.x for this PR is done.
and removed
backport-pending/1.13
The backport for Cilium 1.13.x for this PR is in progress.
labels
Apr 8, 2024
This was referenced Apr 11, 2024
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
area/CI
Continuous Integration testing issue or flake
backport/author
The backport will be carried out by the author of the PR.
backport-done/1.13
The backport for Cilium 1.13.x for this PR is done.
backport-done/1.14
The backport for Cilium 1.14.x for this PR is done.
backport-done/1.15
The backport for Cilium 1.15.x for this PR is done.
kind/bug/CI
This is a bug in the testing code.
ready-to-merge
This PR has passed all tests and received consensus from code owners to merge.
release-note/ci
This PR makes changes to the CI.
sig/agent
Cilium agent related.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
The control plane tests were suffering from the same race as found in #30873. Introduce a mechanism to (mostly) prevent it from happening. And while at it, also fix the send on closed channel panic of #30646 and remove an obsolete test.
Fixes: #30892
Fixes: #30807
Fixes: #30646
Fixes: #21682