Address race condition in TestGetIdentity #30885

bimmlerd · 2024-02-21T13:57:17Z

Please read the commit message of the first commit to understand the race - second commit is a bit of cleanup.

Fixes: #30873
Fixes: #30255

bimmlerd · 2024-02-21T14:26:55Z

/test

bimmlerd · 2024-02-21T15:49:54Z

CI triage (WIP):

Travis: Filed CI: controlplane: 'Timed out waiting for pre-existing resources to be received' #30892 and a PR to fix it for the control plane test failure.
ci-integration: kvstore failure: Error: Timed out waiting to acquire the kvstore lock https://github.com/cilium/cilium/actions/runs/7990633467/job/21819875607#step:8:176.
ci-integration: hive/job timeout: filed CI: ci-runtime: hive/job: TestTimer_ExitOnCloseFnCtx times out #30927 and fixing in job: avoid a race condition in TestTimer_ExitOnCloseFnCtx #30929

margamanterola

Minor comment nits

pkg/k8s/identitybackend/identity_test.go

tommyp1ckles

nice writeup!

TestGetIdentity has been unreliable, even withstanding some previous attempts at deflaking. The issue lies in the use of the k8s fake infrastructure: the simple testing object tracker of client-go does _not_ set the ResourceVersion for resources created. This interacts badly with the logic of the client-go reflector's ListAndWatch method, which relies on the resource version to close the racy window between its List and Watch calls. The real k8s api-server will replay events which occur after the completion of List and before the establishment of the Watch, thanks to the ResourceVersion. The object tracker's Watch implementation, however, does (and can) not do so, as it doesn't have a resource version to determine which events it would need to replay. Notably, the HasSynced method of the informer will return true once the initial List has succeeded. This isn't a guarantee for the Watch to be established (and indeed, the reflector establishes the Watch _after_ the list). This is fine for reality, again thanks to the resource version and the api-server replaying. The race, hence, is that the creation of the identities can happen concurrently to the establishment of the watch (HasSynced guarantees that it happens _after_ the list), and thus we race the creation of the "RaceFreeWatcher" in the object tracker. If the watcher is late, it misses the creation of an identity, and we time out waiting on the wait group. To fix this, instead of attempting to wait for the Watch establishment (which doesn't seem easy, on first glance), just create the resources _before_ list and watch is started, so that they are returned in the initial list call. Prior to this patch, the following commandline typically failed quickly: while true; do go test ./pkg/k8s/identitybackend -run 'TestGetIdentity' -v -count=1 -timeout=10s || break; done After this patch, it ran thousands of times reliably. Co-authored-by: Fabian Fischer <fabian.fischer@isovalent.com> Signed-off-by: David Bimmler <david.bimmler@isovalent.com>

The previous patch explains and fixes a flake, this patch removes some of the remaining cruft from earlier attempts at fixing said flake, as well as running the test in parallel (for efficiency). Signed-off-by: David Bimmler <david.bimmler@isovalent.com>

bimmlerd · 2024-02-23T08:18:51Z

/test

bimmlerd · 2024-02-23T12:19:53Z

CI triage ( 😞 )

Travis hit CI: ci-runtime: hive/job: TestTimer_ExitOnCloseFnCtx times out #30927, will be fixed by job: avoid a race condition in TestTimer_ExitOnCloseFnCtx #30929 - rerunning for now.
ci-runtime hit CI: Conformance Runtime (privileged): TestOps: "0" is not greater than "0" #30899 which I'm fixing in bandwidth: test: don't unlock OS thread too early #30932 - rerunning.
ci-ginkgo:
- E2E Test (1.29, f09-datapath-misc-2): looks like CI: Any test: terminating containers are not deleted after timeout #18447
- E2E Test (1.26, f10-agent-hubble-bandwidth): filed CI: Conformance Ginkgo: E2E Test (1.26, f10-agent-hubble-bandwidth): "Failed to create gRPC client" warning #30934 - kind of useless warning logged by hubble-relay
ci-runtime (2nd run): filed CI: pkg/alibabacloud/eni: ENISuite/TestPrepareIPAllocation is flaky #30935 - rerunning again.
ci-runtime (3rd run) hit CI: Conformance Runtime (privileged): TestOps: "0" is not greater than "0" #30899 again
ci-eks hit a weird sequence of events, where the context of an endpoint regeneration was cancelled which leads to some error logs - however, the agent recovered from these, so I'm unsure whether it was worth failing the test.
ci-clustermesh:
- Run connectivity test (4, disabled, ipv6, disabled, none, clustermesh, legacy, migration, 255): operator error log: level=error msg="Failed to update lock: Put \"https://[fc00:f853:ccd:e793::3]:6443/apis/coordination.k8s.io/v1/namespaces/kube-system/leases/cilium-operator-resource-lock?timeout=5s\": net/http: request canceled (Client.Timeout exceeded while awaiting headers)" subsys=klog (1 occurrences) seems similar to CI: multiple tests, level=error msg="Failed to update lock: etcdserver: request timed out" subsys=klog in operator logs #23047, but whatever- rerunning.
- Run connectivity test (8, vxlan, dual, ipsec, iptables, clustermesh, cluster, cluster, 255) hit CI: Conformance E2E: client-egress-l7-named-port/pod-to-pod: command terminated with exit code 28 (timeout) #27762
- the upgrade downgrade test hit CI: test-conn-disrupt-client failed due to interrupted traffic during upgrade/downgrade #28088 in https://github.com/cilium/cilium/actions/runs/8016451099/job/21898409307#step:42:53
ci-upgrade-ipsec: level=error msg="Failed to update lock: Operation cannot be fulfilled on leases.coordination.k8s.io \"cilium-operator-resource-lock\": the object has been modified; please apply your changes to the latest version and try again" subsys=klog (1 occurrences) in the operator logs again
ci-eks (2nd run): same failure as above, seems to be a real thing?

gandro · 2024-03-19T10:15:47Z

There are some non-trivial conflicts on the v1.15 branch, and given this is about races I think it's probably better if I don't try to resolve them without context. Marking as "backport/author"

maintainer-s-little-helper bot added the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Feb 21, 2024

bimmlerd added sig/k8s Impacts the kubernetes API, or kubernetes -> cilium internals translation layers. release-note/misc This PR makes changes that have no direct user impact. sig/agent Cilium agent related. labels Feb 21, 2024

maintainer-s-little-helper bot removed the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Feb 21, 2024

bimmlerd added the kind/cleanup This includes no functional changes. label Feb 21, 2024

bimmlerd force-pushed the pr/bimmlerd/fix-30873 branch 2 times, most recently from d04605f to 5ae54e1 Compare February 21, 2024 14:13

bimmlerd marked this pull request as ready for review February 21, 2024 14:35

bimmlerd requested a review from a team as a code owner February 21, 2024 14:35

bimmlerd requested a review from tommyp1ckles February 21, 2024 14:35

bimmlerd added the kind/bug/CI This is a bug in the testing code. label Feb 22, 2024

margamanterola approved these changes Feb 22, 2024

View reviewed changes

pkg/k8s/identitybackend/identity_test.go Outdated Show resolved Hide resolved

pkg/k8s/identitybackend/identity_test.go Outdated Show resolved Hide resolved

tommyp1ckles approved these changes Feb 23, 2024

View reviewed changes

bimmlerd and others added 2 commits February 23, 2024 08:46

bimmlerd force-pushed the pr/bimmlerd/fix-30873 branch from 5ae54e1 to 15ba583 Compare February 23, 2024 07:46

maintainer-s-little-helper bot added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Feb 23, 2024

tklauser added this pull request to the merge queue Feb 23, 2024

Merged via the queue into cilium:main with commit 1907334 Feb 23, 2024
62 checks passed

This was referenced Feb 26, 2024

CI: test-conn-disrupt-client failed due to interrupted traffic during upgrade/downgrade #28088

Closed

[CI] Cluster Mesh upgrade: no-interrupted-connections: test-conn-disrupt-client failed due to interrupted traffic during downgrade #30964

Closed

bimmlerd mentioned this pull request Feb 28, 2024

bgpv1: avoid object tracker vs informer race #31010

Merged

tommyp1ckles mentioned this pull request Mar 4, 2024

CI: pkg/alibabacloud/eni: ENISuite/TestPrepareIPAllocation is flaky #30935

Closed

bimmlerd added the needs-backport/1.14 This PR / issue needs backporting to the v1.14 branch label Mar 18, 2024

bimmlerd added the needs-backport/1.15 This PR / issue needs backporting to the v1.15 branch label Mar 18, 2024

gandro added the backport/author The backport will be carried out by the author of the PR. label Mar 19, 2024

gandro mentioned this pull request Mar 19, 2024

v1.15 Backports 2024-03-19 #31490

Merged

21 tasks

bimmlerd mentioned this pull request Mar 21, 2024

[v1.15] Author Backport of #30855 (Address race condition in TestGetIdentity) #31541

Merged

1 task

bimmlerd added backport-pending/1.15 The backport for Cilium 1.15.x for this PR is in progress. and removed needs-backport/1.15 This PR / issue needs backporting to the v1.15 branch labels Mar 21, 2024

bimmlerd mentioned this pull request Mar 21, 2024

[v1.14] Author Backports for deflaking PRs #31542

Merged

3 tasks

bimmlerd added backport-pending/1.14 The backport for Cilium 1.14.x for this PR is in progress. and removed needs-backport/1.14 This PR / issue needs backporting to the v1.14 branch labels Mar 21, 2024

This was referenced Mar 26, 2024

Prepare for release v1.15.3 #31621

Merged

Prepare for release v1.14.9 #31626

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Address race condition in TestGetIdentity #30885

Address race condition in TestGetIdentity #30885

bimmlerd commented Feb 21, 2024 •

edited

Loading

bimmlerd commented Feb 21, 2024

bimmlerd commented Feb 21, 2024 •

edited

Loading

margamanterola left a comment

tommyp1ckles left a comment

bimmlerd commented Feb 23, 2024

bimmlerd commented Feb 23, 2024 •

edited

Loading

gandro commented Mar 19, 2024

Address race condition in TestGetIdentity #30885

Address race condition in TestGetIdentity #30885

Conversation

bimmlerd commented Feb 21, 2024 • edited Loading

bimmlerd commented Feb 21, 2024

bimmlerd commented Feb 21, 2024 • edited Loading

margamanterola left a comment

Choose a reason for hiding this comment

tommyp1ckles left a comment

Choose a reason for hiding this comment

bimmlerd commented Feb 23, 2024

bimmlerd commented Feb 23, 2024 • edited Loading

gandro commented Mar 19, 2024

bimmlerd commented Feb 21, 2024 •

edited

Loading

bimmlerd commented Feb 21, 2024 •

edited

Loading

bimmlerd commented Feb 23, 2024 •

edited

Loading