Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[v1.14] Author Backports for deflaking PRs #31542

Merged
merged 6 commits into from
Mar 25, 2024

Conversation

bimmlerd
Copy link
Member

bimmlerd and others added 4 commits March 21, 2024 14:11
[ upstream commit deb0687 ]

[ backporter's note: also picks up the fake handler sync infra ]

TestGetIdentity has been unreliable, even withstanding some previous
attempts at deflaking.

The issue lies in the use of the k8s fake infrastructure: the simple
testing object tracker of client-go does _not_ set the ResourceVersion
for resources created. This interacts badly with the logic of the
client-go reflector's ListAndWatch method, which relies on the resource
version to close the racy window between its List and Watch calls. The
real k8s api-server will replay events which occur after the completion
of List and before the establishment of the Watch, thanks to the
ResourceVersion. The object tracker's Watch implementation, however,
does (and can) not do so, as it doesn't have a resource version to
determine which events it would need to replay.

Notably, the HasSynced method of the informer will return true once the
initial List has succeeded. This isn't a guarantee for the Watch to be
established (and indeed, the reflector establishes the Watch _after_ the
list). This is fine for reality, again thanks to the resource version
and the api-server replaying.

The race, hence, is that the creation of the identities can happen
concurrently to the establishment of the watch (HasSynced guarantees
that it happens _after_ the list), and thus we race the creation of the
"RaceFreeWatcher" in the object tracker. If the watcher is late, it
misses the creation of an identity, and we time out waiting on the wait
group.

To fix this, instead of attempting to wait for the Watch
establishment (which doesn't seem easy, on first glance), just create
the resources _before_ list and watch is started, so that they are
returned in the initial list call.

Prior to this patch, the following commandline typically failed quickly:

while true; do go test ./pkg/k8s/identitybackend -run 'TestGetIdentity' -v -count=1 -timeout=10s || break; done

After this patch, it ran thousands of times reliably.

Co-authored-by: Fabian Fischer <fabian.fischer@isovalent.com>
Signed-off-by: David Bimmler <david.bimmler@isovalent.com>
[ upstream commit 1907334 ]

The previous patch explains and fixes a flake, this patch removes some
of the remaining cruft from earlier attempts at fixing said flake, as
well as running the test in parallel (for efficiency).

Signed-off-by: David Bimmler <david.bimmler@isovalent.com>
[ upstream commit 7df437a ]

Rarely, the control plane test panics, due to a send on a closed
channel. This can occur in a narrow race window in the filteringWatcher:

1. Stop is called on the child watcher
2. Child watcher calls stop on parent watcher
3. Concurrently, an event is dequeued from the parent result chan, and
   we enter the filtering logic.
4. The parent result chan is closed, and we close the child event channel
5. The filter is matched, and we attempt to write on the closed channel,
   which causes the panic.

Instead of closing the channel in the Stop method, close the channel
from the writing goroutine (as is commonly considered best practice in
Go.)

Fixes: fa89802 (controlplane: Implement filtering of objects with field selectors)

Signed-off-by: David Bimmler <david.bimmler@isovalent.com>
[ upstream commit ba99d74 ]

[ backporter's note: lock.Map doesn't exist, replace with sync.Map ]

We've recently learned that the fake k8s client set's object tracker do
not respect the semantics of the real api-server when it comes to
'Watch': since the object tracker does not care for ResourceVersions, it
cannot respect the version from which it ought to replay events. As a
result, the default informer (more precisely, its reflector) is racy: it
uses a ListAndWatch approach, which relies on this resource version to
avoid a race window between the end of list and the beginning of watch.
Therefore, all informers used in cilium have a low chance of hitting
this race when used with a k8s fake object tracker.

This is somewhat known in the k8s community, see for example [1].
However, the upstream response is that one simply shouldn't be using the
fake infrastructure to test real informers. Unfortunately, this pattern
is used somewhat pervasively inside the cilium tests, specifically so in
the controlplane tests.

This patch introduces a mechanism which reduces the likelihood of
hitting the flake, under the assumption that we do not (often) establish
multiple watchers for the same resource. In the following patch, we'll
use the new infrastructure to reduce the flakiness of tests.

[1]: kubernetes/kubernetes#95372

Co-authored-by: Fabian Fischer <fabian.fischer@isovalent.com>
Signed-off-by: David Bimmler <david.bimmler@isovalent.com>
@bimmlerd bimmlerd added kind/backports This PR provides functionality previously merged into master. backport/1.14 This PR represents a backport for Cilium 1.14.x of a PR that was merged to main. labels Mar 21, 2024
bimmlerd and others added 2 commits March 21, 2024 14:40
[ upstream commit 955049a ]

[ backporter's note: resolve conflicts, also include status_updates_gc.go]

Use the infrastructure introduced in the previous commit to deflake
control plane tests which update k8s state after starting the agent.

Co-authored-by: Fabian Fischer <fabian.fischer@isovalent.com>
Signed-off-by: David Bimmler <david.bimmler@isovalent.com>
[ upstream commit fe71a4a ]

[ backporter's note: fix conflicts, lock.Map -> sync.Map ]

I realized that the fix for controlplane tests isn't complete. There is
still a (small) race window: The current watch reaction records a
watcher as established without "handling" the watch itself, i.e. it lets
the default watch reaction actually call 'Watch' on the tracker. This is
racy, as things can happen in the window between recordng and actually
watching.

To fix this, add the recording unconditionally in the existing tracker
augmentation.

Fixes: ba99d74 (controlplane: add mechanism to wait for watchers)

Signed-off-by: David Bimmler <david.bimmler@isovalent.com>
@bimmlerd bimmlerd force-pushed the pr/v1.14-backport-2024-03-21-02-05 branch from d1cff30 to 04d726c Compare March 21, 2024 13:44
@bimmlerd bimmlerd changed the title v1.14 Backports 2024-03-21 [v1.14 Backports 2024-03-21 Mar 21, 2024
@bimmlerd bimmlerd changed the title [v1.14 Backports 2024-03-21 [v1.14] Author Backports for deflaking PRs Mar 21, 2024
@bimmlerd
Copy link
Member Author

/test-backport-1.14

@bimmlerd bimmlerd marked this pull request as ready for review March 22, 2024 14:35
@bimmlerd bimmlerd requested a review from a team as a code owner March 22, 2024 14:35
@maintainer-s-little-helper maintainer-s-little-helper bot added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Mar 25, 2024
@squeed squeed merged commit 9876e74 into v1.14 Mar 25, 2024
222 checks passed
@squeed squeed deleted the pr/v1.14-backport-2024-03-21-02-05 branch March 25, 2024 15:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport/1.14 This PR represents a backport for Cilium 1.14.x of a PR that was merged to main. kind/backports This PR provides functionality previously merged into master. ready-to-merge This PR has passed all tests and received consensus from code owners to merge.
Projects
No open projects
Status: Released
Development

Successfully merging this pull request may close these issues.

None yet

2 participants