Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

deflake endpointmanager tests #31488

Merged

Conversation

bimmlerd
Copy link
Member

@bimmlerd bimmlerd commented Mar 19, 2024

This series of patches deflakes the endpoint manager tests. Four different flakes were observed when running just the endpointmanager pkg test in a loop, after this PR I achieved 31m55s: 46344 runs so far, 0 failures which implies a high likelihood that any potentially remaining flakes are highly unlikely to occur.

There was a trivial data race in the pressure metrics test, fixed by using an atomic value.

This PR refactors the endpointmanager EP identifier allocation logic away from using a single, global identity pool, as that inherently causes flakes in tests sharing that pool. Specifically, there was a call to mgr.expose which did not check the error - the error was that in the global pool, one of the identities [1, 3, 5, 7] had already been allocated, and hence the exposing failed. Fixed by creating a new allocation pool for each new manager.

Further tests also didn't check expose errors, but these were inconsequential once the global ID pool was removed. Fixed the checks anyway, for good measure.

In addition, TestLookup would fail iff the EP allocated would by chance get ID 1234.

The commits attempt to be somewhat readable in sequence, but the whole change isn't massive either.

Fixes: #26630
Fixes: #28878
Fixes: #27837

@bimmlerd bimmlerd added kind/bug/CI This is a bug in the testing code. kind/cleanup This includes no functional changes. release-note/ci This PR makes changes to the CI. sig/agent Cilium agent related. labels Mar 19, 2024
@bimmlerd bimmlerd force-pushed the pr/bimmlerd/deflake-endpointmanager-tests branch 2 times, most recently from 91fe146 to 6e7f160 Compare March 19, 2024 10:15
@bimmlerd bimmlerd marked this pull request as ready for review March 19, 2024 10:27
@bimmlerd bimmlerd requested review from a team as code owners March 19, 2024 10:27
@bimmlerd
Copy link
Member Author

bimmlerd commented Mar 19, 2024

/test

CI triage:

Copy link
Member

@giorio94 giorio94 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

pkg/endpointmanager/idallocator/allocator_test.go Outdated Show resolved Hide resolved
pkg/endpointmanager/idallocator/allocator_test.go Outdated Show resolved Hide resolved
@chancez
Copy link
Contributor

chancez commented Mar 19, 2024

While we're in there, would it be possible to fix the spelling of fakeGague? Not sure if that's gonna make for painful merge conflicts or not, so feel free to ignore if it's gonna make it harder to merge.

@bimmlerd bimmlerd force-pushed the pr/bimmlerd/deflake-endpointmanager-tests branch from 6e7f160 to 5365276 Compare March 21, 2024 12:26
@bimmlerd
Copy link
Member Author

/test

@bimmlerd
Copy link
Member Author

bimmlerd commented Mar 22, 2024

CI triage:

@bimmlerd bimmlerd force-pushed the pr/bimmlerd/deflake-endpointmanager-tests branch from 5365276 to ef8b41a Compare March 22, 2024 12:52
Go's race detector was unhappy with this test due to unserialised
concurrent access to the last value of the fake gauge. Use an atomic
float value instead, to ensure no weirdness can occur, and placate the
race detector.

Race detector warning (mildly edited):

WARNING: DATA RACE
Read at 0x00c000930488 by goroutine 205:
  github.com/cilium/cilium/pkg/endpointmanager.TestPolicyMapPressure.TestPolicyMapPressure.func1.func2()
      cilium/pkg/endpointmanager/policymap_pressure_test.go:27 +0x69
  github.com/stretchr/testify/assert.Eventually.func1()
      cilium/vendor/github.com/stretchr/testify/assert/assertions.go:1902 +0x33

Previous write at 0x00c000930488 by goroutine 203:
  github.com/cilium/cilium/pkg/endpointmanager.(*fakeGague).Set()
      cilium/pkg/endpointmanager/policymap_pressure_test.go:45 +0x30
  github.com/cilium/cilium/pkg/endpointmanager.(*policyMapPressure).update()
      cilium/pkg/endpointmanager/policymap_pressure.go:82 +0x32a
  github.com/cilium/cilium/pkg/endpointmanager.newPolicyMapPressure.func1()
      cilium/pkg/endpointmanager/policymap_pressure.go:57 +0x2e
  github.com/cilium/cilium/pkg/trigger.(*Trigger).waiter()
      cilium/pkg/trigger/trigger.go:201 +0x771
  github.com/cilium/cilium/pkg/trigger.NewTrigger.gowrap1()
      cilium/pkg/trigger/trigger.go:122 +0x33

Goroutine 205 (running) created at:
  github.com/stretchr/testify/assert.Eventually()
      cilium/vendor/github.com/stretchr/testify/assert/assertions.go:1902 +0x3d5
  github.com/stretchr/testify/assert.(*Assertions).Eventually()
      cilium/vendor/github.com/stretchr/testify/assert/assertion_forward.go:319 +0xc7
  github.com/cilium/cilium/pkg/endpointmanager.TestPolicyMapPressure.func1()
      cilium/pkg/endpointmanager/policymap_pressure_test.go:26 +0x2a8
  github.com/cilium/cilium/pkg/endpointmanager.TestPolicyMapPressure()
      cilium/pkg/endpointmanager/policymap_pressure_test.go:30 +0x205

Goroutine 203 (running) created at:
  github.com/cilium/cilium/pkg/trigger.NewTrigger()
      cilium/pkg/trigger/trigger.go:122 +0x36d
  github.com/cilium/cilium/pkg/endpointmanager.newPolicyMapPressure()
      cilium/pkg/endpointmanager/policymap_pressure.go:52 +0x287
  github.com/cilium/cilium/pkg/endpointmanager.TestPolicyMapPressure()
      cilium/pkg/endpointmanager/policymap_pressure_test.go:18 +0x84

Fixes: 28ce005 (endpointmanager: fix bpf policy pressure getting stuck.)

Signed-off-by: David Bimmler <david.bimmler@isovalent.com>
IDPool contains a mutex, passing copies around is a potential footgun. I
don't think we ever used it incorrectly, but I don't see a reason for
all the copying either.

Signed-off-by: David Bimmler <david.bimmler@isovalent.com>
Transform the test from using checkmate to standard Go tests, as it was
not using any of the features anyway.

Signed-off-by: David Bimmler <david.bimmler@isovalent.com>
The endpointmanagers idallocator package was using a package global pool
for its identifier allocation. That's fine for running the agent, but
causes flakes in testing when multiple tests access the same pool. It's
also not idiomatic Go.

This patch makes the local endpoint identifier allocator a struct, and
the next patch will move it into the endpointmanager package itself, as
there is no other consumer.

While at it, also ensure that the RemoveAll method is only called from a
testing context, by taking a testing.TB as an argument. We cannot simply
move the method into the _test.go files, as tests from other packages
use it.

Signed-off-by: David Bimmler <david.bimmler@isovalent.com>
The endpoint manager assumed it was the only consumer of the idallocator
pkg anyway. Having a pkg that only has one consumer is pointless, hence
move it, including tests. This also allows unexporting everything, and
reducing API surface.

Signed-off-by: David Bimmler <david.bimmler@isovalent.com>
@bimmlerd
Copy link
Member Author

/test

Copy link
Member

@dylandreimerink dylandreimerink left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Datapath owned files look good to me!

@bimmlerd bimmlerd removed the request for review from aditighag March 25, 2024 09:23
@bimmlerd bimmlerd added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Mar 25, 2024
@julianwiedmann julianwiedmann added this pull request to the merge queue Mar 25, 2024
Merged via the queue into cilium:main with commit e2fbd38 Mar 25, 2024
62 checks passed
@bimmlerd bimmlerd added needs-backport/1.13 This PR / issue needs backporting to the v1.13 branch needs-backport/1.14 This PR / issue needs backporting to the v1.14 branch needs-backport/1.15 This PR / issue needs backporting to the v1.15 branch labels Mar 25, 2024
@sayboras sayboras mentioned this pull request Mar 26, 2024
3 tasks
@sayboras sayboras added backport-pending/1.15 The backport for Cilium 1.15.x for this PR is in progress. and removed needs-backport/1.15 This PR / issue needs backporting to the v1.15 branch labels Mar 26, 2024
@github-actions github-actions bot added backport-done/1.15 The backport for Cilium 1.15.x for this PR is done. and removed backport-pending/1.15 The backport for Cilium 1.15.x for this PR is in progress. labels Mar 28, 2024
@joamaki joamaki mentioned this pull request Apr 2, 2024
8 tasks
@joamaki joamaki added backport-pending/1.13 The backport for Cilium 1.13.x for this PR is in progress. and removed needs-backport/1.13 This PR / issue needs backporting to the v1.13 branch labels Apr 2, 2024
@joamaki joamaki mentioned this pull request Apr 2, 2024
10 tasks
@joamaki joamaki added backport-pending/1.14 The backport for Cilium 1.14.x for this PR is in progress. and removed needs-backport/1.14 This PR / issue needs backporting to the v1.14 branch labels Apr 2, 2024
@bimmlerd bimmlerd deleted the pr/bimmlerd/deflake-endpointmanager-tests branch April 4, 2024 13:30
@github-actions github-actions bot added backport-done/1.14 The backport for Cilium 1.14.x for this PR is done. backport-done/1.13 The backport for Cilium 1.13.x for this PR is done. and removed backport-pending/1.14 The backport for Cilium 1.14.x for this PR is in progress. backport-pending/1.13 The backport for Cilium 1.13.x for this PR is in progress. labels Apr 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-done/1.13 The backport for Cilium 1.13.x for this PR is done. backport-done/1.14 The backport for Cilium 1.14.x for this PR is done. backport-done/1.15 The backport for Cilium 1.15.x for this PR is done. kind/bug/CI This is a bug in the testing code. kind/cleanup This includes no functional changes. ready-to-merge This PR has passed all tests and received consensus from code owners to merge. release-note/ci This PR makes changes to the CI. sig/agent Cilium agent related.
Projects
None yet
8 participants