Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

alibabacloud/eni: avoid racing node mgr in test #31877

Merged

Conversation

bimmlerd
Copy link
Member

@bimmlerd bimmlerd commented Apr 10, 2024

TestPrepareIPAllocation attempts to check that PerpareIPAllocation produces the expected results. It avoids starting the ipam node manager it constructs, likely trying to avoid starting the background maintenance jobs.

However, manager.Upsert does asynchronously trigger pool maintenance, which results in a automated creation of an ENI with different params than the test expects, and hence assertion failures.

This patch avoids the race condition by explicitly setting the instance API readiness to false, which causes background pool maintenance to be delayed and hence guarantees that the PrepareIPAllocation call runs in the environment expected.

The following was used to reproduce this flake:

go test -c -race ./pkg/alibabacloud/eni && stress -p 50 ./eni.test

The likelihood of hitting this flake approximates 0.02%, hence reproducing requires a reasonably large number of runs, as well as high load on the system to increase the likelihood of the flake (since it does depend on the test being somewhat starved for CPU).

With the above fix, I was able to achieve 40k runs without flaking.

Fixes: #30935

TestPrepareIPAllocation attempts to check that PerpareIPAllocation
produces the expected results. It avoids starting the ipam node manager
it constructs, likely trying to avoid starting the background
maintenance jobs.

However, manager.Upsert _does_ asynchronously trigger pool maintenance,
which results in a automated creation of an ENI with different params
than the test expects, and hence assertion failures.

This patch avoids the race condition by explicitly setting the instance
API readiness to false, which causes background pool maintenance to be
delayed and hence guarantees that the PrepareIPAllocation call runs in
the environment expected.

The following was used to reproduce this flake:

  go test -c -race ./pkg/alibabacloud/eni && stress -p 50 ./eni.test

The likelihood of hitting this flake approximates 0.02%, hence
reproducing requires a reasonably large number of runs, as well as high
load on the system to increase the likelihood of the flake (since it
does depend on the test being somewhat starved for CPU).

Signed-off-by: David Bimmler <david.bimmler@isovalent.com>
@bimmlerd bimmlerd added kind/bug/CI This is a bug in the testing code. release-note/ci This PR makes changes to the CI. sig/ipam IP address management, including cloud IPAM area/alibaba Impacts Alibaba based IPAM. area/ipam Impacts IP address management functionality. labels Apr 10, 2024
@bimmlerd
Copy link
Member Author

/test

@bimmlerd bimmlerd marked this pull request as ready for review April 10, 2024 11:39
@bimmlerd bimmlerd requested a review from a team as a code owner April 10, 2024 11:39
@maintainer-s-little-helper maintainer-s-little-helper bot added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Apr 10, 2024
@julianwiedmann
Copy link
Member

What's the backport situation for this, are we also seeing the race in stable branches?

@julianwiedmann julianwiedmann added this pull request to the merge queue Apr 12, 2024
Merged via the queue into cilium:main with commit 56ca923 Apr 12, 2024
62 checks passed
@bimmlerd bimmlerd deleted the pr/bimmlerd/fix-alibaba-eni-ipam-flake branch April 15, 2024 06:51
@bimmlerd
Copy link
Member Author

What's the backport situation for this, are we also seeing the race in stable branches?

1.13 looks different, but I repro'd on 1.14 and 1.15. Marking for backport for those.

@bimmlerd bimmlerd added needs-backport/1.14 This PR / issue needs backporting to the v1.14 branch needs-backport/1.15 This PR / issue needs backporting to the v1.15 branch labels Apr 15, 2024
@giorio94 giorio94 mentioned this pull request Apr 15, 2024
3 tasks
@giorio94 giorio94 added backport-pending/1.15 The backport for Cilium 1.15.x for this PR is in progress. and removed needs-backport/1.15 This PR / issue needs backporting to the v1.15 branch labels Apr 15, 2024
@giorio94 giorio94 mentioned this pull request Apr 16, 2024
4 tasks
@giorio94 giorio94 added backport-pending/1.14 The backport for Cilium 1.14.x for this PR is in progress. and removed needs-backport/1.14 This PR / issue needs backporting to the v1.14 branch labels Apr 16, 2024
@github-actions github-actions bot added backport-done/1.14 The backport for Cilium 1.14.x for this PR is done. backport-done/1.15 The backport for Cilium 1.15.x for this PR is done. and removed backport-pending/1.14 The backport for Cilium 1.14.x for this PR is in progress. backport-pending/1.15 The backport for Cilium 1.15.x for this PR is in progress. labels Apr 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/alibaba Impacts Alibaba based IPAM. area/ipam Impacts IP address management functionality. backport-done/1.14 The backport for Cilium 1.14.x for this PR is done. backport-done/1.15 The backport for Cilium 1.15.x for this PR is done. kind/bug/CI This is a bug in the testing code. ready-to-merge This PR has passed all tests and received consensus from code owners to merge. release-note/ci This PR makes changes to the CI. sig/ipam IP address management, including cloud IPAM
Projects
None yet
Development

Successfully merging this pull request may close these issues.

CI: pkg/alibabacloud/eni: ENISuite/TestPrepareIPAllocation is flaky
4 participants