Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bgpv1: Retry peer checks in NeighborAddDel test to avoid flakes #25641

Merged
merged 1 commit into from May 25, 2023

Conversation

rastislavs
Copy link
Contributor

Fixes flakes due to potential timing issue in the Test_NeighborAddDel test.

In the test we wait for peers to transition into their expected state. However, remote peer's session state does not have to immediately match our session's state (e.g. peer may be already in Established but we still in OpenConfirm until we receive a Keepalive from the peer), so we should retry the check with a timeout if our state of the session does not immediately match.

Fixes: #25637

In the test we wait for peers to transition into their expected state.
However, remote peer's session state does not have to immediately
match our session's state (e.g. peer may be already in Established
but we still in OpenConfirm until we receive a Keepalive from the peer),
so we should retry the check with a timeout if our state of the session
does not immediately match.

Signed-off-by: Rastislav Szabo <rastislav.szabo@isovalent.com>
@rastislavs rastislavs requested a review from a team as a code owner May 24, 2023 10:32
@rastislavs rastislavs requested a review from ldelossa May 24, 2023 10:32
@maintainer-s-little-helper maintainer-s-little-helper bot added the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label May 24, 2023
@rastislavs rastislavs added the release-note/ci This PR makes changes to the CI. label May 24, 2023
@maintainer-s-little-helper maintainer-s-little-helper bot removed the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label May 24, 2023
@rastislavs rastislavs added area/CI Continuous Integration testing issue or flake ci/flake This is a known failure that occurs in the tree. Please investigate me! labels May 24, 2023
Copy link
Contributor

@harsimran-pabla harsimran-pabla left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you!

@rastislavs
Copy link
Contributor Author

rastislavs commented May 24, 2023

/test

Job 'Cilium-PR-K8s-1.26-kernel-net-next' failed:

Click to show.

Test Name

K8sDatapathServicesTest Checks N/S loadbalancing With host policy Tests NodePort

Failure Output

FAIL: Can not connect to service "http://192.168.56.11:32520" from outside cluster (1/10)

Jenkins URL: https://jenkins.cilium.io/job/Cilium-PR-K8s-1.26-kernel-net-next/137/

If it is a flake and a GitHub issue doesn't already exist to track it, comment /mlh new-flake Cilium-PR-K8s-1.26-kernel-net-next so I can create one.

Then please upload the Jenkins artifacts to that issue.

@ldelossa
Copy link
Contributor

Just out of curiosity, if we suspect a flake, was this test ran over multiple trials to see if the flake was removed?

@rastislavs
Copy link
Contributor Author

rastislavs commented May 24, 2023

@ldelossa

Just out of curiosity, if we suspect a flake, was this test ran over multiple trials to see if the flake was removed?

This is a bit tricky, as I could not reproduce the flake when running it locally (not before nor after fix). Somehow it is reproducible only in CI - we can re-run the CI test multiple times before merging.

@ldelossa
Copy link
Contributor

ldelossa commented May 24, 2023

yeah, can totally understand that, @rastislavs - rerunning a few times in CI sounds like a good idea to me, if it doesn't spam our CI. May want to task in public testing channel whether we have a precedent for this. Really not too sure.

@rastislavs
Copy link
Contributor Author

rastislavs commented May 24, 2023

The test-1.26-net-next job is failing due to #25605. All other are green.

It seems that the test-runtime job that was flaking due to this issue has been pretty successful in the past runs, so not sure if re-running it once or twice would even prove anything. Maybe we should just merge this and observe it over a larger time period.

@harsimran-pabla
Copy link
Contributor

It seems that the test-runtime job that was flaking due to this issue has been pretty successful in the past runs, so not sure if re-running it once or twice would even prove anything. Maybe we should just merge this and observe it over a larger time period.

Locally I was able to reproduce same failure few time running like this
PRIVILEGED_TESTS=1 go test -exec 'sudo -E' ./... -count 10 -run Test_NeighborAddDel from bgpv1/test directory.

You can try to validate if this change fixes it.

@rastislavs
Copy link
Contributor Author

Locally I was able to reproduce same failure few time running like this PRIVILEGED_TESTS=1 go test -exec 'sudo -E' ./... -count 10 -run Test_NeighborAddDel from bgpv1/test directory.

You can try to validate if this change fixes it.

@harsimran-pabla Thanks for the tip, I was able to reproduce the issue (without the fix) with some more iterations and it seems that the fix indeed helped as I could not reproduce it with 10x more iterations with the fix.

@rastislavs
Copy link
Contributor Author

As I was able to verify the fix locally and the test-1.26-net-next job is failing due to #25605, I think this is good to go, marking as ready-to-merge.

@rastislavs rastislavs added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label May 24, 2023
@squeed squeed merged commit ce7f0f0 into cilium:main May 25, 2023
58 of 59 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/CI Continuous Integration testing issue or flake ci/flake This is a known failure that occurs in the tree. Please investigate me! ready-to-merge This PR has passed all tests and received consensus from code owners to merge. release-note/ci This PR makes changes to the CI.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

CI: Test_NeighborAddDel failure
4 participants