New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
bgpv1: Retry peer checks in NeighborAddDel test to avoid flakes #25641
Conversation
In the test we wait for peers to transition into their expected state. However, remote peer's session state does not have to immediately match our session's state (e.g. peer may be already in Established but we still in OpenConfirm until we receive a Keepalive from the peer), so we should retry the check with a timeout if our state of the session does not immediately match. Signed-off-by: Rastislav Szabo <rastislav.szabo@isovalent.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you!
/test Job 'Cilium-PR-K8s-1.26-kernel-net-next' failed: Click to show.Test Name
Failure Output
Jenkins URL: https://jenkins.cilium.io/job/Cilium-PR-K8s-1.26-kernel-net-next/137/ If it is a flake and a GitHub issue doesn't already exist to track it, comment Then please upload the Jenkins artifacts to that issue. |
Just out of curiosity, if we suspect a flake, was this test ran over multiple trials to see if the flake was removed? |
This is a bit tricky, as I could not reproduce the flake when running it locally (not before nor after fix). Somehow it is reproducible only in CI - we can re-run the CI test multiple times before merging. |
yeah, can totally understand that, @rastislavs - rerunning a few times in CI sounds like a good idea to me, if it doesn't spam our CI. May want to task in public testing channel whether we have a precedent for this. Really not too sure. |
The It seems that the |
Locally I was able to reproduce same failure few time running like this You can try to validate if this change fixes it. |
@harsimran-pabla Thanks for the tip, I was able to reproduce the issue (without the fix) with some more iterations and it seems that the fix indeed helped as I could not reproduce it with 10x more iterations with the fix. |
As I was able to verify the fix locally and the |
Fixes flakes due to potential timing issue in the
Test_NeighborAddDel
test.In the test we wait for peers to transition into their expected state. However, remote peer's session state does not have to immediately match our session's state (e.g. peer may be already in Established but we still in OpenConfirm until we receive a Keepalive from the peer), so we should retry the check with a timeout if our state of the session does not immediately match.
Fixes: #25637