New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make LRP restore test logic robust and optimized #16194
Conversation
test-only --focus="K8sServicesTest.* LRP" --kernel_version="net-next" |
marking for backport as it failed in one backport PR #16210 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Thanks for the fix, I think this is the correct test logic as you described.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As discussed in the community meeting, rationale looks sensible to me. Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@joamaki Thanks for the review, PTAL.
e285745
to
07392be
Compare
test-only --focus="K8sServicesTest.* LRP" --kernel_version="net-next" |
Test failures - 1.16-netnext : https://jenkins.cilium.io/job/Cilium-PR-K8s-1.16-net-next/576/console
java.net.SocketTimeoutException: Read timed out 1.20-net-next failure looks legit. Marking the PR as draft for now. |
Focused test runs passed - https://jenkins.cilium.io/job/Cilium-PR-Tests-Kernel-Focus/219/testReport/Suite-k8s-1/20/. |
07392be
to
33e6658
Compare
According to the logs, you also rebased, so let's retrigger the e2e tests 😞 |
test-gke |
test-1.20-4.19 |
test-1.16-netnext |
@pchaigno Why do we need to trigger the e2e tests? The changes are localized to only LRP test cases, and there were no merge conflicts. |
Because other changes included in the rebase could cause the test to fail (e.g., changes to the agent). The only case where we can merge without restarting tests is when the diff doesn't include any changes that is tested (e.g., only code comment or commit description changes). That's not the case with a rebase. |
Relevant tests have passed. Marking it as ready for merge again. |
The goal of the test is to check if
curl
to a clusterIP svc endpoint is redirected to both the backends when the original svc entry is restored upon LRP removal. The current test logic expects the same backend should be selected for all the pod clients simultaneously, and this can lengthen test duration. This doesn't seem right since backend selection is not exactly deterministic. More importantly, we only need both backends to be selected at least once for all the client pods.Flip the order in which we loop over backends and client pods. Loop over client pods first, and then making curl calls to until we hit both the backends on each of the client pods. Also, keep state about which backends have been successfully tested in order to avoid making some of the duplicate
curl
calls, and make the test logic deterministic.More details - #16154 (comment).
Deferred to a follow-up PR - Looking at the LRP test case where we check if traffic only goes to the local backend, it doesn't seem reliable since it's possible that the curl request that the test made wasn't redirected to the remote backend by chance. I think the reliable way to validate the correctness is to check for a
LocalRedirect
service entry and its corresponding backends in thecilium service list
.Fixes: #16154