New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
test: force restarting of Cilium pods #11613
Conversation
Please set the appropriate release note label. |
test-focus K8sFQDNTest.* |
test-gke K8sFQDNTest.* |
89f9364
to
706b6e6
Compare
test-focus K8sFQDNTest.* |
test-gke K8sFQDNTest.* |
test-focus K8sFQDNTest.* |
706b6e6
to
73cc925
Compare
test-focus K8sFQDNTest.* |
test-me-please |
73cc925
to
b133801
Compare
test-me-please |
test-focus K8sFQDNTest.* |
b133801
to
df79656
Compare
test-focus K8sFQDNTest.* |
df79656
to
fbe86da
Compare
test-me-please |
retest-runtime |
retest-4.19 |
fbe86da
to
a3bd2d5
Compare
test-me-please |
test-gke |
This change ensures that Cilium pods are being restarted in "Restart Cilium validate that FQDN is still working" test. By repeatedly calling `kill 1` in all Cilium pods, which was fastest way of restarting a pod I found. This test has been flaking a lot lately and the theory is that it was a race between connectivity test and restarting the pod. Signed-off-by: Maciej Kwiek <maciej@isovalent.com>
a3bd2d5
to
b885df8
Compare
test-me-please |
@nebril could you elaborate on why exactly connectivity check is interfering here, just trying to understand the context better. |
@errordeveloper the connectivity check was not interfering, but the point of the test is to run the test while Cilium is recovering to validate that dns cache works during restarting Cilium pods. |
@nebril that sounds like it would add even more racy behaviour, it sound to me that it would be more reliable to delete the the daemonset instead, or taint and drain the nodes. |
@errordeveloper AFAIU if we delete the daemonset, Cilium pods will uninstall cleanly, deleting bpf maps. If we drain the nodes, the same applies, and also how will we test the workload running on a drained node? |
I thought the opposite was actually the case.
You just need to have the right toleration set. It may the case that it's not a full drain that is needed, but a taint that Cilium doesn't tolerate followed by deletion of the Cilium pod(s). |
@errordeveloper without the Cilium pod scheduled on a node, we end up with a node that doesn't handle networking via our cni plugin, which is not what we want to test afaiu. |
@nebril I believe missing pod will have the same effect as restarting pod in this case, and just to be clear, my view is that ad-hoc commands is exactly what we should stop doing in the tests. If this is a hack that fixes another hack, I get that :) |
Fixes race between cilium being restarted and connectivity test.