Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test: force restarting of Cilium pods #11613

Merged
merged 1 commit into from May 28, 2020
Merged

test: force restarting of Cilium pods #11613

merged 1 commit into from May 28, 2020

Conversation

nebril
Copy link
Member

@nebril nebril commented May 20, 2020

Fixes race between cilium being restarted and connectivity test.

@nebril nebril requested a review from a team as a code owner May 20, 2020 11:25
@maintainer-s-little-helper
Copy link

Please set the appropriate release note label.

@maintainer-s-little-helper maintainer-s-little-helper bot added this to In progress in 1.8.0 May 20, 2020
@nebril nebril marked this pull request as draft May 20, 2020 11:26
@nebril
Copy link
Member Author

nebril commented May 20, 2020

test-focus K8sFQDNTest.*

@nebril
Copy link
Member Author

nebril commented May 20, 2020

test-gke K8sFQDNTest.*

@nebril
Copy link
Member Author

nebril commented May 20, 2020

test-focus K8sFQDNTest.*

@nebril
Copy link
Member Author

nebril commented May 20, 2020

test-gke K8sFQDNTest.*

@coveralls
Copy link

coveralls commented May 20, 2020

Coverage Status

Coverage decreased (-0.006%) to 36.874% when pulling b885df8 on pr/fqdn-delete-cilium into 60786b6 on master.

@nebril
Copy link
Member Author

nebril commented May 20, 2020

test-focus K8sFQDNTest.*

@nebril
Copy link
Member Author

nebril commented May 21, 2020

test-focus K8sFQDNTest.*

@nebril
Copy link
Member Author

nebril commented May 21, 2020

test-me-please

@nebril nebril changed the title test: delete Cilium daemonset in fqdn test test: force restarting of Cilium pods May 21, 2020
@nebril
Copy link
Member Author

nebril commented May 21, 2020

test-me-please

@nebril
Copy link
Member Author

nebril commented May 21, 2020

test-focus K8sFQDNTest.*

@nebril
Copy link
Member Author

nebril commented May 21, 2020

test-focus K8sFQDNTest.*

@nebril
Copy link
Member Author

nebril commented May 22, 2020

test-me-please

@nebril nebril marked this pull request as ready for review May 22, 2020 09:40
@nebril nebril added the release-note/ci This PR makes changes to the CI. label May 22, 2020
@nebril
Copy link
Member Author

nebril commented May 22, 2020

retest-runtime

@nebril
Copy link
Member Author

nebril commented May 22, 2020

retest-4.19

test/helpers/kubectl.go Show resolved Hide resolved
test/k8sT/fqdn.go Outdated Show resolved Hide resolved
test/helpers/kubectl.go Outdated Show resolved Hide resolved
test/helpers/kubectl.go Outdated Show resolved Hide resolved
@nebril nebril requested a review from tgraf May 22, 2020 14:39
@nebril
Copy link
Member Author

nebril commented May 22, 2020

test-me-please

@nebril
Copy link
Member Author

nebril commented May 25, 2020

test-gke

test/helpers/kubectl.go Outdated Show resolved Hide resolved
This change ensures that Cilium pods are being restarted in
"Restart Cilium validate that FQDN is still working" test. By repeatedly
calling `kill 1` in all Cilium pods, which was fastest way of restarting
a pod I found.

This test has been flaking a lot lately and the theory is that it was a
race between connectivity test and restarting the pod.

Signed-off-by: Maciej Kwiek <maciej@isovalent.com>
@nebril
Copy link
Member Author

nebril commented May 25, 2020

test-me-please

@errordeveloper
Copy link
Contributor

@nebril could you elaborate on why exactly connectivity check is interfering here, just trying to understand the context better.

@nebril
Copy link
Member Author

nebril commented May 27, 2020

@errordeveloper the connectivity check was not interfering, but the point of the test is to run the test while Cilium is recovering to validate that dns cache works during restarting Cilium pods.

@errordeveloper
Copy link
Contributor

@nebril that sounds like it would add even more racy behaviour, it sound to me that it would be more reliable to delete the the daemonset instead, or taint and drain the nodes.

@nebril
Copy link
Member Author

nebril commented May 27, 2020

@errordeveloper AFAIU if we delete the daemonset, Cilium pods will uninstall cleanly, deleting bpf maps. If we drain the nodes, the same applies, and also how will we test the workload running on a drained node?

@errordeveloper
Copy link
Contributor

if we delete the daemonset, Cilium pods will uninstall cleanly, deleting bpf maps

I thought the opposite was actually the case.

how will we test the workload running on a drained node?

You just need to have the right toleration set. It may the case that it's not a full drain that is needed, but a taint that Cilium doesn't tolerate followed by deletion of the Cilium pod(s).

@nebril
Copy link
Member Author

nebril commented May 27, 2020

@errordeveloper without the Cilium pod scheduled on a node, we end up with a node that doesn't handle networking via our cni plugin, which is not what we want to test afaiu.

@errordeveloper
Copy link
Contributor

@nebril I believe missing pod will have the same effect as restarting pod in this case, and just to be clear, my view is that ad-hoc commands is exactly what we should stop doing in the tests. If this is a hack that fixes another hack, I get that :)

@aanm aanm merged commit b514a1c into master May 28, 2020
1.8.0 automation moved this from In progress to Merged May 28, 2020
@aanm aanm deleted the pr/fqdn-delete-cilium branch May 28, 2020 13:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
release-note/ci This PR makes changes to the CI.
Projects
No open projects
1.8.0
  
Merged
Development

Successfully merging this pull request may close these issues.

None yet

5 participants