Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test: Delete DNS pods in AfterAll for datapath tests #16835

Merged
merged 1 commit into from Jul 13, 2021

Conversation

joestringer
Copy link
Member

@joestringer joestringer commented Jul 8, 2021

Commit a0e7712 ("test: Redeploy DNS after changing endpointRoutes")
didn't go quite far enough: It ensured that between individual tests in
a given file, the DNS pods would be redeployed during the next run if
there were significant enough datapath changes. However, the way it did
this was by storing state within the 'kubectl' variable, which is
recreated in each test file. So if the last test in one CI run enabled
endpoint routes mode, then the DNS pods would not be redeployed to
disable endpoint routes mode as part of the next test.

Fix it by just deleting the overall kube-dns deployment at the end of
any test files which reconfigure this datapath option. I assume that the
next test will set up DNS again if it's not available.

Reported-by: Chris Tarazi chris@isovalent.com
Fixes: 0e77127dcd7 ("test: Redeploy DNS after changing endpointRoutes")
Fixes: #16717

@joestringer joestringer requested a review from a team July 8, 2021 20:48
@joestringer joestringer requested a review from a team as a code owner July 8, 2021 20:48
@joestringer joestringer added the release-note/ci This PR makes changes to the CI. label Jul 8, 2021
@maintainer-s-little-helper maintainer-s-little-helper bot added dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. and removed dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. labels Jul 8, 2021
Copy link
Member

@christarazi christarazi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, simple enough. This is a prime example of something that could easily reoccur again because someone isn't aware that this must be done if they configure endpoint-routes mode. However, it seems this would be unlikely to happen in the near future so we don't have to solve this problem now.

@joestringer
Copy link
Member Author

test-me-please

@joestringer
Copy link
Member Author

joestringer commented Jul 8, 2021

I manually tried this out by just running:

CNI_INTEGRATION=minikube K8S_VERSION=1.20 ginkgo --focus="K8sCustom*" -- -cilium.provision=false -cilium.kubeconfig=`echo ~/.kube/config` -cilium.image="quay.io/cilium/cilium-ci" -cilium.operator-image="quay.io/cilium/operator" -cilium.operator-suffix="-ci" -cilium.passCLIEnvironment=true

And sure enough, kube-dns was gone at the end along with Cilium.

I'm not the biggest fan of how the testsuite leaves your cluster in a broken state after it completes (rather than returning to where it started with whatever apps you previously had deployed), but I think that's already the case so if we want to fix that we can follow up separately.

@joestringer
Copy link
Member Author

Cilium init pods went into crashloopbackoff, many of the jobs seem to have the same issue:

https://jenkins.cilium.io/job/Cilium-PR-K8s-1.21-kernel-4.9/1018/testReport/junit/Suite-k8s-1/21/K8sHubbleTest_Hubble_Observe_Test_L3_L4_Flow/

	 kube-system         cilium-5jwhk                       0/1     Init:CrashLoopBackOff   5          4m2s    192.168.36.11   k8s1     <none>           <none>
	 kube-system         cilium-fz9m9                       0/1     Init:CrashLoopBackOff   5          4m2s    192.168.36.12   k8s2     <none>           <none>

@joestringer
Copy link
Member Author

cilium-combined-logs-previous.txt shows that it's a problem with the mount init container:

2021-07-09T00:25:29.978691280Z cp: cannot stat '/usr/bin/cilium-mount': No such file or directory
2021-07-09T00:25:29.979226735Z rm: cannot remove '/hostbin/cilium-mount': No such file or directory
Error from server (BadRequest): previous terminated container "clean-cilium-state" in pod "cilium-5jwhk" not found

I'll try rebasing to see whether I just need to pull in #16815 to resolve the issue.

@joestringer
Copy link
Member Author

test-me-please

@aditighag
Copy link
Member

cilium-combined-logs-previous.txt shows that it's a problem with the mount init container:


2021-07-09T00:25:29.978691280Z cp: cannot stat '/usr/bin/cilium-mount': No such file or directory

2021-07-09T00:25:29.979226735Z rm: cannot remove '/hostbin/cilium-mount': No such file or directory

Error from server (BadRequest): previous terminated container "clean-cilium-state" in pod "cilium-5jwhk" not found

I'll try rebasing to see whether I just need to pull in #16815 to resolve the issue.

This init container change was introduced in #16815. Looks like there is a mismatch between cilium daemonset (up-to-date with master) used and image built on your PR (not up-to-date). Yeah, please rebase your PR.

@joestringer
Copy link
Member Author

Also hit #16846 .

@joestringer
Copy link
Member Author

Evidently not all of the tests ensure that DNS is deployed before running, so the current version of the fix breaks those tests because they're relying on DNS from previous runs.

Maybe I'll just do a complete Cilium + DNS forced redeploy in the AfterAll of the relevant test suites (K8sCustomCalls, K8sDatapathConfig). It may take a few minutes longer but should do the trick.

Commit a0e7712 ("test: Redeploy DNS after changing endpointRoutes")
didn't go quite far enough: It ensured that between individual tests in
a given file, the DNS pods would be redeployed during the next run if
there were significant enough datapath changes. However, the way it did
this was by storing state within the 'kubectl' variable, which is
recreated in each test file. So if the last test in one CI run enabled
endpoint routes mode, then the DNS pods would not be redeployed to
disable endpoint routes mode as part of the next test.

Fix it by redeploying DNS after removing Cilium from the cluster.
Kubernetes will remove the current DNS pods and reschedule them, but
they will not launch until the next test deploys a new version of
Cilium.

Reported-by: Chris Tarazi <chris@isovalent.com>
Fixes: 0e77127dcd7 ("test: Redeploy DNS after changing endpointRoutes")
Related: cilium#16717

Signed-off-by: Joe Stringer <joe@cilium.io>
@joestringer
Copy link
Member Author

OK, turns out Cilium was already getting removed at the end of the tests, we just weren't cleaning up the DNS pods. My previous iteration removed the DNS deployment, and apparently subsequent tests wouldn't ensure this was available. But if Cilium has been removed, it should be sufficient to simply delete the DNS pods and let them get rescheduled into the cluster (then hopefully the next iteration of Cilium will provision the networking for them).

@joestringer
Copy link
Member Author

test-me-please

@joestringer
Copy link
Member Author

Triage:

This is addressing a known flake in the tree and is now only hitting other known flakes. Good to merge.

@joestringer joestringer added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Jul 13, 2021
@kkourt kkourt merged commit c18cfc8 into cilium:master Jul 13, 2021
@joestringer joestringer deleted the submit/fix-16717-again branch July 13, 2021 19:44
@maintainer-s-little-helper maintainer-s-little-helper bot added this to Needs backport from master in 1.10.3 Jul 14, 2021
@maintainer-s-little-helper maintainer-s-little-helper bot moved this from Needs backport from master to Backport pending to v1.10 in 1.10.3 Jul 14, 2021
@maintainer-s-little-helper maintainer-s-little-helper bot moved this from Backport pending to v1.10 to Backport done to v1.10 in 1.10.3 Jul 15, 2021
@nbusseneau
Copy link
Member

Marking for backport to 1.9 following discussion here: #18031 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ready-to-merge This PR has passed all tests and received consensus from code owners to merge. release-note/ci This PR makes changes to the CI.
Projects
No open projects
1.10.3
Backport done to v1.10
Development

Successfully merging this pull request may close these issues.

CI: K8sFQDNTest Restart Cilium validate that FQDN is still working: Error reaching kube-dns before test
8 participants