Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test: Only restart KubeDNS if required #11207

Merged
merged 2 commits into from May 22, 2020

Conversation

tgraf
Copy link
Member

@tgraf tgraf commented Apr 28, 2020

Instead of always restarting the kube-dns deployment, split the validation of the installation into a separate function so we can validate the deployment and only restart kube-dns if we have to.

While doing so, replace the validation with a more efficient version that invokes as few kubectl execs within loops as possible and parallelize operations were possible.

Given that Cilium is typically re-deployed before this logic is executed, the slowest path is typically the service plumbing. Allow for some aggressive timeout for the Kubernetes DNS service to be plumbed to avoid restarting it in the common case.

Example output:

STEP: Checking if kube-dns service is plumbed correctly
STEP: Checking if pods have identity
STEP: Checking if DNS can resolve
STEP: Checking service kube-system/kube-dns plumbing in cilium pod cilium-b9dcp: ClusterIP 10.71.240.10 not found in service list of cilium pod cilium-b9dcp
STEP: Checking service kube-system/kube-dns plumbing in cilium pod cilium-fm9qm: ClusterIP 10.71.240.10 not found in service list of cilium pod cilium-fm9qm
STEP: Checking service kube-system/kube-dns plumbing in cilium pod cilium-mxjvw: ClusterIP 10.71.240.10 not found in service list of cilium pod cilium-mxjvw
STEP: Checking service kube-system/kube-dns plumbing in cilium pod cilium-fm9qm: ClusterIP 10.71.240.10 not found in service list of cilium pod cilium-fm9qm
STEP: Checking service kube-system/kube-dns plumbing in cilium pod cilium-mxjvw: ClusterIP 10.71.240.10 not found in service list of cilium pod cilium-mxjvw
STEP: Checking service kube-system/kube-dns plumbing in cilium pod cilium-b9dcp: ClusterIP 10.71.240.10 not found in service list of cilium pod cilium-b9dcp
STEP: Checking service kube-system/kube-dns plumbing in cilium pod cilium-fm9qm: ClusterIP 10.71.240.10 not found in service list of cilium pod cilium-fm9qm
STEP: Checking service kube-system/kube-dns plumbing in cilium pod cilium-b9dcp: ClusterIP 10.71.240.10 not found in service list of cilium pod cilium-b9dcp
STEP: Checking service kube-system/kube-dns plumbing in cilium pod cilium-mxjvw: ClusterIP 10.71.240.10 not found in service list of cilium pod cilium-mxjvw
STEP: Checking service kube-system/kube-dns plumbing in cilium pod cilium-b9dcp: ClusterIP 10.71.240.10 not found in service list of cilium pod cilium-b9dcp
STEP: Checking service kube-system/kube-dns plumbing in cilium pod cilium-mxjvw: ClusterIP 10.71.240.10 not found in service list of cilium pod cilium-mxjvw
STEP: Checking service kube-system/kube-dns plumbing in cilium pod cilium-fm9qm: ClusterIP 10.71.240.10 not found in service list of cilium pod cilium-fm9qm
STEP: Checking service kube-system/kube-dns plumbing in cilium pod cilium-b9dcp: ClusterIP 10.71.240.10 not found in service list of cilium pod cilium-b9dcp
STEP: Checking service kube-system/kube-dns plumbing in cilium pod cilium-fm9qm: ClusterIP 10.71.240.10 not found in service list of cilium pod cilium-fm9qm
STEP: Checking service kube-system/kube-dns plumbing in cilium pod cilium-mxjvw: ClusterIP 10.71.240.10 not found in service list of cilium pod cilium-mxjvw
STEP: Kubernetes DNS is not ready: ClusterIP 10.71.240.10 not found in service list of cilium pod cilium-mxjvw
STEP: Restarting Kubernetes DSN (-l k8s-app=kube-dns)
STEP: Checking service kube-system/kube-dns plumbing in cilium pod cilium-mxjvw: unable to find service backend 10.68.1.228:53 in datapath of cilium pod cilium-mxjvw
STEP: Checking service kube-system/kube-dns plumbing in cilium pod cilium-b9dcp: unable to find service backend 10.68.1.228:53 in cilium pod cilium-b9dcp
STEP: Checking service kube-system/kube-dns plumbing in cilium pod cilium-fm9qm: unable to find service backend 10.68.1.228:53 in cilium pod cilium-fm9qm
STEP: Waiting for Kubernetes DNS to become operational
STEP: Checking if deployment is ready
STEP: Checking if kube-dns service is plumbed correctly
STEP: Checking if pods have identity
STEP: Checking if DNS can resolve

Before

19:44:29 STEP: Number of ready Cilium pods: 2
19:44:29 STEP: Installing DNS Deployment
19:44:29 STEP: Restarting all kube-system pods
19:45:15 STEP: Performing Cilium preflight check
19:45:15 STEP: Performing Cilium status preflight check
19:45:16 STEP: Performing Cilium controllers preflight check
19:45:18 STEP: Performing Cilium health check
19:45:23 STEP: Performing Cilium service preflight check
19:45:23 STEP: Performing K8s service preflight check
19:45:23 STEP: Waiting for cilium-operator to be ready
19:45:23 STEP: Waiting for kube-dns to be ready
19:45:23 STEP: Running kube-dns preflight check
19:45:26 STEP: Performing K8s service preflight check
19:45:26 STEP: Making sure all endpoints are in ready state
19:45:28 STEP: Launching cilium monitor on "cilium-m9mlj"

Before: 1min

~1min

After

20:22:11 STEP: Number of ready Cilium pods: 2
20:22:11 STEP: Validating if Kubernetes DNS is deployed
20:22:11 STEP: Checking if deployment is ready
20:22:11 STEP: Checking if kube-dns service is plumbed correctly
20:22:11 STEP: Checking if pods have identity
20:22:11 STEP: Checking if DNS can resolve
20:22:13 STEP: Kubernetes DNS is up and operational
20:22:13 STEP: Performing Cilium preflight check
20:22:13 STEP: Performing Cilium status preflight check
20:22:14 STEP: Performing Cilium controllers preflight check
20:22:15 STEP: Performing Cilium health check
20:22:21 STEP: Performing Cilium service preflight check
20:22:21 STEP: Performing K8s service preflight check
20:22:21 STEP: Waiting for cilium-operator to be ready
20:22:22 STEP: Making sure all endpoints are in ready state
20:22:24 STEP: Launching cilium monitor on "cilium-swqh6"

~10s

Fixes: #10980

@tgraf tgraf added area/CI Continuous Integration testing issue or flake release-note/misc This PR makes changes that have no direct user impact. ci/flake This is a known failure that occurs in the tree. Please investigate me! labels Apr 28, 2020
@tgraf tgraf requested a review from a team as a code owner April 28, 2020 22:47
@maintainer-s-little-helper maintainer-s-little-helper bot added this to In progress in 1.8.0 Apr 28, 2020
@tgraf tgraf marked this pull request as draft April 28, 2020 22:47
@tgraf
Copy link
Member Author

tgraf commented Apr 28, 2020

test-me-please

@tgraf tgraf force-pushed the pr/tgraf/ci-restart-kubedns-if-needed branch 2 times, most recently from 0f08186 to 38b2ae4 Compare April 29, 2020 08:27
@tgraf
Copy link
Member Author

tgraf commented Apr 29, 2020

test-me-please

@tgraf tgraf force-pushed the pr/tgraf/ci-restart-kubedns-if-needed branch from 38b2ae4 to c06ec3f Compare May 6, 2020 18:21
@tgraf tgraf marked this pull request as ready for review May 6, 2020 18:22
@tgraf
Copy link
Member Author

tgraf commented May 6, 2020

test-me-please

@joestringer
Copy link
Member

Maybe this fixes #10980 ?

@tgraf
Copy link
Member Author

tgraf commented May 6, 2020

Maybe this fixes #10980 ?

Yes, it does. I'll add a fixes tag

@tgraf
Copy link
Member Author

tgraf commented May 7, 2020

Still needs more work:

19:37:16 STEP: Validating if Kubernetes DNS is deployed
19:37:16 STEP: Checking if deployment is ready
19:37:16 STEP: Kubernetes DNS is not ready: unable to retrieve deployment kube-system/kube-dns: Exitcode: 1 
Stdout:
 	 
Stderr:
 	 Error from server (NotFound): deployments.apps "kube-dns" not found
	 

19:37:16 STEP: Restarting Kubernetes DSN (-l k8s-app=kube-dns)
19:37:23 STEP: Waiting for Kubernetes DNS to become operational
19:37:23 STEP: Checking if deployment is ready
19:37:23 STEP: Kubernetes DNS is not ready yet: unable to retrieve deployment kube-system/kube-dns: Exitcode: 1 
Stdout:
 	 
Stderr:
 	 Error from server (NotFound): deployments.apps "kube-dns" not found

@tgraf tgraf marked this pull request as draft May 7, 2020 07:49
@b3a-dev b3a-dev added this to In Progress (Cilium) in CI Force May 8, 2020
@tgraf tgraf force-pushed the pr/tgraf/ci-restart-kubedns-if-needed branch from c06ec3f to 0bac652 Compare May 20, 2020 16:17
@tgraf
Copy link
Member Author

tgraf commented May 20, 2020

test-me-please

@tgraf tgraf marked this pull request as ready for review May 20, 2020 16:25
test/helpers/kubectl.go Outdated Show resolved Hide resolved
test/k8sT/assertionHelpers.go Outdated Show resolved Hide resolved
test/helpers/kubectl.go Show resolved Hide resolved
test/helpers/kubectl.go Outdated Show resolved Hide resolved
@tgraf tgraf force-pushed the pr/tgraf/ci-restart-kubedns-if-needed branch from 0bac652 to 00e22b1 Compare May 20, 2020 20:54
@tgraf
Copy link
Member Author

tgraf commented May 20, 2020

test-me-please

K8s-1.17 and K8s-1.11:

  • Panic in test PodsHaveCiliumIdentity() -> fixed
    K8s-1.18:
  • K8sFQDNTest_Restart_Cilium_validate_that_FQDN_is_still_working

@coveralls
Copy link

coveralls commented May 20, 2020

Coverage Status

Coverage decreased (-0.006%) to 36.861% when pulling 8694b8d on pr/tgraf/ci-restart-kubedns-if-needed into 43ccfc9 on master.

@tgraf tgraf force-pushed the pr/tgraf/ci-restart-kubedns-if-needed branch from 00e22b1 to 2cca7f6 Compare May 21, 2020 12:34
@tgraf
Copy link
Member Author

tgraf commented May 21, 2020

retest-4.19

EDIT: Passed

Copy link
Member

@nebril nebril left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome stuff! Two minor, but nice to have change requests left inline.

test/helpers/kubectl.go Show resolved Hide resolved
test/helpers/kubectl.go Show resolved Hide resolved
Instead of always restarting the kube-dns deployment, split the
validation of the installation into a separate function so we can
validate the deployment and only restart kube-dns if we have to.

While doing so, replace the validation with a more efficient version
that invokes as few kubectl execs within loops as possible and
parallelizes operations were possible.

Given that Cilium is typically re-deployed before this logic is
executed, the slowest path is typically the service plumbing. Allow for
some aggressive timeout for the Kubernetes DNS service to be plumbed to
avoid restarting it in the common case.

Example output:
```
STEP: Checking if kube-dns service is plumbed correctly
STEP: Checking if pods have identity
STEP: Checking if DNS can resolve
STEP: Checking service kube-system/kube-dns plumbing in cilium pod cilium-b9dcp: ClusterIP 10.71.240.10 not found in service list of cilium pod cilium-b9dcp
STEP: Checking service kube-system/kube-dns plumbing in cilium pod cilium-fm9qm: ClusterIP 10.71.240.10 not found in service list of cilium pod cilium-fm9qm
STEP: Checking service kube-system/kube-dns plumbing in cilium pod cilium-mxjvw: ClusterIP 10.71.240.10 not found in service list of cilium pod cilium-mxjvw
STEP: Checking service kube-system/kube-dns plumbing in cilium pod cilium-fm9qm: ClusterIP 10.71.240.10 not found in service list of cilium pod cilium-fm9qm
STEP: Checking service kube-system/kube-dns plumbing in cilium pod cilium-mxjvw: ClusterIP 10.71.240.10 not found in service list of cilium pod cilium-mxjvw
STEP: Checking service kube-system/kube-dns plumbing in cilium pod cilium-b9dcp: ClusterIP 10.71.240.10 not found in service list of cilium pod cilium-b9dcp
STEP: Checking service kube-system/kube-dns plumbing in cilium pod cilium-fm9qm: ClusterIP 10.71.240.10 not found in service list of cilium pod cilium-fm9qm
STEP: Checking service kube-system/kube-dns plumbing in cilium pod cilium-b9dcp: ClusterIP 10.71.240.10 not found in service list of cilium pod cilium-b9dcp
STEP: Checking service kube-system/kube-dns plumbing in cilium pod cilium-mxjvw: ClusterIP 10.71.240.10 not found in service list of cilium pod cilium-mxjvw
STEP: Checking service kube-system/kube-dns plumbing in cilium pod cilium-b9dcp: ClusterIP 10.71.240.10 not found in service list of cilium pod cilium-b9dcp
STEP: Checking service kube-system/kube-dns plumbing in cilium pod cilium-mxjvw: ClusterIP 10.71.240.10 not found in service list of cilium pod cilium-mxjvw
STEP: Checking service kube-system/kube-dns plumbing in cilium pod cilium-fm9qm: ClusterIP 10.71.240.10 not found in service list of cilium pod cilium-fm9qm
STEP: Checking service kube-system/kube-dns plumbing in cilium pod cilium-b9dcp: ClusterIP 10.71.240.10 not found in service list of cilium pod cilium-b9dcp
STEP: Checking service kube-system/kube-dns plumbing in cilium pod cilium-fm9qm: ClusterIP 10.71.240.10 not found in service list of cilium pod cilium-fm9qm
STEP: Checking service kube-system/kube-dns plumbing in cilium pod cilium-mxjvw: ClusterIP 10.71.240.10 not found in service list of cilium pod cilium-mxjvw
STEP: Kubernetes DNS is not ready: ClusterIP 10.71.240.10 not found in service list of cilium pod cilium-mxjvw
STEP: Restarting Kubernetes DSN (-l k8s-app=kube-dns)
STEP: Checking service kube-system/kube-dns plumbing in cilium pod cilium-mxjvw: unable to find service backend 10.68.1.228:53 in datapath of cilium pod cilium-mxjvw
STEP: Checking service kube-system/kube-dns plumbing in cilium pod cilium-b9dcp: unable to find service backend 10.68.1.228:53 in cilium pod cilium-b9dcp
STEP: Checking service kube-system/kube-dns plumbing in cilium pod cilium-fm9qm: unable to find service backend 10.68.1.228:53 in cilium pod cilium-fm9qm
STEP: Waiting for Kubernetes DNS to become operational
STEP: Checking if deployment is ready
STEP: Checking if kube-dns service is plumbed correctly
STEP: Checking if pods have identity
STEP: Checking if DNS can resolve
```

Signed-off-by: Thomas Graf <thomas@cilium.io>
Instead of restarting all kube-system pods, restart the unmanaged pods.

Signed-off-by: Thomas Graf <thomas@cilium.io>
@tgraf tgraf force-pushed the pr/tgraf/ci-restart-kubedns-if-needed branch from 2cca7f6 to 8694b8d Compare May 21, 2020 18:32
@tgraf tgraf requested a review from nebril May 21, 2020 18:32
@netlify
Copy link

netlify bot commented May 21, 2020

Deploy preview for cilium-docs failed.

Built with commit 8694b8d

https://app.netlify.com/sites/cilium-docs/deploys/5ec6c950db57fb00075abd64

@tgraf
Copy link
Member Author

tgraf commented May 21, 2020

test-me-please

@tgraf tgraf merged commit cc84c1d into master May 22, 2020
1.8.0 automation moved this from In progress to Merged May 22, 2020
CI Force automation moved this from In Progress (Cilium) to Fixed / Done May 22, 2020
@tgraf tgraf deleted the pr/tgraf/ci-restart-kubedns-if-needed branch May 22, 2020 09:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/CI Continuous Integration testing issue or flake ci/flake This is a known failure that occurs in the tree. Please investigate me! release-note/misc This PR makes changes that have no direct user impact.
Projects
No open projects
1.8.0
  
Merged
CI Force
  
Fixed / Done
Development

Successfully merging this pull request may close these issues.

Investigate avoiding redeploying DNS for every DatapathConfiguration test
6 participants