Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CI: K8sFQDNTest Restart Cilium validate that FQDN is still working: Error reaching kube-dns before test #16717

Closed
pchaigno opened this issue Jun 30, 2021 · 8 comments · Fixed by #16767 or #16835
Assignees
Labels
area/CI Continuous Integration testing issue or flake ci/flake This is a known failure that occurs in the tree. Please investigate me!
Projects

Comments

@pchaigno
Copy link
Member

https://jenkins.cilium.io/job/cilium-master-k8s-1.17-kernel-4.9/132/testReport/Suite-k8s-1/17/K8sFQDNTest_Restart_Cilium_validate_that_FQDN_is_still_working/
3205f837_K8sFQDNTest_Restart_Cilium_validate_that_FQDN_is_still_working.zip

Stacktrace

/home/jenkins/workspace/cilium-master-k8s-1.17-kernel-4.9/src/github.com/cilium/cilium/test/ginkgo-ext/scopes.go:465
Error reaching kube-dns before test: error looking up kube-dns.kube-system.svc.cluster.local from default/app2-5cc5d58844-nv6wr: ;; connection timed out; no servers could be reached

command terminated with exit code 1

Expected
    <*errors.errorString | 0xc0024981b0>: {
        s: "error looking up kube-dns.kube-system.svc.cluster.local from default/app2-5cc5d58844-nv6wr: ;; connection timed out; no servers could be reached\n\ncommand terminated with exit code 1\n",
    }
to be nil
/home/jenkins/workspace/cilium-master-k8s-1.17-kernel-4.9/src/github.com/cilium/cilium/test/k8sT/fqdn.go:89

Standard Output

Cilium pods: [cilium-mxkd7 cilium-zggcq]
Netpols loaded: 
CiliumNetworkPolicies loaded: 
Endpoint Policy Enforcement:
Pod                          Ingress   Egress
grafana-7fd557d749-qs865               
prometheus-d87f8f984-7rqc2             
app1-7b6ddb776f-9q5nl                  
app1-7b6ddb776f-n4vvx                  
app2-5cc5d58844-nv6wr                  
app3-6c7856c5b5-fs6n9                  
coredns-767d4c6dd7-jsfcl               
Cilium agent 'cilium-mxkd7': Status: Ok  Health: Ok Nodes "" ContinerRuntime:  Kubernetes: Ok KVstore: Ok Controllers: Total 36 Failed 0
Cilium agent 'cilium-zggcq': Status: Ok  Health: Ok Nodes "" ContinerRuntime:  Kubernetes: Ok KVstore: Ok Controllers: Total 35 Failed 0

Standard Error

Click to show
15:21:47 STEP: Running BeforeAll block for EntireTestsuite K8sFQDNTest
15:21:47 STEP: Ensuring the namespace kube-system exists
15:21:47 STEP: WaitforPods(namespace="kube-system", filter="-l k8s-app=cilium-test-logs")
15:21:47 STEP: WaitforPods(namespace="kube-system", filter="-l k8s-app=cilium-test-logs") => <nil>
15:21:47 STEP: Installing Cilium
15:21:48 STEP: Waiting for Cilium to become ready
15:22:18 STEP: Validating if Kubernetes DNS is deployed
15:22:18 STEP: Checking if deployment is ready
15:22:18 STEP: Checking if kube-dns service is plumbed correctly
15:22:18 STEP: Checking if DNS can resolve
15:22:18 STEP: Checking if pods have identity
15:22:20 STEP: Kubernetes DNS is up and operational
15:22:20 STEP: Validating Cilium Installation
15:22:20 STEP: Performing Cilium controllers preflight check
15:22:20 STEP: Performing Cilium health check
15:22:20 STEP: Performing Cilium status preflight check
15:22:26 STEP: Performing Cilium service preflight check
15:22:26 STEP: Performing K8s service preflight check
15:22:26 STEP: Cilium is not ready yet: connectivity health is failing: Cluster connectivity is unhealthy on 'cilium-mxkd7': Exitcode: 1 
Err: exit status 1
Stdout:
 	 
Stderr:
 	 Error: Cannot get status/probe: Put "http://%2Fvar%2Frun%2Fcilium%2Fhealth.sock/v1beta/status/probe": dial unix /var/run/cilium/health.sock: connect: no such file or directory
	 
	 command terminated with exit code 1
	 

15:22:26 STEP: Performing Cilium controllers preflight check
15:22:26 STEP: Performing Cilium status preflight check
15:22:26 STEP: Performing Cilium health check
15:22:28 STEP: Performing Cilium service preflight check
15:22:28 STEP: Performing K8s service preflight check
15:22:28 STEP: Performing Cilium controllers preflight check
15:22:28 STEP: Performing Cilium health check
15:22:28 STEP: Performing Cilium status preflight check
15:22:30 STEP: Performing Cilium service preflight check
15:22:30 STEP: Performing K8s service preflight check
15:22:30 STEP: Performing Cilium controllers preflight check
15:22:30 STEP: Performing Cilium status preflight check
15:22:30 STEP: Performing Cilium health check
15:22:31 STEP: Performing Cilium service preflight check
15:22:31 STEP: Performing K8s service preflight check
15:22:31 STEP: Performing Cilium status preflight check
15:22:31 STEP: Performing Cilium controllers preflight check
15:22:31 STEP: Performing Cilium health check
15:22:34 STEP: Performing Cilium service preflight check
15:22:34 STEP: Performing K8s service preflight check
15:22:34 STEP: Performing Cilium controllers preflight check
15:22:34 STEP: Performing Cilium status preflight check
15:22:34 STEP: Performing Cilium health check
15:22:35 STEP: Performing Cilium service preflight check
15:22:35 STEP: Performing K8s service preflight check
15:22:35 STEP: Performing Cilium controllers preflight check
15:22:35 STEP: Performing Cilium health check
15:22:35 STEP: Performing Cilium status preflight check
15:22:38 STEP: Performing Cilium service preflight check
15:22:38 STEP: Performing K8s service preflight check
15:22:38 STEP: Cilium is not ready yet: connectivity health is failing: Cluster connectivity is unhealthy on 'cilium-mxkd7': Exitcode: 1 
Err: exit status 1
Stdout:
 	 
Stderr:
 	 Error: Cannot get status/probe: Put "http://%2Fvar%2Frun%2Fcilium%2Fhealth.sock/v1beta/status/probe": dial unix /var/run/cilium/health.sock: connect: no such file or directory
	 
	 command terminated with exit code 1
	 

15:22:38 STEP: Performing Cilium controllers preflight check
15:22:38 STEP: Performing Cilium status preflight check
15:22:38 STEP: Performing Cilium health check
15:22:40 STEP: Performing Cilium service preflight check
15:22:40 STEP: Performing K8s service preflight check
15:22:40 STEP: Performing Cilium controllers preflight check
15:22:40 STEP: Performing Cilium status preflight check
15:22:40 STEP: Performing Cilium health check
15:22:41 STEP: Performing Cilium service preflight check
15:22:41 STEP: Performing K8s service preflight check
15:22:41 STEP: Performing Cilium status preflight check
15:22:41 STEP: Performing Cilium controllers preflight check
15:22:41 STEP: Performing Cilium health check
15:22:43 STEP: Performing Cilium service preflight check
15:22:43 STEP: Performing K8s service preflight check
15:22:43 STEP: Performing Cilium controllers preflight check
15:22:43 STEP: Performing Cilium status preflight check
15:22:43 STEP: Performing Cilium health check
15:22:45 STEP: Performing Cilium service preflight check
15:22:45 STEP: Performing K8s service preflight check
15:22:45 STEP: Performing Cilium controllers preflight check
15:22:45 STEP: Performing Cilium health check
15:22:45 STEP: Performing Cilium status preflight check
15:22:47 STEP: Performing Cilium service preflight check
15:22:47 STEP: Performing K8s service preflight check
15:22:47 STEP: Performing Cilium controllers preflight check
15:22:47 STEP: Performing Cilium status preflight check
15:22:47 STEP: Performing Cilium health check
15:22:48 STEP: Performing Cilium service preflight check
15:22:48 STEP: Performing K8s service preflight check
15:22:48 STEP: Cilium is not ready yet: connectivity health is failing: Cluster connectivity is unhealthy on 'cilium-mxkd7': Exitcode: 1 
Err: exit status 1
Stdout:
 	 
Stderr:
 	 Error: Cannot get status/probe: Put "http://%2Fvar%2Frun%2Fcilium%2Fhealth.sock/v1beta/status/probe": dial unix /var/run/cilium/health.sock: connect: no such file or directory
	 
	 command terminated with exit code 1
	 

15:22:48 STEP: Performing Cilium controllers preflight check
15:22:48 STEP: Performing Cilium health check
15:22:48 STEP: Performing Cilium status preflight check
15:22:50 STEP: Performing Cilium service preflight check
15:22:50 STEP: Performing K8s service preflight check
15:22:50 STEP: Performing Cilium controllers preflight check
15:22:50 STEP: Performing Cilium status preflight check
15:22:50 STEP: Performing Cilium health check
15:22:51 STEP: Performing Cilium service preflight check
15:22:51 STEP: Performing K8s service preflight check
15:22:51 STEP: Performing Cilium controllers preflight check
15:22:51 STEP: Performing Cilium health check
15:22:51 STEP: Performing Cilium status preflight check
15:22:54 STEP: Performing Cilium service preflight check
15:22:54 STEP: Performing K8s service preflight check
15:22:54 STEP: Performing Cilium status preflight check
15:22:54 STEP: Performing Cilium controllers preflight check
15:22:54 STEP: Performing Cilium health check
15:22:57 STEP: Performing Cilium service preflight check
15:22:57 STEP: Performing K8s service preflight check
15:22:57 STEP: Performing Cilium controllers preflight check
15:22:57 STEP: Performing Cilium status preflight check
15:22:57 STEP: Performing Cilium health check
15:23:00 STEP: Performing Cilium service preflight check
15:23:00 STEP: Performing K8s service preflight check
15:23:00 STEP: Performing Cilium controllers preflight check
15:23:00 STEP: Performing Cilium health check
15:23:00 STEP: Performing Cilium status preflight check
15:23:01 STEP: Performing Cilium service preflight check
15:23:01 STEP: Performing K8s service preflight check
15:23:01 STEP: Cilium is not ready yet: connectivity health is failing: Cluster connectivity is unhealthy on 'cilium-mxkd7': Exitcode: 1 
Err: exit status 1
Stdout:
 	 
Stderr:
 	 Error: Cannot get status/probe: Put "http://%2Fvar%2Frun%2Fcilium%2Fhealth.sock/v1beta/status/probe": dial unix /var/run/cilium/health.sock: connect: no such file or directory
	 
	 command terminated with exit code 1
	 

15:23:01 STEP: Performing Cilium controllers preflight check
15:23:01 STEP: Performing Cilium health check
15:23:01 STEP: Performing Cilium status preflight check
15:23:02 STEP: Performing Cilium service preflight check
15:23:02 STEP: Performing K8s service preflight check
15:23:02 STEP: Performing Cilium controllers preflight check
15:23:02 STEP: Performing Cilium health check
15:23:02 STEP: Performing Cilium status preflight check
15:23:03 STEP: Performing Cilium service preflight check
15:23:03 STEP: Performing K8s service preflight check
15:23:03 STEP: Performing Cilium controllers preflight check
15:23:03 STEP: Performing Cilium status preflight check
15:23:03 STEP: Performing Cilium health check
15:23:05 STEP: Performing Cilium service preflight check
15:23:05 STEP: Performing K8s service preflight check
15:23:05 STEP: Performing Cilium controllers preflight check
15:23:05 STEP: Performing Cilium health check
15:23:05 STEP: Performing Cilium status preflight check
15:23:08 STEP: Performing Cilium service preflight check
15:23:08 STEP: Performing K8s service preflight check
15:23:08 STEP: Performing Cilium controllers preflight check
15:23:08 STEP: Performing Cilium status preflight check
15:23:08 STEP: Performing Cilium health check
15:23:10 STEP: Performing Cilium service preflight check
15:23:10 STEP: Performing K8s service preflight check
15:23:10 STEP: Performing Cilium controllers preflight check
15:23:10 STEP: Performing Cilium status preflight check
15:23:10 STEP: Performing Cilium health check
15:23:11 STEP: Performing Cilium service preflight check
15:23:11 STEP: Performing K8s service preflight check
15:23:11 STEP: Cilium is not ready yet: connectivity health is failing: Cluster connectivity is unhealthy on 'cilium-mxkd7': Exitcode: 1 
Err: exit status 1
Stdout:
 	 
Stderr:
 	 Error: Cannot get status/probe: Put "http://%2Fvar%2Frun%2Fcilium%2Fhealth.sock/v1beta/status/probe": dial unix /var/run/cilium/health.sock: connect: no such file or directory
	 
	 command terminated with exit code 1
	 

15:23:11 STEP: Performing Cilium controllers preflight check
15:23:11 STEP: Performing Cilium health check
15:23:11 STEP: Performing Cilium status preflight check
15:23:13 STEP: Performing Cilium service preflight check
15:23:13 STEP: Performing K8s service preflight check
15:23:13 STEP: Performing Cilium controllers preflight check
15:23:13 STEP: Performing Cilium health check
15:23:13 STEP: Performing Cilium status preflight check
15:23:16 STEP: Performing Cilium service preflight check
15:23:16 STEP: Performing K8s service preflight check
15:23:16 STEP: Performing Cilium status preflight check
15:23:16 STEP: Performing Cilium controllers preflight check
15:23:16 STEP: Performing Cilium health check
15:23:18 STEP: Performing Cilium service preflight check
15:23:18 STEP: Performing K8s service preflight check
15:23:19 STEP: Waiting for cilium-operator to be ready
15:23:19 STEP: WaitforPods(namespace="kube-system", filter="-l name=cilium-operator")
15:23:19 STEP: WaitforPods(namespace="kube-system", filter="-l name=cilium-operator") => <nil>
15:23:19 STEP: Applying demo manifest
15:23:19 STEP: WaitforPods(namespace="default", filter="-l zgroup=testapp")
15:23:29 STEP: WaitforPods(namespace="default", filter="-l zgroup=testapp") => <nil>
FAIL: Error reaching kube-dns before test: error looking up kube-dns.kube-system.svc.cluster.local from default/app2-5cc5d58844-nv6wr: ;; connection timed out; no servers could be reached

command terminated with exit code 1

Expected
    <*errors.errorString | 0xc0024981b0>: {
        s: "error looking up kube-dns.kube-system.svc.cluster.local from default/app2-5cc5d58844-nv6wr: ;; connection timed out; no servers could be reached\n\ncommand terminated with exit code 1\n",
    }
to be nil
===================== TEST FAILED =====================
15:27:29 STEP: Running AfterFailed block for EntireTestsuite K8sFQDNTest
cmd: kubectl get pods -o wide --all-namespaces
Exitcode: 0 
Stdout:
 	 NAMESPACE           NAME                               READY   STATUS    RESTARTS   AGE     IP              NODE   NOMINATED NODE   READINESS GATES
	 cilium-monitoring   grafana-7fd557d749-qs865           1/1     Running   0          72m     10.0.0.90       k8s2   <none>           <none>
	 cilium-monitoring   prometheus-d87f8f984-7rqc2         1/1     Running   0          72m     10.0.0.2        k8s2   <none>           <none>
	 default             app1-7b6ddb776f-9q5nl              2/2     Running   0          4m13s   10.0.1.84       k8s1   <none>           <none>
	 default             app1-7b6ddb776f-n4vvx              2/2     Running   0          4m13s   10.0.1.52       k8s1   <none>           <none>
	 default             app2-5cc5d58844-nv6wr              1/1     Running   0          4m13s   10.0.1.234      k8s1   <none>           <none>
	 default             app3-6c7856c5b5-fs6n9              1/1     Running   0          4m13s   10.0.1.159      k8s1   <none>           <none>
	 kube-system         cilium-mxkd7                       1/1     Running   0          5m45s   192.168.36.11   k8s1   <none>           <none>
	 kube-system         cilium-operator-5f99fccbd8-2dbr9   1/1     Running   0          5m44s   192.168.36.11   k8s1   <none>           <none>
	 kube-system         cilium-operator-5f99fccbd8-dvpg7   1/1     Running   0          5m44s   192.168.36.12   k8s2   <none>           <none>
	 kube-system         cilium-zggcq                       1/1     Running   0          5m44s   192.168.36.12   k8s2   <none>           <none>
	 kube-system         coredns-767d4c6dd7-jsfcl           1/1     Running   0          7m6s    10.0.0.251      k8s2   <none>           <none>
	 kube-system         etcd-k8s1                          1/1     Running   0          75m     192.168.36.11   k8s1   <none>           <none>
	 kube-system         kube-apiserver-k8s1                1/1     Running   0          75m     192.168.36.11   k8s1   <none>           <none>
	 kube-system         kube-controller-manager-k8s1       1/1     Running   0          75m     192.168.36.11   k8s1   <none>           <none>
	 kube-system         kube-proxy-2tcqm                   1/1     Running   0          75m     192.168.36.11   k8s1   <none>           <none>
	 kube-system         kube-proxy-dx6pr                   1/1     Running   0          73m     192.168.36.12   k8s2   <none>           <none>
	 kube-system         kube-scheduler-k8s1                1/1     Running   0          75m     192.168.36.11   k8s1   <none>           <none>
	 kube-system         log-gatherer-kcfvr                 1/1     Running   0          72m     192.168.36.11   k8s1   <none>           <none>
	 kube-system         log-gatherer-qcb6s                 1/1     Running   0          72m     192.168.36.12   k8s2   <none>           <none>
	 kube-system         registry-adder-26hfp               1/1     Running   0          72m     192.168.36.11   k8s1   <none>           <none>
	 kube-system         registry-adder-ldqph               1/1     Running   0          72m     192.168.36.12   k8s2   <none>           <none>
	 
Stderr:
 	 

Fetching command output from pods [cilium-mxkd7 cilium-zggcq]
cmd: kubectl exec -n kube-system cilium-mxkd7 -- cilium service list
Exitcode: 0 
Stdout:
 	 ID   Frontend             Service Type   Backend                   
	 1    10.111.116.89:9090   ClusterIP      1 => 10.0.0.2:9090        
	 2    10.96.66.132:80      ClusterIP      1 => 10.0.1.84:80         
	                                          2 => 10.0.1.52:80         
	 3    10.96.0.1:443        ClusterIP      1 => 192.168.36.11:6443   
	 4    10.96.0.10:53        ClusterIP      1 => 10.0.0.251:53        
	 5    10.96.0.10:9153      ClusterIP      1 => 10.0.0.251:9153      
	 6    10.106.214.25:3000   ClusterIP      1 => 10.0.0.90:3000       
	 7    10.96.66.132:69      ClusterIP      1 => 10.0.1.84:69         
	                                          2 => 10.0.1.52:69         
	 
Stderr:
 	 

cmd: kubectl exec -n kube-system cilium-mxkd7 -- cilium endpoint list
Exitcode: 0 
Stdout:
 	 ENDPOINT   POLICY (ingress)   POLICY (egress)   IDENTITY   LABELS (source:key[=value])                            IPv6        IPv4         STATUS   
	            ENFORCEMENT        ENFORCEMENT                                                                                                  
	 880        Disabled           Disabled          13212      k8s:appSecond=true                                     fd02::1af   10.0.1.234   ready   
	                                                            k8s:id=app2                                                                             
	                                                            k8s:io.cilium.k8s.policy.cluster=default                                                
	                                                            k8s:io.cilium.k8s.policy.serviceaccount=app2-account                                    
	                                                            k8s:io.kubernetes.pod.namespace=default                                                 
	                                                            k8s:zgroup=testapp                                                                      
	 1129       Disabled           Disabled          2894       k8s:id=app1                                            fd02::1d3   10.0.1.52    ready   
	                                                            k8s:io.cilium.k8s.policy.cluster=default                                                
	                                                            k8s:io.cilium.k8s.policy.serviceaccount=app1-account                                    
	                                                            k8s:io.kubernetes.pod.namespace=default                                                 
	                                                            k8s:zgroup=testapp                                                                      
	 1749       Disabled           Disabled          4          reserved:health                                        fd02::157   10.0.1.221   ready   
	 1868       Disabled           Disabled          2894       k8s:id=app1                                            fd02::16a   10.0.1.84    ready   
	                                                            k8s:io.cilium.k8s.policy.cluster=default                                                
	                                                            k8s:io.cilium.k8s.policy.serviceaccount=app1-account                                    
	                                                            k8s:io.kubernetes.pod.namespace=default                                                 
	                                                            k8s:zgroup=testapp                                                                      
	 2876       Disabled           Disabled          1          k8s:cilium.io/ci-node=k8s1                                                      ready   
	                                                            k8s:node-role.kubernetes.io/master                                                      
	                                                            reserved:host                                                                           
	 3081       Disabled           Disabled          48485      k8s:id=app3                                            fd02::19a   10.0.1.159   ready   
	                                                            k8s:io.cilium.k8s.policy.cluster=default                                                
	                                                            k8s:io.cilium.k8s.policy.serviceaccount=default                                         
	                                                            k8s:io.kubernetes.pod.namespace=default                                                 
	                                                            k8s:zgroup=testapp                                                                      
	 
Stderr:
 	 

cmd: kubectl exec -n kube-system cilium-zggcq -- cilium service list
Exitcode: 0 
Stdout:
 	 ID   Frontend             Service Type   Backend                   
	 1    10.96.0.10:53        ClusterIP      1 => 10.0.0.251:53        
	 2    10.96.0.10:9153      ClusterIP      1 => 10.0.0.251:9153      
	 3    10.106.214.25:3000   ClusterIP      1 => 10.0.0.90:3000       
	 4    10.111.116.89:9090   ClusterIP      1 => 10.0.0.2:9090        
	 5    10.96.66.132:69      ClusterIP      1 => 10.0.1.84:69         
	                                          2 => 10.0.1.52:69         
	 6    10.96.0.1:443        ClusterIP      1 => 192.168.36.11:6443   
	 7    10.96.66.132:80      ClusterIP      1 => 10.0.1.84:80         
	                                          2 => 10.0.1.52:80         
	 
Stderr:
 	 

cmd: kubectl exec -n kube-system cilium-zggcq -- cilium endpoint list
Exitcode: 0 
Stdout:
 	 ENDPOINT   POLICY (ingress)   POLICY (egress)   IDENTITY   LABELS (source:key[=value])                              IPv6       IPv4         STATUS   
	            ENFORCEMENT        ENFORCEMENT                                                                                                   
	 27         Disabled           Disabled          2560       k8s:app=grafana                                          fd02::1    10.0.0.90    ready   
	                                                            k8s:io.cilium.k8s.policy.cluster=default                                                 
	                                                            k8s:io.cilium.k8s.policy.serviceaccount=default                                          
	                                                            k8s:io.kubernetes.pod.namespace=cilium-monitoring                                        
	 112        Disabled           Disabled          1          k8s:cilium.io/ci-node=k8s2                                                       ready   
	                                                            reserved:host                                                                            
	 2245       Disabled           Disabled          4          reserved:health                                          fd02::81   10.0.0.201   ready   
	 2411       Disabled           Disabled          471        k8s:app=prometheus                                       fd02::7a   10.0.0.2     ready   
	                                                            k8s:io.cilium.k8s.policy.cluster=default                                                 
	                                                            k8s:io.cilium.k8s.policy.serviceaccount=prometheus-k8s                                   
	                                                            k8s:io.kubernetes.pod.namespace=cilium-monitoring                                        
	 3246       Disabled           Disabled          28075      k8s:io.cilium.k8s.policy.cluster=default                 fd02::51   10.0.0.251   ready   
	                                                            k8s:io.cilium.k8s.policy.serviceaccount=coredns                                          
	                                                            k8s:io.kubernetes.pod.namespace=kube-system                                              
	                                                            k8s:k8s-app=kube-dns                                                                     
	 
Stderr:
 	 

===================== Exiting AfterFailed =====================
15:27:50 STEP: Running AfterEach for block EntireTestsuite K8sFQDNTest
15:27:50 STEP: Running AfterEach for block EntireTestsuite
@pchaigno pchaigno added area/CI Continuous Integration testing issue or flake ci/flake This is a known failure that occurs in the tree. Please investigate me! labels Jun 30, 2021
@pchaigno pchaigno self-assigned this Jun 30, 2021
@pchaigno
Copy link
Member Author

pchaigno commented Jun 30, 2021

Symptoms

All DNS connections from default/app2-5cc5d58844-nv6wr seem to be hanging. We don't see the requests in the CoreDNS logs, so it probably wasn't received.

We can trace one request for example. On the source node:

$ cat ~/Downloads/tmp/cilium-mxkd7-hubble_observe.json | ./hubble observe --from-port 55145 --to-port 53 
Jun 28 15:27:21.649: default/app2-5cc5d58844-nv6wr:55145 <> kube-system/coredns-767d4c6dd7-jsfcl:53 to-overlay FORWARDED (UDP)
Jun 28 15:27:26.651: default/app2-5cc5d58844-nv6wr:55145 <> kube-system/coredns-767d4c6dd7-jsfcl:53 to-overlay FORWARDED (UDP)

On the destination node:

$ cat ~/Downloads/tmp/cilium-zggcq-hubble_observe.json | ./hubble observe --from-port 55145 --to-port 53
Jun 28 15:27:21.656: default/app2-5cc5d58844-nv6wr:55145 -> kube-system/coredns-767d4c6dd7-jsfcl:53 to-endpoint FORWARDED (UDP)
Jun 28 15:27:26.658: default/app2-5cc5d58844-nv6wr:55145 -> kube-system/coredns-767d4c6dd7-jsfcl:53 to-endpoint FORWARDED (UDP)

Monitor aggregation is enabled, so only to-xxx traces are collected.

Datapath Analysis

Endpoint routes are disabled, so these to-endpoint traces were emitted from the cilium_vxlan device after a tail call from bpf_overlay to bpf_lxc. Then, the packet should be redirected to the lxc device and enter the container. Let's confirm that.

We can first get the lxc device for the destination CoreDNS container:

$ jq '.[].status | select(."external-identifiers"."k8s-pod-name" == "coredns-767d4c6dd7-jsfcl").networking' cilium-zggcq-endpoint_list.txt
{
  "addressing": [
    {
      "ipv4": "10.0.0.251",
      "ipv6": "fd02::51"
    }
  ],
  "host-mac": "9a:b7:ea:b0:e5:82",
  "interface-index": 207,
  "interface-name": "lxc3e1dbd546d93",
  "mac": "8a:52:f4:b8:28:b8"
}

We can then check the BPF programs attached to the node:

$ cat bugtool-cilium-zggcq/cmd/bpftool-net-show.md 
Error: Netlink error reporting not supported
xdp:

tc:
cilium_net(9) clsact/ingress bpf_host_cilium_net.o:[to-host]
cilium_host(10) clsact/ingress bpf_host.o:[to-host]
cilium_host(10) clsact/egress bpf_host.o:[from-host]
lxc899cabc800bc(17) clsact/ingress bpf_lxc.o:[from-container]
lxcbea3fce3587e(19) clsact/ingress bpf_lxc.o:[from-container]
lxc3e1dbd546d93(207) clsact/ingress bpf_lxc.o:[from-container]
lxc3e1dbd546d93(207) clsact/egress bpf_lxc.o:[to-container]
cilium_vxlan(212) clsact/ingress bpf_overlay.o:[from-overlay]
cilium_vxlan(212) clsact/egress bpf_overlay.o:[to-overlay]
lxc_health(214) clsact/ingress bpf_lxc.o:[from-container]

flow_dissector:

Here, we see that, contrary to other containers, the CoreDNS pod actually has two BPF programs attached, one at ingress and one at egress. That should be the case only when endpoint routes are enabled. Similarly, in the routes, we can see that it's the only endpoint with a route:

$ grep lxc3e1dbd546d93 bugtool-cilium-zggcq/cmd/ip--4-r.md 
10.0.0.251 dev lxc scope link

Therefore the DNS packet is sent to the stack. It flows through netfilter and hits the FORWARD-filter table:

Chain FORWARD (policy DROP 84 packets, 8322 bytes)
 pkts bytes target     prot opt in     out     source               destination         
 2162 3076K CILIUM_FORWARD  all  --  any    any     anywhere             anywhere             /* cilium-feeder: CILIUM_FORWARD */
   96 13716 KUBE-FORWARD  all  --  any    any     anywhere             anywhere             /* kubernetes forwarding rules */
   48  5280 KUBE-SERVICES  all  --  any    any     anywhere             anywhere             ctstate NEW /* kubernetes service portals */
[...]
Chain CILIUM_FORWARD (1 references)
 pkts bytes target     prot opt in     out     source               destination         
 1204 2923K ACCEPT     all  --  any    cilium_host  anywhere             anywhere             /* cilium: any->cluster on cilium_host forward accept */
    0     0 ACCEPT     all  --  cilium_host any     anywhere             anywhere             /* cilium: cluster->any on cilium_host forward accept (nodeport) */
  864  140K ACCEPT     all  --  lxc+   any     anywhere             anywhere             /* cilium: cluster->any on lxc+ forward accept */
    0     0 ACCEPT     all  --  cilium_net any     anywhere             anywhere             /* cilium: cluster->any on cilium_net forward accept (nodeport) */

Since endpoint routes are disabled in the agent, rules installed in CILIUM_FORWARD don't match the packet because they assume it will come out of cilium_host. As per FORWARD's default policy, the packet is dropped.

Root Cause

This is actually a known limitation of Cilium since #16227 (changing the status of endpoint routes on an existing Cilium installation is not supported, even though we do it in CI). If endpoint routes are enabled/disabled in the agent, the setting is not reflected in existing endpoints (including the CoreDNS endpoint in our case). We usually work around it by deleting existing pods so that their routes are reinstalled from scratch. That would be a short solution here.

@pchaigno
Copy link
Member Author

This would only happen in our 4.9 CI job because of the following condition:

cilium/bpf/bpf_lxc.c

Lines 1149 to 1156 in e6f34c3

#if !defined(ENABLE_ROUTING) && defined(TUNNEL_MODE) && !defined(ENABLE_NODEPORT)
/* See comment in IPv4 path. */
ctx_change_type(ctx, PACKET_HOST);
#else
ifindex = ctx_load_meta(ctx, CB_IFINDEX);
if (ifindex)
return redirect_ep(ctx, ifindex, from_host);
#endif /* ENABLE_ROUTING && TUNNEL_MODE && !ENABLE_NODEPORT */

So if ENABLE_NODEPORT is defined (only undefined on 4.9), we redirect the packet to the lxc device instead of passing to the stack and we therefore skip the FORWARD-filter table.

@joestringer
Copy link
Member

I believe that K8sDatapathConfig is the only suite which may switch up endpoint routes mode on/off, and in this particular failure case the K8sDatapathConfig tests ran immediately prior to K8sFQDN. I wonder if we're just not cleaning up the environment properly enough in the AfterAll / BeforeAll steps of one of these two contexts. This could explain why we don't see the failure more often - it requires particular groups of tests to be run in a particular order.

@pchaigno
Copy link
Member Author

This could explain why we don't see the failure more often - it requires particular groups of tests to be run in a particular order.

Yep, but checking if DNS resolves is the first thing we do after any Cilium deployment AFAIK. So any test without endpoint routes running after a test with endpoints routes should fail.

@joestringer
Copy link
Member

joestringer commented Jun 30, 2021

Seems like that should be established as part of this path:

DeployCiliumAndDNS(kubectl, ciliumFilename)

...

vm.RedeployKubernetesDnsIfNecessary()

...

err := kub.ValidateKubernetesDNS()

...

if err := kub.KubernetesDNSCanResolve("default", "kubernetes"); err != nil {

However the failing line is later than this, so whatever check we did above was functionally different from the actual DNS lookup.

Expect(err).Should(BeNil(), "Error reaching kube-dns before test: %s", err)

EDIT: Yep, we validate DNS from one of the hosts, not from pods:

cilium/test/helpers/kubectl.go

Lines 1747 to 1748 in eb9a5c4

cmd := fmt.Sprintf("dig +short %s @%s", serviceToResolve, kubeDnsService.Spec.ClusterIP)
res := kub.ExecInFirstPod(ctx, LogGathererNamespace, logGathererSelector(false), cmd)

I don't know off-hand how different host DNS resolution is but it may provide some hints here.

@pchaigno
Copy link
Member Author

I don't know off-hand how different host DNS resolution is but it may provide some hints here.

Hm. DNS resolution from a hostns pod should be the same as long as it's on a different node than the CoreDNS pod. Once the request reaches the destination node via the tunnel it's basically indistinguishable from a request from a pod (except from policy point of view, but we're not concerned with that here).

@joestringer
Copy link
Member

joestringer commented Jun 30, 2021

I've looked into the following failures and also observed that there are per-endpoint routes for the DNS pod:

https://jenkins.cilium.io/job/cilium-master-k8s-1.21-kernel-4.9/544/testReport/junit/Suite-k8s-1/21/K8sDemosTest_Tests_Star_Wars_Demo/
https://jenkins.cilium.io/job/cilium-master-k8s-1.17-kernel-4.9/129/testReport/junit/Suite-k8s-1/17/K8sDemosTest_Tests_Star_Wars_Demo/
https://jenkins.cilium.io/job/cilium-master-k8s-1.21-kernel-4.9/546/testReport/junit/Suite-k8s-1/21/K8sDemosTest_Tests_Star_Wars_Demo/
https://jenkins.cilium.io/job/cilium-master-k8s-1.21-kernel-4.9/539/testReport/junit/Suite-k8s-1/21/K8sKafkaPolicyTest_Kafka_Policy_Tests_KafkaPolicies/

Here's a one-liner I've been using to establish which endpoints are configured with endpoint-routes mode in a CI sysdump:

$ for dir in $(find test_results/*/*/bugtool-cilium-*/cmd/state/[0-9]* -type d); do base64_decode $(grep BASE64 $dir/ep_config.h) | gron | grep -i -e req -e PodName | norg; done
{"K8sPodName":"xwing-6f56868789-gddzk"}
{"K8sPodName":"spaceship-6567c9b4bd-6gmg6"}
{"K8sPodName":"xwing-6f56868789-h8zzz"}
{"K8sPodName":""}
{"K8sPodName":""}
{"K8sPodName":"deathstar-595989bc5b-g92b4"}
{"K8sPodName":"xwing-6f56868789-cnvmc"}
{"K8sPodName":"spaceship-6567c9b4bd-9hm98"}
{"K8sPodName":"spaceship-6567c9b4bd-btd2x"}
{"K8sPodName":""}
{"K8sPodName":"deathstar-595989bc5b-l6w9h"}
{"K8sPodName":"spaceship-6567c9b4bd-9cxzj"}
{"DatapathConfiguration":{"require-egress-prog":true,"require-routing":false},"K8sPodName":"coredns-755cd654d4-6658j"}
{"K8sPodName":"prometheus-655fb888d7-dv5z4"}
{"K8sPodName":"deathstar-595989bc5b-phz25"}
{"K8sPodName":""}
{"K8sPodName":"grafana-5747bcc8f9-zqjbv"}

EDIT: Oh and here's some useful pointers to get the repro above to work:

https://github.com/tomnomnom/gron

base64_decode()                                                                 
{                                                                               
    echo "$@" | sed -e 's/^.*://' | base64 -di | jq '.'                         
}

@pchaigno pchaigno removed their assignment Jul 2, 2021
@joestringer joestringer self-assigned this Jul 2, 2021
joestringer added a commit that referenced this issue Jul 2, 2021
In general up until now, Cilium has expected endpointRoutes mode to be
set to exactly one value upon deployment and for that value to stay the
same for the remainder of operation. Toggling it can lead to a mix of
endpoints in different datapath modes which is not well covered in CI.

In Github issue #16717 we observed that if the testsuite toggles this
setting then we can end up with kubedns pods remaining in endpoint
routes mode, even though the rest of the daemon (and other pods) are not
configured in this mode. This can lead to connectivity issues in DNS,
and a range of test failures in subsequent tests because DNS is broken.

Longer term to resolve this, we could improve on Cilium to ensure that
users can successfully toggle this setting on or off at runtime and
properly handle this case, or alternatively shift all logic over to
endpoint-routes mode by default and disable the other option.

Given that CI for the master branch is in a poor state due to this issue
today, and that part of the issue is CI reconfiguring the datapath state
of Cilium during the test setup in an unsupported manner, this commit
proposes to force DNS pod redeployment as part of setup any time a test
reconfigures the endpointRoutes mode. This should mitigate the testing
side issue while we mull over the right longer-term solution.

Signed-off-by: Joe Stringer <joe@cilium.io>
joestringer added a commit that referenced this issue Jul 6, 2021
In general up until now, Cilium has expected endpointRoutes mode to be
set to exactly one value upon deployment and for that value to stay the
same for the remainder of operation. Toggling it can lead to a mix of
endpoints in different datapath modes which is not well covered in CI.

In Github issue #16717 we observed that if the testsuite toggles this
setting then we can end up with kubedns pods remaining in endpoint
routes mode, even though the rest of the daemon (and other pods) are not
configured in this mode. This can lead to connectivity issues in DNS,
and a range of test failures in subsequent tests because DNS is broken.

Longer term to resolve this, we could improve on Cilium to ensure that
users can successfully toggle this setting on or off at runtime and
properly handle this case, or alternatively shift all logic over to
endpoint-routes mode by default and disable the other option.

Given that CI for the master branch is in a poor state due to this issue
today, and that part of the issue is CI reconfiguring the datapath state
of Cilium during the test setup in an unsupported manner, this commit
proposes to force DNS pod redeployment as part of setup any time a test
reconfigures the endpointRoutes mode. This should mitigate the testing
side issue while we mull over the right longer-term solution.

Signed-off-by: Joe Stringer <joe@cilium.io>
joestringer added a commit that referenced this issue Jul 6, 2021
In general up until now, Cilium has expected endpointRoutes mode to be
set to exactly one value upon deployment and for that value to stay the
same for the remainder of operation. Toggling it can lead to a mix of
endpoints in different datapath modes which is not well covered in CI.

In Github issue #16717 we observed that if the testsuite toggles this
setting then we can end up with kubedns pods remaining in endpoint
routes mode, even though the rest of the daemon (and other pods) are not
configured in this mode. This can lead to connectivity issues in DNS,
and a range of test failures in subsequent tests because DNS is broken.

Longer term to resolve this, we could improve on Cilium to ensure that
users can successfully toggle this setting on or off at runtime and
properly handle this case, or alternatively shift all logic over to
endpoint-routes mode by default and disable the other option.

Given that CI for the master branch is in a poor state due to this issue
today, and that part of the issue is CI reconfiguring the datapath state
of Cilium during the test setup in an unsupported manner, this commit
proposes to force DNS pod redeployment as part of setup any time a test
reconfigures the endpointRoutes mode. This should mitigate the testing
side issue while we mull over the right longer-term solution.

Signed-off-by: Joe Stringer <joe@cilium.io>
@pchaigno
Copy link
Member Author

The fix at #16767 was not sufficient; #16835 should fix it.

@pchaigno pchaigno reopened this Jul 12, 2021
joestringer added a commit to joestringer/cilium that referenced this issue Jul 12, 2021
Commit a0e7712 ("test: Redeploy DNS after changing endpointRoutes")
didn't go quite far enough: It ensured that between individual tests in
a given file, the DNS pods would be redeployed during the next run if
there were significant enough datapath changes. However, the way it did
this was by storing state within the 'kubectl' variable, which is
recreated in each test file. So if the last test in one CI run enabled
endpoint routes mode, then the DNS pods would not be redeployed to
disable endpoint routes mode as part of the next test.

Fix it by redeploying DNS after removing Cilium from the cluster.
Kubernetes will remove the current DNS pods and reschedule them, but
they will not launch until the next test deploys a new version of
Cilium.

Reported-by: Chris Tarazi <chris@isovalent.com>
Fixes: 0e77127dcd7 ("test: Redeploy DNS after changing endpointRoutes")
Related: cilium#16717

Signed-off-by: Joe Stringer <joe@cilium.io>
kkourt pushed a commit that referenced this issue Jul 13, 2021
Commit a0e7712 ("test: Redeploy DNS after changing endpointRoutes")
didn't go quite far enough: It ensured that between individual tests in
a given file, the DNS pods would be redeployed during the next run if
there were significant enough datapath changes. However, the way it did
this was by storing state within the 'kubectl' variable, which is
recreated in each test file. So if the last test in one CI run enabled
endpoint routes mode, then the DNS pods would not be redeployed to
disable endpoint routes mode as part of the next test.

Fix it by redeploying DNS after removing Cilium from the cluster.
Kubernetes will remove the current DNS pods and reschedule them, but
they will not launch until the next test deploys a new version of
Cilium.

Reported-by: Chris Tarazi <chris@isovalent.com>
Fixes: 0e77127dcd7 ("test: Redeploy DNS after changing endpointRoutes")
Related: #16717

Signed-off-by: Joe Stringer <joe@cilium.io>
aanm pushed a commit to pchaigno/cilium that referenced this issue Jul 14, 2021
[ upstream commit a0e7712 ]

In general up until now, Cilium has expected endpointRoutes mode to be
set to exactly one value upon deployment and for that value to stay the
same for the remainder of operation. Toggling it can lead to a mix of
endpoints in different datapath modes which is not well covered in CI.

In Github issue cilium#16717 we observed that if the testsuite toggles this
setting then we can end up with kubedns pods remaining in endpoint
routes mode, even though the rest of the daemon (and other pods) are not
configured in this mode. This can lead to connectivity issues in DNS,
and a range of test failures in subsequent tests because DNS is broken.

Longer term to resolve this, we could improve on Cilium to ensure that
users can successfully toggle this setting on or off at runtime and
properly handle this case, or alternatively shift all logic over to
endpoint-routes mode by default and disable the other option.

Given that CI for the master branch is in a poor state due to this issue
today, and that part of the issue is CI reconfiguring the datapath state
of Cilium during the test setup in an unsupported manner, this commit
proposes to force DNS pod redeployment as part of setup any time a test
reconfigures the endpointRoutes mode. This should mitigate the testing
side issue while we mull over the right longer-term solution.

Signed-off-by: Joe Stringer <joe@cilium.io>
Signed-off-by: Paul Chaignon <paul@cilium.io>
aanm pushed a commit that referenced this issue Jul 15, 2021
[ upstream commit c18cfc8 ]

Commit a0e7712 ("test: Redeploy DNS after changing endpointRoutes")
didn't go quite far enough: It ensured that between individual tests in
a given file, the DNS pods would be redeployed during the next run if
there were significant enough datapath changes. However, the way it did
this was by storing state within the 'kubectl' variable, which is
recreated in each test file. So if the last test in one CI run enabled
endpoint routes mode, then the DNS pods would not be redeployed to
disable endpoint routes mode as part of the next test.

Fix it by redeploying DNS after removing Cilium from the cluster.
Kubernetes will remove the current DNS pods and reschedule them, but
they will not launch until the next test deploys a new version of
Cilium.

Reported-by: Chris Tarazi <chris@isovalent.com>
Fixes: 0e77127dcd7 ("test: Redeploy DNS after changing endpointRoutes")
Related: #16717

Signed-off-by: Joe Stringer <joe@cilium.io>
Signed-off-by: André Martins <andre@cilium.io>
aanm pushed a commit that referenced this issue Jul 15, 2021
[ upstream commit a0e7712 ]

In general up until now, Cilium has expected endpointRoutes mode to be
set to exactly one value upon deployment and for that value to stay the
same for the remainder of operation. Toggling it can lead to a mix of
endpoints in different datapath modes which is not well covered in CI.

In Github issue #16717 we observed that if the testsuite toggles this
setting then we can end up with kubedns pods remaining in endpoint
routes mode, even though the rest of the daemon (and other pods) are not
configured in this mode. This can lead to connectivity issues in DNS,
and a range of test failures in subsequent tests because DNS is broken.

Longer term to resolve this, we could improve on Cilium to ensure that
users can successfully toggle this setting on or off at runtime and
properly handle this case, or alternatively shift all logic over to
endpoint-routes mode by default and disable the other option.

Given that CI for the master branch is in a poor state due to this issue
today, and that part of the issue is CI reconfiguring the datapath state
of Cilium during the test setup in an unsupported manner, this commit
proposes to force DNS pod redeployment as part of setup any time a test
reconfigures the endpointRoutes mode. This should mitigate the testing
side issue while we mull over the right longer-term solution.

Signed-off-by: Joe Stringer <joe@cilium.io>
Signed-off-by: Paul Chaignon <paul@cilium.io>
aanm pushed a commit that referenced this issue Jul 15, 2021
[ upstream commit c18cfc8 ]

Commit a0e7712 ("test: Redeploy DNS after changing endpointRoutes")
didn't go quite far enough: It ensured that between individual tests in
a given file, the DNS pods would be redeployed during the next run if
there were significant enough datapath changes. However, the way it did
this was by storing state within the 'kubectl' variable, which is
recreated in each test file. So if the last test in one CI run enabled
endpoint routes mode, then the DNS pods would not be redeployed to
disable endpoint routes mode as part of the next test.

Fix it by redeploying DNS after removing Cilium from the cluster.
Kubernetes will remove the current DNS pods and reschedule them, but
they will not launch until the next test deploys a new version of
Cilium.

Reported-by: Chris Tarazi <chris@isovalent.com>
Fixes: 0e77127dcd7 ("test: Redeploy DNS after changing endpointRoutes")
Related: #16717

Signed-off-by: Joe Stringer <joe@cilium.io>
Signed-off-by: André Martins <andre@cilium.io>
krishgobinath pushed a commit to krishgobinath/cilium that referenced this issue Oct 20, 2021
Commit a0e7712 ("test: Redeploy DNS after changing endpointRoutes")
didn't go quite far enough: It ensured that between individual tests in
a given file, the DNS pods would be redeployed during the next run if
there were significant enough datapath changes. However, the way it did
this was by storing state within the 'kubectl' variable, which is
recreated in each test file. So if the last test in one CI run enabled
endpoint routes mode, then the DNS pods would not be redeployed to
disable endpoint routes mode as part of the next test.

Fix it by redeploying DNS after removing Cilium from the cluster.
Kubernetes will remove the current DNS pods and reschedule them, but
they will not launch until the next test deploys a new version of
Cilium.

Reported-by: Chris Tarazi <chris@isovalent.com>
Fixes: 0e77127dcd7 ("test: Redeploy DNS after changing endpointRoutes")
Related: cilium#16717

Signed-off-by: Joe Stringer <joe@cilium.io>
@pchaigno pchaigno self-assigned this Nov 28, 2021
nbusseneau pushed a commit that referenced this issue Dec 14, 2021
[ upstream commit a0e7712 ]

In general up until now, Cilium has expected endpointRoutes mode to be
set to exactly one value upon deployment and for that value to stay the
same for the remainder of operation. Toggling it can lead to a mix of
endpoints in different datapath modes which is not well covered in CI.

In Github issue #16717 we observed that if the testsuite toggles this
setting then we can end up with kubedns pods remaining in endpoint
routes mode, even though the rest of the daemon (and other pods) are not
configured in this mode. This can lead to connectivity issues in DNS,
and a range of test failures in subsequent tests because DNS is broken.

Longer term to resolve this, we could improve on Cilium to ensure that
users can successfully toggle this setting on or off at runtime and
properly handle this case, or alternatively shift all logic over to
endpoint-routes mode by default and disable the other option.

Given that CI for the master branch is in a poor state due to this issue
today, and that part of the issue is CI reconfiguring the datapath state
of Cilium during the test setup in an unsupported manner, this commit
proposes to force DNS pod redeployment as part of setup any time a test
reconfigures the endpointRoutes mode. This should mitigate the testing
side issue while we mull over the right longer-term solution.

Signed-off-by: Joe Stringer <joe@cilium.io>
Signed-off-by: Nicolas Busseneau <nicolas@isovalent.com>
nbusseneau pushed a commit that referenced this issue Dec 14, 2021
[ upstream commit c18cfc8 ]

Commit a0e7712 ("test: Redeploy DNS after changing endpointRoutes")
didn't go quite far enough: It ensured that between individual tests in
a given file, the DNS pods would be redeployed during the next run if
there were significant enough datapath changes. However, the way it did
this was by storing state within the 'kubectl' variable, which is
recreated in each test file. So if the last test in one CI run enabled
endpoint routes mode, then the DNS pods would not be redeployed to
disable endpoint routes mode as part of the next test.

Fix it by redeploying DNS after removing Cilium from the cluster.
Kubernetes will remove the current DNS pods and reschedule them, but
they will not launch until the next test deploys a new version of
Cilium.

Reported-by: Chris Tarazi <chris@isovalent.com>
Fixes: 0e77127dcd7 ("test: Redeploy DNS after changing endpointRoutes")
Related: #16717

Signed-off-by: Joe Stringer <joe@cilium.io>
Signed-off-by: Nicolas Busseneau <nicolas@isovalent.com>
tklauser pushed a commit that referenced this issue Dec 15, 2021
[ upstream commit a0e7712 ]

In general up until now, Cilium has expected endpointRoutes mode to be
set to exactly one value upon deployment and for that value to stay the
same for the remainder of operation. Toggling it can lead to a mix of
endpoints in different datapath modes which is not well covered in CI.

In Github issue #16717 we observed that if the testsuite toggles this
setting then we can end up with kubedns pods remaining in endpoint
routes mode, even though the rest of the daemon (and other pods) are not
configured in this mode. This can lead to connectivity issues in DNS,
and a range of test failures in subsequent tests because DNS is broken.

Longer term to resolve this, we could improve on Cilium to ensure that
users can successfully toggle this setting on or off at runtime and
properly handle this case, or alternatively shift all logic over to
endpoint-routes mode by default and disable the other option.

Given that CI for the master branch is in a poor state due to this issue
today, and that part of the issue is CI reconfiguring the datapath state
of Cilium during the test setup in an unsupported manner, this commit
proposes to force DNS pod redeployment as part of setup any time a test
reconfigures the endpointRoutes mode. This should mitigate the testing
side issue while we mull over the right longer-term solution.

Signed-off-by: Joe Stringer <joe@cilium.io>
Signed-off-by: Nicolas Busseneau <nicolas@isovalent.com>
tklauser pushed a commit that referenced this issue Dec 15, 2021
[ upstream commit c18cfc8 ]

Commit a0e7712 ("test: Redeploy DNS after changing endpointRoutes")
didn't go quite far enough: It ensured that between individual tests in
a given file, the DNS pods would be redeployed during the next run if
there were significant enough datapath changes. However, the way it did
this was by storing state within the 'kubectl' variable, which is
recreated in each test file. So if the last test in one CI run enabled
endpoint routes mode, then the DNS pods would not be redeployed to
disable endpoint routes mode as part of the next test.

Fix it by redeploying DNS after removing Cilium from the cluster.
Kubernetes will remove the current DNS pods and reschedule them, but
they will not launch until the next test deploys a new version of
Cilium.

Reported-by: Chris Tarazi <chris@isovalent.com>
Fixes: 0e77127dcd7 ("test: Redeploy DNS after changing endpointRoutes")
Related: #16717

Signed-off-by: Joe Stringer <joe@cilium.io>
Signed-off-by: Nicolas Busseneau <nicolas@isovalent.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/CI Continuous Integration testing issue or flake ci/flake This is a known failure that occurs in the tree. Please investigate me!
Projects
No open projects
CI Force
  
Awaiting triage
2 participants