Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CI: K8sPolicyTest Multi-node policy test validates ingress CIDR-dependent L4 (FromCIDR + ToPorts) connectivity is restored after importing ingress policy #12596

Closed
pchaigno opened this issue Jul 20, 2020 · 4 comments · Fixed by #14286
Assignees
Labels
area/CI Continuous Integration testing issue or flake ci/flake This is a known failure that occurs in the tree. Please investigate me!

Comments

@pchaigno
Copy link
Member

https://jenkins.cilium.io/job/Cilium-PR-K8s-oldest-net-next/1273/testReport/Suite-k8s-1/11/K8sPolicyTest_Multi_node_policy_test_validates_ingress_CIDR_dependent_L4__FromCIDR___ToPorts__connectivity_is_restored_after_importing_ingress_policy/
test_results_Cilium-PR-K8s-oldest-net-next_1273_BDD-Test-PR.zip

The flake happened 6 times in PRs since the test was added last week. (Test-specific artifacts are always missing when this flake happens.) The two numbers compared below are never off by more than 3. Maybe just an issue with cilium monitor in the test? It seems to only happen in net-next.

I'm opening this issue to track the flake until we know more, but it's possible this is another instance of #12555, which was fixed by #12559.

Stacktrace

/home/jenkins/workspace/Cilium-PR-K8s-oldest-net-next/src/github.com/cilium/cilium/test/ginkgo-ext/scopes.go:514
Monitor output does not show traffic as allowed
Expected
    <int>: 38
to be >=
    <int>: 41
/home/jenkins/workspace/Cilium-PR-K8s-oldest-net-next/src/github.com/cilium/cilium/test/k8sT/Policies.go:1292

Standard Output

Number of "context deadline exceeded" in logs: 0
Number of "level=error" in logs: 0
Number of "level=warning" in logs: 0
Number of "Cilium API handler panicked" in logs: 0
Number of "Goroutine took lock for more than" in logs: 0
No errors/warnings found in logs
Cilium pods: [cilium-gcrfk cilium-l47dx]
Netpols loaded: 
CiliumNetworkPolicies loaded: default::cnp-default-deny-ingress default::cnp-ingress-from-cidr-to-ports 
Endpoint Policy Enforcement:
Pod                           Ingress   Egress
testclient-sbjwr                        
testds-lpfrn                            
testds-wtf56                            
coredns-687db6485c-mkhls                
grafana-5987b75b56-ml2kd                
prometheus-556f8994ff-g9zdb             
test-k8s2-848b6f7864-mlvnl              
testclient-992vk                        
Cilium agent 'cilium-gcrfk': Status: Ok  Health: Ok Nodes "" ContinerRuntime:  Kubernetes: Ok KVstore: Ok Controllers: Total 49 Failed 0
Cilium agent 'cilium-l47dx': Status: Ok  Health: Ok Nodes "" ContinerRuntime:  Kubernetes: Ok KVstore: Ok Controllers: Total 29 Failed 0

Standard Error

Show standard error
11:43:44 STEP: Importing a default deny policy on ingress
11:43:53 STEP: Running cilium monitor in the background
11:43:53 STEP: Importing fromCIDR+toPorts policy on ingress
11:43:56 STEP: Testing that connectivity from outside node is allowed
11:44:06 STEP: Asserting that the expected policy verdict logs are in the monitor output
FAIL: Monitor output does not show traffic as allowed
Expected
    <int>: 38
to be >=
    <int>: 41
=== Test Finished at 2020-07-20T11:44:06Z====
11:44:06 STEP: Running JustAfterEach block for EntireTestsuite K8sPolicyTest
===================== TEST FAILED =====================
11:44:06 STEP: Running AfterFailed block for EntireTestsuite K8sPolicyTest
11:44:06 STEP: WaitforPods(namespace="kube-system", filter="-l k8s-app=cilium")
11:44:06 STEP: WaitforPods(namespace="kube-system", filter="-l k8s-app=cilium") => <nil>
11:44:08 STEP: WaitforPods(namespace="kube-system", filter="-l k8s-app=cilium")
11:44:08 STEP: WaitforPods(namespace="kube-system", filter="-l k8s-app=cilium") => <nil>
11:44:08 STEP: WaitforPods(namespace="kube-system", filter="-l k8s-app=cilium-test-logs --field-selector='spec.nodeName!=k8s3'")
11:44:08 STEP: WaitforPods(namespace="kube-system", filter="-l k8s-app=cilium-test-logs --field-selector='spec.nodeName!=k8s3'") => <nil>
cmd: kubectl get pods -o wide --all-namespaces
Exitcode: 0 
Stdout:
 	 NAMESPACE           NAME                               READY   STATUS    RESTARTS   AGE   IP              NODE   NOMINATED NODE
	 cilium-monitoring   grafana-5987b75b56-ml2kd           1/1     Running   0          21m   10.0.0.201      k8s2   <none>
	 cilium-monitoring   prometheus-556f8994ff-g9zdb        1/1     Running   0          21m   10.0.0.82       k8s2   <none>
	 default             test-k8s2-848b6f7864-mlvnl         2/2     Running   0          2m    10.0.0.240      k8s2   <none>
	 default             testclient-992vk                   1/1     Running   0          2m    10.0.0.84       k8s2   <none>
	 default             testclient-sbjwr                   1/1     Running   0          2m    10.0.1.15       k8s1   <none>
	 default             testds-lpfrn                       2/2     Running   0          2m    10.0.1.196      k8s1   <none>
	 default             testds-wtf56                       2/2     Running   0          2m    10.0.0.183      k8s2   <none>
	 kube-system         cilium-gcrfk                       1/1     Running   0          1m    192.168.36.12   k8s2   <none>
	 kube-system         cilium-l47dx                       1/1     Running   0          1m    192.168.36.11   k8s1   <none>
	 kube-system         cilium-operator-6fd974f587-dmpvq   1/1     Running   0          1m    192.168.36.11   k8s1   <none>
	 kube-system         coredns-687db6485c-mkhls           1/1     Running   0          20m   10.0.0.102      k8s2   <none>
	 kube-system         etcd-k8s1                          1/1     Running   0          23m   192.168.36.11   k8s1   <none>
	 kube-system         kube-apiserver-k8s1                1/1     Running   0          23m   192.168.36.11   k8s1   <none>
	 kube-system         kube-controller-manager-k8s1       1/1     Running   0          23m   192.168.36.11   k8s1   <none>
	 kube-system         kube-scheduler-k8s1                1/1     Running   0          23m   192.168.36.11   k8s1   <none>
	 kube-system         log-gatherer-2v4gn                 1/1     Running   0          21m   192.168.36.12   k8s2   <none>
	 kube-system         log-gatherer-g6br5                 1/1     Running   0          21m   192.168.36.11   k8s1   <none>
	 kube-system         log-gatherer-jj82j                 1/1     Running   0          21m   192.168.36.13   k8s3   <none>
	 kube-system         registry-adder-7vk8j               1/1     Running   0          21m   192.168.36.13   k8s3   <none>
	 kube-system         registry-adder-g7ppt               1/1     Running   0          21m   192.168.36.12   k8s2   <none>
	 kube-system         registry-adder-m26ll               1/1     Running   0          21m   192.168.36.11   k8s1   <none>
	 
Stderr:
 	 

Fetching command output from pods [cilium-gcrfk cilium-l47dx]
11:44:09 STEP: WaitforPods(namespace="kube-system", filter="-l k8s-app=cilium")
11:44:09 STEP: WaitforPods(namespace="kube-system", filter="-l k8s-app=cilium-test-logs --field-selector='spec.nodeName!=k8s3'")
11:44:09 STEP: WaitforPods(namespace="kube-system", filter="-l k8s-app=cilium") => <nil>
11:44:09 STEP: WaitforPods(namespace="kube-system", filter="-l k8s-app=cilium-test-logs --field-selector='spec.nodeName!=k8s3'") => <nil>
11:44:10 STEP: WaitforPods(namespace="kube-system", filter="-l k8s-app=cilium-test-logs --field-selector='spec.nodeName!=k8s3'")
11:44:10 STEP: WaitforPods(namespace="kube-system", filter="-l k8s-app=cilium-test-logs --field-selector='spec.nodeName!=k8s3'") => <nil>
11:44:10 STEP: WaitforPods(namespace="kube-system", filter="-l k8s-app=cilium")
11:44:11 STEP: WaitforPods(namespace="kube-system", filter="-l k8s-app=cilium") => <nil>
11:44:13 STEP: WaitforPods(namespace="kube-system", filter="-l k8s-app=cilium")
11:44:13 STEP: WaitforPods(namespace="kube-system", filter="-l k8s-app=cilium") => <nil>
11:44:15 STEP: WaitforPods(namespace="kube-system", filter="-l k8s-app=cilium")
11:44:15 STEP: WaitforPods(namespace="kube-system", filter="-l k8s-app=cilium") => <nil>
11:44:16 STEP: WaitforPods(namespace="kube-system", filter="-l k8s-app=cilium")
11:44:16 STEP: WaitforPods(namespace="kube-system", filter="-l k8s-app=cilium") => <nil>
11:44:16 STEP: WaitforPods(namespace="kube-system", filter="-l k8s-app=cilium")
11:44:17 STEP: WaitforPods(namespace="kube-system", filter="-l k8s-app=cilium") => <nil>
11:44:18 STEP: WaitforPods(namespace="kube-system", filter="-l k8s-app=cilium")
11:44:19 STEP: WaitforPods(namespace="kube-system", filter="-l k8s-app=cilium") => <nil>
11:44:21 STEP: WaitforPods(namespace="kube-system", filter="-l k8s-app=cilium")
11:44:21 STEP: WaitforPods(namespace="kube-system", filter="-l k8s-app=cilium") => <nil>
cmd: kubectl exec -n kube-system cilium-gcrfk -- cilium service list
Exitcode: 0 
Stdout:
 	 ID   Frontend                Service Type   Backend                   
	 1    10.105.69.147:9090      ClusterIP      1 => 10.0.0.82:9090       
	 2    10.96.0.1:443           ClusterIP      1 => 192.168.36.11:6443   
	 3    10.96.0.10:53           ClusterIP      1 => 10.0.0.102:53        
	 4    10.103.65.19:3000       ClusterIP      1 => 10.0.0.201:3000      
	 7    10.111.187.86:80        ClusterIP      1 => 10.0.0.183:80        
	                                             2 => 10.0.1.196:80        
	 8    10.111.187.86:69        ClusterIP      1 => 10.0.0.183:69        
	                                             2 => 10.0.1.196:69        
	 9    10.105.204.100:10080    ClusterIP      1 => 10.0.0.183:80        
	                                             2 => 10.0.1.196:80        
	 10   10.105.204.100:10069    ClusterIP      1 => 10.0.0.183:69        
	                                             2 => 10.0.1.196:69        
	 12   0.0.0.0:30985           NodePort       1 => 10.0.0.183:80        
	                                             2 => 10.0.1.196:80        
	 13   192.168.36.12:30985     NodePort       1 => 10.0.0.183:80        
	                                             2 => 10.0.1.196:80        
	 14   192.168.36.12:31059     NodePort       1 => 10.0.0.183:69        
	                                             2 => 10.0.1.196:69        
	 16   0.0.0.0:31059           NodePort       1 => 10.0.0.183:69        
	                                             2 => 10.0.1.196:69        
	 17   10.106.236.195:10080    ClusterIP      1 => 10.0.0.183:80        
	                                             2 => 10.0.1.196:80        
	 18   10.106.236.195:10069    ClusterIP      1 => 10.0.0.183:69        
	                                             2 => 10.0.1.196:69        
	 19   192.168.36.12:31769     NodePort       1 => 10.0.0.183:80        
	                                             2 => 10.0.1.196:80        
	 21   0.0.0.0:31769           NodePort       1 => 10.0.0.183:80        
	                                             2 => 10.0.1.196:80        
	 22   192.168.36.12:31928     NodePort       1 => 10.0.0.183:69        
	                                             2 => 10.0.1.196:69        
	 24   0.0.0.0:31928           NodePort       1 => 10.0.0.183:69        
	                                             2 => 10.0.1.196:69        
	 25   10.102.239.102:10080    ClusterIP      1 => 10.0.0.183:80        
	                                             2 => 10.0.1.196:80        
	 26   10.102.239.102:10069    ClusterIP      1 => 10.0.0.183:69        
	                                             2 => 10.0.1.196:69        
	 27   192.168.36.12:32355     NodePort       1 => 10.0.0.183:80        
	 28   192.168.36.12:32355/i   NodePort       1 => 10.0.0.183:80        
	                                             2 => 10.0.1.196:80        
	 31   0.0.0.0:32355           NodePort       1 => 10.0.0.183:80        
	 32   0.0.0.0:32355/i         NodePort       1 => 10.0.0.183:80        
	                                             2 => 10.0.1.196:80        
	 33   192.168.36.12:32011     NodePort       1 => 10.0.0.183:69        
	 34   192.168.36.12:32011/i   NodePort       1 => 10.0.0.183:69        
	                                             2 => 10.0.1.196:69        
	 37   0.0.0.0:32011           NodePort       1 => 10.0.0.183:69        
	 38   0.0.0.0:32011/i         NodePort       1 => 10.0.0.183:69        
	                                             2 => 10.0.1.196:69        
	 39   10.108.163.193:10069    ClusterIP      1 => 10.0.0.240:69        
	 40   10.108.163.193:10080    ClusterIP      1 => 10.0.0.240:80        
	 43   192.168.36.12:31008     NodePort       1 => 10.0.0.240:80        
	 44   192.168.36.12:31008/i   NodePort       1 => 10.0.0.240:80        
	 45   0.0.0.0:31008           NodePort       1 => 10.0.0.240:80        
	 46   0.0.0.0:31008/i         NodePort       1 => 10.0.0.240:80        
	 47   0.0.0.0:30308           NodePort       1 => 10.0.0.240:69        
	 48   0.0.0.0:30308/i         NodePort       1 => 10.0.0.240:69        
	 49   192.168.36.12:30308     NodePort       1 => 10.0.0.240:69        
	 50   192.168.36.12:30308/i   NodePort       1 => 10.0.0.240:69        
	 53   10.111.99.143:10080     ClusterIP      1 => 10.0.0.240:80        
	 54   10.111.99.143:10069     ClusterIP      1 => 10.0.0.240:69        
	 55   192.168.36.12:32286     NodePort       1 => 10.0.0.240:80        
	 57   0.0.0.0:32286           NodePort       1 => 10.0.0.240:80        
	 58   0.0.0.0:31692           NodePort       1 => 10.0.0.240:69        
	 60   192.168.36.12:31692     NodePort       1 => 10.0.0.240:69        
	 61   10.98.253.32:80         ClusterIP      1 => 10.0.0.183:80        
	                                             2 => 10.0.1.196:80        
	 63   0.0.0.0:31199           NodePort       1 => 10.0.0.183:80        
	                                             2 => 10.0.1.196:80        
	 64   192.168.36.12:31199     NodePort       1 => 10.0.0.183:80        
	                                             2 => 10.0.1.196:80        
	 65   10.98.62.254:80         ClusterIP      1 => 10.0.0.240:80        
	 66   192.168.36.12:31163     NodePort       1 => 10.0.0.240:80        
	 67   192.168.36.12:31163/i   NodePort       1 => 10.0.0.240:80        
	 70   0.0.0.0:31163           NodePort       1 => 10.0.0.240:80        
	 71   0.0.0.0:31163/i         NodePort       1 => 10.0.0.240:80        
	 72   10.111.66.200:20069     ClusterIP      1 => 10.0.0.183:69        
	                                             2 => 10.0.1.196:69        
	 73   10.111.66.200:20080     ClusterIP      1 => 10.0.0.183:80        
	                                             2 => 10.0.1.196:80        
	 74   192.0.2.233:20069       ExternalIPs    1 => 10.0.0.183:69        
	                                             2 => 10.0.1.196:69        
	 75   192.0.2.233:20080       ExternalIPs    1 => 10.0.0.183:80        
	                                             2 => 10.0.1.196:80        
	 76   192.168.36.12:30466     NodePort       1 => 10.0.0.183:69        
	                                             2 => 10.0.1.196:69        
	 78   0.0.0.0:30466           NodePort       1 => 10.0.0.183:69        
	                                             2 => 10.0.1.196:69        
	 79   192.168.36.12:31511     NodePort       1 => 10.0.0.183:80        
	                                             2 => 10.0.1.196:80        
	 81   0.0.0.0:31511           NodePort       1 => 10.0.0.183:80        
	                                             2 => 10.0.1.196:80        
	 82   192.168.36.12:8080      HostPort       1 => 10.0.0.240:80        
	 84   0.0.0.0:8080            HostPort       1 => 10.0.0.240:80        
	 85   192.168.36.12:6969      HostPort       1 => 10.0.0.240:69        
	 87   0.0.0.0:6969            HostPort       1 => 10.0.0.240:69        
	 
Stderr:
 	 

cmd: kubectl exec -n kube-system cilium-gcrfk -- cilium endpoint list
Exitcode: 0 
Stdout:
 	 ENDPOINT   POLICY (ingress)   POLICY (egress)   IDENTITY   LABELS (source:key[=value])                              IPv6       IPv4         STATUS   
	            ENFORCEMENT        ENFORCEMENT                                                                                                   
	 3          Disabled           Disabled          11066      k8s:app=prometheus                                       fd00::e8   10.0.0.82    ready   
	                                                            k8s:io.cilium.k8s.policy.cluster=default                                                 
	                                                            k8s:io.cilium.k8s.policy.serviceaccount=prometheus-k8s                                   
	                                                            k8s:io.kubernetes.pod.namespace=cilium-monitoring                                        
	 17         Disabled           Disabled          4          reserved:health                                          fd00::5d   10.0.0.168   ready   
	 255        Disabled           Disabled          1649       k8s:io.cilium.k8s.policy.cluster=default                 fd00::48   10.0.0.102   ready   
	                                                            k8s:io.cilium.k8s.policy.serviceaccount=coredns                                          
	                                                            k8s:io.kubernetes.pod.namespace=kube-system                                              
	                                                            k8s:k8s-app=kube-dns                                                                     
	 543        Enabled            Disabled          14837      k8s:io.cilium.k8s.policy.cluster=default                 fd00::33   10.0.0.240   ready   
	                                                            k8s:io.cilium.k8s.policy.serviceaccount=default                                          
	                                                            k8s:io.kubernetes.pod.namespace=default                                                  
	                                                            k8s:zgroup=test-k8s2                                                                     
	 1390       Enabled            Disabled          9573       k8s:io.cilium.k8s.policy.cluster=default                 fd00::e4   10.0.0.84    ready   
	                                                            k8s:io.cilium.k8s.policy.serviceaccount=default                                          
	                                                            k8s:io.kubernetes.pod.namespace=default                                                  
	                                                            k8s:zgroup=testDSClient                                                                  
	 1430       Disabled           Disabled          1          k8s:cilium.io/ci-node=k8s2                                                       ready   
	                                                            reserved:host                                                                            
	 2545       Enabled            Disabled          776        k8s:io.cilium.k8s.policy.cluster=default                 fd00::ae   10.0.0.183   ready   
	                                                            k8s:io.cilium.k8s.policy.serviceaccount=default                                          
	                                                            k8s:io.kubernetes.pod.namespace=default                                                  
	                                                            k8s:zgroup=testDS                                                                        
	 2586       Disabled           Disabled          30080      k8s:app=grafana                                          fd00::8d   10.0.0.201   ready   
	                                                            k8s:io.cilium.k8s.policy.cluster=default                                                 
	                                                            k8s:io.cilium.k8s.policy.serviceaccount=default                                          
	                                                            k8s:io.kubernetes.pod.namespace=cilium-monitoring                                        
	 
Stderr:
 	 

cmd: kubectl exec -n kube-system cilium-l47dx -- cilium service list
Exitcode: 0 
Stdout:
 	 ID   Frontend                Service Type   Backend                   
	 1    10.96.0.1:443           ClusterIP      1 => 192.168.36.11:6443   
	 2    10.96.0.10:53           ClusterIP      1 => 10.0.0.102:53        
	 3    10.103.65.19:3000       ClusterIP      1 => 10.0.0.201:3000      
	 4    10.105.69.147:9090      ClusterIP      1 => 10.0.0.82:9090       
	 7    10.111.187.86:80        ClusterIP      1 => 10.0.0.183:80        
	                                             2 => 10.0.1.196:80        
	 8    10.111.187.86:69        ClusterIP      1 => 10.0.0.183:69        
	                                             2 => 10.0.1.196:69        
	 9    10.105.204.100:10080    ClusterIP      1 => 10.0.0.183:80        
	                                             2 => 10.0.1.196:80        
	 10   10.105.204.100:10069    ClusterIP      1 => 10.0.0.183:69        
	                                             2 => 10.0.1.196:69        
	 11   192.168.36.11:30985     NodePort       1 => 10.0.0.183:80        
	                                             2 => 10.0.1.196:80        
	 13   0.0.0.0:30985           NodePort       1 => 10.0.0.183:80        
	                                             2 => 10.0.1.196:80        
	 14   192.168.36.11:31059     NodePort       1 => 10.0.0.183:69        
	                                             2 => 10.0.1.196:69        
	 16   0.0.0.0:31059           NodePort       1 => 10.0.0.183:69        
	                                             2 => 10.0.1.196:69        
	 17   10.106.236.195:10080    ClusterIP      1 => 10.0.0.183:80        
	                                             2 => 10.0.1.196:80        
	 18   10.106.236.195:10069    ClusterIP      1 => 10.0.0.183:69        
	                                             2 => 10.0.1.196:69        
	 19   192.168.36.11:31769     NodePort       1 => 10.0.0.183:80        
	                                             2 => 10.0.1.196:80        
	 21   0.0.0.0:31769           NodePort       1 => 10.0.0.183:80        
	                                             2 => 10.0.1.196:80        
	 22   0.0.0.0:31928           NodePort       1 => 10.0.0.183:69        
	                                             2 => 10.0.1.196:69        
	 23   192.168.36.11:31928     NodePort       1 => 10.0.0.183:69        
	                                             2 => 10.0.1.196:69        
	 25   10.102.239.102:10080    ClusterIP      1 => 10.0.0.183:80        
	                                             2 => 10.0.1.196:80        
	 26   10.102.239.102:10069    ClusterIP      1 => 10.0.0.183:69        
	                                             2 => 10.0.1.196:69        
	 27   192.168.36.11:32355     NodePort       1 => 10.0.1.196:80        
	 28   192.168.36.11:32355/i   NodePort       1 => 10.0.0.183:80        
	                                             2 => 10.0.1.196:80        
	 31   0.0.0.0:32355           NodePort       1 => 10.0.1.196:80        
	 32   0.0.0.0:32355/i         NodePort       1 => 10.0.0.183:80        
	                                             2 => 10.0.1.196:80        
	 33   192.168.36.11:32011     NodePort       1 => 10.0.1.196:69        
	 34   192.168.36.11:32011/i   NodePort       1 => 10.0.0.183:69        
	                                             2 => 10.0.1.196:69        
	 37   0.0.0.0:32011           NodePort       1 => 10.0.1.196:69        
	 38   0.0.0.0:32011/i         NodePort       1 => 10.0.0.183:69        
	                                             2 => 10.0.1.196:69        
	 39   10.108.163.193:10069    ClusterIP      1 => 10.0.0.240:69        
	 40   10.108.163.193:10080    ClusterIP      1 => 10.0.0.240:80        
	 43   0.0.0.0:30308           NodePort                                 
	 44   0.0.0.0:30308/i         NodePort       1 => 10.0.0.240:69        
	 45   192.168.36.11:30308     NodePort                                 
	 46   192.168.36.11:30308/i   NodePort       1 => 10.0.0.240:69        
	 47   192.168.36.11:31008     NodePort                                 
	 48   192.168.36.11:31008/i   NodePort       1 => 10.0.0.240:80        
	 51   0.0.0.0:31008           NodePort                                 
	 52   0.0.0.0:31008/i         NodePort       1 => 10.0.0.240:80        
	 53   10.111.99.143:10080     ClusterIP      1 => 10.0.0.240:80        
	 54   10.111.99.143:10069     ClusterIP      1 => 10.0.0.240:69        
	 55   192.168.36.11:32286     NodePort       1 => 10.0.0.240:80        
	 57   0.0.0.0:32286           NodePort       1 => 10.0.0.240:80        
	 59   0.0.0.0:31692           NodePort       1 => 10.0.0.240:69        
	 60   192.168.36.11:31692     NodePort       1 => 10.0.0.240:69        
	 61   10.98.253.32:80         ClusterIP      1 => 10.0.0.183:80        
	                                             2 => 10.0.1.196:80        
	 62   0.0.0.0:31199           NodePort       1 => 10.0.0.183:80        
	                                             2 => 10.0.1.196:80        
	 63   192.168.36.11:31199     NodePort       1 => 10.0.0.183:80        
	                                             2 => 10.0.1.196:80        
	 65   10.98.62.254:80         ClusterIP      1 => 10.0.0.240:80        
	 66   192.168.36.11:31163     NodePort                                 
	 67   192.168.36.11:31163/i   NodePort       1 => 10.0.0.240:80        
	 70   0.0.0.0:31163           NodePort                                 
	 71   0.0.0.0:31163/i         NodePort       1 => 10.0.0.240:80        
	 72   10.111.66.200:20080     ClusterIP      1 => 10.0.0.183:80        
	                                             2 => 10.0.1.196:80        
	 73   10.111.66.200:20069     ClusterIP      1 => 10.0.0.183:69        
	                                             2 => 10.0.1.196:69        
	 74   192.0.2.233:20080       ExternalIPs    1 => 10.0.0.183:80        
	                                             2 => 10.0.1.196:80        
	 75   192.0.2.233:20069       ExternalIPs    1 => 10.0.0.183:69        
	                                             2 => 10.0.1.196:69        
	 76   192.168.36.11:31511     NodePort       1 => 10.0.0.183:80        
	                                             2 => 10.0.1.196:80        
	 78   0.0.0.0:31511           NodePort       1 => 10.0.0.183:80        
	                                             2 => 10.0.1.196:80        
	 79   0.0.0.0:30466           NodePort       1 => 10.0.0.183:69        
	                                             2 => 10.0.1.196:69        
	 80   192.168.36.11:30466     NodePort       1 => 10.0.0.183:69        
	                                             2 => 10.0.1.196:69        
	 
Stderr:
 	 

cmd: kubectl exec -n kube-system cilium-l47dx -- cilium endpoint list
Exitcode: 0 
Stdout:
 	 ENDPOINT   POLICY (ingress)   POLICY (egress)   IDENTITY   LABELS (source:key[=value])                       IPv6        IPv4         STATUS   
	            ENFORCEMENT        ENFORCEMENT                                                                                             
	 364        Enabled            Disabled          776        k8s:io.cilium.k8s.policy.cluster=default          fd00::1cf   10.0.1.196   ready   
	                                                            k8s:io.cilium.k8s.policy.serviceaccount=default                                    
	                                                            k8s:io.kubernetes.pod.namespace=default                                            
	                                                            k8s:zgroup=testDS                                                                  
	 1959       Disabled           Disabled          1          k8s:cilium.io/ci-node=k8s1                                                 ready   
	                                                            k8s:node-role.kubernetes.io/master                                                 
	                                                            reserved:host                                                                      
	 2013       Disabled           Disabled          4          reserved:health                                   fd00::170   10.0.1.133   ready   
	 2517       Enabled            Disabled          9573       k8s:io.cilium.k8s.policy.cluster=default          fd00::185   10.0.1.15    ready   
	                                                            k8s:io.cilium.k8s.policy.serviceaccount=default                                    
	                                                            k8s:io.kubernetes.pod.namespace=default                                            
	                                                            k8s:zgroup=testDSClient                                                            
	 
Stderr:
 	 

===================== Exiting AfterFailed =====================
11:44:36 STEP: Running AfterEach for block EntireTestsuite K8sPolicyTest Multi-node policy test
11:44:36 STEP: Cleaning up after the test
11:44:36 STEP: Running AfterEach for block EntireTestsuite K8sPolicyTest
11:44:36 STEP: Running AfterEach for block EntireTestsuite

[[ATTACHMENT|3ddb01ec_K8sPolicyTest_Multi-node_policy_test_validates_ingress_CIDR-dependent_L4_(FromCIDR_+_ToPorts)_connectivity_is_restored_after_importing_ingress_policy.zip]]
11:44:36 STEP: Running AfterAll block for EntireTestsuite K8sPolicyTest Multi-node policy test validates ingress CIDR-dependent L4 (FromCIDR + ToPorts)
11:44:37 STEP: Redeploying Cilium with default configuration
11:44:37 STEP: Installing Cilium
11:44:39 STEP: Waiting for Cilium to become ready
11:44:39 STEP: Cilium DaemonSet not ready yet: only 0 of 2 desired pods are ready
11:44:44 STEP: Cilium DaemonSet not ready yet: only 0 of 2 desired pods are ready
11:44:49 STEP: Cilium DaemonSet not ready yet: only 0 of 2 desired pods are ready
11:44:54 STEP: Cilium DaemonSet not ready yet: only 0 of 2 desired pods are ready
11:44:59 STEP: Cilium DaemonSet not ready yet: only 0 of 2 desired pods are ready
11:45:04 STEP: Cilium DaemonSet not ready yet: only 0 of 2 desired pods are ready
11:45:09 STEP: Cilium DaemonSet not ready yet: only 1 of 2 desired pods are ready
11:45:14 STEP: Number of ready Cilium pods: 2
11:45:14 STEP: Validating Cilium Installation
11:45:14 STEP: Performing Cilium controllers preflight check
11:45:14 STEP: Performing Cilium health check
11:45:14 STEP: Performing Cilium status preflight check
11:45:16 STEP: Performing Cilium service preflight check
11:45:16 STEP: Performing K8s service preflight check
11:45:17 STEP: Waiting for cilium-operator to be ready
11:45:17 STEP: WaitforPods(namespace="kube-system", filter="-l name=cilium-operator")
11:45:18 STEP: WaitforPods(namespace="kube-system", filter="-l name=cilium-operator") => <nil>

/cc @christarazi

@pchaigno pchaigno added area/CI Continuous Integration testing issue or flake ci/flake This is a known failure that occurs in the tree. Please investigate me! labels Jul 20, 2020
@christarazi
Copy link
Member

Thanks for filing. Yeah it looks like one potential cause could either be what #12559 fixed, or that cilium monitor takes longer than usual to start up, causing us to miss some of the policy verdict logs. However, given that it seems to always be off by 3, it's unlikely that cilium monitor starts up slowly in exactly the same time in each flake instance of this test. I'll keep a look out.

@gandro
Copy link
Member

gandro commented Sep 1, 2020

Also observed again in #13025 (comment)

@stale
Copy link

stale bot commented Nov 1, 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.

@stale stale bot added the stale The stale bot thinks this issue is old. Add "pinned" label to prevent this from becoming stale. label Nov 1, 2020
@pchaigno pchaigno removed the stale The stale bot thinks this issue is old. Add "pinned" label to prevent this from becoming stale. label Nov 2, 2020
@borkmann borkmann added pinned These issues are not marked stale by our issue bot. and removed pinned These issues are not marked stale by our issue bot. labels Nov 5, 2020
@pchaigno pchaigno self-assigned this Dec 4, 2020
pchaigno added a commit that referenced this issue Dec 4, 2020
Our fromCIDR+toPorts tests have been flaking for a while. In particular,
the check on the number of policy verdicts found in the output of cilium
monitor sometimes fails. Those fails happen because we read cilium
monitor's output too soon, before it has a chance to drain the ring
buffer (see #12596 for a detailed analysis of the logs).

This commit fixes it by repeatedly checking the output against our
regular expression until it succeeds or times out, similarly to what is
already done in K8sDatapathConfig MonitorAggregation.

Fixes: #12596
Fixes: 94eb491 ("test: Add policy test for ingress FromCIDR+ToPorts")
Fixes: 74be0b2 ("test: fromCIDR+toPorts host policy")
Signed-off-by: Paul Chaignon <paul@cilium.io>
@pchaigno
Copy link
Member Author

pchaigno commented Dec 4, 2020

#13840 added additional cilium monitor logs to help debug this flake and I was waiting for new failures to continue the investigation. We now have 4 new failures, so let the hunt begin!

Summary of the flake

Our test establishes x1 connections to a pod through a fromCIDR+toPorts policy. Those connections are allowed by the policy. At the same time, cilium monitor --related-to ... is running and we count the number x2 of policy-verdict events matching the allowed connection. The test fails because x1 != x2.

Initial idea

To be exact, the test fails because x1 == x2 + 1 or 2. So the number of connections found in cilium monitor's output is always slightly below the expected number. Maybe we are not waiting long enough after the cilium monitor starts and we miss the first few packets?

Finding the missing connections

test-output.log in the test artifacts has Hubble logs for the whole duration of the test (hubble observe --since 4h -o json). So we can compare that against the new cilium monitor output to find the missing connections in the later. We'll use the source ports as IDs for the connections:

# Type 4 is for TRACE_TO_LXC.
# The time filter is the time the connections started according to the test logs.
$ cat hubble.json | jq 'select(.IP.source=="192.168.36.13") | select(.IP.destination=="10.0.1.61")' \
    jq 'select(.l4.TCP.flags.SYN==true) | select(.event_type.type==4)' \
    jq 'select(.time>="2020-12-04T09:49").l4.TCP.source_port' > hubble-srcports.txt
$ grep -Po "(?<=192.168.36.13:)\d+(?= -> 10.0.1.61:80)" cilium-monitor.txt > cilium-monitor-srcports.txt
$ diff hubble-srcports.txt cilium-monitor-srcports.txt 
44,45d43
< 55568
< 55572

We just caught our two missing connections (:tada:) and they should appear in cilium monitor's output after the other connections.

Root cause

So it looks like instead of starting the connections too quickly after starting cilium monitor, we're actually reading the output from cilium monitor too quickly after making the last connections. cilium monitor likely doesn't have time to drain the perf ring buffer. The fact that we close cilium monitor even before reading its output probably also doesn't help.

We need to wait a bit before reading the connections from cilium monitor.

Solution

Surely, this isn't the only test relying on cilium monitor in this way? How are other tests implementing this?

Well, although there are several tests calling cilium monitor, I could find only one that actually tests the output, K8sDatapathConfig MonitorAggregation. That test repeatedly checks the output of cilium monitor until it finds the expected text or times out:

err := helpers.RepeatUntilTrueDefaultTimeout(func() bool {
monitorOutput = monitorRes.CombineOutput().Bytes()
return checkMonitorOutput(monitorOutput, egressPktCount, ingressPktCount)
})
Expect(err).To(BeNil(), "monitor aggregation did not result in correct number of TCP notifications\n%s", monitorOutput)

That's the fix #14286 implements.

FAQ

Why doesn't it affect the previous test, which checks connections are denied with the same cilium monitor pattern?
My guess is that each connection takes a lot more time when denied. We are therefore less likely to read cilium monitor's output as connections are still in the ring buffer.

Are there partial connections at the end of the monitor logs?
The first thing I checked was to see if we had partial connections at the beginning or end of cilium monitor's output. If we start the connections too soon (or interrupt cilium monitor too soon), we may end up with only the beginning (resp. the end) of a connection.

On the 4 flakes I checked, that never happens. cilium monitor always starts with a TCP SYN for the first connection and ends with a TCP FIN+ACK followed by a TCP ACK for the last connection. I'm guessing there is some factor of luck here + the fact packets for a same connection are likely slightly grouped together.

pchaigno added a commit that referenced this issue Dec 7, 2020
Our fromCIDR+toPorts tests have been flaking for a while. In particular,
the check on the number of policy verdicts found in the output of cilium
monitor sometimes fails. Those fails happen because we read cilium
monitor's output too soon, before it has a chance to drain the ring
buffer (see #12596 for a detailed analysis of the logs).

This commit fixes it by repeatedly checking the output against our
regular expression until it succeeds or times out, similarly to what is
already done in K8sDatapathConfig MonitorAggregation.

Fixes: #12596
Fixes: 94eb491 ("test: Add policy test for ingress FromCIDR+ToPorts")
Fixes: 74be0b2 ("test: fromCIDR+toPorts host policy")
Signed-off-by: Paul Chaignon <paul@cilium.io>
pchaigno added a commit that referenced this issue Dec 8, 2020
Our fromCIDR+toPorts tests have been flaking for a while. In particular,
the check on the number of policy verdicts found in the output of cilium
monitor sometimes fails. Those fails happen because we read cilium
monitor's output too soon, before it has a chance to drain the ring
buffer (see #12596 for a detailed analysis of the logs).

This commit fixes it by repeatedly checking the output against our
regular expression until it succeeds or times out, similarly to what is
already done in K8sDatapathConfig MonitorAggregation.

Fixes: #12596
Fixes: 94eb491 ("test: Add policy test for ingress FromCIDR+ToPorts")
Fixes: 74be0b2 ("test: fromCIDR+toPorts host policy")
Signed-off-by: Paul Chaignon <paul@cilium.io>
nebril pushed a commit that referenced this issue Dec 9, 2020
Our fromCIDR+toPorts tests have been flaking for a while. In particular,
the check on the number of policy verdicts found in the output of cilium
monitor sometimes fails. Those fails happen because we read cilium
monitor's output too soon, before it has a chance to drain the ring
buffer (see #12596 for a detailed analysis of the logs).

This commit fixes it by repeatedly checking the output against our
regular expression until it succeeds or times out, similarly to what is
already done in K8sDatapathConfig MonitorAggregation.

Fixes: #12596
Fixes: 94eb491 ("test: Add policy test for ingress FromCIDR+ToPorts")
Fixes: 74be0b2 ("test: fromCIDR+toPorts host policy")
Signed-off-by: Paul Chaignon <paul@cilium.io>
joestringer pushed a commit that referenced this issue Dec 15, 2020
[ upstream commit 7ba7edc ]

Our fromCIDR+toPorts tests have been flaking for a while. In particular,
the check on the number of policy verdicts found in the output of cilium
monitor sometimes fails. Those fails happen because we read cilium
monitor's output too soon, before it has a chance to drain the ring
buffer (see #12596 for a detailed analysis of the logs).

This commit fixes it by repeatedly checking the output against our
regular expression until it succeeds or times out, similarly to what is
already done in K8sDatapathConfig MonitorAggregation.

Fixes: #12596
Fixes: 94eb491 ("test: Add policy test for ingress FromCIDR+ToPorts")
Fixes: 74be0b2 ("test: fromCIDR+toPorts host policy")
Signed-off-by: Paul Chaignon <paul@cilium.io>
Signed-off-by: Joe Stringer <joe@cilium.io>
joestringer pushed a commit that referenced this issue Dec 15, 2020
[ upstream commit 7ba7edc ]

Our fromCIDR+toPorts tests have been flaking for a while. In particular,
the check on the number of policy verdicts found in the output of cilium
monitor sometimes fails. Those fails happen because we read cilium
monitor's output too soon, before it has a chance to drain the ring
buffer (see #12596 for a detailed analysis of the logs).

This commit fixes it by repeatedly checking the output against our
regular expression until it succeeds or times out, similarly to what is
already done in K8sDatapathConfig MonitorAggregation.

Fixes: #12596
Fixes: 94eb491 ("test: Add policy test for ingress FromCIDR+ToPorts")
Fixes: 74be0b2 ("test: fromCIDR+toPorts host policy")
Signed-off-by: Paul Chaignon <paul@cilium.io>
Signed-off-by: Joe Stringer <joe@cilium.io>
jibi pushed a commit that referenced this issue Dec 17, 2020
[ upstream commit 7ba7edc ]

Our fromCIDR+toPorts tests have been flaking for a while. In particular,
the check on the number of policy verdicts found in the output of cilium
monitor sometimes fails. Those fails happen because we read cilium
monitor's output too soon, before it has a chance to drain the ring
buffer (see #12596 for a detailed analysis of the logs).

This commit fixes it by repeatedly checking the output against our
regular expression until it succeeds or times out, similarly to what is
already done in K8sDatapathConfig MonitorAggregation.

Fixes: #12596
Fixes: 94eb491 ("test: Add policy test for ingress FromCIDR+ToPorts")
Fixes: 74be0b2 ("test: fromCIDR+toPorts host policy")
Signed-off-by: Paul Chaignon <paul@cilium.io>
Signed-off-by: Gilberto Bertin <gilberto@isovalent.com>
jibi pushed a commit that referenced this issue Dec 29, 2020
[ upstream commit 7ba7edc ]

Our fromCIDR+toPorts tests have been flaking for a while. In particular,
the check on the number of policy verdicts found in the output of cilium
monitor sometimes fails. Those fails happen because we read cilium
monitor's output too soon, before it has a chance to drain the ring
buffer (see #12596 for a detailed analysis of the logs).

This commit fixes it by repeatedly checking the output against our
regular expression until it succeeds or times out, similarly to what is
already done in K8sDatapathConfig MonitorAggregation.

Fixes: #12596
Fixes: 94eb491 ("test: Add policy test for ingress FromCIDR+ToPorts")
Fixes: 74be0b2 ("test: fromCIDR+toPorts host policy")
Signed-off-by: Paul Chaignon <paul@cilium.io>
Signed-off-by: Gilberto Bertin <gilberto@isovalent.com>
pchaigno added a commit that referenced this issue Jan 6, 2021
[ upstream commit 7ba7edc ]

Our fromCIDR+toPorts tests have been flaking for a while. In particular,
the check on the number of policy verdicts found in the output of cilium
monitor sometimes fails. Those fails happen because we read cilium
monitor's output too soon, before it has a chance to drain the ring
buffer (see #12596 for a detailed analysis of the logs).

This commit fixes it by repeatedly checking the output against our
regular expression until it succeeds or times out, similarly to what is
already done in K8sDatapathConfig MonitorAggregation.

Fixes: #12596
Fixes: 94eb491 ("test: Add policy test for ingress FromCIDR+ToPorts")
Fixes: 74be0b2 ("test: fromCIDR+toPorts host policy")
Signed-off-by: Paul Chaignon <paul@cilium.io>
Signed-off-by: Gilberto Bertin <gilberto@isovalent.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/CI Continuous Integration testing issue or flake ci/flake This is a known failure that occurs in the tree. Please investigate me!
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants