Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CI: RuntimeFQDNPolicies Validates DNSSEC responses: Cannot connect to "dnssec" #16713

Closed
pchaigno opened this issue Jun 30, 2021 · 6 comments · Fixed by #16769
Closed

CI: RuntimeFQDNPolicies Validates DNSSEC responses: Cannot connect to "dnssec" #16713

pchaigno opened this issue Jun 30, 2021 · 6 comments · Fixed by #16769
Labels
area/CI Continuous Integration testing issue or flake ci/flake This is a known failure that occurs in the tree. Please investigate me!
Projects

Comments

@pchaigno
Copy link
Member

pchaigno commented Jun 30, 2021

https://jenkins.cilium.io/job/cilium-master-runtime-kernel-4.9/2235/testReport/(root)/Suite-runtime/RuntimeFQDNPolicies_Validates_DNSSEC_responses/
09fe2bc5_RuntimeFQDNPolicies_Validates_DNSSEC_responses.zip

Stacktrace

/home/jenkins/workspace/cilium-master-runtime-kernel-4.9/runtime-gopath/src/github.com/cilium/cilium/test/ginkgo-ext/scopes.go:518
Cannot connect to "dnssec" when it should work
Expected command: sudo docker run --name dnssec --net cilium-net -l id.dnssec --dns=172.17.0.5 --rm docker.io/cilium/dnssec-client:v0.2 world1.dnssec.test 
To succeed, but it failed:
Exitcode: 1 
Err: Process exited with status 1
Stdout:
 	 2021-06-27 17:21:46,536 - root - INFO - Resolved 'world1.dnssec.test' to '172.18.0.2'
	 2021-06-27 17:21:46,539 - urllib3.connectionpool - DEBUG - Starting new HTTP connection (1): 172.18.0.2:80
	 2021-06-27 17:21:51,544 - root - ERROR - Exception HTTPConnectionPool(host='172.18.0.2', port=80): Max retries exceeded with url: / (Caused by ConnectTimeoutError(<urllib3.connection.HTTPConnection object at 0x7ffff6cc1280>, 'Connection to 172.18.0.2 timed out. (connect timeout=5)'))
	 
Stderr:
 	 

/home/jenkins/workspace/cilium-master-runtime-kernel-4.9/runtime-gopath/src/github.com/cilium/cilium/test/runtime/fqdn.go:837

Standard Output

Number of "context deadline exceeded" in logs: 0
Number of "level=error" in logs: 0
Number of "level=warning" in logs: 0
Number of "Cilium API handler panicked" in logs: 0
Number of "Goroutine took lock for more than" in logs: 3
No errors/warnings found in logs

Standard Error

Click to show.
17:21:42 STEP: Setting up policy: /home/vagrant/go/src/github.com/cilium/cilium/test/policy_a4b4ef62.json
17:21:42 STEP: Validate that allow target is working correctly
FAIL: Cannot connect to "dnssec" when it should work
Expected command: sudo docker run --name dnssec --net cilium-net -l id.dnssec --dns=172.17.0.5 --rm docker.io/cilium/dnssec-client:v0.2 world1.dnssec.test 
To succeed, but it failed:
Exitcode: 1 
Err: Process exited with status 1
Stdout:
 	 2021-06-27 17:21:46,536 - root - INFO - Resolved 'world1.dnssec.test' to '172.18.0.2'
	 2021-06-27 17:21:46,539 - urllib3.connectionpool - DEBUG - Starting new HTTP connection (1): 172.18.0.2:80
	 2021-06-27 17:21:51,544 - root - ERROR - Exception HTTPConnectionPool(host='172.18.0.2', port=80): Max retries exceeded with url: / (Caused by ConnectTimeoutError(<urllib3.connection.HTTPConnection object at 0x7ffff6cc1280>, 'Connection to 172.18.0.2 timed out. (connect timeout=5)'))
	 
Stderr:
 	 

=== Test Finished at 2021-06-27T17:21:51Z====
17:21:51 STEP: Running JustAfterEach block for EntireTestsuite RuntimeFQDNPolicies
===================== TEST FAILED =====================
17:21:51 STEP: Running AfterFailed block for EntireTestsuite RuntimeFQDNPolicies
10.15.254.116 app3
10.15.217.92 app2
10.15.231.87 app1
10.15.202.116 httpd3
10.15.72.183 httpd2
10.15.37.32 httpd1
172.17.0.5 bind
172.18.0.7 OutsideHttpd3
172.18.0.6 OutsideHttpd2
172.18.0.5 OutsideHttpd1
172.18.0.4 WorldHttpd3
172.18.0.3 WorldHttpd2
172.18.0.2 WorldHttpd1
172.17.0.4 cilium-etcd
172.17.0.3 cilium-consul
172.17.0.2 registry

cmd: sudo cilium endpoint list
Exitcode: 0 
Stdout:
 	 ENDPOINT   POLICY (ingress)   POLICY (egress)   IDENTITY   LABELS (source:key[=value])   IPv6                 IPv4            STATUS   
	            ENFORCEMENT        ENFORCEMENT                                                                                     
	 323        Disabled           Disabled          1          reserved:host                                                      ready   
	 619        Disabled           Disabled          37771      container:app=test            f00d::a0f:0:0:7d00   10.15.254.116   ready   
	                                                            container:id.app3                                                          
	 709        Disabled           Disabled          30456      container:app=test            f00d::a0f:0:0:b165   10.15.231.87    ready   
	                                                            container:id.app1                                                          
	 850        Disabled           Disabled          2394       container:app=test            f00d::a0f:0:0:11b5   10.15.217.92    ready   
	                                                            container:id.app2                                                          
	 1780       Disabled           Disabled          5878       container:app=test            f00d::a0f:0:0:7c72   10.15.202.116   ready   
	                                                            container:id.httpd3                                                        
	                                                            container:id.service1                                                      
	 1873       Disabled           Disabled          4          reserved:health               f00d::a0f:0:0:43fb   10.15.173.219   ready   
	 2182       Disabled           Disabled          49831      container:app=test            f00d::a0f:0:0:e67c   10.15.37.32     ready   
	                                                            container:id.httpd1                                                        
	                                                            container:id.service1                                                      
	 2829       Disabled           Disabled          19286      container:app=test            f00d::a0f:0:0:cc60   10.15.72.183    ready   
	                                                            container:id.httpd2                                                        
	                                                            container:id.service1                                                      
	 
Stderr:
 	 

cmd: sudo cilium policy get
Exitcode: 0 
Stdout:
 	 [
	   {
	     "endpointSelector": {
	       "matchLabels": {
	         "container:id.dnssec": ""
	       }
	     },
	     "egress": [
	       {
	         "toPorts": [
	           {
	             "ports": [
	               {
	                 "port": "53",
	                 "protocol": "ANY"
	               }
	             ],
	             "rules": {
	               "dns": [
	                 {
	                   "matchPattern": "world1.dnssec.test"
	                 }
	               ]
	             }
	           }
	         ]
	       },
	       {
	         "toFQDNs": [
	           {
	             "matchPattern": "world1.dnssec.test"
	           }
	         ]
	       }
	     ],
	     "labels": [
	       {
	         "key": "FQDN test - DNSSEC domain",
	         "source": ""
	       }
	     ]
	   }
	 ]
	 Revision: 34
	 
Stderr:
 	 

===================== Exiting AfterFailed =====================
17:22:01 STEP: Running AfterEach for block EntireTestsuite RuntimeFQDNPolicies
17:22:01 STEP: Running AfterEach for block EntireTestsuite
@pchaigno pchaigno added area/CI Continuous Integration testing issue or flake ci/flake This is a known failure that occurs in the tree. Please investigate me! labels Jun 30, 2021
@pchaigno
Copy link
Member Author

pchaigno commented Jun 30, 2021

In the artifact, we can see it keeps trying to send SYN but it's denied due to lack of allowing policy rule:

[2021-06-27 17:21:46] Policy verdict log: flow 0xbfb7e741 local EP ID 3554, remote ID 16777217, proto 6, egress, action deny, match none, 10.15.38.179:51762 -> 172.18.0.2:80 tcp SYN
[2021-06-27 17:21:47] Policy verdict log: flow 0x6e9afa39 local EP ID 3554, remote ID 16777217, proto 6, egress, action deny, match none, 10.15.38.179:51762 -> 172.18.0.2:80 tcp SYN
[2021-06-27 17:21:49] Policy verdict log: flow 0xc62ead22 local EP ID 3554, remote ID 16777217, proto 6, egress, action deny, match none, 10.15.38.179:51762 -> 172.18.0.2:80 tcp SYN

Given the above, we know it's not the DNS resolution failing but the following TCP connection. So the DNS<->IP association should be in the FQDN cache at this point and we should have a toCIDR entry in the policy map.

The remote identity looks okay. We can't check the BPF map content because the endpoint 3554 (source endpoint given the above is an egress policy verdict) was deleted as soon as the command ended (since it's docker run --rm).

However, we can see that the requested domain (world1.dnssec.test) is missing from the FQDN cache (cilium-fqdn-cache-list.md). Not sure if that could be expected due to the pod being deleted or if we should see it even after the pod is deleted.

Maybe it would be worth removing --rm temporarily to continue debugging this... That wouldn't change anything for the log collection as the Cilium endpoint is deleted when the container stops running.

@pchaigno
Copy link
Member Author

pchaigno commented Jul 2, 2021

Please post here if you hit this again. The next failures should have more information thanks to #16748.

@pchaigno
Copy link
Member Author

pchaigno commented Jul 2, 2021

This flake happened for the first time on v1.9 with backport PR #16744, which backports #16391. So maybe that last PR introduced this flake?

@pchaigno
Copy link
Member Author

pchaigno commented Jul 2, 2021

According to DataStudio, this flake started happening on June 24th:
image

So #16391, which was merged on June 29th, is not to blame here.

https://github.com/cilium/cilium/pulls?q=is%3Apr+is%3Aclosed+merged%3A2021-06-22..2021-06-24+-label%3Akind%2Fbackports+-label%3Aarea%2Fdocumentation+ has PRs merged shortly before June 24th. Of these, 4 have been backported to v1.9:

#16529 is the main suspect here given it touched locking mechanisms in the policy engine. /cc @jrajahalme @aanm

@aanm
Copy link
Member

aanm commented Jul 3, 2021

yep I also believe it's being caused by #16529. I'll revert it and open a PR without it. edit: The PR: #16769

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/CI Continuous Integration testing issue or flake ci/flake This is a known failure that occurs in the tree. Please investigate me!
Projects
No open projects
CI Force
  
Awaiting triage
Development

Successfully merging a pull request may close this issue.

3 participants