Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CI: K8sServicesTest Checks service across nodes Checks ClusterIP Connectivity: exit status 42 #16237

Closed
tklauser opened this issue May 20, 2021 · 5 comments
Assignees
Labels
area/CI Continuous Integration testing issue or flake ci/flake This is a known failure that occurs in the tree. Please investigate me!
Projects

Comments

@tklauser
Copy link
Member

Stacktrace:
/home/jenkins/workspace/Cilium-PR-K8s-1.21-kernel-4.9/src/github.com/cilium/cilium/test/ginkgo-ext/scopes.go:518
Request from testclient-hgkxc pod to service http://[fd03::44ac]:80 failed
Expected command: kubectl exec -n default testclient-hgkxc -- /bin/bash -c 'fails=""; id=$RANDOM; for i in $(seq 1 10); do if curl --path-as-is -s -D /dev/stderr --fail --connect-timeout 5 --max-time 20 http://[fd03::44ac]:80 -H "User-Agent: cilium-test-$id/$i"; then echo "Test round $id/$i exit code: $?"; else fails=$fails:$id/$i=$?; fi; done; if [ -n "$fails" ]; then echo "failed: $fails"; fi; cnt="${fails//[^:]}"; if [ ${#cnt} -gt 0 ]; then exit 42; fi' 
To succeed, but it failed:
Exitcode: 42 
Err: exit status 42
Stdout:
 	 
	 
	 Hostname: testds-6fklp
	 
	 Pod Information:
	 	-no pod information available-
	 
	 Server values:
	 	server_version=nginx: 1.13.3 - lua: 10008
	 
	 Request Information:
	 	client_address=fd02::1d3
	 	method=GET
	 	real path=/
	 	query=
	 	request_version=1.1
	 	request_scheme=http
	 	request_uri=http://[fd03::44ac]:80/
	 
	 Request Headers:
	 	accept=*/*
	 	host=[fd03::44ac]
	 	user-agent=cilium-test-19679/2
	 
	 Request Body:
	 	-no body in request-
	 
	 Test round 19679/2 exit code: 0
	 
	 
	 Hostname: testds-h5ws9
	 
	 Pod Information:
	 	-no pod information available-
	 
	 Server values:
	 	server_version=nginx: 1.13.3 - lua: 10008
	 
	 Request Information:
	 	client_address=fd02::1d3
	 	method=GET
	 	real path=/
	 	query=
	 	request_version=1.1
	 	request_scheme=http
	 	request_uri=http://[fd03::44ac]:80/
	 
	 Request Headers:
	 	accept=*/*
	 	host=[fd03::44ac]
	 	user-agent=cilium-test-19679/3
	 
	 Request Body:
	 	-no body in request-
	 
	 Test round 19679/3 exit code: 0
	 
	 
	 Hostname: testds-6fklp
	 
	 Pod Information:
	 	-no pod information available-
	 
	 Server values:
	 	server_version=nginx: 1.13.3 - lua: 10008
	 
	 Request Information:
	 	client_address=fd02::1d3
	 	method=GET
	 	real path=/
	 	query=
	 	request_version=1.1
	 	request_scheme=http
	 	request_uri=http://[fd03::44ac]:80/
	 
	 Request Headers:
	 	accept=*/*
	 	host=[fd03::44ac]
	 	user-agent=cilium-test-19679/4
	 
	 Request Body:
	 	-no body in request-
	 
	 Test round 19679/4 exit code: 0
	 
	 
	 Hostname: testds-6fklp
	 
	 Pod Information:
	 	-no pod information available-
	 
	 Server values:
	 	server_version=nginx: 1.13.3 - lua: 10008
	 
	 Request Information:
	 	client_address=fd02::1d3
	 	method=GET
	 	real path=/
	 	query=
	 	request_version=1.1
	 	request_scheme=http
	 	request_uri=http://[fd03::44ac]:80/
	 
	 Request Headers:
	 	accept=*/*
	 	host=[fd03::44ac]
	 	user-agent=cilium-test-19679/5
	 
	 Request Body:
	 	-no body in request-
	 
	 Test round 19679/5 exit code: 0
	 
	 
	 Hostname: testds-h5ws9
	 
	 Pod Information:
	 	-no pod information available-
	 
	 Server values:
	 	server_version=nginx: 1.13.3 - lua: 10008
	 
	 Request Information:
	 	client_address=fd02::1d3
	 	method=GET
	 	real path=/
	 	query=
	 	request_version=1.1
	 	request_scheme=http
	 	request_uri=http://[fd03::44ac]:80/
	 
	 Request Headers:
	 	accept=*/*
	 	host=[fd03::44ac]
	 	user-agent=cilium-test-19679/6
	 
	 Request Body:
	 	-no body in request-
	 
	 Test round 19679/6 exit code: 0
	 
	 
	 Hostname: testds-h5ws9
	 
	 Pod Information:
	 	-no pod information available-
	 
	 Server values:
	 	server_version=nginx: 1.13.3 - lua: 10008
	 
	 Request Information:
	 	client_address=fd02::1d3
	 	method=GET
	 	real path=/
	 	query=
	 	request_version=1.1
	 	request_scheme=http
	 	request_uri=http://[fd03::44ac]:80/
	 
	 Request Headers:
	 	accept=*/*
	 	host=[fd03::44ac]
	 	user-agent=cilium-test-19679/7
	 
	 Request Body:
	 	-no body in request-
	 
	 Test round 19679/7 exit code: 0
	 
	 
	 Hostname: testds-h5ws9
	 
	 Pod Information:
	 	-no pod information available-
	 
	 Server values:
	 	server_version=nginx: 1.13.3 - lua: 10008
	 
	 Request Information:
	 	client_address=fd02::1d3
	 	method=GET
	 	real path=/
	 	query=
	 	request_version=1.1
	 	request_scheme=http
	 	request_uri=http://[fd03::44ac]:80/
	 
	 Request Headers:
	 	accept=*/*
	 	host=[fd03::44ac]
	 	user-agent=cilium-test-19679/8
	 
	 Request Body:
	 	-no body in request-
	 
	 Test round 19679/8 exit code: 0
	 
	 
	 Hostname: testds-h5ws9
	 
	 Pod Information:
	 	-no pod information available-
	 
	 Server values:
	 	server_version=nginx: 1.13.3 - lua: 10008
	 
	 Request Information:
	 	client_address=fd02::1d3
	 	method=GET
	 	real path=/
	 	query=
	 	request_version=1.1
	 	request_scheme=http
	 	request_uri=http://[fd03::44ac]:80/
	 
	 Request Headers:
	 	accept=*/*
	 	host=[fd03::44ac]
	 	user-agent=cilium-test-19679/9
	 
	 Request Body:
	 	-no body in request-
	 
	 Test round 19679/9 exit code: 0
	 
	 
	 Hostname: testds-6fklp
	 
	 Pod Information:
	 	-no pod information available-
	 
	 Server values:
	 	server_version=nginx: 1.13.3 - lua: 10008
	 
	 Request Information:
	 	client_address=fd02::1d3
	 	method=GET
	 	real path=/
	 	query=
	 	request_version=1.1
	 	request_scheme=http
	 	request_uri=http://[fd03::44ac]:80/
	 
	 Request Headers:
	 	accept=*/*
	 	host=[fd03::44ac]
	 	user-agent=cilium-test-19679/10
	 
	 Request Body:
	 	-no body in request-
	 
	 Test round 19679/10 exit code: 0
	 failed: :19679/1=28
	 
Stderr:
 	 HTTP/1.1 200 OK
	 Date: Thu, 20 May 2021 07:16:35 GMT
	 Content-Type: text/plain
	 Transfer-Encoding: chunked
	 Connection: keep-alive
	 Server: echoserver
	 
	 HTTP/1.1 200 OK
	 Date: Thu, 20 May 2021 07:16:35 GMT
	 Content-Type: text/plain
	 Transfer-Encoding: chunked
	 Connection: keep-alive
	 Server: echoserver
	 
	 HTTP/1.1 200 OK
	 Date: Thu, 20 May 2021 07:16:35 GMT
	 Content-Type: text/plain
	 Transfer-Encoding: chunked
	 Connection: keep-alive
	 Server: echoserver
	 
	 HTTP/1.1 200 OK
	 Date: Thu, 20 May 2021 07:16:35 GMT
	 Content-Type: text/plain
	 Transfer-Encoding: chunked
	 Connection: keep-alive
	 Server: echoserver
	 
	 HTTP/1.1 200 OK
	 Date: Thu, 20 May 2021 07:16:35 GMT
	 Content-Type: text/plain
	 Transfer-Encoding: chunked
	 Connection: keep-alive
	 Server: echoserver
	 
	 HTTP/1.1 200 OK
	 Date: Thu, 20 May 2021 07:16:35 GMT
	 Content-Type: text/plain
	 Transfer-Encoding: chunked
	 Connection: keep-alive
	 Server: echoserver
	 
	 HTTP/1.1 200 OK
	 Date: Thu, 20 May 2021 07:16:35 GMT
	 Content-Type: text/plain
	 Transfer-Encoding: chunked
	 Connection: keep-alive
	 Server: echoserver
	 
	 HTTP/1.1 200 OK
	 Date: Thu, 20 May 2021 07:16:35 GMT
	 Content-Type: text/plain
	 Transfer-Encoding: chunked
	 Connection: keep-alive
	 Server: echoserver
	 
	 HTTP/1.1 200 OK
	 Date: Thu, 20 May 2021 07:16:35 GMT
	 Content-Type: text/plain
	 Transfer-Encoding: chunked
	 Connection: keep-alive
	 Server: echoserver
	 
	 command terminated with exit code 42
	 

/home/jenkins/workspace/Cilium-PR-K8s-1.21-kernel-4.9/src/github.com/cilium/cilium/test/k8sT/Services.go:1975

https://jenkins.cilium.io/job/Cilium-PR-K8s-1.21-kernel-4.9/501
ec0e028f_K8sServicesTest_Checks_service_across_nodes_Checks_ClusterIP_Connectivity.zip

Possibly related to: #15103, #13839, #13011, #12690.

@tklauser tklauser added area/CI Continuous Integration testing issue or flake ci/flake This is a known failure that occurs in the tree. Please investigate me! labels May 20, 2021
@tklauser
Copy link
Member Author

Another occurence: https://jenkins.cilium.io/job/Cilium-PR-K8s-1.21-kernel-4.9/485 (with PR #16175)

@gandro gandro self-assigned this May 20, 2021
@gandro
Copy link
Member

gandro commented May 20, 2021

I'm starting to look into this as part of my CI force duty. Since this issue here nicely links to all possibly related ones (thanks Tobias!), I'm assigning this one to me.

@b3a-dev b3a-dev added this to In Progress (Cilium) in CI Force May 24, 2021
@gandro
Copy link
Member

gandro commented May 25, 2021

In both sysdumps, the curl command fails because it receives an RST. The RST is likely triggered by a timeout due to the IP cache not updated yet.

Hubble query for the two backend IPs on k8s1:

$ cat cilium-8gbzn-hubble_observe.json | hubble observe --print-node-name --ip "fd02::18" --ip "fd02::114"
May 20 07:16:30.398 [k8s1]: [fd02::1d3]:59382 -> default/testds-6fklp:80 to-endpoint FORWARDED (TCP Flags: SYN)
May 20 07:16:30.398 [k8s1]: [fd02::1d3]:59382 <- default/testds-6fklp:80 to-stack FORWARDED (TCP Flags: SYN, ACK)
May 20 07:16:30.398 [k8s1]: default/testds-6fklp:80 <> [fd02::1d3]:59382 Stale or unroutable IP DROPPED (TCP Flags: SYN, ACK)
May 20 07:16:31.420 [k8s1]: default/testds-6fklp:80 <> [fd02::1d3]:59382 Stale or unroutable IP DROPPED (TCP Flags: SYN, ACK)
May 20 07:16:31.428 [k8s1]: default/testds-6fklp:80 <> [fd02::1d3]:59382 Stale or unroutable IP DROPPED (TCP Flags: SYN, ACK)
May 20 07:16:33.437 [k8s1]: default/testds-6fklp:80 <> [fd02::1d3]:59382 Stale or unroutable IP DROPPED (TCP Flags: SYN, ACK)
May 20 07:16:33.443 [k8s1]: default/testds-6fklp:80 <> [fd02::1d3]:59382 Stale or unroutable IP DROPPED (TCP Flags: SYN, ACK)
May 20 07:16:35.403 [k8s1]: default/testclient-hgkxc:59390 -> default/testds-6fklp:80 to-endpoint FORWARDED (TCP Flags: SYN)
May 20 07:16:35.403 [k8s1]: default/testds-6fklp:80 <> default/testclient-hgkxc:59390 to-overlay FORWARDED (TCP Flags: SYN, ACK)
May 20 07:16:35.404 [k8s1]: default/testclient-hgkxc:59390 -> default/testds-6fklp:80 to-endpoint FORWARDED (TCP Flags: ACK)
May 20 07:16:35.404 [k8s1]: default/testclient-hgkxc:59390 -> default/testds-6fklp:80 to-endpoint FORWARDED (TCP Flags: ACK, PSH)
May 20 07:16:35.404 [k8s1]: default/testds-6fklp:80 <> default/testclient-hgkxc:59390 to-overlay FORWARDED (TCP Flags: ACK, PSH)
May 20 07:16:35.404 [k8s1]: default/testclient-hgkxc:59390 -> default/testds-6fklp:80 to-endpoint FORWARDED (TCP Flags: ACK, FIN)
May 20 07:16:35.404 [k8s1]: default/testds-6fklp:80 <> default/testclient-hgkxc:59390 to-overlay FORWARDED (TCP Flags: ACK, FIN)
May 20 07:16:35.405 [k8s1]: default/testclient-hgkxc:59390 -> default/testds-6fklp:80 to-endpoint FORWARDED (TCP Flags: ACK)
[... more successful request with the same pattern and sport != 59382 ...]
May 20 07:16:35.434 [k8s1]: default/testclient-hgkxc:59406 -> default/testds-6fklp:80 to-endpoint FORWARDED (TCP Flags: SYN)
May 20 07:16:35.434 [k8s1]: default/testds-6fklp:80 <> default/testclient-hgkxc:59406 to-overlay FORWARDED (TCP Flags: SYN, ACK)
May 20 07:16:35.434 [k8s1]: default/testclient-hgkxc:59406 -> default/testds-6fklp:80 to-endpoint FORWARDED (TCP Flags: ACK)
May 20 07:16:35.434 [k8s1]: default/testclient-hgkxc:59406 -> default/testds-6fklp:80 to-endpoint FORWARDED (TCP Flags: ACK, PSH)
May 20 07:16:35.434 [k8s1]: default/testds-6fklp:80 <> default/testclient-hgkxc:59406 to-overlay FORWARDED (TCP Flags: ACK, PSH)
May 20 07:16:35.435 [k8s1]: default/testclient-hgkxc:59406 -> default/testds-6fklp:80 to-endpoint FORWARDED (TCP Flags: ACK, FIN)
May 20 07:16:35.435 [k8s1]: default/testds-6fklp:80 <> default/testclient-hgkxc:59406 to-overlay FORWARDED (TCP Flags: ACK, FIN)
May 20 07:16:35.435 [k8s1]: default/testclient-hgkxc:59406 -> default/testds-6fklp:80 to-endpoint FORWARDED (TCP Flags: ACK)
May 20 07:16:37.660 [k8s1]: default/testds-6fklp:80 <> default/testclient-hgkxc:59382 to-overlay FORWARDED (TCP Flags: SYN, ACK)
May 20 07:16:37.661 [k8s1]: default/testclient-hgkxc:59382 -> default/testds-6fklp:80 to-endpoint FORWARDED (TCP Flags: RST)

Last two packets as observed on k8s2

$ cat cilium-mvnqc-hubble_observe.json | hubble observe --print-node-name --ip "fd02::18" --ip "fd02::114"
[...]
May 20 07:16:37.715 [k8s2]: default/testclient-hgkxc:59382 <- default/testds-6fklp:80 to-endpoint FORWARDED (TCP Flags: SYN, ACK)
May 20 07:16:37.715 [k8s2]: default/testclient-hgkxc:59382 <> default/testds-6fklp:80 to-overlay FORWARDED (TCP Flags: RST)

Cilium logs for the source IP (fd02::1d3, i.e. default/testds-6fklp)

$ grep 'fd02::1d3' pod-kube-system-cilium-8gbzn-cilium-agent.log
4252:2021-05-20T07:16:30.398987047Z level=debug msg="stale identity observed" identity=21966 ipAddr="fd02::1d3" oldIdentity=2 subsys=hubble
4278:2021-05-20T07:16:34.049594404Z level=debug msg="Upserting IP into ipcache layer" identity="{21966 custom-resource false}" ipAddr="fd02::1d3" k8sNamespace=default k8sPodName=testclient-hgkxc key=0 namedPorts="map[]" subsys=ipcache
4279:2021-05-20T07:16:34.049600093Z level=debug msg="Daemon notified of IP-Identity cache state change" identity="{21966 custom-resource false}" ipAddr="{fd02::1d3 ffffffffffffffffffffffffffffffff}" modification=Upsert subsys=datapath-ipcache

In both sysdumps the RST packet is happening on the same port on which the first request (above: 59382) was sent.

The first request fails because the IPCache has not been updated yet at that point, so instead of sending the first response packet through the tunnel (to-overlay), it's sent to-stack, which then subsequently fails (because we're not running in direct routing mode).

In summary, timeline on k8s1:

  • 07:16:30.398: First response packet for port 59382 is dropped by kernel due to unroutable IP
  • 07:16:33.443: Last response packet for port 59382 is dropped by kernel due to unroutable IP
  • 07:16:34.049: IPCache Entry for fd02::1d3 is added
  • 07:16:35.403: First successful response (port 59390)
  • 07:16:35.434: Last successful response (port 59406)
  • 07:16:37.661: RST is observed on connection of failed request (port 59382)

It seems the RST is initiated on node k8s2, i.e. the sender. Likely due to curl dropping the connection up after 5 seconds?

@gandro
Copy link
Member

gandro commented May 25, 2021

Looking at the test code, I think it's pretty clear that the RST is caused by the first curl timing out due to the IPCache propagation delay. The first curl is reporting exit code 28, which is OPERATION_TIMEOUTED (note the 28 in 19679/1=28)

A simple fix would be to increase the timeout, but it looks like this is not the only CI test suffering from IPCache propagation delays.

Edit: From the logs, it looks like identity allocation for fd02::1d3 took ~8 seconds (endpoint is created 07:16:26, identity is allocated 07:16:34)

gandro added a commit to gandro/cilium that referenced this issue May 31, 2021
This increases the curl connection timeout from 5 to 15 seconds to avoid
issues with IPCache propagation delay. On Cilium master an 1.10, it
seems that IPCache updates in CI can take up to 4-8 seconds.

CI flakes likely caused by the increased IPCache propagation delay:

 - cilium#13839
 - cilium#14959
 - cilium#15103
 - cilium#16237

Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>
@gandro
Copy link
Member

gandro commented Jun 3, 2021

I assume this is should be fixed by #16381 now.

@gandro gandro closed this as completed Jun 3, 2021
CI Force automation moved this from In Progress (Cilium) to Fixed / Done Jun 3, 2021
@qmonnet qmonnet mentioned this issue Jun 3, 2021
23 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/CI Continuous Integration testing issue or flake ci/flake This is a known failure that occurs in the tree. Please investigate me!
Projects
No open projects
CI Force
  
Fixed / Done
Development

No branches or pull requests

3 participants