Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CI: ConformanceAKS: client-egress-to-echo-expression-deny: command terminated with exit code 28 #22529

Closed
pchaigno opened this issue Dec 3, 2022 · 3 comments
Assignees
Labels
area/CI Continuous Integration testing issue or flake ci/flake This is a known failure that occurs in the tree. Please investigate me!

Comments

@pchaigno
Copy link
Member

pchaigno commented Dec 3, 2022

From: https://github.com/cilium/cilium/actions/runs/3609902939/jobs/6083378949
cilium-sysdumps.zip

[.] Action [client-egress-to-echo-expression-deny/pod-to-pod/curl-0: cilium-test/client-784c67ffc4-29bsq (10.0.1.105) -> cilium-test/echo-other-node-f7549999-5659v (10.0.0.172:8080)]
🐛 Executing command [curl -vvv -w %{local_ip}:%{local_port} -> %{remote_ip}:%{remote_port} = %{response_code} --silent --fail --show-error --connect-timeout 5 --output /dev/null http://10.0.0.172:8080]
🐛 command "curl -vvv -w %{local_ip}:%{local_port} -> %{remote_ip}:%{remote_port} = %{response_code} --silent --fail --show-error --connect-timeout 5 --output /dev/null http://10.0.0.172:8080/" failed as expected: command terminated with exit code 28
🐛 Expecting egress drops for Action curl-1: Drop
[.] Action [client-egress-to-echo-expression-deny/pod-to-pod/curl-1: cilium-test/client-784c67ffc4-29bsq (10.0.1.105) -> cilium-test/echo-same-node-dc6b7fd9f-89p9w (10.0.1.163:8080)]
🐛 Executing command [curl -vvv -w %{local_ip}:%{local_port} -> %{remote_ip}:%{remote_port} = %{response_code} --silent --fail --show-error --connect-timeout 5 --output /dev/null http://10.0.1.163:8080]
🐛 command "curl -vvv -w %{local_ip}:%{local_port} -> %{remote_ip}:%{remote_port} = %{response_code} --silent --fail --show-error --connect-timeout 5 --output /dev/null http://10.0.1.163:8080/" failed as expected: command terminated with exit code 28
[-] Scenario [client-egress-to-echo-expression-deny/pod-to-pod]
[.] Action [client-egress-to-echo-expression-deny/pod-to-pod/curl-0: cilium-test/client2-67754cb6fb-mzmb2 (10.0.1.204) -> cilium-test/echo-same-node-dc6b7fd9f-89p9w (10.0.1.163:8080)]
🐛 Executing command [curl -vvv -w %{local_ip}:%{local_port} -> %{remote_ip}:%{remote_port} = %{response_code} --silent --fail --show-error --connect-timeout 5 --output /dev/null http://10.0.1.163:8080]
[.] Action [client-egress-to-echo-expression-deny/pod-to-pod/curl-1: cilium-test/client2-67754cb6fb-mzmb2 (10.0.1.204) -> cilium-test/echo-other-node-f7549999-5659v (10.0.0.172:8080)]
🐛 Executing command [curl -vvv -w %{local_ip}:%{local_port} -> %{remote_ip}:%{remote_port} = %{response_code} --silent --fail --show-error --connect-timeout 5 --output /dev/null http://10.0.0.172:8080]
❌ command "curl -vvv -w %{local_ip}:%{local_port} -> %{remote_ip}:%{remote_port} = %{response_code} --silent --fail --show-error --connect-timeout 5 --output /dev/null http://10.0.0.172:8080/" failed: command terminated with exit code 28
@pchaigno pchaigno added area/CI Continuous Integration testing issue or flake ci/flake This is a known failure that occurs in the tree. Please investigate me! labels Dec 3, 2022
@pchaigno pchaigno self-assigned this Dec 3, 2022
@pchaigno
Copy link
Member Author

pchaigno commented Dec 3, 2022

Tracing the Connection

Client side:

$ cat hubble-flows-cilium-rhxvg-20221203-195510.json | ~/hubble/hubble observe --port 50570 --numeric
Dec  3 19:55:05.766: 10.0.1.204:50570 (ID:18091) <> 10.0.0.172:8080 (ID:22565) from-endpoint FORWARDED (TCP Flags: SYN)
Dec  3 19:55:05.766: 10.0.1.204:50570 (ID:18091) -> 10.0.0.172:8080 (ID:22565) policy-verdict:L3-Only EGRESS ALLOWED (TCP Flags: SYN)
Dec  3 19:55:05.766: 10.0.1.204:50570 (ID:18091) -> 10.0.0.172:8080 (ID:22565) to-stack FORWARDED (TCP Flags: SYN)
Dec  3 19:55:06.766: 10.0.1.204:50570 (ID:18091) <> 10.0.0.172:8080 (ID:22565) from-endpoint FORWARDED (TCP Flags: SYN)
Dec  3 19:55:06.766: 10.0.1.204:50570 (ID:18091) -> 10.0.0.172:8080 (ID:22565) to-stack FORWARDED (TCP Flags: SYN)
Dec  3 19:55:08.782: 10.0.1.204:50570 (ID:18091) <> 10.0.0.172:8080 (ID:22565) from-endpoint FORWARDED (TCP Flags: SYN)
Dec  3 19:55:08.782: 10.0.1.204:50570 (ID:18091) -> 10.0.0.172:8080 (ID:22565) to-stack FORWARDED (TCP Flags: SYN)

SYN+ACK is never received.

Server side:

$ cat hubble-flows-cilium-98mnw-20221203-195510.json | ~/hubble/hubble observe --port 50570 --numeric
Dec  3 19:55:05.767: 10.0.1.204:50570 (ID:18091) <> 10.0.0.172:8080 (ID:22565) from-stack FORWARDED (TCP Flags: SYN)
Dec  3 19:55:05.767: 10.0.1.204:50570 (ID:18091) -> 10.0.0.172:8080 (ID:22565) policy-verdict:L3-Only INGRESS ALLOWED (TCP Flags: SYN)
Dec  3 19:55:05.767: 10.0.1.204:50570 (ID:18091) -> 10.0.0.172:8080 (ID:22565) to-endpoint FORWARDED (TCP Flags: SYN)
Dec  3 19:55:05.767: 10.0.0.172:8080 (ID:22565) <> 10.0.1.204:50570 (ID:18091) from-endpoint FORWARDED (TCP Flags: SYN, ACK)
Dec  3 19:55:06.768: 10.0.1.204:50570 (ID:18091) <> 10.0.0.172:8080 (ID:22565) from-stack FORWARDED (TCP Flags: SYN)
Dec  3 19:55:06.768: 10.0.1.204:50570 (ID:18091) -> 10.0.0.172:8080 (ID:22565) to-endpoint FORWARDED (TCP Flags: SYN)
Dec  3 19:55:06.768: 10.0.0.172:8080 (ID:22565) <> 10.0.1.204:50570 (ID:18091) from-endpoint FORWARDED (TCP Flags: SYN, ACK)
Dec  3 19:55:07.771: 10.0.0.172:8080 (ID:22565) <> 10.0.1.204:50570 (ID:18091) from-endpoint FORWARDED (TCP Flags: SYN, ACK)
Dec  3 19:55:08.783: 10.0.1.204:50570 (ID:18091) <> 10.0.0.172:8080 (ID:22565) from-stack FORWARDED (TCP Flags: SYN)
Dec  3 19:55:08.783: 10.0.1.204:50570 (ID:18091) -> 10.0.0.172:8080 (ID:22565) to-endpoint FORWARDED (TCP Flags: SYN)
Dec  3 19:55:08.783: 10.0.0.172:8080 (ID:22565) <> 10.0.1.204:50570 (ID:18091) from-endpoint FORWARDED (TCP Flags: SYN, ACK)
Dec  3 19:55:10.811: 10.0.0.172:8080 (ID:22565) <> 10.0.1.204:50570 (ID:18091) from-endpoint FORWARDED (TCP Flags: SYN, ACK)
Dec  3 19:55:14.875: 10.0.0.172:8080 (ID:22565) <> 10.0.1.204:50570 (ID:18091) from-endpoint FORWARDED (TCP Flags: SYN, ACK)

SYN+ACK is sent but seems it never leaves the first BPF program as we don't have a to-xxx after the from-endpoint. Most likely reason is that we are missing a packet trace event somewhere in the code.

Looking for Missing Packet Traces

Looking at the code, I found two possibles places where we could be dropping/forwarding this packet without emitting a packet trace event:

Checking the second first as it seems more likely to happen (forwarding case) than the first (vs. corner-case error cases).

Redirect to the Proxy

The second missing packet trace event corresponds to a case where we are sending a packet from the container to our proxy based on information from the CT map. We don't have a packet trace event to confirm, but we can check the CT map content, same as the code does:

$ git grep 50570 cilium-bugtool-cilium-98mnw-20221203-195510/cmd/cilium-bpf-ct-list-global.md 
cilium-bugtool-cilium-98mnw-20221203-195510/cmd/cilium-bpf-ct-list-global.md:TCP IN 10.0.1.204:50570 -> 10.0.0.172:8080 expires=23291 RxPackets=9 RxBytes=711 RxFlagsSeen=0x1b LastRxReport=1681 TxPackets=6 TxBytes=444 TxFlagsSeen=0x12 LastTxReport=1687 Flags=0x0050 [ SeenNonSyn ProxyRedirect ] RevNAT=0 SourceSecurityID=18091 IfIndex=0

And indeed, we have a CT entry flagged with ProxyRedirect. Dumping the CNPs and the cilium policy get output (all from sysdump obviously, you think I'm crazy enough to attempt to reproduce a flake before I have its root cause?!) confirms that there are no L7 policies (DNS included).

We are most likely hitting #17459 once again.

Root Cause

#17459 should have been strongly mitigated (though not fixed) by cilium/cilium-cli#558. But that reordering is a bit fragile because we don't have a good way to enforce it. So let's check that someone didn't partially undo cilium/cilium-cli#558 since then.

And indeed! cilium/cilium-cli@8c772b7 adds a bunch of connectivity tests with L7 redirects before tests without L7 redirects, partially undoing cilium/cilium-cli#558.

pchaigno added a commit to cilium/cilium-cli that referenced this issue Dec 3, 2022
To mitigate the effects of bug
cilium/cilium#17459, commit 7f84818 ("test:
Run tests with UDP proxy redirections last") reordered the tests such
that those with proxy redirections are executed last.

Commit 8c772b7 ("connectivity: Add policy test for name ports")
however partially undid that, causing a new flake in CI. This commit
moves those new L7 ingress tests to the end of the connectivity tests.

Related: cilium/cilium#22529
Fixes: 8c772b7 ("connectivity: Add policy test for name ports")
Signed-off-by: Paul Chaignon <paul@cilium.io>
tklauser pushed a commit to cilium/cilium-cli that referenced this issue Dec 5, 2022
To mitigate the effects of bug
cilium/cilium#17459, commit 7f84818 ("test:
Run tests with UDP proxy redirections last") reordered the tests such
that those with proxy redirections are executed last.

Commit 8c772b7 ("connectivity: Add policy test for name ports")
however partially undid that, causing a new flake in CI. This commit
moves those new L7 ingress tests to the end of the connectivity tests.

Related: cilium/cilium#22529
Fixes: 8c772b7 ("connectivity: Add policy test for name ports")
Signed-off-by: Paul Chaignon <paul@cilium.io>
@pchaigno
Copy link
Member Author

This flake should be fixed by cilium/cilium-cli#1261 once we update to CLI v1.12.12 (not yet released).

@pchaigno
Copy link
Member Author

This was fixed with the new CLI release in #23030.

michi-covalent pushed a commit to michi-covalent/cilium that referenced this issue May 30, 2023
To mitigate the effects of bug
cilium#17459, commit 7f84818 ("test:
Run tests with UDP proxy redirections last") reordered the tests such
that those with proxy redirections are executed last.

Commit 8c772b76d ("connectivity: Add policy test for name ports")
however partially undid that, causing a new flake in CI. This commit
moves those new L7 ingress tests to the end of the connectivity tests.

Related: cilium#22529
Fixes: 8c772b76d ("connectivity: Add policy test for name ports")
Signed-off-by: Paul Chaignon <paul@cilium.io>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/CI Continuous Integration testing issue or flake ci/flake This is a known failure that occurs in the tree. Please investigate me!
Projects
None yet
Development

No branches or pull requests

1 participant