-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CI: ConformanceAKS: curl succeeded while it should have failed due to incorrect exit code #22162
Comments
Some initial observations: The failing test is:
So curl pod -> pod on the same node with a deny all ingress policy: apiVersion: "cilium.io/v2"
kind: CiliumNetworkPolicy
metadata:
name: "all-ingress-deny"
spec:
endpointSelector: {}
ingress:
- {} is being allowed when it should be denied. The 2 pods are:
both running on
Looking at the hubble flows, I can't see anything for that connection:
While looking at the policy map for the
So as Paul suggested we might be looking at a race between policies and requests 🤔 |
BTW, looking at the failed results, we see that not always the same individual tests are failing. For example, in this one the first failure is echo-ingress-l7 whereas in this one the first failure is all-egress-deny. Anyway, yes, this very much looks like a race. The question is why does our code think the policy should be in place when it's not fully there. |
Just pushed a change to the CLI to capture sysdumps right after a test fails: cilium/cilium-cli#1228, hopefully that will give us some better visibility into what's going on 🤞 |
Latest discovery from Jibi on this is that it seems to be a similar issue to #15724 where |
Added some more debug output to curl and reran a bunch of times (in parallel because I don't have forever :p). The expected output in case of a timeout looks like:
The output we get in our case is:
It's as if the curl command had been interrupted before we reached the timeout (despite what the timestamps suggest). For that reason, I'm starting to suspect it's a bug in the way we run curl rather than a bug in curl. Taking all this into account, it would be good to get agent/k8s/Golang eyes on https://github.com/cilium/cilium-cli/blob/1a823dc0a6afca645de32fbd057fff323a329f01/k8s/exec.go#L32 to see if something could explain the above. |
Depending on api version and kubectl version, exit codes from kubectl exec may not be forwarded. That being said, the functionality for forwarding exit codes isn’t very new, I forget what version that was added. Maybe we can get verbose kubectl output (-v=10), it should have json output with the exec response code. |
Not really relevant to the issue, but the timestamps stem from logging in cilium-cli, not from when curl was executed. I think for that we'd need to add |
There's been a similar report for the exact same logic we have in For our case we can't just go for that as we need a way to interrupt the execution of the command after it times out, so I'm giving it a try to this new StreamWithContext method (needs to be backported to our own |
Hopefully, our own fork shouldn't be needed anymore: #22547 Once that's merged in cilium/cilium we'd only need to update the vendored copy to a more recent client-go version providing |
Doesn't look like it's working 😞 diff --git a/k8s/exec.go b/k8s/exec.go
index 2b779686..595f55cd 100644
--- a/k8s/exec.go
+++ b/k8s/exec.go
@@ -12,8 +12,6 @@ import (
corev1 "k8s.io/api/core/v1"
"k8s.io/apimachinery/pkg/runtime"
"k8s.io/client-go/tools/remotecommand"
-
- "github.com/cilium/cilium-cli/internal/utils"
)
type ExecResult struct {
@@ -42,7 +40,7 @@ func (c *Client) execInPodWithWriters(ctx context.Context, p ExecParameters, std
req.VersionedParams(&corev1.PodExecOptions{
Command: p.Command,
Container: p.Container,
- Stdin: p.TTY,
+ Stdin: false,
Stdout: true,
Stderr: true,
TTY: p.TTY,
@@ -53,20 +51,9 @@ func (c *Client) execInPodWithWriters(ctx context.Context, p ExecParameters, std
return fmt.Errorf("error while creating executor: %w", err)
}
- var stdin io.ReadCloser
- if p.TTY {
- // CtrlCReader sends Ctrl-C/D sequence if context is cancelled
- stdin = utils.NewCtrlCReader(ctx)
- // Graceful close of stdin once we are done, no Ctrl-C is sent
- // if execution finishes before the context expires.
- defer stdin.Close()
- }
-
- err = exec.Stream(remotecommand.StreamOptions{
- Stdin: stdin,
+ err = exec.StreamWithContext(ctx, remotecommand.StreamOptions{
Stdout: stdout,
Stderr: stderr,
- Tty: p.TTY,
}) Failing test workflow: https://github.com/cilium/cilium/actions/runs/3619110325/jobs/6100967893 Usual error:
And traffic being dropped:
|
It is not as if someone else had already checked... |
The example above actually had a different root cause, explained at cilium/cilium-cli#1260. I've sent cilium/cilium-cli#1541 to try and fix that. Let's take another occurrence that looks like the present issue: https://github.com/cilium/cilium/actions/runs/4806075784 |
When running the connectivity tests in AKS, we sometimes get interrupted commands that don't have any output [1]. Unfortunately, those commands then exit without any error and are therefore considered successful. We think this is caused by connectivity blips between Kubernetes components. This commit adds a check for those inconclusive results. If we see a seemingly successful command with no output, we retry it until we get something conclusive. This works because all our test commands (curl, ping, nslookup) dump something to stdout. 1 - cilium/cilium#22162 Signed-off-by: Paul Chaignon <paul@cilium.io>
When running the connectivity tests in AKS, we sometimes get interrupted commands that don't have any output [1]. Unfortunately, those commands then exit without any error and are therefore considered successful. We think this is caused by connectivity blips between Kubernetes components. This commit adds a check for those inconclusive results. If we see a seemingly successful command with no output, we retry it until we get something conclusive. This works because all our test commands (curl, ping, nslookup) dump something to stdout. 1 - cilium/cilium#22162 Signed-off-by: Paul Chaignon <paul@cilium.io>
When running the connectivity tests in AKS, we sometimes get interrupted commands that don't have any output [1]. Unfortunately, those commands then exit without any error and are therefore considered successful. We think this is caused by connectivity blips between Kubernetes components. This commit adds a check for those inconclusive results. If we see a seemingly successful command with no output, we retry it until we get something conclusive. This works because all our test commands (curl, ping, nslookup) dump something to stdout. 1 - cilium/cilium#22162 Signed-off-by: Paul Chaignon <paul@cilium.io>
When running the connectivity tests in AKS, we sometimes get interrupted commands that don't have any output [1]. Unfortunately, those commands then exit without any error and are therefore considered successful. We think this is caused by connectivity blips between Kubernetes components. This commit adds a check for those inconclusive results. If we see a seemingly successful command with no output, we retry it until we get something conclusive. This works because all our test commands (curl, ping, nslookup) dump something to stdout. 1 - cilium/cilium#22162 Signed-off-by: Paul Chaignon <paul@cilium.io>
When running the connectivity tests in AKS, we sometimes get interrupted commands that don't have any output [1]. Unfortunately, those commands then exit without any error and are therefore considered successful. We think this is caused by connectivity blips between Kubernetes components. This commit adds a check for those inconclusive results. If we see a seemingly successful command with no output, we retry it until we get something conclusive. This works because all our test commands (curl, ping, nslookup) dump something to stdout. 1 - cilium#22162 Signed-off-by: Paul Chaignon <paul@cilium.io>
The workaround for this (cf. cilium/cilium-cli#1543) is used in Cilium's CI so, if anyone hits this flake again, it would be good to know. |
Hit this issue with AKS for Isovalent Enterprise for Cilium in Azure Marketplace. Let me know if I need to open a different issue for this, have the sysdump handy as well. ❌ 1/41 tests failed (1/282 actions), 14 tests skipped, 0 scenarios skipped: |
@amitmavgupta Did you have the error message |
@pchaigno Hi Paul- The test should have failed. CLI version- |
I think I also hit this here:
ping did output something, but it didn't show any successful pings, meaning it likely didn't succeed. PR #26662 against |
Ah, my "fix" doesn't work for ping because there is some output at the beginning even when it hangs. Maybe we can change the ping command we use to remove that first line:
|
When running the connectivity tests in AKS, we sometimes get interrupted commands that don't have any output [1]. Therefore, successful command terminations with an empty output are treated as inconclusive that result in a retry [2]. Unfortunately, the ping command might get stuck by only outputting its header line to the output. `PING 10.0.0.1 (10.0.0.1) 56(84) bytes of data.` This commit adds an additional check for the ping commands. If we see an output only containing the header, we treat it as inconclusive result. 1 - cilium/cilium#22162 2 - cilium#1543 Signed-off-by: Marco Hofstetter <marco.hofstetter@isovalent.com>
When running the connectivity tests in AKS, we sometimes get interrupted commands that don't have any output [1]. Therefore, successful command terminations with an empty output are treated as inconclusive that result in a retry [2]. Unfortunately, the ping command might get stuck by only outputting its header line to the output. `PING 10.0.0.1 (10.0.0.1) 56(84) bytes of data.` This commit adds an additional check for the ping commands. If we see an output only containing the header, we treat it as inconclusive result. 1 - cilium/cilium#22162 2 - cilium#1543 Signed-off-by: Marco Hofstetter <marco.hofstetter@isovalent.com>
When running the connectivity tests in AKS, we sometimes get interrupted commands that don't have any output [1]. Therefore, successful command terminations with an empty output are treated as inconclusive that result in a retry [2]. Unfortunately, the ping command might get stuck by only outputting its header line to the output. `PING 10.0.0.1 (10.0.0.1) 56(84) bytes of data.` This commit adds an additional check for the ping commands. If we see an output only containing the header, we treat it as inconclusive result. 1 - cilium/cilium#22162 2 - cilium#1543 Signed-off-by: Marco Hofstetter <marco.hofstetter@isovalent.com>
When running the connectivity tests in AKS, we sometimes get interrupted commands that don't have any output [1]. Therefore, successful command terminations with an empty output are treated as inconclusive that result in a retry [2]. Unfortunately, the ping command might get stuck by only outputting its header line to the output. `PING 10.0.0.1 (10.0.0.1) 56(84) bytes of data.` This commit adds an additional check for the ping commands. If we see an output only containing the header, we treat it as inconclusive result. 1 - cilium/cilium#22162 2 - #1543 Signed-off-by: Marco Hofstetter <marco.hofstetter@isovalent.com>
When running the connectivity tests in AKS, we sometimes get interrupted commands that don't have any output [1]. Therefore, successful command terminations with an empty output are treated as inconclusive that result in a retry [2]. Unfortunately, the ping command might get stuck by only outputting its header line to the output. `PING 10.0.0.1 (10.0.0.1) 56(84) bytes of data.` This commit adds an additional check for the ping commands. If we see an output only containing the header, we treat it as inconclusive result. 1 - cilium/cilium#22162 2 - cilium#1543 Signed-off-by: Marco Hofstetter <marco.hofstetter@isovalent.com>
No reports since my fix and Marco's followup. Let's close. |
Several curl tests are failing because the commands succeed when we expected them to fail. For example:
Examples:
https://github.com/cilium/cilium/actions/runs/3461307458/jobs/5779401491
https://github.com/cilium/cilium/actions/runs/3457550315/jobs/5771080426
https://github.com/cilium/cilium/actions/runs/3456343637/jobs/5769011681
Sysdumps for the first two examples above:
cilium-sysdump-out.zip.zip
cilium-sysdump-out.zip(1).zip
The text was updated successfully, but these errors were encountered: