Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CI: ConformanceAKS: curl succeeded while it should have failed due to incorrect exit code #22162

Closed
pchaigno opened this issue Nov 14, 2022 · 29 comments
Assignees
Labels
area/CI Continuous Integration testing issue or flake area/cli Impacts the command line interface of any command in the repository. ci/flake This is a known failure that occurs in the tree. Please investigate me! integration/cloud Related to integration with cloud environments such as AKS, EKS, GKE, etc. sig/agent Cilium agent related.

Comments

@pchaigno
Copy link
Member

Several curl tests are failing because the commands succeed when we expected them to fail. For example:

[=] Test [all-ingress-deny]
..
  ℹ️  📜 Applying CiliumNetworkPolicy 'all-ingress-deny' to namespace 'cilium-test'..
  [-] Scenario [all-ingress-deny/pod-to-pod]
  [.] Action [all-ingress-deny/pod-to-pod/curl-0: cilium-test/client2-bb57d9bb4-b5pbz (10.0.1.204) -> cilium-test/echo-other-node-694dd79b46-qzslj (10.0.0.220:8080)]
  [.] Action [all-ingress-deny/pod-to-pod/curl-1: cilium-test/client2-bb57d9bb4-b5pbz (10.0.1.204) -> cilium-test/echo-same-node-865dcc9b58-nz2tf (10.0.1.86:8080)]
  ❌ command "curl -w %{local_ip}:%{local_port} -> %{remote_ip}:%{remote_port} = %{response_code} --silent --fail --show-error --connect-timeout 5 --output /dev/null http://10.0.1.86:8080/" succeeded while it should have failed: 
  ℹ️  curl output:

Examples:
https://github.com/cilium/cilium/actions/runs/3461307458/jobs/5779401491
https://github.com/cilium/cilium/actions/runs/3457550315/jobs/5771080426
https://github.com/cilium/cilium/actions/runs/3456343637/jobs/5769011681

Sysdumps for the first two examples above:
cilium-sysdump-out.zip.zip
cilium-sysdump-out.zip(1).zip

@pchaigno pchaigno added area/CI Continuous Integration testing issue or flake integration/cloud Related to integration with cloud environments such as AKS, EKS, GKE, etc. ci/flake This is a known failure that occurs in the tree. Please investigate me! labels Nov 14, 2022
@jibi jibi self-assigned this Nov 17, 2022
@jibi
Copy link
Member

jibi commented Nov 17, 2022

Some initial observations:

The failing test is:

[.] Action [all-ingress-deny/pod-to-pod/curl-1: cilium-test/client2-bb57d9bb4-b5pbz (10.0.1.204) -> cilium-test/echo-same-node-865dcc9b58-nz2tf (10.0.1.86:8080)]
  ❌ command "curl -w %{local_ip}:%{local_port} -> %{remote_ip}:%{remote_port} = %{response_code} --silent --fail --show-error --connect-timeout 5 --output /dev/null http://10.0.1.86:8080/" succeeded while it should have failed: 

So curl pod -> pod on the same node with a deny all ingress policy:

apiVersion: "cilium.io/v2"
kind: CiliumNetworkPolicy
metadata:
  name: "all-ingress-deny"
spec:
  endpointSelector: {}
  ingress:
    - {}

is being allowed when it should be denied.

The 2 pods are:

➜  sysdump-aks rg '10.0.1.204|10.0.1.86' k8s-pods-20221114-132015.txt
cilium-test   client2-bb57d9bb4-b5pbz               1/1     Running   0          12m   10.0.1.204   aks-nodepool1-34344260-vmss000000   <none>           <none>
cilium-test   echo-same-node-865dcc9b58-nz2tf       2/2     Running   0          12m   10.0.1.86    aks-nodepool1-34344260-vmss000000   <none>           <none>

both running on aks-nodepool1-34344260-vmss000000, where the cilium-wmccr agent pod is running:

kube-system   cilium-wmccr                          1/1     Running   0          13m   10.224.0.4   aks-nodepool1-34344260-vmss000000   <none>           <none>

Looking at the hubble flows, I can't see anything for that connection:

➜  aks-sysdump cat hubble-flows-cilium-* | hubble observe --from-ip 10.0.1.204 --to-ip 10.0.1.86
➜  aks-sysdump

While looking at the policy map for the 186/echo-same-node-865dcc9b58-nz2tf EP I don't see anything allowing traffic from the 2329/client2-bb57d9bb4-b5pbz one:

➜  aks-sysdump yq '.[][] | select(.metadata.name == "client2-bb57d9bb4-b5pbz") | .status.id' ciliumendpoints-20221114-132015.yaml
2329
➜  aks-sysdump yq '.[][] | select(.metadata.name == "echo-same-node-865dcc9b58-nz2tf") | .status.id' ciliumendpoints-20221114-132015.yaml
186

➜  cmd cat cilium-bpf-policy-get---all---numeric.md | grep 186 -A 50
/sys/fs/bpf/tc/globals/cilium_policy_00186:

POLICY   DIRECTION   IDENTITY   PORT/PROTO   PROXY PORT   BYTES    PACKETS
Allow    Ingress     1          ANY          NONE         726006   8488
Allow    Ingress     3622       ANY          NONE         0        0
Allow    Ingress     3873       ANY          NONE         0        0
Allow    Ingress     23412      ANY          NONE         0        0
Allow    Ingress     57140      ANY          NONE         3315     32
Allow    Egress      2          ANY          NONE         0        0
Allow    Egress      3622       ANY          NONE         0        0
Allow    Egress      3873       ANY          NONE         0        0
Allow    Egress      23412      ANY          NONE         0        0
Allow    Egress      57140      ANY          NONE         2716     4
Allow    Egress      16777218   ANY          NONE         0        0
Allow    Egress      16777243   ANY          NONE         0        0
Allow    Egress      16777244   ANY          NONE         0        0
Allow    Egress      16777245   ANY          NONE         0        0
Allow    Egress      16777246   ANY          NONE         0        0
Allow    Egress      16777247   ANY          NONE         0        0
Allow    Egress      16777248   ANY          NONE         0        0
Allow    Egress      16777249   ANY          NONE         0        0
Allow    Egress      16777250   ANY          NONE         0        0
Allow    Egress      16777251   ANY          NONE         0        0
Allow    Egress      16777252   ANY          NONE         0        0
Allow    Egress      16777253   ANY          NONE         0        0
Allow    Egress      16777254   ANY          NONE         0        0
Allow    Egress      16777255   ANY          NONE         0        0
Allow    Egress      16777256   ANY          NONE         0        0
Allow    Egress      16777257   ANY          NONE         0        0
Allow    Egress      16777258   ANY          NONE         0        0
Allow    Egress      16777259   ANY          NONE         0        0
Allow    Egress      16777260   ANY          NONE         0        0
Allow    Egress      16777261   ANY          NONE         0        0
Allow    Egress      16777262   ANY          NONE         0        0
Allow    Egress      16777263   ANY          NONE         0        0
Allow    Egress      16777264   ANY          NONE         0        0
Allow    Egress      16777265   ANY          NONE         0        0
Allow    Egress      16777266   ANY          NONE         0        0

So as Paul suggested we might be looking at a race between policies and requests 🤔

@margamanterola
Copy link
Member

BTW, looking at the failed results, we see that not always the same individual tests are failing. For example, in this one the first failure is echo-ingress-l7 whereas in this one the first failure is all-egress-deny.

Anyway, yes, this very much looks like a race. The question is why does our code think the policy should be in place when it's not fully there.

@jibi
Copy link
Member

jibi commented Nov 23, 2022

Just pushed a change to the CLI to capture sysdumps right after a test fails: cilium/cilium-cli#1228, hopefully that will give us some better visibility into what's going on 🤞

@aanm aanm added the sig/datapath Impacts bpf/ or low-level forwarding details, including map management and monitor messages. label Nov 26, 2022
@pchaigno pchaigno changed the title CI: ConformanceAKS: curl succeeded while it should have failed CI: ConformanceAKS: curl succeeded while it should have failed due to incorrect exit code Dec 2, 2022
@pchaigno
Copy link
Member Author

pchaigno commented Dec 2, 2022

Latest discovery from Jibi on this is that it seems to be a similar issue to #15724 where kubectl ... curl ... returns a 0 exit code even though the curl clearly timed out. The test then fails because it thinks the request succeeded when it shouldn't have.

@pchaigno pchaigno removed the sig/datapath Impacts bpf/ or low-level forwarding details, including map management and monitor messages. label Dec 2, 2022
@pchaigno
Copy link
Member Author

pchaigno commented Dec 3, 2022

Added some more debug output to curl and reran a bunch of times (in parallel because I don't have forever :p).

The expected output in case of a timeout looks like:

[2022-12-03T19:55:10Z]   ❌ command "curl -vvv -w %{local_ip}:%{local_port} -> %{remote_ip}:%{remote_port} = %{response_code} --silent --fail --show-error --connect-timeout 5 --output /dev/null http://10.0.0.172:8080/" failed: command terminated with exit code 28
[2022-12-03T19:55:10Z]   ℹ️  curl output:
[2022-12-03T19:55:10Z]   *   Trying 10.0.0.172:8080...
* After 5000ms connect time, move on!
* connect to 10.0.0.172 port 8080 failed: Operation timed out
* Connection timeout after 5000 ms
* Closing connection 0
curl: (28) Connection timeout after 5000 ms
:0 -> :0 = 000

The output we get in our case is:

[2022-12-03T20:03:20Z]   ❌ command "curl -vvv -w %{local_ip}:%{local_port} -> %{remote_ip}:%{remote_port} = %{response_code} --silent --fail --show-error --connect-timeout 5 --output /dev/null http://10.0.1.186:8080/" succeeded while it should have failed: *   Trying 10.0.1.186:8080...

[2022-12-03T20:03:20Z]   ℹ️  curl output:
[2022-12-03T20:03:20Z]   *   Trying 10.0.1.186:8080...
[2022-12-03T20:03:20Z]

It's as if the curl command had been interrupted before we reached the timeout (despite what the timestamps suggest).

For that reason, I'm starting to suspect it's a bug in the way we run curl rather than a bug in curl. Taking all this into account, it would be good to get agent/k8s/Golang eyes on https://github.com/cilium/cilium-cli/blob/1a823dc0a6afca645de32fbd057fff323a329f01/k8s/exec.go#L32 to see if something could explain the above.

@chancez
Copy link
Contributor

chancez commented Dec 4, 2022

Depending on api version and kubectl version, exit codes from kubectl exec may not be forwarded. That being said, the functionality for forwarding exit codes isn’t very new, I forget what version that was added. Maybe we can get verbose kubectl output (-v=10), it should have json output with the exec response code.

@aanm aanm added area/cli Impacts the command line interface of any command in the repository. sig/agent Cilium agent related. labels Dec 5, 2022
@tklauser
Copy link
Member

tklauser commented Dec 5, 2022

Added some more debug output to curl and reran a bunch of times (in parallel because I don't have forever :p).

The expected output in case of a timeout looks like:

[2022-12-03T19:55:10Z]   ❌ command "curl -vvv -w %{local_ip}:%{local_port} -> %{remote_ip}:%{remote_port} = %{response_code} --silent --fail --show-error --connect-timeout 5 --output /dev/null http://10.0.0.172:8080/" failed: command terminated with exit code 28
[2022-12-03T19:55:10Z]   ℹ️  curl output:
[2022-12-03T19:55:10Z]   *   Trying 10.0.0.172:8080...
* After 5000ms connect time, move on!
* connect to 10.0.0.172 port 8080 failed: Operation timed out
* Connection timeout after 5000 ms
* Closing connection 0
curl: (28) Connection timeout after 5000 ms
:0 -> :0 = 000

The output we get in our case is:

[2022-12-03T20:03:20Z]   ❌ command "curl -vvv -w %{local_ip}:%{local_port} -> %{remote_ip}:%{remote_port} = %{response_code} --silent --fail --show-error --connect-timeout 5 --output /dev/null http://10.0.1.186:8080/" succeeded while it should have failed: *   Trying 10.0.1.186:8080...

[2022-12-03T20:03:20Z]   ℹ️  curl output:
[2022-12-03T20:03:20Z]   *   Trying 10.0.1.186:8080...
[2022-12-03T20:03:20Z]

It's as if the curl command had been interrupted before we reached the timeout (despite what the timestamps suggest).

Not really relevant to the issue, but the timestamps stem from logging in cilium-cli, not from when curl was executed. I think for that we'd need to add --trace-time to the curl command.

@jibi
Copy link
Member

jibi commented Dec 5, 2022

It's as if the curl command had been interrupted before we reached the timeout (despite what the timestamps suggest).

There's been a similar report for the exact same logic we have in k8s/exec.go, where they suggested to use stdin: false.

For our case we can't just go for that as we need a way to interrupt the execution of the command after it times out, so I'm giving it a try to this new StreamWithContext method (needs to be backported to our own client-go fork)

@tklauser
Copy link
Member

tklauser commented Dec 5, 2022

It's as if the curl command had been interrupted before we reached the timeout (despite what the timestamps suggest).

There's been a similar report for the exact same logic we have in k8s/exec.go, where they suggested to use stdin: false.

For our case we can't just go for that as we need a way to interrupt the execution of the command after it times out, so I'm giving it a try to this new StreamWithContext method (needs to be backported to our own client-go fork)

Hopefully, our own fork shouldn't be needed anymore: #22547

Once that's merged in cilium/cilium we'd only need to update the vendored copy to a more recent client-go version providing StreamWithContext.

@jibi
Copy link
Member

jibi commented Dec 5, 2022

Doesn't look like it's working 😞

diff --git a/k8s/exec.go b/k8s/exec.go
index 2b779686..595f55cd 100644
--- a/k8s/exec.go
+++ b/k8s/exec.go
@@ -12,8 +12,6 @@ import (
        corev1 "k8s.io/api/core/v1"
        "k8s.io/apimachinery/pkg/runtime"
        "k8s.io/client-go/tools/remotecommand"
-
-       "github.com/cilium/cilium-cli/internal/utils"
 )

 type ExecResult struct {
@@ -42,7 +40,7 @@ func (c *Client) execInPodWithWriters(ctx context.Context, p ExecParameters, std
        req.VersionedParams(&corev1.PodExecOptions{
                Command:   p.Command,
                Container: p.Container,
-               Stdin:     p.TTY,
+               Stdin:     false,
                Stdout:    true,
                Stderr:    true,
                TTY:       p.TTY,
@@ -53,20 +51,9 @@ func (c *Client) execInPodWithWriters(ctx context.Context, p ExecParameters, std
                return fmt.Errorf("error while creating executor: %w", err)
        }

-       var stdin io.ReadCloser
-       if p.TTY {
-               // CtrlCReader sends Ctrl-C/D sequence if context is cancelled
-               stdin = utils.NewCtrlCReader(ctx)
-               // Graceful close of stdin once we are done, no Ctrl-C is sent
-               // if execution finishes before the context expires.
-               defer stdin.Close()
-       }
-
-       err = exec.Stream(remotecommand.StreamOptions{
-               Stdin:  stdin,
+       err = exec.StreamWithContext(ctx, remotecommand.StreamOptions{
                Stdout: stdout,
                Stderr: stderr,
-               Tty:    p.TTY,
        })

Failing test workflow: https://github.com/cilium/cilium/actions/runs/3619110325/jobs/6100967893

Usual error:

2022-12-05T12:33:21.9698804Z [2022-12-05T12:33:21Z]   ❌ command "curl -w %{local_ip}:%{local_port} -> %{remote_ip}:%{remote_port} = %{response_code} --silent --fail --show-error --connect-timeout 5 --output /dev/null http://10.0.0.242:8080" succeeded while it should have failed: 
2022-12-05T12:33:21.9701282Z [2022-12-05T12:33:21Z]   ℹ️  curl output:
2022-12-05T12:33:21.9702004Z [2022-12-05T12:33:21Z]   
2022-12-05T12:33:21.9702420Z [2022-12-05T12:33:21Z]   
2022-12-05T12:33:21.9702969Z [2022-12-05T12:33:21Z]   📄 No flows recorded during action curl-3
2022-12-05T12:33:21.9703547Z [2022-12-05T12:33:21Z]   📄 No flows recorded during action curl-3

And traffic being dropped:

➜  cilium-sysdump-20221205-123321 cat hubble-flows-cilium-* | hubble observe | sort | grep DROPPED
Dec  5 12:33:16.914: cilium-test/client2-67754cb6fb-fkbvt:45384 <> cilium-test/echo-same-node-dc6b7fd9f-xc2j9:8080 Policy denied by denylist DROPPED (TCP Flags: SYN)
Dec  5 12:33:16.914: cilium-test/client2-67754cb6fb-fkbvt:45384 <> cilium-test/echo-same-node-dc6b7fd9f-xc2j9:8080 Policy denied by denylist DROPPED (TCP Flags: SYN)
Dec  5 12:33:17.923: cilium-test/client2-67754cb6fb-fkbvt:45384 <> cilium-test/echo-same-node-dc6b7fd9f-xc2j9:8080 Policy denied by denylist DROPPED (TCP Flags: SYN)
Dec  5 12:33:17.923: cilium-test/client2-67754cb6fb-fkbvt:45384 <> cilium-test/echo-same-node-dc6b7fd9f-xc2j9:8080 Policy denied by denylist DROPPED (TCP Flags: SYN)
Dec  5 12:33:19.939: cilium-test/client2-67754cb6fb-fkbvt:45384 <> cilium-test/echo-same-node-dc6b7fd9f-xc2j9:8080 Policy denied by denylist DROPPED (TCP Flags: SYN)
Dec  5 12:33:19.939: cilium-test/client2-67754cb6fb-fkbvt:45384 <> cilium-test/echo-same-node-dc6b7fd9f-xc2j9:8080 Policy denied by denylist DROPPED (TCP Flags: SYN)

@pchaigno
Copy link
Member Author

pchaigno commented Dec 5, 2022

It is not as if someone else had already checked...

@pchaigno
Copy link
Member Author

The example above actually had a different root cause, explained at cilium/cilium-cli#1260. I've sent cilium/cilium-cli#1541 to try and fix that.

Let's take another occurrence that looks like the present issue: https://github.com/cilium/cilium/actions/runs/4806075784
Sysdump: cilium-sysdumps.zip
Logs: logs_1390081.zip

@pchaigno pchaigno self-assigned this Apr 26, 2023
pchaigno added a commit to cilium/cilium-cli that referenced this issue Apr 26, 2023
When running the connectivity tests in AKS, we sometimes get interrupted
commands that don't have any output [1]. Unfortunately, those commands
then exit without any error and are therefore considered successful. We
think this is caused by connectivity blips between Kubernetes
components.

This commit adds a check for those inconclusive results. If we see a
seemingly successful command with no output, we retry it until we get
something conclusive. This works because all our test commands (curl,
ping, nslookup) dump something to stdout.

1 - cilium/cilium#22162
Signed-off-by: Paul Chaignon <paul@cilium.io>
pchaigno added a commit to cilium/cilium-cli that referenced this issue Apr 26, 2023
When running the connectivity tests in AKS, we sometimes get interrupted
commands that don't have any output [1]. Unfortunately, those commands
then exit without any error and are therefore considered successful. We
think this is caused by connectivity blips between Kubernetes
components.

This commit adds a check for those inconclusive results. If we see a
seemingly successful command with no output, we retry it until we get
something conclusive. This works because all our test commands (curl,
ping, nslookup) dump something to stdout.

1 - cilium/cilium#22162
Signed-off-by: Paul Chaignon <paul@cilium.io>
pchaigno added a commit to cilium/cilium-cli that referenced this issue Apr 28, 2023
When running the connectivity tests in AKS, we sometimes get interrupted
commands that don't have any output [1]. Unfortunately, those commands
then exit without any error and are therefore considered successful. We
think this is caused by connectivity blips between Kubernetes
components.

This commit adds a check for those inconclusive results. If we see a
seemingly successful command with no output, we retry it until we get
something conclusive. This works because all our test commands (curl,
ping, nslookup) dump something to stdout.

1 - cilium/cilium#22162
Signed-off-by: Paul Chaignon <paul@cilium.io>
tklauser pushed a commit to cilium/cilium-cli that referenced this issue Apr 28, 2023
When running the connectivity tests in AKS, we sometimes get interrupted
commands that don't have any output [1]. Unfortunately, those commands
then exit without any error and are therefore considered successful. We
think this is caused by connectivity blips between Kubernetes
components.

This commit adds a check for those inconclusive results. If we see a
seemingly successful command with no output, we retry it until we get
something conclusive. This works because all our test commands (curl,
ping, nslookup) dump something to stdout.

1 - cilium/cilium#22162
Signed-off-by: Paul Chaignon <paul@cilium.io>
@pchaigno pchaigno removed their assignment May 1, 2023
michi-covalent pushed a commit to michi-covalent/cilium that referenced this issue May 30, 2023
When running the connectivity tests in AKS, we sometimes get interrupted
commands that don't have any output [1]. Unfortunately, those commands
then exit without any error and are therefore considered successful. We
think this is caused by connectivity blips between Kubernetes
components.

This commit adds a check for those inconclusive results. If we see a
seemingly successful command with no output, we retry it until we get
something conclusive. This works because all our test commands (curl,
ping, nslookup) dump something to stdout.

1 - cilium#22162
Signed-off-by: Paul Chaignon <paul@cilium.io>
@pchaigno
Copy link
Member Author

pchaigno commented Jun 6, 2023

The workaround for this (cf. cilium/cilium-cli#1543) is used in Cilium's CI so, if anyone hits this flake again, it would be good to know.

@amitmavgupta
Copy link
Contributor

Hit this issue with AKS for Isovalent Enterprise for Cilium in Azure Marketplace. Let me know if I need to open a different issue for this, have the sysdump handy as well.

❌ 1/41 tests failed (1/282 actions), 14 tests skipped, 0 scenarios skipped:
Test [client-ingress-from-other-client-icmp-deny]:
❌ client-ingress-from-other-client-icmp-deny/client-to-client/ping-ipv4-1: cilium-test/client2-78f748dd67-dklg5 (192.168.1.206) -> cilium-test/client-7b78db77d5-ltl4q (192.168.1.161:0)
connectivity test failed: 1 tests failed

@pchaigno
Copy link
Member Author

@amitmavgupta Did you have the error message succeeded while it should have failed? What version of the CLI were you using to run the connectivity tests?

@amitmavgupta
Copy link
Contributor

@pchaigno Hi Paul- The test should have failed.

CLI version-
cilium-cli: v0.14.5 compiled with go1.20.4 on darwin/arm64

@gandro
Copy link
Member

gandro commented Jul 6, 2023

I think I also hit this here:

  ℹ️  📜 Applying CiliumNetworkPolicy 'client-ingress-from-client2' to namespace 'cilium-test'..
  [-] Scenario [client-ingress/client-to-client]
  [.] Action [client-ingress/client-to-client/ping-ipv4-0: cilium-test/client-6f6788d7cc-jzxfm (10.0.0.38) -> cilium-test/client2-bc59f56d5-ds4jp (10.0.0.83:0)]
  ❌ command "ping -c 1 -W 2 -w 10 10.0.0.83" succeeded while it should have failed: PING 10.0.0.83 (10.0.0.83) 56(84) bytes of data.

  ℹ️  ping output:
  PING 10.0.0.83 (10.0.0.83) 56(84) bytes of data.

ping did output something, but it didn't show any successful pings, meaning it likely didn't succeed. PR #26662 against main: https://github.com/cilium/cilium/actions/runs/5474746995/jobs/9969923321

@pchaigno
Copy link
Member Author

pchaigno commented Jul 6, 2023

Ah, my "fix" doesn't work for ping because there is some output at the beginning even when it hangs. Maybe we can change the ping command we use to remove that first line:

PING 10.0.0.83 (10.0.0.83) 56(84) bytes of data.

mhofstetter added a commit to mhofstetter/cilium-cli that referenced this issue Jul 21, 2023
When running the connectivity tests in AKS, we sometimes get interrupted
commands that don't have any output [1]. Therefore, successful command
terminations with an empty output are treated as inconclusive that
result in a retry [2].

Unfortunately, the ping command might get stuck by only outputting its
header line to the output.

`PING 10.0.0.1 (10.0.0.1) 56(84) bytes of data.`

This commit adds an additional check for the ping commands.
If we see an output only containing the header, we treat it as
inconclusive result.

1 - cilium/cilium#22162
2 - cilium#1543

Signed-off-by: Marco Hofstetter <marco.hofstetter@isovalent.com>
mhofstetter added a commit to mhofstetter/cilium-cli that referenced this issue Jul 21, 2023
When running the connectivity tests in AKS, we sometimes get interrupted
commands that don't have any output [1]. Therefore, successful command
terminations with an empty output are treated as inconclusive that
result in a retry [2].

Unfortunately, the ping command might get stuck by only outputting its
header line to the output.

`PING 10.0.0.1 (10.0.0.1) 56(84) bytes of data.`

This commit adds an additional check for the ping commands.
If we see an output only containing the header, we treat it as
inconclusive result.

1 - cilium/cilium#22162
2 - cilium#1543

Signed-off-by: Marco Hofstetter <marco.hofstetter@isovalent.com>
mhofstetter added a commit to mhofstetter/cilium-cli that referenced this issue Jul 21, 2023
When running the connectivity tests in AKS, we sometimes get interrupted
commands that don't have any output [1]. Therefore, successful command
terminations with an empty output are treated as inconclusive that
result in a retry [2].

Unfortunately, the ping command might get stuck by only outputting its
header line to the output.

`PING 10.0.0.1 (10.0.0.1) 56(84) bytes of data.`

This commit adds an additional check for the ping commands.
If we see an output only containing the header, we treat it as
inconclusive result.

1 - cilium/cilium#22162
2 - cilium#1543

Signed-off-by: Marco Hofstetter <marco.hofstetter@isovalent.com>
michi-covalent pushed a commit to cilium/cilium-cli that referenced this issue Jul 25, 2023
When running the connectivity tests in AKS, we sometimes get interrupted
commands that don't have any output [1]. Therefore, successful command
terminations with an empty output are treated as inconclusive that
result in a retry [2].

Unfortunately, the ping command might get stuck by only outputting its
header line to the output.

`PING 10.0.0.1 (10.0.0.1) 56(84) bytes of data.`

This commit adds an additional check for the ping commands.
If we see an output only containing the header, we treat it as
inconclusive result.

1 - cilium/cilium#22162
2 - #1543

Signed-off-by: Marco Hofstetter <marco.hofstetter@isovalent.com>
liyihuang pushed a commit to liyihuang/cilium-cli that referenced this issue Aug 1, 2023
When running the connectivity tests in AKS, we sometimes get interrupted
commands that don't have any output [1]. Therefore, successful command
terminations with an empty output are treated as inconclusive that
result in a retry [2].

Unfortunately, the ping command might get stuck by only outputting its
header line to the output.

`PING 10.0.0.1 (10.0.0.1) 56(84) bytes of data.`

This commit adds an additional check for the ping commands.
If we see an output only containing the header, we treat it as
inconclusive result.

1 - cilium/cilium#22162
2 - cilium#1543

Signed-off-by: Marco Hofstetter <marco.hofstetter@isovalent.com>
@pchaigno
Copy link
Member Author

No reports since my fix and Marco's followup. Let's close.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/CI Continuous Integration testing issue or flake area/cli Impacts the command line interface of any command in the repository. ci/flake This is a known failure that occurs in the tree. Please investigate me! integration/cloud Related to integration with cloud environments such as AKS, EKS, GKE, etc. sig/agent Cilium agent related.
Projects
None yet
Development

No branches or pull requests

10 participants