Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

workflows/datapath: Fix always-passing step #24918

Merged
merged 2 commits into from
Apr 18, 2023

Conversation

pchaigno
Copy link
Member

@pchaigno pchaigno commented Apr 17, 2023

First commit fixes an incorrect flag in the ConformanceDatapath workflow. Second commit fixes an always-passing step in the same workflow. See commits for details.

Passing run with test commit: https://github.com/cilium/cilium/actions/runs/4730060326/jobs/8393311157.

Fixes: #24296.

@pchaigno pchaigno added sig/datapath Impacts bpf/ or low-level forwarding details, including map management and monitor messages. release-note/ci This PR makes changes to the CI. labels Apr 17, 2023
@pchaigno pchaigno force-pushed the pr/pchaigno/test-datapath-workflow branch 6 times, most recently from 91052b9 to b0723c9 Compare April 18, 2023 08:06
@pchaigno pchaigno changed the title workflows: TEST workflows/datapath: Fix always-passing step Apr 18, 2023
Support for flag --keep-registry was removed in commit 2975c75
("contrib/kind: no longer create local docker registry") because it is
no longer necessary. Because the ConformanceDatapath workflow was broken
and always passing, it was not noticed that this workflow was still
using the flag.

Fixes: 2975c75 ("contrib/kind: no longer create local docker registry")
Signed-off-by: Paul Chaignon <paul@cilium.io>
Flags -it to 'kubectl exec' are unnecessary given the command being
executed is not an interactive one and couldn't be anyway. Worse, the -i
flag causes this workflow step to exit immediately with a zero error
code. The subsequent commands, including the actual connectivity tests,
are therefore silently skipped.

That was causing the ConformanceDatapath workflow to seemingly always
succeed its connectivity tests. This commit fixes it.

I was able to reproduce a similar behavior locally:

    $ cat test.sh
    #!/bin/bash
    echo before
    kubectl -n kube-system -c cilium-agent exec -i ds/cilium -- echo test
    echo after
    $ ./test.sh &
    [4] 13155
    $ before

    [4]+  Stopped                 ./test.sh

If removing -i, then all of 'before', 'test', and 'after' are displayed.

It's unclear why this kubectl flag causes this behavior, but we should
fix nonetheless.

Fixes: edef93f ("gh/workflows: Split ci-dp encrypt tests into separate matrix configs")
Signed-off-by: Paul Chaignon <paul@cilium.io>
@pchaigno pchaigno force-pushed the pr/pchaigno/test-datapath-workflow branch from b0723c9 to 375d617 Compare April 18, 2023 08:44
@pchaigno pchaigno marked this pull request as ready for review April 18, 2023 08:52
@pchaigno pchaigno requested review from a team as code owners April 18, 2023 08:52

./cilium-cli connectivity test --datapath --collect-sysdump-on-failure \
--sysdump-output-filename "cilium-sysdump-${{ matrix.name }}-<ts>"
./cilium-cli connectivity test --collect-sysdump-on-failure \
--sysdump-output-filename "cilium-sysdump-${{ matrix.name }}-<ts>"
./contrib/scripts/kind-down.sh --keep-registry
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not related to your PR, but why do we need to kill Kind? The runner won't be reused by any other test run.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No idea. Could it be a leftover from when the workflow was running connectivity tests twice?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most likely. We can remove it to save a few secs.

Copy link
Member

@brb brb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Amazing 🤯 🎉 Any idea how to prevent such issues in the future?

@pchaigno
Copy link
Member Author

Any idea how to prevent such issues in the future?

By considering any tests that passes 30 times in a row as broken 😸

No real idea. It's tricky to detect always-passing bugs :-/

@@ -428,7 +428,7 @@ jobs:
./cilium-cli install ${{ steps.vars.outputs.cilium_install_defaults }}

./cilium-cli status --wait
kubectl -n kube-system exec -it daemonset/cilium -- cilium status
kubectl -n kube-system exec daemonset/cilium -- cilium status
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting. I wonder why it exits with 0 though. Does it actually run the connectivity test now? :)

In your case in test.sh nothing is displayed, because you put your script to background, and as it wants the stdin, it waits in the background under sigstop. (Do a fg in bash and it will display you the test string and the script will terminate.)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In your case in test.sh nothing is displayed, because you put your script to background, and as it wants the stdin, it waits in the background under sigstop. (Do a fg in bash and it will display you the test string and the script will terminate.)

Yep. What I'm not entirely clear on is why when doing fg "after" doesn't show.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does for me

$ ./z.sh &
[1] 49590
$ before

[1]+  Stopped                 ./z.sh
$ fg
./z.sh
test
after

Anyway, if this fixes the issue, then fine.

@pchaigno
Copy link
Member Author

We have reviews from the datapath team. Missing reviews from the CI but this isn't changing the structure of the file or the way we run the workflow (though it would be nice to get their opinion on why it may have failed). Merging to avoid having other regressions sneak in.

@pchaigno pchaigno merged commit 0e094e9 into main Apr 18, 2023
42 checks passed
@pchaigno pchaigno deleted the pr/pchaigno/test-datapath-workflow branch April 18, 2023 21:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
release-note/ci This PR makes changes to the CI. sig/datapath Impacts bpf/ or low-level forwarding details, including map management and monitor messages.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants