sys tests: run_podman: check for unwanted warnings/errors #19878

edsantiago · 2023-09-06T14:48:08Z

With few exceptions, commands that exit 0 should not emit any
messages with level=warning or =error. Let's start enforcing
that in run_podman.

Allow one-off exceptions, typically when we're testing an
actual warning condition (usual case: "podman stop" where it
times out to SIGKILL). Exceptions are specified via:

run_podman 0+w subcommand...
           ^^^---- or, rarely, 0+e

"0" stands for "expect exit status 0", which is the default
so it's implicit anyway. The +w / +e (or even +we) is the
new part. I have added it to tests where necessary.

And, because life is what it is, add two global exceptions:

Debian. Because runc has too many flakes.
kube. Ditto. Kube commands emit lots of nasty error
messages (yes, level=error) that don't seem to affect
results.

Similar to #18442

Signed-off-by: Ed Santiago santiago@redhat.com

None

edsantiago · 2023-09-06T14:49:42Z

Been testing yesterday & today in #17831, from which I concluded that we can't possibly enable this on Debian or on kube-anything.

Marking WIP because I want a lot more runs, because I suspect there are more flakes to be found. And because I have some questions that need addressing.

vrothberg

LGTM assuming tests go green.

This is really valuable, thank you!

edsantiago

My questions.

edsantiago · 2023-09-06T14:50:30Z

test/system/030-run.bats

+    run_podman 0+e run --rmi --rm $NONLOCAL_IMAGE /bin/true
    is "$output" ".*image is in use by a container" "--rmi should warn that the image was not removed"


Note that this is "e", that is, the warning is actually level=error. Should it be?

That LGTM. The man page states: "After exit of the container, remove the image unless another container is using it." It seems like a good idea to log when the image cannot be removed.

Full agreement on "there should be a warning". My question is, should it be level=error?

Ah, thanks for clarifying. I think a warning would be better than an error. It's perfectly fine behavior which does not classify as an error IMO.

test/system/030-run.bats

edsantiago · 2023-09-06T14:51:18Z

test/system/045-start.bats

@@ -27,7 +27,10 @@ load helpers
    run_podman wait $cid_none_implicit $cid_none_explicit $cid_on_failure

    run_podman rm $cid_none_implicit $cid_none_explicit $cid_on_failure
-    run_podman stop -t 1 $cid_always
+    run_podman 0+w stop -t 1 $cid_always
+    if ! is_remote; then


The warning is never seen in podman-remote. Should it be?

Ideally yes but it's technically very difficult. logrus logs are always shown on the server side, and never on the client side.

Ack, thanks.

edsantiago · 2023-09-06T14:52:32Z

test/system/520-checkpoint.bats

@@ -245,7 +245,7 @@ function teardown() {
    assert "$mac2" == "$mac1" "mac after restore should match"

    # restart the container we should get a new ip/mac because they are not static
-    run_podman restart $cid
+    run_podman 0+w restart $cid


restart seems to invoke stop which then does the timeout-sigkill thing. I just want to point this out in case it is not 100% common knowledge. Also want to point out that this costs us 10 seconds per restart, so, 40 extra seconds of CI time.

Can't we change the default stop-timeout of the containers to 0?

I don't know if you mean in the tests, or in restart, but either way that's not something I'm confident doing on my own.

I didn't investigate it yet but I guess that creating/running the containers with --stop-timeout will do the job.

I haven't investigated fully either, because that code is scary, but at least some of those containers are sleep inf, which is a bad idea all around. Changing to top would solve some of these. But again, I'm just not familiar enough with the purpose behind these tests, and don't feel comfortable making that change.

edsantiago · 2023-09-06T14:53:10Z

test/system/helpers.bash

+    # stdout is only emitted upon error; this printf is to help in debugging
+    printf "\n%s %s %s\n" "$(timestamp)" "$_LOG_PROMPT" "$*"


This is completely unrelated, I'm just sneaking it in. It adds an extra newline, which has greatly helped me scan error logs for most-recent-podman-command.

edsantiago · 2023-09-06T14:53:30Z

test/system/helpers.bash

+        # FIXME: don't do this on Debian: runc is way, way too flaky:
+        # FIXME: #11784 - lstat /sys/fs/.../*.scope: ENOENT
+        # FIXME: #11785 - cannot toggle freezer: cgroups not configured


I don't see any way around this

edsantiago · 2023-09-06T14:54:00Z

test/system/helpers.bash

+            # FIXME: All kube commands emit unpredictable errors:
+            #    "Storage for container <X> has been removed"
+            #    "no container with ID <X> found in database"
+            # These are level=error but we still get exit-status 0.
+            # Just skip all kube commands completely


This, though, really bothers me. podman kube is noisy, scary-noisy. I would really like us to consider fixing that.

vrothberg · 2023-09-06T15:28:26Z

test/system/030-run.bats

@@ -1058,7 +1058,12 @@ $IMAGE--c_ok" \
           "ls /dev/tty[0-9] with --systemd=always: should have no ttyN devices"

    # Make sure run_podman stop supports -1 option
-    run_podman stop -t -1 $cid
+    # FIXME: why is there no signal name here? Should be 'StopSignal XYZ'


That looks like a fart. Especially the double white space screams for the value being printed to differ from the one actually being used. For instance, the value being an empty string as printed which is later on being normalized to SIGTERM.

Likely needs a dedicated issue.

vrothberg · 2023-09-06T15:28:50Z

test/system/030-run.bats

@@ -1058,7 +1058,12 @@ $IMAGE--c_ok" \
           "ls /dev/tty[0-9] with --systemd=always: should have no ttyN devices"

    # Make sure run_podman stop supports -1 option
-    run_podman stop -t -1 $cid
+    # FIXME: why is there no signal name here? Should be 'StopSignal XYZ'
+    # FIXME: do we really really mean to say FFFFFFFFFFFFFFFF here???


I do not understand the question. Can you elaborate?

I guess I don't know the reasoning behind accepting stop -t -1. It seems meaningless to me, and even more meaningless to convert that to uint64 and display it as such.

edsantiago · 2023-09-06T15:59:15Z

CI passed. I rebased & repushed because I still expect to see occasional flakes, and don't want this to merge if it's going to cause annoyance.

rhatdan · 2023-09-06T18:18:09Z

Passed again...

edsantiago · 2023-09-06T21:53:39Z

...and again. Why don't flakes happen when we want them to?

edsantiago · 2023-09-07T02:22:37Z

Sigh. Two more successful runs. Why does that make me nervous?

edsantiago · 2023-09-07T02:24:25Z

test/system/520-checkpoint.bats

-    run_podman run -d --network $netname $IMAGE sleep inf
+    run_podman run -d --network $netname $IMAGE top


I changed all sleep infs to top. This eliminates the need for 0+w because top is interruptible.

If there is some important reason for using sleep; if checkpoint/restore do not work properly with top; please speak now.

With few exceptions, commands that exit 0 should not emit any messages with level=warning or =error. Let's start enforcing that in run_podman. Allow one-off exceptions, typically when we're testing an actual warning condition (usual case: "podman stop" where it times out to SIGKILL). Exceptions are specified via: run_podman 0+w subcommand... ^^^---- or, rarely, 0+e "0" stands for "expect exit status 0", which is the default so it's implicit anyway. The +w / +e (or even +we) is the new part. I have added it to tests where necessary. And, because life is what it is, add two global exceptions: - Debian. Because runc has too many flakes. - kube. Ditto. Kube commands emit lots of nasty error messages (yes, level=error) that don't seem to affect results. Similar to containers#18442 Signed-off-by: Ed Santiago <santiago@redhat.com>

vrothberg · 2023-09-07T12:39:29Z

Sigh. Two more successful runs. Why does that make me nervous?

The empirical rule: once is never, twice is always. Let's go!

edsantiago · 2023-09-07T12:50:01Z

Gulp. I guess if we haven't seen flakes in this many runs, it probably means that future test failures will be desirable (real issues). I'd still like more eyeballs on this, but am OK with proceeding toward merge.

I will file issues today for the topics addressed above, and I'm working on a these-warnings-are-ok check so that undesired errors don't creep into the 0+w blocks.

vrothberg

/lgtm
/hold

openshift-ci · 2023-09-07T12:52:05Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: edsantiago, vrothberg

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [edsantiago,vrothberg]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

mheon · 2023-09-07T15:22:23Z

/hold cancel

PR containers#19878 (checking for warnings in system tests) broke upgrade tests. Reason: my long-ago "optimization" in which, if a PR touches only tests in X, do not run tests in Y. Unfortunately, upgrade tests rely on code in the system-test directory. I don't know if this is fixable; nor if it's an acceptable tradeoff. Please discuss. Sorry, everyone. Signed-off-by: Ed Santiago <santiago@redhat.com>

Followup to containers#20016: - remove obsolete (misleading) comment - prune dangling <none>:<none> image Also, in kube test, rmi pause_image to avoid nasty red warnings Also, ouch, fix a stupid that I introduced in containers#19878: the PODMAN command path got dropped from log messages. Signed-off-by: Ed Santiago <santiago@redhat.com>

Part of RUN-1906. Followup to containers#19878 (check stderr in system tests): allow_warnings() and require_warning() functions to make sure no unexpected messages fall through the cracks. Signed-off-by: Ed Santiago <santiago@redhat.com>

openshift-ci bot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. release-note-none approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Sep 6, 2023

vrothberg reviewed Sep 6, 2023

View reviewed changes

edsantiago commented Sep 6, 2023

View reviewed changes

vrothberg reviewed Sep 6, 2023

View reviewed changes

edsantiago force-pushed the bats_check_stderr branch from 88108a6 to 9c7978e Compare September 6, 2023 15:57

edsantiago force-pushed the bats_check_stderr branch from 9c7978e to b31d5a9 Compare September 6, 2023 20:17

edsantiago force-pushed the bats_check_stderr branch 2 times, most recently from e8f27cb to f87e79a Compare September 7, 2023 01:01

edsantiago commented Sep 7, 2023

View reviewed changes

edsantiago force-pushed the bats_check_stderr branch from f87e79a to 652c222 Compare September 7, 2023 02:47

edsantiago force-pushed the bats_check_stderr branch from 652c222 to c2575f7 Compare September 7, 2023 11:37

edsantiago changed the title ~~WIP: sys tests: run_podman: check for unwanted warnings/errors~~ sys tests: run_podman: check for unwanted warnings/errors Sep 7, 2023

openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Sep 7, 2023

vrothberg approved these changes Sep 7, 2023

View reviewed changes

openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Sep 7, 2023

openshift-ci bot assigned vrothberg Sep 7, 2023

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Sep 7, 2023

openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Sep 7, 2023

openshift-merge-robot merged commit 0c72726 into containers:main Sep 7, 2023
92 checks passed

edsantiago deleted the bats_check_stderr branch September 7, 2023 15:35

edsantiago mentioned this pull request Sep 7, 2023

URGENT: fix broken CI #19894

Merged

edsantiago mentioned this pull request Sep 19, 2023

systests: manifest-zstd: clean up #20030

Merged

edsantiago mentioned this pull request Sep 19, 2023

systests: tighter checks for unwanted warnings #20036

Merged

github-actions bot added the locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. label Dec 7, 2023

github-actions bot locked as resolved and limited conversation to collaborators Dec 7, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sys tests: run_podman: check for unwanted warnings/errors #19878

sys tests: run_podman: check for unwanted warnings/errors #19878

edsantiago commented Sep 6, 2023

edsantiago commented Sep 6, 2023

vrothberg left a comment

edsantiago left a comment

edsantiago Sep 6, 2023

vrothberg Sep 6, 2023

edsantiago Sep 6, 2023

vrothberg Sep 6, 2023

edsantiago Sep 6, 2023

vrothberg Sep 6, 2023

edsantiago Sep 6, 2023

edsantiago Sep 6, 2023

vrothberg Sep 6, 2023

edsantiago Sep 6, 2023

vrothberg Sep 6, 2023

edsantiago Sep 6, 2023

edsantiago Sep 6, 2023

edsantiago Sep 6, 2023

edsantiago Sep 6, 2023

vrothberg Sep 6, 2023

vrothberg Sep 6, 2023

edsantiago Sep 6, 2023

edsantiago commented Sep 6, 2023

rhatdan commented Sep 6, 2023

edsantiago commented Sep 6, 2023

edsantiago commented Sep 7, 2023

edsantiago Sep 7, 2023

vrothberg commented Sep 7, 2023 •

edited

Loading

edsantiago commented Sep 7, 2023

vrothberg left a comment

openshift-ci bot commented Sep 7, 2023

mheon commented Sep 7, 2023

		run_podman 0+e run --rmi --rm $NONLOCAL_IMAGE /bin/true
		is "$output" ".*image is in use by a container" "--rmi should warn that the image was not removed"

		# stdout is only emitted upon error; this printf is to help in debugging
		printf "\n%s %s %s\n" "$(timestamp)" "$_LOG_PROMPT" "$*"

		run_podman run -d --network $netname $IMAGE sleep inf
		run_podman run -d --network $netname $IMAGE top

sys tests: run_podman: check for unwanted warnings/errors #19878

sys tests: run_podman: check for unwanted warnings/errors #19878

Conversation

edsantiago commented Sep 6, 2023

edsantiago commented Sep 6, 2023

vrothberg left a comment

Choose a reason for hiding this comment

edsantiago left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

edsantiago commented Sep 6, 2023

rhatdan commented Sep 6, 2023

edsantiago commented Sep 6, 2023

edsantiago commented Sep 7, 2023

Choose a reason for hiding this comment

vrothberg commented Sep 7, 2023 • edited Loading

edsantiago commented Sep 7, 2023

vrothberg left a comment

Choose a reason for hiding this comment

openshift-ci bot commented Sep 7, 2023

mheon commented Sep 7, 2023

vrothberg commented Sep 7, 2023 •

edited

Loading