-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
sys tests: run_podman: check for unwanted warnings/errors #19878
sys tests: run_podman: check for unwanted warnings/errors #19878
Conversation
Been testing yesterday & today in #17831, from which I concluded that we can't possibly enable this on Debian or on kube-anything. Marking WIP because I want a lot more runs, because I suspect there are more flakes to be found. And because I have some questions that need addressing. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM assuming tests go green.
This is really valuable, thank you!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My questions.
run_podman 0+e run --rmi --rm $NONLOCAL_IMAGE /bin/true | ||
is "$output" ".*image is in use by a container" "--rmi should warn that the image was not removed" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that this is "e", that is, the warning is actually level=error
. Should it be?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That LGTM. The man page states: "After exit of the container, remove the image unless another container is using it." It seems like a good idea to log when the image cannot be removed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Full agreement on "there should be a warning". My question is, should it be level=error
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, thanks for clarifying. I think a warning would be better than an error. It's perfectly fine behavior which does not classify as an error IMO.
@@ -27,7 +27,10 @@ load helpers | |||
run_podman wait $cid_none_implicit $cid_none_explicit $cid_on_failure | |||
|
|||
run_podman rm $cid_none_implicit $cid_none_explicit $cid_on_failure | |||
run_podman stop -t 1 $cid_always | |||
run_podman 0+w stop -t 1 $cid_always | |||
if ! is_remote; then |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The warning is never seen in podman-remote. Should it be?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ideally yes but it's technically very difficult. logrus
logs are always shown on the server side, and never on the client side.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ack, thanks.
test/system/520-checkpoint.bats
Outdated
@@ -245,7 +245,7 @@ function teardown() { | |||
assert "$mac2" == "$mac1" "mac after restore should match" | |||
|
|||
# restart the container we should get a new ip/mac because they are not static | |||
run_podman restart $cid | |||
run_podman 0+w restart $cid |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
restart
seems to invoke stop
which then does the timeout-sigkill thing. I just want to point this out in case it is not 100% common knowledge. Also want to point out that this costs us 10 seconds per restart, so, 40 extra seconds of CI time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't we change the default stop-timeout of the containers to 0?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know if you mean in the tests, or in restart
, but either way that's not something I'm confident doing on my own.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't investigate it yet but I guess that creating/running the containers with --stop-timeout
will do the job.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't investigated fully either, because that code is scary, but at least some of those containers are sleep inf
, which is a bad idea all around. Changing to top
would solve some of these. But again, I'm just not familiar enough with the purpose behind these tests, and don't feel comfortable making that change.
# stdout is only emitted upon error; this printf is to help in debugging | ||
printf "\n%s %s %s\n" "$(timestamp)" "$_LOG_PROMPT" "$*" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is completely unrelated, I'm just sneaking it in. It adds an extra newline, which has greatly helped me scan error logs for most-recent-podman-command.
# FIXME: don't do this on Debian: runc is way, way too flaky: | ||
# FIXME: #11784 - lstat /sys/fs/.../*.scope: ENOENT | ||
# FIXME: #11785 - cannot toggle freezer: cgroups not configured |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see any way around this
# FIXME: All kube commands emit unpredictable errors: | ||
# "Storage for container <X> has been removed" | ||
# "no container with ID <X> found in database" | ||
# These are level=error but we still get exit-status 0. | ||
# Just skip all kube commands completely |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This, though, really bothers me. podman kube
is noisy, scary-noisy. I would really like us to consider fixing that.
@@ -1058,7 +1058,12 @@ $IMAGE--c_ok" \ | |||
"ls /dev/tty[0-9] with --systemd=always: should have no ttyN devices" | |||
|
|||
# Make sure run_podman stop supports -1 option | |||
run_podman stop -t -1 $cid | |||
# FIXME: why is there no signal name here? Should be 'StopSignal XYZ' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That looks like a fart. Especially the double white space screams for the value being printed to differ from the one actually being used. For instance, the value being an empty string as printed which is later on being normalized to SIGTERM.
Likely needs a dedicated issue.
@@ -1058,7 +1058,12 @@ $IMAGE--c_ok" \ | |||
"ls /dev/tty[0-9] with --systemd=always: should have no ttyN devices" | |||
|
|||
# Make sure run_podman stop supports -1 option | |||
run_podman stop -t -1 $cid | |||
# FIXME: why is there no signal name here? Should be 'StopSignal XYZ' | |||
# FIXME: do we really really mean to say FFFFFFFFFFFFFFFF here??? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not understand the question. Can you elaborate?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess I don't know the reasoning behind accepting stop -t -1
. It seems meaningless to me, and even more meaningless to convert that to uint64
and display it as such.
88108a6
to
9c7978e
Compare
CI passed. I rebased & repushed because I still expect to see occasional flakes, and don't want this to merge if it's going to cause annoyance. |
Passed again... |
9c7978e
to
b31d5a9
Compare
...and again. Why don't flakes happen when we want them to? |
e8f27cb
to
f87e79a
Compare
Sigh. Two more successful runs. Why does that make me nervous? |
run_podman run -d --network $netname $IMAGE sleep inf | ||
run_podman run -d --network $netname $IMAGE top |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed all sleep inf
s to top
. This eliminates the need for 0+w
because top
is interruptible.
If there is some important reason for using sleep
; if checkpoint/restore do not work properly with top
; please speak now.
f87e79a
to
652c222
Compare
With few exceptions, commands that exit 0 should not emit any messages with level=warning or =error. Let's start enforcing that in run_podman. Allow one-off exceptions, typically when we're testing an actual warning condition (usual case: "podman stop" where it times out to SIGKILL). Exceptions are specified via: run_podman 0+w subcommand... ^^^---- or, rarely, 0+e "0" stands for "expect exit status 0", which is the default so it's implicit anyway. The +w / +e (or even +we) is the new part. I have added it to tests where necessary. And, because life is what it is, add two global exceptions: - Debian. Because runc has too many flakes. - kube. Ditto. Kube commands emit lots of nasty error messages (yes, level=error) that don't seem to affect results. Similar to containers#18442 Signed-off-by: Ed Santiago <santiago@redhat.com>
652c222
to
c2575f7
Compare
The empirical rule: once is never, twice is always. Let's go! |
Gulp. I guess if we haven't seen flakes in this many runs, it probably means that future test failures will be desirable (real issues). I'd still like more eyeballs on this, but am OK with proceeding toward merge. I will file issues today for the topics addressed above, and I'm working on a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
/hold
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: edsantiago, vrothberg The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/hold cancel |
PR containers#19878 (checking for warnings in system tests) broke upgrade tests. Reason: my long-ago "optimization" in which, if a PR touches only tests in X, do not run tests in Y. Unfortunately, upgrade tests rely on code in the system-test directory. I don't know if this is fixable; nor if it's an acceptable tradeoff. Please discuss. Sorry, everyone. Signed-off-by: Ed Santiago <santiago@redhat.com>
Followup to containers#20016: - remove obsolete (misleading) comment - prune dangling <none>:<none> image Also, in kube test, rmi pause_image to avoid nasty red warnings Also, ouch, fix a stupid that I introduced in containers#19878: the PODMAN command path got dropped from log messages. Signed-off-by: Ed Santiago <santiago@redhat.com>
Part of RUN-1906. Followup to containers#19878 (check stderr in system tests): allow_warnings() and require_warning() functions to make sure no unexpected messages fall through the cracks. Signed-off-by: Ed Santiago <santiago@redhat.com>
With few exceptions, commands that exit 0 should not emit any
messages with level=warning or =error. Let's start enforcing
that in run_podman.
Allow one-off exceptions, typically when we're testing an
actual warning condition (usual case: "podman stop" where it
times out to SIGKILL). Exceptions are specified via:
"0" stands for "expect exit status 0", which is the default
so it's implicit anyway. The +w / +e (or even +we) is the
new part. I have added it to tests where necessary.
And, because life is what it is, add two global exceptions:
messages (yes, level=error) that don't seem to affect
results.
Similar to #18442
Signed-off-by: Ed Santiago santiago@redhat.com