Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sys tests: run_podman: check for unwanted warnings/errors #19878

Merged

Conversation

edsantiago
Copy link
Member

With few exceptions, commands that exit 0 should not emit any
messages with level=warning or =error. Let's start enforcing
that in run_podman.

Allow one-off exceptions, typically when we're testing an
actual warning condition (usual case: "podman stop" where it
times out to SIGKILL). Exceptions are specified via:

run_podman 0+w subcommand...
           ^^^---- or, rarely, 0+e

"0" stands for "expect exit status 0", which is the default
so it's implicit anyway. The +w / +e (or even +we) is the
new part. I have added it to tests where necessary.

And, because life is what it is, add two global exceptions:

  • Debian. Because runc has too many flakes.
  • kube. Ditto. Kube commands emit lots of nasty error
    messages (yes, level=error) that don't seem to affect
    results.

Similar to #18442

Signed-off-by: Ed Santiago santiago@redhat.com

None

@openshift-ci openshift-ci bot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. release-note-none approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Sep 6, 2023
@edsantiago
Copy link
Member Author

Been testing yesterday & today in #17831, from which I concluded that we can't possibly enable this on Debian or on kube-anything.

Marking WIP because I want a lot more runs, because I suspect there are more flakes to be found. And because I have some questions that need addressing.

Copy link
Member

@vrothberg vrothberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM assuming tests go green.

This is really valuable, thank you!

Copy link
Member Author

@edsantiago edsantiago left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My questions.

Comment on lines +184 to 185
run_podman 0+e run --rmi --rm $NONLOCAL_IMAGE /bin/true
is "$output" ".*image is in use by a container" "--rmi should warn that the image was not removed"
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that this is "e", that is, the warning is actually level=error. Should it be?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That LGTM. The man page states: "After exit of the container, remove the image unless another container is using it." It seems like a good idea to log when the image cannot be removed.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Full agreement on "there should be a warning". My question is, should it be level=error?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, thanks for clarifying. I think a warning would be better than an error. It's perfectly fine behavior which does not classify as an error IMO.

test/system/030-run.bats Show resolved Hide resolved
@@ -27,7 +27,10 @@ load helpers
run_podman wait $cid_none_implicit $cid_none_explicit $cid_on_failure

run_podman rm $cid_none_implicit $cid_none_explicit $cid_on_failure
run_podman stop -t 1 $cid_always
run_podman 0+w stop -t 1 $cid_always
if ! is_remote; then
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The warning is never seen in podman-remote. Should it be?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally yes but it's technically very difficult. logrus logs are always shown on the server side, and never on the client side.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack, thanks.

@@ -245,7 +245,7 @@ function teardown() {
assert "$mac2" == "$mac1" "mac after restore should match"

# restart the container we should get a new ip/mac because they are not static
run_podman restart $cid
run_podman 0+w restart $cid
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

restart seems to invoke stop which then does the timeout-sigkill thing. I just want to point this out in case it is not 100% common knowledge. Also want to point out that this costs us 10 seconds per restart, so, 40 extra seconds of CI time.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't we change the default stop-timeout of the containers to 0?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know if you mean in the tests, or in restart, but either way that's not something I'm confident doing on my own.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't investigate it yet but I guess that creating/running the containers with --stop-timeout will do the job.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't investigated fully either, because that code is scary, but at least some of those containers are sleep inf, which is a bad idea all around. Changing to top would solve some of these. But again, I'm just not familiar enough with the purpose behind these tests, and don't feel comfortable making that change.

Comment on lines +342 to +343
# stdout is only emitted upon error; this printf is to help in debugging
printf "\n%s %s %s\n" "$(timestamp)" "$_LOG_PROMPT" "$*"
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is completely unrelated, I'm just sneaking it in. It adds an extra newline, which has greatly helped me scan error logs for most-recent-podman-command.

Comment on lines +393 to +395
# FIXME: don't do this on Debian: runc is way, way too flaky:
# FIXME: #11784 - lstat /sys/fs/.../*.scope: ENOENT
# FIXME: #11785 - cannot toggle freezer: cgroups not configured
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see any way around this

Comment on lines +397 to +401
# FIXME: All kube commands emit unpredictable errors:
# "Storage for container <X> has been removed"
# "no container with ID <X> found in database"
# These are level=error but we still get exit-status 0.
# Just skip all kube commands completely
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This, though, really bothers me. podman kube is noisy, scary-noisy. I would really like us to consider fixing that.

@@ -1058,7 +1058,12 @@ $IMAGE--c_ok" \
"ls /dev/tty[0-9] with --systemd=always: should have no ttyN devices"

# Make sure run_podman stop supports -1 option
run_podman stop -t -1 $cid
# FIXME: why is there no signal name here? Should be 'StopSignal XYZ'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That looks like a fart. Especially the double white space screams for the value being printed to differ from the one actually being used. For instance, the value being an empty string as printed which is later on being normalized to SIGTERM.

Likely needs a dedicated issue.

@@ -1058,7 +1058,12 @@ $IMAGE--c_ok" \
"ls /dev/tty[0-9] with --systemd=always: should have no ttyN devices"

# Make sure run_podman stop supports -1 option
run_podman stop -t -1 $cid
# FIXME: why is there no signal name here? Should be 'StopSignal XYZ'
# FIXME: do we really really mean to say FFFFFFFFFFFFFFFF here???
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not understand the question. Can you elaborate?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess I don't know the reasoning behind accepting stop -t -1. It seems meaningless to me, and even more meaningless to convert that to uint64 and display it as such.

@edsantiago
Copy link
Member Author

CI passed. I rebased & repushed because I still expect to see occasional flakes, and don't want this to merge if it's going to cause annoyance.

@rhatdan
Copy link
Member

rhatdan commented Sep 6, 2023

Passed again...

@edsantiago
Copy link
Member Author

...and again. Why don't flakes happen when we want them to?

@edsantiago edsantiago force-pushed the bats_check_stderr branch 2 times, most recently from e8f27cb to f87e79a Compare September 7, 2023 01:01
@edsantiago
Copy link
Member Author

Sigh. Two more successful runs. Why does that make me nervous?

run_podman run -d --network $netname $IMAGE sleep inf
run_podman run -d --network $netname $IMAGE top
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed all sleep infs to top. This eliminates the need for 0+w because top is interruptible.

If there is some important reason for using sleep; if checkpoint/restore do not work properly with top; please speak now.

With few exceptions, commands that exit 0 should not emit any
messages with level=warning or =error. Let's start enforcing
that in run_podman.

Allow one-off exceptions, typically when we're testing an
actual warning condition (usual case: "podman stop" where it
times out to SIGKILL). Exceptions are specified via:

    run_podman 0+w subcommand...
               ^^^---- or, rarely, 0+e

"0" stands for "expect exit status 0", which is the default
so it's implicit anyway. The +w / +e (or even +we) is the
new part. I have added it to tests where necessary.

And, because life is what it is, add two global exceptions:

  - Debian. Because runc has too many flakes.
  - kube. Ditto. Kube commands emit lots of nasty error
    messages (yes, level=error) that don't seem to affect
    results.

Similar to containers#18442

Signed-off-by: Ed Santiago <santiago@redhat.com>
@vrothberg
Copy link
Member

vrothberg commented Sep 7, 2023

Sigh. Two more successful runs. Why does that make me nervous?

The empirical rule: once is never, twice is always. Let's go!

@edsantiago edsantiago changed the title WIP: sys tests: run_podman: check for unwanted warnings/errors sys tests: run_podman: check for unwanted warnings/errors Sep 7, 2023
@openshift-ci openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Sep 7, 2023
@edsantiago
Copy link
Member Author

Gulp. I guess if we haven't seen flakes in this many runs, it probably means that future test failures will be desirable (real issues). I'd still like more eyeballs on this, but am OK with proceeding toward merge.

I will file issues today for the topics addressed above, and I'm working on a these-warnings-are-ok check so that undesired errors don't creep into the 0+w blocks.

Copy link
Member

@vrothberg vrothberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
/hold

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Sep 7, 2023
@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Sep 7, 2023
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Sep 7, 2023

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: edsantiago, vrothberg

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:
  • OWNERS [edsantiago,vrothberg]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@mheon
Copy link
Member

mheon commented Sep 7, 2023

/hold cancel

@openshift-ci openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Sep 7, 2023
@openshift-merge-robot openshift-merge-robot merged commit 0c72726 into containers:main Sep 7, 2023
92 checks passed
@edsantiago edsantiago deleted the bats_check_stderr branch September 7, 2023 15:35
edsantiago added a commit to edsantiago/libpod that referenced this pull request Sep 7, 2023
PR containers#19878 (checking for warnings in system tests) broke upgrade tests.

Reason: my long-ago "optimization" in which, if a PR touches only
tests in X, do not run tests in Y. Unfortunately, upgrade tests
rely on code in the system-test directory. I don't know if this
is fixable; nor if it's an acceptable tradeoff. Please discuss.

Sorry, everyone.

Signed-off-by: Ed Santiago <santiago@redhat.com>
edsantiago added a commit to edsantiago/libpod that referenced this pull request Sep 19, 2023
Followup to containers#20016:
- remove obsolete (misleading) comment
- prune dangling <none>:<none> image

Also, in kube test, rmi pause_image to avoid nasty red warnings

Also, ouch, fix a stupid that I introduced in containers#19878: the PODMAN
command path got dropped from log messages.

Signed-off-by: Ed Santiago <santiago@redhat.com>
edsantiago added a commit to edsantiago/libpod that referenced this pull request Sep 19, 2023
Part of RUN-1906.

Followup to containers#19878 (check stderr in system tests): allow_warnings()
and require_warning() functions to make sure no unexpected messages
fall through the cracks.

Signed-off-by: Ed Santiago <santiago@redhat.com>
@github-actions github-actions bot added the locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. label Dec 7, 2023
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Dec 7, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. release-note-none
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants