Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

podman image tree: restore previous behavior #10222

Merged
merged 1 commit into from
May 12, 2021

Conversation

vrothberg
Copy link
Member

@vrothberg vrothberg commented May 5, 2021

The initial version of libimage changed the order of layers which has
now been restored to remain backwards compatible.

Further changes:

  • Fix a bug in the journald logging which requires to strip trailing
    new lines from the message. The system tests did not pass due to
    empty new lines. Triggered by changing the default logger to
    journald in containers/common.

  • Fix another bug in the journald logging which embedded the container
    ID inside the message rather than the specifid field. That surfaced
    in a preceeding whitespace of each log line which broke the system
    tests.

  • Alter the system tests to make sure that the k8s-file and the
    journald logging drivers are executed.

  • A number of e2e tests have been changed to force the k8s-file driver
    to make them pass when running inside a root container.

  • Increase the timeout in a kill test which seems to take longer now.
    Reasons are unknown. Tests passed earlier and no signal-related
    changes happend. It may be CI VM flake since some system tests but
    other flaked.

Signed-off-by: Valentin Rothberg rothberg@redhat.com

@openshift-ci-robot openshift-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label May 5, 2021
@openshift-ci-robot
Copy link
Collaborator

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: vrothberg

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci-robot openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 5, 2021
@vrothberg vrothberg changed the title WIP - podman image tree: restore previous behavior podman image tree: restore previous behavior May 6, 2021
@openshift-ci-robot openshift-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label May 6, 2021
@vrothberg
Copy link
Member Author

@containers/podman-maintainers PTAL

@edsantiago this restores the previous order of layers.

@rhatdan
Copy link
Member

rhatdan commented May 6, 2021

LGTM
@containers/podman-maintainers PTAL

@vrothberg
Copy link
Member Author

Looks like the journald change broke e2e tests, having a look now.

@vrothberg
Copy link
Member Author

@mheon @rhatdan, I am running out of time today and am on PTO tomorrow. The errors look related to your changes in c/common. Feel free to take over this PR (the image-tree related parts are done and green).

@vrothberg
Copy link
Member Author

Currently failing. Requires containers/common#540.

@vrothberg vrothberg force-pushed the image-tree branch 4 times, most recently from 40f3a24 to 9821f32 Compare May 11, 2021 13:58
@vrothberg
Copy link
Member Author

vrothberg commented May 11, 2021

This turned out to be more work. Fixed bugs in the journald driver and updated tests.

@edsantiago, please take a look at the changes.

@vrothberg vrothberg force-pushed the image-tree branch 2 times, most recently from b276ec1 to dbd57f2 Compare May 11, 2021 15:24
@vrothberg
Copy link
Member Author

@rhatdan, given the changes in this PR to get journald to pass CI, we should make sure to highlight that in the changelog.

@vrothberg
Copy link
Member Author

@containers/podman-maintainers PTAL

vrothberg added a commit to vrothberg/common that referenced this pull request May 12, 2021
Commit 98ff7e1 changed the default logging driver k8s-file to
journald.  The only consumer of the log-driver is Podman which I think
still needs some more time to stabilize.  Vendoring containers/common
into Podman has revealed quite some warts (see
containers/podman/pull/10222) which reduced my confidence level.

To resolve the chicken-egg-problem of maturing the journald driver, I
want to only partially revert commit 98ff7e1.  The built-in default
remains k8s-file while the containers.conf sets it to journald.  The
intention behind is to make sure that running systems are not impacted
but we can change Fedora to journald to increase coverage.

Once the confidence level is back to normal, we can change the default
to journald.  Latest before RHEL9.

Signed-off-by: Valentin Rothberg <rothberg@redhat.com>
@edsantiago
Copy link
Collaborator

# $ podman exec 97109c3ce4e6efe077b26124eb217d18708e15147841f9b054c7b66edbedd3b6 touch /stop
# #/vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
# #| FAIL: Timed out waiting for DONE from container
# #\^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

This is really bad. The fact that this PR is mucking with logging makes me suspect that there's something broken in logging.

@edsantiago
Copy link
Collaborator

And no, bumping the timeout to 15 seconds is not a solution

@vrothberg
Copy link
Member Author

# $ podman exec 97109c3ce4e6efe077b26124eb217d18708e15147841f9b054c7b66edbedd3b6 touch /stop
# #/vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
# #| FAIL: Timed out waiting for DONE from container
# #\^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

This is really bad. The fact that this PR is mucking with logging makes me suspect that there's something broken in logging.

Still trying to get a feeling for it. Increasing the timeout worked for most cases but not rootless. I don't yet see a relation to logging in this specific case of exec.

@vrothberg
Copy link
Member Author

And no, bumping the timeout to 15 seconds is not a solution

I'd appreciate help. I wanted to fix image-tree and ended up fixing things that are very unrelated but merged untested in c/common 🙈

@edsantiago
Copy link
Collaborator

Something is broken in logging. The timeout has nothing to do with anything; this should trigger in less than a second. If it doesn't, you can bump the timeout to an hour, it will not change anything, the test will still fail.

Maybe it's the change to journald logging? Maybe journald logging doesn't work? Maybe you can try:

--- a/test/system/130-kill.bats
+++ b/test/system/130-kill.bats
@@ -8,7 +8,7 @@ load helpers
 @test "podman kill - test signal handling in containers" {
     # Start a container that will handle all signals by emitting 'got: N'
     local -a signals=(1 2 3 4 5 6 8 10 12 13 14 15 16 20 21 22 23 24 25 26 64)
-    run_podman run -d $IMAGE sh -c \
+    run_podman run --log-driver=k8s-file -d $IMAGE sh -c \
         "for i in ${signals[*]}; do trap \"echo got: \$i\" \$i; done;
         echo READY;
         while ! test -e /stop; do sleep 0.05; done;

If that fixes it, then the problem is with the journald driver, and I kind of think it is super important to find and fix that bug.

@vrothberg
Copy link
Member Author

vrothberg commented May 12, 2021

Thanks, Ed!

Maybe you can try:

Yes. I did that in the last push.

If that fixes it, then the problem is with the journald driver, and I kind of think it is super important to find and fix that bug.

Agreed. I find it curious though that it started flaking today. The only thing that changed since yesterday is the diff you proposed.

@edsantiago
Copy link
Collaborator

Basically, what seems to be happening is that podman logs -f CONTAINER is not spitting out the final output line from a container.

UPDATE: I have a reproducer.

$ while :;do ./bin/podman run --log-driver=journald -d --name foo quay.io/libpod/testimage:20210427 sh -c 'echo hi;sleep 2;echo bye';./bin/podman logs -f foo;./bin/podman rm foo;done
caf4dce7ce3ac5d08da339a8c0ecd4ec97df498cd472fbff61182352e71cf831
hi     <---- there is no subsequent bye!
caf4dce7ce3ac5d08da339a8c0ecd4ec97df498cd472fbff61182352e71cf831
08f58203f738027300d651fed861f1673eb8a1a953b869c788d905664b709938
hi
bye

@vrothberg
Copy link
Member Author

Thank you so much, @edsantiago! That gives me something to chew on.

@vrothberg vrothberg changed the title podman image tree: restore previous behavior WIP - podman image tree: restore previous behavior May 12, 2021
@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label May 12, 2021
The initial version of libimage changed the order of layers which has
now been restored to remain backwards compatible.

Further changes:

 * Fix a bug in the journald logging which requires to strip trailing
   new lines from the message.  The system tests did not pass due to
   empty new lines.  Triggered by changing the default logger to
   journald in containers/common.

 * Fix another bug in the journald logging which embedded the container
   ID inside the message rather than the specifid field.  That surfaced
   in a preceeding whitespace of each log line which broke the system
   tests.

 * Alter the system tests to make sure that the k8s-file and the
   journald logging drivers are executed.

 * A number of e2e tests have been changed to force the k8s-file driver
   to make them pass when running inside a root container.

 * Increase the timeout in a kill test which seems to take longer now.
   Reasons are unknown.  Tests passed earlier and no signal-related
   changes happend.  It may be CI VM flake since some system tests but
   other flaked.

Signed-off-by: Valentin Rothberg <rothberg@redhat.com>
@vrothberg vrothberg changed the title WIP - podman image tree: restore previous behavior podman image tree: restore previous behavior May 12, 2021
@openshift-ci openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label May 12, 2021
@vrothberg
Copy link
Member Author

@edsantiago, I had a look and analyzed it. The race condition is nasty and not easy to fix. I did what I could in the given time but opened #10323 to track it and dropped a FIXME in the code (see https://github.com/containers/podman/pull/10222/files#diff-20cc30e1cdf302ef7404e5923eada3912c68c8b8943c0a7a0a834b29236eba69R92).

FWIW, I opened containers/common#546 earlier today to revert the journald defaulting in c/common.

@vrothberg
Copy link
Member Author

@rhatdan FYI

@edsantiago
Copy link
Collaborator

Reluctant LGTM. This journald thing is making me reeeeeeeeally uncomfortable.

@vrothberg
Copy link
Member Author

vrothberg commented May 12, 2021 via email

@edsantiago
Copy link
Collaborator

Tests green. @containers/podman-maintainers PTAL

@rhatdan
Copy link
Member

rhatdan commented May 12, 2021

/approve
/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label May 12, 2021
@openshift-ci
Copy link
Contributor

openshift-ci bot commented May 12, 2021

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: rhatdan, vrothberg

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-merge-robot openshift-merge-robot merged commit d6507fc into containers:master May 12, 2021
@github-actions github-actions bot added the locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. label Sep 23, 2023
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Sep 23, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants