Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Podman-remote run should wait for exit code #3934

Merged
merged 2 commits into from
Sep 13, 2019

Conversation

rhatdan
Copy link
Member

@rhatdan rhatdan commented Sep 4, 2019

This change matches what is happening on the podman local side
and should eliminate a race condition.

Signed-off-by: Daniel J Walsh dwalsh@redhat.com

@rhatdan
Copy link
Member Author

rhatdan commented Sep 4, 2019

Hopefully fixes: #3870

@rhatdan
Copy link
Member Author

rhatdan commented Sep 4, 2019

@mheon @baude @edsantiago PTAL

Copy link
Member

@vrothberg vrothberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, any chance we can test that?

@mheon
Copy link
Member

mheon commented Sep 4, 2019

Code LGTM

@openshift-ci-robot
Copy link
Collaborator

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: rhatdan

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci-robot openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Sep 4, 2019
@rhatdan
Copy link
Member Author

rhatdan commented Sep 4, 2019

@edsantiago has a tests that was revealing this error.

@edsantiago
Copy link
Collaborator

@cevich can you try running this in your setup? (I've still never actually seen the problem)

@cevich can you also look at CI failures?

ERROR: (gcloud) The project property must be set to a valid project ID, not the project name

@cevich
Copy link
Member

cevich commented Sep 4, 2019

LGTM, any chance we can test that?

Testing negatives/error-cases of a race is very difficult to do reliably in automation. I can manually run the change through the reproducer I used before, however the results cannot be conclusive - since code-change nearly always causes a timing change 😕

@cevich
Copy link
Member

cevich commented Sep 4, 2019

can you also look at CI failures?

@edsantiago that problem is not happening on master. Cirrus-CI (internally) us suppose to decrypt those [ENCRYPTED[... strings, and pass the result as an env. var. value. Since the value is making it through like that, and there are no changes to the value in this PR, this must be a problem on the Cirrus-CI end of things.

If re-running doesn't clear up the problem, then I'll inform their support.

@edsantiago
Copy link
Collaborator

Cirrus-CI (internally) us suppose to decrypt those [ENCRYPTED[... strings,

I think that's a red herring. My suspicion is that the problem is elsewhere, but Cirrus is showing ENCRYPTED because, um, it probably shouldn't show the real value in a public log

@cevich
Copy link
Member

cevich commented Sep 4, 2019

ENCRYPTED because, um, it probably shouldn't show the real value in a public log

It mangles the output with SECRET for encrypted values.

@cevich
Copy link
Member

cevich commented Sep 4, 2019

@rhatdan There's some problem here, when I run Ed's reproducer using code from this PR on Fedora 30:

...cut...
STEP 4: COMMIT build_test
549471035c7a6927e8dc2c52e2a87456327bc0d1b561869c90ea5d0c1bd79289
+ pr run --rm build_test cat /myfile
+ podman-remote run --rm build_test cat /myfile
hi
Error: Virtual Read failed, 0: EOF
[root@cevich-fedora-29-libpod-5420341990522880 libpod]# 

On Fedora 30, it simply hangs after printing 'hi' and never returns (I waited 5-minutes). I'll look at the Cirrus-CI logs next...

@cevich
Copy link
Member

cevich commented Sep 4, 2019

...ya, similar story in the integration-tests: Timed out after 90.000s. across the board, for all remote-client tests except cgroupv2.

@cevich
Copy link
Member

cevich commented Sep 4, 2019

Ref: VM Images I'm using are:

fedora-30-libpod-5420341990522880
fedora-29-libpod-5420341990522880
(the same that were failing in #3870) using reproducer that Ed wrote:

#!/bin/bash

set -xe

tmpdir=$(mktemp -d --tmpdir podman-remote-test.XXXXXX)
cat >$tmpdir/Dockerfile <<EOF
FROM quay.io/libpod/alpine_labels:latest
RUN apk add nginx
RUN echo hi >/myfile
EOF

pr() { podman-remote "$@"; }

while :;do
    pr build -t build_test --format=docker $tmpdir
    pr run --rm build_test cat /myfile
    pr rmi -f build_test
done

like this from the repository root:

make podman podman-remote
make install PREFIX=/usr
systemctl enable io.podman.socket
systemctl enable io.podman.service
systemctl restart io.podman.socket
systemctl restart io.podman.service
chmod +x repro.sh
./repro.sh
...cut...

@rhatdan
Copy link
Member Author

rhatdan commented Sep 5, 2019

@cevich @edsantiago Total rewrite of original patch, turns out the error, I believe, was on the server side. We were not waiting for a full exit, so I think the client side could exit before the container was cleaned up.

While in this code, I figured out why container exit codes were not being propagated.

@rhatdan
Copy link
Member Author

rhatdan commented Sep 5, 2019

@TomSweeneyRedHat
Copy link
Member

LGTM assuming happy tests.

@mheon
Copy link
Member

mheon commented Sep 5, 2019

LGTM but I'd like a nod from @baude pre-merge

@mheon
Copy link
Member

mheon commented Sep 5, 2019

System tests are failing for most remote client jobs. CGroups v2 remote succeeded, though?

@cevich
Copy link
Member

cevich commented Sep 10, 2019

The major difference between reproducing above, and previously is the dnf update -y && reboot I ran. I'll get a fresh VM and try again, once Matt's done looking at this one.

@cevich
Copy link
Member

cevich commented Sep 11, 2019

Rebased onto #3985 just to be safe, spun up a VM w/o updating it's packages. Running the reproducer above I'm still getting Error: Virtual Read failed, 0: EOF when the script does podman-remote run --rm build_test cat /myfile.

Perhaps interestingly, this is also some kind of race, because when I change to podman-remote run -it --rm build_test cat /myfile it sometimes works fine for 3-5 iterations before bombing. I also tried rebasing against master (093013b) but the behavior remains 😖

...cut...
STEP 3: RUN echo hi >/myfile
STEP 4: COMMIT build_test
6e2776951ef699a20d1fa766f2de116b88bada4d1dd44eafd6e682c4db83c73f
+ pr run -it --rm build_test cat /myfile
+ podman-remote run -it --rm build_test cat /myfile
hi
Error: Virtual Read failed, 0: EOF

@cevich
Copy link
Member

cevich commented Sep 11, 2019

@baude Matt took a look at a VM behaving this way, and concluded:

i have no clue what's going on there. early varlink hangup probably

would you mind giving it a go (rebase this PR against master to be safe) when you have a chance?

(note: current CI results here are not helpful until rhatdan rebases the PR)

@cevich
Copy link
Member

cevich commented Sep 11, 2019

Okay, the EOF issue is partially addressed with #3998 and I was successfully able to confirm the fix in this PR does appear to fix the #3870 problem. So let's get this in.

LGTM

We have leaked the exit number codess all over the code, this patch
removes the numbers to constants.

Signed-off-by: Daniel J Walsh <dwalsh@redhat.com>
This change matches what is happening on the podman local side
and should eliminate a race condition.

Also exit commands on the server side should start to return to client.
Signed-off-by: Daniel J Walsh <dwalsh@redhat.com>
@cevich
Copy link
Member

cevich commented Sep 13, 2019

This is a new error in a start test I haven't seen before though it's Ubuntu, so maybe a flake?

@cevich
Copy link
Member

cevich commented Sep 13, 2019

...digging a bit, the unable to find varlink pid message seems harmless, as it's part of the test cleaning up (test/e2e/libpod_suite_remoteclient_test.go:96). The ID of the test container is in fact 8f7a7fb5aa4f, so the podman ps -aq is def. seeing the container after the expected non-zero podman-remote exit from test/e2e/start_test.go:110. I'm betting this is a simple race, where the remote has not yet gotten to do it's container removal yet.

@mheon If the above is true, this race is in many places in our integration tests. There are even some which assert the presence of remaining containers, which could race with an erroneous removal to produce false-positive results.

Is there a way to detect whether or not the removal process has completed or not for a given container ID (like some file that should not exist or something like that)? This would be the ideal solution in both (positive and negative) test scenarios, to prevent racing on an existence-check. For example, test/utils/utils.go could delay until completion before calling ps := p.PodmanBase([]string{"ps", "-aq"}, true, false).

@cevich
Copy link
Member

cevich commented Sep 13, 2019

uggg, sorry, I shouldn't tie up Dan's PR with more problems, let me open an issue for that race...

@cevich
Copy link
Member

cevich commented Sep 13, 2019

...opened #4021

@cevich
Copy link
Member

cevich commented Sep 13, 2019

(re-ran flaked test)

@mheon
Copy link
Member

mheon commented Sep 13, 2019

After this PR, we should have a guarantee that the container is gone when the original Podman process exits.

@mheon
Copy link
Member

mheon commented Sep 13, 2019

Nevermind. We don't have an explicit remove in https://github.com/containers/libpod/blob/82ac0d8925dbb5aa738f1494ecb002eb6daca992/pkg/adapter/containers_remote.go#L462

We probably need to add one

@rhatdan
Copy link
Member Author

rhatdan commented Sep 13, 2019

@mheon the remove happens on the server side in the adapter code.
https://github.com/containers/libpod/pull/3934/files#diff-d6a6befd6e8af99104d67db5ea6786faR106

@rhatdan
Copy link
Member Author

rhatdan commented Sep 13, 2019

I think this is ready to merge.

@mheon @giuseppe @vrothberg @TomSweeneyRedHat @cevich @edsantiago PTAL

@mheon
Copy link
Member

mheon commented Sep 13, 2019

Are we guaranteed that the client doesn't exit until the server is done removing, though?
LGTM regardless

@cevich
Copy link
Member

cevich commented Sep 13, 2019

/lgtm

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Sep 13, 2019
@cevich
Copy link
Member

cevich commented Sep 13, 2019

Are we guaranteed that the client doesn't exit until the server is done removing, though?

Possibly not, that's why I opened #4021

@openshift-merge-robot openshift-merge-robot merged commit 7875e00 into containers:master Sep 13, 2019
@rhatdan
Copy link
Member Author

rhatdan commented Sep 13, 2019

I believe that we are guaranteed for the removal to happen before the front end gets signaled.

@github-actions github-actions bot added the locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. label Sep 26, 2023
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Sep 26, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

9 participants