Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

podman-remote rmi: device or resource busy #3870

Closed
edsantiago opened this issue Aug 21, 2019 · 22 comments
Closed

podman-remote rmi: device or resource busy #3870

edsantiago opened this issue Aug 21, 2019 · 22 comments
Labels
locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments.

Comments

@edsantiago
Copy link
Collaborator

@cevich is seeing frequent system-test errors [1] with podman-remote

$  ./bin/podman-remote rmi -f build_test
# Error: error removing containers [10b343023c7d6187aa521fad47934b5d558adc53964ab491017e34d909782751] for image "8659137a98bd8c7f3528c1c2c09a0fc8115bc32ce7659bc7a1d31b7d73dd3029": could not remove container "10b343023c7d6187aa521fad47934b5d558adc53964ab491017e34d909782751": remove /var/lib/containers/storage/overlay-containers/10b343023c7d6187aa521fad47934b5d558adc53964ab491017e34d909782751/userdata/shm: device or resource busy

This is (I think) Ubuntu. I cannot reproduce on my Fedora laptop, even running the varlink server with nice 19 and even with cores disabled to emulate a slow system. My attempted reproducer is a simplified variant of the build system test:

#!/bin/bash

set -xe

tmpdir=$(mktemp -d --tmpdir podman-remote-test.XXXXXX)
cat >$tmpdir/Dockerfile <<EOF
FROM quay.io/libpod/alpine_labels:latest
RUN apk add nginx
RUN echo hi >/myfile
EOF

pr() { /path/to/podman-remote "$@"; }

while :;do
    pr build -t build_test --format=docker $tmpdir
    pr run --rm build_test cat /myfile
    pr rmi -f build_test
done

My only guess is that this might be a race condition with podman-remote run --rm: does anyone know if it is guaranteed to block?

master @ 1ad8fe5. I am sorry for not being able to reproduce it but Chris seems to run into it constantly.

@cevich
Copy link
Member

cevich commented Aug 21, 2019

Thanks Ed.

Thinking about this more, I'm not seeing those particular failures anywhere else. So there must be something special about PR #3754. That could simply be new/different VM imges - including images with embedded bugs.

Note: The two failures are on F29 and F30 (not Ubuntu) images:

  • fedora-30-libpod-5420341990522880
  • fedora-29-libpod-5420341990522880

I'll try your reproducer (above) using those images vs what we have on master...

@cevich
Copy link
Member

cevich commented Aug 21, 2019

@edsantiago good news is, I was able to reproduce the error using your script above and the F30 VM image from my PR. Trying the VM image from master (expecting no error) now...

@cevich
Copy link
Member

cevich commented Aug 21, 2019

...as expected, no error using the VM image from master (fedora-30-libpod-5751722641719296).

Excellent, so now (hopefully) it's a matter of comparing differences. Since the libpod code is common in both VMs, they're running in the same cloud-environment, so there must be a configuration or packaging difference or something like that.

@cevich
Copy link
Member

cevich commented Aug 21, 2019

Same on both VMs:

  • libvarlink-util-18-1.fc30.x86_64
  • container-selinux-2.113.0-1.dev.git4f7d6bb.fc30.noarch
  • runc-1.0.0-93.dev.gitb9b6cc6.fc30.x86_64
  • systemd-241-10.git511646b.fc30.x86_64
  • slirp4netns-0.4.0-4.git19d199a.fc30.x86_64
  • containernetworking-plugins-0.8.1-1.fc30.x86_64
  • containers-common-0.1.37-2.fc30.x86_64

hrmmmmm.

@cevich
Copy link
Member

cevich commented Aug 21, 2019

Ran reproducer again, and it worked fine for a while, failing after maybe 10 passes. So there's def. some race condition here that's somehow more likely with the images in my PR vs from master, even with the exact same libpod code 😖

Maybe I should try taking the podman and podman-remote binaries from the breaking-VM over to the working-VM to see if the behavior moves/changes?

Not sure what else could be involved here, @mheon have any ideas?

@cevich
Copy link
Member

cevich commented Aug 21, 2019

confirmed: it's not the binaries, behavior remains the same after copying from breaking -> working VM.

Noticed this difference though, the breaking VM is running kernel-5.2.8-200 vs working VM (from master) is using kernel-5.2.7-200. Going to try upgrading that kernel, rebooting, and see if the problem starts happening...

@cevich
Copy link
Member

cevich commented Aug 21, 2019

JJF**($!!343$hg80*&#//**** 😠

updated both VMs to 5.2.9-200.fc30.x86_64 rebooted, but the pattern remains: Problem reproduces on fedora-30-libpod-5420341990522880 but not fedora-30-libpod-5751722641719296.

cevich added a commit to cevich/podman that referenced this issue Aug 29, 2019
Signed-off-by: Chris Evich <cevich@redhat.com>
@rhatdan
Copy link
Member

rhatdan commented Aug 29, 2019

This could definitely cause a race.
pr build -t build_test --format=docker $tmpdir
pr run --rm build_test cat /myfile
When this command finishes the image is still mounted. conmon will execute podman container cleanup, but after the run command finishes.
pr rmi -f build_test
If this command executes before podman container cleanup finishes you can get this error.

@rhatdan
Copy link
Member

rhatdan commented Aug 29, 2019

@mheon Is my assumption correct. Does podman container cleanup take care of the --rm as well?

@edsantiago
Copy link
Collaborator Author

Thanks @rhatdan. If this is the case, would it be possible to have run --rm block until conmon finishes cleanup?

And if that's not possible, I'd like to request a clear documentation update stating something like:

 podman-remote run --rm actually means "remove the container some time in
 the future, not necessarily now", and is not actually guaranteed to remove
 the container. If you actually want to deterministically remove the container,
 perhaps because you want to rmi its parent container, please run [some sort
 of TBI command that blocks until container is removed]

@rhatdan
Copy link
Member

rhatdan commented Aug 29, 2019

Yes I agree it should be cleaned up, but I am not sure if my theory is correct.

@mheon
Copy link
Member

mheon commented Aug 29, 2019

podman run --rm should, if attached (not run with -d), block until container removal is complete. I don't think there's a potential for races, either.

@rhatdan
Copy link
Member

rhatdan commented Aug 30, 2019

Well @edsantiago is seeing what looks like a race when using the varlink connection.

@edsantiago
Copy link
Collaborator Author

Small correction: I've never seen it; only @cevich has seen it.

@cevich
Copy link
Member

cevich commented Aug 30, 2019

@mheon wrong answer. The correct answer is: Oh, I saw this problem yesterday and just merged a fix 😀

@mheon remember this is podman-remote -> podman. Perhaps podman-remote is exiting before podman/conmon finish their job?

With Ed's reproducer script, this very reliably reproduces for me (using hack/get_ci_vm.sh on the images mentioned). Would running in debug mode help at all? How can we prove/disprove Dan's theory? Or maybe is there a way we can tell the audit subsystem to log process-exits (with timestamps)?

@cevich
Copy link
Member

cevich commented Sep 3, 2019

@mheon I managed to catch this in the act by instrumenting the system-test that reproduces it:

--- a/test/system/070-build.bats
+++ b/test/system/070-build.bats
@@ -26,9 +26,11 @@ EOF
     PODMAN_TIMEOUT=240 run_podman build -t build_test --format=docker $tmpdir
     is "$output" ".*STEP 4: COMMIT" "COMMIT seen in log"
 
+    run_podman ps -a
     run_podman run --rm build_test cat /$rand_filename
     is "$output"   "$rand_content"   "reading generated file in image"
 
+    run_podman ps -a
     run_podman rmi -f build_test
 }
 

You can see what the test is doing in the diff. When it reproduces, it happens on that last rm -f line:

[+0100s] not ok 28 podman build - basic test
[+0100s] # (from function `die' in file test/system/helpers.bash, line 264,
[+0100s] #  from function `run_podman' in file test/system/helpers.bash, line 164,
[+0100s] #  in test file test/system/070-build.bats, line 34)
[+0100s] #   `run_podman rmi -f build_test' failed with status 2
[+0100s] # $ ./bin/podman-remote rm --all --force
[+0100s] # $ ./bin/podman-remote images --all --format {{.Repository}}:{{.Tag}} {{.ID}}
[+0100s] # quay.io/libpod/alpine_labels:latest   4fab981df737
[+0100s] # $ ./bin/podman-remote build -t build_test --format=docker /tmp/podman_bats.hPzLvX/build-test
[+0100s] # STEP 1: FROM quay.io/libpod/alpine_labels:latest
[+0100s] # STEP 2: RUN apk add nginx
[+0100s] # fetch http://dl-cdn.alpinelinux.org/alpine/v3.8/main/x86_64/APKINDEX.tar.gz
[+0100s] # fetch http://dl-cdn.alpinelinux.org/alpine/v3.8/community/x86_64/APKINDEX.tar.gz
[+0100s] # (1/2) Installing pcre (8.42-r0)
[+0100s] # (2/2) Installing nginx (1.14.2-r1)
[+0100s] # Executing nginx-1.14.2-r1.pre-install
[+0100s] # Executing busybox-1.28.4-r1.trigger
[+0100s] # OK: 6 MiB in 15 packages
[+0100s] # 27af317be9360cdb9cf9ec92e3852518e34831223001012fb28aaf496d2462ff
[+0100s] # STEP 3: RUN echo UkyW3wKhN0Uk8z1zdTzgjXxyUypI0WBjXmaKij34vb2vC9cHA2 > /UdFKEYczgNKHkBKXaNwL
[+0100s] # STEP 4: COMMIT build_test
[+0100s] # 102f5447866173d6e131165b85c0955f970cba4d5f2197cd40e1bd506824ce78
[+0100s] # $ ./bin/podman-remote ps -a
[+0100s] # CONTAINER ID  IMAGE  COMMAND  CREATED  STATUS  PORTS  NAMES
[+0100s] # $ ./bin/podman-remote run --rm build_test cat /UdFKEYczgNKHkBKXaNwL
[+0100s] # UkyW3wKhN0Uk8z1zdTzgjXxyUypI0WBjXmaKij34vb2vC9cHA2
[+0100s] # $ ./bin/podman-remote ps -a
[+0100s] # CONTAINER ID  IMAGE                        COMMAND               CREATED                 STATUS                             PORTS  NAMES
[+0100s] # e32470ce70ef  localhost/build_test:latest  cat /UdFKEYczgNKH...  Less than a second ago  Exited (0) Less than a second ago         keen_cohen
[+0100s] # $ ./bin/podman-remote rmi -f build_test
[+0100s] # Error: error removing containers [e32470ce70ef1fabf805c6db9d2bdb763bb0b802e79ac6d2a8b38e21171ac8e4] for image "102f5447866173d6e131165b85c0955f970cba4d5f2197cd40e1bd506824ce78": could not remove container "e32470ce70ef1fabf805c6db9d2bdb763bb0b802e79ac6d2a8b38e21171ac8e4": unlinkat /var/lib/containers/storage/overlay-containers/e32470ce70ef1fabf805c6db9d2bdb763bb0b802e79ac6d2a8b38e21171ac8e4/userdata/shm: device or resource busy
[+0100s] # [ rc=2 (** EXPECTED 0 **) ]
[+0100s] # #/vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
[+0100s] # #| FAIL: exit code is 2; expected 0
[+0100s] # #\^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

So clearly a race here is possible, but perhaps (somehow) more likely when podman-remote is being used?

is it impossible for the run ... --rm to verify removal before exit?

(ref. cirrus task)

cevich added a commit to cevich/podman that referenced this issue Sep 3, 2019
Signed-off-by: Chris Evich <cevich@redhat.com>
@mheon
Copy link
Member

mheon commented Sep 3, 2019

Have we ever seen this without remote being involved? This should not be possible for local Podman.

cevich added a commit to cevich/podman that referenced this issue Sep 3, 2019
Signed-off-by: Chris Evich <cevich@redhat.com>
@cevich
Copy link
Member

cevich commented Sep 3, 2019

@mheon As best as I can recall, I have not seen it. Even with podman-remote it doesn't always happen. Is it possible for podman-remote to special-case this, and verify removal happens before exiting (assuming there was no networking-error)?

cevich added a commit to cevich/podman that referenced this issue Sep 3, 2019
Signed-off-by: Chris Evich <cevich@redhat.com>
@cevich cevich changed the title [not reproducible] podman-remote rmi: device or resource busy podman-remote rmi: device or resource busy Sep 3, 2019
cevich added a commit to cevich/podman that referenced this issue Sep 3, 2019
Signed-off-by: Chris Evich <cevich@redhat.com>
cevich added a commit to cevich/podman that referenced this issue Sep 3, 2019
Signed-off-by: Chris Evich <cevich@redhat.com>
cevich added a commit to cevich/podman that referenced this issue Sep 3, 2019
Signed-off-by: Chris Evich <cevich@redhat.com>
cevich added a commit to cevich/podman that referenced this issue Sep 3, 2019
Signed-off-by: Chris Evich <cevich@redhat.com>
@rhatdan
Copy link
Member

rhatdan commented Sep 4, 2019

Looking at the difference between pkg/adapter/containers.go and pkg/adapter/containers_remote.go.
Containers.go has a call to

if ecode, err := ctr.Wait(); err != nil {

And containers_remote.go does not. I would guess this is the issue.

@mheon
Copy link
Member

mheon commented Sep 4, 2019

Oh. That sounds like it.

cevich added a commit to cevich/podman that referenced this issue Sep 4, 2019
Signed-off-by: Chris Evich <cevich@redhat.com>
@cevich
Copy link
Member

cevich commented Sep 4, 2019

Oh that's a HUGE relief, thanks @rhatdan !

@mheon
Copy link
Member

mheon commented Sep 13, 2019

Closing as #3934 is merged

@mheon mheon closed this as completed Sep 13, 2019
@github-actions github-actions bot added the locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. label Sep 23, 2023
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Sep 23, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments.
Projects
None yet
Development

No branches or pull requests

4 participants