Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Container storage leak when the podman process is SIGKILL #3906

Closed
TristanCacqueray opened this issue Aug 29, 2019 · 21 comments · Fixed by #4493
Closed

Container storage leak when the podman process is SIGKILL #3906

TristanCacqueray opened this issue Aug 29, 2019 · 21 comments · Fixed by #4493
Assignees
Labels
do-not-close kind/bug Categorizes issue or PR as related to a bug. locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. stale-issue

Comments

@TristanCacqueray
Copy link
Contributor

Is this a BUG REPORT or FEATURE REQUEST? (leave only one on its own line)

/kind bug

Description

When the podman process is SIGKILL midway, the container storage leaks and it is not obvious how to clean it up.

Steps to reproduce the issue:

Run this script:

#!/bin/python3

import subprocess, time

argv = ["podman", "run", "--rm", "--name", "test-server",
        "-e", "PYTHONUNBUFFERED=1",
        "fedora", "python3", "-m", "http.server"]

while True:
    try:
        print("Press ctrl-C if http server start...")
        proc = subprocess.Popen(argv)
        try:
            proc.communicate()
        except KeyboardInterrupt:
            proc.terminate()
            proc.kill()
            raise
    except KeyboardInterrupt:
        subprocess.Popen(["podman", "kill", "test-server"]).wait()
    time.sleep(1)

Sample output:

$ python3 repro.py 
Press ctrl-C if http server start...
Serving HTTP on 0.0.0.0 port 8000 (http://0.0.0.0:8000/) ...
^C
Keyboard interrupt received, exiting.
Error: no container with name or ID test-server found: no such container
Press ctrl-C if http server start...
Serving HTTP on 0.0.0.0 port 8000 (http://0.0.0.0:8000/) ...
^C
Keyboard interrupt received, exiting.
Error: no container with name or ID test-server found: no such container
Press ctrl-C if http server start...
Serving HTTP on 0.0.0.0 port 8000 (http://0.0.0.0:8000/) ...
^C
Keyboard interrupt received, exiting.
Error: no container with name or ID test-server found: no such container
Press ctrl-C if http server start...
Error: error creating container storage: the container name "test-server" is already in use by "478dbcf9e5f5d262ad6b21ad3fe94907260c086b718ce29d54d05a311caf2abf". You have to remove that container to be able to reuse that name.: that name is already in use
Press ctrl-C if http server start...
Error: error creating container storage: the container name "test-server" is already in use by "478dbcf9e5f5d262ad6b21ad3fe94907260c086b718ce29d54d05a311caf2abf". You have to remove that container to be able to reuse that name.: that name is already in use

Describe the results you received:

Once this happens, to be able to re-use the test-server name, the container storage needs to be removed manually:

$ podman container inspect test-server
Error: error looking up container "test-server": no container with name or ID test-server found: no such container
$ podman container inspect 478dbcf9e5f5d262ad6b21ad3fe94907260c086b718ce29d54d05a311caf2abf
Error: error looking up container "478dbcf9e5f5d262ad6b21ad3fe94907260c086b718ce29d54d05a311caf2abf": no container with name or ID 478dbcf9e5f5d262ad6b21ad3fe94907260c086b718ce29d54d05a311caf2abf found: no such container
$ podman rm test-server
Error: no container with name or ID test-server found: no such container
$ podman rm -f test-server
Error: no container with name or ID test-server found: no such container
$ podman rm -f --storage test-server
test-server

Describe the results you expected:

podman inspect should tell that the container storage still exists instead of "no such container".

Additional information you deem important (e.g. issue happens only occasionally):

The reproducer is probably not correct as it SIGKILL podman while it is cleaning up the container, but perhaps podman could remove the storage first?

Output of podman version:

Version:            1.4.5-dev
RemoteAPI Version:  1
Go Version:         go1.12.2
OS/Arch:            linux/amd64

Output of podman info --debug:

debug:
  compiler: gc
  git commit: ""
  go version: go1.12.2
  podman version: 1.4.5-dev
host:
  BuildahVersion: 1.9.0
  Conmon:
    package: podman-1.5.1-3.fc30.x86_64
    path: /usr/libexec/podman/conmon
    version: 'conmon version 2.0.0, commit: d728afa06cd2df86a27f32a4692c7099a56acc97-dirty'
  Distribution:
    distribution: fedora
    version: "30"
  MemFree: 11724386304
  MemTotal: 16815677440
  OCIRuntime:
    package: containerd.io-1.2.6-3.3.fc30.x86_64
    path: /usr/bin/runc
    version: |-
      runc version 1.0.0-rc8
      commit: 425e105d5a03fabd737a126ad93d62a9eeede87f
      spec: 1.0.1-dev
  SwapFree: 0
  SwapTotal: 0
  arch: amd64
  cpus: 8
  hostname: ocp.rdocloud
  kernel: 5.2.9-200.fc30.x86_64
  os: linux
  rootless: true
  uptime: 1h 31m 26.12s (Approximately 0.04 days)
registries:
  blocked: null
  insecure: null
  search:
  - docker.io
  - registry.fedoraproject.org
  - quay.io
  - registry.access.redhat.com
  - registry.centos.org
store:
  ConfigFile: /home/fedora/.config/containers/storage.conf
  ContainerStore:
    number: 9
  GraphDriverName: vfs
  GraphOptions: null
  GraphRoot: /home/fedora/.local/share/containers/storage
  GraphStatus: {}
  ImageStore:
    number: 6
  RunRoot: /tmp/1000
  VolumePath: /home/fedora/.local/share/containers/storage/volumes

Package info (e.g. output of rpm -q podman or apt list podman):

podman-1.5.1-3.fc30.x86_64

Additional environment details (AWS, VirtualBox, physical, etc.):

The host is running in OpenStack

@openshift-ci-robot openshift-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Aug 29, 2019
@vrothberg
Copy link
Member

vrothberg commented Aug 29, 2019

Hi @TristanCacqueray, thanks for opening the issue. There's something odd with the version since podman-version claims 1.4.5-devbut the package is podman-1.5.1-3.fc30.x86_64. I have the very same package installed on my machine but podman returns the correct version.

Can you check if you used the correct binary?

@TristanCacqueray
Copy link
Contributor Author

Oops, it was a devel version in ~/.local/bin... Using the system podman resulted in the same behavior. Here are the new debug info:

Version:            1.5.1
RemoteAPI Version:  1
Go Version:         go1.12.7
OS/Arch:            linux/amd64
debug:
  compiler: gc
  git commit: ""
  go version: go1.12.7
  podman version: 1.5.1
host:
  BuildahVersion: 1.10.1
  Conmon:
    package: podman-1.5.1-3.fc30.x86_64
    path: /usr/libexec/podman/conmon
    version: 'conmon version 2.0.0, commit: d728afa06cd2df86a27f32a4692c7099a56acc97-dirty'
  Distribution:
    distribution: fedora
    version: "30"
  MemFree: 11273560064
  MemTotal: 16815677440
  OCIRuntime:
    package: containerd.io-1.2.6-3.3.fc30.x86_64
    path: /usr/bin/runc
    version: |-
      runc version 1.0.0-rc8
      commit: 425e105d5a03fabd737a126ad93d62a9eeede87f
      spec: 1.0.1-dev
  SwapFree: 0
  SwapTotal: 0
  arch: amd64
  cpus: 8
  eventlogger: file
  hostname: ocp.rdocloud
  kernel: 5.2.9-200.fc30.x86_64
  os: linux
  rootless: true
  uptime: 1h 45m 53.63s (Approximately 0.04 days)
registries:
  blocked: null
  insecure: null
  search:
  - docker.io
  - registry.fedoraproject.org
  - quay.io
  - registry.access.redhat.com
  - registry.centos.org
store:
  ConfigFile: /home/fedora/.config/containers/storage.conf
  ContainerStore:
    number: 10
  GraphDriverName: vfs
  GraphOptions: null
  GraphRoot: /home/fedora/.local/share/containers/storage
  GraphStatus: {}
  ImageStore:
    number: 6
  RunRoot: /tmp/1000
  VolumePath: /home/fedora/.local/share/containers/storage/volumes

@mheon
Copy link
Member

mheon commented Aug 29, 2019

If we finally have a consistent reproducer for this, that's a very good thing - removing storage happens before containers are evicted from the database, so figuring out why it's failing but the container is still not in the Libpod DB might let us fix this for good.

On inspect showing the container - I would not expect inspect to be able to locate containers not in our database. There is some talk of a podman ps --storage will display containers in storage (in a somewhat limited form, as we can't get status, etc).

@vrothberg
Copy link
Member

@mheon, I can share some of my cycles. Want me to have a look?

@TristanCacqueray
Copy link
Contributor Author

Here is a new sample log using system podman:

$ python3 repro.py 
Press ctrl-C if http server start...
Serving HTTP on 0.0.0.0 port 8000 (http://0.0.0.0:8000/) ...
^C
Keyboard interrupt received, exiting.
Error: no container with name or ID test-server found: no such container
Press ctrl-C if http server start...
Error: error creating container storage: the container name "test-server" is already in use by "1c5629586bd866f80c05cc90d38b6e395c0114aac01c000446c52dfe3708d841". You have to remove that container to be able to reuse that name.: that name is already in use
Press ctrl-C if http server start...
Error: error creating container storage: the container name "test-server" is already in use by "1c5629586bd866f80c05cc90d38b6e395c0114aac01c000446c52dfe3708d841". You have to remove that container to be able to reuse that name.: that name is already in use
Press ctrl-C if http server start...
^CError: no container with name or ID test-server found: no such container
^CTraceback (most recent call last):
  File "repro.py", line 22, in <module>
    time.sleep(1)
KeyboardInterrupt
$ podman container inspect 1c5629586bd866f80c05cc90d38b6e395c0114aac01c000446c52dfe3708d841
Error: error looking up container "1c5629586bd866f80c05cc90d38b6e395c0114aac01c000446c52dfe3708d841": no container with name or ID 1c5629586bd866f80c05cc90d38b6e395c0114aac01c000446c52dfe3708d841 found: no such container
$ podman container inspect test-server
Error: error looking up container "test-server": no container with name or ID test-server found: no such container
$ podman rm test-server
Error: no container with name or ID test-server found: no such container
$ podman rm 1c5629586bd866f80c05cc90d38b6e395c0114aac01c000446c52dfe3708d841
Error: no container with name or ID 1c5629586bd866f80c05cc90d38b6e395c0114aac01c000446c52dfe3708d841 found: no such container
$ podman rm 1c5629586bd866f80c05cc90d38b6e395c0114aac01c000446c52dfe3708d841^C
$ podman run -it --rm --name test-server fedora
Error: error creating container storage: the container name "test-server" is already in use by "1c5629586bd866f80c05cc90d38b6e395c0114aac01c000446c52dfe3708d841". You have to remove that container to be able to reuse that name.: that name is already in use
$ podman rm -f 1c5629586bd866f80c05cc90d38b6e395c0114aac01c000446c52dfe3708d841
Error: no container with name or ID 1c5629586bd866f80c05cc90d38b6e395c0114aac01c000446c52dfe3708d841 found: no such container
$ podman rm -f test-server
Error: no container with name or ID test-server found: no such container
$ podman rm -f --storage test-server
test-server
$ podman run -it --rm --name test-server fedora echo it is working
it is working

@TristanCacqueray
Copy link
Contributor Author

@mheon perhaps the podman run failure in that situation could hint at using "podman rm --storage" or something. It currently says "You have to remove that container to be able to reuse that name.: that name is already in use".

@holser
Copy link

holser commented Sep 3, 2019

podman should act with /var/lib/containers/storage/overlay-containers/containers.json based on ACID. SIGTERM or SIGKILL shouldn't corrupt containers.json. Having containers.lock is not enough to implement ACID.

@mheon
Copy link
Member

mheon commented Sep 3, 2019

I haven't seen any corrupted containers.json yet. Here, we're seeing a failure to remove for reasons we've yet to determine - but the storage metadata is intact. The container can be removed, but later.

@holser If you have a corrupted containers.json that's a separate issue - and significantly more serious, because that sounds like it would break storage a lot worse than this does.

@mheon mheon self-assigned this Sep 3, 2019
@mheon
Copy link
Member

mheon commented Sep 5, 2019

I've identified something that may be part of the problem - Podman may not be unmounting the SHM mount before the removal. I'm adding some patches that might address it to #3931

openstack-gerrit pushed a commit to openstack/openstack that referenced this issue Sep 6, 2019
* Update tripleo-heat-templates from branch 'master'
  - Merge "container-puppet: run podman rm with --storage"
  - container-puppet: run podman rm with --storage
    
    There is an ongoing bug in libpod where the removal of a container can
    randomly (probably under heavy load) leak and fail to finish the tasks,
    leaving leftovers in the podman database, which causes a container
    creation with the same name to fail.
    
    It is related to this issue:
    containers/podman#3906
    
    And originally reported here as well:
    https://bugzilla.redhat.com/show_bug.cgi?id=1747885
    
    This patch is a mitigation that has been validated by the Container Team
    where we use --storage option to force the low level removal of the
    container by using containers/storage project to really remove all the
    data.
    
    Closes-Bug: #1840691
    Change-Id: I711f460bd51747c3985b95784d19560d1be2028a
openstack-gerrit pushed a commit to openstack-archive/tripleo-heat-templates that referenced this issue Sep 6, 2019
There is an ongoing bug in libpod where the removal of a container can
randomly (probably under heavy load) leak and fail to finish the tasks,
leaving leftovers in the podman database, which causes a container
creation with the same name to fail.

It is related to this issue:
containers/podman#3906

And originally reported here as well:
https://bugzilla.redhat.com/show_bug.cgi?id=1747885

This patch is a mitigation that has been validated by the Container Team
where we use --storage option to force the low level removal of the
container by using containers/storage project to really remove all the
data.

Closes-Bug: #1840691
Change-Id: I711f460bd51747c3985b95784d19560d1be2028a
openstack-gerrit pushed a commit to openstack-archive/tripleo-heat-templates that referenced this issue Sep 6, 2019
There is an ongoing bug in libpod where the removal of a container can
randomly (probably under heavy load) leak and fail to finish the tasks,
leaving leftovers in the podman database, which causes a container
creation with the same name to fail.

It is related to this issue:
containers/podman#3906

And originally reported here as well:
https://bugzilla.redhat.com/show_bug.cgi?id=1747885

This patch is a mitigation that has been validated by the Container Team
where we use --storage option to force the low level removal of the
container by using containers/storage project to really remove all the
data.

Closes-Bug: #1840691
Change-Id: I711f460bd51747c3985b95784d19560d1be2028a
(cherry picked from commit 9e2a971)
@mheon
Copy link
Member

mheon commented Sep 23, 2019

I'm going to test this now that #3931 is merged and see if we can call it fixed

@mheon
Copy link
Member

mheon commented Sep 23, 2019

Does not appear to be fixed. Looking deeper...

@mheon
Copy link
Member

mheon commented Sep 23, 2019

SHM is still mounted, but I'm not seeing any error messages out of Podman.

@mheon
Copy link
Member

mheon commented Sep 23, 2019

Alright, I think the reproducer here has introduced a wierd race.

podman run --rm invokes container removal twice: once in the Conmon cleanup process, once in the main podman process. This duplication is to ensure that, if the user detaches from the container, it will still be removed in the background; but if the container remains alive, it is guaranteed to be removed before podman run --rm exits. With our locking model, this is quite safe - one of the processes will lose, but the resulting "no such container" error is safely handled, and the container is definitely gone.

In this case, podman run --rm appears to be consistently winning the race as to who gets to remove the container; which makes sense, because it doesn't need to launch a fresh process. It begins container removal, successfully deletes it from the database, and then is immediately whacked by a SIGKILL from the reproducer before it can completely clean up the mount.

The cleanup process then arrives and attempts to remove the container, but it's already gone from the database; that's one of the first steps we take, to ensure that we don't leave unusable partially-deleted containers around. As such, podman cleanup can't find the container, determines it has been removed by the other Podman process, and as such doesn't complain much.

I don't know how well we can handle this. If podman takes a SIGKILL in the middle of removing a container, things are going to be left undone, so we're left with the decision of whether to leave partially-removed, unusable containers around in podman ps, or potentially leave storage containers around that block names from re-use.

Perhaps the simplest takeaway here is to not whack podman processes with SIGKILL unless they are actually stuck and timing out (well, really, Podman processes that are expected to remove containers - podman run --rm, podman rm).

I'll think about this more, but I don't know if there's an easy solution on our end.

@mheon
Copy link
Member

mheon commented Sep 23, 2019

(For reference - adding a time.sleep(1) before the proc.kill() completely fixes the bug for me.
Podman will handle other signals via sig-proxy, but we can't catch SIGKILL, and can't sanely handle the situation.

@bogdando
Copy link

bogdando commented Oct 3, 2019

Please note we cannot in general protect podman from sudden "SIGKILL", AFAIK, a fenced controller would end up with the same situation, like if using SIGKILL but doing a force power off instead. Or systemd might send KILL when it times-out shutting down some container via the service unit.

So there needs to be some transactions replaying or something. Not sure, but Galera DB recovers from such crashes well, for example...

@holser
Copy link

holser commented Oct 3, 2019

@mheon SIGKILL is a valid signal which can be used by user. There is no way to intercept it, that's true. Though, there are many ways how to handle it. A lot of software guarantees constancy data whatever happens (databases, file systems, clustering software). Thus I mentioned ACID. For instance there can be a transaction log (journal) of podman operations. If podman was killed by SIGKILL the transaction log won't be clean. It means that extra steps should be taken to assure that consistency is present and nothing will blow up.

@ZeroExistence
Copy link

ZeroExistence commented Oct 9, 2019

Not sure if my issue is related to SIGKILL issue here, but I'm encountering this case of container that seems to didn't make to the execution of command inside the container. The podman ps -a didn't even detect the container.

Oct 09 19:50:34 prod.moe.ph podman[6966]: time="2019-10-09T19:50:34+08:00" level=debug msg="setting container name nginx"
Oct 09 19:50:34 prod.moe.ph podman[6966]: time="2019-10-09T19:50:34+08:00" level=debug msg="created OCI spec and options for new container"
Oct 09 19:50:34 prod.moe.ph podman[6966]: time="2019-10-09T19:50:34+08:00" level=debug msg="Allocated lock 21 for container 6e3076659ef6c90a9c35d7f2e7d43cccb0a25fd7d2ab5edc9859c8deb64c2da0"
Oct 09 19:50:34 prod.moe.ph podman[6966]: time="2019-10-09T19:50:34+08:00" level=debug msg="parsed reference into \"[overlay@/var/lib/containers/storage+/var/run/containers/storage:overlay.mountopt=nodev,metacopy=on]@0dac5b41d811ca6e1bfe68d31bec5fb1f5c37485b741674619a1a2a3dec5cc0e\""
Oct 09 19:50:34 prod.moe.ph podman[6966]: time="2019-10-09T19:50:34+08:00" level=debug msg="exporting opaque data as blob \"sha256:0dac5b41d811ca6e1bfe68d31bec5fb1f5c37485b741674619a1a2a3dec5cc0e\""
Oct 09 19:50:34 prod.moe.ph podman[6966]: time="2019-10-09T19:50:34+08:00" level=debug msg="created container \"6e3076659ef6c90a9c35d7f2e7d43cccb0a25fd7d2ab5edc9859c8deb64c2da0\""
Oct 09 19:50:34 prod.moe.ph podman[6966]: time="2019-10-09T19:50:34+08:00" level=debug msg="container \"6e3076659ef6c90a9c35d7f2e7d43cccb0a25fd7d2ab5edc9859c8deb64c2da0\" has work directory \"/var/lib/containers/storage/overlay-containers/6e3076659ef6c90a9c35d7f2e7d43cccb0a25fd7d2ab5edc9859c8deb64c2da0/userdata\""
Oct 09 19:50:34 prod.moe.ph podman[6966]: time="2019-10-09T19:50:34+08:00" level=debug msg="container \"6e3076659ef6c90a9c35d7f2e7d43cccb0a25fd7d2ab5edc9859c8deb64c2da0\" has run directory \"/var/run/containers/storage/overlay-containers/6e3076659ef6c90a9c35d7f2e7d43cccb0a25fd7d2ab5edc9859c8deb64c2da0/userdata\""
Oct 09 20:17:36 prod.moe.ph systemd[1]: Stopping nginx container service...
Oct 09 20:17:36 prod.moe.ph podman[9147]: Error: no container with name or ID nginx found: no such container
Oct 09 20:17:36 prod.moe.ph systemd[1]: cnginx.service: Succeeded.
Oct 09 20:17:36 prod.moe.ph systemd[1]: Stopped nginx container service.

I'm assuming that if the podman didn't started the container successfully, it won't be able to cleanup the shm mounts afterwards. On my case, the container was still in the containers.json, so I have to remove it using podman rm --storage.

Here's my podman info details.

[root@prod storage]# podman info --debug
debug:
  compiler: gc
  git commit: ""
  go version: go1.12.9
  podman version: 1.6.1
host:
  BuildahVersion: 1.11.2
  CgroupVersion: v1
  Conmon:
    package: conmon-2.0.1-1.fc30.x86_64
    path: /usr/bin/conmon
    version: 'conmon version 2.0.1, commit: 4346fbe0b2634b05857973bdf663598081240374'
  Distribution:
    distribution: fedora
    version: "30"
  MemFree: 332312576
  MemTotal: 4133416960
  OCIRuntime:
    package: runc-1.0.0-93.dev.gitb9b6cc6.fc30.x86_64
    path: /usr/bin/runc
    version: |-
      runc version 1.0.0-rc8+dev
      commit: e3b4c1108f7d1bf0d09ab612ea09927d9b59b4e3
      spec: 1.0.1-dev
  SwapFree: 5341835264
  SwapTotal: 5368705024
  arch: amd64
  cpus: 2
  eventlogger: journald
  hostname: prod.moe.ph
  kernel: 5.2.14-200.fc30.x86_64
  os: linux
  rootless: false
  uptime: 19h 17m 42.67s (Approximately 0.79 days)
registries:
  blocked: null
  insecure: null
  search:
  - docker.io
  - registry.fedoraproject.org
  - quay.io
  - registry.access.redhat.com
  - registry.centos.org
store:
  ConfigFile: /etc/containers/storage.conf
  ContainerStore:
    number: 9
  GraphDriverName: overlay
  GraphOptions:
    overlay.mountopt: nodev,metacopy=on
  GraphRoot: /var/lib/containers/storage
  GraphStatus:
    Backing Filesystem: extfs
    Native Overlay Diff: "false"
    Supports d_type: "true"
    Using metacopy: "true"
  ImageStore:
    number: 44
  RunRoot: /var/run/containers/storage
  VolumePath: /var/lib/containers/storage/volumes

Edit 1:
Never mind. It seems that my case here is related to containers/storage. It seems that my issue is caused by mounting 3 volumes or more or something else.

Edit 2:
It's really related to the containers/storage. I was able to fix my issue by stopping all containers and run podman system renumber.

Significant changes were made to Podman volumes in this release. If you have pre-existing volumes, it is strongly recommended to run podman system renumber after upgrading.

@github-actions
Copy link

This issue had no activity for 30 days. In the absence of activity or the "do-not-close" label, the issue will be automatically closed within 7 days.

@mheon mheon closed this as completed Nov 10, 2019
@mheon mheon reopened this Nov 10, 2019
@mheon
Copy link
Member

mheon commented Nov 10, 2019

Oops. I thought the bot closed this.

Adding do-not-close label.

mheon added a commit to mheon/libpod that referenced this issue Nov 11, 2019
When Libpod removes a container, there is the possibility that
removal will not fully succeed. The most notable problems are
storage issues, where the container cannot be removed from
c/storage.

When this occurs, we were faced with a choice. We can keep the
container in the state, appearing in `podman ps` and available for
other API operations, but likely unable to do any of them as it's
been partially removed. Or we can remove it very early and clean
up after it's already gone. We have, until now, used the second
approach.

The problem that arises is intermittent problems removing
storage. We end up removing a container, failing to remove its
storage, and ending up with a container permanently stuck in
c/storage that we can't remove with the normal Podman CLI, can't
use the name of, and generally can't interact with. A notable
cause is when Podman is hit by a SIGKILL midway through removal,
which can consistently cause `podman rm` to fail to remove
storage.

We now add a new state for containers that are in the process of
being removed, ContainerStateRemoving. We set this at the
beginning of the removal process. It notifies Podman that the
container cannot be used anymore, but preserves it in the DB
until it is fully removed. This will allow Remove to be run on
these containers again, which should successfully remove storage
if it fails.

Fixes containers#3906

Signed-off-by: Matthew Heon <mheon@redhat.com>
mheon added a commit to mheon/libpod that referenced this issue Nov 11, 2019
When Libpod removes a container, there is the possibility that
removal will not fully succeed. The most notable problems are
storage issues, where the container cannot be removed from
c/storage.

When this occurs, we were faced with a choice. We can keep the
container in the state, appearing in `podman ps` and available for
other API operations, but likely unable to do any of them as it's
been partially removed. Or we can remove it very early and clean
up after it's already gone. We have, until now, used the second
approach.

The problem that arises is intermittent problems removing
storage. We end up removing a container, failing to remove its
storage, and ending up with a container permanently stuck in
c/storage that we can't remove with the normal Podman CLI, can't
use the name of, and generally can't interact with. A notable
cause is when Podman is hit by a SIGKILL midway through removal,
which can consistently cause `podman rm` to fail to remove
storage.

We now add a new state for containers that are in the process of
being removed, ContainerStateRemoving. We set this at the
beginning of the removal process. It notifies Podman that the
container cannot be used anymore, but preserves it in the DB
until it is fully removed. This will allow Remove to be run on
these containers again, which should successfully remove storage
if it fails.

Fixes containers#3906

Signed-off-by: Matthew Heon <mheon@redhat.com>
mheon added a commit to mheon/libpod that referenced this issue Nov 11, 2019
When Libpod removes a container, there is the possibility that
removal will not fully succeed. The most notable problems are
storage issues, where the container cannot be removed from
c/storage.

When this occurs, we were faced with a choice. We can keep the
container in the state, appearing in `podman ps` and available for
other API operations, but likely unable to do any of them as it's
been partially removed. Or we can remove it very early and clean
up after it's already gone. We have, until now, used the second
approach.

The problem that arises is intermittent problems removing
storage. We end up removing a container, failing to remove its
storage, and ending up with a container permanently stuck in
c/storage that we can't remove with the normal Podman CLI, can't
use the name of, and generally can't interact with. A notable
cause is when Podman is hit by a SIGKILL midway through removal,
which can consistently cause `podman rm` to fail to remove
storage.

We now add a new state for containers that are in the process of
being removed, ContainerStateRemoving. We set this at the
beginning of the removal process. It notifies Podman that the
container cannot be used anymore, but preserves it in the DB
until it is fully removed. This will allow Remove to be run on
these containers again, which should successfully remove storage
if it fails.

Fixes containers#3906

Signed-off-by: Matthew Heon <mheon@redhat.com>
mheon added a commit to mheon/libpod that referenced this issue Nov 19, 2019
When Libpod removes a container, there is the possibility that
removal will not fully succeed. The most notable problems are
storage issues, where the container cannot be removed from
c/storage.

When this occurs, we were faced with a choice. We can keep the
container in the state, appearing in `podman ps` and available for
other API operations, but likely unable to do any of them as it's
been partially removed. Or we can remove it very early and clean
up after it's already gone. We have, until now, used the second
approach.

The problem that arises is intermittent problems removing
storage. We end up removing a container, failing to remove its
storage, and ending up with a container permanently stuck in
c/storage that we can't remove with the normal Podman CLI, can't
use the name of, and generally can't interact with. A notable
cause is when Podman is hit by a SIGKILL midway through removal,
which can consistently cause `podman rm` to fail to remove
storage.

We now add a new state for containers that are in the process of
being removed, ContainerStateRemoving. We set this at the
beginning of the removal process. It notifies Podman that the
container cannot be used anymore, but preserves it in the DB
until it is fully removed. This will allow Remove to be run on
these containers again, which should successfully remove storage
if it fails.

Fixes containers#3906

Signed-off-by: Matthew Heon <mheon@redhat.com>
@TristanCacqueray
Copy link
Contributor Author

Thanks @mheon !

@mheon
Copy link
Member

mheon commented Dec 2, 2019

Took us half a year, but it's finally done 😄

vrothberg pushed a commit to vrothberg/libpod that referenced this issue Jun 28, 2022
When Libpod removes a container, there is the possibility that
removal will not fully succeed. The most notable problems are
storage issues, where the container cannot be removed from
c/storage.

When this occurs, we were faced with a choice. We can keep the
container in the state, appearing in `podman ps` and available for
other API operations, but likely unable to do any of them as it's
been partially removed. Or we can remove it very early and clean
up after it's already gone. We have, until now, used the second
approach.

The problem that arises is intermittent problems removing
storage. We end up removing a container, failing to remove its
storage, and ending up with a container permanently stuck in
c/storage that we can't remove with the normal Podman CLI, can't
use the name of, and generally can't interact with. A notable
cause is when Podman is hit by a SIGKILL midway through removal,
which can consistently cause `podman rm` to fail to remove
storage.

We now add a new state for containers that are in the process of
being removed, ContainerStateRemoving. We set this at the
beginning of the removal process. It notifies Podman that the
container cannot be used anymore, but preserves it in the DB
until it is fully removed. This will allow Remove to be run on
these containers again, which should successfully remove storage
if it fails.

Fixes containers#3906

Downstream-patch: podman-1775647.patch
Signed-off-by: Matthew Heon <mheon@redhat.com>
Signed-off-by: Valentin Rothberg <vrothberg@redhat.com>
@github-actions github-actions bot added the locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. label Sep 23, 2023
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Sep 23, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
do-not-close kind/bug Categorizes issue or PR as related to a bug. locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. stale-issue
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants