Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

container/pod refuse to be removed cause defunct exec session process #14252

Closed
dispensable opened this issue May 16, 2022 · 11 comments · Fixed by #14258
Closed

container/pod refuse to be removed cause defunct exec session process #14252

dispensable opened this issue May 16, 2022 · 11 comments · Fixed by #14258
Labels
kind/bug Categorizes issue or PR as related to a bug. locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments.

Comments

@dispensable
Copy link

Is this a BUG REPORT or FEATURE REQUEST? (leave only one on its own line)

/kind bug

Description

Podman container/pod refuse to be removed cause active exec session. But the exec session is just a defunc zombie process and will never been kill or cleanup. We can only reboot the system to clean zombie process and pod/container.

Steps to reproduce the issue:

Sorry, but I don't know how to create a zombie exec session (it just happened).

Describe the results you received:
b094f727f1dd4dc06f7743a061b2d2ece691eb59adfc3c4e4791d0b2ae16ac62 has containers that are not ready to be removed: cannot remove container 585ec09d8b1ec7c5ff2791261e30829dc117c838f0e9d925eb25263a8182ec37 as it has active exec sessions: container state improper

Describe the results you expected:
container/pod removed successfully. podman pod rm -f should ignore defunct zombie exec session cause they will never be killed during system up also unnecessary.

Additional information you deem important (e.g. issue happens only occasionally):

Output of podman version:

Version:      3.4.0
API Version:  3.4.0
Go Version:   go1.17.1
Built:        Mon Oct 18 11:00:12 2021
OS/Arch:      linux/amd64

Output of podman info --debug:

host:
  arch: amd64
  buildahVersion: 1.23.1
  cgroupControllers: []
  cgroupManager: cgroupfs
  cgroupVersion: v1
  conmon:
    package: app-containers/conmon-2.0.27
    path: /usr/libexec/podman/conmon
    version: 'conmon version 2.0.27, commit: v2.0.27'
  cpus: 48
  distribution:
    distribution: gentoo
    version: unknown
  eventLogger: file
  hostname: hithlan2
  idMappings:
    gidmap:
    - container_id: 0
      host_id: 2392
      size: 1
    - container_id: 1
      host_id: 362144
      size: 65536
    uidmap:
    - container_id: 0
      host_id: 2392
      size: 1
    - container_id: 1
      host_id: 362144
      size: 65536
  kernel: 5.10.27-gentoo
  linkmode: dynamic
  logDriver: k8s-file
  memFree: 91562278912
  memTotal: 202617040896
  ociRuntime:
    name: crun
    package: app-containers/crun-1.2
    path: /usr/bin/crun
    version: |-
      crun version 1.2
      commit: 4f6c8e0583c679bfee6a899c05ac6b916022561b
      spec: 1.0.0
      +SELINUX +APPARMOR +CAP +SECCOMP +EBPF +YAJL
  os: linux
  remoteSocket:
    path: /tmp/podman-run-2392/podman/podman.sock
  security:
    apparmorEnabled: false
    capabilities: CAP_AUDIT_WRITE,CAP_CHOWN,CAP_DAC_OVERRIDE,CAP_FOWNER,CAP_FSETID,CAP_KILL,CAP_MKNOD,CAP_NET_BIND_SERVICE,CAP_NET_RAW,CAP_SETFCAP,CAP_SETGID,CAP_SETPCAP,CAP_SETUID,CAP_SYS_CHROOT
    rootless: true

Package info (e.g. output of rpm -q podman or apt list podman):

[?] app-emulation/podman [1]
     Available versions:  ~3.3.0_rc3^t {apparmor btrfs +fuse +rootless selinux}
     Installed versions:  3.4.0^st{tbz2}(11:00:45 AM 10/18/2021)(fuse rootless -apparmor -btrfs -selinux)
     Homepage:            https://github.com/containers/podman/
     Description:         Library and podman tool for running OCI-based containers in Pods

Have you tested with the latest version of Podman and have you checked the Podman Troubleshooting Guide? (https://github.com/containers/podman/blob/main/troubleshooting.md)

It's hard to create a zombie exec session, so I haven't test on latest podman

Additional environment details (AWS, VirtualBox, physical, etc.):

@openshift-ci openshift-ci bot added the kind/bug Categorizes issue or PR as related to a bug. label May 16, 2022
@vrothberg
Copy link
Member

Thanks for reaching out, @dispensable.

The request makes sense to me, especially with --force. @mheon WDYT?

@rhatdan
Copy link
Member

rhatdan commented May 16, 2022

@mheon Would this be enough

diff --git a/libpod/runtime_ctr.go b/libpod/runtime_ctr.go
index df7174ac6..e9dab24ae 100644
--- a/libpod/runtime_ctr.go
+++ b/libpod/runtime_ctr.go
@@ -644,9 +644,11 @@ func (r *Runtime) removeContainer(ctx context.Context, c *Container, force, remo
                return err
        }
 
-       // If we're not force-removing, we need to check if we're in a good
-       // state to remove.
-       if !force {
+       if force {
+               c.removeAllExecSessions()
+       } else {
+               // If we're not force-removing, we need to check if we're in a good
+               // state to remove.
                if err := c.checkReadyForRemoval(); err != nil {
                        return err
                }

@mheon
Copy link
Member

mheon commented May 16, 2022

That won't help, we already short-circuit checkReadyForRemoval() in the force case, and I don't want to remove a container with active exec sessions without force.

@rhatdan
Copy link
Member

rhatdan commented May 16, 2022

Huh? This is only removing ExecSessions with Force.

@mheon
Copy link
Member

mheon commented May 16, 2022

Which happens later in removeContainer

@rhatdan
Copy link
Member

rhatdan commented May 16, 2022

I think we are failing before we get there. which is what the user is complaining about. Perhaps this was fixed, and my patch is not necessary.

@mheon
Copy link
Member

mheon commented May 16, 2022

The way I read the code, this works already, though you need to remove the container twice? removeAllExecSessions() is guaranteed to evict all exec sessions from the DB, but it does return an error if any of them failed that will block podman rm. The second run will remove the container because the exec sessions are gone.

@mheon
Copy link
Member

mheon commented May 16, 2022

@rhatdan No, that code is never running for rm --force

@mheon
Copy link
Member

mheon commented May 16, 2022

I think the real fix is probably to continue, not hard-error, if removing exec sessions failed.

@rhatdan
Copy link
Member

rhatdan commented May 16, 2022

SGTM, could you open a PR.

Googling for a zombie, finds.

cat /tmp/zombie.c
int main(int argc, char *argv[])
{
    int i = fork();

    if(i == 0)
    {
        exit(0); /* we let the child die as fast as possible */
    } else {
        sleep(30); /* during these 30 sec, the child is a zombie, because it is dead, but not reaped with waitpid yet. Use ps command during this to see it in the process list */
    }
    /* when we do not reap the child before we exit, it will either be removed by OS or reaped by init as it is reparented */
    return 0;
}

mheon added a commit to mheon/libpod that referenced this issue May 16, 2022
Removing exec sessions is guaranteed to evict them from the DB,
but in the case of a zombie process (or similar) it may error and
block removal of the container. A subsequent run of `podman rm`
would succeed (because the exec sessions have been purged from
the DB), which is potentially confusing to users. So let's just
continue, instead of erroring out, if removing exec sessions
fails.

[NO NEW TESTS NEEDED] I wouldn't want to spawn a zombie in our
test VMs even if I could.

Fixes containers#14252

Signed-off-by: Matthew Heon <matthew.heon@pm.me>
@mheon
Copy link
Member

mheon commented May 16, 2022

#14258

cdoern pushed a commit to cdoern/podman that referenced this issue May 27, 2022
Removing exec sessions is guaranteed to evict them from the DB,
but in the case of a zombie process (or similar) it may error and
block removal of the container. A subsequent run of `podman rm`
would succeed (because the exec sessions have been purged from
the DB), which is potentially confusing to users. So let's just
continue, instead of erroring out, if removing exec sessions
fails.

[NO NEW TESTS NEEDED] I wouldn't want to spawn a zombie in our
test VMs even if I could.

Fixes containers#14252

Signed-off-by: Matthew Heon <matthew.heon@pm.me>
mheon added a commit to mheon/libpod that referenced this issue Jun 14, 2022
Removing exec sessions is guaranteed to evict them from the DB,
but in the case of a zombie process (or similar) it may error and
block removal of the container. A subsequent run of `podman rm`
would succeed (because the exec sessions have been purged from
the DB), which is potentially confusing to users. So let's just
continue, instead of erroring out, if removing exec sessions
fails.

[NO NEW TESTS NEEDED] I wouldn't want to spawn a zombie in our
test VMs even if I could.

Fixes containers#14252

Signed-off-by: Matthew Heon <matthew.heon@pm.me>
gbraad pushed a commit to gbraad-redhat/podman that referenced this issue Jul 13, 2022
Removing exec sessions is guaranteed to evict them from the DB,
but in the case of a zombie process (or similar) it may error and
block removal of the container. A subsequent run of `podman rm`
would succeed (because the exec sessions have been purged from
the DB), which is potentially confusing to users. So let's just
continue, instead of erroring out, if removing exec sessions
fails.

[NO NEW TESTS NEEDED] I wouldn't want to spawn a zombie in our
test VMs even if I could.

Fixes containers#14252

Signed-off-by: Matthew Heon <matthew.heon@pm.me>
@github-actions github-actions bot added the locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. label Sep 20, 2023
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Sep 20, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
kind/bug Categorizes issue or PR as related to a bug. locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants