Rarely `Error: getting pod's service container: container {ID} not found in DB: no such container` #16964

MartinX3 · 2023-01-01T10:51:08Z

Is this a BUG REPORT or FEATURE REQUEST? (leave only one on its own line)

/kind bug

Description

Maybe connected to #12034

I can't remove, start or restart the pod.
It just tell me, that it's missing the service container with a specific id.

My workaround is to edit the database file ~/.local/share/containers/storage/libpod/bolt_state.db and replace this id by the id of a running service container and then remove the pod and start it again with systemD.
On removing I get the error Error: freeing pod 59c9006b753b60141d73d23e1f42ef0fae794a45e3c6315a27faee9db1bf930a lock: no such file or directory, but I can ignore it and just recreate the database pod.
The service container I stole the ID from still works fine.

Steps to reproduce the issue:

podman pod rm database

Describe the results you received:

$ podman pod ps
POD ID        NAME        STATUS      CREATED            INFRA ID      # OF CONTAINERS
59c9006b753b  database    Created     24 hours ago                     0

$ podman pod rm database 
Error: getting pod's service container: container b6a7e0e2c9325da40272180a80445b2e8d090866a78911545773e1173a045737 not found in DB: no such container

Describe the results you expected:

Removed pod.

Additional information you deem important (e.g. issue happens only occasionally):

happens rarely

Output of podman version:

Client:       Podman Engine
Version:      4.3.1
API Version:  4.3.1
Go Version:   go1.19.3
Git Commit:   814b7b003cc630bf6ab188274706c383f9fb9915-dirty
Built:        Sun Nov 20 23:32:45 2022
OS/Arch:      linux/amd64

Output of podman info:

host:
  arch: amd64
  buildahVersion: 1.28.0
  cgroupControllers:
  - cpuset
  - cpu
  - io
  - memory
  - pids
  cgroupManager: systemd
  cgroupVersion: v2
  conmon:
    package: /usr/bin/conmon ist in conmon 1:2.1.5-1 enthalten
    path: /usr/bin/conmon
    version: 'conmon version 2.1.5, commit: c9f7f19eb82d5b8151fc3ba7fbbccf03fdcd0325'
  cpuUtilization:
    idlePercent: 96.7
    systemPercent: 2.48
    userPercent: 0.82
  cpus: 4
  distribution:
    distribution: arch
    version: unknown
  eventLogger: journald
  hostname: homeserver
  idMappings:
    gidmap:
    - container_id: 0
      host_id: 1000
      size: 1
    - container_id: 1
      host_id: 100000
      size: 65536
    uidmap:
    - container_id: 0
      host_id: 1000
      size: 1
    - container_id: 1
      host_id: 100000
      size: 65536
  kernel: 5.15.85-1-lts
  linkmode: dynamic
  logDriver: journald
  memFree: 1236152320
  memTotal: 6218739712
  networkBackend: netavark
  ociRuntime:
    name: crun
    package: /usr/bin/crun ist in crun 1.7.2-1 enthalten
    path: /usr/bin/crun
    version: |-
      crun version 1.7.2
      commit: 0356bf4aff9a133d655dc13b1d9ac9424706cac4
      rundir: /run/user/1000/crun
      spec: 1.0.0
      +SYSTEMD +SELINUX +APPARMOR +CAP +SECCOMP +EBPF +CRIU +YAJL
  os: linux
  remoteSocket:
    exists: true
    path: /run/user/1000/podman/podman.sock
  security:
    apparmorEnabled: false
    capabilities: CAP_CHOWN,CAP_DAC_OVERRIDE,CAP_FOWNER,CAP_FSETID,CAP_KILL,CAP_NET_BIND_SERVICE,CAP_SETFCAP,CAP_SETGID,CAP_SETPCAP,CAP_SETUID,CAP_SYS_CHROOT
    rootless: true
    seccompEnabled: true
    seccompProfilePath: /etc/containers/seccomp.json
    selinuxEnabled: false
  serviceIsRemote: false
  slirp4netns:
    executable: /usr/bin/slirp4netns
    package: /usr/bin/slirp4netns ist in slirp4netns 1.2.0-1 enthalten
    version: |-
      slirp4netns version 1.2.0
      commit: 656041d45cfca7a4176f6b7eed9e4fe6c11e8383
      libslirp: 4.7.0
      SLIRP_CONFIG_VERSION_MAX: 4
      libseccomp: 2.5.4
  swapFree: 34359730176
  swapTotal: 34359730176
  uptime: 1h 17m 33.00s (Approximately 0.04 days)
plugins:
  authorization: null
  log:
  - k8s-file
  - none
  - passthrough
  - journald
  network:
  - bridge
  - macvlan
  volume:
  - local
registries: {}
store:
  configFile: /home/homeserver/.config/containers/storage.conf
  containerStore:
    number: 13
    paused: 0
    running: 13
    stopped: 0
  graphDriverName: btrfs
  graphOptions: {}
  graphRoot: /home/homeserver/.local/share/containers/storage
  graphRootAllocated: 1000187023360
  graphRootUsed: 2227044352
  graphStatus:
    Build Version: Btrfs v6.0.1
    Library Version: "102"
  imageCopyTmpDir: /var/tmp
  imageStore:
    number: 13
  runRoot: /run/user/1000/containers
  volumePath: /home/homeserver/.local/share/containers/storage/volumes
version:
  APIVersion: 4.3.1
  Built: 1668983565
  BuiltTime: Sun Nov 20 23:32:45 2022
  GitCommit: 814b7b003cc630bf6ab188274706c383f9fb9915-dirty
  GoVersion: go1.19.3
  Os: linux
  OsArch: linux/amd64
  Version: 4.3.1

Package info (e.g. output of rpm -q podman or apt list podman or brew info podman):

podman 4.3.1-2

Have you tested with the latest version of Podman and have you checked the Podman Troubleshooting Guide?

Yes

Additional environment details (AWS, VirtualBox, physical, etc.):
physical

The text was updated successfully, but these errors were encountered:

vrothberg · 2023-01-02T10:14:10Z

Thanks for reaching out. Can you share a reproducer?

MartinX3 · 2023-01-02T10:20:27Z

Thank you for your repsonse.

Sadly I don't know how it happened each time.
Maybe on shutdown, maybe on podman pod rm.
Luckily it happens only rarely.

In the end it would be helpful to be able to rm and restart pods which are in created states forever because a service container with a specific id is missing.
A restart of the server didn't help to solve this. Only my workaround be editing the DB file and replace the service container id by an existing one from a different pod to rm the broken pod.

vrothberg · 2023-01-02T10:23:58Z

I assume you have created the pod via kube play, right?

MartinX3 · 2023-01-02T10:28:03Z

Together with systemD, yes.
systemctl --user enable --now podman-kube@$(systemd-escape $(pwd)/database-pod.yaml).service
It also has the restartPolicy: on-failure

Here is the file
https://github.com/MartinX3-AdministrativeDevelopment/ServerContainerTemplate/blob/development/container/services/postgresql/postgresql-pod.yaml

Maybe it's helpful, but since it only happens rarely because probably there is some kind of race condition I don't know.

vrothberg · 2023-01-02T12:14:28Z

Thanks, @MartinX3. Are you seeing the error when stopping the service via systemctl or when removing the pod manually?

MartinX3 · 2023-01-02T16:54:21Z

Sadly I don't remember.
I recognize the problem only after my server gave me an error log about a not working service and then I just see that one pod is in "CREATED" mode.
I don't know if it happened on modifying and reapplying the yaml file or a restart or the background podman autoupdate and prune service.

Luap99 · 2023-01-03T17:51:02Z

Regardless of how we reproduce it I think the first step for podman pod rm is to ignore ErrNoSuchCtr for the service container here:

podman/libpod/runtime_pod_common.go

Lines 320 to 322 in 28d04bc

    
           if err := p.maybeRemoveServiceContainer(); err != nil { 
        
           	return err 
        
           }

This will make the pod rm command work AFAICT.

rhatdan · 2023-01-03T18:06:38Z

Care to open a PR?

vrothberg · 2023-01-04T09:15:35Z

I don't think that's the right approach at the current time. We don't have a reproducer. If a pod was created with a service container and the service container was removed before the pod, then there is a bug OR the user removed it (which they should not).

Unless we have a reproducer, I am against patching symptoms.

vrothberg · 2023-01-04T09:18:15Z

Ah, but it seems the pod cannot be removed in that case. In that case, I think the pod should be removed but the error of the service container should be returned? In case that's a too intrusive change, it should at least be error logged.

Luap99 · 2023-01-04T12:20:44Z

Ah, but it seems the pod cannot be removed in that case. In that case, I think the pod should be removed but the error of the service container should be returned? In case that's a too intrusive change, it should at least be error logged.

Well in theory the podman process could be killed after p.maybeRemoveServiceContainer() but before

podman/libpod/runtime_pod_common.go

Line 325 in 28d04bc

if err := r.state.RemovePod(p); err != nil {

Since the next pod rm will always fail because of the missing service container it will leave the db in a very bad state. I don't even think we should log this, at this point the user wants to remove the pod so why should the user care about such warning. The pod is gone afterwards anyway.

You can reproduce this easily by removing the service container manually:

$ podman kube play --service-container test.yml
Trying to pull docker.io/library/alpine:latest...
Getting image source signatures
Copying blob c158987b0551 done  
Copying config 49176f190c done  
Writing manifest to image destination
Storing signatures
Pod:
bf4200f0ba3aec4cf1769bd434a3099bc4c3d1ea14c06728134107c43184c439
Container:
efbf32d5909d12a7aa7288b8c74bb3aaebb1964efa9740bd090ef332a219bcc8

$ podman ps
CONTAINER ID  IMAGE                                    COMMAND     CREATED        STATUS            PORTS       NAMES
1c1e38f098da  localhost/podman-pause:4.3.1-1668178887              8 seconds ago  Up 3 seconds ago              142bbe0be38a-service
c6cce367faef  localhost/podman-pause:4.3.1-1668178887              8 seconds ago  Up 3 seconds ago              bf4200f0ba3a-infra
efbf32d5909d  docker.io/library/alpine:latest                      3 seconds ago  Up 2 seconds ago              gallantmahavira-pod-gallantmahavira
$ podman rm 142bbe0be38a-service 
Error: container 1c1e38f098da09df4b2ec853b521167b8847784329aad8ab5dc74848c29166ed is the service container of pod(s) bf4200f0ba3aec4cf1769bd434a3099bc4c3d1ea14c06728134107c43184c439 and cannot be removed without removing the pod(s)
$ podman pod stop gallantmahavira-pod 
bf4200f0ba3aec4cf1769bd434a3099bc4c3d1ea14c06728134107c43184c439
ERRO[0000] Stopping service container 1c1e38f098da09df4b2ec853b521167b8847784329aad8ab5dc74848c29166ed: container is stopped 
ERRO[0000] Stopping service container 1c1e38f098da09df4b2ec853b521167b8847784329aad8ab5dc74848c29166ed: container is stopped 
ERRO[0000] Stopping service container 1c1e38f098da09df4b2ec853b521167b8847784329aad8ab5dc74848c29166ed: container is stopped 
$ podman rm 142bbe0be38a-service 
142bbe0be38a-service
$ podman pod rm gallantmahavira-pod 
Error: getting pod's service container: container 1c1e38f098da09df4b2ec853b521167b8847784329aad8ab5dc74848c29166ed not found in DB: no such container

If the service container is that critical to the pod it podman should not allow it to be removed without the pod.

vrothberg · 2023-01-04T12:25:53Z

I don't even think we should log this, at this point the user wants to remove the pod so why should the user care about such warning. The pod is gone afterwards anyway.

See my other comment. It either means there's a bug or that a user manually removed the service container (which they should not) - or something got killed. Logging seems the right way to me. Ignoring such errors (I call them symptoms) will also hide them from tests which can negatively impact quality since we don't know/see the errors.

A service container is only used when executed in systemd. In that case, systemd should manage the deployments (not the user).

vrothberg · 2023-01-04T15:41:04Z

To avoid duplicate work. @Luap99 do you want to tackle the issue or shall I?

Luap99 · 2023-01-04T15:44:54Z

@vrothberg Please take it.

Allow for removing a pod even if the associated service container is already gone. This can happen if a user manually removes the service container (which they shouldn't). Yet, accidants and bugs can happen, so we need to be able to clean up and remove the pods. Fixes: containers#16964 Signed-off-by: Valentin Rothberg <vrothberg@redhat.com>

Do not allow for removing the service container unless all associated pods have been removed. Previously, the service container could be removed when all pods have exited which can lead to a number of issues. Now, the service container is treated like an infra container and can only be removed along with the pods. Fixes: containers#16964 Signed-off-by: Valentin Rothberg <vrothberg@redhat.com>

Do not allow for removing the service container unless all associated pods have been removed. Previously, the service container could be removed when all pods have exited which can lead to a number of issues. Now, the service container is treated like an infra container and can only be removed along with the pods. Also make sure that a pod is unlinked from the service container once it's being removed. Fixes: containers#16964 Signed-off-by: Valentin Rothberg <vrothberg@redhat.com>

openshift-ci bot added the kind/bug Categorizes issue or PR as related to a bug. label Jan 1, 2023

MartinX3 changed the title ~~Rarely~~ Rarely Error: getting pod's service container: container {ID} not found in DB: no such container Jan 1, 2023

vrothberg mentioned this issue Jan 5, 2023

remove service container _after_ pods #17003

Merged

openshift-merge-robot closed this as completed in #17003 Jan 9, 2023

github-actions bot added the locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. label Sep 5, 2023

github-actions bot locked as resolved and limited conversation to collaborators Sep 5, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rarely `Error: getting pod's service container: container {ID} not found in DB: no such container` #16964

Rarely `Error: getting pod's service container: container {ID} not found in DB: no such container` #16964

MartinX3 commented Jan 1, 2023 •

edited

vrothberg commented Jan 2, 2023

MartinX3 commented Jan 2, 2023 •

edited

vrothberg commented Jan 2, 2023

MartinX3 commented Jan 2, 2023

vrothberg commented Jan 2, 2023

MartinX3 commented Jan 2, 2023

Luap99 commented Jan 3, 2023

rhatdan commented Jan 3, 2023

vrothberg commented Jan 4, 2023

vrothberg commented Jan 4, 2023

Luap99 commented Jan 4, 2023

vrothberg commented Jan 4, 2023

vrothberg commented Jan 4, 2023

Luap99 commented Jan 4, 2023

Rarely Error: getting pod's service container: container {ID} not found in DB: no such container #16964

Rarely Error: getting pod's service container: container {ID} not found in DB: no such container #16964

Comments

MartinX3 commented Jan 1, 2023 • edited

vrothberg commented Jan 2, 2023

MartinX3 commented Jan 2, 2023 • edited

vrothberg commented Jan 2, 2023

MartinX3 commented Jan 2, 2023

vrothberg commented Jan 2, 2023

MartinX3 commented Jan 2, 2023

Luap99 commented Jan 3, 2023

rhatdan commented Jan 3, 2023

vrothberg commented Jan 4, 2023

vrothberg commented Jan 4, 2023

Luap99 commented Jan 4, 2023

vrothberg commented Jan 4, 2023

vrothberg commented Jan 4, 2023

Luap99 commented Jan 4, 2023

Rarely `Error: getting pod's service container: container {ID} not found in DB: no such container` #16964

Rarely `Error: getting pod's service container: container {ID} not found in DB: no such container` #16964

MartinX3 commented Jan 1, 2023 •

edited

MartinX3 commented Jan 2, 2023 •

edited