-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Podman hangs, possible deadlock ? #16062
Comments
A friendly reminder that this issue had no activity for 30 days. |
@mheon PTAL |
Adding output from SIGABRT signal https://pastebin.com/raw/wTuJ3jND ( large output ) |
Seems like a c/storage deadlock, given this. |
@nalind Any thoughts? |
Well, the backtrace includes one goroutine (1) waiting for a file lock, and no other routines in a path that would obviously have obtained the lock, and the |
Adding output from SIGABRT from second process ( podman system service ) https://transfer.sh/SrjUkf/pdm.dump |
@nalind by the way, all this is not happening on the docker ( we are using --userns=auto and a few libpod endpoints instead of compat, then there is of course cgroup v1 vs cgroup v2 difference ) |
Confirmed: it's a locking bug. In goroutine 57243, the getMaxSizeFromImage() function calls LayerStore() before iterating through the list of known layers, and LayerStore() attempts to lock the layers.json file while reading it. The problem is that getMaxSizeFromImage() is being called from getAutoUserNS(), which is being called from CreateContainer(), which is already holding a write lock on that very file. |
... and I'm pretty sure @mtrmac fixed it in containers/storage#1387. |
In this log, in goroutine 57243 , isn’t the blocking lock Either way, that It does seem that containers/storage#1387 fixes this … without even noticing that it was broken. Oops. Does anyone have suggestions for containers/storage#1389 ? I’ll, at least, update containers/storage#1438 to document the |
A friendly reminder that this issue had no activity for 30 days. |
containers/storage@8d8d6be is now included in the Podman main branch. |
@mtrmac Hey, just tried https://download.copr.fedorainfracloud.org/results/rhcontainerbot/podman-next/fedora-37-x86_64/05226098-podman/podman-0.0.git.17669.93118464-1.fc37.x86_64.rpm on FCOS installed via this snippet
, but there is still some deadlock. |
@lukasmrtvy It’s hard to act upon “some deadlock”. Please file a new issue, with similar data like before ( |
Dump from |
I’m afraid that doesn’t match the backtrace; that backtrace seems to be consistent with this being unmodified Podman 4.3.1. |
|
https://github.com/containers/podman/releases/tag/v4.4.0-rc1 ? But if there were some deeper reason for the mismatch between what crashes and what you have upgraded to, just updating to that might not make a difference. |
In #16062 (comment) this issue is considered a locking bug inside podman. We seem to have problems regarding a deadlock in the xfs file system, triggered by podman. I am wondering if these can be explained by the same explanation or not. The issue occurs when starting a container for RabbitMQ with --user rabbitmq --userns=keep-id, on newer kernels when the native overlayfs is used and not fuse-overlayfs. One thing that is sub-optimal is that RabbitMQ needs its home directory mounted in the container (/var/lib/rabbitmq) but this is also where Podman stores all the container files; so effectively the container files are mounted into the container. The kernel warns about processes being stuck; here is one of them:
After this, more and more processes get blocked as they try to access the file system. We are currently working around it by forcing the use of the fuse-overlayfs instead of the native one. The version of Podman we have is fairly old (3.4.2) but Ubuntu doesn't seem to have packaged a newer version for Ubuntu 20.04. Therefore we could not try if this fix also works for us. |
This sounds completely unrelated to the original bug (and, honestly, is most likely a kernel bug, not a Podman one). Please open a fresh issue (probably against Podman first, so we can be sure this is actually the kernel). |
@mheon Yes I will do that. I will also add some extra information to it that may help. I also think it is more likely a kernel bug (after all, userland programs should not be able to mess up the kernel), but I wasn't sure where else to report it. |
Is this a BUG REPORT or FEATURE REQUEST? (leave only one on its own line)
/kind bug
Description
Seems that I am suffering from the very same problem as mentioned in #1574 ( 4y old ), but with the latest Podman version ( 4.2.0 ).
My scenario is a little bit different. I am running "custom scheduler" in a container with a mounted podman.socket and its running via systemd service. This scheduler is responsible for creating/managing/deleting ( there are also other operations like attaching, streaming logs, stats, etc involved ) and also for waiting for containers on behalf. Seems that when I restart this scheduler, Podman will become unavailable. Also using
--userns=auto
for containers created via "custom scheduler".Steps to reproduce the issue:
( Not relevant due to its own business logic )
systemctl start scheduler.service ( podman run -v "/run/podman/podman.sock:/run/podman/podman.sock" myscheduler )
systemctl restart scheduler.service
Describe the results you received:
Podman is not accessible via CLI and API
Describe the results you expected:
Podman API is accessible via CLI and API
Additional information you deem important (e.g. issue happens only occasionally):
Related logs:
journald -u podman --no-pager | grep 04b5091d21ad06885e48915d9539997c1b58745a52e3178ce12995cf6b82f944
ps -ef | grep 04b5091d21ad06885e48915d9539997c1b58745a52e3178ce12995cf6b82f944
lslocks | grep 44396
ps -ef | grep 5482
strace podman ps
Output of
podman version
:Output of
podman info
:Package info (e.g. output of
rpm -q podman
orapt list podman
orbrew info podman
):Have you tested with the latest version of Podman and have you checked the Podman Troubleshooting Guide? (https://github.com/containers/podman/blob/main/troubleshooting.md)
Yes
Additional environment details (AWS, VirtualBox, physical, etc.):
AWS EC2 instance, /var/lib/containers/storage/ mounted from EBS running Fedora CoreOS 36.20220820.3.0
The text was updated successfully, but these errors were encountered: