New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unable to start containers after a forced shutdown #5986
Comments
|
A friendly reminder that this issue had no activity for 30 days. |
|
This issue seems to have been lost. Sorry about that, are you still having issues with this? |
|
Hi, yes occasionally. Seems either the ip config file can be left dangling or a reference to the image is left.
Leigh
… On 28 May 2020, at 01:51, Daniel J Walsh ***@***.***> wrote:
This issue seems to have been lost. Sorry about that, are you still having issues with this?
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub, or unsubscribe.
|
|
I was under the impression the symlink issue with images was resolved already in c/storage, but that sounds to be incorrect |
|
@nalind PTAL |
|
When we get this error, I think that calling |
|
Is there an easy way to determine if the error in question is an error that would require |
|
It looks like the error that's coming back from |
|
I am seeing this issue too. The workaround for me was to delete the image: |
|
Any idea how you got into this state? Do you have a reproducer? |
|
@rhatdan This happened for me when I had an orange pi zero with its underpowered sdcard to try pulling & launching around 7 containers simultaneously. Of course the little guy overheated and/or kernel panicked. Whatever happened caused a forced shutdown during pulling/creation/startup of containers, possibly had containers in most of these stages, since 2 of those images were relatively small and were probably into later stages than the others. So high load, multiple concurrent operations and unclean shutdown. Maybe a high ext4 commit will cause this issue to reproduce more reliably. |
|
I don't have a reliable reproduction @rhatdan , but I hit it frequently enough using containers running couchdb during any shutdown/reboot. It seems to be more likely when the shutdown comes as a power off of a VM |
|
@giuseppe Could this be fuse-overlay related, or is this just a partial removal from container storage that is causing this problem? |
|
@rhatdan I don't think it is related to fuse-overlayfs. Generally the storage can get corrupted on a forced shutdown and the missing symlinks is just one symptom. What I am worried the most about is that images could be corrupted as well (e.g. missing or incomplete files) and this is difficult to detect. When running in a cluster, on the next node boot CRI-O wipes out the entire storage if the node wasn't stopped cleanly. I think this is still the safest we can do now, until we will have something like "podman storage fsck" that can verify that each file in the images is not corrupted and if needed re-pull the image. |
|
How difficult would it be to reassemble the storage with an fsck option? The difference between CRI-O and Podman is blowing away of containers, could mean loss of a serious amount of work. Think toolbox containers. |
I can confirm this is a thing that happens. I've seen some applications crashing for no apparent reason, which turned out to be fixed by removing and re-pulling the image (same digest). But again, no reproducer. Does podman support any sort of read-only rootfs setup? Like storing images in a partition which gets mounted as ro? Or even the whole rootfs mounted as ro. |
we would need to checksum each file in the image. It would get us closer to the OSTree storage model. OSTree has a fsck operation that works this way. Alternatively, more expensive in terms of I/O, we record the image is pulled only after we do a
You can use an additional store that works exactly how you described it. The entire storage is on a read-only partition and tell Podman to use it with: in the |
|
A friendly reminder that this issue had no activity for 30 days. |
|
@nalind Any movement on this? |
|
Sorry, been focused on other bugs. |
|
I just faced this issue in a very strange situation:
|
|
A friendly reminder that this issue had no activity for 30 days. |
|
This problem still exists. |
|
A similar issue exists after a simple Linux sudo reboot. Neither locks nor ports are released. But the correct systemctl reboot works OK. |
|
I too can confirm that this issue has just recently occurred and is still happening. There was a surge in the zone over the weekend which resulted in a short loss of electricity. I had recently moved the raspberrypi I use for testing and it had no UPS. Only one container seems to have been affected a rootfull haproxy. The other rootless containers seem to have not malfunctioned. The solution was indeed to pull haproxy again. Not really something I'd want to do in production considering that haproxy and podman are both key components. `$ sudo podman info --debug
|
|
Issue fixed in c/storage by containers/storage#822 |
|
@umohnani8 Please backport containers/storage#822 to v1.26 so we can update the vendor in podman to fix this in podman 3.0. |
|
@rhatdan the patch is already in c/storage v1.26 |
|
Did you open a PR to vendor in an update? |
|
I don't think that'll pass CI due to the libcap farts, see #9462 |
|
I've also hit this on podman 2.2.1 on Fedora IoT. I want to emphasize that this is really problematic in low-bandwidth IoT use-cases with unstable power supply as I'm facing right now: when operating podman on devices at sites with slow connection speeds, limits (around 500M to 1G per Month) or a per KB/MB billing model a corrupted storage and re-downloading 500MB nodejs images is either impossible or at least lethal on site. For this it's crucial to store all images on update in a dedicated, read-only storage like giuseppe suggested earlier. Just mentioning it because it caused a lot of headache already in the past and maybe others hitting this with a similar use-case can benefit from the idea. As a side-note on ostree based systems: I think it's a viable approach for those scenarios to commit the images into the ostree which also allows diff-based container updates, since ostree comes with the ability to do delta updates. It also ties the container image state / version to the OS version, which is also a nice property IMO. |
|
The storage fix made it into podman 3.1.0-rc1. @rhatdan do we plan to backport this fix for 2.2.1 as well? |
|
no. |
|
Okay, this is fixed in podman |
|
Just to clarify, I'm assuming that the ro-store workaround would not be sufficient for containers run as root, since it relies on filesystem permissions, right? I'm getting hit with this fairly often using preemptible instances on google cloud. Since I have to expect random hard shutdowns, I'm already taking container checkpoints at regular intervals (which require root). My fairly overkill workaround to the random corruption is if after boot, any My scripts seem to catch the corruption and allow recovery, but it's fairly aggressive. |
Is this a BUG REPORT or FEATURE REQUEST? (leave only one on its own line)
/kind bug
Description
Steps to reproduce the issue:
sudo /usr/bin/podman run --rm -d --name atomix-1 -p 5679:5679 -it -v /opt/onos/config:/etc/atomix/conf -v /var/lib/atomix-1/data:/var/lib/atomix/data:Z atomix/atomix:3.1.5 --config /etc/atomix/conf/atomix-1.conf --ignore-resources --data-dir /var/lib/atomix/data --log-level WARN
sudo virsh destroy
Describe the results you received:
Container failed to start with error readlink below: no such file or directory. This occurs approximate 1 in 5 forced shutdowns
sudo /usr/bin/podman run --rm -d --name atomix-1 -p 5679:5679 -it -v /opt/onos/config:/etc/atomix/conf -v /var/lib/atomix-1/data:/var/lib/atomix/data:Z atomix/atomix:3.1.5 --config /etc/atomix/conf/atomix-1.conf --ignore-resources --data-dir /var/lib/atomix/data --log-level WARN
Error: readlink /var/lib/containers/storage/overlay/l/QRPHWAOMUOP7RQXQKPUY4Y7I3Z: no such file or directory
sudo podman inspect localhost/atomix/atomix:3.1.5
Error: error parsing image data "57ddcf43f4ac8f399810d4b44ded2c3a63e5abfb672bc447c3aa0f18e39a282c": readlink /var/lib/containers/storage/overlay/l/GMVU2BJI2CBP6Z2DFDEHCCZGTD: no such file or directory
Describe the results you expected:
Container starts correctly
Additional information you deem important (e.g. issue happens only occasionally):
The only work around seems to be to delete the image and re pull:
sudo podman rm -f atomix/atomix:3.1.5
sudo podman pull atomix/atomix:3.1.5
Output of
podman version:Output of
podman info --debug:Package info (e.g. output of
rpm -q podmanorapt list podman):Additional environment details (AWS, VirtualBox, physical, etc.):
KVM CentOS 8.1 Guest VM running latest stable podman.
The text was updated successfully, but these errors were encountered: