-
Notifications
You must be signed in to change notification settings - Fork 2.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
podman-run --systemd=true insufficient to properly run systemd #2996
Comments
I think you have issues in your labeling. systemd should be able to write anywhere within the container image. Could you remove the tmpfs and check if you have an issue with SELinux. You could see if SELinux is causing the permission denied via setenforce 0 to test it. I would expect you have a labeling issue in /var/lib/containers restorecon -R -v /var/lib/containers Would fix the labels. |
my bad - i totally forgot to mention --read-only was being used - but that is typically the case i would imagine for a production container - systemd artifacts would mostly be considered transient (tmpfs) - or at least that is how i run containers - everything is read-only except tmpfs stuff and bind mounts are used for anything that must be permanent |
which i think is what the --systemd option seems to be meant for - to enable systemd to function within a mostly/completely immutable image |
however.. if this is not the common use case scenario (treating stuff in /var as transient/discarded on restarting the container) - then this is probably not a bug |
Would just mounting the tmpfs on /var/log be enough to get this to work? |
So.. I played around with this a bit - it seems these are needed: /var/tmp /var/tmp is needed for systemd units which specify PrivateTmp (a popular setting). /var/lib/systemd is needed for systemd's random_seed. It should be noted that I did not exhaustively test this - there may be others. podman makes provisions for /var/lib/journal - but at least on centos/rhel 7 - that is not even a standard directory in a base install. This could vary based on systemd version and/or linux distribution as well. Maybe there is a /var/lib/journal in rhel8 - I have not yet had the chance to play with that yet. I am personally just sticking to mounting /var as tmpfs, but I understand why it makes sense to limit tmpfs mounts to those only required by systemd to function - I suppose. |
For what it is worth... I find the --systemd and --rootfs features of podman super useful - just passing that along since I know that running systemd inside a container or not using the formal /var/lib/containers location for storage is not considered by some to be the "right way". Just letting you know that there are users out there who appreciate these podman features. |
We ought get some of the systemd maintainers in here and ask their opinions on how they'd like us to handle this. |
@poettering @lnykryn WDYT |
Do you have any steps to reproduce? Failed to create cgroup /machine.slice/libpod-714a01a10a64d5fb72c394f064bdfe5dff78957e1da7484c2bb6a6a716bc48b3.scope/system.slice/dbus.service: Permission denied |
setsebool container_manage_cgroup true |
Hmm so I just did "setenforce 0" '(sorry @rhatdan ) and now everything seems to work just fine. |
@lnykryn Setting the boolean above would fix it. The think @caffeinejolt is trying to do is run the systemd as pid 1 in a read-only container. So add --read-only to you podman command. We are looking for the directories where systemd expects to be able to write. |
Sorry, I missed that part, that is an entirely different topic, that is not related just to containers. I would like to see the same thing also working on the bare metal, since there we are for the same use-case still using the ancient readonly-root from initscripts[1]. I think we need to discuss that with @poettering @keszybz [1] https://github.com/fedora-sysv/initscripts/blob/master/usr/libexec/readonly-root |
As for
There will be applications which expect to create file in /var/cache and so on. So providing a full-featured writeable Alternatives would be either make
|
/var/tmp probably should, but the other directories are up to the user to mount volumes in. If the unit file is running mariadb then it is up to the person packaging mariadb to make /var/log/mariadb and /var/lib/mariadb writeable to the container. |
Isn't systemd just looking to see if the |
That's the check that should apply in this case. https://www.freedesktop.org/wiki/Software/systemd/ContainerInterface/#environmentvariables describes What does
The stuff that @caffeinejolt lists in #2996 (comment) should be enough, at least for now. |
We would detect podman as container-other. Let's assign a name to it. Inspired by containers/podman#2996.
SYSTEMD_LOG_LEVEL=debug systemd-detect-virt |
The assumption is that /var is writable really, it's already in the name... everything else may be read-only. I mean, we worked hard on getting /etc/mtab and /etc/resolv.conf out of /etc, so that /etc can be read-only, and we moved it to /var, but /var should really be writable. /tmp needs to be writable too, /run as well. otoh / itself, /etc, /usr can all be read-only. That said, you can get away with making /var/log/journal read-only if you turn off persistent storage of journald (using Storage=volatile in journald.conf). But in general: if it's not systemd itself that wants to write to /var, it's going to be some package you install in addition, since the general assumption is that you can write to /var. /run/lock should not be a separate fs, it should just be a regular subdir of /run. It's pretty much legacy, and there's nothing special about it really, it's just one subdir of /run among many others... It should not require any special hookup whatsoever. please remove any special mention of that dir in podman. |
btw, this all reminds me of the "--volatile=" switch in nspawn. If you pick "--volatile=state" then this means that nspawn will mount the container image read-only as a whole, and then overmount /var with an empty tmpfs. Moreover /tmp and /run are mounted as tmpfs (as they always are, even without --volatile=). systemd as payload has been carefully tuned so that it can boot up fine with an entirely empty /var, and that everything it needs is automatically re-populated again via tmpfiles.d entries. It appears to me, that podman should just set things up the same way and make use of the fact that systemd is totally happy in such an environment. To summarize:
Which means:
|
btw, the requirements we make are documented here: https://www.freedesktop.org/wiki/Software/systemd/ContainerInterface/ — it's (mostly) up-to-date even |
Well OCI Containers are a little different then can come with a pre-populated /var, For example you might want a container with lots of html pages in /var/www that are static and you don't want the container able to write there. Podman/runc does have the ability to copy the content under a tmpfs on top of the tmpfs. I think the change to podman would be to mount a tmpfs on /run, /tmp, and /var/tmp when run in --read-only mode. Then make sure the other directories where systemd needs to write are writable when it comes up. Changing the journald.conf entry is not really something we can do with podman, since we have no control over the image that is handed to us. |
We would detect podman as container-other. Let's assign a name to it. Inspired by containers/podman#2996.
I think we can close this one with the changes @rhatdan landed for mounting some directories tmpfs when the container is read-only. |
We would detect podman as container-other. Let's assign a name to it. Inspired by containers/podman#2996.
We would detect podman as container-other. Let's assign a name to it. Inspired by containers/podman#2996.
I think systemd continues to fail out of the box in read-only mode, due to the Steps to reproduce: FROM centos:centos8
RUN yum install -y httpd && systemctl enable httpd
EXPOSE 80
CMD [ "/sbin/init" ]
Expected: systemd should start without any warnings. Actual: Many errors in the output, such as Is there a way to redirect journal logs from systemd services inside the container to be in the host's journal instead? If journald didn't try to write to Another option is to punt this to userland, and say that it's your responsibility to mount necessary paths as tmpfs if you plan to use systemd. However, my expectation is that at least journald should work out of the box, since it runs in virtually any systemd setup. |
Note: I am running RHEL 8 with Podman 1.4.2, the latest version from |
Why not volume mount in /var/log into your container.
|
I can't duplicate your
|
If I had to guess here - we're only mounting a tmpfs on |
Recommended solution: Modify your Apache configuration to log elsewhere, or mount either a tmpfs or a volume at |
podman-0.12.1.2-2.git9551f6b.el7.centos.x86_64
centos-release-7-6.1810.2.el7.centos.x86_64
systemd-219-62.el7_6.5.x86_64
As noted in the podman-run man page for --systemd=true (default value): "podman will setup tmpfs mount points - /run, /run/lock, /tmp, /sys/fs/cgroup/systemd, /var/lib/journal
However - running systemd as such, result in numerous errors on startup (visible in podman-run when using "-i -t"):
[FAILED] Failed to start Load/Save Random Seed.
│See 'systemctl status systemd-random-seed.service' for details.
[FAILED] Failed to start Create Volatile Files and Directories.
│See 'systemctl status systemd-tmpfiles-setup.service' for details.
[FAILED] Failed to start Update UTMP about System Boot/Shutdown.
│See 'systemctl status systemd-update-utmp.service' for details.
It turns out that systemd needs some of the stuff those failures allude to since it won't actually start services specified in unit files via ExecStart either as such service start attempts end in failures:
Failed to create file /var/log/wtmp: Permission denied
Failed to create file /var/log/btmp: Permission denied
Failed to write utmp record: Permission denied
systemd-update-utmp.service: main process exited, code=exited, status=1/FAILURE
Unit systemd-update-utmp.service has failed.
And if systemd unit files cannot be used to launch services, it belittles the value of having a --systemd option at all.
A work-around is to do something like --tmpfs=/var:rw,noexec,nosuid,nodev,size=524288000
With /var mounted as a tmpfs, everything works fine. Its possible tmpfs mounts could be more selective (i.e. not doing all of /var) - I tried adding just /var/log and that did get rid of some errors - but not all of them. I am fine with /var as a tmpfs and the systemd guys seem to do the same with systemd-nspawn containers: https://www.freedesktop.org/software/systemd/man/systemd-nspawn.html
The text was updated successfully, but these errors were encountered: