Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

podman-run --systemd=true insufficient to properly run systemd #2996

Closed
caffeinejolt opened this issue Apr 23, 2019 · 30 comments
Closed

podman-run --systemd=true insufficient to properly run systemd #2996

caffeinejolt opened this issue Apr 23, 2019 · 30 comments
Labels
kind/bug Categorizes issue or PR as related to a bug. locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments.

Comments

@caffeinejolt
Copy link

podman-0.12.1.2-2.git9551f6b.el7.centos.x86_64
centos-release-7-6.1810.2.el7.centos.x86_64
systemd-219-62.el7_6.5.x86_64

As noted in the podman-run man page for --systemd=true (default value): "podman will setup tmpfs mount points - /run, /run/lock, /tmp, /sys/fs/cgroup/systemd, /var/lib/journal

However - running systemd as such, result in numerous errors on startup (visible in podman-run when using "-i -t"):

[FAILED] Failed to start Load/Save Random Seed.
│See 'systemctl status systemd-random-seed.service' for details.
[FAILED] Failed to start Create Volatile Files and Directories.
│See 'systemctl status systemd-tmpfiles-setup.service' for details.
[FAILED] Failed to start Update UTMP about System Boot/Shutdown.
│See 'systemctl status systemd-update-utmp.service' for details.

It turns out that systemd needs some of the stuff those failures allude to since it won't actually start services specified in unit files via ExecStart either as such service start attempts end in failures:

Failed to create file /var/log/wtmp: Permission denied
Failed to create file /var/log/btmp: Permission denied
Failed to write utmp record: Permission denied
systemd-update-utmp.service: main process exited, code=exited, status=1/FAILURE
Unit systemd-update-utmp.service has failed.

And if systemd unit files cannot be used to launch services, it belittles the value of having a --systemd option at all.

A work-around is to do something like --tmpfs=/var:rw,noexec,nosuid,nodev,size=524288000

With /var mounted as a tmpfs, everything works fine. Its possible tmpfs mounts could be more selective (i.e. not doing all of /var) - I tried adding just /var/log and that did get rid of some errors - but not all of them. I am fine with /var as a tmpfs and the systemd guys seem to do the same with systemd-nspawn containers: https://www.freedesktop.org/software/systemd/man/systemd-nspawn.html

@mheon mheon added the kind/bug Categorizes issue or PR as related to a bug. label Apr 23, 2019
@rhatdan
Copy link
Member

rhatdan commented Apr 23, 2019

I think you have issues in your labeling. systemd should be able to write anywhere within the container image.

Could you remove the tmpfs and check if you have an issue with SELinux.

You could see if SELinux is causing the permission denied via setenforce 0 to test it.

I would expect you have a labeling issue in /var/lib/containers

restorecon -R -v /var/lib/containers

Would fix the labels.

@caffeinejolt
Copy link
Author

my bad - i totally forgot to mention --read-only was being used - but that is typically the case i would imagine for a production container - systemd artifacts would mostly be considered transient (tmpfs) - or at least that is how i run containers - everything is read-only except tmpfs stuff and bind mounts are used for anything that must be permanent

@caffeinejolt
Copy link
Author

caffeinejolt commented Apr 23, 2019

which i think is what the --systemd option seems to be meant for - to enable systemd to function within a mostly/completely immutable image

@caffeinejolt
Copy link
Author

however.. if this is not the common use case scenario (treating stuff in /var as transient/discarded on restarting the container) - then this is probably not a bug

@rhatdan
Copy link
Member

rhatdan commented Apr 24, 2019

Would just mounting the tmpfs on /var/log be enough to get this to work?

@caffeinejolt
Copy link
Author

So.. I played around with this a bit - it seems these are needed:

/var/tmp
/var/log
/var/lib/systemd

/var/tmp is needed for systemd units which specify PrivateTmp (a popular setting). /var/lib/systemd is needed for systemd's random_seed. It should be noted that I did not exhaustively test this - there may be others.

podman makes provisions for /var/lib/journal - but at least on centos/rhel 7 - that is not even a standard directory in a base install. This could vary based on systemd version and/or linux distribution as well. Maybe there is a /var/lib/journal in rhel8 - I have not yet had the chance to play with that yet.

I am personally just sticking to mounting /var as tmpfs, but I understand why it makes sense to limit tmpfs mounts to those only required by systemd to function - I suppose.

@caffeinejolt
Copy link
Author

For what it is worth... I find the --systemd and --rootfs features of podman super useful - just passing that along since I know that running systemd inside a container or not using the formal /var/lib/containers location for storage is not considered by some to be the "right way". Just letting you know that there are users out there who appreciate these podman features.

@mheon
Copy link
Member

mheon commented Apr 24, 2019

We ought get some of the systemd maintainers in here and ask their opinions on how they'd like us to handle this.

@rhatdan
Copy link
Member

rhatdan commented Apr 24, 2019

@poettering @lnykryn WDYT

@lnykryn
Copy link

lnykryn commented Apr 24, 2019

Do you have any steps to reproduce?
With podman-1.0.0-2.git921f98f.module+el8.0.0+2958+4e823551.x86_64
I've tried "podman run -i -t --systemd=true centos /usr/lib/systemd/systemd" and I've run into a completely different set of error messages, mostly about cgroupfs not being writable

Failed to create cgroup /machine.slice/libpod-714a01a10a64d5fb72c394f064bdfe5dff78957e1da7484c2bb6a6a716bc48b3.scope/system.slice/dbus.service: Permission denied

@caffeinejolt
Copy link
Author

Do you have any steps to reproduce?
With podman-1.0.0-2.git921f98f.module+el8.0.0+2958+4e823551.x86_64
I've tried "podman run -i -t --systemd=true centos /usr/lib/systemd/systemd" and I've run into a completely different set of error messages, mostly about cgroupfs not being writable

Failed to create cgroup /machine.slice/libpod-714a01a10a64d5fb72c394f064bdfe5dff78957e1da7484c2bb6a6a716bc48b3.scope/system.slice/dbus.service: Permission denied

setsebool container_manage_cgroup true

@lnykryn
Copy link

lnykryn commented Apr 24, 2019

Hmm so I just did "setenforce 0" '(sorry @rhatdan ) and now everything seems to work just fine.
https://paste.fedoraproject.org/paste/Ey9hEhZ6~~cmdCXf5rUczA

@rhatdan
Copy link
Member

rhatdan commented Apr 24, 2019

@lnykryn Setting the boolean above would fix it.

The think @caffeinejolt is trying to do is run the systemd as pid 1 in a read-only container.

So add --read-only to you podman command. We are looking for the directories where systemd expects to be able to write.

@lnykryn
Copy link

lnykryn commented Apr 24, 2019

Sorry, I missed that part, that is an entirely different topic, that is not related just to containers. I would like to see the same thing also working on the bare metal, since there we are for the same use-case still using the ancient readonly-root from initscripts[1]. I think we need to discuss that with @poettering @keszybz

[1] https://github.com/fedora-sysv/initscripts/blob/master/usr/libexec/readonly-root

@keszybz
Copy link

keszybz commented Apr 26, 2019

/var/tmp — this directory is generally needed and must be writeable. It's a standard linux interface and many things will make use of it, not just systemd.

As for /var/log and /var/lib — I think it's better to make them available, or even to make the whole /var available. In file-hierarchy(7) we say

/var/
Must be writable. Persistency is recommended, but optional, ...

There will be applications which expect to create file in /var/cache and so on. So providing a full-featured writeable /var will just make things smoother.

Alternatives would be either make systemd-update-utmp.service somehow conditional on /var being writeable, or disable or mask that service in the image or even compile systemd without it. None of those options seems nice: it should be possible to run a fully generic image without any errors, this is generally our goal with containers. And if systemd-update-utmp.service is compiled-in and enabled, it should work and complain if it can't do it's job. This service is not only a cosmetic thing — the files it creates are part of the compatibility layer for sysvinit, in particular for applications which want to know the "runlevel".

systemd-random-seed.service is a different story. It has ConditionVirtualization=!container. If it gets run, it seems systemd does not think it is running in a container. Please figure out why that is, and fix that, and then this service will not be started.

@rhatdan
Copy link
Member

rhatdan commented Apr 26, 2019

/var/tmp probably should, but the other directories are up to the user to mount volumes in.
I suppose podman could react to --read-only flag and mount some of these, but I am looking for specifically what systemd would need to successfully run.

If the unit file is running mariadb then it is up to the person packaging mariadb to make /var/log/mariadb and /var/lib/mariadb writeable to the container.

@rhatdan
Copy link
Member

rhatdan commented Apr 26, 2019

# podman run alpine printenv container
podman

Isn't systemd just looking to see if the container environment variable is enabled? Shouldn't this be enough to not execute the systemd-random-seed.service service with ConditionVirtualization=!container.

@keszybz
Copy link

keszybz commented Apr 26, 2019

Isn't systemd just looking to see if the container environment variable is enabled?

That's the check that should apply in this case. https://www.freedesktop.org/wiki/Software/systemd/ContainerInterface/#environmentvariables describes $container.

What does SYSTEMD_LOG_LEVEL=debug systemd-detect-virt say under podman?

but I am looking for specifically what systemd would need to successfully run.

The stuff that @caffeinejolt lists in #2996 (comment) should be enough, at least for now.

keszybz added a commit to keszybz/systemd that referenced this issue Apr 26, 2019
We would detect podman as container-other. Let's assign a name to it.
Inspired by containers/podman#2996.
@rhatdan
Copy link
Member

rhatdan commented Apr 26, 2019

SYSTEMD_LOG_LEVEL=debug systemd-detect-virt
Found container virtualization container-other.
container-other

@poettering
Copy link

The assumption is that /var is writable really, it's already in the name... everything else may be read-only. I mean, we worked hard on getting /etc/mtab and /etc/resolv.conf out of /etc, so that /etc can be read-only, and we moved it to /var, but /var should really be writable. /tmp needs to be writable too, /run as well. otoh / itself, /etc, /usr can all be read-only.

That said, you can get away with making /var/log/journal read-only if you turn off persistent storage of journald (using Storage=volatile in journald.conf).

But in general: if it's not systemd itself that wants to write to /var, it's going to be some package you install in addition, since the general assumption is that you can write to /var.

/run/lock should not be a separate fs, it should just be a regular subdir of /run. It's pretty much legacy, and there's nothing special about it really, it's just one subdir of /run among many others... It should not require any special hookup whatsoever. please remove any special mention of that dir in podman.

@poettering
Copy link

btw, this all reminds me of the "--volatile=" switch in nspawn. If you pick "--volatile=state" then this means that nspawn will mount the container image read-only as a whole, and then overmount /var with an empty tmpfs. Moreover /tmp and /run are mounted as tmpfs (as they always are, even without --volatile=). systemd as payload has been carefully tuned so that it can boot up fine with an entirely empty /var, and that everything it needs is automatically re-populated again via tmpfiles.d entries. It appears to me, that podman should just set things up the same way and make use of the fact that systemd is totally happy in such an environment. To summarize:

  1. /var writable tmpfs, as a whole, including /var/tmp
  2. /tmp writable tmpfs
  3. /run writable tmpfs
  4. everything else read-only, including /, /etc and /usr

Which means:

  1. drop podman's special handling of /run/lock
  2. drop podma's special handling of /var/lib/journal
  3. instead add special handling of /var as a whole

@poettering
Copy link

btw, the requirements we make are documented here: https://www.freedesktop.org/wiki/Software/systemd/ContainerInterface/ — it's (mostly) up-to-date even

@rhatdan
Copy link
Member

rhatdan commented Apr 26, 2019

Well OCI Containers are a little different then can come with a pre-populated /var, For example you might want a container with lots of html pages in /var/www that are static and you don't want the container able to write there.

Podman/runc does have the ability to copy the content under a tmpfs on top of the tmpfs.

I think the change to podman would be to mount a tmpfs on /run, /tmp, and /var/tmp when run in --read-only mode.

Then make sure the other directories where systemd needs to write are writable when it comes up.

Changing the journald.conf entry is not really something we can do with podman, since we have no control over the image that is handed to us.

keszybz added a commit to keszybz/systemd that referenced this issue Apr 29, 2019
We would detect podman as container-other. Let's assign a name to it.
Inspired by containers/podman#2996.
@mheon
Copy link
Member

mheon commented May 7, 2019

I think we can close this one with the changes @rhatdan landed for mounting some directories tmpfs when the container is read-only.

@mheon mheon closed this as completed May 7, 2019
edevolder pushed a commit to edevolder/systemd that referenced this issue Jun 26, 2019
We would detect podman as container-other. Let's assign a name to it.
Inspired by containers/podman#2996.
Yamakuzure pushed a commit to elogind/elogind that referenced this issue Sep 23, 2019
We would detect podman as container-other. Let's assign a name to it.
Inspired by containers/podman#2996.
@sffc
Copy link

sffc commented Jan 2, 2020

I think systemd continues to fail out of the box in read-only mode, due to the /var permissions.

Steps to reproduce:

  1. Make a Dockerfile according to @rhatdan's blog post:
FROM centos:centos8
RUN yum install -y httpd && systemctl enable httpd
EXPOSE 80
CMD [ "/sbin/init" ]
  1. Build the image:
# podman build -t init .
  1. Run the image in read-only mode:
# podman run -it --read-only init

Expected: systemd should start without any warnings.

Actual: Many errors in the output, such as [FAILED] Failed to start Journal Service and [FAILED] Failed to start The Apache HTTP Server.

Is there a way to redirect journal logs from systemd services inside the container to be in the host's journal instead? If journald didn't try to write to /var/log, that would solve many of the startup errors. Of course, httpd will still try to write to /var/log, so maybe some other solution is needed.

Another option is to punt this to userland, and say that it's your responsibility to mount necessary paths as tmpfs if you plan to use systemd. However, my expectation is that at least journald should work out of the box, since it runs in virtually any systemd setup.

@sffc
Copy link

sffc commented Jan 2, 2020

Note: I am running RHEL 8 with Podman 1.4.2, the latest version from sudo yum module install container-tools:rhel8.

@rhatdan
Copy link
Member

rhatdan commented Jan 2, 2020

Why not volume mount in /var/log into your container.

# mkdir /var/log/mylog
# podman run -it -v /var/log/mylog:/var/log:Z  --read-only init

@mheon
Copy link
Member

mheon commented Jan 2, 2020

I can't duplicate your Failed to start Journal service result, but I do see httpd in the failed state.

Jan 02 14:45:52 4a86362ca6eb httpd[28]: AH00015: Unable to open logs seems the most likely error?

@mheon
Copy link
Member

mheon commented Jan 2, 2020

If I had to guess here - we're only mounting a tmpfs on /var/log/journal and not /var/log itself. Apache wants to write logs to /var/log/httpd/ which is now on a read-only filesystem courtesy of --read-only and is failing as such.

@mheon
Copy link
Member

mheon commented Jan 2, 2020

Recommended solution: Modify your Apache configuration to log elsewhere, or mount either a tmpfs or a volume at /var/log/httpd to allow Apache to log.

@github-actions github-actions bot added the locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. label Sep 23, 2023
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Sep 23, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
kind/bug Categorizes issue or PR as related to a bug. locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments.
Projects
None yet
Development

No branches or pull requests

7 participants