Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pursuing conventional systemd+podman interaction #6400

Closed
andrewgdunn opened this issue May 27, 2020 · 63 comments · Fixed by #6666
Closed

pursuing conventional systemd+podman interaction #6400

andrewgdunn opened this issue May 27, 2020 · 63 comments · Fixed by #6666
Assignees
Labels
locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. stale-issue

Comments

@andrewgdunn
Copy link

This is an RFE after talking with @mheon for a bit in IRC (thanks for that, sorry I kept you so late). In the shortest form I can think of the enhancement would be: facilitate podman/conmon interacting with systemd in a way that provides console output for systemctl and journalctl. In bullet form:

  • create a "system" user (e.g. UID/GID less than 1000 by convention), set shell to /sbin/nologin
  • create a sub-UID and sub-GID mapping for that user
  • create a "system" level unit file (e.g. /etc/systemd/system/<unit>.service) that specifies that "system" user in User=.
  • systemctl start <unit>.service and be able to see the console output of the container
  • journalctl -u <unit>.service and be able to see the historical console output of the container

My use case is that I want to use podman to run images that are essentially "system" services, but as "user" because I want the rootless isolation. I've been consuming podman for a bit now (starting with 1.8.2) and am likely stuck on that version because in new versions my approach gets broken: I loose all logging from the container. I have tried --log-driver=journald but have no idea how to find a hand-hold for the console output (what -u should I be looking for, because its not .service, and it's not the container... and it's not podman-.scope). Basically podman doesn't provide the init system with a console hand-hold so I'm rolling blind.

Here is an example of mattermost, under 1.8.2 this works how I'd like it to work (e.g. I'm getting console output). I'm doing some things that are different than what podman generate systemd offers, but it's because my explicit goal is to:

  • run the container rootless under a "system" user
  • see what the heck is going on inside the container with my init system (rather than having to sudo -u <user> -h <home> podman logs <container-name>)
[root@vault ~]# systemctl cat podman-mattermost.service 
# /etc/systemd/system/podman-mattermost.service
[Unit]
Description=Podman running mattermost
Wants=network.target
After=network-online.target
Requires=podman-mattermost-postgres.service

[Service]
WorkingDirectory=/app/gitlab
User=gitlab
Group=gitlab
Restart=no
ExecStartPre=/usr/bin/rm -f %T/%N.pid %T/%N.cid
ExecStartPre=/usr/bin/podman rm --ignore -f mattermost
ExecStart=/usr/bin/podman run --conmon-pidfile %T/%N.pid --cidfile %T/%N.cid --cgroups=no-conmon \
  --name=mattermost \
  --env-file /app/gitlab/mattermost/mattermost.env \
  --publish 127.0.0.1:8065:8065 \
  --security-opt label=disable \
  --health-cmd=none \
  --volume /app/gitlab/mattermost/data:/mattermost/data \
  --volume /app/gitlab/mattermost/logs:/mattermost/logs \
  --volume /app/gitlab/mattermost/config:/mattermost/config \
  --volume /app/gitlab/mattermost/plugins:/mattermost/client/plugins \
  docker.io/mattermost/mattermost-team-edition:release-5.24
ExecStop=/usr/bin/podman stop --ignore mattermost -t 30
ExecStopPost=/usr/bin/podman rm --ignore -f mattermost
ExecStopPost=/usr/bin/rm -f %T/%N.pid %T/%N.cid
KillMode=none
Type=simple

[Install]
WantedBy=multi-user.target default.target
[root@vault ~]# systemctl cat podman-mattermost-postgres.service 
# /etc/systemd/system/podman-mattermost-postgres.service
[Unit]
Description=Podman running postgres for mattermost
Wants=network.target
After=network-online.target podman-mattermost.service
PartOf=podman-mattermost.service

[Service]
WorkingDirectory=/app/gitlab
User=gitlab
Group=gitlab
Restart=no
ExecStartPre=/usr/bin/rm -f %T/%N.pid %T/%N.cid
ExecStartPre=/usr/bin/podman rm --ignore -f postgres
ExecStart=/usr/bin/podman run --conmon-pidfile %T/%N.pid --cidfile %T/%N.cid --cgroups=no-conmon \
  --name=postgres \
  --env-file /app/gitlab/mattermost/postgres.env \
  --net=container:mattermost \
  --volume /app/gitlab/mattermost/postgres:/var/lib/postgresql/data:Z \
  docker.io/postgres:12
ExecStop=/usr/bin/podman stop --ignore postgres -t 30
ExecStopPost=/usr/bin/podman rm --ignore -f postgres
ExecStopPost=/usr/bin/rm -f %T/%N.pid %T/%N.cid
KillMode=none
Type=simple

[Install]
WantedBy=multi-user.target default.target

With these units above I am able to:

  • run as rootless as the "system" user (in this case I'm running both gitlab and mattermost)
  • see the console output (notice the Type=simple and lack of -d)
    • in both systemctl <unit> and journalctl -u <unit>
  • have the container instance be ephemeral (excessive ExecPre and ExecStop)
  • have a shared networking namespace so that mattermost and postgres can talk
  • have container level dependencies represented through the init system
    • podman-mattermost.service requires the podman-mattermost-postgres.service (Requires=)
    • podman-mattermost-postgres.service will get a stop signal if I stop podman-mattermost.service (PartOf=)
    • there are challenges here where podman-mattermost.service closes out the networking namespace before podman-mattermost-postgres.service can finish up (I think), so it's not ideal... i'd be interested in suggestions.

Tagging @lsm5 as well since I think for my use case I'm relegated to use 1.8.2 in F32 for the time being... so I am wondering if that is going away any-time soon?

@andrewgdunn
Copy link
Author

If I didn't hit it clearly, I did try to adopt 1.9.2. It requires a couple things (but ultimately does not work well). #6084 has some more information as well.

  • do the loginctl enable-linger on the "system" user
  • switch over to the more conventional things from podman generate systemd like -d, Type=forking

Starting this up, you can only see console output from the container by doing sudo -u <user> -h <home> podman logs <containername> where systemctl/journalctl give you nothing.

The --log-driver=journald doesn't allow for anything better... because I can't figure out what the unit is to actually query logs from (I think it might be some composite of the container id?)... and when you do a sudo -u <user> -h <home> podman logs <containername> you get nothing.

@lsm5
Copy link
Member

lsm5 commented May 27, 2020

you can get 1.8.2-2 from https://koji.fedoraproject.org/koji/buildinfo?buildID=1479547

I'll save it to my fedorapeople page as well and send you the URL later.

@giuseppe
Copy link
Member

if you enable linger mode and there is already the user session running, is there any disadvantage in in installing the .service file into ~/.config/systemd/user/?

@andrewgdunn
Copy link
Author

andrewgdunn commented May 27, 2020

@giuseppe for you to be able to do that you'd need a shell for that "system" account. Above I'm creating the user as root with a /sbin/nologin shell. To access the systemctl --user session you'd actually need to login, or you'd need to set the XDG_RUNTIME_DIR variable... I think... (it could also be DBUS_SESSION_BUS_ADRESS) like XDG_RUNTIME_DIR=/run/user/$UID systemctl --user status. It generally gets messy.

Also, this suggestion doesn't address what I'm primarily asking for above: desiring console output from the running container to be seen by systemd/journald. Without being able to see the combination on:

  • systemd log output
  • podman log output
  • the console output of the running container

You have an extremely hard time figuring out what is going on with the system (you have to look in multiple places to piece together the state of errors).

@mheon
Copy link
Member

mheon commented May 27, 2020

@vrothberg The core ask here (viewing logs for systemd-managed Podman) seems to be a pretty valid one - our current forking= approach does break this, and podman logs becomes very inconvenient when the services are running as rootless and you have to sudo into each of them to get logs.

I was thinking that it ought to be possible for the journald log driver to write straight to the logs for the unit file if we know it, and we did add something similar for auto-update?

@vrothberg
Copy link
Member

There was a very similar request by @lucab :coreos/fedora-coreos-docs#75 (comment)

I was also thinking about the log driver 👍

@rhatdan
Copy link
Member

rhatdan commented Jun 9, 2020

@ashley-cui Could you look into the --log-driver changes?

@goochjj
Copy link
Contributor

goochjj commented Jun 12, 2020

@storrgie I've been pursuing similar things recently.

Do -d, and keep the forking.
Enable --log-driver journald

That alone should take care of all container logs showing up in journald, you just need to do
journalctl CONTAINER_NAME=mattermost

As conmon will be providing those keys - CONTAINER_ID an CONTAINER_NAME. I've been doing lots of testing, basically what I've been doing is - start a container, generate the output to journald, then use journalctl -n 10 to grab the last 10 lines and find a line it logged, tweaking for 20 or 30 lines or whatever it takes. Then journalctl -n 10 -o json-pretty or -o json to get the raw line and figure out what other metadata you have to work with.

You could use CONTAINER_TAG too... i.e. add --log-opt tag=WhateverYouWant and find it with
journalctl CONTAINER_TAG=WhateverYouWant

If you want it to show under the unit, like I do, I do this:
--cgroup-parent=/system.slice/%n --cgroup-manager cgroupfs

Note, my container is root, not rootless, and the host is running Flatcar. My guess is you can get similar results by possibly tweaking the cgroup-parent. By putting the processes under the cgroup, systemd finds that they're associated with a unit - but I'd expect conmon being in the correct cgroup SHOULD be all you need.

The added benefit of running all the processes in the systemd service's cgroup is that bind mounted /dev/log ALSO associates to the unit file, automagically. You don't get the automagic CONTAINER_NAME from conmon journald records, but you DO get anything you put in the service file as a LogExtraField - so you could use that to find your logs as well.

@TravisBowers
Copy link

I'm running rootless containers on Fedora Server. I'm able to see logs using --log-opt tag=<tag> and journalctl CONTAINER_TAG=<tag>. However, when I add --cgroup-parent=/system.slice/%n --cgroup-manager cgroupfs, my units fail with result 'exit-code. @rhatdan, are they failing because they're rootless?

@mheon
Copy link
Member

mheon commented Jun 12, 2020

I really do not recommend running --cgroup-manager=cgroupfs with systemd-managed Podman - you end up with both systemd and Podman potentially altering the same cgroup, and I think there's the potential for them to trample each other. If you want to stay in the systemd cgroup, I'd recommend using the crun OCI runtime and passing --cgroups=disabled to prevent Podman from creating a container cgroup. We lose the ability to set resource limits, but you can just set them from within the systemd unit, so it's not a big loss.

@mheon
Copy link
Member

mheon commented Jun 12, 2020

(There is also --cgroups=no-conmon to only place Conmon in the systemd cgroup - we use that by default in unit files from podman generate systemd)

@andrewgdunn
Copy link
Author

andrewgdunn commented Jun 12, 2020

I see traffic on the mailing list from @rhatdan about an FAQ... I'm feeling more and more as I learn about this project that the idea this can "replace" docker is basically gimmicky at this stage. There is no clear golden pathway for running containers as daemons on systems with podman+systemd. It seems fraught with edge cases. I'd really love to see this ticket be taken seriously as I think there are a LOT of people trying to depart docker land and systemd+podman is a way to rid yourself of the docker monolithic daemon.

@mheon
Copy link
Member

mheon commented Jun 12, 2020

I think we definitely need a single page containing everything we recommend about running containers inside units (best practices, and the reasons for them). I've probably explained why we made the choice for forking vs simple five times at this point; having a single page with a definitive answer on that would be greatly helpful to everyone. We'll need to hash some things out as part of this, especially the use of rootless Podman + root systemd as this issue asks, but even getting the basics written down would be a start.

@lucab
Copy link
Member

lucab commented Jun 13, 2020

@mheon that would indeed help, but I'm not sure that's going to solve much. For example, from the thread at coreos/fedora-coreos-docs#75, that content currently exists in the form of a blog post which unfortunately is:

  • already stale at this point (podman-generate does not generate that unit anymore)
  • not really integrating well with systemd service handling (e.g. journald, sd-notify, user setting, etc)
  • somehow concerning/fragile (e.g. KillMode=none)

I think it would be better to first devise a podman mode which works well when integrated in the systemd ecosystem, and only then document it.

As a sidenote, many containerized services (eg. etcd, haproxy, etc.) do use sd-notify in order to signal when they are actually initialized and ready to start serving requests. For that kind of autoscale-friendly logic to work, a Type=notify service unit would be required.

@mheon
Copy link
Member

mheon commented Jun 13, 2020 via email

@mheon
Copy link
Member

mheon commented Jun 13, 2020 via email

@goochjj
Copy link
Contributor

goochjj commented Jun 13, 2020

We got the User setting working, it was mainly a problem with -d, no? Unless there's something else outstanding, I think that's solved. Similarly, the journald log-driver works well for me... unless you try to log a tty, which would be a bad idea anyway, now that exec is fixed.

Systemd integration isn't great with docker either - docker's log-driver is exactly analogous to what conmon does, docker's containers are launched by the daemon which puts them in another cgroup, unless you use cgroup-parent tricks, and sometimes getting the container to work right w.r.t. logging and groups requires hacks like systemd-docker which throws a hacky shim around sd-notify. So are we really saying podman+systemd is somehow worse? Or just not better? Because it seems better to me. Doesn't seem like Docker has a golden pathway either.

I've run docker w/ cgroup-parent sharing the unit's cgroup and systemd-docker (even though it's unsupported) for over a year, and haven't had any problems with systemd and docker fighting. I'm not sure why podman would... but I defer to the experts.

The only thing I have with docker now that I don't have with podman is bind mounting /dev/log works - because I put the docker container in the same cgroup as the unit. Without that, I'd need some sort of syslog proxy, which would probably have to live in conmon, and is a whole other discussion and probably only relevant to me.

@vrothberg
Copy link
Member

@mheon that would indeed help, but I'm not sure that's going to solve much. For example, from the thread at coreos/fedora-coreos-docs#75, that content currently exists in the form of a blog post which unfortunately is:

* already stale at this point (podman-generate does generate that unit anymore)

That's not accurate. We just updated the blog post last week and do that regularly. The units are still generated the same way. Once Podman v2 is out, we need to create some upstream docs as a living document and point the blog post there.

* not really integrating well with systemd service handling (e.g. journald, sd-notify, user setting, etc)

We only support Type=forking with podman generate systemd.

* somehow concerning/fragile (e.g. `KillMode=none`)

We've been discussing that already in depth. We want Podman to handle shutdown (and killing) and prevent signal races with systemd which does not know the order in which all processes should be killed.

I think it would be better to first devise a podman mode which works well when integrated in the systemd ecosystem, and only then document it.

As a sidenote, many containerized services (eg. etcd, haproxy, etc.) do use sd-notify in order to signal when they are actually initialized and ready to start serving requests. For that kind of autoscale-friendly logic to work, a Type=notify service unit would be required.

Type=notify is supported but we don't generate them with podman generate systemd. I guess this could be part of an upstream doc?

@vrothberg
Copy link
Member

I think we definitely need a single page containing everything we recommend about running containers inside units (best practices, and the reasons for them). I've probably explained why we made the choice for forking vs simple five times at this point; having a single page with a definitive answer on that would be greatly helpful to everyone. We'll need to hash some things out as part of this, especially the use of rootless Podman + root systemd as this issue asks, but even getting the basics written down would be a start.

I agree and made a similar conclusion last week when working with support on some issues. Once v2 is out (and all fixes are in), I'd love us to create a living upstream document that the blog post can link to.

@vrothberg
Copy link
Member

I opened #6604 to break out the logging discussion.

@lucab
Copy link
Member

lucab commented Jun 15, 2020

@vrothberg thanks! I shouldn't have piled up more topics in here, sorry for that.
If you prefer, I can split the other ones (e.g. sd-notify) to their own tickets, so they can be incrementally closed as soon as we are done.

@vrothberg
Copy link
Member

No worries at all, @lucab! All input and feedback is much appreciated.

If you prefer, I can split the other ones (e.g. sd-notify) to their own tickets, so they can be incrementally closed as soon as we are done.

That would be great, sure. While we support sd-notify, we don't generate these types. Having a dedicated issue will help us agree on how such a unit should look like and eventually get that into upstream docs (and man pages). Thanks a lot!

@goochjj
Copy link
Contributor

goochjj commented Jun 17, 2020

Since we're having this discussion, and there's plenty of talk about Killmode, and cgroups, and where things should reside - it makes sense to me that podman's integration with systemd already has a blueprint - that being systemd-nspawn. The systemd-nspawn@.service unit includes things like:

KillMode=mixed
Delegate=yes
Slice=machine.slice

This means (among other things) you end up with
/machine.slice/unit.service/supervisor - which contains the systemd-nspawn ("conmon"-esque) process, and
/machine.slice/unit.service/payload - which contains the contained processes

And systemd has no problem monitoring the supervisor Pid, I'm guessing because Delegate is set, and it's a sub-cgroup.

nspawn has options like --slice, --property, --register, and --keep-unit - probably all of which should be implemented similarly in podman... and the caveats are already spelled out in the documentation.

https://www.freedesktop.org/software/systemd/man/systemd-nspawn.html

nspawn also has options for the journal - how it's bind mounted and supported, plus setting the machine ID properly for those logs... etc.

I'd imagine we'd want nspawn to be the template?

@goochjj
Copy link
Contributor

goochjj commented Jun 17, 2020

And doing Delegate and sub-cgroups like that also means systemctl status knows the Main PID is the supervisor, but shows the full process tree including the payload clearly in the status output, and the service type is sd-notify, so I imagine it's talking back to systemd to let it know these things.

@goochjj
Copy link
Contributor

goochjj commented Jun 17, 2020

For that matter I've wondered if it's possible to use/wrap/hack/mangle something into place to allow systemd-nspawn itself to be the OCI container runtime, instead of crun or runc. Moreso a thought experiment than anything else, but the key hangup seems to be nspawn wants a specific mount to use, which podman can provide since it already did all the work to create the appropriate overlay bindmount.

Probably involves reading config.json and turning it into command line arguments? I'm unclear separation-wise which parts of the above fit into which parts of the execution lifecycle.

@mheon
Copy link
Member

mheon commented Jun 17, 2020

There was talk about making nspawn accept OCI specs, even that may not be necessary. I don't know how well it would interface with Conmon though.

On the Delegate change - I'd have to think more about what this means for containers which forward host cgroups into the container (we'll need a way to guarantee that the entire unit cgroup isn't forwarded). I also think we'll need to ensure that the container remembers it was started with cgroupfs, so that other Podman commands launched from outside the unit file that require cgroups (e.g. podman stats) still work.

@giuseppe
Copy link
Member

to simulate what nspawn does we'd need to tell the OCI runtime to use the cgroup already created by conmon instead of creating a new one.

Next crun version will automatically create a /container subcgroup in the same way nspawn does.

I think we can go a step further and get closer to what nspawn does by having a single cgroup for conmon+container payload

giuseppe added a commit to giuseppe/libpod that referenced this issue Jun 18, 2020
add a new mode cgroup mode conmon-delegated.

When running under systemd there is no need to create yet another
cgroup for the container.

With conmon-delegated the current cgroup will be split in two sub
cgroups:

- supervisor
- container

The supervisor cgroup will hold conmon and the podman process, while
the container cgroup is used by the OCI runtime (using the cgroupfs
backend).

Closes: containers#6400

Depends on: containers/crun#409

Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
@goochjj
Copy link
Contributor

goochjj commented Jun 30, 2020

@jdoss If you're using selinux, I suggest you compile and place the crun binary in /usr/local/bin, as that folder is recognized in the policy. If you're going to have a local podman or runc or crun it should be there, and chcon'd to match, i.e. chcon --reference=/usr/bin/crun /usr/local/bin/crun

In /etc/containers/containers.conf:

runtime = "crun"

[engine.runtimes]
crun = [ "/usr/local/bin/crun" ]

Or specify it on the command line as @giuseppe indicated.

@jdoss
Copy link
Contributor

jdoss commented Jun 30, 2020

@goochjj and @giuseppe I just compiled crun from master and put it in /usr/local/bin/crun and it's still getting the same error:

# /usr/local/bin/crun --version
crun version 0.13.227-d38b
commit: d38b8c28fc50a14978a27fa6afc69a55bfdd2c11
spec: 1.0.0
+SYSTEMD +SELINUX +APPARMOR +CAP +SECCOMP +EBPF +YAJL
Jun 30 15:48:12 mycool mycool-elasticsearch[65963]: [conmon:d]: failed to write to /proc/self/oom_score_adj: Permission denied
Jun 30 15:48:12 mycool conmon[65963]: conmon c0cf8da55a1936150298 <ndebug>: failed to write to /proc/self/oom_score_adj: Permission denied
Jun 30 15:48:12 mycool conmon[65964]: conmon c0cf8da55a1936150298 <ninfo>: attach sock path: /tmp/run-1001/libpod/tmp/socket/c0cf8da55a1936150298fdf9608ad04154a9de89c774c0fc93e2809b489a97b7/attach
Jun 30 15:48:12 mycool conmon[65964]: conmon c0cf8da55a1936150298 <ninfo>: addr{sun_family=AF_UNIX, sun_path=/tmp/run-1001/libpod/tmp/socket/c0cf8da55a1936150298fdf9608ad04154a9de89c774c0fc93e2809b489a97b7/attach}
Jun 30 15:48:12 mycool conmon[65964]: conmon c0cf8da55a1936150298 <ninfo>: terminal_ctrl_fd: 13
Jun 30 15:48:12 mycool conmon[65964]: conmon c0cf8da55a1936150298 <ninfo>: winsz read side: 15, winsz write side: 15
Jun 30 15:48:12 mycool conmon[65965]: conmon c0cf8da55a1936150298 <nwarn>: Failed to chown stdin
Jun 30 15:48:12 mycool conmon[65964]: conmon c0cf8da55a1936150298 <error>: Failed to create container: exit status 1
Jun 30 15:48:12 mycool mycool-elasticsearch[65952]: time="2020-06-30T15:48:12Z" level=debug msg="Received: -1"
Jun 30 15:48:12 mycool mycool-elasticsearch[65952]: time="2020-06-30T15:48:12Z" level=debug msg="Cleaning up container c0cf8da55a1936150298fdf9608ad04154a9de89c774c0fc93e2809b489a97b7"
Jun 30 15:48:12 mycool mycool-elasticsearch[65952]: time="2020-06-30T15:48:12Z" level=debug msg="unmounted container \"c0cf8da55a1936150298fdf9608ad04154a9de89c774c0fc93e2809b489a97b7\""
Jun 30 15:48:12 mycool mycool-elasticsearch[65952]: time="2020-06-30T15:48:12Z" level=debug msg="ExitCode msg: \"cannot set limits without cgroups: oci runtime error\""
Jun 30 15:48:12 mycool mycool-elasticsearch[65952]: Error: cannot set limits without cgroups: OCI runtime error
Jun 30 15:48:12 mycool systemd[1]: mycool-elasticsearch.service: Control process exited, code=exited, status=126/n/a

@goochjj
Copy link
Contributor

goochjj commented Jun 30, 2020

Add --pids-limit 0 to your run args

@goochjj
Copy link
Contributor

goochjj commented Jun 30, 2020

Wait you're cgroups v2 now? I don't have that problem under cgroups v2 rootless. What does cat /proc/self/cgroup show?

@jdoss
Copy link
Contributor

jdoss commented Jun 30, 2020

--pids-limit 0 does let the containers start, but yea, I booted FCOS into cgroups v2 with rootless here. I have a non-root user mycool that is being used via systemd to launch these containers.

[core@mycool ~]$ cat /proc/self/cgroup
0::/user.slice/user-1000.slice/session-1.scope

@goochjj
Copy link
Contributor

goochjj commented Jun 30, 2020

I can't get the infra container to start, because you're binding to ports 80 and 443 as non-root...

@goochjj
Copy link
Contributor

goochjj commented Jun 30, 2020

setting /proc/sys/net/ipv4/ip_unprivileged_port_start

@goochjj
Copy link
Contributor

goochjj commented Jun 30, 2020

Hmm and there it is

If I remove your --pod it works

@jdoss
Copy link
Contributor

jdoss commented Jun 30, 2020

 - path: /etc/sysctl.d/90-ip-unprivileged-port-start.conf
    mode: 0644
    contents:
      inline: |
        net.ipv4.ip_unprivileged_port_start = 0

To allow the pod to bind to those ports.

@goochjj
Copy link
Contributor

goochjj commented Jun 30, 2020

I think it's because you're using a pod.

When I run this as the user, rootless, I get this:

Pod creates:
/user.slice/user-(uid).slice/user@(uid).service/user.slice/user-libpod_pod_(podid).slice/libpod-(infracid).scope/container
/user.slice/user-(uid).slice/user@(uid).service/user.slice/user-libpod_pod_(podid).slice/libpod-conmon-(infracid).scope

Container (without split) creates:
/user.slice/user-(uid).slice/user@(uid).service/user.slice/user-libpod_pod_(podid).slice/libpod-(escid).scope/container
/user.slice/user-(uid).slice/user@(uid).service/user.slice/user-libpod_pod_(podid).slice/libpod-conmon-(escid).scope

Through Systemd as the user, I get this:
Pod creates:
/user.slice/user-(uid).slice/user@(uid).service/user.slice/user-libpod_pod_(podid).slice/libpod-(infracid).scope/container
/system.slice/mycool-pod.service

Container (without split) creates:
/user.slice/user-(uid).slice/user@(uid).service/user.slice/user-libpod_pod_(podid).slice/libpod-(escid).scope/container
/system.slice/mycool-elasticsearch.service

@goochjj
Copy link
Contributor

goochjj commented Jun 30, 2020

TLDR, @giuseppe would have to modify/extend another PR to handle pods.

It looks like when a container is spawned in a pod, it assumes its parent slice will be the parent cgroup path. (Which is reasonable) Since pod create doesn't have a --cgroups split option, the pod's conmon is attached to the service cgroup, and the pod's slice is in the user slice, divorced from the service's cgroup.

You can't simultaneously have a service (i.e. elasticsearch) be part of the unit's service, and also the pod's slice. Nor can you have a second systemd unit muck around with the pod's cgroup - that's probably a bad idea.

What's your desired outcome here, @jdoss jdoss?

/system.slice/mycool-pod.service/supervisor -> pod conmon
/system.slice/mycool-pod.service/container -> infra container
/system.slice/mycool-elasticsearch.service/supervisor -> conmon
/system.slice/mycool-elasticsearch.service/container -> ES processes

Then ALL the pod services aren't contained in a slice.

Right now it's
/system.slice/mycool-pod.service -> pod conmon
/(user's systemd service)/user.slice/user-libpod_pod_(podid).slice/libpod-(cid).scope/container -> infra procs
/system.slice/mycool-elasticsearch.service -> conmon
/(user's systemd service)/user.slice/user-libpod_pod_(podid).slice/libpod-(cid).scope/container -> elasticsearch procs

Is this insufficient in some way?

@goochjj
Copy link
Contributor

goochjj commented Jun 30, 2020

Or maybe we should do this in a more systemd-like way?

i.e. Slice=machines-mycool_pod.slice

Pod
/machines.slice/machines-mycool_pod.slice/mycool-pod.service/supervisor -> pod conmon
/machines.slice/machines-mycool_pod.slice/mycool-pod.service/container -> infra container
/machines.slice/machines-mycool_pod.slice/mycool-elasticsearch.service/supervisor -> conmon
/machines.slice/machines-mycool_pod.slice/mycool-elasticsearch.service/container -> ES processes

Then everything is properly in a parent slice - is this what we'd want split to do with pods?

If so, the --cgroups split would have to be set at the pod create level, and child services would have to know if split is passed, to not inherit the cgroup-parent of the pod.

@goochjj
Copy link
Contributor

goochjj commented Jun 30, 2020

--pids-limit 0 does let the containers start, but yea, I booted FCOS into cgroups v2 with rootless here. I have a non-root user mycool that is being used via systemd to launch these containers.

[core@mycool ~]$ cat /proc/self/cgroup
0::/user.slice/user-1000.slice/session-1.scope

@giuseppe I don't know what's causing this - but there are times when I need to set --pids-limit 0. It seems like there's a default of pids-limit 2048 coming from somewhere, not the config file and not the command line, and then when crun sees it can't do cgroups with pids-limit, it throws the runtime error.

If you happen to get the cgroup right - i.e. it's something crun can modify and it has a pids controller, then the error isn't present.

@jdoss
Copy link
Contributor

jdoss commented Jun 30, 2020

@goochjj I am trying to set things up so I can have many pods running under a rootless user/users via systemd units with the User= directive for each stack of applications running as rootless containers inside the pod. Having everything in it's own pod namespace as a rootless user is pretty great so I don't need to juggle ports on each stack of application, just the pod ports. I also like the isolation pods give each application stack deployment.

Since FCOS doesn't support user systemd units via Ignition, I have to set them up in as system units. Which is fine since I like using system units over user units anyways to prevent them from modified by nonroot users.

@goochjj
Copy link
Contributor

goochjj commented Jun 30, 2020

Right, but all this works for you without --cgroups split, correct? Is there something you're hoping to gain with --cgroups split?

@mheon
Copy link
Member

mheon commented Jun 30, 2020

The pids-limit is probably Podman automatically trying to set the maximum available for that rlimit - we should code that to only happen if cgroups are present.

@jdoss
Copy link
Contributor

jdoss commented Jun 30, 2020

@goochjj I was running FCOS with cgroups v1 up until I saw this thread that introduced --cgroups split so I started down this road of giving it a try with cgroups v2. Trying my old setup that works on FCOS cgroups v1 on FCOS with cgroups v2 doesn't work at all without setting --pids-limit 0.

I am am not trying to gain anything specific by using --cgroups split. I thought it would help provide me with a better setup for my use case.

@goochjj
Copy link
Contributor

goochjj commented Jul 1, 2020

@Mehon I'm unclear on why cgroups aren't present... let alone that default.

It's really annoying, and seems to be cgroupsv1 specific. Should I create this as a separate issue?

@mheon
Copy link
Member

mheon commented Jul 1, 2020

I believe that's a requirement forced on us by cgroups v1 not being safe for rootless use, unless I'm greatly misunderstanding?

@goochjj
Copy link
Contributor

goochjj commented Jul 1, 2020

@mheon I'm fine with that, as long as it doesn't explicitly require me to --pids-limit 0 everything, which it's currently doing.

This code

118cf1fc63 (Daniel J Walsh    2019-09-14 06:21:10 -0400 302)            // then ignore the settings.  If the caller asked for a
118cf1fc63 (Daniel J Walsh    2019-09-14 06:21:10 -0400 303)            // non-default, then try to use it.
118cf1fc63 (Daniel J Walsh    2019-09-14 06:21:10 -0400 304)            setPidLimit := true
118cf1fc63 (Daniel J Walsh    2019-09-14 06:21:10 -0400 305)            if rootless.IsRootless() {
118cf1fc63 (Daniel J Walsh    2019-09-14 06:21:10 -0400 306)                    cgroup2, err := cgroups.IsCgroup2UnifiedMode()
118cf1fc63 (Daniel J Walsh    2019-09-14 06:21:10 -0400 307)                    if err != nil {
118cf1fc63 (Daniel J Walsh    2019-09-14 06:21:10 -0400 308)                            return nil, err
118cf1fc63 (Daniel J Walsh    2019-09-14 06:21:10 -0400 309)                    }
4352d58549 (Daniel J Walsh    2020-03-27 10:13:51 -0400 310)                    if (!cgroup2 || (runtimeConfig != nil && runtimeConfig.Engine.CgroupManager != cconfig.SystemdCgroupsManager)) && config.Resources.PidsLimit == sysinfo.GetDefaultPidsLimit() {
118cf1fc63 (Daniel J Walsh    2019-09-14 06:21:10 -0400 311)                            setPidLimit = false
118cf1fc63 (Daniel J Walsh    2019-09-14 06:21:10 -0400 312)                    }
118cf1fc63 (Daniel J Walsh    2019-09-14 06:21:10 -0400 313)            }
118cf1fc63 (Daniel J Walsh    2019-09-14 06:21:10 -0400 314)            if setPidLimit {
118cf1fc63 (Daniel J Walsh    2019-09-14 06:21:10 -0400 315)                    g.SetLinuxResourcesPidsLimit(config.Resources.PidsLimit)
118cf1fc63 (Daniel J Walsh    2019-09-14 06:21:10 -0400 316)                    addedResources = true
118cf1fc63 (Daniel J Walsh    2019-09-14 06:21:10 -0400 317)            }

in pkg/spec/spec.go seems to indicate it should already be ignoring the default on cgroups v1. I'm digging.

@goochjj
Copy link
Contributor

goochjj commented Jul 1, 2020

Cuz this isn't great.

(focal)mrwizard@FocalCG1Dev:~/src/podman
$ podman run --rm -it alpine sh
Error: cannot set limits without cgroups: OCI runtime error

@mheon
Copy link
Member

mheon commented Jul 1, 2020

This is definitely a bug. Is this 2.0? pkg/spec is deprecated, we've moved to pkg/specgen/generate - so the offending code likely lives there.

@goochjj
Copy link
Contributor

goochjj commented Jul 1, 2020

2.1.0-dev. Actually, master, plus my sdnotify

So, sounds like I should create a new issue.
:-D

@goochjj
Copy link
Contributor

goochjj commented Jul 1, 2020

#6834

@github-actions
Copy link

github-actions bot commented Aug 1, 2020

A friendly reminder that this issue had no activity for 30 days.

@rhatdan
Copy link
Member

rhatdan commented Aug 4, 2020

Fixed in master.

@rhatdan rhatdan closed this as completed Aug 4, 2020
@github-actions github-actions bot added the locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. label Sep 23, 2023
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Sep 23, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. stale-issue
Projects
None yet
Development

Successfully merging a pull request may close this issue.