Allow a CRI-O container's cgroup to be delegated to a user #5228

ajwock · 2021-08-19T19:40:23Z

Reference to discussion in 8/19/2021 weekly meeting.

Acceptance: When crio is using cgroups v2, there should be a way to specify a user that will own the root cgroup within the container. Low level details not yet determined.

My first step is to try to make a POC.

ajwock · 2021-08-19T19:44:30Z

And, to anyone who might know- since we actually have to change some filesystem ownership and permissions, does this require messing with the OCI, making these changes as a cri-o process, or simply passing the right configs to the OCI? Specifically, we need to make the changes indicated here to the cgroup that contains the container processes.

ajwock · 2021-08-19T20:01:49Z

Documentation note, quoted in this lkml link:

The delegater should not change the ownership of any of the
controller interfaces files (e.g., pids.max, memory.high) in
dlgt_grp. Those files are used from the next level above the
delegated subtree in order to distribute resources into the
subtree, and the delegatee should not have permission to change
the resources that are distributed into the delegated subtree.

Roman recently added /sys/kernel/cgroup/delegate and
/sys/kernel/cgroup/features. The former contains newline separated
list of cgroup files which should be chowned on delegation (in
addition to the directory itself) and the latter contains optional
features (currently only nsdelegate).

A 'proper' way to delegate cgroups would be to use the /sys/kernel/cgroup/delegate file to know which files to chown. These are the files that belong to the cgroup itself and not the parent, and thus should be controlled by the delegatee.

mrunalp · 2021-08-20T20:15:52Z

@giuseppe @kolyshkin ptal

kolyshkin · 2021-08-25T22:15:25Z

Since this is cgroup v2 and we do have cgroupns, this can probably be achieved by merely passing "nsdelegate" option to cgroupfs mount (and making it rw). I tried it briefly and it didn't work as expected, will continue tomorrow.

ajwock · 2021-08-25T23:27:52Z

I thought it might work like that when I read this here:

A cgroup can be delegated in two ways. First, to a less privileged user by granting write access of the directory and its “cgroup.procs”, “cgroup.threads” and “cgroup.subtree_control” files to the user. Second, if the “nsdelegate” mount option is set, automatically to a cgroup namespace on namespace creation.

But when I try to parse that second sentence out, it doesn't really make much sense. The first sentence is saying it's delegating to a user when you grant write access to those files. The second sentence is saying it's delegating to a namespace, not a user, and in my knowledge of cgroups, that doesn't make sense. I haven't found any sources other than this documentation that backs this up as correct. What does delegating to a namespace mean?

From what I can gather on this lkml thread, it seems that nsdelegate only makes namespaces into a stronger security boundary but it doesn't change the permissions on the files, and nothing short of actually changing those permissions has worked for me.

ajwock · 2021-08-25T23:39:31Z

I wouldn't guess that doing this would be possible without changing the ownership of the delegation files over to the delegatee user.

giuseppe · 2021-08-26T08:13:03Z

is fine for your use case to use user namespaces as well? In this case the cgroup could be owned by root in the user namespace (crun always chowns the cgroup to root in the container) and still it maps to a different user on the host.

ajwock · 2021-08-26T10:32:49Z

I think that delegating to a userns root would be the ideal solution, though I would have to wait for kubernetes user namespace support. A non-userns delegation would be useful as a stop-gap or an alternative design option.

Edit:
This made me go do some reading and I found your name. Is there already some sort of kubernetes-transparent support for userns on CRI-O?
kubernetes/enhancements#2101 (comment)

IMO this is a step too far unless the FS-related issues are addressed. As a
final goal, it sounds great. But it suffers the same problems described
elsewhere - the inside space and the outside space are the same size, which
means you can't GUARANTEE that it works for any arbitrary 2^32 UID.

in CRI-O we punch holes for runAsUser/runAsGroup/fsGroup and map them back to their value on the host, so anything that relies on these values will still keep working.

We don't handle yet arbitrary IDs in the image files set to higher values than what is available in the inner user namespace, e.g. given an allocation of 65k IDs, a file owned by 200000:200000 will show up as owned by the overflow ID.

I think this issue could also be addressed in the runtime/kubelet. The available IDs in the userns can be allocated in a non contiguous way to please the files present in the image. What is preventing to address this issue right now in CRI-O is that the IDs must be allocated when the sandbox is created and the containers/images running in that pod are not known in this phase.

Another possibility would be to create a per-pod user namespace and then each container runs in their own sub user namespace, child of the pod user namespace; but that would probably require changes in the OCI runtime as well.

giuseppe · 2021-08-26T12:33:21Z

This made me go do some reading and I found your name. Is there already some sort of kubernetes-transparent support for userns on CRI-O?

yes :) there is some kind of transparent support for user namespaces in CRI-O using annotations that was added with #3944

The downside is that user namespaces are still expensive, as we need to create a clone of the image to chown the files. There is support in the kernel now for doing it at runtime, but we are still not supporting it.

ajwock · 2021-08-26T12:59:58Z

I'll see if I can figure out how to make that work and get back. Sounds promising!

ajwock · 2021-08-27T13:49:15Z

cri-o/server/sandbox_run_linux.go

Line 98 in ee8e721

    
           return nil, errors.New("cannot use uidmapping or gidmapping if not running as root")

@giuseppe Will it be possible to do this without being root in the securitycontext in the future?

Or, when you use this annotation and runAsRoot in security context, are you still only the container's root and not the host's?
Also, when you get the chance- what kernel support are you referring to? Sounds like a good read.

ajwock · 2021-08-27T19:38:13Z

So, I'm not that well versed in uidmapping: When I use

io.kubernetes.cri-o.userns-mode: "private:uidmapping=0:57500:5000"

where 57500 is my host uid

and

runAsUser: 0

I show up in the container as 5000. How would I map root in the container as my host uid? When I tried to make the last number '1', I got this:

Error: error creating an ID-mapped copy of layer "7555a8182c42c7737a384cfe03a3c7329f646a3bf389c4bcd75379fc85e6c144": exit status 1: error during chown: error mapping container ID pair idtools.IDPair{UID:0, GID:42} for "etc/gshadow" to host: Container ID 42 cannot be mapped to a host ID

haircommander · 2021-08-27T19:59:10Z

have you tried just using auto

ajwock · 2021-08-27T20:01:07Z

Tried it just now-

here's my annotation- do I need more?
io.kubernetes.cri-o.userns-mode: "auto"

Failed to create pod sandbox: rpc error: code = Unknown desc = error creating pod sandbox with name "k8s_crio-test2_default_cadfbde1-5d73-43b9-b30a-2bd550a9c2b2_0": could not find enough available IDs

haircommander · 2021-08-27T20:11:44Z

what's
cat /etc/sub?id

ajwock · 2021-08-27T20:14:33Z

Empty. I'm guessing it's not supposed to be?

haircommander · 2021-08-27T20:15:35Z

yeah I would have
containers:200000:65536
in each
/etc/subgid and /etc/subuid

ajwock · 2021-08-31T10:25:52Z

Had to change to a different kubernetes cluster because I was previously using a heavily modified kind cluster. But thanks for the help. I can confirm that as pod root within a namespace, I can create and configure cgroups. I'll need to do a little more research before I'm sure this works with other components of the program.

Okay, had some spare cycles today to try all of this out more. I've managed to configure up a system where this works. Will do more testing to see if this works with my use case.

m-yosefpor · 2021-10-20T20:21:07Z

Actually there is an ongoing effort to make this happen in OCI: opencontainers/runtime-spec#1123
Also you can find an implementation in runc here: opencontainers/runc#3057

m-yosefpor · 2021-10-20T20:25:09Z

This made me go do some reading and I found your name. Is there already some sort of kubernetes-transparent support for userns on CRI-O?

yes :) there is some kind of transparent support for user namespaces in CRI-O using annotations that was added with #3944

The downside is that user namespaces are still expensive, as we need to create a clone of the image to chown the files. There is support in the kernel now for doing it at runtime, but we are still not supporting it.

Exactly, we tested userns and it heavily impacts pod start time (increased from a few seconds to 30-60 seconds)

frasertweedale · 2021-10-22T07:02:13Z

Yes, based on my development nothing special needs to be done in CRI-O at this point. It is possible (though I'm hoping not) that the runc behaviour might land behind an OCI annotation to enable it, in which case either CRI-O needs to add the annotation, or the user must add the annotation to the Pod spec (CRI-O propagates these annotations to the OCI config).

frasertweedale · 2021-12-10T02:55:21Z

runc PR opencontainers/runc#3057 was merged. It will automatically chown the cgroupfs to the container process UID when particular conditions are met, as now specified in the OCI runtime-spec: https://github.com/opencontainers/runtime-spec/blob/main/config-linux.md#cgroup-ownership.

Accordingly there is nothing more to be done here. I leave to a CRI-O maintainer to close this ticket.

ajwock · 2021-12-16T14:25:37Z

Wanted to come back and say that I've had a chance to test this stack, and it works perfectly! I'm able to get the cgroup hierarchy delegated to a user on the host namespace with the cgroup2-mount-hierarchy-rw annotation in cri-o set to "true" and with default settings on the newest (built from github source) runc. Perhaps in the future we can get a setting to the cgroup.max.descendants for the base of the hierarchy and remove the need for an annotation.

ajwock closed this as completed Dec 16, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow a CRI-O container's cgroup to be delegated to a user #5228

Allow a CRI-O container's cgroup to be delegated to a user #5228

ajwock commented Aug 19, 2021

ajwock commented Aug 19, 2021 •

edited

ajwock commented Aug 19, 2021

mrunalp commented Aug 20, 2021

kolyshkin commented Aug 25, 2021

ajwock commented Aug 25, 2021 •

edited

ajwock commented Aug 25, 2021

giuseppe commented Aug 26, 2021

ajwock commented Aug 26, 2021 •

edited

giuseppe commented Aug 26, 2021

ajwock commented Aug 26, 2021

ajwock commented Aug 27, 2021 •

edited

ajwock commented Aug 27, 2021

haircommander commented Aug 27, 2021

ajwock commented Aug 27, 2021

haircommander commented Aug 27, 2021

ajwock commented Aug 27, 2021

haircommander commented Aug 27, 2021

ajwock commented Aug 31, 2021 •

edited

m-yosefpor commented Oct 20, 2021 •

edited

m-yosefpor commented Oct 20, 2021

frasertweedale commented Oct 22, 2021

frasertweedale commented Dec 10, 2021

ajwock commented Dec 16, 2021

Allow a CRI-O container's cgroup to be delegated to a user #5228

Allow a CRI-O container's cgroup to be delegated to a user #5228

Comments

ajwock commented Aug 19, 2021

ajwock commented Aug 19, 2021 • edited

ajwock commented Aug 19, 2021

mrunalp commented Aug 20, 2021

kolyshkin commented Aug 25, 2021

ajwock commented Aug 25, 2021 • edited

ajwock commented Aug 25, 2021

giuseppe commented Aug 26, 2021

ajwock commented Aug 26, 2021 • edited

giuseppe commented Aug 26, 2021

ajwock commented Aug 26, 2021

ajwock commented Aug 27, 2021 • edited

ajwock commented Aug 27, 2021

haircommander commented Aug 27, 2021

ajwock commented Aug 27, 2021

haircommander commented Aug 27, 2021

ajwock commented Aug 27, 2021

haircommander commented Aug 27, 2021

ajwock commented Aug 31, 2021 • edited

m-yosefpor commented Oct 20, 2021 • edited

m-yosefpor commented Oct 20, 2021

frasertweedale commented Oct 22, 2021

frasertweedale commented Dec 10, 2021

ajwock commented Dec 16, 2021

ajwock commented Aug 19, 2021 •

edited

ajwock commented Aug 25, 2021 •

edited

ajwock commented Aug 26, 2021 •

edited

ajwock commented Aug 27, 2021 •

edited

ajwock commented Aug 31, 2021 •

edited

m-yosefpor commented Oct 20, 2021 •

edited