Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow a CRI-O container's cgroup to be delegated to a user #5228

Closed
ajwock opened this issue Aug 19, 2021 · 23 comments
Closed

Allow a CRI-O container's cgroup to be delegated to a user #5228

ajwock opened this issue Aug 19, 2021 · 23 comments

Comments

@ajwock
Copy link
Contributor

ajwock commented Aug 19, 2021

Reference to discussion in 8/19/2021 weekly meeting.

Acceptance: When crio is using cgroups v2, there should be a way to specify a user that will own the root cgroup within the container. Low level details not yet determined.

My first step is to try to make a POC.

@ajwock
Copy link
Contributor Author

ajwock commented Aug 19, 2021

And, to anyone who might know- since we actually have to change some filesystem ownership and permissions, does this require messing with the OCI, making these changes as a cri-o process, or simply passing the right configs to the OCI? Specifically, we need to make the changes indicated here to the cgroup that contains the container processes.

@ajwock
Copy link
Contributor Author

ajwock commented Aug 19, 2021

Documentation note, quoted in this lkml link:

The delegater should not change the ownership of any of the
controller interfaces files (e.g., pids.max, memory.high) in
dlgt_grp. Those files are used from the next level above the
delegated subtree in order to distribute resources into the
subtree, and the delegatee should not have permission to change
the resources that are distributed into the delegated subtree.

Roman recently added /sys/kernel/cgroup/delegate and
/sys/kernel/cgroup/features. The former contains newline separated
list of cgroup files which should be chowned on delegation (in
addition to the directory itself) and the latter contains optional
features (currently only nsdelegate).

A 'proper' way to delegate cgroups would be to use the /sys/kernel/cgroup/delegate file to know which files to chown. These are the files that belong to the cgroup itself and not the parent, and thus should be controlled by the delegatee.

@mrunalp
Copy link
Member

mrunalp commented Aug 20, 2021

@giuseppe @kolyshkin ptal

@kolyshkin
Copy link
Collaborator

Since this is cgroup v2 and we do have cgroupns, this can probably be achieved by merely passing "nsdelegate" option to cgroupfs mount (and making it rw). I tried it briefly and it didn't work as expected, will continue tomorrow.

@ajwock
Copy link
Contributor Author

ajwock commented Aug 25, 2021

I thought it might work like that when I read this here:

A cgroup can be delegated in two ways. First, to a less privileged user by granting write access of the directory and its “cgroup.procs”, “cgroup.threads” and “cgroup.subtree_control” files to the user. Second, if the “nsdelegate” mount option is set, automatically to a cgroup namespace on namespace creation.

But when I try to parse that second sentence out, it doesn't really make much sense. The first sentence is saying it's delegating to a user when you grant write access to those files. The second sentence is saying it's delegating to a namespace, not a user, and in my knowledge of cgroups, that doesn't make sense. I haven't found any sources other than this documentation that backs this up as correct. What does delegating to a namespace mean?

From what I can gather on this lkml thread, it seems that nsdelegate only makes namespaces into a stronger security boundary but it doesn't change the permissions on the files, and nothing short of actually changing those permissions has worked for me.

@ajwock
Copy link
Contributor Author

ajwock commented Aug 25, 2021

I wouldn't guess that doing this would be possible without changing the ownership of the delegation files over to the delegatee user.

@giuseppe
Copy link
Member

is fine for your use case to use user namespaces as well? In this case the cgroup could be owned by root in the user namespace (crun always chowns the cgroup to root in the container) and still it maps to a different user on the host.

@ajwock
Copy link
Contributor Author

ajwock commented Aug 26, 2021

I think that delegating to a userns root would be the ideal solution, though I would have to wait for kubernetes user namespace support. A non-userns delegation would be useful as a stop-gap or an alternative design option.

Edit:
This made me go do some reading and I found your name. Is there already some sort of kubernetes-transparent support for userns on CRI-O?
kubernetes/enhancements#2101 (comment)

IMO this is a step too far unless the FS-related issues are addressed. As a
final goal, it sounds great. But it suffers the same problems described
elsewhere - the inside space and the outside space are the same size, which
means you can't GUARANTEE that it works for any arbitrary 2^32 UID.

in CRI-O we punch holes for runAsUser/runAsGroup/fsGroup and map them back to their value on the host, so anything that relies on these values will still keep working.

We don't handle yet arbitrary IDs in the image files set to higher values than what is available in the inner user namespace, e.g. given an allocation of 65k IDs, a file owned by 200000:200000 will show up as owned by the overflow ID.

I think this issue could also be addressed in the runtime/kubelet. The available IDs in the userns can be allocated in a non contiguous way to please the files present in the image. What is preventing to address this issue right now in CRI-O is that the IDs must be allocated when the sandbox is created and the containers/images running in that pod are not known in this phase.

Another possibility would be to create a per-pod user namespace and then each container runs in their own sub user namespace, child of the pod user namespace; but that would probably require changes in the OCI runtime as well.

@giuseppe
Copy link
Member

This made me go do some reading and I found your name. Is there already some sort of kubernetes-transparent support for userns on CRI-O?

yes :) there is some kind of transparent support for user namespaces in CRI-O using annotations that was added with #3944

The downside is that user namespaces are still expensive, as we need to create a clone of the image to chown the files. There is support in the kernel now for doing it at runtime, but we are still not supporting it.

@ajwock
Copy link
Contributor Author

ajwock commented Aug 26, 2021

I'll see if I can figure out how to make that work and get back. Sounds promising!

@ajwock
Copy link
Contributor Author

ajwock commented Aug 27, 2021

return nil, errors.New("cannot use uidmapping or gidmapping if not running as root")

@giuseppe Will it be possible to do this without being root in the securitycontext in the future?

Or, when you use this annotation and runAsRoot in security context, are you still only the container's root and not the host's?
Also, when you get the chance- what kernel support are you referring to? Sounds like a good read.

@ajwock
Copy link
Contributor Author

ajwock commented Aug 27, 2021

So, I'm not that well versed in uidmapping: When I use

io.kubernetes.cri-o.userns-mode: "private:uidmapping=0:57500:5000"

where 57500 is my host uid

and

runAsUser: 0

I show up in the container as 5000. How would I map root in the container as my host uid? When I tried to make the last number '1', I got this:

Error: error creating an ID-mapped copy of layer "7555a8182c42c7737a384cfe03a3c7329f646a3bf389c4bcd75379fc85e6c144": exit status 1: error during chown: error mapping container ID pair idtools.IDPair{UID:0, GID:42} for "etc/gshadow" to host: Container ID 42 cannot be mapped to a host ID

@haircommander
Copy link
Member

have you tried just using auto

@ajwock
Copy link
Contributor Author

ajwock commented Aug 27, 2021

Tried it just now-

here's my annotation- do I need more?
io.kubernetes.cri-o.userns-mode: "auto"

Failed to create pod sandbox: rpc error: code = Unknown desc = error creating pod sandbox with name "k8s_crio-test2_default_cadfbde1-5d73-43b9-b30a-2bd550a9c2b2_0": could not find enough available IDs

@haircommander
Copy link
Member

what's
cat /etc/sub?id

@ajwock
Copy link
Contributor Author

ajwock commented Aug 27, 2021

Empty. I'm guessing it's not supposed to be?

@haircommander
Copy link
Member

yeah I would have
containers:200000:65536
in each
/etc/subgid and /etc/subuid

@ajwock
Copy link
Contributor Author

ajwock commented Aug 31, 2021

Had to change to a different kubernetes cluster because I was previously using a heavily modified kind cluster. But thanks for the help. I can confirm that as pod root within a namespace, I can create and configure cgroups. I'll need to do a little more research before I'm sure this works with other components of the program.

Okay, had some spare cycles today to try all of this out more. I've managed to configure up a system where this works. Will do more testing to see if this works with my use case.

@m-yosefpor
Copy link

m-yosefpor commented Oct 20, 2021

Actually there is an ongoing effort to make this happen in OCI: opencontainers/runtime-spec#1123
Also you can find an implementation in runc here: opencontainers/runc#3057

@m-yosefpor
Copy link

This made me go do some reading and I found your name. Is there already some sort of kubernetes-transparent support for userns on CRI-O?

yes :) there is some kind of transparent support for user namespaces in CRI-O using annotations that was added with #3944

The downside is that user namespaces are still expensive, as we need to create a clone of the image to chown the files. There is support in the kernel now for doing it at runtime, but we are still not supporting it.

Exactly, we tested userns and it heavily impacts pod start time (increased from a few seconds to 30-60 seconds)

@frasertweedale
Copy link
Contributor

Yes, based on my development nothing special needs to be done in CRI-O at this point. It is possible (though I'm hoping not) that the runc behaviour might land behind an OCI annotation to enable it, in which case either CRI-O needs to add the annotation, or the user must add the annotation to the Pod spec (CRI-O propagates these annotations to the OCI config).

@frasertweedale
Copy link
Contributor

runc PR opencontainers/runc#3057 was merged. It will automatically chown the cgroupfs to the container process UID when particular conditions are met, as now specified in the OCI runtime-spec: https://github.com/opencontainers/runtime-spec/blob/main/config-linux.md#cgroup-ownership.

Accordingly there is nothing more to be done here. I leave to a CRI-O maintainer to close this ticket.

@ajwock
Copy link
Contributor Author

ajwock commented Dec 16, 2021

Wanted to come back and say that I've had a chance to test this stack, and it works perfectly! I'm able to get the cgroup hierarchy delegated to a user on the host namespace with the cgroup2-mount-hierarchy-rw annotation in cri-o set to "true" and with default settings on the newest (built from github source) runc. Perhaps in the future we can get a setting to the cgroup.max.descendants for the base of the hierarchy and remove the need for an annotation.

@ajwock ajwock closed this as completed Dec 16, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants