-
Notifications
You must be signed in to change notification settings - Fork 3.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for user namespaces in stateless pods (KEP-127) #7679
Conversation
Hi @rata. Thanks for your PR. I'm waiting for a containerd member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
It will be great if we can add this to the 1.7 milestone too :) |
5a28185
to
b507641
Compare
Thanks for working on this! I will definitely take some time to look at this soon. I've added the 1.7 milestone as we clearly want to have support for the new CRI API changes for userns KEP in containerd in our next major release. |
We are working on a feature that depends on user namespaces, so we are very interested in your patch. I downloaded and try your patch , but I haven’t been able to enable namespaces. When I follow the instructions here (https://kubernetes.io/docs/tasks/configure-pod-container/user-namespaces/) and add hostUsers: false key, I still get no user namespace created. In fact, if I check the bundle that containerd creates, there is no reference to user namespaces whatsoever. What are we doing wrong? |
@asierHuawei That is good to know, thanks! Are you sure you enabled the feature gate on all k8s components (not just the kubelet)? If that doesn't do the trick, something else simple should be missing. Let's continue this on slack, I'm rata in k8s slack, Rodrigo Campos or rata in CNCF slack |
As previously discussed with @rata, I was able to verify the proper behavior of this patch for unprivileged pods. I may add some feedback later on once I review the code changes in more details. |
Could you add integration tests too? |
@rata I was able to get it working enabling the feature gates. In our tests, everything seems to work. So far so good. |
return "", err | ||
} | ||
// TODO(dgl): length isn't taken into account here yet either. | ||
if err := remapRootFS(ctx, mounts, hostUID, hostGID); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's equivalent functionality already present in lower-level runtimes, which would be impacted by a solution like this one that unidirectionally dictates how to handle the rootfs ownership. For example, Sysbox runtime, in the absence of idmapped-mount support for overlayfs (kernel < 5.19), relies on shiftfs
and/or chown-metacopy
to considerably expedite this operation.
Would it be possible to expose a config knob that overrides this id-remap feature to allow users to decide where to run this logic? This knob would be deprecated once that snapshotters support more efficient id-remap approaches (probably after 5.19+).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, makes sense; it would essentially be an option to say that snapshotters support the "remap-ids" capability even if they don't declare it (and it's handled elsewhere).
(The only downside I see is things outside containerd cannot see this capability, so e.g. if a snapshotter is deciding whether it reports the capability based on the kernel version or not, something like sysbox may have to infer that too, given this would be a user config though I think that's fine. I'll have a little think about that and add something here soon.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, makes sense; it would essentially be an option to say that snapshotters support the "remap-ids" capability even if they don't declare it (and it's handled elsewhere).
Yes, i think that would work. The snapshotter wouldn't do anything in this case and would rely on lower-level runtimes for this operation.
(The only downside I see is things outside containerd cannot see this capability, so e.g. if a snapshotter is deciding whether it reports the capability based on the kernel version or not, something like sysbox may have to infer that too...
Agree, we are already doing something like this.
Thanks @dgl.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The method to setup CNI looks good to me. We can double-check the runC-init process is still alive after mount netns. The shim is watching the exit event of runC-init so that it is reliable. At least to me, the shim is acting like pidfd
.
But this approach seems not available to kata-container/gVisor runtime. I think it is not blocker. Currently, Sandbox-API refactors existing CRI-plugin and it should work with other runtime. So, it is just like host-networking to kata-container. Not blocker.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Grr, forgot to click "submit review". Now submitting
202292a
to
bd9a8d9
Compare
return nil, err | ||
} | ||
|
||
if len(resp.Plugins) <= 0 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: len can't be < 0
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
I think there are two followup items
- Consider to add validation if the container userns settings is not equal to pod's.
- Add cni failpoint test case for the userns senario.
err: true, | ||
}, | ||
} { | ||
t.Run(desc, func(t *testing.T) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please consider to use the following pattern just in case that the cases run in parallelly.
desc := desc
test := test
t.Run(desc, func(t *tesitng.T) {
})
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I can do that to avoid go issues with closures if we parallelize! FYI, this was c&p from other parts of the code, so there are more instances of not doing this in the code.
Should I push to this PR with that amend? I'm ok with doing a follow-up PR if pushing now will remove the LGTM labels.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lgtm is still counting after you push small changes~ It is ok to handle it in the followup.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I'll do it in a different PR (I already have the code ready locally) to not delay merging this then :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM to merge and progress to additional testing/TODOs on subsequent PRs.
This was suggested here: containerd#7679 (comment) Signed-off-by: Rodrigo Campos <rodrigoca@microsoft.com>
This was suggested here: containerd#7679 (comment) Signed-off-by: Rodrigo Campos <rodrigoca@microsoft.com>
Since we merged support for userns in: containerd#7679 overlay has been doing a chown for the rootfs using WithRemapperLabels. Signed-off-by: Rodrigo Campos <rodrigoca@microsoft.com>
Since we merged support for userns in: containerd#7679 overlay has been doing a chown for the rootfs using WithRemapperLabels. Signed-off-by: Rodrigo Campos <rodrigoca@microsoft.com>
Since we merged support for userns in: containerd#7679 overlay has been doing a chown for the rootfs using WithRemapperLabels. Signed-off-by: Rodrigo Campos <rodrigoca@microsoft.com>
Hi!
This PR implements support for user namespace in Kubernetes pods. I'm opening the PR to see if you think this approach is sound, I'll add tests later, of course, but don't need to wait on that to see if this direction makes sense.
The 3 subitems listed in #7063 are solved in some way here, but not all possible solutions are implemented. In particular:
The other solutions listed in the issue for the ownership of the rootfs are fuse and idmap mounts. I'm not particularly interested in fuse solutions, as that changes the stack a lot and we just prefer to use idmap mounts/chown. We can implement idmap mounts in the future with this capability commits included (this probably will need to be probed at runtime, as support depends on the kernel and fs being used. I haven't checked how sound these commits are regarding runtime discovery, those were written by @dgl). There is an earlier attempt to use idmap mounts for the rootfs too: #5890, I haven't looked into merging that with the capabilities approach.
While it is not needed, it will be nice to also merge #7141 for 1.7 too, as that reduced the storage overhead when using userns (we only create inodes and hardlink instead of a full blown copy of the rootfs for each pod). Probably it reduces the pod startup latency too :)
As a side note, I noted there is
pkg/cri/server
andpkg/cri/sbserver
that seems to be a c&p with some small changes. I've only implemented support for userns in pkg/cri/server but it is trivial to duplicate the code in sbserver. See commit "WIP: Fix sbserver compilation". I wasn't sure if that is desirable or not. Just let me know :)To review, consider reviewing commit by commit. Each commit is atomic and has clear explanations.
To test this, a k8s 1.25+ cluster and the field "hostUsers: false" in the pod.spec should be enough. Feel free to reach out to me on slack if there hit any issue to test this.
I'll try to join the next community meeting in case someone wants to discuss this in more detail.
TODO:
Fixes: #7063
cc @rodnymolina @dgl