Add support for user namespaces in stateless pods (KEP-127) #7679

rata · 2022-11-16T16:13:31Z

Hi!

This PR implements support for user namespace in Kubernetes pods. I'm opening the PR to see if you think this approach is sound, I'll add tests later, of course, but don't need to wait on that to see if this direction makes sense.

The 3 subitems listed in #7063 are solved in some way here, but not all possible solutions are implemented. In particular:

The netns problem: it is solved in the first commit (Let OCI runtime create netns when userns is used) and, as requested by Mike Brown in a community meeting, I left intact the code-path that are used today.
Ownership of the rootfs: three possible solutions are listed there. I chose just to take the commits from PR Add capability for snapshotters to declare support for UID remapping #7310, that uses a chown and adds the notion of capabilities, to say if some snapshotter supports idmap mounts (support for idmap not implemented). I've squashed the commits in that PR, rebased, fixed conflicts and just improve some code-paths based on my PoC patches. We opened a different PR for those changes first as we thought it may make sense to merge it before, but now that this PR is open, I think we can superseede that PR.
Just implement userns support: this is done in commit "cri: Support pods with user namespaces"

The other solutions listed in the issue for the ownership of the rootfs are fuse and idmap mounts. I'm not particularly interested in fuse solutions, as that changes the stack a lot and we just prefer to use idmap mounts/chown. We can implement idmap mounts in the future with this capability commits included (this probably will need to be probed at runtime, as support depends on the kernel and fs being used. I haven't checked how sound these commits are regarding runtime discovery, those were written by @dgl). There is an earlier attempt to use idmap mounts for the rootfs too: #5890, I haven't looked into merging that with the capabilities approach.

While it is not needed, it will be nice to also merge #7141 for 1.7 too, as that reduced the storage overhead when using userns (we only create inodes and hardlink instead of a full blown copy of the rootfs for each pod). Probably it reduces the pod startup latency too :)

As a side note, I noted there is pkg/cri/server and pkg/cri/sbserver that seems to be a c&p with some small changes. I've only implemented support for userns in pkg/cri/server but it is trivial to duplicate the code in sbserver. See commit "WIP: Fix sbserver compilation". I wasn't sure if that is desirable or not. Just let me know :)

To review, consider reviewing commit by commit. Each commit is atomic and has clear explanations.

To test this, a k8s 1.25+ cluster and the field "hostUsers: false" in the pod.spec should be enough. Feel free to reach out to me on slack if there hit any issue to test this.

I'll try to join the next community meeting in case someone wants to discuss this in more detail.

TODO:

unit tests for userns in _linux.go files (containers and sandbox)
Check if we can add unit tests for userns in non-linux files (containers and sandbox)
Check if we can add unit tests for remapped snapshot (@dgl can you look into this?)
integration tests

Fixes: #7063

cc @rodnymolina @dgl

k8s-ci-robot · 2022-11-16T16:13:40Z

Hi @rata. Thanks for your PR.

I'm waiting for a containerd member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

rata · 2022-11-16T16:14:37Z

It will be great if we can add this to the 1.7 milestone too :)

estesp · 2022-11-16T18:21:49Z

Thanks for working on this! I will definitely take some time to look at this soon. I've added the 1.7 milestone as we clearly want to have support for the new CRI API changes for userns KEP in containerd in our next major release.

asierHuawei · 2022-11-17T15:00:55Z

We are working on a feature that depends on user namespaces, so we are very interested in your patch.

I downloaded and try your patch , but I haven’t been able to enable namespaces. When I follow the instructions here (https://kubernetes.io/docs/tasks/configure-pod-container/user-namespaces/) and add hostUsers: false key, I still get no user namespace created. In fact, if I check the bundle that containerd creates, there is no reference to user namespaces whatsoever.

What are we doing wrong?

rata · 2022-11-17T15:38:06Z

@asierHuawei That is good to know, thanks! Are you sure you enabled the feature gate on all k8s components (not just the kubelet)? If that doesn't do the trick, something else simple should be missing. Let's continue this on slack, I'm rata in k8s slack, Rodrigo Campos or rata in CNCF slack

rodnymolina · 2022-11-18T18:23:36Z

As previously discussed with @rata, I was able to verify the proper behavior of this patch for unprivileged pods. I may add some feedback later on once I review the code changes in more details.

pkg/cri/opts/spec_linux.go

snapshotter_opts_unix.go

snapshotter_opts_windows.go

AkihiroSuda · 2022-11-20T10:49:08Z

Could you add integration tests too?
https://github.com/containerd/containerd/tree/main/integration

asierHuawei · 2022-11-23T06:39:03Z

@rata I was able to get it working enabling the feature gates. In our tests, everything seems to work. So far so good.

rodnymolina · 2022-11-24T06:55:52Z

snapshotter_opts_unix.go

+		return "", err
+	}
+	// TODO(dgl): length isn't taken into account here yet either.
+	if err := remapRootFS(ctx, mounts, hostUID, hostGID); err != nil {


There's equivalent functionality already present in lower-level runtimes, which would be impacted by a solution like this one that unidirectionally dictates how to handle the rootfs ownership. For example, Sysbox runtime, in the absence of idmapped-mount support for overlayfs (kernel < 5.19), relies on shiftfs and/or chown-metacopy to considerably expedite this operation.

Would it be possible to expose a config knob that overrides this id-remap feature to allow users to decide where to run this logic? This knob would be deprecated once that snapshotters support more efficient id-remap approaches (probably after 5.19+).

Yes, makes sense; it would essentially be an option to say that snapshotters support the "remap-ids" capability even if they don't declare it (and it's handled elsewhere).

(The only downside I see is things outside containerd cannot see this capability, so e.g. if a snapshotter is deciding whether it reports the capability based on the kernel version or not, something like sysbox may have to infer that too, given this would be a user config though I think that's fine. I'll have a little think about that and add something here soon.)

Yes, makes sense; it would essentially be an option to say that snapshotters support the "remap-ids" capability even if they don't declare it (and it's handled elsewhere).

Yes, i think that would work. The snapshotter wouldn't do anything in this case and would rely on lower-level runtimes for this operation.

(The only downside I see is things outside containerd cannot see this capability, so e.g. if a snapshotter is deciding whether it reports the capability based on the kernel version or not, something like sysbox may have to infer that too...

Agree, we are already doing something like this.

Thanks @dgl.

pkg/cri/server/sandbox_run.go

fuweid

The method to setup CNI looks good to me. We can double-check the runC-init process is still alive after mount netns. The shim is watching the exit event of runC-init so that it is reliable. At least to me, the shim is acting like pidfd.

But this approach seems not available to kata-container/gVisor runtime. I think it is not blocker. Currently, Sandbox-API refactors existing CRI-plugin and it should work with other runtime. So, it is just like host-networking to kata-container. Not blocker.

pkg/cri/sbserver/container_create_linux.go

pkg/cri/server/sandbox_run_linux.go

rata

Grr, forgot to click "submit review". Now submitting

pkg/cri/server/sandbox_run.go

AkihiroSuda · 2022-12-28T01:50:55Z

client.go

+		return nil, err
+	}
+
+	if len(resp.Plugins) <= 0 {


nit: len can't be < 0

fuweid

LGTM

I think there are two followup items

Consider to add validation if the container userns settings is not equal to pod's.
Add cni failpoint test case for the userns senario.

fuweid · 2022-12-28T08:52:48Z

pkg/cri/server/container_create_linux_test.go

+			err: true,
+		},
+	} {
+		t.Run(desc, func(t *testing.T) {


please consider to use the following pattern just in case that the cases run in parallelly.

desc := desc test := test t.Run(desc, func(t *tesitng.T) { })

Sure, I can do that to avoid go issues with closures if we parallelize! FYI, this was c&p from other parts of the code, so there are more instances of not doing this in the code.

Should I push to this PR with that amend? I'm ok with doing a follow-up PR if pushing now will remove the LGTM labels.

Lgtm is still counting after you push small changes~ It is ok to handle it in the followup.

Ok, I'll do it in a different PR (I already have the code ready locally) to not delay merging this then :)

mikebrow

LGTM to merge and progress to additional testing/TODOs on subsequent PRs.

This was suggested here: containerd#7679 (comment) Signed-off-by: Rodrigo Campos <rodrigoca@microsoft.com>

Since we merged support for userns in: containerd#7679 overlay has been doing a chown for the rootfs using WithRemapperLabels. Signed-off-by: Rodrigo Campos <rodrigoca@microsoft.com>

k8s-ci-robot added the needs-ok-to-test label Nov 16, 2022

rata changed the title ~~Add support for user namespaces in stateless pods~~ Add support for user namespaces in stateless pods (KEP-127) Nov 16, 2022

rata force-pushed the rata/userns-stateless-pods branch 3 times, most recently from 5a28185 to b507641 Compare November 16, 2022 17:22

estesp added this to New in Code Review via automation Nov 16, 2022

estesp added this to the 1.7 milestone Nov 16, 2022

estesp moved this from New to Needs Discussion in Code Review Nov 16, 2022

fuweid self-requested a review November 17, 2022 00:25

AkihiroSuda reviewed Nov 20, 2022

View reviewed changes

pkg/cri/opts/spec_linux.go Show resolved Hide resolved

AkihiroSuda reviewed Nov 20, 2022

View reviewed changes

snapshotter_opts_unix.go Outdated Show resolved Hide resolved

AkihiroSuda reviewed Nov 20, 2022

View reviewed changes

snapshotter_opts_unix.go Outdated Show resolved Hide resolved

AkihiroSuda reviewed Nov 20, 2022

View reviewed changes

snapshotter_opts_windows.go Outdated Show resolved Hide resolved

rodnymolina reviewed Nov 24, 2022

View reviewed changes

dmcgowan marked this pull request as draft November 30, 2022 16:43

dmcgowan added status/needs-discussion Needs discussion and decision from maintainers and removed needs-ok-to-test labels Nov 30, 2022

fuweid reviewed Dec 8, 2022

View reviewed changes

pkg/cri/server/sandbox_run.go Outdated Show resolved Hide resolved

fuweid reviewed Dec 8, 2022

View reviewed changes

pkg/cri/server/sandbox_run.go Show resolved Hide resolved

fuweid reviewed Dec 8, 2022

View reviewed changes

pkg/cri/sbserver/container_create_linux.go Outdated Show resolved Hide resolved

pkg/cri/server/sandbox_run_linux.go Show resolved Hide resolved

pkg/cri/server/sandbox_run_linux.go Show resolved Hide resolved

rata commented Dec 12, 2022

View reviewed changes

pkg/cri/server/sandbox_run.go Outdated Show resolved Hide resolved

pkg/cri/server/sandbox_run.go Show resolved Hide resolved

rata force-pushed the rata/userns-stateless-pods branch from 202292a to bd9a8d9 Compare December 13, 2022 16:06

AkihiroSuda reviewed Dec 28, 2022

View reviewed changes

client.go

return nil, err

}

if len(resp.Plugins) <= 0 {

Copy link

Member

AkihiroSuda Dec 28, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: len can't be < 0

AkihiroSuda approved these changes Dec 28, 2022

View reviewed changes

fuweid approved these changes Dec 28, 2022

View reviewed changes

fuweid removed the status/needs-discussion Needs discussion and decision from maintainers label Dec 28, 2022

mikebrow approved these changes Dec 28, 2022

View reviewed changes

mikebrow merged commit 66f186d into containerd:main Dec 29, 2022

Code Review automation moved this from Needs Contributor Update to Done Dec 29, 2022

rata added a commit to kinvolk/containerd that referenced this pull request Dec 29, 2022

cri: Shadow variables to avoid t.Parallel() issues

e66e86a

This was suggested here: containerd#7679 (comment) Signed-off-by: Rodrigo Campos <rodrigoca@microsoft.com>

rata mentioned this pull request Dec 29, 2022

cri: Shadow variables to avoid t.Parallel() issues #7880

Merged

rata added a commit to kinvolk/containerd that referenced this pull request Dec 29, 2022

cri: Shadow variables to avoid t.Parallel() issues

12d5fab

This was suggested here: containerd#7679 (comment) Signed-off-by: Rodrigo Campos <rodrigoca@microsoft.com>

rata mentioned this pull request Dec 30, 2022

Validate userns container config is consistent with sandbox userns config #7882

Merged

dmcgowan added the impact/changelog label Jan 3, 2023

dmcgowan mentioned this pull request Jan 4, 2023

Prepare release notes for v1.7.0-beta.2 #7903

Merged

fuweid mentioned this pull request Jan 13, 2023

Enhance CRI workflow to support pod level user namespace #6908

Closed

dmcgowan mentioned this pull request Jan 28, 2023

Prepare release notes for v1.7.0-beta.3 #8014

Merged

dmcgowan mentioned this pull request Feb 15, 2023

Prepare release notes for v1.7.0 beta.4 #8110

Merged

fuweid mentioned this pull request Feb 16, 2023

Add experimental section to RELEASES.md #8119

Merged

This was referenced Feb 25, 2023

Prepare release notes for v1.7.0-rc.0 #8162

Merged

Prepare release notes for v1.7.0-rc.1 #8192

Merged

This was referenced Mar 7, 2023

Add release notes for v1.7.0-rc.2 #8216

Merged

Prepare release notes for v1.7.0-rc.3 #8233

Merged

Prepare release notes for v1.7.0 #8242

Merged

MikeZappa87 mentioned this pull request Apr 18, 2023

Ensure Windows Network Compartment is Created #8411

Closed

mxpv added cherry-picked/sbserver Changes are backported to sbserver and removed cherry-pick/sbserver Changes need to be backported to sbserver labels Aug 2, 2023

yawqi mentioned this pull request Nov 13, 2023

support user namespace kata-containers/kata-containers#8170

Open

fuweid mentioned this pull request Jul 2, 2024

CRI: Properly restore IP information for userns sandboxes #10400

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for user namespaces in stateless pods (KEP-127) #7679

Add support for user namespaces in stateless pods (KEP-127) #7679

rata commented Nov 16, 2022 •

edited

Loading

k8s-ci-robot commented Nov 16, 2022

rata commented Nov 16, 2022

estesp commented Nov 16, 2022

asierHuawei commented Nov 17, 2022

rata commented Nov 17, 2022

rodnymolina commented Nov 18, 2022

AkihiroSuda commented Nov 20, 2022

asierHuawei commented Nov 23, 2022

rodnymolina Nov 24, 2022 •

edited

Loading

dgl Nov 24, 2022

rodnymolina Nov 24, 2022 •

edited

Loading

fuweid left a comment

rata left a comment

AkihiroSuda Dec 28, 2022

fuweid left a comment

fuweid Dec 28, 2022

rata Dec 28, 2022 •

edited

Loading

fuweid Dec 28, 2022

rata Dec 28, 2022

mikebrow left a comment

Add support for user namespaces in stateless pods (KEP-127) #7679

Add support for user namespaces in stateless pods (KEP-127) #7679

Conversation

rata commented Nov 16, 2022 • edited Loading

k8s-ci-robot commented Nov 16, 2022

rata commented Nov 16, 2022

estesp commented Nov 16, 2022

asierHuawei commented Nov 17, 2022

rata commented Nov 17, 2022

rodnymolina commented Nov 18, 2022

AkihiroSuda commented Nov 20, 2022

asierHuawei commented Nov 23, 2022

rodnymolina Nov 24, 2022 • edited Loading

Choose a reason for hiding this comment

dgl Nov 24, 2022

Choose a reason for hiding this comment

rodnymolina Nov 24, 2022 • edited Loading

Choose a reason for hiding this comment

fuweid left a comment

Choose a reason for hiding this comment

rata left a comment

Choose a reason for hiding this comment

AkihiroSuda Dec 28, 2022

Choose a reason for hiding this comment

fuweid left a comment

Choose a reason for hiding this comment

fuweid Dec 28, 2022

Choose a reason for hiding this comment

rata Dec 28, 2022 • edited Loading

Choose a reason for hiding this comment

fuweid Dec 28, 2022

Choose a reason for hiding this comment

rata Dec 28, 2022

Choose a reason for hiding this comment

mikebrow left a comment

Choose a reason for hiding this comment

rata commented Nov 16, 2022 •

edited

Loading

rodnymolina Nov 24, 2022 •

edited

Loading

rodnymolina Nov 24, 2022 •

edited

Loading

rata Dec 28, 2022 •

edited

Loading