Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for user namespaces in stateless pods (KEP-127) #7679

Merged
merged 4 commits into from
Dec 29, 2022

Conversation

rata
Copy link
Contributor

@rata rata commented Nov 16, 2022

Hi!

This PR implements support for user namespace in Kubernetes pods. I'm opening the PR to see if you think this approach is sound, I'll add tests later, of course, but don't need to wait on that to see if this direction makes sense.

The 3 subitems listed in #7063 are solved in some way here, but not all possible solutions are implemented. In particular:

  • The netns problem: it is solved in the first commit (Let OCI runtime create netns when userns is used) and, as requested by Mike Brown in a community meeting, I left intact the code-path that are used today.
  • Ownership of the rootfs: three possible solutions are listed there. I chose just to take the commits from PR Add capability for snapshotters to declare support for UID remapping #7310, that uses a chown and adds the notion of capabilities, to say if some snapshotter supports idmap mounts (support for idmap not implemented). I've squashed the commits in that PR, rebased, fixed conflicts and just improve some code-paths based on my PoC patches. We opened a different PR for those changes first as we thought it may make sense to merge it before, but now that this PR is open, I think we can superseede that PR.
  • Just implement userns support: this is done in commit "cri: Support pods with user namespaces"

The other solutions listed in the issue for the ownership of the rootfs are fuse and idmap mounts. I'm not particularly interested in fuse solutions, as that changes the stack a lot and we just prefer to use idmap mounts/chown. We can implement idmap mounts in the future with this capability commits included (this probably will need to be probed at runtime, as support depends on the kernel and fs being used. I haven't checked how sound these commits are regarding runtime discovery, those were written by @dgl). There is an earlier attempt to use idmap mounts for the rootfs too: #5890, I haven't looked into merging that with the capabilities approach.

While it is not needed, it will be nice to also merge #7141 for 1.7 too, as that reduced the storage overhead when using userns (we only create inodes and hardlink instead of a full blown copy of the rootfs for each pod). Probably it reduces the pod startup latency too :)

As a side note, I noted there is pkg/cri/server and pkg/cri/sbserver that seems to be a c&p with some small changes. I've only implemented support for userns in pkg/cri/server but it is trivial to duplicate the code in sbserver. See commit "WIP: Fix sbserver compilation". I wasn't sure if that is desirable or not. Just let me know :)

To review, consider reviewing commit by commit. Each commit is atomic and has clear explanations.

To test this, a k8s 1.25+ cluster and the field "hostUsers: false" in the pod.spec should be enough. Feel free to reach out to me on slack if there hit any issue to test this.

I'll try to join the next community meeting in case someone wants to discuss this in more detail.

TODO:

  • unit tests for userns in _linux.go files (containers and sandbox)
  • Check if we can add unit tests for userns in non-linux files (containers and sandbox)
  • Check if we can add unit tests for remapped snapshot (@dgl can you look into this?)
  • integration tests

Fixes: #7063

cc @rodnymolina @dgl

@k8s-ci-robot
Copy link

Hi @rata. Thanks for your PR.

I'm waiting for a containerd member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@rata rata changed the title Add support for user namespaces in stateless pods Add support for user namespaces in stateless pods (KEP-127) Nov 16, 2022
@rata
Copy link
Contributor Author

rata commented Nov 16, 2022

It will be great if we can add this to the 1.7 milestone too :)

@rata rata force-pushed the rata/userns-stateless-pods branch 3 times, most recently from 5a28185 to b507641 Compare November 16, 2022 17:22
@estesp estesp added this to New in Code Review via automation Nov 16, 2022
@estesp estesp added this to the 1.7 milestone Nov 16, 2022
@estesp estesp moved this from New to Needs Discussion in Code Review Nov 16, 2022
@estesp
Copy link
Member

estesp commented Nov 16, 2022

Thanks for working on this! I will definitely take some time to look at this soon. I've added the 1.7 milestone as we clearly want to have support for the new CRI API changes for userns KEP in containerd in our next major release.

@fuweid fuweid self-requested a review November 17, 2022 00:25
@asierHuawei
Copy link

We are working on a feature that depends on user namespaces, so we are very interested in your patch.

I downloaded and try your patch , but I haven’t been able to enable namespaces. When I follow the instructions here (https://kubernetes.io/docs/tasks/configure-pod-container/user-namespaces/) and add hostUsers: false key, I still get no user namespace created. In fact, if I check the bundle that containerd creates, there is no reference to user namespaces whatsoever.

What are we doing wrong?

@rata
Copy link
Contributor Author

rata commented Nov 17, 2022

@asierHuawei That is good to know, thanks! Are you sure you enabled the feature gate on all k8s components (not just the kubelet)? If that doesn't do the trick, something else simple should be missing. Let's continue this on slack, I'm rata in k8s slack, Rodrigo Campos or rata in CNCF slack

@rodnymolina
Copy link

As previously discussed with @rata, I was able to verify the proper behavior of this patch for unprivileged pods. I may add some feedback later on once I review the code changes in more details.

@AkihiroSuda
Copy link
Member

Could you add integration tests too?
https://github.com/containerd/containerd/tree/main/integration

@asierHuawei
Copy link

@rata I was able to get it working enabling the feature gates. In our tests, everything seems to work. So far so good.

return "", err
}
// TODO(dgl): length isn't taken into account here yet either.
if err := remapRootFS(ctx, mounts, hostUID, hostGID); err != nil {
Copy link

@rodnymolina rodnymolina Nov 24, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's equivalent functionality already present in lower-level runtimes, which would be impacted by a solution like this one that unidirectionally dictates how to handle the rootfs ownership. For example, Sysbox runtime, in the absence of idmapped-mount support for overlayfs (kernel < 5.19), relies on shiftfs and/or chown-metacopy to considerably expedite this operation.

Would it be possible to expose a config knob that overrides this id-remap feature to allow users to decide where to run this logic? This knob would be deprecated once that snapshotters support more efficient id-remap approaches (probably after 5.19+).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, makes sense; it would essentially be an option to say that snapshotters support the "remap-ids" capability even if they don't declare it (and it's handled elsewhere).

(The only downside I see is things outside containerd cannot see this capability, so e.g. if a snapshotter is deciding whether it reports the capability based on the kernel version or not, something like sysbox may have to infer that too, given this would be a user config though I think that's fine. I'll have a little think about that and add something here soon.)

Copy link

@rodnymolina rodnymolina Nov 24, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, makes sense; it would essentially be an option to say that snapshotters support the "remap-ids" capability even if they don't declare it (and it's handled elsewhere).

Yes, i think that would work. The snapshotter wouldn't do anything in this case and would rely on lower-level runtimes for this operation.

(The only downside I see is things outside containerd cannot see this capability, so e.g. if a snapshotter is deciding whether it reports the capability based on the kernel version or not, something like sysbox may have to infer that too...

Agree, we are already doing something like this.

Thanks @dgl.

@dmcgowan dmcgowan marked this pull request as draft November 30, 2022 16:43
@dmcgowan dmcgowan added status/needs-discussion Needs discussion and decision from maintainers and removed needs-ok-to-test labels Nov 30, 2022
Copy link
Member

@fuweid fuweid left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The method to setup CNI looks good to me. We can double-check the runC-init process is still alive after mount netns. The shim is watching the exit event of runC-init so that it is reliable. At least to me, the shim is acting like pidfd.

But this approach seems not available to kata-container/gVisor runtime. I think it is not blocker. Currently, Sandbox-API refactors existing CRI-plugin and it should work with other runtime. So, it is just like host-networking to kata-container. Not blocker.

pkg/cri/sbserver/container_create_linux.go Outdated Show resolved Hide resolved
pkg/cri/server/sandbox_run_linux.go Show resolved Hide resolved
pkg/cri/server/sandbox_run_linux.go Show resolved Hide resolved
Copy link
Contributor Author

@rata rata left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Grr, forgot to click "submit review". Now submitting

pkg/cri/server/sandbox_run.go Outdated Show resolved Hide resolved
pkg/cri/server/sandbox_run.go Show resolved Hide resolved
return nil, err
}

if len(resp.Plugins) <= 0 {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: len can't be < 0

Copy link
Member

@fuweid fuweid left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

I think there are two followup items

  • Consider to add validation if the container userns settings is not equal to pod's.
  • Add cni failpoint test case for the userns senario.

err: true,
},
} {
t.Run(desc, func(t *testing.T) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please consider to use the following pattern just in case that the cases run in parallelly.

desc := desc
test := test
t.Run(desc, func(t *tesitng.T) {
})

Copy link
Contributor Author

@rata rata Dec 28, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I can do that to avoid go issues with closures if we parallelize! FYI, this was c&p from other parts of the code, so there are more instances of not doing this in the code.

Should I push to this PR with that amend? I'm ok with doing a follow-up PR if pushing now will remove the LGTM labels.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lgtm is still counting after you push small changes~ It is ok to handle it in the followup.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I'll do it in a different PR (I already have the code ready locally) to not delay merging this then :)

@fuweid fuweid removed the status/needs-discussion Needs discussion and decision from maintainers label Dec 28, 2022
Copy link
Member

@mikebrow mikebrow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM to merge and progress to additional testing/TODOs on subsequent PRs.

@mikebrow mikebrow merged commit 66f186d into containerd:main Dec 29, 2022
Code Review automation moved this from Needs Contributor Update to Done Dec 29, 2022
rata added a commit to kinvolk/containerd that referenced this pull request Dec 29, 2022
This was suggested here:
	containerd#7679 (comment)

Signed-off-by: Rodrigo Campos <rodrigoca@microsoft.com>
rata added a commit to kinvolk/containerd that referenced this pull request Dec 29, 2022
This was suggested here:
	containerd#7679 (comment)

Signed-off-by: Rodrigo Campos <rodrigoca@microsoft.com>
rata added a commit to kinvolk/containerd that referenced this pull request Jul 11, 2023
Since we merged support for userns in:
	containerd#7679

overlay has been doing a chown for the rootfs using WithRemapperLabels.

Signed-off-by: Rodrigo Campos <rodrigoca@microsoft.com>
@mxpv mxpv added cherry-picked/sbserver Changes are backported to sbserver and removed cherry-pick/sbserver Changes need to be backported to sbserver labels Aug 2, 2023
jsturtevant pushed a commit to jsturtevant/containerd that referenced this pull request Sep 21, 2023
Since we merged support for userns in:
	containerd#7679

overlay has been doing a chown for the rootfs using WithRemapperLabels.

Signed-off-by: Rodrigo Campos <rodrigoca@microsoft.com>
juliusl pushed a commit to juliusl/containerd that referenced this pull request Jan 26, 2024
Since we merged support for userns in:
	containerd#7679

overlay has been doing a chown for the rootfs using WithRemapperLabels.

Signed-off-by: Rodrigo Campos <rodrigoca@microsoft.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Development

Successfully merging this pull request may close these issues.

Support user namespaces phase I in Kubernetes 1.25+