-
Notifications
You must be signed in to change notification settings - Fork 18.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow non-privileged containers to create device nodes. #2979
Conversation
ping @jpetazzo |
@kevinwallace @jpetazzo Won't this be a problem for devicemapper? |
We can currently import or pull and run images that already have arbitrary device nodes, so this is no more or less secure, if that's what you mean. |
👍 have an immediate need for this. |
Some changes to the way privs were dropped broke this. Rebased on top of efde305; it works again. |
+1 this will fix my |
Rebased, and added DCO signoff to the commit messages. |
@kevinwallace is there a chance that you could begin some documentation for this? An example would be a great start - we can fill more in later, but that takes longer :). |
Sure! I'm not sure if/where this would fit into the docs, so I'll just dump it here: Before this change, if you needed to use a particular device node from within a container ( After this change, non-privileged containers include the SYS_MKNOD capability, allowing them to create device nodes at runtime. This only changes the docker security model if you both (a) strictly control the set of images your users have access to, and (b) allow users to run as root within containers. Here's an example of a Dockerfile that works after this change, but didn't before: https://github.com/kevinwallace/qemu-docker/blob/37388951255c8d835f4b943a2ca59e713495c173/Dockerfile (in particular, the |
so @kevinwallace this will help #1916 avoid the need for -privileged ? awesome :) merci for the doc-start, hopefully I'll find time to work out a place for it (or @metalivedev / @jamtur01 et al will ) |
This let me install GlusterFS (which depended on FUSE). 👍 Now I just need to get GlusterFS working.. 😁 |
If/when this code gets approved and merged, please file a new issue about creating documentation and refer back to this one. Otherwise the docs may really get left behind. |
I think the doc issue is larger in scope than this change. I'd expect to see this sort of information in a section that addresses the questions "What are the security properties of a container? What can unprivileged containers do? What can privileged containers do?". That section doesn't currently exist, AFAIK (though I'd love to be wrong about this). And I suspect the concept of a "privileged container" might not last long -- see the "[docker-dev] Breaking down system requirements / privileges" email thread. Anyway, agreed that this conversation belongs in a new, separate issue. |
Can we please add a huge warning about this to the docs. Allowing mknod() is totally unsafe and means you can easily break out of the container. For instance, any "root" user inside the container can just create a /dev/mem node and arbitrarily read/write to any physical memory. |
re: security, updates to and issues with http://docs.docker.io/en/latest/installation/security/ are welcome. I think that in this case we're not granting any capabilities that give new security holes because we're only granting the ability to create these nodes within the container, not to actually read or write them. See lxc_template.go:37 which hasn't changed. What is the purpose of creating a node that you can't read or write? Yeah, I though that was weird too. It lets you use a Dockerfile to create a container with all the right mount points in it for later running as a privileged container. It won't fix all the catch-22 problems of privileged container creation, but it will help in some cases, as with FUSE file system mounts. At least that's my current understanding. |
@metalivedev You nailed my motivation for this change. In particular, this is a prerequisite for using Trusted Builds to create an image that contains device nodes that aren't present in the parent image. |
@metalivedev That sounds good to me. I looked at this: |
There is a minor issue with enabling mknod in a container -- systemd compatibility. Apparently systemd does a check for CAP_MKNOD to determine whether it should run udevd and any other units that create device nodes. But that's not probably a big deal at this time, since systemd already doesn't work very well in a docker container, so it wouldn't really break anything that's not already broken. Once we get systemd working in a container we'd probably have to either change systemd's behavior wrt to CAP_MKNOD, or have docker drop CAP_MKNOD when it detects a systemd container. |
👍 I needz zis! |
+1 |
2 similar comments
+1 |
+1 |
+1 |
+1 this would be excellent for "live-build" (or anywhere else "debootstrap" is used, even if directly) to allow us to create ISOs or even base images directly inside a container without using the hammer of |
+1 |
3 similar comments
+1 |
+1 |
+1 |
LGTM |
Are you kidding me? Sure, land this, and watch how much fun people have with your underlying hardware and your system. No, this is horrible. Don't do this. Unless you like unsecure systems and enjoy people doing bad things, then sure, no objection from me... |
@gregkh how so? There isn't anything here I can't already do with a tar and ADD is there? Should we be scrubbing some of the device nodes from ADDed tar files too? |
@gregkh I think the argument is that you can already run a privileged container, mknod, commit, push, pull into an unprivileged container. So essentially you can mknod in a round about way today. Is there not an additional security layer that prevents access even if the device node exists? If there is not then you are basically saying that there is a huge security issue in docker today. |
Indeed, if this PR creates a problem with security then it's not a new one. Where's the harm in being able to create device nodes I can't read or write? |
@tianon, yes, you could do that today, and that's also a bad thing, that can cause problems. Don't try to make the issue worse by making it easy for people to do this, and having to break users when the whole thing finally gets fixed. Also, this breaks systemd, isn't that a huge blocker? Unless you don't want to run systemd, and then, well, again, there are easier ways to stop systemd from running, don't try to be subtle about it... |
@tianon, why would you ever want to create a device node you can't read or write? It's not useful at all, so why would you want to allow it? Unless you can use it, and then, well, it's a problem... Be explicit about permissions, if you want to create a device node, then require permissions to do so. Why is that a problem for people who want to create them today? |
If the mere presence of a device node causes a security concern, then that needs to be addressed. If cgroups can effectively block access, then why not just let users create them. It does make building images easier for packages that have a dependency on things like fuse, but don't actually use fuse. But... If this breaks systemd it should not be accepted. |
I don't think that cgroups can block device access, but I could be wrong. Someone should research that... I still don't understand why a package would want to create a device node, and never use it. Specific examples would be nice to see. |
@gregkh I can't speak for the motivation behind #2191 (the bug that this patch fixes), but the reason I wrote this patch was so that I could write a Dockerfile that created /dev/kvm without needing -privileged at build time, so that it could be published as a Trusted Build. It would still require -privileged at run time in order to be able to use KVM, of course. Another common motivation (scanning the +1s above) seems to be installing dpkg packages that depend on fuse. My understanding (based on conversation attached to #2191) was that cgroups would continue to block access to device nodes not explicitly whitelisted (see https://github.com/dotcloud/docker/blob/6c266c4b42eeabe2d433a994753d86637fe52a0b/execdriver/lxc/lxc_template.go#L42 -- whitelist is currently /dev/null, /dev/zero, consoles, /dev/{u,}random, /dev/pts, and tuntap). If that understanding is incorrect, then we have a big (pre-existing) problem. |
Here's an example non-privileged session to show why this isn't an issue: $ docker run -it --rm debian:stable bash
root@6f17eb5cebeb:/# cat /dev/mem # "do no evil (tm)"
cat: /dev/mem: Operation not permitted
root@6f17eb5cebeb:/# cat /dev/kmem
cat: /dev/kmem: Operation not permitted
root@6f17eb5cebeb:/# cat /dev/sda1
cat: /dev/sda1: Operation not permitted |
Also, as far as use cases go, |
odds are /dev/mem is mediated by your kernel, are you sure that isn't happening here? Pick a different device node if you want to test things out, like a random serial port... |
That's why I included "/dev/sda1" :) |
So, just for good measure: root@f70fd78e7d0b:/# cat /dev/ttyS0
cat: /dev/ttyS0: Permission denied
root@f70fd78e7d0b:/# echo 'something' > /dev/ttyS0
bash: /dev/ttyS0: Permission denied |
@gregkh Obviously cgroups can limit device access, and we're using this today in docker. I don't think the hysterical security ranting is called for, if you have actual feedback that would be interesting though. We're not completely stupid you know. |
So will never be possible to mount a fuse file system like s3ql or gluster inside a docker container? |
@guestisp It is possible right now if you pass -privileged, but that gives you a very unsecure container. Long term we will probably add ways open up privileges in a more fine grained way, but that is not currently possible. |
@guestisp the other alternative is to mount it outside the container and then add it to the container using a volume. |
Is not always possible to mount it outside. An option would be to mount a single volume and then bind inside a container a single folder. In this case each container share the same fs mount on node, but doesn't see other containers file due to the subdirectory, in example: /mnt/s3ql/container1 => mounted in container1 /mnt/s3ql is a single s3ql fs mounted on node. How did dotcloud s3ql mounts ? |
I made an earlier comment about building up an debian rootfs with multistrap: #2979 (comment) Whilst I can appreciate it's perhaps not a mainstream requirement, it's nonetheless a necessary part of my workflow for building embedded firmware images. |
@gregkh I won't get into the security discussion, plenty of people adequately covering that already, but I wanted to address this particular comment:
TLDR: it doesn't break systemd, we are just changing a hardcoded default but you can still change these values per-container or per-image (or in the case of MKNOD, we will add that flag). And we will NEVER EVER do anything to hurt another project on purpose! I'll state the (very) obvious, we're not actively trying to break any software. Quite the opposite we actively want as many programs as possible to run happily inside docker, including of course systemd, which is a common request. We are actually putting a lot of work specifically into getting docker to work well with systemd. We approach the problem of running systemd inside docker the same way we approach the problem for any other program:
Of the 4 examples I listed above, 4 are currently baked into the runtime, but 3 are scheduled to be "demoted" to runtime/buildtime options for various reasons:
The conclusion to all this is two-fold:
Sorry for the rather long comment. But I wanted to be thorough, because we put a lot of thought into these design decisions, and so far we have been able to fend off the spectrum of drama and keep everyone focused on technical merit and positive things like making our users happy. I feel incredibly lucky that experienced open-source people like you @gregkh and @alexlarsson take the time to discuss and contribute here. The reason it works well, I think, is that we trust each other to be well-intentioned and to put the good of the project and its users first, and have succeeded in setting a positive and friendly atmosphere regardless of technical disagreements. So I'm always alarmed when I see the notion that we might change the code on purpose to hurt someone else or generally do something negative. That is the very last thing I want this project associated with. If you feel like we are sending a negative vibe in any way, let's please discuss it, this matters a lot to me. A good place to discuss this would be #docker-dev on Freenode every thursday at 10:30AM, that's our weekly irc meeting. Ok I'm done :) |
On Fri, Mar 14, 2014 at 01:56:26AM -0700, Solomon Hykes wrote:
My "sarcasm" tag got filtered away by github, sorry about that. I know
That worries me, it's a permission that I feel should still be disabled, |
FYI, systemd does not drop mknod: http://cgit.freedesktop.org/systemd/systemd/tree/src/nspawn/nspawn.c#n113 |
Having dug into the issue and read through this thread it seems like a reasonable choice to have Docker build enabling mknod for this use case. I reviewed the code and the libcontainer masking of capabilities and it looks good too (after I cleared up my confusion via #4719). LGTM based on the use case and available options. @creack nspawn does lock down mknod to null, zero, full, random, urandom and tty by default though: http://cgit.freedesktop.org/systemd/systemd/diff/src/nspawn/nspawn.c?id=9457ac5b4e755e9019ead2f564124df5d35ee7cf |
Ok, I now believe that the container will prevent the "horrible" things I previously mentioned, sorry for the noise. But, I still don't like the feeling that it seems you are allowing permissions to be "elevated" here over what they normally are (i.e. normal users can't mknod). And what @philips pointed out with systemd, I don't think this will play well with that as it locks things down to prevent this from users from doing this as well. But I can't really test it, so I can't be sure how it interacts. Anyway, feel free to merge this if it passes your testing and doesn't mess with systemd, but it really feels wrong to me... |
Merged c94111b |
Allow non-privileged containers to create device nodes.
Such nodes could already be created by importing a tarball to a container; now
they can be created from within the container itself.
This gives non-privileged containers the mknod kernel capability, and modifies
their cgroup settings to allow creation of any node, not just whitelisted
ones. Use of such nodes is still controlled by the existing cgroup whitelist.
Fixes #2191