Allow non-privileged containers to create device nodes. #2979

kevinwallace · 2013-12-01T23:40:31Z

Allow non-privileged containers to create device nodes.

Such nodes could already be created by importing a tarball to a container; now
they can be created from within the container itself.

This gives non-privileged containers the mknod kernel capability, and modifies
their cgroup settings to allow creation of any node, not just whitelisted
ones. Use of such nodes is still controlled by the existing cgroup whitelist.

Fixes #2191

crosbymichael · 2013-12-02T17:31:19Z

ping @jpetazzo

creack · 2013-12-05T19:33:46Z

@kevinwallace @jpetazzo Won't this be a problem for devicemapper?

kevinwallace · 2013-12-07T04:29:33Z

@creack What specifically is the concern? I'm not familiar with devicemapper, but @jpetazzo indicated in #2191 that it shouldn't cause any issues.

tianon · 2013-12-07T04:43:14Z

We can currently import or pull and run images that already have arbitrary device nodes, so this is no more or less secure, if that's what you mean.

pwaller · 2013-12-19T15:49:43Z

👍 have an immediate need for this.

kevinwallace · 2013-12-20T01:21:24Z

Some changes to the way privs were dropped broke this. Rebased on top of efde305; it works again.

hsk81 · 2014-01-08T10:36:39Z

+1 this will fix my fuse problem in dockerfiles!

kevinwallace · 2014-01-08T20:52:20Z

Rebased, and added DCO signoff to the commit messages.

SvenDowideit · 2014-01-08T22:42:31Z

@kevinwallace is there a chance that you could begin some documentation for this? An example would be a great start - we can fill more in later, but that takes longer :).

kevinwallace · 2014-01-08T23:06:26Z

Sure! I'm not sure if/where this would fit into the docs, so I'll just dump it here:

Before this change, if you needed to use a particular device node from within a container (/dev/fuse, /dev/kvm, etc), it needed to be present in the image you built your container on. Because device nodes are simply special files (with two extra fields major and minor telling the kernel which type of device it represents), they can get into that image either by being imported as part of a tarball, or by some action by a container running in privileged mode (running mknod, for example). You couldn't create device nodes at runtime from a non-privileged container because Docker dropped the SYS_MKNOD kernel capability, which the kernel requires callers to have in order to create device nodes.

After this change, non-privileged containers include the SYS_MKNOD capability, allowing them to create device nodes at runtime. This only changes the docker security model if you both (a) strictly control the set of images your users have access to, and (b) allow users to run as root within containers.

Here's an example of a Dockerfile that works after this change, but didn't before: https://github.com/kevinwallace/qemu-docker/blob/37388951255c8d835f4b943a2ca59e713495c173/Dockerfile (in particular, the RUN mknod /dev/kvm c 10 232 line)

SvenDowideit · 2014-01-08T23:37:44Z

so @kevinwallace this will help #1916 avoid the need for -privileged ? awesome :)

merci for the doc-start, hopefully I'll find time to work out a place for it (or @metalivedev / @jamtur01 et al will )

scjudd · 2014-01-10T21:25:35Z

This let me install GlusterFS (which depended on FUSE). 👍

Now I just need to get GlusterFS working.. 😁

metalivedev · 2014-01-13T20:41:42Z

If/when this code gets approved and merged, please file a new issue about creating documentation and refer back to this one. Otherwise the docs may really get left behind.

kevinwallace · 2014-01-13T22:12:05Z

@SvenDowideit @metalivedev

I think the doc issue is larger in scope than this change. I'd expect to see this sort of information in a section that addresses the questions "What are the security properties of a container? What can unprivileged containers do? What can privileged containers do?". That section doesn't currently exist, AFAIK (though I'd love to be wrong about this). And I suspect the concept of a "privileged container" might not last long -- see the "[docker-dev] Breaking down system requirements / privileges" email thread.

Anyway, agreed that this conversation belongs in a new, separate issue.

alexlarsson · 2014-01-14T14:08:24Z

Can we please add a huge warning about this to the docs. Allowing mknod() is totally unsafe and means you can easily break out of the container. For instance, any "root" user inside the container can just create a /dev/mem node and arbitrarily read/write to any physical memory.

metalivedev · 2014-01-14T23:32:24Z

re: security, updates to and issues with http://docs.docker.io/en/latest/installation/security/ are welcome.

I think that in this case we're not granting any capabilities that give new security holes because we're only granting the ability to create these nodes within the container, not to actually read or write them. See lxc_template.go:37 which hasn't changed.

What is the purpose of creating a node that you can't read or write? Yeah, I though that was weird too. It lets you use a Dockerfile to create a container with all the right mount points in it for later running as a privileged container. It won't fix all the catch-22 problems of privileged container creation, but it will help in some cases, as with FUSE file system mounts. At least that's my current understanding.

kevinwallace · 2014-01-14T23:51:27Z

@metalivedev You nailed my motivation for this change. In particular, this is a prerequisite for using Trusted Builds to create an image that contains device nodes that aren't present in the parent image.

alexlarsson · 2014-01-15T08:38:11Z

@metalivedev That sounds good to me.

I looked at this:
https://www.kernel.org/doc/Documentation/cgroups/devices.txt
The security part there seemed to indicate that processes can move themselves to a different cgroup, but then there were a lot of handwaving about how to protect against this. It seems like this is likely out of date, and hopefully this is safe now.

jpoimboe · 2014-01-23T16:33:45Z

There is a minor issue with enabling mknod in a container -- systemd compatibility. Apparently systemd does a check for CAP_MKNOD to determine whether it should run udevd and any other units that create device nodes. But that's not probably a big deal at this time, since systemd already doesn't work very well in a docker container, so it wouldn't really break anything that's not already broken.

Once we get systemd working in a container we'd probably have to either change systemd's behavior wrt to CAP_MKNOD, or have docker drop CAP_MKNOD when it detects a systemd container.

ahmadnassri · 2014-02-14T16:14:19Z

👍 I needz zis!

maurerbot · 2014-02-14T16:14:41Z

+1

crimeminister · 2014-02-14T16:15:57Z

+1

alexflanagan · 2014-02-14T16:58:30Z

+1

flaub · 2014-02-14T18:10:07Z

+1
I'd like to use tup (http://gittup.org/tup/) which uses FUSE.

tianon · 2014-02-18T05:47:43Z

+1 this would be excellent for "live-build" (or anywhere else "debootstrap" is used, even if directly) to allow us to create ISOs or even base images directly inside a container without using the hammer of --privileged.

optikfluffel · 2014-02-22T11:14:19Z

+1

guestisp · 2014-02-22T15:17:58Z

+1

bbradbury · 2014-02-28T20:32:20Z

+1

bbirand · 2014-03-01T07:46:36Z

+1

vieux · 2014-03-14T00:17:53Z

LGTM

gregkh · 2014-03-14T05:03:16Z

Are you kidding me? Sure, land this, and watch how much fun people have with your underlying hardware and your system.

No, this is horrible. Don't do this.

Unless you like unsecure systems and enjoy people doing bad things, then sure, no objection from me...

tianon · 2014-03-14T05:07:55Z

@gregkh how so? There isn't anything here I can't already do with a tar and ADD is there? Should we be scrubbing some of the device nodes from ADDed tar files too?

ibuildthecloud · 2014-03-14T05:12:03Z

@gregkh I think the argument is that you can already run a privileged container, mknod, commit, push, pull into an unprivileged container. So essentially you can mknod in a round about way today.

Is there not an additional security layer that prevents access even if the device node exists? If there is not then you are basically saying that there is a huge security issue in docker today.

tianon · 2014-03-14T05:14:15Z

Indeed, if this PR creates a problem with security then it's not a new one.

Where's the harm in being able to create device nodes I can't read or write?

gregkh · 2014-03-14T05:15:04Z

@tianon, yes, you could do that today, and that's also a bad thing, that can cause problems. Don't try to make the issue worse by making it easy for people to do this, and having to break users when the whole thing finally gets fixed.

Also, this breaks systemd, isn't that a huge blocker? Unless you don't want to run systemd, and then, well, again, there are easier ways to stop systemd from running, don't try to be subtle about it...

gregkh · 2014-03-14T05:17:14Z

@tianon, why would you ever want to create a device node you can't read or write? It's not useful at all, so why would you want to allow it? Unless you can use it, and then, well, it's a problem...

Be explicit about permissions, if you want to create a device node, then require permissions to do so. Why is that a problem for people who want to create them today?

ibuildthecloud · 2014-03-14T05:23:15Z

If the mere presence of a device node causes a security concern, then that needs to be addressed. If cgroups can effectively block access, then why not just let users create them. It does make building images easier for packages that have a dependency on things like fuse, but don't actually use fuse.

But... If this breaks systemd it should not be accepted.

gregkh · 2014-03-14T05:35:20Z

I don't think that cgroups can block device access, but I could be wrong. Someone should research that...

I still don't understand why a package would want to create a device node, and never use it. Specific examples would be nice to see.

kevinwallace · 2014-03-14T05:52:42Z

@gregkh I can't speak for the motivation behind #2191 (the bug that this patch fixes), but the reason I wrote this patch was so that I could write a Dockerfile that created /dev/kvm without needing -privileged at build time, so that it could be published as a Trusted Build. It would still require -privileged at run time in order to be able to use KVM, of course.

Another common motivation (scanning the +1s above) seems to be installing dpkg packages that depend on fuse.

My understanding (based on conversation attached to #2191) was that cgroups would continue to block access to device nodes not explicitly whitelisted (see https://github.com/dotcloud/docker/blob/6c266c4b42eeabe2d433a994753d86637fe52a0b/execdriver/lxc/lxc_template.go#L42 -- whitelist is currently /dev/null, /dev/zero, consoles, /dev/{u,}random, /dev/pts, and tuntap). If that understanding is incorrect, then we have a big (pre-existing) problem.

tianon · 2014-03-14T06:12:45Z

Here's an example non-privileged session to show why this isn't an issue:

$ docker run -it --rm debian:stable bash
root@6f17eb5cebeb:/# cat /dev/mem # "do no evil (tm)"
cat: /dev/mem: Operation not permitted
root@6f17eb5cebeb:/# cat /dev/kmem   
cat: /dev/kmem: Operation not permitted
root@6f17eb5cebeb:/# cat /dev/sda1
cat: /dev/sda1: Operation not permitted

tianon · 2014-03-14T06:13:08Z

Also, as far as use cases go, live-build and debootstrap as part of docker build are at the top of my list.

gregkh · 2014-03-14T06:42:42Z

odds are /dev/mem is mediated by your kernel, are you sure that isn't happening here? Pick a different device node if you want to test things out, like a random serial port...

tianon · 2014-03-14T06:43:04Z

That's why I included "/dev/sda1" :)

tianon · 2014-03-14T06:50:36Z

So, just for good measure:

root@f70fd78e7d0b:/# cat /dev/ttyS0 
cat: /dev/ttyS0: Permission denied
root@f70fd78e7d0b:/# echo 'something' > /dev/ttyS0
bash: /dev/ttyS0: Permission denied

alexlarsson · 2014-03-14T08:10:10Z

@gregkh Obviously cgroups can limit device access, and we're using this today in docker.
See https://www.kernel.org/doc/Documentation/cgroups/devices.txt for the details (as linked to in a previous comment here). We're setting this up pretty tight with only a whitelist of allowed devices (null, zero, full, random, urandom, tty, console, tty0, tty1, pts/*) and everything else is denied open + read + write.

I don't think the hysterical security ranting is called for, if you have actual feedback that would be interesting though. We're not completely stupid you know.

guestisp · 2014-03-14T08:11:32Z

So will never be possible to mount a fuse file system like s3ql or gluster inside a docker container?

alexlarsson · 2014-03-14T08:14:42Z

@guestisp It is possible right now if you pass -privileged, but that gives you a very unsecure container. Long term we will probably add ways open up privileges in a more fine grained way, but that is not currently possible.

alexlarsson · 2014-03-14T08:15:11Z

@guestisp the other alternative is to mount it outside the container and then add it to the container using a volume.

guestisp · 2014-03-14T08:44:44Z

Is not always possible to mount it outside.
In our paas environment we will have hundreds of containers on each node and mounting every shared FS outside is not really an option but is certainly the most secure way.

An option would be to mount a single volume and then bind inside a container a single folder. In this case each container share the same fs mount on node, but doesn't see other containers file due to the subdirectory, in example:

/mnt/s3ql/container1 => mounted in container1
/mnt/s3ql/container2 => mounted in container2

/mnt/s3ql is a single s3ql fs mounted on node.
Each container has its own directory mounted inside with volumes. Will this be hackable or each container will never be able to move to upper directory?

How did dotcloud s3ql mounts ?

csirac2 · 2014-03-14T08:52:18Z

@gregkh

why would you ever want to create a device node you can't read or write? It's not
useful at all, so why would you want to allow it? Unless you can use it, and then, well,
it's a problem...

I made an earlier comment about building up an debian rootfs with multistrap: #2979 (comment)

Whilst I can appreciate it's perhaps not a mainstream requirement, it's nonetheless a necessary part of my workflow for building embedded firmware images.

shykes · 2014-03-14T08:56:25Z

@gregkh I won't get into the security discussion, plenty of people adequately covering that already, but I wanted to address this particular comment:

Also, this breaks systemd, isn't that a huge blocker? Unless you don't want to run systemd, and then, well, again, there are easier ways to stop systemd from running, don't try to be subtle about it...

TLDR: it doesn't break systemd, we are just changing a hardcoded default but you can still change these values per-container or per-image (or in the case of MKNOD, we will add that flag). And we will NEVER EVER do anything to hurt another project on purpose!

I'll state the (very) obvious, we're not actively trying to break any software. Quite the opposite we actively want as many programs as possible to run happily inside docker, including of course systemd, which is a common request. We are actually putting a lot of work specifically into getting docker to work well with systemd.

We approach the problem of running systemd inside docker the same way we approach the problem for any other program:

Identify everything the program requires to be present in the docker runtime environment. Examples for systemd may be and "I require uid 0", "$container should be set to a particular value", "CAP_MKNOD should be disabled" and "I must be pid 1".
We identify which requirements can be generalized to all programs without breaking some of them. Hardcode them as defaults into the Docker runtime.
For those requirements that don't make sense as universal defaults, we find an elegant way for the user to set them at runtime, and for the program author to set default values when creating images.

Of the 4 examples I listed above, 4 are currently baked into the runtime, but 3 are scheduled to be "demoted" to runtime/buildtime options for various reasons:

defaulting to uid 0 is a reasonable convention, it's the only uid guaranteed to mean the same thing in every container
most relevant to this discussion, CAP_MKNOD is currently disabled by default in the runtime, but merging this PR would change that.
we currently set $container to whatever value causes systemd and upstart to work, but we determined that this introspection method, although a pragmatic best effort solution to the problem, is not a sustainable solution and solves less problems than it creates. So we're not hardcoding it in the runtime because we don't believe it benefits other programs to have that variable added to their environment by default.
we currently run the application as pid1 in its own namespace, but as it turns out most applications don't handle well the special treatment reserved for pid 1 - specifically it creates quirks in signal handling, and makes it difficult to inject new processes into the namespace. So we are planning to remove this from the default as well, and make it a configurable runtime/buildtime option.

The conclusion to all this is two-fold:

Just because a particular program has quirks and requirements doesn't mean we'll automatically make it a default in the docker runtime. Even if we wanted to, we couldn't, since these requirements are often contradictory between programs (for example systemd needs pid 1, but say, apache needs pid != 1).
But that's totally OK, because your program can still run perfectly in Docker, quirks and all. All you need to do is build an image with the proper runtime options set. In the case of systemd, it should be built in an image which specifies "run me as PID 1", "run me with CAP_MKNOD disabled", and "run me with container=somethingsomethingsomething".

Sorry for the rather long comment. But I wanted to be thorough, because we put a lot of thought into these design decisions, and so far we have been able to fend off the spectrum of drama and keep everyone focused on technical merit and positive things like making our users happy. I feel incredibly lucky that experienced open-source people like you @gregkh and @alexlarsson take the time to discuss and contribute here. The reason it works well, I think, is that we trust each other to be well-intentioned and to put the good of the project and its users first, and have succeeded in setting a positive and friendly atmosphere regardless of technical disagreements. So I'm always alarmed when I see the notion that we might change the code on purpose to hurt someone else or generally do something negative. That is the very last thing I want this project associated with. If you feel like we are sending a negative vibe in any way, let's please discuss it, this matters a lot to me. A good place to discuss this would be #docker-dev on Freenode every thursday at 10:30AM, that's our weekly irc meeting.

Ok I'm done :)

gregkh · 2014-03-15T05:26:01Z

On Fri, Mar 14, 2014 at 01:56:26AM -0700, Solomon Hykes wrote:

TLDR: it doesn't break systemd, we are just moving runtime defaults to
configurable image options. And we will NEVER EVER do anything to hurt another
project on purpose!

My "sarcasm" tag got filtered away by github, sorry about that. I know
you all wouldn't ever do that on purpose, I was just teasing :)

• most relevant to this discussion, CAP_MKNOD is currently disabled by
default in the runtime, but merging this PR would change that.

That worries me, it's a permission that I feel should still be disabled,
as "normal" userspace requires that permission to be granted to create a
device node, so I worry that you are elevating permissions for users.

creack · 2014-03-15T08:02:43Z

FYI, systemd does not drop mknod: http://cgit.freedesktop.org/systemd/systemd/tree/src/nspawn/nspawn.c#n113

philips · 2014-03-18T17:08:09Z

Having dug into the issue and read through this thread it seems like a reasonable choice to have Docker build enabling mknod for this use case. I reviewed the code and the libcontainer masking of capabilities and it looks good too (after I cleared up my confusion via #4719).

LGTM based on the use case and available options.

@creack nspawn does lock down mknod to null, zero, full, random, urandom and tty by default though: http://cgit.freedesktop.org/systemd/systemd/diff/src/nspawn/nspawn.c?id=9457ac5b4e755e9019ead2f564124df5d35ee7cf

gregkh · 2014-03-19T04:08:14Z

Ok, I now believe that the container will prevent the "horrible" things I previously mentioned, sorry for the noise.

But, I still don't like the feeling that it seems you are allowing permissions to be "elevated" here over what they normally are (i.e. normal users can't mknod).

And what @philips pointed out with systemd, I don't think this will play well with that as it locks things down to prevent this from users from doing this as well. But I can't really test it, so I can't be sure how it interacts.

Anyway, feel free to merge this if it passes your testing and doesn't mess with systemd, but it really feels wrong to me...

vieux · 2014-04-03T18:47:08Z

Merged c94111b

SvenDowideit mentioned this pull request Jan 8, 2014

docker build should support privileged operations #1916

Closed

vieux closed this Apr 3, 2014

kevinwallace mentioned this pull request Apr 3, 2014

Enable mknod and setfcap capabilities in all containers #2191

Closed

crosbymichael mentioned this pull request Oct 31, 2017

Remove global device mknod cgroup rules opencontainers/runc#1623

Open

Allow non-privileged containers to create device nodes. #2979

Allow non-privileged containers to create device nodes. #2979

Conversation

kevinwallace commented Dec 1, 2013

crosbymichael commented Dec 2, 2013

creack commented Dec 5, 2013

kevinwallace commented Dec 7, 2013

tianon commented Dec 7, 2013

pwaller commented Dec 19, 2013

kevinwallace commented Dec 20, 2013

hsk81 commented Jan 8, 2014

kevinwallace commented Jan 8, 2014

SvenDowideit commented Jan 8, 2014

kevinwallace commented Jan 8, 2014

SvenDowideit commented Jan 8, 2014

scjudd commented Jan 10, 2014

metalivedev commented Jan 13, 2014

kevinwallace commented Jan 13, 2014

alexlarsson commented Jan 14, 2014

metalivedev commented Jan 14, 2014

kevinwallace commented Jan 14, 2014

alexlarsson commented Jan 15, 2014

jpoimboe commented Jan 23, 2014

ahmadnassri commented Feb 14, 2014

maurerbot commented Feb 14, 2014

crimeminister commented Feb 14, 2014

alexflanagan commented Feb 14, 2014

flaub commented Feb 14, 2014

tianon commented Feb 18, 2014

optikfluffel commented Feb 22, 2014

guestisp commented Feb 22, 2014

bbradbury commented Feb 28, 2014

bbirand commented Mar 1, 2014

vieux commented Mar 14, 2014

gregkh commented Mar 14, 2014

tianon commented Mar 14, 2014

ibuildthecloud commented Mar 14, 2014

tianon commented Mar 14, 2014

gregkh commented Mar 14, 2014

gregkh commented Mar 14, 2014

ibuildthecloud commented Mar 14, 2014

gregkh commented Mar 14, 2014

kevinwallace commented Mar 14, 2014

tianon commented Mar 14, 2014

tianon commented Mar 14, 2014

gregkh commented Mar 14, 2014

tianon commented Mar 14, 2014

tianon commented Mar 14, 2014

alexlarsson commented Mar 14, 2014

guestisp commented Mar 14, 2014

alexlarsson commented Mar 14, 2014

alexlarsson commented Mar 14, 2014

guestisp commented Mar 14, 2014

csirac2 commented Mar 14, 2014

shykes commented Mar 14, 2014

gregkh commented Mar 15, 2014

creack commented Mar 15, 2014

philips commented Mar 18, 2014

gregkh commented Mar 19, 2014

vieux commented Apr 3, 2014