Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow non-privileged containers to create device nodes. #2979

Closed
wants to merge 2 commits into from
Closed

Allow non-privileged containers to create device nodes. #2979

wants to merge 2 commits into from

Conversation

kevinwallace
Copy link
Contributor

Allow non-privileged containers to create device nodes.

Such nodes could already be created by importing a tarball to a container; now
they can be created from within the container itself.

This gives non-privileged containers the mknod kernel capability, and modifies
their cgroup settings to allow creation of any node, not just whitelisted
ones. Use of such nodes is still controlled by the existing cgroup whitelist.

Fixes #2191

@crosbymichael
Copy link
Contributor

ping @jpetazzo

@creack
Copy link
Contributor

creack commented Dec 5, 2013

@kevinwallace @jpetazzo Won't this be a problem for devicemapper?

@kevinwallace
Copy link
Contributor Author

@creack What specifically is the concern? I'm not familiar with devicemapper, but @jpetazzo indicated in #2191 that it shouldn't cause any issues.

@tianon
Copy link
Member

tianon commented Dec 7, 2013

We can currently import or pull and run images that already have arbitrary device nodes, so this is no more or less secure, if that's what you mean.

@pwaller
Copy link
Contributor

pwaller commented Dec 19, 2013

👍 have an immediate need for this.

@kevinwallace
Copy link
Contributor Author

Some changes to the way privs were dropped broke this. Rebased on top of efde305; it works again.

@hsk81
Copy link

hsk81 commented Jan 8, 2014

+1 this will fix my fuse problem in dockerfiles!

@kevinwallace
Copy link
Contributor Author

Rebased, and added DCO signoff to the commit messages.

@SvenDowideit
Copy link
Contributor

@kevinwallace is there a chance that you could begin some documentation for this? An example would be a great start - we can fill more in later, but that takes longer :).

@kevinwallace
Copy link
Contributor Author

Sure! I'm not sure if/where this would fit into the docs, so I'll just dump it here:

Before this change, if you needed to use a particular device node from within a container (/dev/fuse, /dev/kvm, etc), it needed to be present in the image you built your container on. Because device nodes are simply special files (with two extra fields major and minor telling the kernel which type of device it represents), they can get into that image either by being imported as part of a tarball, or by some action by a container running in privileged mode (running mknod, for example). You couldn't create device nodes at runtime from a non-privileged container because Docker dropped the SYS_MKNOD kernel capability, which the kernel requires callers to have in order to create device nodes.

After this change, non-privileged containers include the SYS_MKNOD capability, allowing them to create device nodes at runtime. This only changes the docker security model if you both (a) strictly control the set of images your users have access to, and (b) allow users to run as root within containers.

Here's an example of a Dockerfile that works after this change, but didn't before: https://github.com/kevinwallace/qemu-docker/blob/37388951255c8d835f4b943a2ca59e713495c173/Dockerfile (in particular, the RUN mknod /dev/kvm c 10 232 line)

@SvenDowideit
Copy link
Contributor

so @kevinwallace this will help #1916 avoid the need for -privileged ? awesome :)

merci for the doc-start, hopefully I'll find time to work out a place for it (or @metalivedev / @jamtur01 et al will )

@scjudd
Copy link

scjudd commented Jan 10, 2014

This let me install GlusterFS (which depended on FUSE). 👍

Now I just need to get GlusterFS working.. 😁

@metalivedev
Copy link
Contributor

If/when this code gets approved and merged, please file a new issue about creating documentation and refer back to this one. Otherwise the docs may really get left behind.

@kevinwallace
Copy link
Contributor Author

@SvenDowideit @metalivedev

I think the doc issue is larger in scope than this change. I'd expect to see this sort of information in a section that addresses the questions "What are the security properties of a container? What can unprivileged containers do? What can privileged containers do?". That section doesn't currently exist, AFAIK (though I'd love to be wrong about this). And I suspect the concept of a "privileged container" might not last long -- see the "[docker-dev] Breaking down system requirements / privileges" email thread.

Anyway, agreed that this conversation belongs in a new, separate issue.

@alexlarsson
Copy link
Contributor

Can we please add a huge warning about this to the docs. Allowing mknod() is totally unsafe and means you can easily break out of the container. For instance, any "root" user inside the container can just create a /dev/mem node and arbitrarily read/write to any physical memory.

@metalivedev
Copy link
Contributor

re: security, updates to and issues with http://docs.docker.io/en/latest/installation/security/ are welcome.

I think that in this case we're not granting any capabilities that give new security holes because we're only granting the ability to create these nodes within the container, not to actually read or write them. See lxc_template.go:37 which hasn't changed.

What is the purpose of creating a node that you can't read or write? Yeah, I though that was weird too. It lets you use a Dockerfile to create a container with all the right mount points in it for later running as a privileged container. It won't fix all the catch-22 problems of privileged container creation, but it will help in some cases, as with FUSE file system mounts. At least that's my current understanding.

@kevinwallace
Copy link
Contributor Author

@metalivedev You nailed my motivation for this change. In particular, this is a prerequisite for using Trusted Builds to create an image that contains device nodes that aren't present in the parent image.

@alexlarsson
Copy link
Contributor

@metalivedev That sounds good to me.

I looked at this:
https://www.kernel.org/doc/Documentation/cgroups/devices.txt
The security part there seemed to indicate that processes can move themselves to a different cgroup, but then there were a lot of handwaving about how to protect against this. It seems like this is likely out of date, and hopefully this is safe now.

@jpoimboe
Copy link
Contributor

There is a minor issue with enabling mknod in a container -- systemd compatibility. Apparently systemd does a check for CAP_MKNOD to determine whether it should run udevd and any other units that create device nodes. But that's not probably a big deal at this time, since systemd already doesn't work very well in a docker container, so it wouldn't really break anything that's not already broken.

Once we get systemd working in a container we'd probably have to either change systemd's behavior wrt to CAP_MKNOD, or have docker drop CAP_MKNOD when it detects a systemd container.

@ahmadnassri
Copy link

👍 I needz zis!

@maurerbot
Copy link

+1

2 similar comments
@crimeminister
Copy link

+1

@alexflanagan
Copy link

+1

@flaub
Copy link

flaub commented Feb 14, 2014

+1
I'd like to use tup (http://gittup.org/tup/) which uses FUSE.

@tianon
Copy link
Member

tianon commented Feb 18, 2014

+1 this would be excellent for "live-build" (or anywhere else "debootstrap" is used, even if directly) to allow us to create ISOs or even base images directly inside a container without using the hammer of --privileged.

@optikfluffel
Copy link

+1

3 similar comments
@guestisp
Copy link

+1

@bbradbury
Copy link

+1

@bbirand
Copy link

bbirand commented Mar 1, 2014

+1

@vieux
Copy link
Contributor

vieux commented Mar 14, 2014

LGTM

@gregkh
Copy link

gregkh commented Mar 14, 2014

Are you kidding me? Sure, land this, and watch how much fun people have with your underlying hardware and your system.

No, this is horrible. Don't do this.

Unless you like unsecure systems and enjoy people doing bad things, then sure, no objection from me...

@tianon
Copy link
Member

tianon commented Mar 14, 2014

@gregkh how so? There isn't anything here I can't already do with a tar and ADD is there? Should we be scrubbing some of the device nodes from ADDed tar files too?

@ibuildthecloud
Copy link
Contributor

@gregkh I think the argument is that you can already run a privileged container, mknod, commit, push, pull into an unprivileged container. So essentially you can mknod in a round about way today.

Is there not an additional security layer that prevents access even if the device node exists? If there is not then you are basically saying that there is a huge security issue in docker today.

@tianon
Copy link
Member

tianon commented Mar 14, 2014

Indeed, if this PR creates a problem with security then it's not a new one.

Where's the harm in being able to create device nodes I can't read or write?

@gregkh
Copy link

gregkh commented Mar 14, 2014

@tianon, yes, you could do that today, and that's also a bad thing, that can cause problems. Don't try to make the issue worse by making it easy for people to do this, and having to break users when the whole thing finally gets fixed.

Also, this breaks systemd, isn't that a huge blocker? Unless you don't want to run systemd, and then, well, again, there are easier ways to stop systemd from running, don't try to be subtle about it...

@gregkh
Copy link

gregkh commented Mar 14, 2014

@tianon, why would you ever want to create a device node you can't read or write? It's not useful at all, so why would you want to allow it? Unless you can use it, and then, well, it's a problem...

Be explicit about permissions, if you want to create a device node, then require permissions to do so. Why is that a problem for people who want to create them today?

@ibuildthecloud
Copy link
Contributor

If the mere presence of a device node causes a security concern, then that needs to be addressed. If cgroups can effectively block access, then why not just let users create them. It does make building images easier for packages that have a dependency on things like fuse, but don't actually use fuse.

But... If this breaks systemd it should not be accepted.

@gregkh
Copy link

gregkh commented Mar 14, 2014

I don't think that cgroups can block device access, but I could be wrong. Someone should research that...

I still don't understand why a package would want to create a device node, and never use it. Specific examples would be nice to see.

@kevinwallace
Copy link
Contributor Author

@gregkh I can't speak for the motivation behind #2191 (the bug that this patch fixes), but the reason I wrote this patch was so that I could write a Dockerfile that created /dev/kvm without needing -privileged at build time, so that it could be published as a Trusted Build. It would still require -privileged at run time in order to be able to use KVM, of course.

Another common motivation (scanning the +1s above) seems to be installing dpkg packages that depend on fuse.

My understanding (based on conversation attached to #2191) was that cgroups would continue to block access to device nodes not explicitly whitelisted (see https://github.com/dotcloud/docker/blob/6c266c4b42eeabe2d433a994753d86637fe52a0b/execdriver/lxc/lxc_template.go#L42 -- whitelist is currently /dev/null, /dev/zero, consoles, /dev/{u,}random, /dev/pts, and tuntap). If that understanding is incorrect, then we have a big (pre-existing) problem.

@tianon
Copy link
Member

tianon commented Mar 14, 2014

Here's an example non-privileged session to show why this isn't an issue:

$ docker run -it --rm debian:stable bash
root@6f17eb5cebeb:/# cat /dev/mem # "do no evil (tm)"
cat: /dev/mem: Operation not permitted
root@6f17eb5cebeb:/# cat /dev/kmem   
cat: /dev/kmem: Operation not permitted
root@6f17eb5cebeb:/# cat /dev/sda1
cat: /dev/sda1: Operation not permitted

@tianon
Copy link
Member

tianon commented Mar 14, 2014

Also, as far as use cases go, live-build and debootstrap as part of docker build are at the top of my list.

@gregkh
Copy link

gregkh commented Mar 14, 2014

odds are /dev/mem is mediated by your kernel, are you sure that isn't happening here? Pick a different device node if you want to test things out, like a random serial port...

@tianon
Copy link
Member

tianon commented Mar 14, 2014

That's why I included "/dev/sda1" :)

@tianon
Copy link
Member

tianon commented Mar 14, 2014

So, just for good measure:

root@f70fd78e7d0b:/# cat /dev/ttyS0 
cat: /dev/ttyS0: Permission denied
root@f70fd78e7d0b:/# echo 'something' > /dev/ttyS0
bash: /dev/ttyS0: Permission denied

@alexlarsson
Copy link
Contributor

@gregkh Obviously cgroups can limit device access, and we're using this today in docker.
See https://www.kernel.org/doc/Documentation/cgroups/devices.txt for the details (as linked to in a previous comment here). We're setting this up pretty tight with only a whitelist of allowed devices (null, zero, full, random, urandom, tty, console, tty0, tty1, pts/*) and everything else is denied open + read + write.

I don't think the hysterical security ranting is called for, if you have actual feedback that would be interesting though. We're not completely stupid you know.

@guestisp
Copy link

So will never be possible to mount a fuse file system like s3ql or gluster inside a docker container?

@alexlarsson
Copy link
Contributor

@guestisp It is possible right now if you pass -privileged, but that gives you a very unsecure container. Long term we will probably add ways open up privileges in a more fine grained way, but that is not currently possible.

@alexlarsson
Copy link
Contributor

@guestisp the other alternative is to mount it outside the container and then add it to the container using a volume.

@guestisp
Copy link

Is not always possible to mount it outside.
In our paas environment we will have hundreds of containers on each node and mounting every shared FS outside is not really an option but is certainly the most secure way.

An option would be to mount a single volume and then bind inside a container a single folder. In this case each container share the same fs mount on node, but doesn't see other containers file due to the subdirectory, in example:

/mnt/s3ql/container1 => mounted in container1
/mnt/s3ql/container2 => mounted in container2

/mnt/s3ql is a single s3ql fs mounted on node.
Each container has its own directory mounted inside with volumes. Will this be hackable or each container will never be able to move to upper directory?

How did dotcloud s3ql mounts ?

@csirac2
Copy link

csirac2 commented Mar 14, 2014

@gregkh

why would you ever want to create a device node you can't read or write? It's not
useful at all, so why would you want to allow it? Unless you can use it, and then, well,
it's a problem...

I made an earlier comment about building up an debian rootfs with multistrap: #2979 (comment)

Whilst I can appreciate it's perhaps not a mainstream requirement, it's nonetheless a necessary part of my workflow for building embedded firmware images.

@shykes
Copy link
Contributor

shykes commented Mar 14, 2014

@gregkh I won't get into the security discussion, plenty of people adequately covering that already, but I wanted to address this particular comment:

Also, this breaks systemd, isn't that a huge blocker? Unless you don't want to run systemd, and then, well, again, there are easier ways to stop systemd from running, don't try to be subtle about it...

TLDR: it doesn't break systemd, we are just changing a hardcoded default but you can still change these values per-container or per-image (or in the case of MKNOD, we will add that flag). And we will NEVER EVER do anything to hurt another project on purpose!

I'll state the (very) obvious, we're not actively trying to break any software. Quite the opposite we actively want as many programs as possible to run happily inside docker, including of course systemd, which is a common request. We are actually putting a lot of work specifically into getting docker to work well with systemd.

We approach the problem of running systemd inside docker the same way we approach the problem for any other program:

  1. Identify everything the program requires to be present in the docker runtime environment. Examples for systemd may be and "I require uid 0", "$container should be set to a particular value", "CAP_MKNOD should be disabled" and "I must be pid 1".

  2. We identify which requirements can be generalized to all programs without breaking some of them. Hardcode them as defaults into the Docker runtime.

  3. For those requirements that don't make sense as universal defaults, we find an elegant way for the user to set them at runtime, and for the program author to set default values when creating images.

Of the 4 examples I listed above, 4 are currently baked into the runtime, but 3 are scheduled to be "demoted" to runtime/buildtime options for various reasons:

  • defaulting to uid 0 is a reasonable convention, it's the only uid guaranteed to mean the same thing in every container
  • most relevant to this discussion, CAP_MKNOD is currently disabled by default in the runtime, but merging this PR would change that.
  • we currently set $container to whatever value causes systemd and upstart to work, but we determined that this introspection method, although a pragmatic best effort solution to the problem, is not a sustainable solution and solves less problems than it creates. So we're not hardcoding it in the runtime because we don't believe it benefits other programs to have that variable added to their environment by default.
  • we currently run the application as pid1 in its own namespace, but as it turns out most applications don't handle well the special treatment reserved for pid 1 - specifically it creates quirks in signal handling, and makes it difficult to inject new processes into the namespace. So we are planning to remove this from the default as well, and make it a configurable runtime/buildtime option.

The conclusion to all this is two-fold:

  1. Just because a particular program has quirks and requirements doesn't mean we'll automatically make it a default in the docker runtime. Even if we wanted to, we couldn't, since these requirements are often contradictory between programs (for example systemd needs pid 1, but say, apache needs pid != 1).

  2. But that's totally OK, because your program can still run perfectly in Docker, quirks and all. All you need to do is build an image with the proper runtime options set. In the case of systemd, it should be built in an image which specifies "run me as PID 1", "run me with CAP_MKNOD disabled", and "run me with container=somethingsomethingsomething".

Sorry for the rather long comment. But I wanted to be thorough, because we put a lot of thought into these design decisions, and so far we have been able to fend off the spectrum of drama and keep everyone focused on technical merit and positive things like making our users happy. I feel incredibly lucky that experienced open-source people like you @gregkh and @alexlarsson take the time to discuss and contribute here. The reason it works well, I think, is that we trust each other to be well-intentioned and to put the good of the project and its users first, and have succeeded in setting a positive and friendly atmosphere regardless of technical disagreements. So I'm always alarmed when I see the notion that we might change the code on purpose to hurt someone else or generally do something negative. That is the very last thing I want this project associated with. If you feel like we are sending a negative vibe in any way, let's please discuss it, this matters a lot to me. A good place to discuss this would be #docker-dev on Freenode every thursday at 10:30AM, that's our weekly irc meeting.

Ok I'm done :)

@gregkh
Copy link

gregkh commented Mar 15, 2014

On Fri, Mar 14, 2014 at 01:56:26AM -0700, Solomon Hykes wrote:

TLDR: it doesn't break systemd, we are just moving runtime defaults to
configurable image options. And we will NEVER EVER do anything to hurt another
project on purpose!

My "sarcasm" tag got filtered away by github, sorry about that. I know
you all wouldn't ever do that on purpose, I was just teasing :)

• most relevant to this discussion, CAP_MKNOD is currently disabled by
default in the runtime, but merging this PR would change that.

That worries me, it's a permission that I feel should still be disabled,
as "normal" userspace requires that permission to be granted to create a
device node, so I worry that you are elevating permissions for users.

@creack
Copy link
Contributor

creack commented Mar 15, 2014

FYI, systemd does not drop mknod: http://cgit.freedesktop.org/systemd/systemd/tree/src/nspawn/nspawn.c#n113

@philips
Copy link
Contributor

philips commented Mar 18, 2014

Having dug into the issue and read through this thread it seems like a reasonable choice to have Docker build enabling mknod for this use case. I reviewed the code and the libcontainer masking of capabilities and it looks good too (after I cleared up my confusion via #4719).

LGTM based on the use case and available options.

@creack nspawn does lock down mknod to null, zero, full, random, urandom and tty by default though: http://cgit.freedesktop.org/systemd/systemd/diff/src/nspawn/nspawn.c?id=9457ac5b4e755e9019ead2f564124df5d35ee7cf

@gregkh
Copy link

gregkh commented Mar 19, 2014

Ok, I now believe that the container will prevent the "horrible" things I previously mentioned, sorry for the noise.

But, I still don't like the feeling that it seems you are allowing permissions to be "elevated" here over what they normally are (i.e. normal users can't mknod).

And what @philips pointed out with systemd, I don't think this will play well with that as it locks things down to prevent this from users from doing this as well. But I can't really test it, so I can't be sure how it interacts.

Anyway, feel free to merge this if it passes your testing and doesn't mess with systemd, but it really feels wrong to me...

@vieux
Copy link
Contributor

vieux commented Apr 3, 2014

Merged c94111b

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Enable mknod and setfcap capabilities in all containers