Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal: User containers should be mounted read-only #890

Open
domdom82 opened this issue Jul 14, 2016 · 26 comments
Open

Proposal: User containers should be mounted read-only #890

domdom82 opened this issue Jul 14, 2016 · 26 comments

Comments

@domdom82
Copy link
Contributor

domdom82 commented Jul 14, 2016

This is a design-proposal for security hardening of the invoker.

During my work on AppArmor I did some research and found that it is currently very hard to enforce a disk size quota on docker containers. There are some discussions going on here moby/moby#3804 and there is the --device-write-bps option in docker 1.10 which only limits throughput. We can't use docker 1.10 for stability reasons anyway.

My proposal is this:

To prevent any action container from flooding the host filesystem we mount the container in read-only mode. Each container will be given a writable /tmp directory that is mounted on a tmpfs volume with a small size (e.g. 512 MB, 1 GB etc.).

tmpfs is ramdisk so nothing gets written to the actual disk of the host (docker logs of course being a different issue).

I have been dabbling with docker volume but could not get it to actually create a real tmpfs volume with a size limit. The example for tmpfs here does not work.

Instead my proposal looks something like this:

  1. when makeContainer gets called, the invoker first calls makeTmpFs(containerName)

  2. makeTmpFs calls

    tempDir = $(maketemp -d)
    mount -t tmpfs -o size=512M,mode=0700 tmpfs $tempDir
    
  3. makeContainer then starts the action container like so

    docker run -d --read-only -v $tempDir:/tmp --security-stuff whisk/nodeJsAction
    
  4. rmContainer finally does

    umount $tempDir
    

Advantages:

  1. Increased security. Containers can not flood the host.
  2. Increased speed. Tmpfs is RAM, so super-fast. mount / umount takes next to no time.
  3. We can really enforce a disk quota for actions.

Tradeoff we will have to make:

  1. Actions can no longer write wherever they want in the container.
  2. Blackbox implementers need to know. So the contract must be updated.

Once we can make the switch over to docker > 1.10 we can simply use the --tmpfs option of docker run and drop the above tmpfs handler in the invoker.

@domdom82
Copy link
Contributor Author

Noticed there is no "security" label. There should be one I think.

@markusthoemmes
Copy link
Contributor

Note that memory is the most expensive resource when getting VMs so mounting a ramdisk of 1GB is essentially assigning another gigabyte of memory to that user container.

@domdom82
Copy link
Contributor Author

domdom82 commented Jul 14, 2016

that is correct. 1 GB may be a tad much, 512 MB should do just fine. Most actions should not even write to disk at all. 512 MB seems to be a reasonable size.

@jeremiaswerner
Copy link
Contributor

Do we need persistent disk at all or should we better start the container with a read-only filesystem? What are the use cases to have a persistent store within the container? One that would come in my mind is caching for multiple invocations within the same warm container...

In case we need persistent disk I would further limit the tmpfs because memory is the limiting factor on invoker machines. maybe 64MB? (really depends on use cases we want to support).

For any larger storage we should probably think about mounting a docker volume. Where we would need to test carefully upfront to avoid performance and robustness issues.

@domdom82
Copy link
Contributor Author

one use case I can think of is a blackbox nodejs action that requires additional modules. those modules need to be installed somewhere.

other actions will also probably require some minimal /tmp space, but I think mainly custom user actions that need to pull in additional libs are the main target.

docker volumes are directly related to the storage driver. afaik only devicemapper gives you the option to attach lvm volumes of a definitive size. quotas also seem to be supported for devicemapper, zfs and btrfs. see moby/moby#3804

but keep in mind that no disk-based storage will ever be as fast as memory. so it will take some time to make the volume and remove it after the action is done. this is where the charm of tmpfs comes into play.

if we assume a limit of 512 MB for /tmp and 512 MB memory limit for the container that makes 1 GB per container. but bear in mind that both memory and tmpfs are not allocated right away. if an action only takes like 64 MB then only that gets used.

I'd say on a 8 GB machine you could easily run 4-5 concurrent actions that use maximum resources without swapping.

@markusthoemmes
Copy link
Contributor

Besides of the memory concern (a pretty huge one if you ask me, I don't think we can "waste" precious memory for container storage) we also need to make sure that "non-node-js" containers continue to work. Swift for example compiles the code on initialization of the container and thus needs to write that somewhere. Is that than also stored in /tmp, effectively reducing the quota the user has at hand?

@domdom82
Copy link
Contributor Author

Yes, of course. The memory vs. hd storage aside, we need to limit storage for containers. A user container is basically a function call. It should not have the need to store anything locally besides runtime data. Why would a stateless, serverless function need to store any data on a disk anyway?

The 512 MB limit is mostly helpful for blackbox containers where you are free to do whatever you like. When you are in a swift action or nodejs action, you are basically using a runtime instead of the naked container, so storing files on disk is even less relevant.

Besides, the user code can not rely on being in the same container the next time it gets invoked.

@mbehrendt
Copy link

mbehrendt commented Jul 17, 2016

@jeremiaswerner i think we definitely need writable a /tmp disk -- there are use cases for it. also, from a competitive view.

@domdom82 have you asked phil estes for advice ?

@jeremiaswerner
Copy link
Contributor

@mbehrendt thanks for clarifying

@estesp
Copy link

estesp commented Jul 19, 2016

Just FYI-- the mount options support in the local driver for docker volume command came in 1.11, so that's why the example doesn't work from current docs.

Read-only seems like a good option for your use case with the available tmpfs as a "relief valve" for the cases that must have a writeable location. I can't remember if you are using user namespaces, but we had a restriction that the --read-only option on docker run conflicted with a daemon with user namespaces enabled (based on a Linux kernel restriction on remount of the fs as read-only). That problem is potentially resolved upstream, so this is a point in time statement.

As you noted, there has been work for quota support on some backends recently (ping @cpuguy83) so that may be a future option if there are more complexities found with the readonly path.

If memory is an issue, you can also create a "real" file mounted as a writeable temporary fs (loopback file mounted) as another option. The time cost would probably be higher than the memory option, but would save you the memory allocation (which I think as was noted above is only truly allocated if it's used)

@cpuguy83
Copy link

FYI, on 1.11: docker volume create --name mytmpfs --opt type=tmpfs --opt device=tmpfs --opt o=size=512M,mode=0700

@domdom82
Copy link
Contributor Author

@estesp @cpuguy83 thanks for chiming in. regarding the user namespace problem, yes that is going to be an issue. We want to prevent fork bombs in the container, the current idea is ulimit nproc which works on a per-user basis and only on non-root users. Since we have stability issues with docker >=1.10 we are currently stuck on 1.9.1. I have proposed running our containers as non-root (see issue #898) using the docker run -u option with a high UID. Of course this would mean we would have to maintain the UIDs ourselves.
It would be awesome to have nproc-like limits on containers.

@cpuguy83
Copy link

@domdom82 What stability issues? Can you detail in another issue?

BTW, nproc is pretty horrible as it is per-user, however if you are going to be controlling which user is in the container, maybe this isn't a problem for you.

As of kernel 4.3 there is now a pids cgroup controller for controlling the number of pids allowed. This should be used for fork bomb prevention.

@domdom82
Copy link
Contributor Author

domdom82 commented Jul 20, 2016

@cpuguy83 we have been discussing this problem for a while. Here is the gist by @perryibm :

History: After upgrading from docker 1.9 to 1.10 and then 1.11, we experienced increasing and then severe instability to the point where the docker daemon, under light-moderate load, will go into a bad state in 3-6 hours. We decided to downgrade to 1.9 and have since achieved stability. But we still want to go forward and have been consulting local Docker experts.

aufs has frequent "Failed to remove..." issues which are readily reproducible with even a modest level of concurrency. Withe 2 threads doing run and rm it often takes 10 iterations before it manifests. In my experience, overlayfs completely solves this particular problem (even if there are other issues).

All of docker 1.9 to 1.11 experience daemon inconsistencies and soft panic when under concurrent load as recently as of May-Jun 2016. See moby/moby#19758. It is dependent on concurrency (to enable the bug) and heavy load (to make it show up faster) and use of docker features particularly networking (to lengthen the critical section and increase race probability).

In local testing (my python script), I have measured that the speedup of concurrent operations is modest for 1.9 (only 20%) and significant for 1.11 (60-70%).

Docker 1.11 has a bug with pause/unpause which was diagnosed by Doug David. He sent me a patched docker which fixes the issue. Independently and around the same time, a similar fix was merged into the code base so presumably it will show up in 1.12.

Conclusions:

  1. Use overlayfs
  2. Docker 1.9 seems stable if we eschew parallelism. In any case, based on (3) alone, it's a disappointing gain.
  3. Docker 1.10 is not obviously more stable than 1.11 so don't bother with it either.
  4. Docker 1.11 is not useable unless we deploy a patched version and even so it fails under concurrent load. It does offer a glimmer of hope in terms of better performance.
  5. Docker 1.12 - who knows? It does not seem to me that our container use case is well-supported by Docker.

Bottom line:

  • We had problems with parallel docker operations in general. (both 1.9.x and later)
  • We worked around it by serializing docker operations per invoker node.
  • Switched from aufs to overlay (which currently pokes us with inode exhaustion but it is manageable)
  • docker 1.11 seems most promising performance-wise but it has problems with pause/unpause
  • We would probably go with docker 1.11 if the mentioned pause/unpause-patch were merged (maybe it is now, @perryibm can you elaborate on this?)

@estesp
Copy link

estesp commented Jul 20, 2016

@cpuguy83 I've also been involved as well as @duglin in trying to help with some of the performance/stability issues.

A couple comments:

  • The pause/unpause fix is merged into docker/docker master, but did not make any of the 1.11.x point releases, so officially 1.12 will be the first release to include it. 1.12 is at rc4, so it would be good for it to be tested ASAP to see if that + other changes fixes any of these issues.
  • The "overlay2" driver has been merged and I believe is in 1.12; this will significantly resolve inode exhaustion issues from what I understand. I believe it has a kernel dependency, but not onerous if I remember correctly.
  • I've looked on/off at some of the parallelism issues, but need a more deterministic scenario to replicate some of the issues seen at full load so that it can be debugged more fully to start to determine where fixes are necessary. Also interested how the containerd/daemon/runc split may help or at least impact parallelism stability?

@domdom82
Copy link
Contributor Author

domdom82 commented Jul 20, 2016

Cool. Looks like 1.12 is the new guy then.
Overlay2 will require us to move to kernel 4.x which could be an issue. Alternatively we can experiment with changing the inode/block ratio of the filesystem to get more inodes. This is only a problem with the CI system anyway where new images are built 24/7.

As for the parallelism testsuite, I think @perryibm has prepared a suite for local testing. I'm sure he likes to share.

Thanks a lot for your insights!

@perryibm
Copy link
Contributor

@jeremiaswerner @mbehrendt A simple use case for disk is our own image processing demo which writes the image to a temporary file before invoking image magic.

@domdom82 I agree that 1.12 is the first reasonable thing to switch to. I'll try to do a parallelism study soon so we have numbers. If 1.12 is at least as good as 1.11, then we should proceed on performance grounds. That leaves stability testing which we can only confirm by testing at scale in the system as local and small scale tests historically cannot flush these out.

@domdom82
Copy link
Contributor Author

After testing log limits today, I have opened a new issue with docker as there is still a security vulnerability with log sizes in docker 1.12. Linking to it here:

moby/moby#25930

@domdom82
Copy link
Contributor Author

Good to hear: Apparently the log size issue is fixed with docker 1.13.

@domdom82
Copy link
Contributor Author

Initial benchmark testing on Docker 1.12.1 using Kernel 4.4.0-28 looks encouraging:

Server Version: 1.12.1
Kernel Version: 4.4.0-28-generic

Storage Driver: overlay
Backing Filesystem: extfs
CPUs: 4

Full     = docker run, pause, unpause, rm
No pause = docker run, rm
Skip rm  = docker run, pause, unpause  (for testing)

Table shows the rate (per sec) of a docker sequence.  Measured with 5 iterations per thread.  Error count in parens.
-----------------------------------------------------------------------------------------------------------------------
            1 thread   2 threads (lock)   2 threads     Speedup      3 threads     Speedup
Full         2.236          2.392          3.706          1.66        4.147          1.85
No pause     2.554                         3.982          1.56        4.452          1.74
Skip rm      4.334                         6.857          1.58        6.727          1.55

The numbers are comparable to Docker 1.11 and about 50% better than Docker 1.9,
This is a single-node benchmark, not a load test on a distributed system yet. Will conduct that now.

@markusthoemmes
Copy link
Contributor

Regarding the log size issue, the PR to fix it went in already. See #22982. So as @domdom82 said, fixed in 1.13.

@domdom82
Copy link
Contributor Author

domdom82 commented Aug 22, 2016

Regarding the /tmp filesystem there is an alternative way besides tmpfs, although I would only use it as a last resort. I only mention it here for reference.

  1. Make an empty file to hold the tmp fs. (1G in this example. takes about 1.5 sec)
    dd if=/dev/zero of=/tmp/tmpfs.bin bs=1M count=1024
  2. Make a filesystem in the file (ext4 in this example. takes 0.1 sec)
    mkfs.ext4 -F -v /tmp/tmpfs.bin
  3. Mount the filesystem locally on the docker host
    mount /tmp/tmpfs.bin /mnt/tmp/ -t ext4 -o loop,rw,noexec,nosuid
  4. Export the filesystem to the docker container (that read-only otherwise)
    docker run --read-only -v /mnt/tmp:/tmp -it ubuntu bash

This has the one advantage that it does not eat memory.
However there are multiple drawbacks to this:

  • Loopbacks are very slow. We don't expect a lot of load here but still.
  • Massive overhead. The invoker will have to keep track of all those loopback files. It will need to create them, mkfs them, mount them and clean them up during gc later.
  • Performance loss to actually make the files (dd-ing alone takes about 1.5 secs on top of the docker run command. 0.8 secs for 512M files)

In contrast, tmpfs costs next to no time to create and is immediately destroyed when the container ends.

@domdom82
Copy link
Contributor Author

With docker 1.12 there is even a CLI option to use tmpfs right away:

docker run --read-only --tmpfs /tmp:rw,noexec,nosuid,size=64M -it ubuntu bash

Simple as that.

@domdom82
Copy link
Contributor Author

domdom82 commented Sep 8, 2016

@perryibm regarding the tmpfs=memory problem we could make the storage in /tmp also configurable similar to action limits for memory. so the user could decide to have

  • 0MB - "true" read-only (i.e. no /tmp storage)
  • 32MB
  • 64MB
  • 128MB

etc.

This would then simply be added to the overall size calculation for container density.

Thoughts?

@perryibm
Copy link
Contributor

perryibm commented Sep 9, 2016

My concern with tmpfs being in memory at all is that memory is our most constraining resource limiting the number of containers we can deploy. I can see that reflected already in the numbers you propose in that they are very small for a file system. Has there been consensus that file storage that is on par or significantly less than the memory limits is acceptable?

@domdom82
Copy link
Contributor Author

domdom82 commented Oct 4, 2016

Team meeting decision from Oct 4:

  • we will go forward with user namespaces
  • we will not mount containers read-only
  • we will limit storage using overlay2 + xfs using docker 1.13 (modulo stemcell being able to mount root fs as xfs)
  • if xfs not possible, pursue other solutions like io-limits or watchdogs until ext4 quota arrives

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants