-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Proposal: User containers should be mounted read-only #890
Comments
Noticed there is no "security" label. There should be one I think. |
Note that memory is the most expensive resource when getting VMs so mounting a ramdisk of 1GB is essentially assigning another gigabyte of memory to that user container. |
that is correct. 1 GB may be a tad much, 512 MB should do just fine. Most actions should not even write to disk at all. 512 MB seems to be a reasonable size. |
Do we need persistent disk at all or should we better start the container with a read-only filesystem? What are the use cases to have a persistent store within the container? One that would come in my mind is caching for multiple invocations within the same warm container... In case we need persistent disk I would further limit the tmpfs because memory is the limiting factor on invoker machines. maybe 64MB? (really depends on use cases we want to support). For any larger storage we should probably think about mounting a docker volume. Where we would need to test carefully upfront to avoid performance and robustness issues. |
one use case I can think of is a blackbox nodejs action that requires additional modules. those modules need to be installed somewhere. other actions will also probably require some minimal /tmp space, but I think mainly custom user actions that need to pull in additional libs are the main target. docker volumes are directly related to the storage driver. afaik only devicemapper gives you the option to attach lvm volumes of a definitive size. quotas also seem to be supported for devicemapper, zfs and btrfs. see moby/moby#3804 but keep in mind that no disk-based storage will ever be as fast as memory. so it will take some time to make the volume and remove it after the action is done. this is where the charm of tmpfs comes into play. if we assume a limit of 512 MB for /tmp and 512 MB memory limit for the container that makes 1 GB per container. but bear in mind that both memory and tmpfs are not allocated right away. if an action only takes like 64 MB then only that gets used. I'd say on a 8 GB machine you could easily run 4-5 concurrent actions that use maximum resources without swapping. |
Besides of the memory concern (a pretty huge one if you ask me, I don't think we can "waste" precious memory for container storage) we also need to make sure that "non-node-js" containers continue to work. Swift for example compiles the code on initialization of the container and thus needs to write that somewhere. Is that than also stored in |
Yes, of course. The memory vs. hd storage aside, we need to limit storage for containers. A user container is basically a function call. It should not have the need to store anything locally besides runtime data. Why would a stateless, serverless function need to store any data on a disk anyway? The 512 MB limit is mostly helpful for blackbox containers where you are free to do whatever you like. When you are in a swift action or nodejs action, you are basically using a runtime instead of the naked container, so storing files on disk is even less relevant. Besides, the user code can not rely on being in the same container the next time it gets invoked. |
@jeremiaswerner i think we definitely need writable a /tmp disk -- there are use cases for it. also, from a competitive view. @domdom82 have you asked phil estes for advice ? |
@mbehrendt thanks for clarifying |
Just FYI-- the mount options support in the Read-only seems like a good option for your use case with the available tmpfs as a "relief valve" for the cases that must have a writeable location. I can't remember if you are using user namespaces, but we had a restriction that the As you noted, there has been work for quota support on some backends recently (ping @cpuguy83) so that may be a future option if there are more complexities found with the readonly path. If memory is an issue, you can also create a "real" file mounted as a writeable temporary fs (loopback file mounted) as another option. The time cost would probably be higher than the memory option, but would save you the memory allocation (which I think as was noted above is only truly allocated if it's used) |
FYI, on 1.11: |
@estesp @cpuguy83 thanks for chiming in. regarding the user namespace problem, yes that is going to be an issue. We want to prevent fork bombs in the container, the current idea is ulimit nproc which works on a per-user basis and only on non-root users. Since we have stability issues with docker >=1.10 we are currently stuck on 1.9.1. I have proposed running our containers as non-root (see issue #898) using the |
@domdom82 What stability issues? Can you detail in another issue? BTW, nproc is pretty horrible as it is per-user, however if you are going to be controlling which user is in the container, maybe this isn't a problem for you. As of kernel 4.3 there is now a pids cgroup controller for controlling the number of pids allowed. This should be used for fork bomb prevention. |
@cpuguy83 we have been discussing this problem for a while. Here is the gist by @perryibm :
Bottom line:
|
@cpuguy83 I've also been involved as well as @duglin in trying to help with some of the performance/stability issues. A couple comments:
|
Cool. Looks like 1.12 is the new guy then. As for the parallelism testsuite, I think @perryibm has prepared a suite for local testing. I'm sure he likes to share. Thanks a lot for your insights! |
@jeremiaswerner @mbehrendt A simple use case for disk is our own image processing demo which writes the image to a temporary file before invoking image magic. @domdom82 I agree that 1.12 is the first reasonable thing to switch to. I'll try to do a parallelism study soon so we have numbers. If 1.12 is at least as good as 1.11, then we should proceed on performance grounds. That leaves stability testing which we can only confirm by testing at scale in the system as local and small scale tests historically cannot flush these out. |
After testing log limits today, I have opened a new issue with docker as there is still a security vulnerability with log sizes in docker 1.12. Linking to it here: |
Good to hear: Apparently the log size issue is fixed with docker 1.13. |
Initial benchmark testing on Docker 1.12.1 using Kernel 4.4.0-28 looks encouraging:
The numbers are comparable to Docker 1.11 and about 50% better than Docker 1.9, |
Regarding the log size issue, the PR to fix it went in already. See #22982. So as @domdom82 said, fixed in 1.13. |
Regarding the /tmp filesystem there is an alternative way besides tmpfs, although I would only use it as a last resort. I only mention it here for reference.
This has the one advantage that it does not eat memory.
In contrast, tmpfs costs next to no time to create and is immediately destroyed when the container ends. |
With docker 1.12 there is even a CLI option to use tmpfs right away:
Simple as that. |
@perryibm regarding the tmpfs=memory problem we could make the storage in /tmp also configurable similar to action limits for memory. so the user could decide to have
etc. This would then simply be added to the overall size calculation for container density. Thoughts? |
My concern with tmpfs being in memory at all is that memory is our most constraining resource limiting the number of containers we can deploy. I can see that reflected already in the numbers you propose in that they are very small for a file system. Has there been consensus that file storage that is on par or significantly less than the memory limits is acceptable? |
Team meeting decision from Oct 4:
|
This is a design-proposal for security hardening of the invoker.
During my work on AppArmor I did some research and found that it is currently very hard to enforce a disk size quota on docker containers. There are some discussions going on here moby/moby#3804 and there is the
--device-write-bps
option in docker 1.10 which only limits throughput. We can't use docker 1.10 for stability reasons anyway.My proposal is this:
To prevent any action container from flooding the host filesystem we mount the container in read-only mode. Each container will be given a writable
/tmp
directory that is mounted on a tmpfs volume with a small size (e.g. 512 MB, 1 GB etc.).tmpfs is ramdisk so nothing gets written to the actual disk of the host (docker logs of course being a different issue).
I have been dabbling with
docker volume
but could not get it to actually create a real tmpfs volume with a size limit. The example for tmpfs here does not work.Instead my proposal looks something like this:
when
makeContainer
gets called, the invoker first callsmakeTmpFs(containerName)
makeTmpFs calls
makeContainer
then starts the action container like sormContainer
finally doesAdvantages:
Tradeoff we will have to make:
Once we can make the switch over to docker > 1.10 we can simply use the --tmpfs option of docker run and drop the above tmpfs handler in the invoker.
The text was updated successfully, but these errors were encountered: