Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

devicemapper: dm.basesize should be tunable per-container #14678

Closed
jeremyeder opened this issue Jul 16, 2015 · 31 comments
Closed

devicemapper: dm.basesize should be tunable per-container #14678

jeremyeder opened this issue Jul 16, 2015 · 31 comments
Labels
area/storage/devicemapper kind/enhancement Enhancements are not bugs or new features but can improve usability or performance.

Comments

@jeremyeder
Copy link

Right now when using the devicemapper graph driver, the dm.basesize daemon flag sets the default size of the CoW volume for each container to 10G by default.

If a user needs more than 10G in their CoW volume, they must completely tear down their storage backend (delete all containers and images), rm -rf /var/lib/docker customize dm.basesize, start docker and recreate all their images/containers.

We need an option to make dm.basesize tunable on a per-container basis.
Keep the default of 10G, but allow an override at runtime:

i.e. this would create a new container of 32G size:

# docker run -it --basesize=32G rhel7 bash

I am explicitly NOT including resizing of existing/running containers in this PR, although that would be useful.

@jeremyeder
Copy link
Author

@unclejack
Copy link
Contributor

This can't be --basesize, it has to be a storage driver option for the container to be passed to docker run, otherwise we'll have lots of top level options added to docker run for each storage driver.

How would this interact with commit? Would that mean that the user would always get a container with that size?

@jeremyeder
Copy link
Author

Perhaps a Dockerfile option.

@unclejack
Copy link
Contributor

A Dockerfile based approach could also be a problem in some circumstances. Attempting to build using a Dockerfile which needs a really large basesize can be a problem:

  • how do I enforce a limit for the maximum base size to be set via the Dockerfile? This is a problem for build / CI servers.
  • how would the limits be handled for instances of that image which uses a larger base size? Unless the container is read only, exhausting the available space is even easier now.
  • how would space requirements be handled for this? Allowing a base size of 50 GB means the resulting image could use up a lot of disk space. Would a build fail if there isn't enough space and some extra space left after the build?

This also seems to be somewhat related to quotas, so I think getting the design right would be a good idea.

update: I'm not saying this isn't needed or that it shouldn't be implemented, it's just that we need to get the design right.

@rhvgoyal
Copy link
Contributor

quotas seem to work only per user/group basis and we don't seem to have notion of per user/group containers.

Rest of the storage drivers seem to allow unlimited size image/containers. So I am wondering why not start with a bigger basesize for dm. Say dm.basesize=100G.

This is not a solution to the problem but an stop gap measure till a real solution is found and implemented.

@jeremyeder
Copy link
Author

The difference is that with devicemapper, you can overcommit. With for example overlay, you don't have a limit per-container, but you also can't overcommit.

I think the real problem is that dm.basesize is global, and can't be changed without tearing everything down.

@rhvgoyal
Copy link
Contributor

even if you get --basesize parameter, that does not solve overcommit problem? That seems to be orthogonal.

In both the cases space will be allocated as it is requested by container/images. giving basesize 100G is not making any guarantees. It just says that max you can use is 100G. There is no guarantee for minimum.

And rest of the graph drivers also don't give any guarantee for minimum. So I really can't see what additional problem overcommit is introducing (as compared to other storage drivers).

@jeremyeder
Copy link
Author

Should we revisit this dm.resize and dm.thinpool options that Alex Larsson wrote?
https://github.com/rhatdan/docker/commits/thinpool

@rhvgoyal I just forwarded you an email with test results.

@rhvgoyal
Copy link
Contributor

@unclejack

For the case of commit, we could probably create new image of same as as the container being committed. That will make sure commit does not fail.

So do we need something like --storage-driver dm.basesize=X for docker-run? Is that the suggestion.

EDIT: I meant something like --storage-opt dm.basesize=X for docker-run and docker-create

@snitm
Copy link
Contributor

snitm commented Jul 17, 2015

@jeremyeder if people use an lvm2 created thin-pool then lvm2 may be used to resize the pool. But lvm2 cannot be used to resize the docker created/managed thin device(s) associated with each container.

Best bet would be to overprovision the docker created/managed thin device(s). Then when the thin-pool is resized by lvm2 the docker thin device(s) will just grab more space from the thin-pool.

@jeremyeder
Copy link
Author

I think the default size of 10G was fairly arbitrary. @snitm IIUC you're agreeing with @rhvgoyal that we just bump up dm.basesize? Vivek suggested dm.basesize=100G.

This change would affect loopback and lvm2 thin-pools differently because the upper bounds of loopback is dm.loopdatasize whereas the upper bound of lvm2 thin-pool is the size of the partition/disk.

So, coupled with dm.basesize=100G, would we increase dm.loopdatasize as well ?

@rhvgoyal
Copy link
Contributor

@jeremyeder

dm.loopdatasize should not have any affect on dm.basesize. dm.basesize is virtual size. It does not matter how much physical space is there in underlying storage.

dm.loopdatasize is specifying the max size of loop device. It would even be 1G and still we should be
able to create thin devices of size 100G. Just that when you start writing to those devices, they can not collectively consume more than 1G.

IOW, dm.basesize does not have a direct dependency on dm.loopdatasize.

@snitm
Copy link
Contributor

snitm commented Jul 17, 2015

@jeremyeder loop-lvm support should be removed.. not trying to be snarky. I genuinely believe that (you know this!). SO I'm not going to put any energy to reasoning through loopback-specific insanity. Sorry.

Anyway, as for lvm2 created thin-pools: the dm.basesize only controls the size of the docker created thin devices right? So I'm not following your comment about "whereas the upper bound of lvm2 thin-pool is the size of the partition/disk".

Could be I'm misremembering what dm.basesize governs.. but I don't have the docker code at my fingertips at the moment.

@jeremyeder
Copy link
Author

@rhvgoyal Right, I know there's no direct dependency; I was postulating that we keep a similar overcommit ratio, though. Maybe that's not necessary.

@snitm right; dm.basesize is for individual containers. The upper bound of the lvm2 thin-pool is controlled by the physical capacity of whatever dm.datadev is pointing to (DEVS=/dev/sdX in docker-storage-setup). i.e. my test system uses a 300GB disk.

Back to the original intent of this issue, 10GB is too small for some situations, thus we have the following issues:

  1. we can't adjust the dm.basesize on a per-container basis
  2. changing global dm.basesize requires tear-down of all docker images/containers
  3. we can't resize existing containers

A fix to any one of those would alleviate the situation.

What if we added a new option to docker-storage-setup for dm.basesize?

@snitm
Copy link
Contributor

snitm commented Jul 17, 2015

@jeremyeder dm.datadev is really deprecated (from my point of view). If users are going to use "direct-lvm" the best way to do so is by using lvm2-created thin-pool and dm.thinpooldev. The entire point being to lean on lvm2's feature rich thin-pool management rather than reinventing the same in docker.

I can appreciate that there may be docker users who deployed without understanding the knobs at their disposal but honestly the right thing is to:

  1. use dm.thinpooldev
  2. increase dm.basesize to overprovision
  3. use lvm2's monitoring and management to extend the thinpool that was handed to docker in step 1)

Rescuing legacy users isn't a good use of developer time. More documentation and training should be backfilled to avoid this in the future.

@rhvgoyal
Copy link
Contributor

@jeremyeder

Ok, I have generated a PR to increase devicemapper default to 100G. I expect that this might lead to increased fs metadata footprint on container creation. I started docker and with this change initial data usage was around 60MB while with 10G it was around 20MB.

But when I pulled in an image and started a container, that diff remained more or less 40MB and did not increase. So I am assuming that it might not be a very huge deal.

We can experiment with this and if we run into issues, then revisit this issue.

@rhvgoyal
Copy link
Contributor

@vbatts ping.

@jeremyeder
Copy link
Author

@rhvgoyal what's the difference in mkfs.xfs time for a 10G volume vs a 100G volume ? I know XFS is extremely fast at mkfs, but just wondering, since we're measuring container startup time in the "hundreds of milliseconds) range right now.

@rhvgoyal
Copy link
Contributor

@jeremyeder Here is the initial data with 100 container creation, 3 iterations.

10G (Time is in seconds)

Time taken to create containers is 51.707271494
Time taken to create containers is 52.670078980
Time taken to create containers is 54.176040303

100G (Time is in seconds)

Time taken to create containers is 51.592516651
Time taken to create containers is 52.896233596
Time taken to create containers is 52.387154341

10G (Data and Metadata space usage)

Data Space Used: 726.4 MB
Metadata Space Used: 3.281 MB

100G (Data and Metadata Space Usage)

Data Space Used: 766.5 MB
Metadata Space Used: 3.314 MB

Observations

  • Looks like nothing much has changed with basesize 100G. Container creation time is more or less same.
  • There is around 40MB of penalty in terms of data space. And I think that penalty gets applied as soon as docker start and does not scale with number of containers.

@vbatts
Copy link
Contributor

vbatts commented Jul 17, 2015

besides the 40mb penalty, what is the drawback here? Reading the commentary, it doesn't look like anything.

@rhvgoyal
Copy link
Contributor

@vbatts

One drawback I can think of is that an unprivileged process can easily fill the pool.

dd if=/dev/zero of=zerofile

Previously max was 10G so what a container could do was limited.

But this is no different then any other graph driver.

@jeremyeder
Copy link
Author

@rhvgoyal ok, thanks for running those tests.

@unclejack
Copy link
Contributor

@rhvgoyal Yes, I was actually proposing that we add --storage-opt to docker run if we need to add anything like this.

The part about quotas was about per container quotas, not per user/group quotas. This would be similar to quotas, except that it applies to the entire container. This could integrate well with the devicemapper base size for the container.

We could use this even further and use it for the build. This would then be a driver agnostic way to specify this limit.

@rhvgoyal
Copy link
Contributor

@unclejack Per container quota is a good idea but I am not aware of any functionality using which we can enforce this for file system based graph drivers.

As of now devicemapper seems to be the only graph driver which can enforce this. Anyway, I will keep this in mind. For now we have changed dm.basesize to 100G default. If there is still need for
per container tunable, I will write a patch to implement --storage-opt for docker run.

@rhatdan
Copy link
Contributor

rhatdan commented Jul 21, 2015

The people I have talked to who want quotas, want it on more then just the COW file systems, they also want it on volumes mounted into containers. Like homedirs. Relying on quotadb is tough if you have multiple containers sharing the same UIDs.

@unclejack
Copy link
Contributor

@rhvgoyal @rhatdan Perhaps this could be a mount time option for the file system? e.g.: mount -o maxusage=8G ...... This would work for the container's fs. A special driver could be used for volumes to get quotas there as well.

@rhatdan
Copy link
Contributor

rhatdan commented Jul 22, 2015

Well the problem is that there is no QUOTA system for containers, it is just for file systems per UID.
@rhvgoyal and I have talked about potentially using devicemapper/lvm to setup limited size volumes that could be mounted into a container, giving admins the ability to control the size of a containers Writable area. Would probably help out PAAS type environments. I would think you would do this with docker volume command.

@thaJeztah thaJeztah added area/storage/devicemapper kind/enhancement Enhancements are not bugs or new features but can improve usability or performance. labels Aug 9, 2015
@cubbasa
Copy link

cubbasa commented Aug 17, 2015

Right now devicemapper driver is very low level and can't easily provide easy solution to enforce quota for containers which is globally controlled by dm.basesize. So what is we have another kind of driver like lvm which is manage by lvm not by devicemapper directly. In this solution containers rootfs volumes are full fledge lvm thin provisioned volumes. This allow users to have option like lvm.basesize globally with reasonable size and allow users to extend rootfs according needs and provide quota per container. You can do it manually or automatically that first extend lvm volume and during start of container a tool like cloud-init could automatically resize rootfs if not banned by capabilities. Thanks to that we don't need any kind of hacks like xfs project quota or similar and could coexist with devicemapper driver without any changes. Devicemapper could be used with containers that are mostly the same and repeatable and lvm could provide options for unique containers which are hardly the same.

@rhvgoyal
Copy link
Contributor

An lvm graph driver for docker might be worth trying. It will also remove lots of complexity from devicemapper code like managing transactions and thin pool transaction id. We will not have to deal with libdm and can get rid of lots of code there.

And as you said this will allow growing container outside the docker using a command line. We probably will just have to export the name of the logical volume for container root using docker inspect.

Somebody will have to write it and we will have to play with it and see how well does it work.

@rhvgoyal
Copy link
Contributor

Only drawback of lvm driver seems to be that one will have to deal with command line interface and parse output etc.

@runcom
Copy link
Member

runcom commented Apr 1, 2016

fixed by #19123

@runcom runcom closed this as completed Apr 1, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/storage/devicemapper kind/enhancement Enhancements are not bugs or new features but can improve usability or performance.
Projects
None yet
Development

No branches or pull requests

9 participants