devicemapper: dm.basesize should be tunable per-container #14678

jeremyeder · 2015-07-16T16:16:54Z

Right now when using the devicemapper graph driver, the dm.basesize daemon flag sets the default size of the CoW volume for each container to 10G by default.

If a user needs more than 10G in their CoW volume, they must completely tear down their storage backend (delete all containers and images), rm -rf /var/lib/docker customize dm.basesize, start docker and recreate all their images/containers.

We need an option to make dm.basesize tunable on a per-container basis.
Keep the default of 10G, but allow an override at runtime:

i.e. this would create a new container of 32G size:

# docker run -it --basesize=32G rhel7 bash

I am explicitly NOT including resizing of existing/running containers in this PR, although that would be useful.

The text was updated successfully, but these errors were encountered:

jeremyeder · 2015-07-16T16:17:21Z

@rhvgoyal @vbatts @snitm @ekuric

unclejack · 2015-07-16T16:26:18Z

This can't be --basesize, it has to be a storage driver option for the container to be passed to docker run, otherwise we'll have lots of top level options added to docker run for each storage driver.

How would this interact with commit? Would that mean that the user would always get a container with that size?

jeremyeder · 2015-07-16T16:30:59Z

Perhaps a Dockerfile option.

unclejack · 2015-07-16T16:56:27Z

A Dockerfile based approach could also be a problem in some circumstances. Attempting to build using a Dockerfile which needs a really large basesize can be a problem:

how do I enforce a limit for the maximum base size to be set via the Dockerfile? This is a problem for build / CI servers.
how would the limits be handled for instances of that image which uses a larger base size? Unless the container is read only, exhausting the available space is even easier now.
how would space requirements be handled for this? Allowing a base size of 50 GB means the resulting image could use up a lot of disk space. Would a build fail if there isn't enough space and some extra space left after the build?

This also seems to be somewhat related to quotas, so I think getting the design right would be a good idea.

update: I'm not saying this isn't needed or that it shouldn't be implemented, it's just that we need to get the design right.

rhvgoyal · 2015-07-16T20:14:56Z

quotas seem to work only per user/group basis and we don't seem to have notion of per user/group containers.

Rest of the storage drivers seem to allow unlimited size image/containers. So I am wondering why not start with a bigger basesize for dm. Say dm.basesize=100G.

This is not a solution to the problem but an stop gap measure till a real solution is found and implemented.

jeremyeder · 2015-07-16T20:16:40Z

The difference is that with devicemapper, you can overcommit. With for example overlay, you don't have a limit per-container, but you also can't overcommit.

I think the real problem is that dm.basesize is global, and can't be changed without tearing everything down.

rhvgoyal · 2015-07-16T22:35:52Z

even if you get --basesize parameter, that does not solve overcommit problem? That seems to be orthogonal.

In both the cases space will be allocated as it is requested by container/images. giving basesize 100G is not making any guarantees. It just says that max you can use is 100G. There is no guarantee for minimum.

And rest of the graph drivers also don't give any guarantee for minimum. So I really can't see what additional problem overcommit is introducing (as compared to other storage drivers).

jeremyeder · 2015-07-17T11:27:15Z

Should we revisit this dm.resize and dm.thinpool options that Alex Larsson wrote?
https://github.com/rhatdan/docker/commits/thinpool

@rhvgoyal I just forwarded you an email with test results.

rhvgoyal · 2015-07-17T13:15:10Z

@unclejack

For the case of commit, we could probably create new image of same as as the container being committed. That will make sure commit does not fail.

So do we need something like --storage-driver dm.basesize=X for docker-run? Is that the suggestion.

EDIT: I meant something like --storage-opt dm.basesize=X for docker-run and docker-create

snitm · 2015-07-17T13:33:17Z

@jeremyeder if people use an lvm2 created thin-pool then lvm2 may be used to resize the pool. But lvm2 cannot be used to resize the docker created/managed thin device(s) associated with each container.

Best bet would be to overprovision the docker created/managed thin device(s). Then when the thin-pool is resized by lvm2 the docker thin device(s) will just grab more space from the thin-pool.

jeremyeder · 2015-07-17T13:43:35Z

I think the default size of 10G was fairly arbitrary. @snitm IIUC you're agreeing with @rhvgoyal that we just bump up dm.basesize? Vivek suggested dm.basesize=100G.

This change would affect loopback and lvm2 thin-pools differently because the upper bounds of loopback is dm.loopdatasize whereas the upper bound of lvm2 thin-pool is the size of the partition/disk.

So, coupled with dm.basesize=100G, would we increase dm.loopdatasize as well ?

rhvgoyal · 2015-07-17T13:48:53Z

@jeremyeder

dm.loopdatasize should not have any affect on dm.basesize. dm.basesize is virtual size. It does not matter how much physical space is there in underlying storage.

dm.loopdatasize is specifying the max size of loop device. It would even be 1G and still we should be
able to create thin devices of size 100G. Just that when you start writing to those devices, they can not collectively consume more than 1G.

IOW, dm.basesize does not have a direct dependency on dm.loopdatasize.

snitm · 2015-07-17T14:10:19Z

@jeremyeder loop-lvm support should be removed.. not trying to be snarky. I genuinely believe that (you know this!). SO I'm not going to put any energy to reasoning through loopback-specific insanity. Sorry.

Anyway, as for lvm2 created thin-pools: the dm.basesize only controls the size of the docker created thin devices right? So I'm not following your comment about "whereas the upper bound of lvm2 thin-pool is the size of the partition/disk".

Could be I'm misremembering what dm.basesize governs.. but I don't have the docker code at my fingertips at the moment.

jeremyeder · 2015-07-17T14:21:01Z

@rhvgoyal Right, I know there's no direct dependency; I was postulating that we keep a similar overcommit ratio, though. Maybe that's not necessary.

@snitm right; dm.basesize is for individual containers. The upper bound of the lvm2 thin-pool is controlled by the physical capacity of whatever dm.datadev is pointing to (DEVS=/dev/sdX in docker-storage-setup). i.e. my test system uses a 300GB disk.

Back to the original intent of this issue, 10GB is too small for some situations, thus we have the following issues:

we can't adjust the dm.basesize on a per-container basis
changing global dm.basesize requires tear-down of all docker images/containers
we can't resize existing containers

A fix to any one of those would alleviate the situation.

What if we added a new option to docker-storage-setup for dm.basesize?

snitm · 2015-07-17T15:13:37Z

@jeremyeder dm.datadev is really deprecated (from my point of view). If users are going to use "direct-lvm" the best way to do so is by using lvm2-created thin-pool and dm.thinpooldev. The entire point being to lean on lvm2's feature rich thin-pool management rather than reinventing the same in docker.

I can appreciate that there may be docker users who deployed without understanding the knobs at their disposal but honestly the right thing is to:

use dm.thinpooldev
increase dm.basesize to overprovision
use lvm2's monitoring and management to extend the thinpool that was handed to docker in step 1)

Rescuing legacy users isn't a good use of developer time. More documentation and training should be backfilled to avoid this in the future.

rhvgoyal · 2015-07-17T15:17:34Z

@jeremyeder

Ok, I have generated a PR to increase devicemapper default to 100G. I expect that this might lead to increased fs metadata footprint on container creation. I started docker and with this change initial data usage was around 60MB while with 10G it was around 20MB.

But when I pulled in an image and started a container, that diff remained more or less 40MB and did not increase. So I am assuming that it might not be a very huge deal.

We can experiment with this and if we run into issues, then revisit this issue.

rhvgoyal · 2015-07-17T15:18:03Z

@vbatts ping.

jeremyeder · 2015-07-17T15:30:50Z

@rhvgoyal what's the difference in mkfs.xfs time for a 10G volume vs a 100G volume ? I know XFS is extremely fast at mkfs, but just wondering, since we're measuring container startup time in the "hundreds of milliseconds) range right now.

rhvgoyal · 2015-07-17T18:21:42Z

@jeremyeder Here is the initial data with 100 container creation, 3 iterations.

10G (Time is in seconds)

Time taken to create containers is 51.707271494
Time taken to create containers is 52.670078980
Time taken to create containers is 54.176040303

100G (Time is in seconds)

Time taken to create containers is 51.592516651
Time taken to create containers is 52.896233596
Time taken to create containers is 52.387154341

10G (Data and Metadata space usage)

Data Space Used: 726.4 MB
Metadata Space Used: 3.281 MB

100G (Data and Metadata Space Usage)

Data Space Used: 766.5 MB
Metadata Space Used: 3.314 MB

Observations

Looks like nothing much has changed with basesize 100G. Container creation time is more or less same.
There is around 40MB of penalty in terms of data space. And I think that penalty gets applied as soon as docker start and does not scale with number of containers.

vbatts · 2015-07-17T18:34:40Z

besides the 40mb penalty, what is the drawback here? Reading the commentary, it doesn't look like anything.

rhvgoyal · 2015-07-17T19:11:22Z

@vbatts

One drawback I can think of is that an unprivileged process can easily fill the pool.

dd if=/dev/zero of=zerofile

Previously max was 10G so what a container could do was limited.

But this is no different then any other graph driver.

jeremyeder · 2015-07-17T19:58:25Z

@rhvgoyal ok, thanks for running those tests.

unclejack · 2015-07-21T06:57:03Z

@rhvgoyal Yes, I was actually proposing that we add --storage-opt to docker run if we need to add anything like this.

The part about quotas was about per container quotas, not per user/group quotas. This would be similar to quotas, except that it applies to the entire container. This could integrate well with the devicemapper base size for the container.

We could use this even further and use it for the build. This would then be a driver agnostic way to specify this limit.

rhvgoyal · 2015-07-21T11:53:31Z

@unclejack Per container quota is a good idea but I am not aware of any functionality using which we can enforce this for file system based graph drivers.

As of now devicemapper seems to be the only graph driver which can enforce this. Anyway, I will keep this in mind. For now we have changed dm.basesize to 100G default. If there is still need for
per container tunable, I will write a patch to implement --storage-opt for docker run.

rhatdan · 2015-07-21T12:35:09Z

The people I have talked to who want quotas, want it on more then just the COW file systems, they also want it on volumes mounted into containers. Like homedirs. Relying on quotadb is tough if you have multiple containers sharing the same UIDs.

unclejack · 2015-07-22T11:50:02Z

@rhvgoyal @rhatdan Perhaps this could be a mount time option for the file system? e.g.: mount -o maxusage=8G ...... This would work for the container's fs. A special driver could be used for volumes to get quotas there as well.

rhatdan · 2015-07-22T13:02:55Z

Well the problem is that there is no QUOTA system for containers, it is just for file systems per UID.
@rhvgoyal and I have talked about potentially using devicemapper/lvm to setup limited size volumes that could be mounted into a container, giving admins the ability to control the size of a containers Writable area. Would probably help out PAAS type environments. I would think you would do this with docker volume command.

cubbasa · 2015-08-17T14:04:54Z

Right now devicemapper driver is very low level and can't easily provide easy solution to enforce quota for containers which is globally controlled by dm.basesize. So what is we have another kind of driver like lvm which is manage by lvm not by devicemapper directly. In this solution containers rootfs volumes are full fledge lvm thin provisioned volumes. This allow users to have option like lvm.basesize globally with reasonable size and allow users to extend rootfs according needs and provide quota per container. You can do it manually or automatically that first extend lvm volume and during start of container a tool like cloud-init could automatically resize rootfs if not banned by capabilities. Thanks to that we don't need any kind of hacks like xfs project quota or similar and could coexist with devicemapper driver without any changes. Devicemapper could be used with containers that are mostly the same and repeatable and lvm could provide options for unique containers which are hardly the same.

rhvgoyal · 2015-08-25T13:04:13Z

An lvm graph driver for docker might be worth trying. It will also remove lots of complexity from devicemapper code like managing transactions and thin pool transaction id. We will not have to deal with libdm and can get rid of lots of code there.

And as you said this will allow growing container outside the docker using a command line. We probably will just have to export the name of the logical volume for container root using docker inspect.

Somebody will have to write it and we will have to play with it and see how well does it work.

rhvgoyal · 2015-08-25T13:14:28Z

Only drawback of lvm driver seems to be that one will have to deal with command line interface and parse output etc.

runcom · 2016-04-01T09:27:10Z

fixed by #19123

rhvgoyal mentioned this issue Jul 17, 2015

devicemapper: Change default basesize to 100G #14709

Merged

thaJeztah added area/storage/devicemapper kind/enhancement Enhancements are not bugs or new features but can improve usability or performance. labels Aug 9, 2015

unclejack mentioned this issue Sep 7, 2015

Bump up 10GB size limit on container #5151

Closed

jeanpralo mentioned this issue Oct 14, 2015

docker rm performance drop since 1.8.2 #16281

Closed

thaJeztah mentioned this issue Jan 6, 2016

daemon option (--storage-opt dm.basesize) for increasing the base device size on daemon restart #19123

Merged

runcom closed this as completed Apr 1, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

devicemapper: dm.basesize should be tunable per-container #14678

devicemapper: dm.basesize should be tunable per-container #14678

jeremyeder commented Jul 16, 2015

jeremyeder commented Jul 16, 2015

unclejack commented Jul 16, 2015

jeremyeder commented Jul 16, 2015

unclejack commented Jul 16, 2015

rhvgoyal commented Jul 16, 2015

jeremyeder commented Jul 16, 2015

rhvgoyal commented Jul 16, 2015

jeremyeder commented Jul 17, 2015

rhvgoyal commented Jul 17, 2015

snitm commented Jul 17, 2015

jeremyeder commented Jul 17, 2015

rhvgoyal commented Jul 17, 2015

snitm commented Jul 17, 2015

jeremyeder commented Jul 17, 2015

snitm commented Jul 17, 2015

rhvgoyal commented Jul 17, 2015

rhvgoyal commented Jul 17, 2015

jeremyeder commented Jul 17, 2015

rhvgoyal commented Jul 17, 2015

vbatts commented Jul 17, 2015

rhvgoyal commented Jul 17, 2015

jeremyeder commented Jul 17, 2015

unclejack commented Jul 21, 2015

rhvgoyal commented Jul 21, 2015

rhatdan commented Jul 21, 2015

unclejack commented Jul 22, 2015

rhatdan commented Jul 22, 2015

cubbasa commented Aug 17, 2015

rhvgoyal commented Aug 25, 2015

rhvgoyal commented Aug 25, 2015

runcom commented Apr 1, 2016

devicemapper: dm.basesize should be tunable per-container #14678

devicemapper: dm.basesize should be tunable per-container #14678

Comments

jeremyeder commented Jul 16, 2015

jeremyeder commented Jul 16, 2015

unclejack commented Jul 16, 2015

jeremyeder commented Jul 16, 2015

unclejack commented Jul 16, 2015

rhvgoyal commented Jul 16, 2015

jeremyeder commented Jul 16, 2015

rhvgoyal commented Jul 16, 2015

jeremyeder commented Jul 17, 2015

rhvgoyal commented Jul 17, 2015

snitm commented Jul 17, 2015

jeremyeder commented Jul 17, 2015

rhvgoyal commented Jul 17, 2015

snitm commented Jul 17, 2015

jeremyeder commented Jul 17, 2015

snitm commented Jul 17, 2015

rhvgoyal commented Jul 17, 2015

rhvgoyal commented Jul 17, 2015

jeremyeder commented Jul 17, 2015

rhvgoyal commented Jul 17, 2015

10G (Time is in seconds)

100G (Time is in seconds)

10G (Data and Metadata space usage)

100G (Data and Metadata Space Usage)

Observations

vbatts commented Jul 17, 2015

rhvgoyal commented Jul 17, 2015

jeremyeder commented Jul 17, 2015

unclejack commented Jul 21, 2015

rhvgoyal commented Jul 21, 2015

rhatdan commented Jul 21, 2015

unclejack commented Jul 22, 2015

rhatdan commented Jul 22, 2015

cubbasa commented Aug 17, 2015

rhvgoyal commented Aug 25, 2015

rhvgoyal commented Aug 25, 2015

runcom commented Apr 1, 2016