Overlay multiple lower directory support #22126

Merged
merged 6 commits into from Jun 13, 2016

Projects

None yet

10 participants

@dmcgowan
Member
dmcgowan commented Apr 18, 2016 edited

Summary

The overlay driver is quickly becoming more popular as users are looking for alternatives to devicemapper. Even though in many cases overlay is the best graphdriver option, it is still lacking optimization and has not been updated to take advantage of the 4.0 kernel feature to mount multiple lower directories. In addition, its use of the naive diff driver for generating diffs has led to at least one serious storage bug (#21555). Users often complain about inode exhaustion using overlay which is caused by the way upper directories are combined for each layer. For users not yet updated to 4.0 there is not much improvement we can make, however for 4.0 users, we can.

Design (updated May 17th)

Use a "diff" directory as the upper directory for a layer and have its contents be pure for the layer.
The previous "upper" directory was a combination of all layers on top of a single root. Having a pure
"diff" directory allows directly archiving a single directory to get an exportable diff. This also
allows apply to be done on an empty directory and changes to be calculated against a single directory.
The naive driver is no longer used and replaced by archiving the "diff" directory with rewrites of
whiteouts to use the aufs style expected by Docker diff tars.

Each layer has a unique identifier generated for it and a symbolic link in a link directory to the
"diff" directory for the layer. This shortened identifier allows referencing more layers as lower
directories for the overlay mount without hitting the page size limitation on mount arguments. This
unique identifier is put in a "link" file.

A "lower" file is created which stores a ':' separated list of lower directories to be used on mount.
Each item in the lower file is a reference to the symbolic link for each diff directory. The "lower"
file should always be less than or equal to the page size limit - 512 bytes to avoid getting an
error on mount. This is enforce inside the graph driver as a limit of 128 lower layers.

Performance Results

Testing shows a significant reduction in the number of inodes used, especially
for images with many layers. This is likely to solve many user issues around
running out of inodes while using overlay.

Full test log which generated inode/size information can be found here https://gist.github.com/dmcgowan/46d96ba48a38ace4ce03572efbf25697

Inode usage

overlay2 overlay %
dockercore/docker 76908 460576 17%
wordpress 27784 145609 19%
golang 24795 74487 33%
ubuntu 8154 17534 47%

Size (1K blocks)

overlay2 overlay %
dockercore/docker 2536356 3212196 79%
wordpress 463460 682796 68%
golang 787848 878640 90%
ubuntu 135700 147804 92%

Benchmark Results (between addition of commit with overlay change)

benchmark                         old ns/op     new ns/op     delta
BenchmarkExists-8                 1749          1549          -11.44%
BenchmarkGetEmpty-8               4224          6538          +54.78%
BenchmarkDiffBase-8               108961        101431        -6.91%
BenchmarkDiffSmallUpper-8         13801616      207843        -98.49%
BenchmarkDiff10KFileUpper-8       239367502     174130292     -27.25%
BenchmarkDiff10KFilesBottom-8     240473098     213452        -99.91%
BenchmarkDiffApply100-8           23074450      5931350       -74.29%
BenchmarkDiff20Layers-8           14447874      99199         -99.31%
BenchmarkRead20Layers-8           6555          6425          -1.98%

benchmark                         old allocs     new allocs     delta
BenchmarkExists-8                 4              4              +0.00%
BenchmarkGetEmpty-8               8              10             +25.00%
BenchmarkDiffBase-8               307            304            -0.98%
BenchmarkDiffSmallUpper-8         787            744            -5.46%
BenchmarkDiff10KFileUpper-8       644928         641897         -0.47%
BenchmarkDiff10KFilesBottom-8     644918         750            -99.88%
BenchmarkDiffApply100-8           9860           10330          +4.77%
BenchmarkDiff20Layers-8           4532           304            -93.29%
BenchmarkRead20Layers-8           7              7              +0.00%

benchmark                         old bytes     new bytes     delta
BenchmarkExists-8                 544           544           +0.00%
BenchmarkGetEmpty-8               1088          1152          +5.88%
BenchmarkDiffBase-8               24534         23976         -2.27%
BenchmarkDiffSmallUpper-8         44574         38142         -14.43%
BenchmarkDiff10KFileUpper-8       22451091      22057174      -1.75%
BenchmarkDiff10KFilesBottom-8     22433038      38901         -99.83%
BenchmarkDiffApply100-8           561485        631334        +12.44%
BenchmarkDiff20Layers-8           304053        24018         -92.10%
BenchmarkRead20Layers-8           1220          1212          -0.66%

The biggest change involves BenchmarkDiff10KFilesBottom which is intended
to represent steps during a docker build which add few files but
are built on bases with a large number of files (i.e. FROM ubuntu).

@dmcgowan dmcgowan changed the title from Overlay native diff to Overlay multiple lower directory support Apr 18, 2016
@tonistiigi tonistiigi and 1 other commented on an outdated diff Apr 18, 2016
daemon/graphdriver/graphtest/graphtest_unix.go
+<<<<<<< 894c1f08cd933023e9277eb61be28f6dd42db276
@dmcgowan
dmcgowan Apr 18, 2016 Member

Missed this conflict, looks like my after my rebase need to add CreateReadWrite as well

@dmcgowan
Member

One additional tweak to the design I am making now is to always have a lower-id and exclude the root from the lower. This will ensure that the combination of the diff and lowers is equivalent to the prior upper directory. It should allow cleaning up some logic and ease some of the cleanup when there is a squash failure.

@unclejack
Contributor

@dmcgowan Does this fix some other issues we were encountering when using Overlay?

@dmcgowan
Member
dmcgowan commented Apr 22, 2016 edited

@unclejack the 2 big issues I know this addresses is inode exhaustion and the missing updated file during diff #21555. There are other possible solutions to 21555 but inode exhaustion is directly related to the way each layer is flattened in the current version. My hope is this will also speed up builds on overlay as the amount of processing to produce a diff has greatly been reduced. What are the issues you had in mind?

@dmcgowan
Member
dmcgowan commented Apr 23, 2016 edited

Rebased to account for #22168. Might be worth discussing how this will impact this PR. The fsdiff is still used in some cases but not always.

@cpuguy83
Contributor

Pretty awesome!
Verified this works and seems to work well. I have not done any performance comparisons or stress testing, but functionally it seems pretty solid.

The only problem with the current approach is that it's not backwards compatible.
So a user using overlay can upgrade Docker get this changed out under the hood without noticing any changes, but then if they need to downgrade anything new will not work under the old driver.

It seems like this either needs to be a new driver (overlay2?) so there are no surprises here, or the newer version needs to setup new layers in a backwards compatible way.

@runcom
Member
runcom commented May 10, 2016 edited

awesome, running with patch w/o any issues so far

overlay2 sounds good to me to maintain back compatibility

@dmcgowan
Member

@cpuguy83 the current approach uses the new method by default with the option to turn it off using an option "nomultilower". It would be simple to make it an opt-in option rather than opt-out option. My concerns with making it a separate graph driver is related to the amount of code that is shared, it would requiring making almost all the internal functions of the currently overlay driver exported. Also having another driver is likely to make it more obscure and less used. If it is using the same directory and both methods are always supported, then in the future turning it on would allow backward compatibility with the next release and everything after.

@tonistiigi
Contributor

@dmcgowan How much code duplication would there be if the second driver would not keep the current hardlink based implementation at all and would only use upper overlay dirs, like aufs driver does(ignoring the problem with reaching the mount options length limit).

@dmcgowan
Member

@tonistiigi: just some of the initialization checks. That is fair to point out, the duplication is because the current approach can be used on top of an existing overlay directory and uses the current hard link method to "squash" layers past the options limit into upper directories as used by the current driver.

I found that squashing is required when an image gets about 35 layers deep when using the default "/var/lib/docker" location.

@unclejack
Contributor

@dmcgowan I'm sorry for the late reply. I was just hoping this change would get us some fixes for some of the other issues as well.

@dmcgowan
Member

@unclejack not late at all. Did you have any specific issues in mind? I am going back through the issue list to see which are reproducible and not identified as Kernel bugs.

These are the issues I know are addressed, would love to add to the list

  • #10613 - Inode exhaustion, fixed by using less files
  • #17551 - Slow delete, fixed by having less files to remove
  • #21555 - Unstable images, fixed by using a diff directory
@unclejack
Contributor

@dmcgowan #9572 #19647 #12080

These are not necessarily issues anyone would expect this PR to fix, just something I'd hoped this change would fix through multiple lower directories.

@thaJeztah
Member

Can someone make a list of pros/cons of having a separate driver vs a hybrid driver, so that we can move this forward?

@dmcgowan
Member

@thaJeztah added a section on the bottom. Please point out any bias or suggested updates.

@crosbymichael
Member

@dmcgowan does overlay work with MS_REMOUNT for specifying more layers after it is already mounted or do we have no choice but to use hardlinks to squash?

@dmcgowan
Member

@crosbymichael tried doing remount and did not work for me. Although my suspicion is that even if it did, it would not solve the problem since all the lowers would still need to be given as mount options.

@crosbymichael
Member

Would something like this work to get away from all of the hard linking or would the multiple mounts be worse?

package main

import (
    "fmt"
    "os"
    "path/filepath"
    "syscall"

    "github.com/Sirupsen/logrus"
)

func main() {
    if err := run(); err != nil {
        logrus.Fatal(err)
    }
}

func run() error {
    cwd, err := os.Getwd()
    if err != nil {
        return err
    }
    for _, name := range []string{
        "lower1",
        "lower2",
        "merged",
        "upper",
        "work",
    } {
        if err := os.MkdirAll(filepath.Join(cwd, name), 0755); err != nil {
            return err
        }
    }
    // mount the RO layers first so we don't overload the mount syscall with too much data
    if err := syscall.Mount(
        "overlay",
        filepath.Join(cwd, "merged"),
        "overlay",
        0,
        fmt.Sprintf(
            "lowerdir=%s:%s,workdir=%s",
            filepath.Join(cwd, "lower1"),
            filepath.Join(cwd, "lower2"),
            filepath.Join(cwd, "work"))); err != nil {
        return err
    }
    // now mount the rw layer with the already mounted lower's merge dir as the lower dir here
    if err := syscall.Mount(
        "overlay",
        filepath.Join(cwd, "merged"),
        "overlay",
        0,
        fmt.Sprintf(
            "lowerdir=%s,workdir=%s,upperdir=%s",
            filepath.Join(cwd, "merged"),
            filepath.Join(cwd, "work"),
            filepath.Join(cwd, "upper"))); err != nil {
        return err
    }
    return nil
}
@tonistiigi
Contributor

@crosbymichael @vikstrous tried something like that in #18560 (comment) but hit some blockers.

There are other ways to get over the options limit. For example symlinks and relative paths.

@crosbymichael
Member

boooo

@crosbymichael
Member

I think my preference would be to make this change a new driver. It would be nice to have the new "pure" implementation and not have it messed up by the old one. Then as time passes we can eventually fade out the old implementation.

I'm fine whipping out my var/lib/docker for the new implementation as its much better and cleaner. This is just my initial thought and I don't feel too strongly about same or different drivers, whatever is cleaner and easier to maintain is better for the long run.

@dmcgowan
Member

@crosbymichael @tonistiigi: going to make a proof of concept using the symlink and relative path method. I calculated that we should be able to hit a max depth of 128, which should allow making the cleaner separate driver.

The approach I plan on taking is generating a random 26 character identifier for each layer and creating a symlink named with that identifier in a link directory pointing to the diff directory. On mount the process with fork, change directories and use the relative path to the links for the lower dirs.

@dmcgowan
Member
dmcgowan commented May 13, 2016 edited

Added the separate driver with no squashing and support for 128 layer depth. I will remove the other implementation from this PR once I have completed testing and benchmarking. I will update the PR description accordingly as well. The initial version is calling out to "/usr/bin/mount" but updating to use reexec instead. This is also still using the <id>/diff ordering instead of the diff/<id> order aufs uses. I don't have much of an opinion on this but don't mind updating if the consensus is to use the aufs method.

@crosbymichael crosbymichael and 1 other commented on an outdated diff May 13, 2016
daemon/graphdriver/overlay2/overlay.go
+ d.ctr.Decrement(id)
+ return mergedDir, nil
+ }
+
+ opts := fmt.Sprintf("lowerdir=%s,upperdir=%s,workdir=%s", string(lowers), path.Join(id, "diff"), path.Join(id, "work"))
+ mountLabel = label.FormatMountLabel(opts, mountLabel)
+ if len(mountLabel) > syscall.Getpagesize() {
+ d.ctr.Decrement(id)
+ return "", fmt.Errorf("cannot mount layer, mount label too large %d", len(mountLabel))
+ }
+
+ // TODO: Call mount with a reexec instead of using mount binary
+ cmd := exec.Cmd{
+ Path: "/usr/bin/mount",
+ Args: []string{"mount", "-t", "overlay", "overlay", "-o", mountLabel, path.Join(id, "merged")},
+ Dir: d.home,
@crosbymichael
crosbymichael May 13, 2016 Member

You are calling this because you are using the cwd so that you don't have to prepend all the paths with the root?

@dmcgowan
dmcgowan May 13, 2016 Member

Correct. If the root is prepended the page limit can easily be hit. Made more difficult by the fact that the root path could be any length.

@dmcgowan
Member

Updated PR description for the separate driver. Also re-ran the usage tests and included the wordpress image. It is an interesting case since it has many layers but is not a very large image.

@dmcgowan
Member

Updated to use directory based reference counter. Also added a unit test for changes and fixed a bug which it exposed.

@thaJeztah
Member

We discussed this PR in the maintainers meeting, and having this implementation as a separate driver (overlay2) was agreed on, so I moved this to code review.

@docker/core-maintainers please review and test this PR, it would be really great to see this in the upcoming release

@thaJeztah
Member

ping @dmcgowan looks like this needs a rebase 😢

ping @docker/core-engine-maintainers PTAL!

@cpuguy83
Contributor
cpuguy83 commented Jun 7, 2016

LGTM

@thaJeztah thaJeztah added this to the 1.12.0 milestone Jun 7, 2016
@cpuguy83
Contributor
cpuguy83 commented Jun 8, 2016

This now needs a rebase.

vikstrous and others added some commits Dec 14, 2015
@vikstrous @dmcgowan vikstrous aufs compatibility
Signed-off-by: Viktor Stanchev <me@viktorstanchev.com>
b03d323
@dmcgowan dmcgowan Update archive package to support overlay whiteouts
Signed-off-by: Derek McGowan <derek@mcgstyle.net> (github: dmcgowan)
8222c86
@dmcgowan dmcgowan Expand graphtest package
Signed-off-by: Derek McGowan <derek@mcgstyle.net> (github: dmcgowan)
8b0441d
@dmcgowan dmcgowan Add more overlay tests and benchmarks
Signed-off-by: Derek McGowan <derek@mcgstyle.net> (github: dmcgowan)
246e993
@dmcgowan dmcgowan Add separate overlay2 driver
Adds a new overlay driver which uses multiple lower directories to create the union fs.
Additionally it uses symlinks and relative mount paths to allow a depth of 128 and stay within the mount page size limit.
Diffs and done directly over a single directory allowing diffs to be done efficiently and without the need fo the naive diff driver.

Signed-off-by: Derek McGowan <derek@mcgstyle.net> (github: dmcgowan)
23e5c94
@thaJeztah
Member

ping @crosbymichael PTAL!

@vdemeester
Member

I'm trying this on my laptop and another computer and so far it looks good 👼

@vdemeester
Member

LGTM 🐯

@runcom
Member
runcom commented Jun 12, 2016

LGTM

@vdemeester
Member

switching to docs-review then 👼

@thaJeztah
Member

For documentation, we need to have a mention in the dockerd options;

Ideally, we should have a section on this storage driver in our "storage driver" user guide, explaining what the differences / advantages are;

But I can see that needs some work, so for now, perhaps just mention it as an option, and some minimum description

@thaJeztah
Member

Just chatted with @dmcgowan, he's doing some docs now, but we may need to do some follow up after to tweak. We'd like to have this merged before code freeze 😄

@vdemeester
Member

@thaJeztah sounds good to me 👍

@thaJeztah
Member

LGTM, thanks @dmcgowan

ping @vdemeester for review/merge

@dmcgowan dmcgowan Add documentation for using overlay2
Add mention in dockerd command line and storage driver selection documentation.

Signed-off-by: Derek McGowan <derek@mcgstyle.net> (github: dmcgowan)
a546042
@vdemeester
Member

Oh boy 🎉
LGTM 🐹

@crosbymichael
Member

LGTM

@crosbymichael crosbymichael merged commit 8a2f9a2 into docker:master Jun 13, 2016

7 of 8 checks passed

experimental Jenkins build Docker-PRs-experimental 19879 is running
Details
docker/dco-signed All commits signed
Details
documentation success
Details
gccgo Jenkins build Docker-PRs-gccgo 6739 has succeeded
Details
janky Jenkins build Docker-PRs 28667 has succeeded
Details
userns Jenkins build Docker-PRs-userns 10829 has succeeded
Details
win2lin Jenkins build Docker-PRs-Win2Lin 27248 has succeeded
Details
windowsTP5 Jenkins build Docker-PRs-WoW-TP5 3157 has succeeded
Details
@AkihiroSuda AkihiroSuda added a commit to AkihiroSuda/issues-docker that referenced this pull request Jun 14, 2016
@AkihiroSuda AkihiroSuda overlay2 can avoid excessive inode consumption (docker/docker#22126) 7a1700f
@dmcgowan dmcgowan deleted the dmcgowan:overlay-native-diff branch Jun 15, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment