New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Switch from btrfs to some other filesystem to resolve stabiity and portability issues #1045

Closed
vito opened this Issue May 6, 2017 · 27 comments

Comments

Projects
None yet
6 participants
@vito
Member

vito commented May 6, 2017

Feature Request

What challenge are you facing?

We've been using btrfs as our volume driver by default on Linux for a long time now, and while we get a lot of good things from it (primarily nestability), we've encountered stability and portability problems in the wild.

What is a volume driver/graph driver?

Buckle up, as there's a lot of history and quirks here. There's a reason Docker supports like 10 different drivers. :)

You may recognize the words aufs, overlay/overlayfs, btrfs - these are all filesystems you can use to have a copy-on-write replica of an original volume/directory. This is how 10 containers are able to all use a 1GB image as their rootfs without using 10GB of disk space. Docker has a very similar concept, called a "graph driver", which is how it does all its image layering shenanigans to tie 20 different layers together to form 1 rootfs. Concourse's volume driver interface is a bit simpler, as BaggageClaim just supports creating volumes and copy-on-writes of other volumes. The full interface is in driver.go.

How do they compare?

aufs and overlay (formerly overlayfs) are both union filesystems. Filesystems like these are "pseudo" filesystems in that they don't directly interact with a device to provide filesystem semantics on their own (like ext4 and btrfs and other "real" filesystems you'd use for a physical machine). Instead they tie together upper and lower directories on an existing filesystem to form one mount point.

The terms "upper directory" and "lower directory" refer to the directory that writes go to, and the (possibly read-only) directory that the writes are overlayed on to, respectively. For example, a container's rootfs will start with an empty upper directory, with the original rootfs image as its lower directory. If the container writes to /etc/hosts or /etc/some-file, those writes will go into the upper directory, leaving the lower directory unchanged, allowing it to be shared across many containers at once without polluting each other.

aufs is kind of a black sheep: it was never part of the kernel, and has only ever been available by installing an -extra package. Despite this, it was chosen by Docker early on as its filesystem of choice, probably because it supported creating filesystems from multiple lower directories. overlayfs was the primary competitor at the time, but it was pretty new, only came with some versions of Ubuntu, and only supported one upper directory and one lower directory.

These days, overlayfs has been renamed to overlay and now ships with the kernel as of version 3.18, requiring no additional setup. Kernel version 4.0 introduces support for configuring multiple lower directories, making it a worthy replacement for aufs on paper.

The critical flaw with aufs and overlay for our use case is that they do not nest. Within a container with an aufs filesystem, you cannot then create more aufs filesystems with an aufs directory as lower directory. The same is true for overlay.

btrfs, on the other hand, is a real filesystem that deals directly with a block device. You can use it for your entire machine, just like ext4. Instead of gluing together "upper" and "lower" directories, it supports copy-on-write semantics via snapshotting a volume to create a separate volume. The key feature btrfs brings to the table is nestability. Volumes created from a snapshot can themselves be snapshotted to create another volume. So if you were to run docker within a container with a btrfs filesystem, it would just use its btrfs driver and create subvolumes within the container. An important side-tangent here is that containers are just processes, it's the management of their rootfs that's complicated with regard to nesting. Using btrfs solves this problem elegantly.

Nestability is important because:

  • The docker-image resource can just use the btrfs driver, so we don't need a loopback for every image, which is great, as loopbacks are a global system resource that can leak if we're not careful.
  • Tasks can use docker compose, again by just using the btrfs driver.

What's wrong with btrfs?

Stability and portability. While btrfs has been available in the kernel for a long time, it's never been rock-solid.

It's also occasionally stripped out of some systems (i.e. Docker for Mac), which is frustrating.

There's also the initial dance we need to do in order to get a btrfs device available. In most deployments this involves creating a loopback device for btrfs, as it's extremely uncommon for it to be the primary filesystem of the disk (probably due to the aforementioned stability issues).

Here are some of the specific issues and concerns we've run into:

  • most commonly and critically, #767 and #1035 (btrfs disk hangs, requiring worker recreate)
  • failure to collect volume stats via qgroup
  • possible disk usage creep due to sparse file; unconfirmed, but a lot of people are surprised by disk usage
  • #561 (sparse file sizing confusion)
  • #596 (slow in Docker for Mac as it has no btrfs support)

A Modest Proposal

Let's revisit our choice and see if we can accomplish everything we need with another driver. Because aufs does not come with the kernel, this will probably be overlay. There's also a new kid on the block in lcfs which may be worth investigating if it's something we can carry around and not require a kernel module to be installed.

We may be able to achieve nestability by ensuring there's a non-overlay scratch space available to resources and tasks, so that they can use their own layering driver. Initial testing suggests we'd need this kind of thing, as overlay does not nest:

root@think:~# mkdir overlay-test
root@think:~# cd overlay-test/
root@think:~/overlay-test# mkdir upper lower work merged
root@think:~/overlay-test# touch lower/a
root@think:~/overlay-test# mount -t overlay overlay -o lowerdir=./lower,upperdir=./upper,workdir=./work ./merged
root@think:~/overlay-test# ls merged/
a
root@think:~/overlay-test# touch merged/b
root@think:~/overlay-test# find .
.
./upper
./upper/b
./lower
./lower/a
./work
./work/work
./merged
./merged/a
./merged/b
root@think:~/overlay-test# cd merged/
root@think:~/overlay-test/merged# mkdir upper2 lower2 work2 merged2
root@think:~/overlay-test/merged# touch lower2/nested-existing
root@think:~/overlay-test/merged# mount -t overlay overlay -o lowerdir=./lower2,upperdir=./upper2,workdir=./work2 ./merged2
mount: wrong fs type, bad option, bad superblock on overlay,
       missing codepage or helper program, or other error

       In some cases useful info is found in syslog - try
       dmesg | tail or so.
root@think:~/overlay-test/merged# dmesg | tail -1
[57110.791959] overlayfs: filesystem on './upper2' not supported as upperdir

So one approach for this could be to create an empty volume and mount it somewhere like /tmp in the container. That would then propagate the filesystem from the host (probably ext4 or something) into the container. This would probably fix the Docker-in-Concourse case. The next case to handle would be Concourse-in-Docker. The concourse/concourse is likely to be run by a Docker daemon using overlay or aufs as its driver. For this we can just have a VOLUME pragma for the work-dir, which should mount in the filesystem from the host.

@concourse-bot

This comment has been minimized.

concourse-bot commented May 6, 2017

Hi there!

We use Pivotal Tracker to provide visibility into what our team is working on. A story for this issue has been automatically created.

The current status is as follows:

  • #144984065 Switch from btrfs some other filesystem to resolve stabiity and portability issues

This comment, as well as the labels on the issue, will be automatically updated as the status in Tracker changes.

@vito vito removed the unscheduled label May 8, 2017

@vito vito modified the milestone: v2.10.0 May 9, 2017

@clarafu clarafu added this to the v2.10.0 milestone May 9, 2017

@vito vito modified the milestones: Staging, v2.10.0 May 9, 2017

@andrewedstrom

This comment has been minimized.

Contributor

andrewedstrom commented May 11, 2017

One additional problem with btrfs is that occasionally workers will report No space left on device errors when there is actually still a lot of disk space left. This is a known btrfs bug. More info: https://btrfs.wiki.kernel.org/index.php/FAQ#Help.21_Btrfs_claims_I.27m_out_of_space.2C_but_it_looks_like_I_should_have_lots_left.21

@vito

This comment has been minimized.

Member

vito commented May 11, 2017

Prioritizing this higher since things seem to be coming to a head with btrfs lately.

Let's first investigate overlay as it comes with the kernel out-of-the-box. There's actually a branch on baggageclaim already with an overlay driver: https://github.com/concourse/baggageclaim/tree/overlay-driver

Let's also investigate lcfs as an alternative experimental approach. It would be great if all that can run in userland and we can literally package Concourse with it, and then we have a known-good (or at least known) version. Not sure if that's possible, but worth investigating.

Side-note: after finding a working alternative, it may be worthwhile to observe performance differences between them in various deployment scenarios (e.g. binary directly on VM, BOSH-deployed, Concourse-in-Docker).

@vito vito modified the milestones: v2.9.0, Staging, v3.1.0 May 11, 2017

@vito vito added in progress and removed in progress labels May 15, 2017

@vito

This comment has been minimized.

Member

vito commented May 16, 2017

Initial investigation into overlay has yielded no real surprises.

First I merged in the overlay-driver branch, then changed the baggageclaim_ctl in the BOSH release to use the overlay driver instead of btrfs. I then ran a build with a simple get and things worked. Next I ran a hello-world build which uses image_resource. The build failed with "invalid argument" trying to fetch the image, as expected, as /var/lib/docker in the container is mounted overlay, which Docker can't do anything with.

So I patched the ATC to create an empty volume and bind-mount it in to the container at /var/lib/docker. Then it worked! I moved on to TestFlight, which started failing on volume destroys, which I fixed with concourse/baggageclaim@06884dd. TestFlight is now passing.

Continuing on this thread would probably mean setting up a canonical "scratch space" which would be made available to resource containers. And then decide if it should be available to tasks as well, to support the "Docker Compose" use case. This scratch space can't be /tmp, as it turns out that causes the container create to fail, as Guardian places its init there during container start. We'd probably want to invent something like /opt/resource/scratch, but then we'd need to figure out what to do for tasks.

Haven't looked into LCFS yet. That's probably the next step.

@vito

This comment has been minimized.

Member

vito commented May 16, 2017

Looked into LCFS for a bit. I haven't made a POC as it looks like it'd be a bit of an investment, and I already have an initial set of concerns:

  • It's a bit early on in the project for my liking.
  • It requires a block device or a sparse file, just like btrfs. It at least doesn't require a loopback device for the sparse-file case. They do mention that using a file has a negative performance impact, but I do not see any benchmarks so I don't know how bad it is.
  • I'm not sure if it nests, or if aufs can run on top of it. Hard to tell without a POC. It can at least operate directly with a sparse-file, so we wouldn't need a loopback.
  • I'm a bit confused by its CLI. lcfs daemon requires you to pass two mount points, one which is called "host-mnt" and one which is "plugin-mnt", which seems like some Docker graph driver plugin concerns leaking through the abstraction. The docs just describe them as two mount points. There's no percievable difference in /proc/mounts. Wat?
  • It requires FUSE 3.0, but that may be fine. I think that's all user-land, and can be packaged alongside lcfs or just built in to it (not sure if the dependency is compile-time or runtime).
@vito

This comment has been minimized.

Member

vito commented May 16, 2017

Starting with some performance testing between btrfs and overlay. Configured two workers: one with overlay, one with btrfs, and a pipeline that runs two jobs against both: one job that builds the atc-ci image (i.e. this job), and another that builds the git resource image and then runs its integration tests (i.e. this job).

These are very high-level tests, compared to the usual tests run against a graph driver, but they're at least realistic.

I'll let the pipeline run periodically overnight. The dashboard showing the results can be seen here:

https://metrics.concourse.ci/dashboard/db/driver-testing?panelId=1&fullscreen&orgId=1&refresh=1m&from=now-1h&to=now

@vito

This comment has been minimized.

Member

vito commented May 17, 2017

In running tests overnight, I initially saw that btrfs was constantly slightly slower:

https://snapshot.raintank.io/dashboard/snapshot/8ZgvXLZoxswdMElcR3WXWmSBji2J7Ow8

Seeing that this probably won't change overnight, I then upped the ante and had the atc-ci job run three puts in parallel, simulating more load.

This showed a more pronounced performance difference:

https://snapshot.raintank.io/dashboard/snapshot/iCjVkUy3Bmg5gSse1aQk08JrcohkOh1s

I then opted to test another kind of job, @osis's Strabo, which demonstrates high write use during the initial get, and high COW volume use as it has eight put steps preceded by ~46 get steps, plus a couple task outputs. This results in 384 total COW volumes being created.

Initial findings were kind of interesting. btrfs took the 384 volume creates like a champ, and successfully ran the build.

overlay, however, cannot run the build successfully. Once it gets to the task, it errors trying to namespace the task's image, with Post /volumes: net/http: timeout awaiting response headers . I opened #1171 for this initially, but then I noticed it happens consistently. Looking into the logs reveals that namespacing a container's image takes about 1 second for btrfs, but 1 minute and 10 seconds for overlay:

May 17 07:25:16 btrfs-worker-0 baggageclaim:  {"timestamp":"1495031116.095516443","source":"baggageclaim","message":"baggageclaim.repository.create-volume.namespace.start","log_level":0,"data":{"handle":"9c1d7f2c-8354-4a14-73c2-9a73a9927744","path":"/var/vcap/data/baggageclaim/volumes/init/9c1d7f2c-8354-4a14-73c2-9a73a9927744/volume","session":"3.396046.1"}}
May 17 07:25:17 btrfs-worker-0 baggageclaim:  {"timestamp":"1495031116.940141678","source":"baggageclaim","message":"baggageclaim.repository.create-volume.namespace.done","log_level":0,"data":{"handle":"9c1d7f2c-8354-4a14-73c2-9a73a9927744","path":"/var/vcap/data/baggageclaim/volumes/init/9c1d7f2c-8354-4a14-73c2-9a73a9927744/volume","session":"3.396046.1"}}
May 17 07:30:21 overlay-worker-0 baggageclaim:  {"timestamp":"1495031421.247597456","source":"baggageclaim","message":"baggageclaim.repository.create-volume.namespace.start","log_level":0,"data":{"handle":"ae199c4e-b55f-4e59-4a0c-7949ca1c0910","path":"/var/vcap/data/baggageclaim/volumes/init/ae199c4e-b55f-4e59-4a0c-7949ca1c0910/volume","session":"2.408866.1"}}
May 17 07:31:31 overlay-worker-0 baggageclaim:  {"timestamp":"1495031491.111908197","source":"baggageclaim","message":"baggageclaim.repository.create-volume.namespace.done","log_level":0,"data":{"handle":"ae199c4e-b55f-4e59-4a0c-7949ca1c0910","path":"/var/vcap/data/baggageclaim/volumes/init/ae199c4e-b55f-4e59-4a0c-7949ca1c0910/volume","session":"2.408866.1"}}

This demonstrates the strengths and weaknesses of a "real" filesystem like btrfs compared to a union filesystem like overlay. Namespacing a volume entails recursing through it and chowning things owned by root. This takes much longer with overlay, possibly because each chown requires more i/o to hoist the file to the upper layer and then change the permissions. Or some other file attribute tracking/bookkeeping that overlay has to maintain.

I'll look in to how this behaves on Docker, as they'll have run into the same issues once they added user namespacing support.

This is also a problem set that may be addressed by something like shiftfs in the future, but we need to solve today's problems, not next year's. :)

@vito

This comment has been minimized.

Member

vito commented May 17, 2017

We could mitigate the slow namespacing by keeping privileged and unprivileged versions of these resource caches. Then it would only affect the first fetch. The downside would be doubled disk use.

Looking in to how Docker doesn't have this issue, they have the advantage of not having data at rest that they then need to namespace. They fetch images from the registry, knowing whether their container has to be namespaced or not, and will just remap the UID/GIDs as they extract. So there shouldn't be any performance difference. We however have to deal with scenarios like a get of a Docker image (or some other resource, even) being subsequently used by one privileged and one unprivileged task within the same build.

vito added a commit that referenced this issue May 25, 2017

bump bin
Submodule src/github.com/concourse/bin caefdb94..57b22ec7:
  > auto-detect driver, and respect flag if provided
  > update .envrc now that bin lives in concourse repo

#1045

vito added a commit to concourse/bin that referenced this issue May 25, 2017

@clarafu

This comment has been minimized.

Contributor

clarafu commented May 25, 2017

The binary has been updated to only choose overlay on kernel >= 4.0.

The concourse worker command will not require a recreate for this upgrade; it'll just stick with btrfs, as there will already be an existing btrfs mount point. (Note: we should validate this during acceptance.)

The BOSH release however will require a recreate of the workers. This is because the same autodetect logic is not implemented in the release. Maybe we should push all this down in to BaggageClaim?

@clarafu

This comment has been minimized.

Contributor

clarafu commented May 25, 2017

We should make sure the Docker repository is able to use overlay. This may be as simple as adding a VOLUME pragma to the Dockerfile for the work dir.

clarafu added a commit that referenced this issue May 25, 2017

bump baggageclaim bin
this pushes the filesystem setup down in to baggageclaim, and should
make the upgrade from btrfs-default to overlay-default a smooth
transition, as workers will continue to use btrfs until they're
recreated.

Submodule src/github.com/concourse/baggageclaim c12e0c4..178b8c0:
  > complete the move of driver detection down
  > auto-detect driver and set up btrfs loopback
Submodule src/github.com/concourse/bin dd6ef2d8..7dfb3e86:
  > make asset setup non-platform-specific
  > complete the move to baggageclaim
  > move auto-driver-setup/detect into baggageclaim

#1045

Signed-off-by: Clara Fu <cfu@pivotal.io>
@vito

This comment has been minimized.

Member

vito commented May 25, 2017

We ended up pushing the driver detection and setup logic (i.e. btrfs loopback image wiring) down into BaggageClaim, and removing it from the binaries and BOSH release. This has the (very much intended) side effect of no longer requiring a worker recreate to upgrade - the BOSH release now defaults to detect, along with the binary, so they'll both just see an existing btrfs mount (as it was set up previously) and continue to use that driver. Yay!

@mbjelac

This comment has been minimized.

mbjelac commented May 26, 2017

What release is that (going to be) in?

@clarafu

This comment has been minimized.

Contributor

clarafu commented May 29, 2017

@mbjelac 3.1.0; the milestone attached to issues should answer that now

@clarafu clarafu closed this May 29, 2017

dpb587 added a commit to dpb587/bosh that referenced this issue Jun 8, 2017

Use scratch mount in main-bosh-docker
Avoids the following error in recent versions of concourse...

    Deploying:
      Creating instance bosh/0:
        Creating VM:
          Creating vm with stemcell cid bosh.io/stemcells:bc05e9fa-ede3-4250-6adf-8f91d30a170a:
            CPI create_vm method responded with error: CmdError{"type":"Bosh::Clouds::CloudError","message":"Creating VM with agent ID {{d52dc281-7207-4c4f-58b0-df2fd5c89ba8}}: Creating container: Error response from daemon: error creating aufs mount to /var/lib/docker/aufs/mnt/eaa48325c874cccbb68e0138d742fb1283a32511434d84576725fb566bda6233-init: invalid argument","ok_to_retry":false}

Related...

 * concourse/concourse#1045
 * https://github.com/concourse/docker-image-resource/blob/7ffaffb69b052c02cffa9a1bfed30b355af2c453/assets/common.sh#L64

dpb587 added a commit to dpb587/bosh that referenced this issue Jun 8, 2017

Use /scratch mount for docker data-dir in main-bosh-docker
Avoids the following error in recent versions of concourse...

    Deploying:
      Creating instance bosh/0:
        Creating VM:
          Creating vm with stemcell cid bosh.io/stemcells:bc05e9fa-ede3-4250-6adf-8f91d30a170a:
            CPI create_vm method responded with error: CmdError{"type":"Bosh::Clouds::CloudError","message":"Creating VM with agent ID {{d52dc281-7207-4c4f-58b0-df2fd5c89ba8}}: Creating container: Error response from daemon: error creating aufs mount to /var/lib/docker/aufs/mnt/eaa48325c874cccbb68e0138d742fb1283a32511434d84576725fb566bda6233-init: invalid argument","ok_to_retry":false}

Related...

 * concourse/concourse#1045
 * https://github.com/concourse/docker-image-resource/blob/7ffaffb69b052c02cffa9a1bfed30b355af2c453/assets/common.sh#L64

ggeorgiev added a commit to ggeorgiev/docker-image-resource that referenced this issue Jun 8, 2017

place docker data root in /scratch/docker
this will be an empty volume mounted into the container such that Docker
can use overlay or aufs even if Concourse is already using overlay for
its volumes

concourse/concourse#1045

calippo added a commit to buildo/dcind that referenced this issue Jul 6, 2017

gabro added a commit to buildo/dcind that referenced this issue Jul 6, 2017

@AkihiroSuda AkihiroSuda referenced this issue Aug 3, 2017

Closed

TODO: update #5

@EugenMayer

This comment has been minimized.

EugenMayer commented Nov 30, 2017

is there any reason why overlay2 has not been used? https://docs.docker.com/engine/userguide/storagedriver/overlayfs-driver/

It is highly recommended that you use the overlay2 driver if possible, rather than the overlay driver. The overlay driver is not supported for Docker EE.
@vito

This comment has been minimized.

Member

vito commented Nov 30, 2017

@EugenMayer

This comment has been minimized.

EugenMayer commented Nov 30, 2017

Oh, thats confusion right - thanks for the clarificatoin

@vito vito changed the title from Switch from btrfs some other filesystem to resolve stabiity and portability issues to Switch from btrfs to some other filesystem to resolve stabiity and portability issues Jun 13, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment