New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Switch from btrfs to some other filesystem to resolve stabiity and portability issues #1045

Closed
vito opened this Issue May 6, 2017 · 27 comments

Comments

Projects
None yet
6 participants
@vito
Member

vito commented May 6, 2017

Feature Request

What challenge are you facing?

We've been using btrfs as our volume driver by default on Linux for a long time now, and while we get a lot of good things from it (primarily nestability), we've encountered stability and portability problems in the wild.

What is a volume driver/graph driver?

Buckle up, as there's a lot of history and quirks here. There's a reason Docker supports like 10 different drivers. :)

You may recognize the words aufs, overlay/overlayfs, btrfs - these are all filesystems you can use to have a copy-on-write replica of an original volume/directory. This is how 10 containers are able to all use a 1GB image as their rootfs without using 10GB of disk space. Docker has a very similar concept, called a "graph driver", which is how it does all its image layering shenanigans to tie 20 different layers together to form 1 rootfs. Concourse's volume driver interface is a bit simpler, as BaggageClaim just supports creating volumes and copy-on-writes of other volumes. The full interface is in driver.go.

How do they compare?

aufs and overlay (formerly overlayfs) are both union filesystems. Filesystems like these are "pseudo" filesystems in that they don't directly interact with a device to provide filesystem semantics on their own (like ext4 and btrfs and other "real" filesystems you'd use for a physical machine). Instead they tie together upper and lower directories on an existing filesystem to form one mount point.

The terms "upper directory" and "lower directory" refer to the directory that writes go to, and the (possibly read-only) directory that the writes are overlayed on to, respectively. For example, a container's rootfs will start with an empty upper directory, with the original rootfs image as its lower directory. If the container writes to /etc/hosts or /etc/some-file, those writes will go into the upper directory, leaving the lower directory unchanged, allowing it to be shared across many containers at once without polluting each other.

aufs is kind of a black sheep: it was never part of the kernel, and has only ever been available by installing an -extra package. Despite this, it was chosen by Docker early on as its filesystem of choice, probably because it supported creating filesystems from multiple lower directories. overlayfs was the primary competitor at the time, but it was pretty new, only came with some versions of Ubuntu, and only supported one upper directory and one lower directory.

These days, overlayfs has been renamed to overlay and now ships with the kernel as of version 3.18, requiring no additional setup. Kernel version 4.0 introduces support for configuring multiple lower directories, making it a worthy replacement for aufs on paper.

The critical flaw with aufs and overlay for our use case is that they do not nest. Within a container with an aufs filesystem, you cannot then create more aufs filesystems with an aufs directory as lower directory. The same is true for overlay.

btrfs, on the other hand, is a real filesystem that deals directly with a block device. You can use it for your entire machine, just like ext4. Instead of gluing together "upper" and "lower" directories, it supports copy-on-write semantics via snapshotting a volume to create a separate volume. The key feature btrfs brings to the table is nestability. Volumes created from a snapshot can themselves be snapshotted to create another volume. So if you were to run docker within a container with a btrfs filesystem, it would just use its btrfs driver and create subvolumes within the container. An important side-tangent here is that containers are just processes, it's the management of their rootfs that's complicated with regard to nesting. Using btrfs solves this problem elegantly.

Nestability is important because:

  • The docker-image resource can just use the btrfs driver, so we don't need a loopback for every image, which is great, as loopbacks are a global system resource that can leak if we're not careful.
  • Tasks can use docker compose, again by just using the btrfs driver.

What's wrong with btrfs?

Stability and portability. While btrfs has been available in the kernel for a long time, it's never been rock-solid.

It's also occasionally stripped out of some systems (i.e. Docker for Mac), which is frustrating.

There's also the initial dance we need to do in order to get a btrfs device available. In most deployments this involves creating a loopback device for btrfs, as it's extremely uncommon for it to be the primary filesystem of the disk (probably due to the aforementioned stability issues).

Here are some of the specific issues and concerns we've run into:

  • most commonly and critically, #767 and #1035 (btrfs disk hangs, requiring worker recreate)
  • failure to collect volume stats via qgroup
  • possible disk usage creep due to sparse file; unconfirmed, but a lot of people are surprised by disk usage
  • #561 (sparse file sizing confusion)
  • #596 (slow in Docker for Mac as it has no btrfs support)

A Modest Proposal

Let's revisit our choice and see if we can accomplish everything we need with another driver. Because aufs does not come with the kernel, this will probably be overlay. There's also a new kid on the block in lcfs which may be worth investigating if it's something we can carry around and not require a kernel module to be installed.

We may be able to achieve nestability by ensuring there's a non-overlay scratch space available to resources and tasks, so that they can use their own layering driver. Initial testing suggests we'd need this kind of thing, as overlay does not nest:

root@think:~# mkdir overlay-test
root@think:~# cd overlay-test/
root@think:~/overlay-test# mkdir upper lower work merged
root@think:~/overlay-test# touch lower/a
root@think:~/overlay-test# mount -t overlay overlay -o lowerdir=./lower,upperdir=./upper,workdir=./work ./merged
root@think:~/overlay-test# ls merged/
a
root@think:~/overlay-test# touch merged/b
root@think:~/overlay-test# find .
.
./upper
./upper/b
./lower
./lower/a
./work
./work/work
./merged
./merged/a
./merged/b
root@think:~/overlay-test# cd merged/
root@think:~/overlay-test/merged# mkdir upper2 lower2 work2 merged2
root@think:~/overlay-test/merged# touch lower2/nested-existing
root@think:~/overlay-test/merged# mount -t overlay overlay -o lowerdir=./lower2,upperdir=./upper2,workdir=./work2 ./merged2
mount: wrong fs type, bad option, bad superblock on overlay,
       missing codepage or helper program, or other error

       In some cases useful info is found in syslog - try
       dmesg | tail or so.
root@think:~/overlay-test/merged# dmesg | tail -1
[57110.791959] overlayfs: filesystem on './upper2' not supported as upperdir

So one approach for this could be to create an empty volume and mount it somewhere like /tmp in the container. That would then propagate the filesystem from the host (probably ext4 or something) into the container. This would probably fix the Docker-in-Concourse case. The next case to handle would be Concourse-in-Docker. The concourse/concourse is likely to be run by a Docker daemon using overlay or aufs as its driver. For this we can just have a VOLUME pragma for the work-dir, which should mount in the filesystem from the host.

@concourse-bot

This comment has been minimized.

Show comment
Hide comment
@concourse-bot

concourse-bot May 6, 2017

Hi there!

We use Pivotal Tracker to provide visibility into what our team is working on. A story for this issue has been automatically created.

The current status is as follows:

  • #144984065 Switch from btrfs some other filesystem to resolve stabiity and portability issues

This comment, as well as the labels on the issue, will be automatically updated as the status in Tracker changes.

concourse-bot commented May 6, 2017

Hi there!

We use Pivotal Tracker to provide visibility into what our team is working on. A story for this issue has been automatically created.

The current status is as follows:

  • #144984065 Switch from btrfs some other filesystem to resolve stabiity and portability issues

This comment, as well as the labels on the issue, will be automatically updated as the status in Tracker changes.

@vito vito removed the unscheduled label May 8, 2017

@vito vito modified the milestone: v2.10.0 May 9, 2017

@clarafu clarafu added this to the v2.10.0 milestone May 9, 2017

@vito vito modified the milestones: Staging, v2.10.0 May 9, 2017

@andrewedstrom

This comment has been minimized.

Show comment
Hide comment
@andrewedstrom

andrewedstrom May 11, 2017

Contributor

One additional problem with btrfs is that occasionally workers will report No space left on device errors when there is actually still a lot of disk space left. This is a known btrfs bug. More info: https://btrfs.wiki.kernel.org/index.php/FAQ#Help.21_Btrfs_claims_I.27m_out_of_space.2C_but_it_looks_like_I_should_have_lots_left.21

Contributor

andrewedstrom commented May 11, 2017

One additional problem with btrfs is that occasionally workers will report No space left on device errors when there is actually still a lot of disk space left. This is a known btrfs bug. More info: https://btrfs.wiki.kernel.org/index.php/FAQ#Help.21_Btrfs_claims_I.27m_out_of_space.2C_but_it_looks_like_I_should_have_lots_left.21

@vito

This comment has been minimized.

Show comment
Hide comment
@vito

vito May 11, 2017

Member

Prioritizing this higher since things seem to be coming to a head with btrfs lately.

Let's first investigate overlay as it comes with the kernel out-of-the-box. There's actually a branch on baggageclaim already with an overlay driver: https://github.com/concourse/baggageclaim/tree/overlay-driver

Let's also investigate lcfs as an alternative experimental approach. It would be great if all that can run in userland and we can literally package Concourse with it, and then we have a known-good (or at least known) version. Not sure if that's possible, but worth investigating.

Side-note: after finding a working alternative, it may be worthwhile to observe performance differences between them in various deployment scenarios (e.g. binary directly on VM, BOSH-deployed, Concourse-in-Docker).

Member

vito commented May 11, 2017

Prioritizing this higher since things seem to be coming to a head with btrfs lately.

Let's first investigate overlay as it comes with the kernel out-of-the-box. There's actually a branch on baggageclaim already with an overlay driver: https://github.com/concourse/baggageclaim/tree/overlay-driver

Let's also investigate lcfs as an alternative experimental approach. It would be great if all that can run in userland and we can literally package Concourse with it, and then we have a known-good (or at least known) version. Not sure if that's possible, but worth investigating.

Side-note: after finding a working alternative, it may be worthwhile to observe performance differences between them in various deployment scenarios (e.g. binary directly on VM, BOSH-deployed, Concourse-in-Docker).

@vito vito modified the milestones: v2.9.0, Staging, v3.1.0 May 11, 2017

@vito vito added in progress and removed in progress labels May 15, 2017

@vito

This comment has been minimized.

Show comment
Hide comment
@vito

vito May 16, 2017

Member

Initial investigation into overlay has yielded no real surprises.

First I merged in the overlay-driver branch, then changed the baggageclaim_ctl in the BOSH release to use the overlay driver instead of btrfs. I then ran a build with a simple get and things worked. Next I ran a hello-world build which uses image_resource. The build failed with "invalid argument" trying to fetch the image, as expected, as /var/lib/docker in the container is mounted overlay, which Docker can't do anything with.

So I patched the ATC to create an empty volume and bind-mount it in to the container at /var/lib/docker. Then it worked! I moved on to TestFlight, which started failing on volume destroys, which I fixed with concourse/baggageclaim@06884dd. TestFlight is now passing.

Continuing on this thread would probably mean setting up a canonical "scratch space" which would be made available to resource containers. And then decide if it should be available to tasks as well, to support the "Docker Compose" use case. This scratch space can't be /tmp, as it turns out that causes the container create to fail, as Guardian places its init there during container start. We'd probably want to invent something like /opt/resource/scratch, but then we'd need to figure out what to do for tasks.

Haven't looked into LCFS yet. That's probably the next step.

Member

vito commented May 16, 2017

Initial investigation into overlay has yielded no real surprises.

First I merged in the overlay-driver branch, then changed the baggageclaim_ctl in the BOSH release to use the overlay driver instead of btrfs. I then ran a build with a simple get and things worked. Next I ran a hello-world build which uses image_resource. The build failed with "invalid argument" trying to fetch the image, as expected, as /var/lib/docker in the container is mounted overlay, which Docker can't do anything with.

So I patched the ATC to create an empty volume and bind-mount it in to the container at /var/lib/docker. Then it worked! I moved on to TestFlight, which started failing on volume destroys, which I fixed with concourse/baggageclaim@06884dd. TestFlight is now passing.

Continuing on this thread would probably mean setting up a canonical "scratch space" which would be made available to resource containers. And then decide if it should be available to tasks as well, to support the "Docker Compose" use case. This scratch space can't be /tmp, as it turns out that causes the container create to fail, as Guardian places its init there during container start. We'd probably want to invent something like /opt/resource/scratch, but then we'd need to figure out what to do for tasks.

Haven't looked into LCFS yet. That's probably the next step.

@vito

This comment has been minimized.

Show comment
Hide comment
@vito

vito May 16, 2017

Member

Looked into LCFS for a bit. I haven't made a POC as it looks like it'd be a bit of an investment, and I already have an initial set of concerns:

  • It's a bit early on in the project for my liking.
  • It requires a block device or a sparse file, just like btrfs. It at least doesn't require a loopback device for the sparse-file case. They do mention that using a file has a negative performance impact, but I do not see any benchmarks so I don't know how bad it is.
  • I'm not sure if it nests, or if aufs can run on top of it. Hard to tell without a POC. It can at least operate directly with a sparse-file, so we wouldn't need a loopback.
  • I'm a bit confused by its CLI. lcfs daemon requires you to pass two mount points, one which is called "host-mnt" and one which is "plugin-mnt", which seems like some Docker graph driver plugin concerns leaking through the abstraction. The docs just describe them as two mount points. There's no percievable difference in /proc/mounts. Wat?
  • It requires FUSE 3.0, but that may be fine. I think that's all user-land, and can be packaged alongside lcfs or just built in to it (not sure if the dependency is compile-time or runtime).
Member

vito commented May 16, 2017

Looked into LCFS for a bit. I haven't made a POC as it looks like it'd be a bit of an investment, and I already have an initial set of concerns:

  • It's a bit early on in the project for my liking.
  • It requires a block device or a sparse file, just like btrfs. It at least doesn't require a loopback device for the sparse-file case. They do mention that using a file has a negative performance impact, but I do not see any benchmarks so I don't know how bad it is.
  • I'm not sure if it nests, or if aufs can run on top of it. Hard to tell without a POC. It can at least operate directly with a sparse-file, so we wouldn't need a loopback.
  • I'm a bit confused by its CLI. lcfs daemon requires you to pass two mount points, one which is called "host-mnt" and one which is "plugin-mnt", which seems like some Docker graph driver plugin concerns leaking through the abstraction. The docs just describe them as two mount points. There's no percievable difference in /proc/mounts. Wat?
  • It requires FUSE 3.0, but that may be fine. I think that's all user-land, and can be packaged alongside lcfs or just built in to it (not sure if the dependency is compile-time or runtime).
@vito

This comment has been minimized.

Show comment
Hide comment
@vito

vito May 16, 2017

Member

Starting with some performance testing between btrfs and overlay. Configured two workers: one with overlay, one with btrfs, and a pipeline that runs two jobs against both: one job that builds the atc-ci image (i.e. this job), and another that builds the git resource image and then runs its integration tests (i.e. this job).

These are very high-level tests, compared to the usual tests run against a graph driver, but they're at least realistic.

I'll let the pipeline run periodically overnight. The dashboard showing the results can be seen here:

https://metrics.concourse.ci/dashboard/db/driver-testing?panelId=1&fullscreen&orgId=1&refresh=1m&from=now-1h&to=now

Member

vito commented May 16, 2017

Starting with some performance testing between btrfs and overlay. Configured two workers: one with overlay, one with btrfs, and a pipeline that runs two jobs against both: one job that builds the atc-ci image (i.e. this job), and another that builds the git resource image and then runs its integration tests (i.e. this job).

These are very high-level tests, compared to the usual tests run against a graph driver, but they're at least realistic.

I'll let the pipeline run periodically overnight. The dashboard showing the results can be seen here:

https://metrics.concourse.ci/dashboard/db/driver-testing?panelId=1&fullscreen&orgId=1&refresh=1m&from=now-1h&to=now

@vito

This comment has been minimized.

Show comment
Hide comment
@vito

vito May 17, 2017

Member

In running tests overnight, I initially saw that btrfs was constantly slightly slower:

https://snapshot.raintank.io/dashboard/snapshot/8ZgvXLZoxswdMElcR3WXWmSBji2J7Ow8

Seeing that this probably won't change overnight, I then upped the ante and had the atc-ci job run three puts in parallel, simulating more load.

This showed a more pronounced performance difference:

https://snapshot.raintank.io/dashboard/snapshot/iCjVkUy3Bmg5gSse1aQk08JrcohkOh1s

I then opted to test another kind of job, @osis's Strabo, which demonstrates high write use during the initial get, and high COW volume use as it has eight put steps preceded by ~46 get steps, plus a couple task outputs. This results in 384 total COW volumes being created.

Initial findings were kind of interesting. btrfs took the 384 volume creates like a champ, and successfully ran the build.

overlay, however, cannot run the build successfully. Once it gets to the task, it errors trying to namespace the task's image, with Post /volumes: net/http: timeout awaiting response headers. I opened #1171 for this initially, but then I noticed it happens consistently. Looking into the logs reveals that namespacing a container's image takes about 1 second for btrfs, but 1 minute and 10 seconds for overlay:

May 17 07:25:16 btrfs-worker-0 baggageclaim:  {"timestamp":"1495031116.095516443","source":"baggageclaim","message":"baggageclaim.repository.create-volume.namespace.start","log_level":0,"data":{"handle":"9c1d7f2c-8354-4a14-73c2-9a73a9927744","path":"/var/vcap/data/baggageclaim/volumes/init/9c1d7f2c-8354-4a14-73c2-9a73a9927744/volume","session":"3.396046.1"}}
May 17 07:25:17 btrfs-worker-0 baggageclaim:  {"timestamp":"1495031116.940141678","source":"baggageclaim","message":"baggageclaim.repository.create-volume.namespace.done","log_level":0,"data":{"handle":"9c1d7f2c-8354-4a14-73c2-9a73a9927744","path":"/var/vcap/data/baggageclaim/volumes/init/9c1d7f2c-8354-4a14-73c2-9a73a9927744/volume","session":"3.396046.1"}}
May 17 07:30:21 overlay-worker-0 baggageclaim:  {"timestamp":"1495031421.247597456","source":"baggageclaim","message":"baggageclaim.repository.create-volume.namespace.start","log_level":0,"data":{"handle":"ae199c4e-b55f-4e59-4a0c-7949ca1c0910","path":"/var/vcap/data/baggageclaim/volumes/init/ae199c4e-b55f-4e59-4a0c-7949ca1c0910/volume","session":"2.408866.1"}}
May 17 07:31:31 overlay-worker-0 baggageclaim:  {"timestamp":"1495031491.111908197","source":"baggageclaim","message":"baggageclaim.repository.create-volume.namespace.done","log_level":0,"data":{"handle":"ae199c4e-b55f-4e59-4a0c-7949ca1c0910","path":"/var/vcap/data/baggageclaim/volumes/init/ae199c4e-b55f-4e59-4a0c-7949ca1c0910/volume","session":"2.408866.1"}}

This demonstrates the strengths and weaknesses of a "real" filesystem like btrfs compared to a union filesystem like overlay. Namespacing a volume entails recursing through it and chowning things owned by root. This takes much longer with overlay, possibly because each chown requires more i/o to hoist the file to the upper layer and then change the permissions. Or some other file attribute tracking/bookkeeping that overlay has to maintain.

I'll look in to how this behaves on Docker, as they'll have run into the same issues once they added user namespacing support.

This is also a problem set that may be addressed by something like shiftfs in the future, but we need to solve today's problems, not next year's. :)

Member

vito commented May 17, 2017

In running tests overnight, I initially saw that btrfs was constantly slightly slower:

https://snapshot.raintank.io/dashboard/snapshot/8ZgvXLZoxswdMElcR3WXWmSBji2J7Ow8

Seeing that this probably won't change overnight, I then upped the ante and had the atc-ci job run three puts in parallel, simulating more load.

This showed a more pronounced performance difference:

https://snapshot.raintank.io/dashboard/snapshot/iCjVkUy3Bmg5gSse1aQk08JrcohkOh1s

I then opted to test another kind of job, @osis's Strabo, which demonstrates high write use during the initial get, and high COW volume use as it has eight put steps preceded by ~46 get steps, plus a couple task outputs. This results in 384 total COW volumes being created.

Initial findings were kind of interesting. btrfs took the 384 volume creates like a champ, and successfully ran the build.

overlay, however, cannot run the build successfully. Once it gets to the task, it errors trying to namespace the task's image, with Post /volumes: net/http: timeout awaiting response headers. I opened #1171 for this initially, but then I noticed it happens consistently. Looking into the logs reveals that namespacing a container's image takes about 1 second for btrfs, but 1 minute and 10 seconds for overlay:

May 17 07:25:16 btrfs-worker-0 baggageclaim:  {"timestamp":"1495031116.095516443","source":"baggageclaim","message":"baggageclaim.repository.create-volume.namespace.start","log_level":0,"data":{"handle":"9c1d7f2c-8354-4a14-73c2-9a73a9927744","path":"/var/vcap/data/baggageclaim/volumes/init/9c1d7f2c-8354-4a14-73c2-9a73a9927744/volume","session":"3.396046.1"}}
May 17 07:25:17 btrfs-worker-0 baggageclaim:  {"timestamp":"1495031116.940141678","source":"baggageclaim","message":"baggageclaim.repository.create-volume.namespace.done","log_level":0,"data":{"handle":"9c1d7f2c-8354-4a14-73c2-9a73a9927744","path":"/var/vcap/data/baggageclaim/volumes/init/9c1d7f2c-8354-4a14-73c2-9a73a9927744/volume","session":"3.396046.1"}}
May 17 07:30:21 overlay-worker-0 baggageclaim:  {"timestamp":"1495031421.247597456","source":"baggageclaim","message":"baggageclaim.repository.create-volume.namespace.start","log_level":0,"data":{"handle":"ae199c4e-b55f-4e59-4a0c-7949ca1c0910","path":"/var/vcap/data/baggageclaim/volumes/init/ae199c4e-b55f-4e59-4a0c-7949ca1c0910/volume","session":"2.408866.1"}}
May 17 07:31:31 overlay-worker-0 baggageclaim:  {"timestamp":"1495031491.111908197","source":"baggageclaim","message":"baggageclaim.repository.create-volume.namespace.done","log_level":0,"data":{"handle":"ae199c4e-b55f-4e59-4a0c-7949ca1c0910","path":"/var/vcap/data/baggageclaim/volumes/init/ae199c4e-b55f-4e59-4a0c-7949ca1c0910/volume","session":"2.408866.1"}}

This demonstrates the strengths and weaknesses of a "real" filesystem like btrfs compared to a union filesystem like overlay. Namespacing a volume entails recursing through it and chowning things owned by root. This takes much longer with overlay, possibly because each chown requires more i/o to hoist the file to the upper layer and then change the permissions. Or some other file attribute tracking/bookkeeping that overlay has to maintain.

I'll look in to how this behaves on Docker, as they'll have run into the same issues once they added user namespacing support.

This is also a problem set that may be addressed by something like shiftfs in the future, but we need to solve today's problems, not next year's. :)

@vito

This comment has been minimized.

Show comment
Hide comment
@vito

vito May 17, 2017

Member

We could mitigate the slow namespacing by keeping privileged and unprivileged versions of these resource caches. Then it would only affect the first fetch. The downside would be doubled disk use.

Looking in to how Docker doesn't have this issue, they have the advantage of not having data at rest that they then need to namespace. They fetch images from the registry, knowing whether their container has to be namespaced or not, and will just remap the UID/GIDs as they extract. So there shouldn't be any performance difference. We however have to deal with scenarios like a get of a Docker image (or some other resource, even) being subsequently used by one privileged and one unprivileged task within the same build.

Member

vito commented May 17, 2017

We could mitigate the slow namespacing by keeping privileged and unprivileged versions of these resource caches. Then it would only affect the first fetch. The downside would be doubled disk use.

Looking in to how Docker doesn't have this issue, they have the advantage of not having data at rest that they then need to namespace. They fetch images from the registry, knowing whether their container has to be namespaced or not, and will just remap the UID/GIDs as they extract. So there shouldn't be any performance difference. We however have to deal with scenarios like a get of a Docker image (or some other resource, even) being subsequently used by one privileged and one unprivileged task within the same build.

@vito

This comment has been minimized.

Show comment
Hide comment
@vito

vito May 17, 2017

Member

Even aside from the timeout (which I bumped to 5 minutes to continue testing), the inflated disk use and slowness due to copying really makes this look bad. Strabo can't successfully run on overlay because it runs out of disk: https://snapshot.raintank.io/dashboard/snapshot/woyCmHg4BSiTxLMgaKD3E0hmB0xfeMX4

Currently switching to overlay effectively places a tax on unprivileged tasks. All their inputs including their image will likely point to a privileged volume, as all resources are privileged.

They're privileged for one silly reason: the docker-image resource needs to run Docker. If we reversed the default behavior and made resources unprivileged by default, this wouldn't be nearly as big of a problem, and privileged tasks would be taxed instead. It would also be a lot safer.

So here's one proposal:

  1. Resources default to unprivileged.
  2. Base resource types such as docker-image can declare themselves to be privileged.
  3. Custom resource types can be marked privileged: true in the pipeline if necessary.

This would at least make it so that the resources themselves don't have to be copied in order to pass them to tasks. It would however not fix image_resource; those would still need to be namespaced and thus copied.

It may be worth investigating aufs at this point to see if it has any special magic handling of chown that's more efficient. Or investigating LCFS more deeply and doing a POC.

Member

vito commented May 17, 2017

Even aside from the timeout (which I bumped to 5 minutes to continue testing), the inflated disk use and slowness due to copying really makes this look bad. Strabo can't successfully run on overlay because it runs out of disk: https://snapshot.raintank.io/dashboard/snapshot/woyCmHg4BSiTxLMgaKD3E0hmB0xfeMX4

Currently switching to overlay effectively places a tax on unprivileged tasks. All their inputs including their image will likely point to a privileged volume, as all resources are privileged.

They're privileged for one silly reason: the docker-image resource needs to run Docker. If we reversed the default behavior and made resources unprivileged by default, this wouldn't be nearly as big of a problem, and privileged tasks would be taxed instead. It would also be a lot safer.

So here's one proposal:

  1. Resources default to unprivileged.
  2. Base resource types such as docker-image can declare themselves to be privileged.
  3. Custom resource types can be marked privileged: true in the pipeline if necessary.

This would at least make it so that the resources themselves don't have to be copied in order to pass them to tasks. It would however not fix image_resource; those would still need to be namespaced and thus copied.

It may be worth investigating aufs at this point to see if it has any special magic handling of chown that's more efficient. Or investigating LCFS more deeply and doing a POC.

@vito

This comment has been minimized.

Show comment
Hide comment
@vito

vito May 18, 2017

Member

Wrote up an aufs driver and deployed it alongside overlay and btrfs. It's not looking good. The git integration tests fail like so:

running it_can_check_from_head...
  warning: unable to access '/root/.config/git/ignore': Permission denied
  Switched to a new branch 'bogus'
  warning: unable to access '/root/.config/git/ignore': Permission denied
  Switched to branch 'master'
  warning: unable to access '/root/.config/git/ignore': Permission denied
  warning: unable to access '/root/.config/git/ignore': Permission denied
  warning: unable to access '/root/.config/git/attributes': Permission denied
  warning: unable to access '/root/.config/git/attributes': Permission denied
  rm: can't stat '/root/.netrc': Permission denied

I recall this being a peculiar bug with aufs with user namespaces. Here's some context including a workaround that I really don't want to implement: cloudfoundry-attic/garden-linux@f6bb305

Here's the driver implementation. I probably won't commit it. https://gist.github.com/vito/10ae0ac70bd5d2c280897ee9385cc425

Member

vito commented May 18, 2017

Wrote up an aufs driver and deployed it alongside overlay and btrfs. It's not looking good. The git integration tests fail like so:

running it_can_check_from_head...
  warning: unable to access '/root/.config/git/ignore': Permission denied
  Switched to a new branch 'bogus'
  warning: unable to access '/root/.config/git/ignore': Permission denied
  Switched to branch 'master'
  warning: unable to access '/root/.config/git/ignore': Permission denied
  warning: unable to access '/root/.config/git/ignore': Permission denied
  warning: unable to access '/root/.config/git/attributes': Permission denied
  warning: unable to access '/root/.config/git/attributes': Permission denied
  rm: can't stat '/root/.netrc': Permission denied

I recall this being a peculiar bug with aufs with user namespaces. Here's some context including a workaround that I really don't want to implement: cloudfoundry-attic/garden-linux@f6bb305

Here's the driver implementation. I probably won't commit it. https://gist.github.com/vito/10ae0ac70bd5d2c280897ee9385cc425

@vito

This comment has been minimized.

Show comment
Hide comment
@vito

vito May 18, 2017

Member

Moving beyond that failure, I went ahead and tested the atc-ci job and Strabo. The atc-ci job performs around the same as overlay (slightly faster but I didn't bother collecting enough data points). Strabo demonstrates the same problems as overlay. So there's really no reason to continue investigating aufs as it only introduces more trouble (not available in kernel, has permissions bug mentioned previously, does not fix any of the difficulties with overlay).

Member

vito commented May 18, 2017

Moving beyond that failure, I went ahead and tested the atc-ci job and Strabo. The atc-ci job performs around the same as overlay (slightly faster but I didn't bother collecting enough data points). Strabo demonstrates the same problems as overlay. So there's really no reason to continue investigating aufs as it only introduces more trouble (not available in kernel, has permissions bug mentioned previously, does not fix any of the difficulties with overlay).

@vito

This comment has been minimized.

Show comment
Hide comment
@vito

vito May 18, 2017

Member

Started playing with making resources unprivileged (except docker-image - currently hardcoded).

The good news: strabo completes successfully!

The bad news: the first few builds error while fetching the resources.

The builds error with Post /volumes: net/http: timeout awaiting response headers on some/most of the get steps. This is even after bumping the timeout from 1 minute to 5 minutes, as was done somewhere in the previous comments. Notably, it's interesting that this errors on the get steps now, whereas it used to error on the task before (with the 1 minute timeout).

The get steps errored because they use a custom resource type. This is interesting because now the get steps themselves are unprivileged, but their resource image will be privileged, because it was fetched via the docker-image resource. This meant that every get step's container would have to namespace (i.e. copy) the rootfs image. With 46 of these happening at once on one worker, it was enough load to slow things down to the point of failure.

Running the build enough times, it'll eventually succeed as all the get steps complete and warm the cache. Once you get past that, the task container is able to be created in short time, because now all of its inputs are already unprivileged. Progress!

I believe the next step from here will be to ensure that resources fetched by privileged containers result in an unprivileged volume. This will remove the need for namespacing the image for custom resource types or image_resources, and make it so all resource caches are unprivileged, so that we don't have to keep track. We would then only need to "un-namespace" inputs for privileged tasks and puts.

To sum up, all this will require two changes to BaggageClaim:

  1. The ability to convert a volume from a privileged volume to an unprivileged volume. This will be used on the volume that privileged gets fetch into before marking them initialized.
  2. The ability to make a privileged COW volume from an unprivileged parent volume. This will be used for the inputs and image to a privileged task or put step.
Member

vito commented May 18, 2017

Started playing with making resources unprivileged (except docker-image - currently hardcoded).

The good news: strabo completes successfully!

The bad news: the first few builds error while fetching the resources.

The builds error with Post /volumes: net/http: timeout awaiting response headers on some/most of the get steps. This is even after bumping the timeout from 1 minute to 5 minutes, as was done somewhere in the previous comments. Notably, it's interesting that this errors on the get steps now, whereas it used to error on the task before (with the 1 minute timeout).

The get steps errored because they use a custom resource type. This is interesting because now the get steps themselves are unprivileged, but their resource image will be privileged, because it was fetched via the docker-image resource. This meant that every get step's container would have to namespace (i.e. copy) the rootfs image. With 46 of these happening at once on one worker, it was enough load to slow things down to the point of failure.

Running the build enough times, it'll eventually succeed as all the get steps complete and warm the cache. Once you get past that, the task container is able to be created in short time, because now all of its inputs are already unprivileged. Progress!

I believe the next step from here will be to ensure that resources fetched by privileged containers result in an unprivileged volume. This will remove the need for namespacing the image for custom resource types or image_resources, and make it so all resource caches are unprivileged, so that we don't have to keep track. We would then only need to "un-namespace" inputs for privileged tasks and puts.

To sum up, all this will require two changes to BaggageClaim:

  1. The ability to convert a volume from a privileged volume to an unprivileged volume. This will be used on the volume that privileged gets fetch into before marking them initialized.
  2. The ability to make a privileged COW volume from an unprivileged parent volume. This will be used for the inputs and image to a privileged task or put step.
@vito

This comment has been minimized.

Show comment
Hide comment
@vito

vito May 19, 2017

Member

Late yesterday I completed the first step (converting a volume from privileged to unprivileged). This proved to work really well, as Strabo completed successfully right off the bat, and completed quicker than btrfs.

And now, I broke btrfs!

I let the atc-ci and git jobs run overnight to collect more data-points. This morning I also opted to run Strabo concurrent with it, which I'd never done before, just to collect avg times of Strabo between overlay and btrfs. Upon running Strabo, the atc-ci job got stuck, and the load avg of the btrfs worker started to climb:

https://snapshot.raintank.io/dashboard/snapshot/6kTWLqGd16Egut8jsA9lpeU1zl8zwK5i

The process tree shows a btrfs subvolume create stuck in disk sleep (D):

root        6363  4.9  0.0 2418588 2696 ?        S<l  May18  52:17 /var/vcap/packages/baggageclaim/bin/baggageclaim --volumes /var/vcap/data/baggageclaim/volumes --driver btrfs --overlays-dir /var/vcap/data/baggageclaim/overlays --btrfs-bin /var/vcap/packages/btrfs_tools/sbin/btrfs --mkfs-bin /var/vcap/packages/btrfs_tools/sbin/mkfs.btrfs --bind-ip 0.0.0.0 --b
root     1277635  0.0  0.0   2104     0 ?        D<   14:10   0:00  \_ /var/vcap/packages/btrfs_tools/sbin/btrfs subvolume create /var/vcap/data/baggageclaim/volumes/init/6a2cc100-dc5b-4fe4-6ade-72055edce6d6/volume

Here's possibly relevant info from /var/log/kern.log: https://gist.github.com/vito/4405a03a17925da46d4677c90f793f95 Notably I don't see any btrfs panics, but yet there are many hung processes.

Anyhow. Back to paving the way for overlay. This adds more wind to the sails.

Member

vito commented May 19, 2017

Late yesterday I completed the first step (converting a volume from privileged to unprivileged). This proved to work really well, as Strabo completed successfully right off the bat, and completed quicker than btrfs.

And now, I broke btrfs!

I let the atc-ci and git jobs run overnight to collect more data-points. This morning I also opted to run Strabo concurrent with it, which I'd never done before, just to collect avg times of Strabo between overlay and btrfs. Upon running Strabo, the atc-ci job got stuck, and the load avg of the btrfs worker started to climb:

https://snapshot.raintank.io/dashboard/snapshot/6kTWLqGd16Egut8jsA9lpeU1zl8zwK5i

The process tree shows a btrfs subvolume create stuck in disk sleep (D):

root        6363  4.9  0.0 2418588 2696 ?        S<l  May18  52:17 /var/vcap/packages/baggageclaim/bin/baggageclaim --volumes /var/vcap/data/baggageclaim/volumes --driver btrfs --overlays-dir /var/vcap/data/baggageclaim/overlays --btrfs-bin /var/vcap/packages/btrfs_tools/sbin/btrfs --mkfs-bin /var/vcap/packages/btrfs_tools/sbin/mkfs.btrfs --bind-ip 0.0.0.0 --b
root     1277635  0.0  0.0   2104     0 ?        D<   14:10   0:00  \_ /var/vcap/packages/btrfs_tools/sbin/btrfs subvolume create /var/vcap/data/baggageclaim/volumes/init/6a2cc100-dc5b-4fe4-6ade-72055edce6d6/volume

Here's possibly relevant info from /var/log/kern.log: https://gist.github.com/vito/4405a03a17925da46d4677c90f793f95 Notably I don't see any btrfs panics, but yet there are many hung processes.

Anyhow. Back to paving the way for overlay. This adds more wind to the sails.

@vito

This comment has been minimized.

Show comment
Hide comment
@vito

vito May 19, 2017

Member

Side note: recreated the btrfs worker and queued up more Strabo builds, since it would actually be nice to get the data from it, and it locked up again. Pretty easy to reproduce, it seems. Here are the metrics if anyone's interested. https://snapshot.raintank.io/dashboard/snapshot/qO3HJQtuzDN3sNMjRslVxGjpVhlv0Y35

Overlay's looking pretty solid. All builds succeeded and in good time, even with the atc-ci and git jobs running concurrently.

Member

vito commented May 19, 2017

Side note: recreated the btrfs worker and queued up more Strabo builds, since it would actually be nice to get the data from it, and it locked up again. Pretty easy to reproduce, it seems. Here are the metrics if anyone's interested. https://snapshot.raintank.io/dashboard/snapshot/qO3HJQtuzDN3sNMjRslVxGjpVhlv0Y35

Overlay's looking pretty solid. All builds succeeded and in good time, even with the atc-ci and git jobs running concurrently.

vito added a commit to concourse/baggageclaim that referenced this issue May 19, 2017

vito added a commit to concourse/baggageclaim that referenced this issue May 19, 2017

volumes can be converted to/from privileged
also backfill coverage and make behavior consistent for streaming
to/from privileged/unprivileged volumes

concourse/concourse#1045

vito added a commit to concourse/atc that referenced this issue May 19, 2017

mount /var/lib/docker so docker magically works
this is a hack; we should ultimately mount it as "/scratch" or something
generic.

the point of this is to have a non-overlay mount point so that docker
can make its own aufs or overlay mounts.

the volume will be GCed when the container goes away. 'life' makes this
kind of thing super easy!

concourse/concourse#1045

vito added a commit to concourse/atc that referenced this issue May 19, 2017

ensure created resources are made unprivileged
this optimizes for the most common case of using images or inputs for an
unprivileged container. otherwise, union-mount filesystems like
'overlay' and 'aufs' will end up having to copy a bunch of data during
namespacing.

concourse/concourse#1045

vito added a commit to concourse/atc that referenced this issue May 24, 2017

mount /var/lib/docker so docker magically works
this is a hack; we should ultimately mount it as "/scratch" or something
generic.

the point of this is to have a non-overlay mount point so that docker
can make its own aufs or overlay mounts.

the volume will be GCed when the container goes away. 'life' makes this
kind of thing super easy!

concourse/concourse#1045

vito added a commit to concourse/atc that referenced this issue May 24, 2017

ensure created resources are made unprivileged
this optimizes for the most common case of using images or inputs for an
unprivileged container. otherwise, union-mount filesystems like
'overlay' and 'aufs' will end up having to copy a bunch of data during
namespacing.

concourse/concourse#1045

vito added a commit to concourse/atc that referenced this issue May 24, 2017

make all resources but 'docker-image' unprivileged
this makes it so that the vast majority of containers managed by
concourse are unprivileged, so they can all benefit from their images
and input artifacts/volumes being unprivileged.

ultimately this should be done by the base resource type being listed as
privileged, so that we're not special-casing by name.

we'll probably also need 'privileged: true' in the 'resource_types'
section of the pipeline.

concourse/concourse#1045

vito added a commit to concourse/atc that referenced this issue May 24, 2017

create /scratch, not /var/lib/docker
chose /scratch as the path because:

* there is at least some precedent in UNIX systems for having /scratch
  as an "efficient, large ephemeral scratch space"
* it's named to support the use case described in
  concourse/concourse#534 for check containers to have a "scratch space"
  available to them. when actually doing that issue, the volume for this
  mount would outlive the container and be used for the next check
  container for the same resource
* it can be reasonably platform agnostic, as it doesn't live under
  "/opt" or "/tmp"; on Windows and Darwin it can just live in the
  container's directory
* it's unlikely to collide with a task or resource's filesystem

concourse/concourse#1045
@vito

This comment has been minimized.

Show comment
Hide comment
@vito

vito May 24, 2017

Member

Bumped the worker version to 1.1 as the ATC now needs (Volume).SetPrivileged. The BaggageClaim changes are otherwise backwards-compatible, though, hence the minor bump and not major.

Member

vito commented May 24, 2017

Bumped the worker version to 1.1 as the ATC now needs (Volume).SetPrivileged. The BaggageClaim changes are otherwise backwards-compatible, though, hence the minor bump and not major.

@vito

This comment has been minimized.

Show comment
Hide comment
@vito

vito May 24, 2017

Member

Forgot to reference this issue, but the commit that flips the switch on this is here: ddb6662

The work in the binary still has to be done.

Member

vito commented May 24, 2017

Forgot to reference this issue, but the commit that flips the switch on this is here: ddb6662

The work in the binary still has to be done.

@vito

This comment has been minimized.

Show comment
Hide comment
@vito

vito May 24, 2017

Member

Another caveat to all this is that overlay is only something we can use on kernels >= 4.0, as we make use of the 'multiple lower layers' feature.

Member

vito commented May 24, 2017

Another caveat to all this is that overlay is only something we can use on kernels >= 4.0, as we make use of the 'multiple lower layers' feature.

vito added a commit to concourse/bin that referenced this issue May 25, 2017

auto-detect driver, and respect flag if provided
previously it would default the driver to 'btrfs' but even if you set it
to something else it would just use btrfs anyway.

now it'll default to "detect", where it'll prefer btrfs if the
filesystem underlying the work dir is btrfs, and otherwise prefer
overlay if the kernel is >= 4.0.0, and then fall back on naive.

if btrfs is chosen explicitly, a loopback device will be created if
necessary, just as before.

concourse/bin#32, concourse/concourse#1045

vito added a commit that referenced this issue May 25, 2017

bump bin
Submodule src/github.com/concourse/bin caefdb94..57b22ec7:
  > auto-detect driver, and respect flag if provided
  > update .envrc now that bin lives in concourse repo

concourse/concourse#1045

vito added a commit to concourse/bin that referenced this issue May 25, 2017

@clarafu

This comment has been minimized.

Show comment
Hide comment
@clarafu

clarafu May 25, 2017

Contributor

The binary has been updated to only choose overlay on kernel >= 4.0.

The concourse worker command will not require a recreate for this upgrade; it'll just stick with btrfs, as there will already be an existing btrfs mount point. (Note: we should validate this during acceptance.)

The BOSH release however will require a recreate of the workers. This is because the same autodetect logic is not implemented in the release. Maybe we should push all this down in to BaggageClaim?

Contributor

clarafu commented May 25, 2017

The binary has been updated to only choose overlay on kernel >= 4.0.

The concourse worker command will not require a recreate for this upgrade; it'll just stick with btrfs, as there will already be an existing btrfs mount point. (Note: we should validate this during acceptance.)

The BOSH release however will require a recreate of the workers. This is because the same autodetect logic is not implemented in the release. Maybe we should push all this down in to BaggageClaim?

@clarafu

This comment has been minimized.

Show comment
Hide comment
@clarafu

clarafu May 25, 2017

Contributor

We should make sure the Docker repository is able to use overlay. This may be as simple as adding a VOLUME pragma to the Dockerfile for the work dir.

Contributor

clarafu commented May 25, 2017

We should make sure the Docker repository is able to use overlay. This may be as simple as adding a VOLUME pragma to the Dockerfile for the work dir.

clarafu added a commit that referenced this issue May 25, 2017

bump baggageclaim bin
this pushes the filesystem setup down in to baggageclaim, and should
make the upgrade from btrfs-default to overlay-default a smooth
transition, as workers will continue to use btrfs until they're
recreated.

Submodule src/github.com/concourse/baggageclaim c12e0c4..178b8c0:
  > complete the move of driver detection down
  > auto-detect driver and set up btrfs loopback
Submodule src/github.com/concourse/bin dd6ef2d8..7dfb3e86:
  > make asset setup non-platform-specific
  > complete the move to baggageclaim
  > move auto-driver-setup/detect into baggageclaim

concourse/concourse#1045

Signed-off-by: Clara Fu <cfu@pivotal.io>
@vito

This comment has been minimized.

Show comment
Hide comment
@vito

vito May 25, 2017

Member

We ended up pushing the driver detection and setup logic (i.e. btrfs loopback image wiring) down into BaggageClaim, and removing it from the binaries and BOSH release. This has the (very much intended) side effect of no longer requiring a worker recreate to upgrade - the BOSH release now defaults to detect, along with the binary, so they'll both just see an existing btrfs mount (as it was set up previously) and continue to use that driver. Yay!

Member

vito commented May 25, 2017

We ended up pushing the driver detection and setup logic (i.e. btrfs loopback image wiring) down into BaggageClaim, and removing it from the binaries and BOSH release. This has the (very much intended) side effect of no longer requiring a worker recreate to upgrade - the BOSH release now defaults to detect, along with the binary, so they'll both just see an existing btrfs mount (as it was set up previously) and continue to use that driver. Yay!

@mbjelac

This comment has been minimized.

Show comment
Hide comment
@mbjelac

mbjelac May 26, 2017

What release is that (going to be) in?

mbjelac commented May 26, 2017

What release is that (going to be) in?

@clarafu

This comment has been minimized.

Show comment
Hide comment
@clarafu

clarafu May 29, 2017

Contributor

@mbjelac 3.1.0; the milestone attached to issues should answer that now

Contributor

clarafu commented May 29, 2017

@mbjelac 3.1.0; the milestone attached to issues should answer that now

@clarafu clarafu closed this May 29, 2017

dpb587 added a commit to dpb587/bosh that referenced this issue Jun 8, 2017

Use scratch mount in main-bosh-docker
Avoids the following error in recent versions of concourse...

    Deploying:
      Creating instance bosh/0:
        Creating VM:
          Creating vm with stemcell cid bosh.io/stemcells:bc05e9fa-ede3-4250-6adf-8f91d30a170a:
            CPI create_vm method responded with error: CmdError{"type":"Bosh::Clouds::CloudError","message":"Creating VM with agent ID {{d52dc281-7207-4c4f-58b0-df2fd5c89ba8}}: Creating container: Error response from daemon: error creating aufs mount to /var/lib/docker/aufs/mnt/eaa48325c874cccbb68e0138d742fb1283a32511434d84576725fb566bda6233-init: invalid argument","ok_to_retry":false}

Related...

 * concourse/concourse#1045
 * https://github.com/concourse/docker-image-resource/blob/7ffaffb69b052c02cffa9a1bfed30b355af2c453/assets/common.sh#L64

dpb587 added a commit to dpb587/bosh that referenced this issue Jun 8, 2017

Use /scratch mount for docker data-dir in main-bosh-docker
Avoids the following error in recent versions of concourse...

    Deploying:
      Creating instance bosh/0:
        Creating VM:
          Creating vm with stemcell cid bosh.io/stemcells:bc05e9fa-ede3-4250-6adf-8f91d30a170a:
            CPI create_vm method responded with error: CmdError{"type":"Bosh::Clouds::CloudError","message":"Creating VM with agent ID {{d52dc281-7207-4c4f-58b0-df2fd5c89ba8}}: Creating container: Error response from daemon: error creating aufs mount to /var/lib/docker/aufs/mnt/eaa48325c874cccbb68e0138d742fb1283a32511434d84576725fb566bda6233-init: invalid argument","ok_to_retry":false}

Related...

 * concourse/concourse#1045
 * https://github.com/concourse/docker-image-resource/blob/7ffaffb69b052c02cffa9a1bfed30b355af2c453/assets/common.sh#L64

ggeorgiev added a commit to ggeorgiev/docker-image-resource that referenced this issue Jun 8, 2017

place docker data root in /scratch/docker
this will be an empty volume mounted into the container such that Docker
can use overlay or aufs even if Concourse is already using overlay for
its volumes

concourse/concourse#1045

calippo added a commit to buildo/dcind that referenced this issue Jul 6, 2017

gabro added a commit to buildo/dcind that referenced this issue Jul 6, 2017

@AkihiroSuda AkihiroSuda referenced this issue Aug 3, 2017

Closed

TODO: update #5

@EugenMayer

This comment has been minimized.

Show comment
Hide comment
@EugenMayer

EugenMayer Nov 30, 2017

is there any reason why overlay2 has not been used? https://docs.docker.com/engine/userguide/storagedriver/overlayfs-driver/

It is highly recommended that you use the overlay2 driver if possible, rather than the overlay driver. The overlay driver is not supported for Docker EE.

EugenMayer commented Nov 30, 2017

is there any reason why overlay2 has not been used? https://docs.docker.com/engine/userguide/storagedriver/overlayfs-driver/

It is highly recommended that you use the overlay2 driver if possible, rather than the overlay driver. The overlay driver is not supported for Docker EE.
@vito

This comment has been minimized.

Show comment
Hide comment
@vito

vito Nov 30, 2017

Member
Member

vito commented Nov 30, 2017

@EugenMayer

This comment has been minimized.

Show comment
Hide comment
@EugenMayer

EugenMayer Nov 30, 2017

Oh, thats confusion right - thanks for the clarificatoin

EugenMayer commented Nov 30, 2017

Oh, thats confusion right - thanks for the clarificatoin

@vito vito changed the title from Switch from btrfs some other filesystem to resolve stabiity and portability issues to Switch from btrfs to some other filesystem to resolve stabiity and portability issues Jun 13, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment