Low performance with volumes on btrfs #6862

matpen · 2020-07-06T12:11:21Z

Is this a BUG REPORT or FEATURE REQUEST? (leave only one on its own line)

/kind bug

Description

I am experiencing low performance when containers using volumes backed by a btrfs filesystem. Since I started to use podman I have been puzzled by my applications being slow when deployed (DigitalOcean droplet), but pretty fast in my development environment (laptop). Now I could finally find a reproducer, which I will try to explain below.

The issue I am having is particularly with a mariadb container, where I am importing some data from CSV files. Performing the same query on DigitalOcean (inside the container) is 6 to 10 times slower than on my laptop (again, inside the container). Today I also compared with the same query performed directly on the host (droplet) and by using another volume backed by ext4: sure enough, the issue only appears inside the container when the volume backed by btrfs.

I therefore performed some research, and came across this thread, this bug and this issue, which more or less describe my situation, but for docker. Unfortunately, the solutions described therein do not work in my case:

the most popular solution seems to be to disable barriers in the filesystem; however, after researching a bit about it, I feel uncomfortable with this option on a production machine. Quoting from here

Disabling barriers gets rid of that penalty but will most certainly lead to a corrupted filesystem in case of a crash or power loss.

and from here:

Your hard drive has been detected as not supporting barriers. This is a severe condition, which can result in full file-system corruption [...]

many seem to have had luck with this solution which suggests changing options for the IO scheduler:

echo 0 >/sys/block/<device>/queue/iosched/slice_idle
echo 0 >/sys/block/<device>/queue/iosched/group_idle

however, these files do not exist in my system. See below for more connected info.

finally @cyphar seems to have performed extensive research on the topic, and summarized here a few other options (including the one described in the previous point). The first seems to change the scheduler algorithm:

echo deadline >/sys/block/<device>/queue/scheduler
# or
echo cfq >/sys/block/<device>/queue/scheduler
# or
echo noop >/sys/block/<device>/queue/scheduler

however, none of these work in my case. I am only given the options none and mq-deadline, which do not seem to affect performance at all, even after restarting the containers.

$ sudo cat /sys/block/sda/queue/scheduler
[mq-deadline] none

lastly @cyphar suggests "switching to btrfs", which is obviously impossible in my case, since I am already using btrfs. As mentioned, I tried moving the volume with the mysql data onto the root volume of the droplet (which is ext4) and the performance increased to the expected level. This is unfortunate, as I was planning to leverage some btrfs features like RAID and snapshots.

Having run out of ideas, I turn to the experts hoping to receive some insight.

Steps to reproduce the issue:

See above description.

Describe the results you received:

Low performance during SQL queries.

Describe the results you expected:

Same performance as on the host.

Additional information you deem important (e.g. issue happens only occasionally):

Although I did not measure yet, mongodb and other processes seem to be affected, too.

Howver, I did accuretely measure the performance of all other possibly involved components (CPU, RAM, even the very same volume using dd) and everything seems nominal: only the SQL query is reproducibly slow.

Output of podman version:

Version:            1.6.2
RemoteAPI Version:  1
Go Version:         go1.10.4
OS/Arch:            linux/amd64

Output of podman info --debug:

debug:
  compiler: gc
  git commit: ""
  go version: go1.10.4
  podman version: 1.6.2
host:
  BuildahVersion: 1.11.3
  CgroupVersion: v1
  Conmon:
    package: 'conmon: /usr/bin/conmon'
    path: /usr/bin/conmon
    version: 'conmon version 2.0.3, commit: unknown'
  Distribution:
    distribution: ubuntu
    version: "18.04"
  MemFree: 856244224
  MemTotal: 8348622848
  OCIRuntime:
    name: runc
    package: 'runc: /usr/sbin/runc'
    path: /usr/sbin/runc
    version: 'runc version spec: 1.0.1-dev'
  SwapFree: 2038906880
  SwapTotal: 2066706432
  arch: amd64
  cpus: 4
  eventlogger: journald
  hostname: develop
  kernel: 5.3.0-62-generic
  os: linux
  rootless: false
  uptime: 2h 29m 41.06s (Approximately 0.08 days)
registries:
  blocked: null
  insecure: null
  search: null
store:
  ConfigFile: /etc/containers/storage.conf
  ContainerStore:
    number: 59
  GraphDriverName: overlay
  GraphOptions: {}
  GraphRoot: /mnt/data/root/podman
  GraphStatus:
    Backing Filesystem: btrfs
    Native Overlay Diff: "true"
    Supports d_type: "true"
    Using metacopy: "false"
  ImageStore:
    number: 104
  RunRoot: /var/run/containers/storage
  VolumePath: /mnt/data/root/podman/volumes

Package info (e.g. output of rpm -q podman or apt list podman):

podman/bionic,now 1.6.2-1~ubuntu18.04~ppa1 amd64 [installed]

Additional environment details (AWS, VirtualBox, physical, etc.):

The issue appears on a DigitalOcean droplet, where both the images and the podman volumes are stored on a btrfs filesystem (/mnt/data) on an "attached storage volume".

$ cat /etc/issue
Ubuntu 18.04.4 LTS
$ uname -a
Linux develop 5.3.0-62-generic #56~18.04.1-Ubuntu SMP Wed Jun 24 16:17:03 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
$ cat /etc/fstab
LABEL=cloudimg-rootfs	/	 ext4	defaults	0 0
LABEL=UEFI	/boot/efi	vfat	defaults	0 0
/dev/disk/by-id/scsi-0DO_Volume_sdb	/mnt/data	btrfs	defaults,nofail,discard	0	0

The text was updated successfully, but these errors were encountered:

vrothberg · 2020-07-06T13:12:21Z

Thanks for reaching out, @matpen! I don't know any btrfs experts in this community. Maybe there's another kernel setting for Ubuntu that needs tweaking to improve btrfs performance? I fear that moving the volumes to ext4 or xfs may be the quickest solution.

@giuseppe @mheon @rhatdan WDYT?

mheon · 2020-07-06T14:36:12Z

This should all be kernel-side - under the hood, these are just bind mounts into the mount namespace of the container. I don't think Podman itself does (or can do) anything about performance here - if there are issues, they're likely on the kernel's side

matpen · 2020-07-06T19:36:06Z

While I agree with @mheon that podman cannot probably fix the problem by itself, I also do not agree with the decision of closing this issue. In fact, the problem seems to emerge from the interaction of podman with the kernel (or btrfs): btrfs does work well when used on the host directly.

rhatdan · 2020-07-06T20:46:09Z

@matpen Sure we can reopen it but you would need to tell us what we are doing wrong with the mount point. If you look at the container and mount namespace at how the mount point is being handled, then tell us what we are doing wrong we would have a chance of fixing it.

None of us use BTRFS so us figuring this out is not likely, and we can not even identify if this is a Podman issue.

One thing to check would be if this works better with Docker. It could be BTRFS does not work well when used in Namespaces.

matpen · 2020-07-07T12:01:00Z

@rhatdan Thank you for reopening the issue and supporting for this investigation! I am definitely willing to perform all necessary work to setup a reproducible case, and collect all required information. All I ask from the podman team is a few pointers as I make progress, specifically for those podman features I have limited understanding of.

If you look at the container and mount namespace at how the mount point is being handled

For example, it seems like you already have a good idea on how to "look" at the mount namespace, do you mind suggesting it here? Maybe podman inspect will reveal the required info?

One thing to check would be if this works better with Docker. It could be BTRFS does not work well when used in Namespaces.

This too is worth testing: I plan to setup a droplet from scratch, with two identical volumes running ext4 and btrfs, and compare docker with podman performance. As soon as available, I will share my results here.

rhatdan · 2020-07-07T18:07:44Z

I was thinking podman run -v /SOURCE:/DEST fedora mount | grep /DEST
This should show you how we are mounting the volume into the container.

matpen · 2020-07-07T19:29:45Z

@rhatdan Thank you for the pointer. So here is the result for the actual container that is presenting the problem:

overlay on / type overlay (rw,relatime,lowerdir=/mnt/data/root/podman/overlay/l/3NNXSIUQVCMBHKAG3Z34P7CGND:/mnt/data/root/podman/overlay/l/E7D463O4J7IITMRDT4QLN6O6BJ:/mnt/data/root/podman/overlay/l/VQYGB6TBVD4JOP2JPK4EDQXLEB:/mnt/data/root/podman/overlay/l/DL7MO4ZN4JVKYLHEDEGNJ64OAO:/mnt/data/root/podman/overlay/l/I2M6Z5OYYQGRINEJ7KDRCHBEKO:/mnt/data/root/podman/overlay/l/JLCKGMDXRMJOF5YWCJRAD6RA7K:/mnt/data/root/podman/overlay/l/QRJK7V3YJ6C43DEYL5JKKJNDN6:/mnt/data/root/podman/overlay/l/WK3SKEVJK4R32EROMZ2JUKUZ6O:/mnt/data/root/podman/overlay/l/HR4VKHOOF3Z3JGMBWRYDUUGSXD,upperdir=/mnt/data/root/podman/overlay/cd875cc1852570e51f465a459929b6eda73e2d1a308b8aa8ab99ebd2adb56065/diff,workdir=/mnt/data/root/podman/overlay/cd875cc1852570e51f465a459929b6eda73e2d1a308b8aa8ab99ebd2adb56065/work,xino=off)

rhatdan · 2020-07-07T19:46:09Z

Ok that is showing the overlay mount, but not the volume mounts inside of the container.

matpen · 2020-07-07T19:48:50Z

My bad, here is the correct data:

/dev/sda on /var/lib/mysql type btrfs (rw,relatime,discard,space_cache,subvolid=5,subvol=/root/mydata/mariadb/mariadb_data)

mheon · 2020-07-07T19:51:20Z

Does the same performance degradation happen if subvolumes are not involved?

matpen · 2020-07-07T19:57:08Z

@mheon No. The btrfs volume makes it slow. I just performed extensive tests, and am about to post the results.

matpen · 2020-07-07T20:09:44Z

As mentioned in my earlier comments, today I also took the time to setup a droplet from scratch and perform some comparisons.

The droplet is a standard droplet with 1 CPU, Ubuntu 18.04 and kernel 4.15.0-66-generic.
Then I added two identical volumes 20 GB volumes, and formatted one with ext4 and the other with btrfs.

The test involves a simple set of SQL queries that create a new database and some tables, and import about 35 MB of CSV data into them.
The image is a standard mariadb image in order to avoid influence from other factors.

The command to launch the container is:

podman|docker run --rm --name test_container --volume <test folder>:/var/lib/mysql:rw -e MYSQL_ROOT_PASSWORD=secret docker.io/library/mariadb:10.4.13

After launching the container, I manually exec into it, download the data and run:

time mysql -psecret < test.sql

The four tests refer to the same container, but the data residing on

no volume (skip the --volume flag);
the root ext4 volume of the droplet;
the external attached ext4 volume;
the external attached btrfs volume;

Of course, the tests have been repeated multiple times, and at different times of the day, to exclude other effects.
The averaged results are as per below table, whereby I tested both podman and docker with the same environment.

------------------------|------------------------|------------------------|
                        |                        |                        |
         TEST           |         PODMAN         |        DOCKER          |
                        |                        |                        |
------------------------|------------------------|------------------------|
                        |                        |                        |
        no volume       |       1.5 - 1.6 s      |       1.6 - 1.7 s      |
                        |                        |                        |
------------------------|------------------------|------------------------|
                        |                        |                        |
      root volume       |       1.5 - 1.6 s      |       1.6 - 1.7 s      |
                        |                        |                        |
------------------------|------------------------|------------------------|
                        |                        |                        |
      ext4 volume       |        4 - 4.5 s       |       4.1 - 5.4 s      |
                        |                        |                        |
------------------------|------------------------|------------------------|
                        |                        |                        |
     btrfs volume       |        36 - 40 s       |        41 - 47 s       |
                        |                        |                        |
------------------------|------------------------|------------------------|

And what I can comment so far:

The good news is that podman performs slightly better than docker, although the difference is minimal;
The external volumes are definitely slower than the root volume: this might be due to differences in the backing technology, of which I am unaware;
btrfs is approximately 10 times slower when compared to ext4.

Since this is just a testing droplet, I will go ahead and perform more agressive tests, like mounting btrfs with nobarrier. If someone has additional ideas for test, I will be glad to carry them out at the first chance.

matpen · 2020-07-07T20:59:46Z

btrfs with nobarrier indeed brings the performance level up to the ext4 volume: 4-5 seconds on average. This seems to confirm the sources outlined in the OP.

cyphar · 2020-07-07T21:30:48Z

This does look like a kernel issue, but can you try running a similar test with just cgroups and no container runtime at all (to help narrow down where the issue is coming from)? You could do this by doing something like:

$ for ss in /sys/fs/cgroup/*; do mkdir $ss/test-performance; done
$ echo $$ >/sys/fs/cgroup/blkio/test-performance
$ # spawn mariadb in this cgroup
$ # run the test
$ # repeat, adding the shell to a new cgroup each time and restarting mariadb inside this cgroup

This would help figure out whether this is a derivative of the cgroup issue we had several years ago (that you linked in this thread). Alternatively, can you provide an example data.sql so I can go and run the tests myself?

matpen · 2020-07-08T20:46:31Z

@cyphar Thank you for joining this discussion, and for your valuable input: this is exactly the kind of help I was hoping to get.

So I went on and, based on your directions, I prepared a scripted test that I ran multiple times today:

the setup is the one I described earlier (Ubuntu droplet with two attached volumes);
the test spawns a new mariadb service for each existing cgroup, as per your code example; I also added a "bare" test outside of any cgroup, for comparison;
the above is repeated for the droplet root and the two attached volumes;
also, each test is repeated 5 times in order to average-out any interference (such as droplet adjacency);

The results are attached. These include some statistics, too, such as the total and average test duration, and variance.

And here is what seems to emerge:

The ext4 volume is 3-4x slower than the droplet root: after some research, I found that this is probably expected, as the volumes are network-bound;
The btrfs volume is 2-3x slower than the ext4 volume: note that, in the current test environment, I could not reproduce the same delay as in the full container, which is still happening;
Inside cgroups, the variance is sometimes very large: in some extreme cases the test can take anywhere from 7 to 90 seconds.

I am eager to receive your comments on the results.

Also, at this link I found that the DO engineers suggest tweaking the "queue depth", which remind me of the solutions related to the scheduler that have been suggested in the past. However, I am unsure what values would be sensitive to try.

cmurf · 2020-07-20T22:02:42Z

I'm pretty sure this is an SQL on Btrfs issue, not a podman on Btrfs issue.

But to isolate this, it would be helpful if someone could put together a generic test case, i.e. a list of reproduce steps, that makes it easy for a non-podman familiar person, to test for performance differences between different setups. i.e. it would be file system unspecific. An additional test case might then do some basic SQL tasks and time them.

Something like this:

Load a test image (this can't be timed if it's downloaded)
Time to create 100 containers based on the test image
Time to do various typical modifications of each of those containers. Net time is sufficient.
Time to destroy those 100 containers

Maybe it should be 1000 containers. That may not be typical but the idea might be to do an order of magnitude greater, in order to help expose problems. If 1000 containers is not so uncommon, then maybe the test case should test creating, modifying, and destroying 10000 containers.

And for SQL testing, it really should try to isolate it to specifically SQL, i.e. it would do no timing while a container is being created or destroyed. The idea of doing this with podman is as an expedient, to make it easier for those unfamiliar with such things to do the testing. Including making it easier to automate in something like openQA - and now we have a basis to also do automated regression tests for these things and get an early warning if something's gone wrong somewhere.

Time to run a series of SQL tests (perhaps start with sqlite, since it's common, and also used in Firefox - there's a dual use for such a test)
1) loads or creates a test database in advance so we're not benchmarking any network effects
2) does some kind of minimal aging so it's faux-real-world rather than brand new (this could be done already on a test database, that way it's in a deterministic state for everyone)
3) runs a set of read heavy tests and times it
4) runs a set of write heavy tests and times it
5) some other common tests, times it

The synthetic nature of such tests is not necessarily a negative. The lack of detailed timing information for every conceivable test is also not necessarily a negative. Such tests won't estimate whether a particular configuration runs a particular database any slower or faster, but it might help expose edge cases and regressions. If there's a big difference between two configurations, we can get more detailed information by running such benchmarks concurrent with eBPF tracing via bcc-tools like btrfsslower, fileslower, biolatency, and figure out where any time discrepancies lie.

matpen · 2020-07-30T08:47:19Z

@cmurf Thank you for sharing some ideas. I would appreciate if you had the chance to test them in your environment, as to have a comparison. Here some thoughts:

But to isolate this, it would be helpful if someone could put together a generic test case, i.e. a list of reproduce steps, that makes it easy for a non-podman familiar person, to test for performance differences between different setups

The tests in my latest comments are pretty much automatic: simply launch the script, and it will run the tests by itself. Sure, some setup is needed to prepare the environment, e.g. install MySQL etc.

And for SQL testing, it really should try to isolate it to specifically SQL, i.e. it would do no timing while a container is being created or destroyed

The tests include a case without involving containers, so its just "raw" SQL on different filesystems. The tests only measure the SQL query, and not the time required to setup the test (eg. launch and destroy containers).

github-actions · 2020-08-30T00:15:11Z

A friendly reminder that this issue had no activity for 30 days.

rhatdan · 2020-09-08T10:45:58Z

@matpen @cmurf @cyphar Any movement on this. I don't see the core developers working on this, since none have specific knowledge on BTRFS. So we need help from community if this is going to move forward.

matpen · 2020-09-09T15:30:47Z

I just finished a long debugging process with the cloud provider, but we could not track down the problem, either. Unfortunately, tests appear to lead to random results and are very complex to setup. All that we know so far is that

it "looks lke" it is related to volumes;
the volumes sometime appear to "freeze" (even for more than one second) like they were hitting some bottleneck;
this can happen on different environments (i.e. laptop or server).

I am sorry I have not much more to report: me too I hoped that someone could add more insight. If you would rather close this issue, I will re-open at a later point if I can provide more data.

rhatdan · 2020-10-05T18:54:54Z

As I don't believe this is a Podman issue, since the BTRFS is just being bind mounted into the container. This is a BTRFS issue and should be taken up with the OS vendor.

jhit · 2023-05-24T16:54:25Z

@matpen did you report this somewhere else? I tried to import a 2.5GB sql Dump today and it did not finish in hours. I first tried in docker and then tested on a native MySQL Server 8 on Ubuntu with the same result.
I'm on Ubuntu 23.04 and using a root BTRFS filesystem.
So this looks like something is really off with MySQL and BTRFS.

matpen · 2023-05-24T21:32:51Z

@jhit thank you for reporting your experience with this! I have moved away from that setup since my last comment in this issue, so unfortunately I have nothing new to share. IIRC I also have not reported this anywhere else... but I would be interested in the conversation, if you decide to do so.

jhit · 2023-05-25T08:21:15Z

@matpen Ok. Will see if I find some time to start a conversation in the btrfs community. Will post a link here if I get to it.

openshift-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Jul 6, 2020

rhatdan closed this as completed Jul 6, 2020

rhatdan reopened this Jul 6, 2020

cmurf mentioned this issue Jul 20, 2020

automatically use btrfs driver if on btrfs #6563

Closed

github-actions bot added the stale-issue label Aug 30, 2020

rhatdan closed this as completed Oct 5, 2020

github-actions bot added the locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. label Aug 24, 2023

github-actions bot locked as resolved and limited conversation to collaborators Aug 24, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Low performance with volumes on btrfs #6862

Low performance with volumes on btrfs #6862

matpen commented Jul 6, 2020

vrothberg commented Jul 6, 2020

mheon commented Jul 6, 2020

matpen commented Jul 6, 2020

rhatdan commented Jul 6, 2020

matpen commented Jul 7, 2020

rhatdan commented Jul 7, 2020

matpen commented Jul 7, 2020

rhatdan commented Jul 7, 2020

matpen commented Jul 7, 2020

mheon commented Jul 7, 2020

matpen commented Jul 7, 2020

matpen commented Jul 7, 2020

matpen commented Jul 7, 2020

cyphar commented Jul 7, 2020

matpen commented Jul 8, 2020

cmurf commented Jul 20, 2020 •

edited

Loading

matpen commented Jul 30, 2020

github-actions bot commented Aug 30, 2020

rhatdan commented Sep 8, 2020

matpen commented Sep 9, 2020

rhatdan commented Oct 5, 2020

jhit commented May 24, 2023

matpen commented May 24, 2023

jhit commented May 25, 2023

Low performance with volumes on btrfs #6862

Low performance with volumes on btrfs #6862

Comments

matpen commented Jul 6, 2020

vrothberg commented Jul 6, 2020

mheon commented Jul 6, 2020

matpen commented Jul 6, 2020

rhatdan commented Jul 6, 2020

matpen commented Jul 7, 2020

rhatdan commented Jul 7, 2020

matpen commented Jul 7, 2020

rhatdan commented Jul 7, 2020

matpen commented Jul 7, 2020

mheon commented Jul 7, 2020

matpen commented Jul 7, 2020

matpen commented Jul 7, 2020

matpen commented Jul 7, 2020

cyphar commented Jul 7, 2020

matpen commented Jul 8, 2020

cmurf commented Jul 20, 2020 • edited Loading

matpen commented Jul 30, 2020

github-actions bot commented Aug 30, 2020

rhatdan commented Sep 8, 2020

matpen commented Sep 9, 2020

rhatdan commented Oct 5, 2020

jhit commented May 24, 2023

matpen commented May 24, 2023

jhit commented May 25, 2023

cmurf commented Jul 20, 2020 •

edited

Loading