Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BTRFS quota is reached when filling up VM disk image file #9124

Closed
tomponline opened this issue Aug 18, 2021 · 47 comments
Closed

BTRFS quota is reached when filling up VM disk image file #9124

tomponline opened this issue Aug 18, 2021 · 47 comments
Assignees
Labels
Bug Confirmed to be a bug
Milestone

Comments

@tomponline
Copy link
Member

tomponline commented Aug 18, 2021

Even though LXD sets the BTRFS quota 100MiB (by default) larger than the VM disk file max size, if the VM uses all of its disk space the underlying BTRFS filesystem sees that the referenced disk quota has been reached and prevents LXD from starting the VM because it cannot write the backup file. Even though there should be some space free.

It seems like BTRFS quota isn't working the way we think it does.

See https://discuss.linuxcontainers.org/t/btrfs-issues-storage-pools-btrfs-empty-and-btrfs-quota-100-while-inside-the-vm-only-48-utilized/11897

Steps to reproduce:

lxc storage create btrfs btrfs
lxc init images:ubuntu/focal/cloud v1 --vm -s btrfs

# This should do 2 things; increase disk image file size to 11GB, and enable BTRFS disk quotas to a size of 11GB+100MiB (which is the default `volume.state` size) to account for the entirety of the disk size, and allow 100MiB for volume state usage.
lxc config device set v1 root size=11GB # Accounting for VM image file size sizeBytes=11104862208 blockSize=11000004608

Check BTRFS quota set (expect it to be blockSize 11000004608 + 100MiB (104857600 bytes) = 11104862208 bytes):

sudo btrfs subvolume show --raw /var/lib/lxd/storage-pools/btrfs/virtual-machines/v1
virtual-machines/v1
	Name: 			v1
	UUID: 			5816cce0-0dba-3348-bd29-14b80dfdbc85
	Parent UUID: 		06d1312f-5158-f44e-89a0-999e132ee7bc
	Received UUID: 		-
	Creation time: 		2021-12-01 12:27:57 +0000
	Subvolume ID: 		272
	Generation: 		2516
	Gen at creation: 	2515
	Parent ID: 		5
	Top level ID: 		5
	Flags: 			-
	Snapshot(s):
	Quota group:		0/272
	  Limit referenced:	11104862208 # Matches requested sizeBytes of 11104862208
	  Limit exclusive:	-
	  Usage referenced:	2361466880
	  Usage exclusive:	57344

Check size of root disk file (expect it to be 11000004608 bytes):

sudo du -b /var/lib/lxd/storage-pools/btrfs/virtual-machines/v1/root.img
11000004608	/var/lib/lxd/storage-pools/btrfs/virtual-machines/v1/root.img

Start the VM:

lxc start v1
lxc shell v1

Now inside v1 run until the disk fills up (should fill up the disk image but not reach BTRFS quota as it has another 100MiB allowed):

cat /dev/urandom > /root/foo.bin
cat: write error: Read-only file system

You can now see that the BTRFS referenced quota has been reached, which it shouldn't have been.

sudo btrfs subvolume show --raw /var/lib/lxd/storage-pools/btrfs/virtual-machines/v1
virtual-machines/v1
	Name: 			v1
	UUID: 			5816cce0-0dba-3348-bd29-14b80dfdbc85
	Parent UUID: 		06d1312f-5158-f44e-89a0-999e132ee7bc
	Received UUID: 		-
	Creation time: 		2021-12-01 12:27:57 +0000
	Subvolume ID: 		272
	Generation: 		3246
	Gen at creation: 	2515
	Parent ID: 		5
	Top level ID: 		5
	Flags: 			-
	Snapshot(s):
	Quota group:		0/272
	  Limit referenced:	11104862208
	  Limit exclusive:	-
	  Usage referenced:	11104841728 # Limit reached, so how to explain if disk size is still 11000004608.
	  Usage exclusive:	9011888128

Disk image is still at set size of 11000004608.

sudo du -b /var/lib/lxd/storage-pools/btrfs/virtual-machines/v1/root.img
11000004608	/var/lib/lxd/storage-pools/btrfs/virtual-machines/v1/root.img

And the actual used blocks of the image are:

du  -B1 root.img 
10758160384	root.img

Indeed the total size of the volume is less than the quota:

du  -B1 
20480	./templates
8192	./config/systemd
4096	./config/udev
4096	./config/files
14815232	./config
10773151744	.
@tomponline tomponline added the Bug Confirmed to be a bug label Aug 18, 2021
@jamielsharief
Copy link
Contributor

Could it be related to this
https://github.com/lxc/lxd/issues/8468

@stgraber
Copy link
Contributor

It's possible, yeah. btrfs quotas are really frustrating to work with and very odd compared to what you get on ZFS or through project quotas on ext4/xfs, for one thing, it seems somewhat asynchronous, making it possible to exceed the limit by a few hundred MBs before the quota kicks in. This may well be the source of the issue here but we'll need to investigate some more.

@tomponline
Copy link
Member Author

Yes I noticed that when I looked into it initially, the quota consumption kept changing even thought the actual instance wasn't running.

@jamielsharief
Copy link
Contributor

Issue #8468, is about incorrect disk usage, as seen in the examples, it says the usage was 9MB despite the usage being 500MB. If i remember correctly creating a snapshot resets the usage.

It was an other issue where i reported the delay of information.

@jamielsharief
Copy link
Contributor

So I am thinking that calculation of usage is somehow calculating the difference in the filesystem and the most recent snapshot, as opposed to the original image.

@tomponline
Copy link
Member Author

Its not even that simple because if you wait some time thought you'll see the utilisation change on the original volume too as BTRFS performs an async usage scan after taking the snapshot.

@jamielsharief
Copy link
Contributor

I think submitted that as a different issue, but my understanding was that is how the storage driver works, so deal with it. Its mainly noticeable to users when using a web UI for lxd. Using the command line, you probably wont notice it.

@tomponline
Copy link
Member Author

tomponline commented Aug 18, 2021

Yes its not ideal if the BTRFS reporting tools are async. But if it allows users to exceed their quotas due to that async nature then its pretty nasty. And this is without even using snapshots (apart from the initial source one that is).

There's also the concept of referenced data limit vs exclusive data limit which I've not fully understood the ramifications of yet.

@jamielsharief
Copy link
Contributor

I think we are talking about two different problems, the problem I discovered was that when creating a snapshot using custom BTRFS partition, it resets the usage to almost nothing. This means that the reported use is way lower that is available, so i am not sure if that later leads to the problem reported above.

@tomponline
Copy link
Member Author

In this case I am talking about the example above.

@jamielsharief
Copy link
Contributor

No problems, i just saw the issue come through, and it reminded me of similar problems, which could create bugs if the usage information is used by the API for something else.

@stgraber stgraber added this to the lxd-4.19 milestone Sep 10, 2021
@stgraber
Copy link
Contributor

@tomponline assigning to you so we can decide what to do with this.

If the issue is that btrfs applies quotas asynchronously, then I suspect the only thing we can do is mention it in doc/storage.md and close the issue. For those affected, setting the size.state property on the root device to something quite large like 1GiB should do the trick to let you start things back up, but it's not something that I think we should be doing for the users.

@stgraber stgraber modified the milestones: lxd-4.19, lxd-4.20 Sep 30, 2021
@stgraber stgraber modified the milestones: lxd-4.20, lxd-4.21 Nov 1, 2021
@stgraber stgraber modified the milestones: lxd-4.21, lxd-4.22 Nov 29, 2021
@tomponline
Copy link
Member Author

@stgraber looking into this now.

@jamielsharief
Copy link
Contributor

Its been a while since i encountered this, but let me know if I can help.

@tomponline
Copy link
Member Author

tomponline commented Dec 1, 2021

To rule out any issues with snapshots and quota accounting I created a VM, exported it as an non-optimised backup and then re-imported it so it wasn't linked to any existing image subvolume.

After import I set the disk size to 12GiB and then filled it up as normal.

sudo btrfs subvolume show --raw /var/lib/lxd/storage-pools/btrfs/virtual-machines/v1
virtual-machines/v1
	Name: 			v1
	UUID: 			35c8e59c-e13e-b049-959f-66746bb10a4f
	Parent UUID: 		-
	Received UUID: 		-
	Creation time: 		2021-12-01 14:34:18 +0000
	Subvolume ID: 		307
	Generation: 		28884
	Gen at creation: 	6547
	Parent ID: 		5
	Top level ID: 		5
	Flags: 			-
	Snapshot(s):
	Quota group:		0/307
	  Limit referenced:	13409189888
	  Limit exclusive:	-
	  Usage referenced:	13409112064
	  Usage exclusive:	13409112064

I checked using btrfs fi du and du -B1 and they agree on the size of the volume's root disk file:

sudo du -B1 /var/lib/lxd/storage-pools/btrfs/virtual-machines/v1/root.img
12101988352	/var/lib/lxd/storage-pools/btrfs/virtual-machines/v1/root.img
sudo btrfs fi du --raw  /var/lib/lxd/storage-pools/btrfs/virtual-machines/v1/root.img
     Total   Exclusive  Set shared  Filename
12101988352  12101988352           0  /var/lib/lxd/storage-pools/btrfs/virtual-machines/v1/root.img

So it seems the quota group is reporting the incorrect info, as even the total for the subvolume is larger than what btrfs fi du reports for the whole subvolume.

@tomponline
Copy link
Member Author

tomponline commented Dec 1, 2021

Here's something curious, if you create an empty VM, and then manually setup a loop device, filesystem, mount it and then fill up that filesystem, it doesn't exceed the disk image file's size and doesn't reach the BTRFS quota.

lxc init v1 --empty --vm -s btrfs
lxc config device set v1 root size=11GiB
losetup -f /var/lib/lxd/storage-pools/btrfs/virtual-machines/v1/root.img --show
/dev/loop21
mkfs.ext4 /dev/loop21
mount /dev/loop21 /mnt
dd if=/dev/random of=/mnt/foo
dd: writing to '/mnt/foo': No space left on device
22460761+0 records in
22460760+0 records out
11499909120 bytes (11 GB, 11 GiB) copied, 189.115 s, 60.8 MB/s
sudo btrfs subvolume show --raw /var/lib/lxd/storage-pools/btrfs/virtual-machines/v1
virtual-machines/v1
	Name: 			v1
	UUID: 			44d51cde-aa4e-874c-8bcc-344d454ec955
	Parent UUID: 		-
	Received UUID: 		-
	Creation time: 		2021-12-01 16:24:35 +0000
	Subvolume ID: 		373
	Generation: 		35468
	Gen at creation: 	35439
	Parent ID: 		5
	Top level ID: 		5
	Flags: 			-
	Snapshot(s):
	Quota group:		0/373
	  Limit referenced:	11916017664
	  Limit exclusive:	-
	  Usage referenced:	11572088832
	  Usage exclusive:	11572088832

And yet doing the same dd command inside the VM will fill up the BTRFS quota.

@tomponline
Copy link
Member Author

I even tried using losetup to create a loop device for the disk image and then manually modified LXD to use the /dev/loop21 device as its root disk rather than using the root file directly, and the same effect, the BTRFS quota was exceeded.
It looks like some issue between the interplay of QEMU and BTRFS quotas.

@jamielsharief
Copy link
Contributor

I think the issue #8468 covers when creating snapshots also it changed things as well.

@stgraber
Copy link
Contributor

stgraber commented Dec 1, 2021

@tomponline possibly has to do with the size of the writes. You could probably try dd with something like bs=4M conv=fdatasync? I wonder if btrfs re-computes the quota on sync and not on write or something.

@tomponline
Copy link
Member Author

tomponline commented Dec 1, 2021

@stgraber I've also tried this with a BTRFS storage pool on a raw NVME device, and not on a loopdev, to avoid any issues with loopdev on loopdev. But the same thing.

It seems that I need to set size.state to 1098MiB to allow a 12GiB disk image to be filled without also causing the BTRFS quota to be reached.

@tomponline
Copy link
Member Author

@tomponline possibly has to do with the size of the writes. You could probably try dd with something like bs=4M conv=fdatasync? I wonder if btrfs re-computes the quota on sync and not on write or something.

I did start to think perhaps its a fragmentation issue, which could also be affected by block size. This would explain why the quota extents are larger than how BTRFS is tracking the actual used file sizes.

@stgraber
Copy link
Contributor

stgraber commented Dec 2, 2021

@tomponline likely has to do with the API used by QEMU, may be related to asyncio with multiple I/O threads on the same block or something and btrfs incorrectly accounting for those writes.

@stgraber
Copy link
Contributor

stgraber commented Dec 2, 2021

I'm closing the LXD issue as if there's one thing that's clear right now is that our quota and file size calculation is all correct, it's the enforcement which is problematic.

@tomponline can you send what you have to the btrfs mailing-list (or bug tracker if they have one) and we'll see if they come up with anything useful.

@stgraber stgraber closed this as completed Dec 2, 2021
@tomponline
Copy link
Member Author

I've asked on #btrfs IRC and if get no reply will post to linux-btrfs@vger.kernel.org

@tomponline
Copy link
Member Author

tomponline commented Dec 3, 2021

Chatting on #btrfs IRC, forza (@Forza-tng ?) says (paraphrased):

Extents are immutable so when blocks are written to they end up in new extents and the old remains until all of its data is derefernced or rewritten. You'd need up to double quota to be safe.
You have to allow for 200% space usage. And try compress-force and autodefrag. How well Autodefrag works depends on the workload.
Extents in btrfs are immutable worst case is when only 4k of 128MiB (max extent size) is refereced and 128MiB-4k is wasted.

mutlicore says:

I'd probably use compress or compress-force with datacow vm images to limit the extent sizes
compression limits max compressed extent size to 128KiB, uncompressed ones can still be 128MiB. Compress-force is the same, but it limits uncompressed extent sizes to 512KiB (currently as a side-effect of sorts)

@tomponline
Copy link
Member Author

tomponline commented Dec 3, 2021

I was asked to run the compsize tool which also accounted for the extra usage:

sudo compsize --bytes /var/lib/lxd/storage-pools/btrfs/virtual-machines/v1/root.img
Type       Perc     Disk Usage   Uncompressed Referenced
TOTAL      100%     12967923712  12967923712  11807326208
none       100%     12967923712  12967923712  11807326208

@tomponline
Copy link
Member Author

tomponline commented Dec 3, 2021

@stgraber so shall I look into the compress-force mount argument for BTRFS VM volumes and/or go for the belt-and-braces approach of using a BTRFS quota of <size.state size>+(2*<disk image size>)?

@tomponline
Copy link
Member Author

@stgraber there's various info about compression on https://btrfs.wiki.kernel.org/index.php/Compression, but the problem I see with compress-force is that its a mount option and so would affect all files in the storage pool. You can enable compression on a per-file basis (https://btrfs.wiki.kernel.org/index.php/Compression#Can_I_force_compression_on_a_file_without_using_the_compress_mount_option.3F) but confusingly this doesn't enable compress-force for a file, but forcefully enables compress (which uses heuristics to decide whether or not to compress).

It doesn't look like you can enable compress-force for a single file.

tomponline added a commit to tomponline/lxd that referenced this issue Dec 3, 2021
… raw files

This has the side effect of reducing the maximum extent size for compressable extents and reduces space waste
when partially rewriting extents that can cause the block file to end up exceeding the BTRFS volume quota.

Fixes canonical#9124

Signed-off-by: Thomas Parrott <thomas.parrott@canonical.com>
tomponline added a commit to tomponline/lxd that referenced this issue Dec 3, 2021
… raw files

This has the side effect of reducing the maximum extent size for compressible extents and reduces space waste
when partially rewriting extents that can cause the block file to end up exceeding the BTRFS volume quota.

Fixes canonical#9124

Signed-off-by: Thomas Parrott <thomas.parrott@canonical.com>
tomponline added a commit to tomponline/lxd that referenced this issue Dec 3, 2021
… raw files

This has the side effect of reducing the maximum extent size for compressible extents and reduces space waste
when partially rewriting extents that can cause the block file to end up exceeding the BTRFS volume quota.

Fixes canonical#9124

Signed-off-by: Thomas Parrott <thomas.parrott@canonical.com>
tomponline added a commit to tomponline/lxd that referenced this issue Dec 3, 2021
… raw files

This has the side effect of reducing the maximum extent size for compressible extents and reduces space waste
when partially rewriting extents that can cause the block file to end up exceeding the BTRFS volume quota.

Fixes canonical#9124

Signed-off-by: Thomas Parrott <thomas.parrott@canonical.com>
@stgraber
Copy link
Contributor

stgraber commented Dec 3, 2021

Hmm, so doubling the quota feels wrong as users will (rightfully) expect the volume to not exceed the quota they set...

The compress stuff has been absolutely horrible at causing filesystem corruption in the past and as it's indeed a global option for the entire filesystem, I'd be quite worried to turn it on.

One thing that comes to mind though is that I thought the recommendation was for VM images to be marked as nocow through a filesystem attribute. I wonder if that would improve this behaviour and what the downside would be.

@tomponline
Copy link
Member Author

The compress stuff has been absolutely horrible at causing filesystem corruption in the past and as it's indeed a global option for the entire filesystem, I'd be quite worried to turn it on.

Isn't BTRFS wonderful :(

I thought the recommendation was for VM images to be marked as nocow

Interesting, I had no idea the nocow option existed, nor that it was the recommended option for VM images.
I found this https://btrfs.wiki.kernel.org/index.php/FAQ#Can_copy-on-write_be_turned_off_for_data_blocks.3F and will see if that helps. It might well do given what we know now.

tomponline added a commit to tomponline/lxd that referenced this issue Dec 3, 2021
…aw disk files

Enable nodatacow so that random writes don't cause fragmentation and old extents to be kept.
BTRFS extents are immutable so when blocks are written they end up in new extents and the old
ones remains until all of its data is dereferenced or rewritten. These old extents are counted
in the quota, and so leaving CoW enabled can cause the BTRFS subvolume quota to be reached even
if the block volume file isn't full.

Fixes canonical#9124

Signed-off-by: Thomas Parrott <thomas.parrott@canonical.com>
tomponline added a commit to tomponline/lxd that referenced this issue Dec 3, 2021
…aw disk files

Enable nodatacow so that random writes don't cause fragmentation and old extents to be kept.
BTRFS extents are immutable so when blocks are written they end up in new extents and the old
ones remains until all of its data is dereferenced or rewritten. These old extents are counted
in the quota, and so leaving CoW enabled can cause the BTRFS subvolume quota to be reached even
if the block volume file isn't full.

Fixes canonical#9124

Signed-off-by: Thomas Parrott <thomas.parrott@canonical.com>
tomponline added a commit to tomponline/lxd that referenced this issue Dec 3, 2021
…aw disk files

Enable nodatacow so that random writes don't cause fragmentation and old extents to be kept.
BTRFS extents are immutable so when blocks are written they end up in new extents and the old
ones remains until all of its data is dereferenced or rewritten. These old extents are counted
in the quota, and so leaving CoW enabled can cause the BTRFS subvolume quota to be reached even
if the block volume file isn't full.

Fixes canonical#9124

Signed-off-by: Thomas Parrott <thomas.parrott@canonical.com>
@tomponline
Copy link
Member Author

@stgraber I'm afraid the nowdatacow option didn't work, I initially was encouraged, but after using snapshots of VMs from an image it didn't work. I've also checked that I am applying the +C attribute correctly, because it can only be applied to empty files.

I tested this by running chattr +C on an existing file and it didn't show as applied with lsattr (even though the chattr command didn't fail with an error - another bug then), whereas it did apply and show with lsattr when it was added to the empty root file before the image unpacker was run, and it was still showing as applied when on VM snapshot of the image volume.

Sadly it still managed to reached the quota, and compsize showed the same issue as before.

I wonder if we shouldn't use the 2x capacity approach, but rather than silently add it, check when applying the quota that it allows for 2x the disk image size?

@tomponline
Copy link
Member Author

@stgraber Good morning. I've found out why I initially thought that the nodatacow option was working and then abruptly changed my mind. The reason is that initially I was testing on a VM I had imported from a backup (so it wasn't a snapshot of an image) whereas later I was testing on a VM that was created as a snapshot from an image volume.

Further reading on the subject of nodatacow revealed this post:

https://www.spinics.net/lists/linux-btrfs/msg35491.html

Second, there's the snapshotting exception. Because a btrfs snapshot
locks the existing file data in place with the snapshot, the first
modification to a fileblock after a snapshot will force a COW for that
block, even on an otherwise nocow file. The nocow attribute remains in
effect, however, and further writes to the same block will modify it in-
place... until the next snapshot of course.

So the issue is that for VMs created as a snapshot from the VM image volume, the first write to a block will necessarily cause a CoW operation, and thus the VM volume's quota usage will be increased because it references both the old and new extents of that block (this is why compression helps because it reduces the maximum extent size and so the issue is not as exacerbated).

It gets worse though.

I've noticed that the backup import system currently only restores the primary volume's quota, not the state volumes' quota.
This means that for BTRFS backup imports the subvolume is restored with no quota.

Fixing this issue then causes another serious problem.

Whilst, in the previous example the image volume size is set to the default 10GiB, the actual data usage size is whatever the image size is (for Ubuntu Focal its 4GiB approx). So there is some leeway before the quota is totally filled up.

However when exporting a VM to a non-optimized backup and then reimporting it, the full raw image file is written back to the subvolume. Combined with the fix size.state quota restoration, means that the disk file is effectively considered full from a BTRFS quota perspective. This means that if a snapshot is taken of that restored VM, any write to that VM will cause a CoW event, and almost immediately reach the subvolume quota (less than 100MiB of writes need to occur before it is reached).

So we are in a tricky position:

If we fix the backup restoration issue so that size.state quota is set correctly, then this will mean any subsequent snapshot of the restored VM will very quickly cause the source VM's disk to fail with I/O errors as it will hit the underlying BTRFS quota.

@tomponline
Copy link
Member Author

@stgraber this is effectively the same issue as LVM has for non-thin volume snapshots (where snapshots have to be created with size that limits the total number of CoWs that can occur). For the LXD LVM driver, this has been addressed by creating the snapshot at the same size as the volume (effectively doubling the quota):

https://github.com/lxc/lxd/blob/master/lxd/storage/drivers/driver_lvm_utils.go#L397-L408

We effectively need to do the same and account for the BTRFS snapshot CoW, but using BTRFS semantics of doubling the quota (which is not as nice as the LVM approach, as we cannot assign that additional quota just for CoW usage).

I realise you said above:

Hmm, so doubling the quota feels wrong as users will (rightfully) expect the volume to not exceed the quota they set...

But as we already have a precedent for this in the LVM driver (i.e if I set an LVM volume to10GiB size and then take a snapshot then writes to the original LVM volume can now take up to 20GiB of space due to accounting for CoW of the snapshot), does this change your position?

@stgraber
Copy link
Contributor

stgraber commented Dec 6, 2021

Not really as in the LVM case it still won't allow me to dump 10GiB of state data on top of the quota.

In the btrfs case, if we use the approach of silently doubling the quotas and we have users who happen to start using stateful stop/snapshot, they will be allowed to exceed their quotas, potentially by tens of GiB, completely messing up any chance of doing proper resource control on the system (thinking of shared environments with restricted projects combined with projects limits).

@tomponline
Copy link
Member Author

tomponline commented Dec 6, 2021

Not really as in the LVM case it still won't allow me to dump 10GiB of state data on top of the quota.

This is because we have conflated the state and disk file quotas by not using a separate subvolume (without quota) for the disk image file, whereas with LVM there are separate LVs for state and root disk data.

In theory using a single subvolume made sense, but given how BTRFS does CoW accounting for quotas, using a separate subvolume, although a lot larger change, seems like the best approach to address this cleanly.

@stgraber
Copy link
Contributor

stgraber commented Dec 6, 2021

It would still be papering over an upstream issue. Yes, doing two volumes would help a bit for the block case, but we'd still get that failure on the fs volume as that one would still need a limit and so would hit the bug if ever snapshotted.

Similarly, we could absolutely reproduce this issue with a container filesystem.

I'm usually not very keen on papering over other people's bugs especially if we can't take care of the entire issue in a consistent way.

I still think the best we can do here is document the btrfs issue and let people decide what they want to do. For most I'm hoping it will be staying away from btrfs while those who really want btrfs should probably consider compression.

@tomponline
Copy link
Member Author

OK I will update the docs absolutely.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Confirmed to be a bug
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants