New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BTRFS quota is reached when filling up VM disk image file #9124
Comments
Could it be related to this |
It's possible, yeah. btrfs quotas are really frustrating to work with and very odd compared to what you get on ZFS or through project quotas on ext4/xfs, for one thing, it seems somewhat asynchronous, making it possible to exceed the limit by a few hundred MBs before the quota kicks in. This may well be the source of the issue here but we'll need to investigate some more. |
Yes I noticed that when I looked into it initially, the quota consumption kept changing even thought the actual instance wasn't running. |
Issue #8468, is about incorrect disk usage, as seen in the examples, it says the usage was 9MB despite the usage being 500MB. If i remember correctly creating a snapshot resets the usage. It was an other issue where i reported the delay of information. |
So I am thinking that calculation of usage is somehow calculating the difference in the filesystem and the most recent snapshot, as opposed to the original image. |
Its not even that simple because if you wait some time thought you'll see the utilisation change on the original volume too as BTRFS performs an async usage scan after taking the snapshot. |
I think submitted that as a different issue, but my understanding was that is how the storage driver works, so deal with it. Its mainly noticeable to users when using a web UI for lxd. Using the command line, you probably wont notice it. |
Yes its not ideal if the BTRFS reporting tools are async. But if it allows users to exceed their quotas due to that async nature then its pretty nasty. And this is without even using snapshots (apart from the initial source one that is). There's also the concept of referenced data limit vs exclusive data limit which I've not fully understood the ramifications of yet. |
I think we are talking about two different problems, the problem I discovered was that when creating a snapshot using custom BTRFS partition, it resets the usage to almost nothing. This means that the reported use is way lower that is available, so i am not sure if that later leads to the problem reported above. |
In this case I am talking about the example above. |
No problems, i just saw the issue come through, and it reminded me of similar problems, which could create bugs if the usage information is used by the API for something else. |
@tomponline assigning to you so we can decide what to do with this. If the issue is that btrfs applies quotas asynchronously, then I suspect the only thing we can do is mention it in doc/storage.md and close the issue. For those affected, setting the |
@stgraber looking into this now. |
Its been a while since i encountered this, but let me know if I can help. |
To rule out any issues with snapshots and quota accounting I created a VM, exported it as an non-optimised backup and then re-imported it so it wasn't linked to any existing image subvolume. After import I set the disk size to 12GiB and then filled it up as normal.
I checked using
So it seems the quota group is reporting the incorrect info, as even the total for the subvolume is larger than what |
Here's something curious, if you create an empty VM, and then manually setup a loop device, filesystem, mount it and then fill up that filesystem, it doesn't exceed the disk image file's size and doesn't reach the BTRFS quota.
And yet doing the same |
I even tried using |
I think the issue #8468 covers when creating snapshots also it changed things as well. |
@tomponline possibly has to do with the size of the writes. You could probably try |
@stgraber I've also tried this with a BTRFS storage pool on a raw NVME device, and not on a loopdev, to avoid any issues with loopdev on loopdev. But the same thing. It seems that I need to set |
I did start to think perhaps its a fragmentation issue, which could also be affected by block size. This would explain why the quota extents are larger than how BTRFS is tracking the actual used file sizes. |
@tomponline likely has to do with the API used by QEMU, may be related to asyncio with multiple I/O threads on the same block or something and btrfs incorrectly accounting for those writes. |
I'm closing the LXD issue as if there's one thing that's clear right now is that our quota and file size calculation is all correct, it's the enforcement which is problematic. @tomponline can you send what you have to the btrfs mailing-list (or bug tracker if they have one) and we'll see if they come up with anything useful. |
I've asked on #btrfs IRC and if get no reply will post to linux-btrfs@vger.kernel.org |
Chatting on #btrfs IRC,
|
I was asked to run the
|
@stgraber so shall I look into the |
@stgraber there's various info about compression on https://btrfs.wiki.kernel.org/index.php/Compression, but the problem I see with It doesn't look like you can enable |
… raw files This has the side effect of reducing the maximum extent size for compressable extents and reduces space waste when partially rewriting extents that can cause the block file to end up exceeding the BTRFS volume quota. Fixes canonical#9124 Signed-off-by: Thomas Parrott <thomas.parrott@canonical.com>
… raw files This has the side effect of reducing the maximum extent size for compressible extents and reduces space waste when partially rewriting extents that can cause the block file to end up exceeding the BTRFS volume quota. Fixes canonical#9124 Signed-off-by: Thomas Parrott <thomas.parrott@canonical.com>
… raw files This has the side effect of reducing the maximum extent size for compressible extents and reduces space waste when partially rewriting extents that can cause the block file to end up exceeding the BTRFS volume quota. Fixes canonical#9124 Signed-off-by: Thomas Parrott <thomas.parrott@canonical.com>
… raw files This has the side effect of reducing the maximum extent size for compressible extents and reduces space waste when partially rewriting extents that can cause the block file to end up exceeding the BTRFS volume quota. Fixes canonical#9124 Signed-off-by: Thomas Parrott <thomas.parrott@canonical.com>
Hmm, so doubling the quota feels wrong as users will (rightfully) expect the volume to not exceed the quota they set... The compress stuff has been absolutely horrible at causing filesystem corruption in the past and as it's indeed a global option for the entire filesystem, I'd be quite worried to turn it on. One thing that comes to mind though is that I thought the recommendation was for VM images to be marked as nocow through a filesystem attribute. I wonder if that would improve this behaviour and what the downside would be. |
Isn't BTRFS wonderful :(
Interesting, I had no idea the |
…aw disk files Enable nodatacow so that random writes don't cause fragmentation and old extents to be kept. BTRFS extents are immutable so when blocks are written they end up in new extents and the old ones remains until all of its data is dereferenced or rewritten. These old extents are counted in the quota, and so leaving CoW enabled can cause the BTRFS subvolume quota to be reached even if the block volume file isn't full. Fixes canonical#9124 Signed-off-by: Thomas Parrott <thomas.parrott@canonical.com>
…aw disk files Enable nodatacow so that random writes don't cause fragmentation and old extents to be kept. BTRFS extents are immutable so when blocks are written they end up in new extents and the old ones remains until all of its data is dereferenced or rewritten. These old extents are counted in the quota, and so leaving CoW enabled can cause the BTRFS subvolume quota to be reached even if the block volume file isn't full. Fixes canonical#9124 Signed-off-by: Thomas Parrott <thomas.parrott@canonical.com>
…aw disk files Enable nodatacow so that random writes don't cause fragmentation and old extents to be kept. BTRFS extents are immutable so when blocks are written they end up in new extents and the old ones remains until all of its data is dereferenced or rewritten. These old extents are counted in the quota, and so leaving CoW enabled can cause the BTRFS subvolume quota to be reached even if the block volume file isn't full. Fixes canonical#9124 Signed-off-by: Thomas Parrott <thomas.parrott@canonical.com>
@stgraber I'm afraid the I tested this by running Sadly it still managed to reached the quota, and compsize showed the same issue as before. I wonder if we shouldn't use the 2x capacity approach, but rather than silently add it, check when applying the quota that it allows for 2x the disk image size? |
@stgraber Good morning. I've found out why I initially thought that the Further reading on the subject of https://www.spinics.net/lists/linux-btrfs/msg35491.html
So the issue is that for VMs created as a snapshot from the VM image volume, the first write to a block will necessarily cause a CoW operation, and thus the VM volume's quota usage will be increased because it references both the old and new extents of that block (this is why compression helps because it reduces the maximum extent size and so the issue is not as exacerbated). It gets worse though. I've noticed that the backup import system currently only restores the primary volume's quota, not the state volumes' quota. Fixing this issue then causes another serious problem. Whilst, in the previous example the image volume size is set to the default 10GiB, the actual data usage size is whatever the image size is (for Ubuntu Focal its 4GiB approx). So there is some leeway before the quota is totally filled up. However when exporting a VM to a non-optimized backup and then reimporting it, the full raw image file is written back to the subvolume. Combined with the fix size.state quota restoration, means that the disk file is effectively considered full from a BTRFS quota perspective. This means that if a snapshot is taken of that restored VM, any write to that VM will cause a CoW event, and almost immediately reach the subvolume quota (less than 100MiB of writes need to occur before it is reached). So we are in a tricky position: If we fix the backup restoration issue so that size.state quota is set correctly, then this will mean any subsequent snapshot of the restored VM will very quickly cause the source VM's disk to fail with I/O errors as it will hit the underlying BTRFS quota. |
@stgraber this is effectively the same issue as LVM has for non-thin volume snapshots (where snapshots have to be created with size that limits the total number of CoWs that can occur). For the LXD LVM driver, this has been addressed by creating the snapshot at the same size as the volume (effectively doubling the quota): https://github.com/lxc/lxd/blob/master/lxd/storage/drivers/driver_lvm_utils.go#L397-L408 We effectively need to do the same and account for the BTRFS snapshot CoW, but using BTRFS semantics of doubling the quota (which is not as nice as the LVM approach, as we cannot assign that additional quota just for CoW usage). I realise you said above:
But as we already have a precedent for this in the LVM driver (i.e if I set an LVM volume to10GiB size and then take a snapshot then writes to the original LVM volume can now take up to 20GiB of space due to accounting for CoW of the snapshot), does this change your position? |
Not really as in the LVM case it still won't allow me to dump 10GiB of state data on top of the quota. In the btrfs case, if we use the approach of silently doubling the quotas and we have users who happen to start using stateful stop/snapshot, they will be allowed to exceed their quotas, potentially by tens of GiB, completely messing up any chance of doing proper resource control on the system (thinking of shared environments with restricted projects combined with projects limits). |
This is because we have conflated the state and disk file quotas by not using a separate subvolume (without quota) for the disk image file, whereas with LVM there are separate LVs for state and root disk data. In theory using a single subvolume made sense, but given how BTRFS does CoW accounting for quotas, using a separate subvolume, although a lot larger change, seems like the best approach to address this cleanly. |
It would still be papering over an upstream issue. Yes, doing two volumes would help a bit for the block case, but we'd still get that failure on the fs volume as that one would still need a limit and so would hit the bug if ever snapshotted. Similarly, we could absolutely reproduce this issue with a container filesystem. I'm usually not very keen on papering over other people's bugs especially if we can't take care of the entire issue in a consistent way. I still think the best we can do here is document the btrfs issue and let people decide what they want to do. For most I'm hoping it will be staying away from btrfs while those who really want btrfs should probably consider compression. |
OK I will update the docs absolutely. |
Even though LXD sets the BTRFS quota 100MiB (by default) larger than the VM disk file max size, if the VM uses all of its disk space the underlying BTRFS filesystem sees that the referenced disk quota has been reached and prevents LXD from starting the VM because it cannot write the backup file. Even though there should be some space free.
It seems like BTRFS quota isn't working the way we think it does.
See https://discuss.linuxcontainers.org/t/btrfs-issues-storage-pools-btrfs-empty-and-btrfs-quota-100-while-inside-the-vm-only-48-utilized/11897
Steps to reproduce:
Check BTRFS quota set (expect it to be blockSize 11000004608 + 100MiB (104857600 bytes) = 11104862208 bytes):
Check size of root disk file (expect it to be 11000004608 bytes):
Start the VM:
Now inside v1 run until the disk fills up (should fill up the disk image but not reach BTRFS quota as it has another 100MiB allowed):
You can now see that the BTRFS referenced quota has been reached, which it shouldn't have been.
Disk image is still at set size of 11000004608.
And the actual used blocks of the image are:
Indeed the total size of the volume is less than the quota:
The text was updated successfully, but these errors were encountered: