The overall idea of the new BG_TREE is pretty simple:
Put BLOCK_GROUP_ITEMS into a separate tree.
This brings one obvious enhancement:
- Reduce mount time of large fs
Although it could be possible to accept BLOCK_GROUP_ITEMS in either
trees (extent root or bg root), I'll leave that kernel convert as
alternatives to offline convert, as next step if there are a lot of
interests in that.
So for now, if an existing fs want to take advantage of BG_TREE feature,
btrfs-progs will provide offline convertion tool.
[[Benchmark]]
Physical device: NVMe SSD
VM device: VirtIO block device, backup by sparse file
Nodesize: 4K (to bump up tree height)
Extent data size: 4M
Fs size used: 1T
All file extents on disk is in 4M size, preallocated to reduce space usage
(as the VM uses loopback block device backed by sparse file)
Without patchset:
Use ftrace function graph:
7) | open_ctree [btrfs]() {
7) | btrfs_read_block_groups [btrfs]() {
7) @ 805851.8 us | }
7) @ 911890.2 us | }
btrfs_read_block_groups() takes 88% of the total mount time,
With patchset, and use -O bg-tree mkfs option:
6) | open_ctree [btrfs]() {
6) | btrfs_read_block_groups [btrfs]() {
6) * 91204.69 us | }
6) @ 192039.5 us | }
open_ctree() time is only 21% of original mount time.
And btrfs_read_block_groups() only takes 47% of total open_ctree()
execution time.
The reason is pretty obvious when considering how many tree blocks needs
to be read from disk:
- Original extent tree:
nodes: 55
leaves: 1025
total: 1080
- Block group tree:
nodes: 1
leaves: 13
total: 14
Not to mention all the tree blocks readahead works pretty fine for bg
tree, as we will read every item.
While readahead for extent tree will just be a diaster, as all block
groups are scatter across the whole extent tree.
The reduction of mount time is already obvious even on super fast NVMe
disk with memory cache.
It would be even more obvious if the fs is on spinning rust.
Signed-off-by: Qu Wenruo <wqu@suse.com>
d0989ad