Permalink
Cannot retrieve contributors at this time
Join GitHub today
GitHub is home to over 50 million developers working together to host and review code, manage projects, and build software together.
Sign upbtrfs-dev-docs/trees.txt
Go to file| Btrfs trees | |
| =========== | |
| BTRFS stands for B-tree filesystem. Naturally this means that the filesystem is | |
| organised as a series of b-trees. In fact the code to deal with btrees in btrfs | |
| is generic and is oblivious of the type of items being stored. Generally the | |
| BTRFS btree implementation knows of only 3 structures: keys, items and block | |
| headers. Trees are stored in nodes, each of with belong to a level in the | |
| b-tree structure. Internal nodes contain references to other internal nodes on | |
| the next level, or to leaf nodes. Leaf nodes are considered to be at level 0. | |
| Leaf nodes contain various types of data structures, depending on the tree. | |
| Data structures | |
| --------------- | |
| Btree keys | |
| ~~~~~~~~~~ | |
| The filesystem is composed of objects, each of which has an abstract 64bit | |
| object id. When an object is created, a previously unused object id is chosen | |
| for it. The object id makes up the most significant bits of the key, allowing | |
| all of the items for a given filesystem object to be logically grouped | |
| together in the B-tree. The 'type' field describes the kind of data held by an | |
| item; an object typically comprises several items. The offset field describes | |
| data held in an extent. | |
| Every object in the FS has a corresponding key with the following definition: | |
| struct btrfs_disk_key { | |
| u64 objectid; | |
| u8 type; | |
| u64 offset; | |
| } | |
| Btree internal nodes | |
| ~~~~~~~~~~~~~~~~~~~~ | |
| All nodes which do not contain actual data and whose level is higher than 0 | |
| are internal nodes or sometimes referred to as simply "nodes". They are | |
| represented as a struct btrfs_node. Conceptually a node consists of a header, | |
| followed by an array of btrfs_key_ptr structures: | |
| | block header | key1 | key2 | keyN | | |
| The block header is identical to the block header of a leaf node (see below). | |
| Each key is represented by the following struct: | |
| struct btrfs_key_ptr { | |
| /* This is the lowest key in the block pointed to by this pointer */ | |
| struct btrfs_disk_key key; | |
| /* Logical address of the block pointed at */ | |
| __le64 blockptr; | |
| /* Is the generation this block pointer was modified */ | |
| __le64 generation; | |
| } | |
| Following invariant can be extrapolated from that: | |
| 1. For a given key pointer in a node the actual btrfs keys which can be found | |
| in it are defined by the : | |
| a) btrfs_key_ptr.key - the starting key | |
| b) btrfs_key_ptr.key of the next key pointer - the end key in the range | |
| Btree leaf nodes | |
| ~~~~~~~~~~~~~~~~ | |
| Data in btrfs is always contained in the leaf nodes of the btrees. Those nodes | |
| always have a level of 0 and are simply referred to as "leaves". | |
| Conceptually a leaf node will have the following structure: | |
| | block header | I0 | I1 | I2 | free space | D2 | D1 | D0 | | |
| Every tree block (leaf or node) starts with the following structure: | |
| struct btrfs_header { | |
| /* Checksum of everything after this field (from 0x20 to the end of the node) */ | |
| u8 csum[BTRFS_CSUM_SIZE]; | |
| /* FS specific uuid */ | |
| u8 fsid[BTRFS_FSID_SIZE]; | |
| /* which block this node is supposed to live in (Logical address of the node) */ | |
| __le64 bytenr; | |
| /* Various flags */ | |
| __le64 flags; | |
| /* allowed to be different from the super from here on down */ | |
| u8 chunk_tree_uuid[BTRFS_UUID_SIZE]; | |
| /* Transaction id at time when this block was modified */ | |
| __le64 generation; | |
| /* The id of the tree that contains this node */ | |
| __le64 owner; | |
| /* Number of items this node houses */ | |
| __le32 nritems; | |
| /* Level at which this nodes resides (always 0 for leaf nodes) | |
| u8 level; | |
| } | |
| Following the block header there is an array of btrfs items, which are | |
| represented by the following structure: | |
| struct btrfs_item { | |
| struct btrfs_disk_key key; | |
| /* Offset of the data portion of this item in the btree node */ | |
| __le32 offset; | |
| /* Size of the data portion of this item in the btree node */ | |
| __le32 size; | |
| } | |
| In order to arrive at a particular item, the first thing that has to happen is | |
| locate the item. This is done with the help of the btrfs_disk_key strucutre. | |
| It essentially represents the logical address of an item. So an item is really | |
| just a key with additional offset and size information. | |
| Item data in btrfs is variably sized, and each item in the beginning of the | |
| b-tree node is paired with the respective data portion. This leads to the item | |
| array in the bginning and the reverse sorted data array at the end of the | |
| node. Those 2 array grow toward each other. | |
| Trees | |
| ----- | |
| A filesystem is comprised of a forest of trees. Currently there are 10 | |
| distinct trees: | |
| Root tree | |
| ~~~~~~~~~ | |
| This tree indexes the B-trees making up the filesystem. Essentially each root | |
| in this tree is the starting point for one of the trees below. Additionally, | |
| there could be items which represent ordinary files and are normally found in | |
| the FS tree. This is due to the old/default freespace cache being implemented | |
| as files, yet they are invisible to the user and are used internally by the | |
| filesystem to quickly find freespace in the filesystem. | |
| Extent allocation tree | |
| ~~~~~~~~~~~~~~~~~~~~~~ | |
| This tree tracks allocated extents in extent items. Each of those items | |
| describe a particular contiguous on-disk area. Those items also holds all | |
| back-references to an extent. This allows moving an extent if needed or | |
| recovering from a damaged disk block. Taken as a whole, back-references | |
| multiply the number of filesystem disk pointers by 2. For each forward | |
| bpointer, there is exactly one back-pointer. | |
| There could be many references to an extent, each addressing only part of it. | |
| For example, consider file foo, which has an on-disk extent 100KiB–128KiB. | |
| File foo is cloned creating file bar. Later on, a range of 10KiB is | |
| overwritten in bar. This could cause the following situation. | |
| There are three pointers into extent 100–128KiB, covering different parts of | |
| it. The extent-item keeps track of all such references, to allow moving the | |
| entire extent at a later time. An extent could potentially have a large number | |
| of back references, in which case the extent-item does not fit in a single | |
| B-tree leaf node. In such cases, the item spills and takes up more than one | |
| leaf. A back reference is a logical hint - it is not a physical address. It is | |
| constructed from the root object id, generation id, tree level, and lowest | |
| object-id in the pointing block. This allows finding the pointer after a | |
| lookup traversal starting at the root object-id. In this example, extent | |
| 100–128KiB has three back references, one from foo, and two from bar. The back | |
| references also serve as the ref count; when it reaches zero, the extent can | |
| be freed. | |
| An important operation on the extent-tree is finding all references into a | |
| disk area. The extent items are sorted by start address, and there are no | |
| overlaps between items. This allows doing a range query on the B-tree, | |
| resulting in all extents in the range. We can then extract all the back | |
| references and find all pointers to this disk area. This is a crucial | |
| operation used for moving data out of an area. There could be multiple | |
| reasons for this: garbage collection, device removal, file system resize, RAID | |
| rebalance, and so on. | |
| This tree also holds btrfs_block_group_items, which describe allocated chunks. | |
| Chunk tree | |
| ~~~~~~~~~~ | |
| The chunk tree holds 2 types of items - btrfs_chunk and btrfs_dev_item. Those | |
| items are used to resolve physical<=>logical mapping. Chunk items hold the | |
| logical to physical translation and dev_item describe physical devices, | |
| comprising this filesystem. | |
| Device tree | |
| ~~~~~~~~~~~ | |
| The device tree stores btrfs_dev_extent items. They describe allocated portions | |
| of the physical disk space. Device stats items are also found in this tree. | |
| Filesystem/subvolume tree | |
| ~~~~~~~~~~~~~~~~~~~~~~~~~ | |
| This tree stores user visible files and directories. Each subvolume is | |
| implemented by a separate tree. Subvolumes can be snapshotted and cloned, | |
| creating additional B-trees. The roots of all subvolumes are indexed by the | |
| tree of tree roots. | |
| Checksum tree | |
| ~~~~~~~~~~~~~ | |
| This tree holds one checksum item per allocated extent. The item itself | |
| contains a list of checksums per sector in the extent. | |
| Quota tree | |
| ~~~~~~~~~~ | |
| Quota tree is used to hold information about qgroups, which define the usage | |
| limits of a particular subvolume. The tree contains 4 types of items: | |
| * status items - they contains global information about qgroup state such as | |
| whether it's enabled and consistent | |
| * qgroup info item - this item holds accounting information about current space | |
| usage | |
| * qgroup limit item - configured limits reside in this item | |
| * quota relation item - this item describes a parent<->child relationship | |
| between 2 qgroups. | |
| UUID tree | |
| ~~~~~~~~~ | |
| The uuid was not part of the initial btrfs design. It was added later as a | |
| performance optimisation to be used during subvolume/snapshot send/receive. It | |
| is essentially a persistent cache of subvolume id <=> uuid mappings. Items in | |
| this tree store the 128-bit UUID as a two 64 bit portions. The | |
| btrfs_disk_key's objectid field contains the lower 64 bits and the offset | |
| field contains the topmost 64 bits field. The data portion of an item contains | |
| a single 64 bit value which is the subvolume id. | |
| Relocation tree | |
| ~~~~~~~~~~~~~~~ | |
| Rebalancing operations (shrinking, moving chunks) require extents to be | |
| relocated. However, doing a simple copy-on-write of the relocating extent will | |
| break sharing between snapshots and consume disk space. To preserve sharing, | |
| an update-and-swap algorithm is used, with a special relocation tree serving as | |
| scratch space for affected metadata. The extent to be relocated is first copied | |
| to its destination. Then, by following backreferences upward through the | |
| affected subvolume's file system tree, metadata pointing to the old extent is | |
| progressively updated to point at the new one; any newly updated items are | |
| stored in the relocation tree. Once the update is complete, items in the | |
| relocation tree are swapped with their counterparts in the affected subvolume, | |
| and the relocation tree is discarded. | |
| Log tree | |
| ~~~~~~~~ | |
| An fsync is a request to commit modified data immediately to stable storage. | |
| Fsync-heavy workloads (like a database or a virtual machine whose running OS | |
| fsyncs intensively) could potentially generate a great deal of redundant write | |
| I/O by forcing the file system to repeatedly copy-on-write and flush frequently | |
| modified parts of trees to storage. To avoid this, a temporary per-subvolume | |
| log tree is created to journal fsync-triggered copy-on-writes. Log trees are | |
| self-contained, tracking their own extents and keeping their own checksum | |
| items. Their items are replayed and deleted at the next full tree commit or | |
| (if there was a system crash) at the next remount. | |
| Root log tree | |
| ~~~~~~~~~~~~~ | |
| This is a b-tree which tracks all the existing log trees. So for example if a | |
| filesystem has 10 subvolumes, and in turn on each one of those there is at | |
| least one file fsynced, then there are going to be 10 Log trees and the Root | |
| log tree is going to point to each one of them. | |
| Free space tree | |
| ~~~~~~~~~~~~~~~ | |
| The free space tree is a relatively new addition to BTRFS and wasn't part of | |
| the original design. The idea behind is to provide better scalability than the | |
| quaint free space cache, improve the interface and make use of all the | |
| benefits of b-trees : checksumming, RAID handling and well-understood behavior. | |
| It contains items which describe the free space the file system has, this | |
| makes it efficient for BTRFS to quickly find available space when writing data | |
| to disk. To ensure consistency and proper operation the free space tree is | |
| updated in tandem with the extent tree. | |
| References: | |
| ACM Reference Format: | |
| Rodeh, O., Bacik, J., and Mason, C. 2013. BTRFS: The Linux B-tree filesystem. ACM Trans. Storage 9, 3, | |
| Article 9 (August 2013), 32 pages. | |
| DOI:http://dx.doi.org/10.1145/2501620.2501623 |