Permalink
Cannot retrieve contributors at this time
Join GitHub today
GitHub is home to over 50 million developers working together to host and review code, manage projects, and build software together.
Sign upbtrfs-dev-docs/fsid-change.txt
Go to file| BTRFS Metadata UUID (FSID change) feature | |
| SCOPE | |
| ================================== | |
| The most important requirement for this feature is to ensure that while FSID | |
| change is in progress a sudden power loss won’t lead to unrecoverable | |
| filesystem upon next boot. This document presents analysis of the failure modes | |
| and their automatic handling inside the kernel. | |
| This analysis looks at only the use case when a FSID is initiated on a regular | |
| btrfs filesystem (ie. one that is not a clone of an existing btrfs filesystem). | |
| Implementation | |
| =============================================== | |
| Since btrfs cannot guarantee that superblock writes can occur atomically across | |
| multiple devices it’s important to work around this deficiency. The proposal | |
| set forth in this document is to introduce 2 flags in the btrfs super: | |
| IP (IN PROGRESS) - it signals that FSID is in progress it’s purposes will be | |
| explained later | |
| C (Complete) - this flag will signal that filesystem change has already been | |
| completed. | |
| All things considered I envision the following sequence of operations: | |
| 1. The IP flag is set in transaction X and applied to all disks. | |
| 2. Then in transaction X+1 the C flag (this will be an incompat bit) will be set, | |
| current FSID will be copied to the metadata_uuid field and a new FSID is going | |
| to replace the current fsid, at this point the operation is considered complete. | |
| Power loss can happen anytime before, during or after either of trans X or | |
| trans X+1. Let’s look the failures in each of those cases: | |
| 1. Failure before transaction X is committed - not a problem since nothing is | |
| committed on disk, user just has to restart operations via btrfstune | |
| 2. Failure during transaction X - this means that transaction X will be | |
| persisted on number of N-out-of-M (N < M) disks for a filesystem consisting | |
| of M disks. This will in essence create 2 partitions P and Q, where disks | |
| in P will have the IP flag and disks in Q won’t and the super block | |
| generation of disks in P will be greater than the one in disks in Q. Since | |
| disks in btrfs are scanned arbitrarily either disks from P or Q can be | |
| scanned first hence create the initial fs_devices structure to represent | |
| the filesystem. This needs to be handled. | |
| 2.1 Let’s consider the first case where a disk from partition P is | |
| scanned first. In this case the scan code will detect that IP is set | |
| but since existing FSID is not changed so the fs_devices struct will be | |
| created with the old FSID identifying the filesystem. Then all other | |
| disks in P that are scanned will follow the same logic and if they | |
| already detect an fs_devices struct will just be added to it. Also | |
| their IP flags will be cleared upon next transaction commit. Following | |
| this, if a disk from partition Q are scanned they will just find the | |
| existing fs_devices and will be added to it and the IP flag will be | |
| cleared on mount. Additionally fs_devices created will be marked as | |
| having been created from disks with the IP flag set. This will be | |
| required in order to handle case 4.2 (see below). | |
| 2.2 The second case to consider is a disk from partition Q being | |
| scanned first. In this case fs_devices will be created as usual (using | |
| the unchanged FSID) then all other Q disks will be added to this | |
| fs_devices and their IP flag is going to be cleared. On the other hand | |
| all disks from partition P will detect their IP flag and since FSID is | |
| not changed they can just scan existing fs_devices for one which matches | |
| their FSID. One important thing to heed is that during mount btrfs’ | |
| mount code always users the superblock from the device with the highest | |
| generation. Disk in P will have a higher generation number as such when | |
| a filesystem is mounted a disk in set P will be used. In order to | |
| preserve consistency the IP flag must be cleared. | |
| 3. Failure after transaction X/before transaction X+1 - this will be handled | |
| the same way as in case 2, i.e IP flag will be cleared on mount and the | |
| filesystem will be created with the unchanged FSID. | |
| 4. Failure during transaction x + 1. When such a failure happens the | |
| filesystem in question will be partitioned in two sets P and Q. | |
| Q would have just the old fsid and the IP flag, also the superblock’s generation | |
| number of disks in P will be higher than those in Q. | |
| There are two cases of P here accoring to the new FSID value: | |
| 4.1 The new fsid value written to P differs from the old metadata_uuid value. | |
| Where P would have the C flag and a new value for fsid written to it as | |
| well as the old FSID value written in the ‘metadata_uuid’ field. | |
| In contrast Again two cases needs to be handled: | |
| 4.1.1 The first one is a disk from partition P being scanned initially. | |
| This means that fs_devices will be created with differing | |
| fsid/metadata_uuid values. | |
| Subsequently, when other disks from P are scanned they will just | |
| be added as normally. The interesting case is when a disk from partition | |
| Q are scanned. In this case the code should see that the superblock in | |
| question has the IP flag set and at the same time will see that there | |
| is already a fs_devices struct which has differing metadata/fsid members | |
| AND the metadata_uuid matches that of FSID in disk in Q. In this case | |
| the disk will be added to the filesystem, same logic applies to rest | |
| of disks in partition Q upon next transaction commit their SB will be | |
| overwritten with the correct information. | |
| 4.1.2 Disks in Q are handled the same way as in scenario 2.1. The main | |
| difference is how disks in P are handled. When such disks are scanned | |
| they will have differing metadata/fsid uuid, however they will detect | |
| that an fs_devices exist such that the metadata of the disk matches the | |
| metadata/fsid of the fs_devices AND the IP flag is set in fs_device | |
| indicating that it was created from a superblock that had the IP flag | |
| set. This will be an indication to the scanning code that it needs to | |
| add the disk, additionally since disks in P will have a higher | |
| superblock generation number than disk in Q upon adding such a disk to | |
| a fs_devices it needs to overwrite the fsid/metadata details, since | |
| upon mount a disk in P will be used. | |
| 4.2 The new fsid value written to P is same with the old metadata_uuid | |
| value. Now P's 'metadata_uuid' field is cleared and the field to | |
| indentify by is only 'fsid'. P does not have the C flag or the IP flag | |
| either. Two cases needs to be handled: | |
| 4.2.1 The first one is a disk from partition P being scanned initially. | |
| This means that fs_devices will be created with same | |
| fsid/metadata_uuid values. Subsequently, when other disks from P | |
| are scanned they will just be added as normally. | |
| The interesting case is when a disk from partition Q are scanned. | |
| In this case the code should see that the superblock in | |
| question has the IP flag set and at the same time will see that there | |
| is already a fs_devices struct which has same metadata/fsid members | |
| AND the metadata_uuid matches that of FSID in disk in Q. In this case | |
| the disk will be added to the filesystem, same logic applies to rest | |
| of disks in partition Q upon next transaction commit their SB will be | |
| overwritten with the correct information. | |
| 4.2.2 Disks in Q are handled the same way as in scenario 2.1. The main | |
| difference is how disks in P are handled. When such disks are | |
| scanned first, fs_devices will have differing metadata/fsid | |
| uuid. However they will detect that an fs_devices exist such that | |
| the fsid of the disk matches the metadata of the fs_devices | |
| AND the IP flag is set in fs_device indicating that it was created from | |
| a superblock that had the IP flag set. | |
| This will be an indication to the scanning code that it needs to | |
| add the disk, additionally since disks in P will have a higher | |
| superblock generation number than disk in Q upon adding such a disk to | |
| a fs_devices it needs to overwrite the fsid/metadata details, since | |
| upon mount a disk in P will be used. | |
| 5. Failure after transaction X + 1 - the filesystem is already changed so | |
| no need to do special handling | |
| 6. It is possible that there exists such a super block that has both IP and | |
| C flags set. This can occur in situation where a disk has undergone a | |
| successful metadata uuid change so it has the correct C flag set. When a | |
| subsequent FSID is initiated on such a fs the IP flag will be set as part | |
| of the new operation. During a failure of transaction X then we can end up | |
| with disks comprising this filesystem partitioned in two sets P and Q. | |
| Disks in P will have both C and IP flags set and disks in Q will have only | |
| C flag and their fsid will be different than that of disks in Q. Disks in | |
| both partition will have identical metadata_uuid. The 2 cases shall be | |
| handled as follows: | |
| 6.1 When a disk in P comes up it should scan the existing registered | |
| fs_devices for one which has differing fsid/metadata_uuid AND whose | |
| metadata uuid is equal to that of the disk and should join them. If none | |
| such is found then a new fs_devices should be created with the IP flag | |
| set. Subsequently as disks in Q come in they should scan for such an | |
| fs_device that has the IP flag set, whose metadata_uuid equal that of | |
| the scanned disk and whose fsid/metadat_uuid members differ. This will | |
| be a sign that this disk must be joined in the respective fs_devices | |
| and must also replace the fsid/metadata_uuid in struct btrfs_fs_devices | |
| to its own values. | |
| 6.2 When a disk in Q is scanned first, it will create fs_devices as | |
| usual. Then as disks from P appear they will scan the existing | |
| fs_devices for one with a matching metadata_uuid and a differing | |
| metdata_uuid/fsid in the fs_devices and should join that fs_devices | |
| collections. | |