-
Notifications
You must be signed in to change notification settings - Fork 4
Code documentation
Following is a list of kernel source files modified by the Next3 patches.
To better understand the changes in each file, one should follow the Next3 snapshot patches series.
The following kernel files have been changed to add the next3 file system:
- fs/Kconfig - add next3 file system kernel configuration option.
- fs/Makefile - add next3 file system folder to kernel build tree.
- include/linux/magic.h - add next3 file system magic number (same as ext3).
- include/linux/jbd.h - add task COW state, COW credits and COW statistic counters to transaction handle struct.
- include/linux/journal_head.h - add COW transaction id and modified bit to journal_head struct.
Most files under fs/next3 folder (except for the "Added Next3 snapshot files") have been cloned from fs/ext3 folder and from include/linux/ext3*.h include files.
The following files have been changed after they were cloned to support the next3 snapshot feature.
- fs/next3/Kconfig - add next3 snapshot support kernel configuration option.
- fs/next3/Makefile - add next3 snapshot files to next3 file system build.
- fs/next3/balloc.c - add snapshot delete hooks, reserve free blocks for snapshot use and update exclude bitmap on snapshot file block allocations/deletes.
- fs/next3/file.c - enforce snapshot file read-only access permissions.
- fs/next3/inode.c - snapshot file address space read-only ops, data write hooks and implementation of special snapshot file map commands.
- fs/next3/ioctl.c - entry point to snapshot control functions (by chattr -X).
- fs/next3/namei.c - reuse orphan inode list functions for snapshot list manipulation and restrict unlink of snapshot files on list.
- fs/next3/super.c - initialize and terminate snapshot code.
- fs/next3/next3_jbd.c - add snapshot metadata write hooks and trace remaining transaction COW credits.
- fs/next3/next3_jbd.h - add extra COW credits to all journal transactions.
- fs/next3/next3.h - add snapshot fields to next3 on-disk structs and define snapshot file status flags.
- fs/next3/next3_i.h - add snapshot fields to next3 in-memory inode struct.
- fs/next3/next3_sb.h - add snapshot fields to next3 in-memory super block struct.
The following files were added to the cloned next3 file system to implement the snapshot feature:
- fs/next3/buffer.c - add tracked read functions.
- fs/next3/snapshot.h - next3 snapshot macros and API.
- fs/next3/snapshot.c - next3 snapshot block access and COW functions.
- fs/next3/snapshot_ctl.c - next3 snapshot load, destroy and control functions.
- fs/next3/snapshot_debug.h - next3 snapshot debug macros.
- fs/next3/snapshot_debug.c - next3 snapshot debug functions.
Following is a list of kernel API's added or modified by the Next3 patches.
The interaction between the snapshot sub-system and the filesystem is mostly, but not exclusively, through these API's.
- next3_snapshot_load() - load snapshot list on filesystem mount.
- next3_snapshot_destroy() - free snapshot resources on filesystem umount.
The following functions are implemented in snapshot_ctl.c file and are called by User-kernel API functions to perform various snapshot administrative tasks.
- next3_snapshot_create() - allocate resources from a new snapshot.
- next3_snapshot_take() - take a current snapshot of the filesystem.
- next3_snapshot_enable() - enable user (read-only) access to snapshot.
- next3_snapshot_disable() - disable user access to snapshot.
- next3_snapshot_delete() - queue snapshot for deletion.
- next3_snapshot_update() - update snapshots status and cleanup deleted snapshots.
The following functions are called from various filesystem hooks to monitor the modification of snapshot protected blocks.
Some of the API names resemble the corresponding JBD block access API, which they are called from.
- next3_snapshot_get_write_access() - notify snapshot before modifying a metadata block.
- next3_snapshot_get_undo_access() - notify snapshot before modifying a block bitmap block.
- next3_snapshot_get_create_access() - notify snapshot before writing a new metadata block.
- next3_snapshot_get_move_access() - notify snapshot before modifying data blocks.
- next3_snapshot_get_delete_access() - notify snapshot before deleting blocks.
- next3_file_open() - grant restricted (read-only) access to snapshot files.
- next3_snapshot_readpage() - implementation of the VFS readpage operation on snapshot files.
- next3_no_writepage() - NOP implementation of the VFS writepage operation on snapshot files.
The following filesystem API's where extended, in order to accommodate the special snapshot files and implement new snapshot block operation.
- next3_free_branches_cow() - extends next3_free_branches() API to prune inode blocks selectively (according to COW bitmap). Several helper functions, such as next3_free_data(), where extended in a similar manner.
- next3_move_branches() - moves block from one (snapshot) file to another.
- next3_get_blocks_handle() - the boolean 'create' input argument was extended to pass special snapshot file block allocation commands such as 'move existing block' and 'allocate COW bitmap block'. Several helper functions, such as next3_alloc_branch(), where extended in a similar manner.
Following is a list of file system data structures manipulated by the snapshot code.
The data integrity rules SHOULD guarantee that file system data CAN NOT be corrupted by the snapshot code. They DO NOT guarantee the data integrity inside the snapshot read-only image.
The locking semantics SHOULD guarantee that data structure are manipulated by the snapshot code in a safe manner.
- Snapshot code MUST NOT change any data or metadata block, EXCEPT changes that are explicitly allowed by other data integrity rules.
- Snapshot code MAY change Code documentation#New_fields_and_bits_in_ondisk_structures (which are not used by ext2/ext3/ext4).
- Snapshot code MAY add Code documentation#New_fields_and_bits_in_inmemory_structures, as long as they are initialized properly.
- The Code documentation#New_snapshot_file_flags MAY ONLY be set on snapshot files.
- Snapshot code MAY indirectly change Code documentation#File_system_global_bitmaps_and_counters, when allocating blocks for active snapshot.
- Snapshot code MAY directly change Code documentation#Blocks_mapped_to_special_snapshot_inodes.
- Snapshot code MAY change the inode mapping of Code documentation#Blocks_moved_to_snapshot_files, given that the data of these blocks is about to become obsolete (overwritten or truncated completely).
Following are general rules about manipulating snapshot structures. Each individual structure in the sections below may have it's own specific rules.
The majority of the code in the snapshot_{ctl,debug}.c files is called from very few entry points in the code:
- {init,exit}_next3_fs() - calls {init,exit}_next3_snapshot() under BGL.
- next3_{fill,put}_super() - calls next3_snapshot_{load,destroy}() under VFS sb_lock, while the file system is not accessible to users.
- next3_ioctl() - only place that takes snapshot_mutex (after i_mutex) and only entry point to snapshot control functions.
- snapshot_mutex during snapshot control operations.
- VFS sb_lock during file system mount/umount time.
- Big kernel lock during module init time.
Snapshot COW code (in snapshot.c) is called from block access hooks during a transaction (with a transaction handle). This guaranties safe read access to s_active_snapshot, without taking snapshot_mutex, because the latter is only changed under journal_lock_updates() (while no transaction handles exist).
The transaction handle is a per task struct, so there is no need to protect fields on that struct (i.e., h_cowing, h_cow_*).
See also On-disk format page.
The inode number of next snapshot on the list. Used to be osd1.linux1.l_i_reserved1.
Overlaps with ext4 persistent inode version (l_i_version), but the next snapshot inode is a valid value for version. The in-memory version of snapshot file is their snapshot id. We use lsattr -v to display the snapshot id.
- Set by next3_do_update_inode() under inode_mutex when writing a snapshot inode to disk.
Snapshot file flags. The flags 0x1FF00000 are used in-memory, but only the flags 0x1F000000 are stored on-disk.
- Set by next3_snapshot_set_flags() under snapshot_mutex and inode_mutex when changing a snapshot file's status.
- Set by next3_snapshot_update() for all snapshots on the list under sb_lock on mount time and under snapshot_mutex after changing a snapshot file's status.
- s_snapshot_inum - inode number of active snapshot.
- s_snapshot_id - sequential ID of active snapshot.
- s_snapshot_r_blocks_count - reserved blocks for active snapshot future use (metadata only).
- Reset to 0 by tune2fs -O ^has_snapshot on cleanly un-mounted fs when converting back to Ext3.
- Set by next3_snapshot_take() under lock_super() and snapshot_mutex after a new snapshot was taken.
- s_snapshot_inum is reset to 0 by next3_snapshot_update() under lock_super() and snapshot_mutex after the last snapshot was deleted.
The inode number of the head of the on-disk snapshots list.
- Reset to 0 by tune2fs -O ^has_snapshot on cleanly un-mounted fs when converting back to Ext3.
- Set by next3_snapshot_create() under lock_super() and snapshot_mutex before taking a new snapshot.
- Reset to 0 by next3_snapshot_remove() under lock_super() and snapshot_mutex after the last snapshot was deleted.
- NEXT3_FEATURE_COMPAT_EXCLUDE_INODE - exclude inode that holds exclude bitmap blocks.
- Set ON/OFF by mke2fs/tune2fs on cleanly un-mounted fs with no snapshots.
- NEXT3_FEATURE_RO_COMPAT_HAS_SNAPSHOT - file system with snapshots.
- Set by mke2fs/tune2fs on a cleanly un-mounted fs.
- Set by next3_snapshot_take() under lock_super() and snapshot_mutex.
- NEXT3_FLAGS_IS_SNAPSHOT - identifies a read-only ext2 snapshot image.
- NEXT3_FLAGS_FIX_SNAPSHOT - reserved for future use. signal fsck that snapshot files needs to be checked.
- NEXT3_FLAGS_FIX_EXCLUDE - reserved for future use. signal fsck that exclude bitmap needs to be checked.
- FIX_EXCLUDE is set by snapshot code when a snapshot file block is not marked in exclude bitmap or when a non-snapshot file block or free block are marked in exclude bitmap.
- FIX_SNAPSHOT and FIX_EXCLUDE are cleared by tune2fs/e2fsck on a cleanly un-mounted fs.
Per block-group in-memory struct array.
- Allocated to maximum number of block groups (after resize) on mount.
- Freed on umount.
Caches the address of the block group exclude bitmap block, which is mapped by the "exclude" inode.
- Set by next3_group_add() under snapshot_mutex for new block group added on online resize.
- Set by next3_snapshot_init_bitmap_cache() under sb_lock on mount.
Caches the address of the block group COW bitmap block, which is mapped by the "active" snapshot file.
- Set by next3_snapshot_read_cow_bitmap() under sb_bgl_lock() on first block group access after snapshot take.
- Reset to 0 by next3_snapshot_reset_bitmap_cache() under lock_super() on snapshot take and under sb_lock on mount.
This mutex synchronizes snapshot control operations (such as take/delete/enable/disable). Both in-memory and on-disk snapshots list are manipulated with the snapshot_mutex held.
- Initialized on next3_fill_super() under sb_lock on mount time.
- Acquired on next3_ioctl() before snapshot control operations and before resizing the file system.
Holds a reference to active snapshot inode. The active snapshot has a positive reference count, so it is pinned to the inode cache.
- Initialized on next3_fill_super() under sb_lock on mount time.
- Set by next3_snapshot_load() under sb_lock on mount time.
- Set by next3_snapshot_take() under snapshot_mutex after a new snapshot was taken.
- Reset by next3_snapshot_update() under snapshot_mutex after the last snapshot was deleted.
- Reset by next3_snapshot_destroy() under sb_lock on umount time.
Head of the snapshot inodes list. All snapshots on the list have a positive reference count, so they are pinned to the inode cache.
- Initialized on next3_fill_super() under sb_lock on mount time.
- Set by next3_snapshot_load() under sb_lock on mount time.
- Set by next3_snapshot_create() under snapshot_mutex before a new snapshot is taken.
- Reset by next3_snapshot_remove() under snapshot_mutex after the last snapshot was deleted.
- Reset by next3_snapshot_destroy() under sb_lock on umount time.
The old field i_orphan is normally used to chain unlinked and truncated inodes on in-memory orphan list. Snapshots on the snapshot list cannot be unlinked nor truncated, so i_orphan is reused to chain the snapshot inodes on in-memory snapshot list. The Ext3 code uses the old field i_dtime to cache the inode number of the next orphan inode. Snapshot code uses the new field i_next to cache the inode number of the next snapshot inode.
- Initialized on allocation of new inode struct.
- Set by next3_snapshot_load() under sb_lock on mount time.
- Set by next3_snapshot_create() under snapshot_mutex before a new snapshot is taken.
- Set by next3_snapshot_remove() under snapshot_mutex after a snapshot was deleted.
- Reset by next3_snapshot_destroy() under sb_lock on umount time.
The h_cowing flag signals that the task is performing a snapshot COW operation. The h_cow_* counters collect statistics on the task's snapshot COW operations.
- The handle_s struct is a per task transaction handle, which requires no locking.
- h_cowing is set at the beginning of next3_snapshot_test_and_cow() and reset before exiting the function.
- h_cow_* counters are updated by next3_snapshot_test_and_cow() and reset when a new transaction handle is created.
The old field h_buffer_credits is used to keep count of remaining transaction credits. In Next3, the user requested transaction credits are multiplied by a factor to account for the extra snapshot COW operation credits. The new field h_base_credits is used to store the user requested credits. The new field h_user_credits is used to keep count of remaining user credits (requested by user - used by user).
- Set by next3_journal_start,restart,extend,release,dirty_metadata() after updating h_buffer_credits.
The old flag b_modified signals that the buffer was modified during the current transaction. The buffer may have been modified during a COW operation or not during a COW operation. The new flag b_user_modified signals that the buffer was modified during the current transaction, not during a COW operation.
- Set by next3_journal_dirty_metadata() under jbd_lock_bh_state() when a metadata buffer is modified, not during a COW operation.
Cache the id of the last transaction in which the buffer was COWed.
- Set by next3_snapshot_test_and_cow() under jbd_lock_bh_state() when buffer is COWed.
This flag signals that the buffer points to a snapshot file page which reads through a hole in active snapshot to the block device. The snapshot code tracks the a-sync I/O of these buffers to resolve read-COW race conditions (next3_snapshot_race_read.patch).
- Set by next3_snapshot_get_block() under lock_buffer() along with taking 0x10000 reference counts on block device page buffer.
- Reset by end_buffer_async_read() under lock_buffer() along with dropping 0x10000 reference counts on block device page buffer.
- end_buffer_tracked_read() also clears the buffer mapped flag to prevent reading the block returned by next3_snapshot_get_block() again without tracking the read.
NEXT3_SNAPFILE_FL - identifies a snapshot file, which has special read-only address space ops.
It cannot be changed on regular files, but only on directories, so a snapshot file must be created inside a snapshots directory. A snapshot file cannot be opened for write by users, so only the snapshot code can modify snapshot files data and metadata. Snapshot control operations are performed by privileged users on snapshot files by modifying snapshot file status flags.
NEXT3_SNAPFILE_LIST_FL - identifies a snapshot file on the in-memory snapshots list.
The snapshot list flag is not persistent. It is set on in-memory inode loaded by next3_snapshot_load(). It cannot be modified directly by the user, but only used to signal snapshot take or snapshot delete operations. Trying to set the list flag on an empty snapshot file signals a snapshot take operations. Trying to remove the list flag from a snapshot file signals a snapshot delete operations. The actual removal of snapshot from the list is usually deferred to a later cleanup time.
- Set by next3_snapshot_load() for all inodes on the on-disk snapshot list under sb_lock on mount time.
- Set by next3_snapshot_create() along with adding inode to snapshot list under snapshot_mutex and inode_mutex on start of snapshot take.
- Cleared by next3_snapshot_update() along with removing inode from snapshot list under snapshot_mutex on cleanup of deleted and unused snapshot.
- Cleared by next3_do_update_inode() under inode_mutex when writing a snapshot inode to disk.
The following snapshot flags may be modified by privileged users:
- NEXT3_SNAPFILE_ENABLED_FL - snapshot file read access enabled.
- NEXT3_SNAPFILE_TAGGED_FL - user tag for general snapshot management purpose.
- Set by next3_snapshot_set_flags() under snapshot_mutex and inode_mutex.
- NEXT3_SNAPFILE_ACTIVE_FL - identifies the active snapshot file (last snapshot taken).
- NEXT3_SNAPFILE_OPEN_FL - snapshot file is open for read (mounted via loop device).
- NEXT3_SNAPFILE_INUSE_FL - cannot be removed from list (has blocks that are in-use by older snapshots).
- NEXT3_SNAPFILE_DELETED_FL - marked for deletion on next snapshot list cleanup.
- NEXT3_SNAPFILE_SHRUNK_FL - unused block were freed on previous snapshot list cleanup.
- Set by next3_snapshot_update() under snapshot_mutex after snapshot control operations.
The file system has some global bitmaps and counters that are modified indirectly by snapshot COW operations. Whenever a block is allocated for the active snapshot file, the block bitmap is updated along with the exclude bitmap.
The following counters are affected by snapshot COW operations that allocate new active snapshot blocks:
- next3_group_desc.bg_free_blocks_count
- next3_super_block.s_free_blocks_count
- next3_sb_info.s_freeblocks_counter
Blocks that are mapped to snapshot files are not modifiable by users, because:
- snapshot files are read-only.
- snapshots on the list cannot be unlinked nor truncated by users.
- all snapshot file blocks are freed before a snapshot is removed from the list.
- Active snapshot blocks are allocated, modified and journaled (data blocks are ordered) by next3_snapshot_test_and_cow() under various snapshot race protections (next3_snapshot_race.patch).
- Deleted snapshot blocks are moved or freed by next3_snapshot_shrink,merge,remove() under snapshot_mutex.
The exclude inode is a regular file with a special reserved inode number (NEXT3_INO_EXCLUDE). It's data blocks are the exclude bitmap blocks. This is very similar to the Ext3 "resize" special inode, whose blocks are the reserved GDT blocks for file system resize.
Exclude inode blocks are allocated and initialized by tune2fs/mke2fs/resize2fs on an un-mounted fs. The special exclude inode is not accessible to users because it has no directory entry.
- Exclude bitmap blocks are modified and journaled (ordered) by next3_snapshot_test_and_cow() with atomic set_bit operation.
When truncating or unlinking regular files and directories, their blocks are unmapped from their inode and cleared from the block bitmap. Before the blocks are cleared from block bitmap, they may be mapped to the active snapshot inode. In that case, they are not cleared from block bitmap and they are set in the exclude bitmap.
- The un-mapping of blocks from their inode and mapping of these blocks to the active snapshot is journaled together in the same transaction.
Before writing to a regular file, missing data blocks are allocated and mapped to the file's inode. Before overwriting an entire allocated data block, the block may be mapped to the active snapshot inode. In that case, a new data block is allocated and mapped to the file's inode in place of the moved block.
- The mapping of the new block and remapping of the old block to the active snapshot is journaled together in the same transaction.
- If new block allocation fails, the old block is not moved to active snapshot and the write operation fails.