Permalink
Cannot retrieve contributors at this time
Join GitHub today
GitHub is home to over 50 million developers working together to host and review code, manage projects, and build software together.
Sign upbtrfs-dev-docs/delalloc.txt
Go to file| Delayed allocation (delalloc) | |
| ============================= | |
| Most filesystems resorts to all kinds of trickery to ensure performance. One of | |
| those tricks is called delayed allocation. This mean the filesystem goes to | |
| great lengths to try and delay as much as possible the moment calling into the | |
| extent allocator. BRFS makes no exception. Due to the nature of implementation | |
| the code involved in this operation is a bit involved. The idea of this document | |
| is to illuminate how delayed allocations (delallocs) are implemented in btrfs. | |
| The complexicty stems primarily from the fact that BTRFS relies on a lot of | |
| call back calls and asynchornicity to it's hard to follow the code. | |
| Buffered write path | |
| ------------------- | |
| Delallocs are created in the buffered write path (btrfs_buffered_write). That | |
| function is fairly involved but the gist really is: | |
| 1. Allocate enough space to satisfy the space being written. This is achieved | |
| via btrfs_delalloc_reserve_metadata. This function ensures there is enough | |
| space for metadata (checksum bytes and data structures describing the extents | |
| being written). | |
| 2. A number of pages necessary to house the number of bytes being written are | |
| prepared by means of two functions: prepare_pages allocates the pages in the | |
| pagecache tree of the file and lock_and_cleanup_extent_if_need ensures that | |
| if there are any pending writes to the range of the file being written they | |
| are completed. | |
| 3. Those pages are filled in with the new data via btrfs_copy_from_user. | |
| At this point we have an inode and a range of page cache pages, corresponding | |
| to the range of the file being currently written. No btrfs data structures have | |
| yet been created for those "dirty" pages. Since we are using delalloc we've | |
| only reserved space for the data/metadata but still don't have an idea where | |
| on disk the data is going to be persisted. So there needs to be a way to tell | |
| btrfs "those pages need something done to them in order to actually persist" the | |
| data. This final step of the process is implemented via: | |
| 4. btrfs_dirty_pages. What this function does is to ensure that for any | |
| writes after EOF in-memory copy of the extent data structures are created | |
| (btrfs_find_new_delalloc_bytes). After that btrfs_set_extent_delalloc is | |
| called which sets the EXTENT_DELALLOC bit to the newly written range of the | |
| file. Finally all the newly written pages have they dirty flag set. This implies | |
| that they are going to be written back to disk when the writeback timer expires | |
| or if the system is low on memory. | |
| Despite the buffered writeback code being messy, it's really those 4 logical | |
| steps which trigger the delalloc logic. Bear in mind that we still haven't | |
| called into the allocator to persist data i.e stuff is delayed. | |
| Until the pages are presisted any reads to the region which has been written | |
| will be served directly from the pages currently present in memory. | |
| Page writeback | |
| -------------- | |
| When the system is either low on memory or the writeback time has expired | |
| pag writeback is triggered. There is a substantial amount of generic code which | |
| is invoked before the filesystem is eventually called. The entry point into the | |
| fs is the ->writepages function pointer, which is set to btrfs_writepages, this | |
| is really a thin wrapper over extent_write_cache_pages. The last function goes | |
| through every page of the inode passed (i.e. the address_space argument) and | |
| calls __extent_writepage on it. __extent_writepage is responsible for allocating | |
| all the necessary on-disk data structures describing the to-be-written page | |
| and submitting the page for writeback. This is performed in the following | |
| sequence: | |
| 1. writepage_delalloc is called. This function goes through the io_tree of the | |
| inode whose page we are writing and finds a range marked with DELALLOC starting | |
| at the offset in the written page. This is done by calling find_lock_delalloc_range. | |
| Remember that the range was initialised by the buffered write path. Once such | |
| a range is found, as defined by delalloc_start/delalloc_end variables they are passed to the | |
| btrfs_run_delalloc_range(). This function pointer really points to | |
| run_delalloc_range, which is a simple dispatch function. The purpose of the | |
| dispatch function is to decide which routine to call to "fill the delalloc | |
| range". The decision is based on whether NODATACOW is enabled (run_delalloc_nocow | |
| is called in this case) or whether the BTRFS_INODE_PREALLOC flag is set, signalling | |
| there are preallocated extents (also handled by run_delalloc_nocow). Or whether | |
| compression is needed (in this case cow_file_range_async is called). Finally if | |
| none of the above is true - cow_file_range is called. | |
| For the sake of simplicity only cow_file_range is going to be considered in this | |
| article. The basic idea is to allocate extents via btrfs_reserve_extent on disk | |
| for the range. Following a successful reservation ordered data struct is | |
| allocated via btrfs_add_ordered_extent. It's used to track the state of | |
| pending io for a particular range. More on this will be discussed in the End IO | |
| path section. | |
| So let's stop for a moment and consider where we are in the lifetime of the | |
| data. So now we have dirty pages created by the buffered write. What just | |
| happened is appropriate extents were allocated for the range of bytes which the | |
| currently written page represents and appropriate ordered data structs were | |
| allocated to track those extents. What remains is for actual IO on the page be | |
| triggered. | |
| 2. __extent_writepage_io is called to trigger io. This function really loops | |
| until we've submitted write requests for all extents within the requested range. | |
| The sequence of submission consists of : | |
| 1. btrfs_get_extent - to get the newly created extents by btrfs_run_delalloc_range | |
| and use that extent information to calculate the amount that needs | |
| to be written. | |
| 2. Once all of this is done, submit_extent_page is called. This function | |
| builds the require 'bio' structs and sets endio routine to | |
| end_bio_extent_writepage. | |
| <TODO: Discuss btrfs_writepage_cow_fixup and | |
| btrfs_writepage_endio_finish_ordered> | |
| End IO path | |
| ----------- | |
| Upon completion of data write end_bio_extent_writepage end_io call back is | |
| called. This function calls end_extent_writepage for every segment (page) in | |
| this bio. end_extent_writepage is responsible for performing any fs-specific | |
| tasks in order to complete the io. This is implemented mainly via the | |
| writepage_end_io_hook, which is really btrfs_writepage_end_io_hook. This | |
| function ensures that the appropriate metadata is persisted on disk. First | |
| btrfs_dec_test_ordered_pending is called, its purpose is to search the | |
| tree of ordered extents (added in "Page writeback" section) and subtract the | |
| amount of io this write completed. Then based on the whether all the pending | |
| io bytes are completed (btrfs_ordered_extent->bytes_left == 0) this function | |
| returns 1 (all io completed) or 0 (io still pending). | |
| So the reason why we need ordered extents is that if we want to write something | |
| in the range of 2-4mb we could potentially queue multiple bios to fulfill | |
| this request. Upon every bio completion the generic code will eventually end up | |
| calling btrfs_dec_test_ordered_pending. Most of the time this function will | |
| return 0, since there will be more pending writes. So by returning 0 from | |
| btrfs_dec_test_ordered_pending we stop doing further end io processing and | |
| short-circuit btrfs_writepage_end_io_hook execution. | |
| Alternatively, when the current io completion is the last for this ordered | |
| extent the final job is queued for asynchronous execution. This entails | |
| executing finish_ordered_fn/btrfs_finish_ordered_io. The last function is the | |
| one which does all the necessary changes to the btree in order to bring the | |
| system to a consistent state and "publish" the newly written data. The 2 most | |
| important things that this function performs is calling either | |
| btrfs_mark_extent_written (in case we just wrote to a PREALLOC extent) or | |
| calling insert_reserved_file_extent. And adding the necessary checksums to the | |
| csum tree via add_pending_csums. | |
| It's important to note that the end io context is already asynchronous and | |
| arbitrary and initially it might seems queueing yet another asynchornous job | |
| from there is redundant. The matter of fact is it's not, this is required since | |
| modifying the btree is a pretty heavy operation and in order to avoid deadlocks | |
| as well as not block further ios from being completed we need to do this | |
| processing from a different thread than the end io routine. |