Skip to content

Commit 0bf1f51

Browse files
riteshharjanitytso
authored andcommitted
ext4: Add atomic block write documentation
Add an initial documentation around atomic writes support in ext4. Acked-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Link: https://patch.msgid.link/d3893b9f5ad70317abae72046e81e4c180af91bf.1747337952.git.ritesh.list@gmail.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
1 parent 642e0dc commit 0bf1f51

File tree

2 files changed

+226
-0
lines changed

2 files changed

+226
-0
lines changed
Lines changed: 225 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,225 @@
1+
.. SPDX-License-Identifier: GPL-2.0
2+
.. _atomic_writes:
3+
4+
Atomic Block Writes
5+
-------------------------
6+
7+
Introduction
8+
~~~~~~~~~~~~
9+
10+
Atomic (untorn) block writes ensure that either the entire write is committed
11+
to disk or none of it is. This prevents "torn writes" during power loss or
12+
system crashes. The ext4 filesystem supports atomic writes (only with Direct
13+
I/O) on regular files with extents, provided the underlying storage device
14+
supports hardware atomic writes. This is supported in the following two ways:
15+
16+
1. **Single-fsblock Atomic Writes**:
17+
EXT4's supports atomic write operations with a single filesystem block since
18+
v6.13. In this the atomic write unit minimum and maximum sizes are both set
19+
to filesystem blocksize.
20+
e.g. doing atomic write of 16KB with 16KB filesystem blocksize on 64KB
21+
pagesize system is possible.
22+
23+
2. **Multi-fsblock Atomic Writes with Bigalloc**:
24+
EXT4 now also supports atomic writes spanning multiple filesystem blocks
25+
using a feature known as bigalloc. The atomic write unit's minimum and
26+
maximum sizes are determined by the filesystem block size and cluster size,
27+
based on the underlying device’s supported atomic write unit limits.
28+
29+
Requirements
30+
~~~~~~~~~~~~
31+
32+
Basic requirements for atomic writes in ext4:
33+
34+
1. The extents feature must be enabled (default for ext4)
35+
2. The underlying block device must support atomic writes
36+
3. For single-fsblock atomic writes:
37+
38+
1. A filesystem with appropriate block size (up to the page size)
39+
4. For multi-fsblock atomic writes:
40+
41+
1. The bigalloc feature must be enabled
42+
2. The cluster size must be appropriately configured
43+
44+
NOTE: EXT4 does not support software or COW based atomic write, which means
45+
atomic writes on ext4 are only supported if underlying storage device supports
46+
it.
47+
48+
Multi-fsblock Implementation Details
49+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
50+
51+
The bigalloc feature changes ext4 to allocate in units of multiple filesystem
52+
blocks, also known as clusters. With bigalloc each bit within block bitmap
53+
represents cluster (power of 2 number of blocks) rather than individual
54+
filesystem blocks.
55+
EXT4 supports multi-fsblock atomic writes with bigalloc, subject to the
56+
following constraints. The minimum atomic write size is the larger of the fs
57+
block size and the minimum hardware atomic write unit; and the maximum atomic
58+
write size is smaller of the bigalloc cluster size and the maximum hardware
59+
atomic write unit. Bigalloc ensures that all allocations are aligned to the
60+
cluster size, which satisfies the LBA alignment requirements of the hardware
61+
device if the start of the partition/logical volume is itself aligned correctly.
62+
63+
Here is the block allocation strategy in bigalloc for atomic writes:
64+
65+
* For regions with fully mapped extents, no additional work is needed
66+
* For append writes, a new mapped extent is allocated
67+
* For regions that are entirely holes, unwritten extent is created
68+
* For large unwritten extents, the extent gets split into two unwritten
69+
extents of appropriate requested size
70+
* For mixed mapping regions (combinations of holes, unwritten extents, or
71+
mapped extents), ext4_map_blocks() is called in a loop with
72+
EXT4_GET_BLOCKS_ZERO flag to convert the region into a single contiguous
73+
mapped extent by writing zeroes to it and converting any unwritten extents to
74+
written, if found within the range.
75+
76+
Note: Writing on a single contiguous underlying extent, whether mapped or
77+
unwritten, is not inherently problematic. However, writing to a mixed mapping
78+
region (i.e. one containing a combination of mapped and unwritten extents)
79+
must be avoided when performing atomic writes.
80+
81+
The reason is that, atomic writes when issued via pwritev2() with the RWF_ATOMIC
82+
flag, requires that either all data is written or none at all. In the event of
83+
a system crash or unexpected power loss during the write operation, the affected
84+
region (when later read) must reflect either the complete old data or the
85+
complete new data, but never a mix of both.
86+
87+
To enforce this guarantee, we ensure that the write target is backed by
88+
a single, contiguous extent before any data is written. This is critical because
89+
ext4 defers the conversion of unwritten extents to written extents until the I/O
90+
completion path (typically in ->end_io()). If a write is allowed to proceed over
91+
a mixed mapping region (with mapped and unwritten extents) and a failure occurs
92+
mid-write, the system could observe partially updated regions after reboot, i.e.
93+
new data over mapped areas, and stale (old) data over unwritten extents that
94+
were never marked written. This violates the atomicity and/or torn write
95+
prevention guarantee.
96+
97+
To prevent such torn writes, ext4 proactively allocates a single contiguous
98+
extent for the entire requested region in ``ext4_iomap_alloc`` via
99+
``ext4_map_blocks_atomic()``. EXT4 also force commits the current journalling
100+
transaction in case if allocation is done over mixed mapping. This ensures any
101+
pending metadata updates (like unwritten to written extents conversion) in this
102+
range are in consistent state with the file data blocks, before performing the
103+
actual write I/O. If the commit fails, the whole I/O must be aborted to prevent
104+
from any possible torn writes.
105+
Only after this step, the actual data write operation is performed by the iomap.
106+
107+
Handling Split Extents Across Leaf Blocks
108+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
109+
110+
There can be a special edge case where we have logically and physically
111+
contiguous extents stored in separate leaf nodes of the on-disk extent tree.
112+
This occurs because on-disk extent tree merges only happens within the leaf
113+
blocks except for a case where we have 2-level tree which can get merged and
114+
collapsed entirely into the inode.
115+
If such a layout exists and, in the worst case, the extent status cache entries
116+
are reclaimed due to memory pressure, ``ext4_map_blocks()`` may never return
117+
a single contiguous extent for these split leaf extents.
118+
119+
To address this edge case, a new get block flag
120+
``EXT4_GET_BLOCKS_QUERY_LEAF_BLOCKS flag`` is added to enhance the
121+
``ext4_map_query_blocks()`` lookup behavior.
122+
123+
This new get block flag allows ``ext4_map_blocks()`` to first check if there is
124+
an entry in the extent status cache for the full range.
125+
If not present, it consults the on-disk extent tree using
126+
``ext4_map_query_blocks()``.
127+
If the located extent is at the end of a leaf node, it probes the next logical
128+
block (lblk) to detect a contiguous extent in the adjacent leaf.
129+
130+
For now only one additional leaf block is queried to maintain efficiency, as
131+
atomic writes are typically constrained to small sizes
132+
(e.g. [blocksize, clustersize]).
133+
134+
135+
Handling Journal transactions
136+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
137+
138+
To support multi-fsblock atomic writes, we ensure enough journal credits are
139+
reserved during:
140+
141+
1. Block allocation time in ``ext4_iomap_alloc()``. We first query if there
142+
could be a mixed mapping for the underlying requested range. If yes, then we
143+
reserve credits of up to ``m_len``, assuming every alternate block can be
144+
an unwritten extent followed by a hole.
145+
146+
2. During ``->end_io()`` call, we make sure a single transaction is started for
147+
doing unwritten-to-written conversion. The loop for conversion is mainly
148+
only required to handle a split extent across leaf blocks.
149+
150+
How to
151+
------
152+
153+
Creating Filesystems with Atomic Write Support
154+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
155+
156+
First check the atomic write units supported by block device.
157+
See :ref:`atomic_write_bdev_support` for more details.
158+
159+
For single-fsblock atomic writes with a larger block size
160+
(on systems with block size < page size):
161+
162+
.. code-block:: bash
163+
164+
# Create an ext4 filesystem with a 16KB block size
165+
# (requires page size >= 16KB)
166+
mkfs.ext4 -b 16384 /dev/device
167+
168+
For multi-fsblock atomic writes with bigalloc:
169+
170+
.. code-block:: bash
171+
172+
# Create an ext4 filesystem with bigalloc and 64KB cluster size
173+
mkfs.ext4 -F -O bigalloc -b 4096 -C 65536 /dev/device
174+
175+
Where ``-b`` specifies the block size, ``-C`` specifies the cluster size in bytes,
176+
and ``-O bigalloc`` enables the bigalloc feature.
177+
178+
Application Interface
179+
~~~~~~~~~~~~~~~~~~~~~
180+
181+
Applications can use the ``pwritev2()`` system call with the ``RWF_ATOMIC`` flag
182+
to perform atomic writes:
183+
184+
.. code-block:: c
185+
186+
pwritev2(fd, iov, iovcnt, offset, RWF_ATOMIC);
187+
188+
The write must be aligned to the filesystem's block size and not exceed the
189+
filesystem's maximum atomic write unit size.
190+
See ``generic_atomic_write_valid()`` for more details.
191+
192+
``statx()`` system call with ``STATX_WRITE_ATOMIC`` flag can provides following
193+
details:
194+
195+
* ``stx_atomic_write_unit_min``: Minimum size of an atomic write request.
196+
* ``stx_atomic_write_unit_max``: Maximum size of an atomic write request.
197+
* ``stx_atomic_write_segments_max``: Upper limit for segments. The number of
198+
separate memory buffers that can be gathered into a write operation
199+
(e.g., the iovcnt parameter for IOV_ITER). Currently, this is always set to one.
200+
201+
The STATX_ATTR_WRITE_ATOMIC flag in ``statx->attributes`` is set if atomic
202+
writes are supported.
203+
204+
.. _atomic_write_bdev_support:
205+
206+
Hardware Support
207+
----------------
208+
209+
The underlying storage device must support atomic write operations.
210+
Modern NVMe and SCSI devices often provide this capability.
211+
The Linux kernel exposes this information through sysfs:
212+
213+
* ``/sys/block/<device>/queue/atomic_write_unit_min`` - Minimum atomic write size
214+
* ``/sys/block/<device>/queue/atomic_write_unit_max`` - Maximum atomic write size
215+
216+
Nonzero values for these attributes indicate that the device supports
217+
atomic writes.
218+
219+
See Also
220+
--------
221+
222+
* :doc:`bigalloc` - Documentation on the bigalloc feature
223+
* :doc:`allocators` - Documentation on block allocation in ext4
224+
* Support for atomic block writes in 6.13:
225+
https://lwn.net/Articles/1009298/

Documentation/filesystems/ext4/overview.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -25,3 +25,4 @@ order.
2525
.. include:: inlinedata.rst
2626
.. include:: eainode.rst
2727
.. include:: verity.rst
28+
.. include:: atomic_writes.rst

0 commit comments

Comments
 (0)