block: qcow: QCOW2 thread safe metadata for multiqueue I/O by weltling · Pull Request #7744 · cloud-hypervisor/cloud-hypervisor

weltling · 2026-02-22T22:22:46Z

block: QCOW2 thread safe metadata for multiqueue I/O

The QCOW2 synchronous backend currently wraps QcowFile in Arc<Mutex> to avoid the data corruption from cloned instances with independent caches, introduced in #7661. This is thread safe but serializes all queue I/O through a single lock, eliminating any benefit from multiple virtio-blk queues.

This series introduces QcowMetadata, a coarse RwLock wrapper around the in memory QCOW2 metadata tables. Metadata lookup is separated from data I/O so each queue can read and write through its own file descriptor (via dup and pread64/pwrite64) without holding the metadata lock. Read path L2 cache hits only need a shared read lock. Cache misses and all write operations upgrade to a write lock.

QcowDiskSync holds a single Arc<QcowMetadata> shared across all queues. Backing files are decomposed into concrete owned types (BackingKind enum) so QCOW2 backings also use the same metadata plus pread64 pattern, recursively through the backing chain.

This is the first step in the block crate refactoring plan (#7560). The ClusterReadMapping and ClusterWriteMapping enums returned by the metadata layer describe exactly what host I/O is needed for each guest request without performing it. This separation lays the foundation for a future io_uring backend that can submit the mapped offsets as async operations instead of blocking pread64/pwrite64 calls.

Commits

Extract QcowHeader into header.rs - Move QcowHeader, constants and helpers into a submodule. Pure code move, no functional changes.
Extract utility functions into util.rs - Move L1/L2 entry helpers, division utilities and constants into a submodule. Pure code move, no functional changes.
Add QcowMetadata with RwLock - The core change. QcowMetadata provides cluster resolution for reads and writes, and deallocate operations for discard. Extract parse_qcow() from QcowFile so both QcowFile and QcowDiskSync share parsing and validation.
Add resize() to QcowMetadata - Move resize logic into the metadata layer so QcowDiskSync can grow the virtual disk size without going through QcowFile.
Refactor BackingFile for ownership based decomposition - Replace the clone based BackingFileOps trait with a BackingKind enum. QCOW2 backings call parse_qcow() directly instead of building a full QcowFile. Remove Clone for BackingFile and QcowFile.
Extend unit tests - New tests covering multiqueue concurrent reads, raw and QCOW2 backing files, three layer backing chains, COW on partial cluster writes, discard with backing fallthrough, cross cluster boundary operations, reads beyond virtual size and resize.

Performance tests show substantial improvements over the main baseline due to parallel data I/O across queues.

phip1611

Generally LGTM! I think parts of it hard to review (many changed LOC in single commits). Can we do better here?

Perhaps you could even split out commits 1 and 2 into a dedicated PR to keep this PR more focused (given that you might split it into more commits)

phip1611 · 2026-02-23T08:09:13Z

block/src/qcow/metadata.rs

+///
+/// One instance is shared via Arc across all virtio-blk queues. Each
+/// queue holds its own QcowRawFile clone for data I/O.
+pub struct QcowMetadata {


general question: Could you please explain the bigger picture a little? We always have certain metadata in RAM and flush it to disk occasionally? How does it work?

QCOW2 uses two levels of indirection tables L1 + L2 to map guest cluster offsets to host file offsets:

guest offset -> L1 table -> L2 table -> host cluster offset.

These tables live on disk but are cached in RAM for performance. The L1 table is always fully loaded, L2 tables are loaded on demand into an LRU cache. On reads, the cached tables are looked up to find where the data lives on disk, then pread64 is issued at that host offset. On writes, a new cluster may need to be allocated by appending to the file and updating the L2 entry. That modifies the cached tables and marks them dirty. Dirty tables are flushed to disk on fsync or when evicted from the cache.

The RwLock protects these in memory tables. Multiple readers can look up cluster mappings concurrently, while writes that modify table entries take an exclusive lock. The metadata layer returns a ClusterReadMapping or ClusterWriteMapping that describes the host I/O needs, then the caller does pread64/pwrite64 on its own file descriptor without any lock held. This is what enables multiqueue parallelism - queues only briefly cross on the metadata lock for the lookup phase, while actual data I/O runs fully concurrent.

Thanks

block/src/qcow/metadata.rs

phip1611 · 2026-02-23T08:17:37Z

block/src/qcow/metadata.rs

        Ok(())
    }

+    /// Flushes dirty metadata caches and clears the dirty bit for


not about this line but about commit block: qcow: Refactor BackingFile for ownership based decomposition

It is also fairly difficult to review, many lines changed. Is there a way you can split this commit into multiple?

Please see #7744 (comment)

weltling · 2026-02-23T09:10:23Z

Performance summary

./scripts/dev_cli.sh tests --metrics -- --test-filter block_qcow on both main and this PR feature branch.

Main vs. Metadata Lock

Category	1Q Main (MiB/s)	1Q Metadata Lock (MiB/s)	1Q Delta	4Q Main (MiB/s)	4Q Metadata Lock (MiB/s)	4Q Delta
Uncompressed	1773	1759	~2%	1825	6138	+230-251%
Zlib	2189	2152	~5%	5811	8987	+47-61%
Zstd	2184	2220	~4%	5801	9063	+39-65%
Backing QCOW2	1421	1374	~6%	1684	5237	+172-241%
Backing RAW	1518	1534	~2%	1815	5572	+172-236%

Metadata Lock - Single vs. Multi Queue Scaling

Category	1Q (MiB/s)	4Q (MiB/s)	Scaling
Uncompressed	1759	6138	3.44-3.53x
Zlib	2152	8987	4.13-4.22x
Zstd	2220	9063	3.77-4.32x
Backing QCOW2	1374	5237	3.26-5.81x
Backing RAW	1534	5572	3.45-4.54x

Notes

Single queue performance is within noise, confirming no overhead on the single threaded path. Multiqueue throughput improves dramatically - uncompressed reads scale 3.5x on 4 queues where Arc<Mutex> achieved
only ~1x.

Compressed workloads exceed 4x since per cluster decompression now runs concurrently.

Backing file overlays show the largest gains, from ~1.7 GiB/s to 5-6 GiB/s.

The RwLock on QcowMetadata allows concurrent read access with no contention, while write exclusivity is preserved for cluster allocation.

There are full logs with all the 30 perf tests, if someone is interested.
perf_baseline_main.log
perf_baseline_metadata_lock.log

weltling · 2026-02-23T10:38:37Z

Generally LGTM! I think parts of it hard to review (many changed LOC in single commits). Can we do better here?

Perhaps you could even split out commits 1 and 2 into a dedicated PR to keep this PR more focused (given that you might split it into more commits)

Thanks for giving this a shot! And yes, I second the reviewability concerns and had quite some thought before producing this structure, too.

The main goal was to make each commit a self contained logical step that compiles and passes clippy on its own, while keeping the overall diff as small as possible. Those commits follow a strict dependency chain where each one builds on the previous, so they need to land in order and reviewing them in sequence tells a coherent story. There is also a CI job that validates every commit individually, so any commit that does not build or breaks tests would fail the pipeline.

I've been as well considering splitting commits 1+2 into a separate PR but decided against it because they are pure mechanical extractions that only make sense as preparation for commit 3. On their own they are a refactoring with no functional motivation, and reviewing them in isolation loses the context of why the split was made. The sequence also matters here, commit 3 depends on the types and helpers being in separate modules so metadata.rs can import them without circular dependencies in mod.rs. Keeping them in the same PR shows the full picture in one pass. The commit split already isolates each logical step so they can be reviewed individually.

For commit 3 "Add QcowMetadata with RwLock" - QcowMetadata, QcowState, the mapping enums and parse_qcow() form a single coherent type that cannot be introduced partially. Splitting would create broken intermediate commits where the struct exists but cannot resolve clusters, or the enums exist but nothing returns them.

For commit 5 "Refactor BackingFile" - the BackingFileOps trait removal, BackingKind enum introduction and parse_qcow() reuse all touch the same call sites. Splitting would mean introducing the new enum in one commit and removing the old trait in another, but both halves would need to modify the same functions, adding revert/redo churn with no reviewability gain.

That said, I am open to suggestions if you see a concrete split that would make a particular commit easier to follow. Happy to restructure if there is a way that works better for review.

Thanks!

phip1611 · 2026-02-27T10:05:55Z

I unfortunately don't have capacity to review this currently. Maybe next week!

weltling · 2026-02-27T10:37:54Z

I unfortunately don't have capacity to review this currently. Maybe next week!

No worries and thanks so much for all the efforts! I'm starting to look into the other parts of the refactoring to prepare and make progress in parallel.

Thanks!

rbradford · 2026-03-03T14:25:48Z

block/src/qcow/metadata.rs

+//
+// SPDX-License-Identifier: Apache-2.0 AND BSD-3-Clause
+
+#![allow(dead_code)] // wired in by qcow_sync


What does this mean? It would be good to not have this and have the types correct labelled with scope (maybe just pub(crate) if needed.

The types are already labelled with the appropriate scope pub(crate) / pub. The #![allow(dead_code)] was there because this module is introduced in commit 3 while the consumer qcow_sync.rs is not wired up until commit 5. I have removed it now. The final tree no longer carries it, but the intermediate commits 3 and 4 will not pass cargo clippy -p block -- -D warnings without it. Perhaps not that bad, if CI doesn't complain.

Thanks.

Confirmed CI checks clippy on the final merge tree, not per intermediate commit. Thus, no allow dead code.

Thanks

rbradford · 2026-03-03T14:28:59Z

block/src/qcow_sync.rs

+        BackingKind::Raw(raw_file) => {
+            // SAFETY: raw_file holds a valid open fd.
+            let dup_fd = unsafe { libc::dup(raw_file.as_raw_fd()) };
+            assert!(dup_fd >= 0, "dup() backing file fd");


assert!() can only be used on programming errors (or test code) this could happen at runtime due to the number of open FDs exceeded

Indeed. Replaced these by runtime checks and throwing Err.

Thanks!

Move QcowHeader, associated types, constants and helper functions into a new header.rs submodule. Public types are re-exported from mod.rs. No functional changes. Signed-off-by: Anatol Belski <anbelski@linux.microsoft.com>

Move L1 and L2 table entry helpers, division utilities and related constants from mod.rs into a dedicated util.rs submodule. Both mod.rs and metadata.rs import from util. No functional changes. Signed-off-by: Anatol Belski <anbelski@linux.microsoft.com>

Introduce QcowMetadata, a thread safe wrapper around QCOW2 metadata tables and caches using RwLock. Provides cluster resolution for reads and writes, and deallocate operations for discard. Extract parse_qcow() from QcowFile so both QcowFile and QcowDiskSync can share the parsing and validation logic. Signed-off-by: Anatol Belski <anbelski@linux.microsoft.com>

Add resize() and grow_l1_table() so the metadata layer can grow the virtual disk size. Only grow is supported. Signed-off-by: Anatol Belski <anbelski@linux.microsoft.com>

Replace the clone based BackingFileOps trait with a BackingKind enum so backing files can be decomposed into their concrete owned types. BackingFile::new() for QCOW2 backings now calls parse_qcow() directly instead of building a full QcowFile. Remove Clone for BackingFile and QcowFile. Prerequisite for the qcow_sync rewrite which decomposes a BackingFile into a raw fd or QcowMetadata for lock free I/O. Signed-off-by: Anatol Belski <anbelski@linux.microsoft.com>

Add tests for multiqueue concurrent reads, raw and QCOW2 backing files, three layer backing chains, COW on partial cluster writes, discard with backing fallthrough, cross cluster boundary operations, reads beyond virtual size, and resize. Signed-off-by: Anatol Belski <anbelski@linux.microsoft.com>

phip1611

I only have small remarks. Great job done here!

block/src/qcow/metadata.rs

phip1611 · 2026-03-09T08:57:29Z

block/src/qcow_sync.rs

+            })
+            .collect();
+
+        for t in threads {


I am not sure how this test ensures that the reads are done in parallel and not sequentially

The threads are scheduled by the OS the same way virtio queue workers are. There's no synchronization, they just run whenever the scheduler picks them up. On a multicore machine 8 threads doing 16 reads each will naturally overlap. Thus, the access is concurrent.

Thanks

weltling · 2026-03-09T22:12:32Z

Thanks for all the reviews! Working on the error unification and trait infra as next refactoring steps.

weltling requested a review from a team as a code owner February 22, 2026 22:22

weltling force-pushed the qcow-metadata-lock branch from 5151c46 to 5a18636 Compare February 22, 2026 23:08

phip1611 requested changes Feb 23, 2026

View reviewed changes

weltling mentioned this pull request Feb 23, 2026

block: Crate Refactoring - Cycle 1: QCOW2 Foundation + Async Reads #7694

Open

4 tasks

weltling force-pushed the qcow-metadata-lock branch from 5a18636 to 60f94c8 Compare February 23, 2026 10:37

weltling requested a review from phip1611 February 23, 2026 11:04

weltling force-pushed the qcow-metadata-lock branch 4 times, most recently from 78bc890 to 454eb32 Compare February 27, 2026 09:28

weltling force-pushed the qcow-metadata-lock branch 4 times, most recently from 4cd56c5 to a055883 Compare March 3, 2026 10:33

rbradford reviewed Mar 3, 2026

View reviewed changes

weltling force-pushed the qcow-metadata-lock branch 5 times, most recently from 6652292 to b3e2f7c Compare March 6, 2026 14:11

rbradford approved these changes Mar 6, 2026

View reviewed changes

weltling added 5 commits March 9, 2026 08:11

block: qcow: Add resize() to QcowMetadata

32282d6

Add resize() and grow_l1_table() so the metadata layer can grow the virtual disk size. Only grow is supported. Signed-off-by: Anatol Belski <anbelski@linux.microsoft.com>

weltling force-pushed the qcow-metadata-lock branch from b3e2f7c to eb1fe0a Compare March 9, 2026 07:12

phip1611 approved these changes Mar 9, 2026

View reviewed changes

rbradford approved these changes Mar 9, 2026

View reviewed changes

rbradford added this pull request to the merge queue Mar 9, 2026

Merged via the queue into cloud-hypervisor:main with commit fd6891d Mar 9, 2026
37 checks passed

weltling deleted the qcow-metadata-lock branch March 9, 2026 22:12

Conversation

weltling commented Feb 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

block: QCOW2 thread safe metadata for multiqueue I/O

Commits

Uh oh!

phip1611 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

weltling commented Feb 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Performance summary

Main vs. Metadata Lock

Metadata Lock - Single vs. Multi Queue Scaling

Notes

Uh oh!

weltling commented Feb 23, 2026

Uh oh!

phip1611 commented Feb 27, 2026

Uh oh!

weltling commented Feb 27, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

phip1611 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

weltling commented Mar 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

weltling commented Feb 22, 2026 •

edited

Loading

weltling commented Feb 23, 2026 •

edited

Loading