Skip to content

block: qcow: QCOW2 thread safe metadata for multiqueue I/O#7744

Merged
rbradford merged 6 commits intocloud-hypervisor:mainfrom
weltling:qcow-metadata-lock
Mar 9, 2026
Merged

block: qcow: QCOW2 thread safe metadata for multiqueue I/O#7744
rbradford merged 6 commits intocloud-hypervisor:mainfrom
weltling:qcow-metadata-lock

Conversation

@weltling
Copy link
Copy Markdown
Member

@weltling weltling commented Feb 22, 2026

block: QCOW2 thread safe metadata for multiqueue I/O

The QCOW2 synchronous backend currently wraps QcowFile in Arc<Mutex> to avoid the data corruption from cloned instances with independent caches, introduced in #7661. This is thread safe but serializes all queue I/O through a single lock, eliminating any benefit from multiple virtio-blk queues.

This series introduces QcowMetadata, a coarse RwLock wrapper around the in memory QCOW2 metadata tables. Metadata lookup is separated from data I/O so each queue can read and write through its own file descriptor (via dup and pread64/pwrite64) without holding the metadata lock. Read path L2 cache hits only need a shared read lock. Cache misses and all write operations upgrade to a write lock.

QcowDiskSync holds a single Arc<QcowMetadata> shared across all queues. Backing files are decomposed into concrete owned types (BackingKind enum) so QCOW2 backings also use the same metadata plus pread64 pattern, recursively through the backing chain.

This is the first step in the block crate refactoring plan (#7560). The ClusterReadMapping and ClusterWriteMapping enums returned by the metadata layer describe exactly what host I/O is needed for each guest request without performing it. This separation lays the foundation for a future io_uring backend that can submit the mapped offsets as async operations instead of blocking pread64/pwrite64 calls.

Commits

  1. Extract QcowHeader into header.rs - Move QcowHeader, constants and helpers into a submodule. Pure code move, no functional changes.

  2. Extract utility functions into util.rs - Move L1/L2 entry helpers, division utilities and constants into a submodule. Pure code move, no functional changes.

  3. Add QcowMetadata with RwLock - The core change. QcowMetadata provides cluster resolution for reads and writes, and deallocate operations for discard. Extract parse_qcow() from QcowFile so both QcowFile and QcowDiskSync share parsing and validation.

  4. Add resize() to QcowMetadata - Move resize logic into the metadata layer so QcowDiskSync can grow the virtual disk size without going through QcowFile.

  5. Refactor BackingFile for ownership based decomposition - Replace the clone based BackingFileOps trait with a BackingKind enum. QCOW2 backings call parse_qcow() directly instead of building a full QcowFile. Remove Clone for BackingFile and QcowFile.

  6. Extend unit tests - New tests covering multiqueue concurrent reads, raw and QCOW2 backing files, three layer backing chains, COW on partial cluster writes, discard with backing fallthrough, cross cluster boundary operations, reads beyond virtual size and resize.

Performance tests show substantial improvements over the main baseline due to parallel data I/O across queues.

@weltling weltling requested a review from a team as a code owner February 22, 2026 22:22
Copy link
Copy Markdown
Member

@phip1611 phip1611 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally LGTM! I think parts of it hard to review (many changed LOC in single commits). Can we do better here?

Perhaps you could even split out commits 1 and 2 into a dedicated PR to keep this PR more focused (given that you might split it into more commits)

///
/// One instance is shared via Arc across all virtio-blk queues. Each
/// queue holds its own QcowRawFile clone for data I/O.
pub struct QcowMetadata {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

general question: Could you please explain the bigger picture a little? We always have certain metadata in RAM and flush it to disk occasionally? How does it work?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

QCOW2 uses two levels of indirection tables L1 + L2 to map guest cluster offsets to host file offsets:

guest offset -> L1 table -> L2 table -> host cluster offset.

These tables live on disk but are cached in RAM for performance. The L1 table is always fully loaded, L2 tables are loaded on demand into an LRU cache. On reads, the cached tables are looked up to find where the data lives on disk, then pread64 is issued at that host offset. On writes, a new cluster may need to be allocated by appending to the file and updating the L2 entry. That modifies the cached tables and marks them dirty. Dirty tables are flushed to disk on fsync or when evicted from the cache.

The RwLock protects these in memory tables. Multiple readers can look up cluster mappings concurrently, while writes that modify table entries take an exclusive lock. The metadata layer returns a ClusterReadMapping or ClusterWriteMapping that describes the host I/O needs, then the caller does pread64/pwrite64 on its own file descriptor without any lock held. This is what enables multiqueue parallelism - queues only briefly cross on the metadata lock for the lookup phase, while actual data I/O runs fully concurrent.

Thanks

Ok(())
}

/// Flushes dirty metadata caches and clears the dirty bit for
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not about this line but about commit block: qcow: Refactor BackingFile for ownership based decomposition

It is also fairly difficult to review, many lines changed. Is there a way you can split this commit into multiple?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please see #7744 (comment)

@weltling
Copy link
Copy Markdown
Member Author

weltling commented Feb 23, 2026

Performance summary

./scripts/dev_cli.sh tests --metrics -- --test-filter block_qcow on both main and this PR feature branch.

Main vs. Metadata Lock

Category 1Q Main (MiB/s) 1Q Metadata Lock (MiB/s) 1Q Delta 4Q Main (MiB/s) 4Q Metadata Lock (MiB/s) 4Q Delta
Uncompressed 1773 1759 ~2% 1825 6138 +230-251%
Zlib 2189 2152 ~5% 5811 8987 +47-61%
Zstd 2184 2220 ~4% 5801 9063 +39-65%
Backing QCOW2 1421 1374 ~6% 1684 5237 +172-241%
Backing RAW 1518 1534 ~2% 1815 5572 +172-236%

Metadata Lock - Single vs. Multi Queue Scaling

Category 1Q (MiB/s) 4Q (MiB/s) Scaling
Uncompressed 1759 6138 3.44-3.53x
Zlib 2152 8987 4.13-4.22x
Zstd 2220 9063 3.77-4.32x
Backing QCOW2 1374 5237 3.26-5.81x
Backing RAW 1534 5572 3.45-4.54x

Notes

Single queue performance is within noise, confirming no overhead on the single threaded path. Multiqueue throughput improves dramatically - uncompressed reads scale 3.5x on 4 queues where Arc<Mutex> achieved
only ~1x.

Compressed workloads exceed 4x since per cluster decompression now runs concurrently.

Backing file overlays show the largest gains, from ~1.7 GiB/s to 5-6 GiB/s.

The RwLock on QcowMetadata allows concurrent read access with no contention, while write exclusivity is preserved for cluster allocation.

There are full logs with all the 30 perf tests, if someone is interested.
perf_baseline_main.log
perf_baseline_metadata_lock.log

@weltling
Copy link
Copy Markdown
Member Author

Generally LGTM! I think parts of it hard to review (many changed LOC in single commits). Can we do better here?

Perhaps you could even split out commits 1 and 2 into a dedicated PR to keep this PR more focused (given that you might split it into more commits)

Thanks for giving this a shot! And yes, I second the reviewability concerns and had quite some thought before producing this structure, too.

The main goal was to make each commit a self contained logical step that compiles and passes clippy on its own, while keeping the overall diff as small as possible. Those commits follow a strict dependency chain where each one builds on the previous, so they need to land in order and reviewing them in sequence tells a coherent story. There is also a CI job that validates every commit individually, so any commit that does not build or breaks tests would fail the pipeline.

I've been as well considering splitting commits 1+2 into a separate PR but decided against it because they are pure mechanical extractions that only make sense as preparation for commit 3. On their own they are a refactoring with no functional motivation, and reviewing them in isolation loses the context of why the split was made. The sequence also matters here, commit 3 depends on the types and helpers being in separate modules so metadata.rs can import them without circular dependencies in mod.rs. Keeping them in the same PR shows the full picture in one pass. The commit split already isolates each logical step so they can be reviewed individually.

For commit 3 "Add QcowMetadata with RwLock" - QcowMetadata, QcowState, the mapping enums and parse_qcow() form a single coherent type that cannot be introduced partially. Splitting would create broken intermediate commits where the struct exists but cannot resolve clusters, or the enums exist but nothing returns them.

For commit 5 "Refactor BackingFile" - the BackingFileOps trait removal, BackingKind enum introduction and parse_qcow() reuse all touch the same call sites. Splitting would mean introducing the new enum in one commit and removing the old trait in another, but both halves would need to modify the same functions, adding revert/redo churn with no reviewability gain.

That said, I am open to suggestions if you see a concrete split that would make a particular commit easier to follow. Happy to restructure if there is a way that works better for review.

Thanks!

@weltling weltling requested a review from phip1611 February 23, 2026 11:04
@weltling weltling force-pushed the qcow-metadata-lock branch 4 times, most recently from 78bc890 to 454eb32 Compare February 27, 2026 09:28
@phip1611
Copy link
Copy Markdown
Member

I unfortunately don't have capacity to review this currently. Maybe next week!

@weltling
Copy link
Copy Markdown
Member Author

I unfortunately don't have capacity to review this currently. Maybe next week!

No worries and thanks so much for all the efforts! I'm starting to look into the other parts of the refactoring to prepare and make progress in parallel.

Thanks!

@weltling weltling force-pushed the qcow-metadata-lock branch 4 times, most recently from 4cd56c5 to a055883 Compare March 3, 2026 10:33
//
// SPDX-License-Identifier: Apache-2.0 AND BSD-3-Clause

#![allow(dead_code)] // wired in by qcow_sync
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does this mean? It would be good to not have this and have the types correct labelled with scope (maybe just pub(crate) if needed.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The types are already labelled with the appropriate scope pub(crate) / pub. The #![allow(dead_code)] was there because this module is introduced in commit 3 while the consumer qcow_sync.rs is not wired up until commit 5. I have removed it now. The final tree no longer carries it, but the intermediate commits 3 and 4 will not pass cargo clippy -p block -- -D warnings without it. Perhaps not that bad, if CI doesn't complain.

Thanks.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Confirmed CI checks clippy on the final merge tree, not per intermediate commit. Thus, no allow dead code.

Thanks

BackingKind::Raw(raw_file) => {
// SAFETY: raw_file holds a valid open fd.
let dup_fd = unsafe { libc::dup(raw_file.as_raw_fd()) };
assert!(dup_fd >= 0, "dup() backing file fd");
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

assert!() can only be used on programming errors (or test code) this could happen at runtime due to the number of open FDs exceeded

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed. Replaced these by runtime checks and throwing Err.

Thanks!

@weltling weltling force-pushed the qcow-metadata-lock branch 5 times, most recently from 6652292 to b3e2f7c Compare March 6, 2026 14:11
weltling added 5 commits March 9, 2026 08:11
Move QcowHeader, associated types, constants and helper functions
into a new header.rs submodule. Public types are re-exported from
mod.rs.

No functional changes.

Signed-off-by: Anatol Belski <anbelski@linux.microsoft.com>
Move L1 and L2 table entry helpers, division utilities and related
constants from mod.rs into a dedicated util.rs submodule. Both
mod.rs and metadata.rs import from util.

No functional changes.

Signed-off-by: Anatol Belski <anbelski@linux.microsoft.com>
Introduce QcowMetadata, a thread safe wrapper around QCOW2 metadata
tables and caches using RwLock.

Provides cluster resolution for reads and writes, and deallocate
operations for discard.

Extract parse_qcow() from QcowFile so both QcowFile and QcowDiskSync
can share the parsing and validation logic.

Signed-off-by: Anatol Belski <anbelski@linux.microsoft.com>
Add resize() and grow_l1_table() so the metadata layer can grow
the virtual disk size. Only grow is supported.

Signed-off-by: Anatol Belski <anbelski@linux.microsoft.com>
Replace the clone based BackingFileOps trait with a BackingKind enum
so backing files can be decomposed into their concrete owned types.

BackingFile::new() for QCOW2 backings now calls parse_qcow() directly
instead of building a full QcowFile. Remove Clone for BackingFile and
QcowFile.

Prerequisite for the qcow_sync rewrite which decomposes a BackingFile
into a raw fd or QcowMetadata for lock free I/O.

Signed-off-by: Anatol Belski <anbelski@linux.microsoft.com>
Add tests for multiqueue concurrent reads, raw and QCOW2 backing
files, three layer backing chains, COW on partial cluster writes,
discard with backing fallthrough, cross cluster boundary operations,
reads beyond virtual size, and resize.

Signed-off-by: Anatol Belski <anbelski@linux.microsoft.com>
@weltling weltling force-pushed the qcow-metadata-lock branch from b3e2f7c to eb1fe0a Compare March 9, 2026 07:12
Copy link
Copy Markdown
Member

@phip1611 phip1611 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I only have small remarks. Great job done here!

})
.collect();

for t in threads {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure how this test ensures that the reads are done in parallel and not sequentially

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The threads are scheduled by the OS the same way virtio queue workers are. There's no synchronization, they just run whenever the scheduler picks them up. On a multicore machine 8 threads doing 16 reads each will naturally overlap. Thus, the access is concurrent.

Thanks

@rbradford rbradford added this pull request to the merge queue Mar 9, 2026
Merged via the queue into cloud-hypervisor:main with commit fd6891d Mar 9, 2026
37 checks passed
@weltling
Copy link
Copy Markdown
Member Author

weltling commented Mar 9, 2026

Thanks for all the reviews! Working on the error unification and trait infra as next refactoring steps.

@weltling weltling deleted the qcow-metadata-lock branch March 9, 2026 22:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants