block: qcow: QCOW2 thread safe metadata for multiqueue I/O#7744
block: qcow: QCOW2 thread safe metadata for multiqueue I/O#7744rbradford merged 6 commits intocloud-hypervisor:mainfrom
Conversation
5151c46 to
5a18636
Compare
phip1611
left a comment
There was a problem hiding this comment.
Generally LGTM! I think parts of it hard to review (many changed LOC in single commits). Can we do better here?
Perhaps you could even split out commits 1 and 2 into a dedicated PR to keep this PR more focused (given that you might split it into more commits)
| /// | ||
| /// One instance is shared via Arc across all virtio-blk queues. Each | ||
| /// queue holds its own QcowRawFile clone for data I/O. | ||
| pub struct QcowMetadata { |
There was a problem hiding this comment.
general question: Could you please explain the bigger picture a little? We always have certain metadata in RAM and flush it to disk occasionally? How does it work?
There was a problem hiding this comment.
QCOW2 uses two levels of indirection tables L1 + L2 to map guest cluster offsets to host file offsets:
guest offset -> L1 table -> L2 table -> host cluster offset.
These tables live on disk but are cached in RAM for performance. The L1 table is always fully loaded, L2 tables are loaded on demand into an LRU cache. On reads, the cached tables are looked up to find where the data lives on disk, then pread64 is issued at that host offset. On writes, a new cluster may need to be allocated by appending to the file and updating the L2 entry. That modifies the cached tables and marks them dirty. Dirty tables are flushed to disk on fsync or when evicted from the cache.
The RwLock protects these in memory tables. Multiple readers can look up cluster mappings concurrently, while writes that modify table entries take an exclusive lock. The metadata layer returns a ClusterReadMapping or ClusterWriteMapping that describes the host I/O needs, then the caller does pread64/pwrite64 on its own file descriptor without any lock held. This is what enables multiqueue parallelism - queues only briefly cross on the metadata lock for the lookup phase, while actual data I/O runs fully concurrent.
Thanks
| Ok(()) | ||
| } | ||
|
|
||
| /// Flushes dirty metadata caches and clears the dirty bit for |
There was a problem hiding this comment.
not about this line but about commit block: qcow: Refactor BackingFile for ownership based decomposition
It is also fairly difficult to review, many lines changed. Is there a way you can split this commit into multiple?
Performance summary
Main vs. Metadata Lock
Metadata Lock - Single vs. Multi Queue Scaling
NotesSingle queue performance is within noise, confirming no overhead on the single threaded path. Multiqueue throughput improves dramatically - uncompressed reads scale 3.5x on 4 queues where Compressed workloads exceed 4x since per cluster decompression now runs concurrently. Backing file overlays show the largest gains, from ~1.7 GiB/s to 5-6 GiB/s. The There are full logs with all the 30 perf tests, if someone is interested. |
5a18636 to
60f94c8
Compare
Thanks for giving this a shot! And yes, I second the reviewability concerns and had quite some thought before producing this structure, too. The main goal was to make each commit a self contained logical step that compiles and passes clippy on its own, while keeping the overall diff as small as possible. Those commits follow a strict dependency chain where each one builds on the previous, so they need to land in order and reviewing them in sequence tells a coherent story. There is also a CI job that validates every commit individually, so any commit that does not build or breaks tests would fail the pipeline. I've been as well considering splitting commits 1+2 into a separate PR but decided against it because they are pure mechanical extractions that only make sense as preparation for commit 3. On their own they are a refactoring with no functional motivation, and reviewing them in isolation loses the context of why the split was made. The sequence also matters here, commit 3 depends on the types and helpers being in separate modules so metadata.rs can import them without circular dependencies in mod.rs. Keeping them in the same PR shows the full picture in one pass. The commit split already isolates each logical step so they can be reviewed individually. For commit 3 "Add QcowMetadata with RwLock" - For commit 5 "Refactor BackingFile" - the That said, I am open to suggestions if you see a concrete split that would make a particular commit easier to follow. Happy to restructure if there is a way that works better for review. Thanks! |
78bc890 to
454eb32
Compare
|
I unfortunately don't have capacity to review this currently. Maybe next week! |
No worries and thanks so much for all the efforts! I'm starting to look into the other parts of the refactoring to prepare and make progress in parallel. Thanks! |
4cd56c5 to
a055883
Compare
block/src/qcow/metadata.rs
Outdated
| // | ||
| // SPDX-License-Identifier: Apache-2.0 AND BSD-3-Clause | ||
|
|
||
| #![allow(dead_code)] // wired in by qcow_sync |
There was a problem hiding this comment.
What does this mean? It would be good to not have this and have the types correct labelled with scope (maybe just pub(crate) if needed.
There was a problem hiding this comment.
The types are already labelled with the appropriate scope pub(crate) / pub. The #![allow(dead_code)] was there because this module is introduced in commit 3 while the consumer qcow_sync.rs is not wired up until commit 5. I have removed it now. The final tree no longer carries it, but the intermediate commits 3 and 4 will not pass cargo clippy -p block -- -D warnings without it. Perhaps not that bad, if CI doesn't complain.
Thanks.
There was a problem hiding this comment.
Confirmed CI checks clippy on the final merge tree, not per intermediate commit. Thus, no allow dead code.
Thanks
block/src/qcow_sync.rs
Outdated
| BackingKind::Raw(raw_file) => { | ||
| // SAFETY: raw_file holds a valid open fd. | ||
| let dup_fd = unsafe { libc::dup(raw_file.as_raw_fd()) }; | ||
| assert!(dup_fd >= 0, "dup() backing file fd"); |
There was a problem hiding this comment.
assert!() can only be used on programming errors (or test code) this could happen at runtime due to the number of open FDs exceeded
There was a problem hiding this comment.
Indeed. Replaced these by runtime checks and throwing Err.
Thanks!
6652292 to
b3e2f7c
Compare
Move QcowHeader, associated types, constants and helper functions into a new header.rs submodule. Public types are re-exported from mod.rs. No functional changes. Signed-off-by: Anatol Belski <anbelski@linux.microsoft.com>
Move L1 and L2 table entry helpers, division utilities and related constants from mod.rs into a dedicated util.rs submodule. Both mod.rs and metadata.rs import from util. No functional changes. Signed-off-by: Anatol Belski <anbelski@linux.microsoft.com>
Introduce QcowMetadata, a thread safe wrapper around QCOW2 metadata tables and caches using RwLock. Provides cluster resolution for reads and writes, and deallocate operations for discard. Extract parse_qcow() from QcowFile so both QcowFile and QcowDiskSync can share the parsing and validation logic. Signed-off-by: Anatol Belski <anbelski@linux.microsoft.com>
Add resize() and grow_l1_table() so the metadata layer can grow the virtual disk size. Only grow is supported. Signed-off-by: Anatol Belski <anbelski@linux.microsoft.com>
Replace the clone based BackingFileOps trait with a BackingKind enum so backing files can be decomposed into their concrete owned types. BackingFile::new() for QCOW2 backings now calls parse_qcow() directly instead of building a full QcowFile. Remove Clone for BackingFile and QcowFile. Prerequisite for the qcow_sync rewrite which decomposes a BackingFile into a raw fd or QcowMetadata for lock free I/O. Signed-off-by: Anatol Belski <anbelski@linux.microsoft.com>
Add tests for multiqueue concurrent reads, raw and QCOW2 backing files, three layer backing chains, COW on partial cluster writes, discard with backing fallthrough, cross cluster boundary operations, reads beyond virtual size, and resize. Signed-off-by: Anatol Belski <anbelski@linux.microsoft.com>
b3e2f7c to
eb1fe0a
Compare
phip1611
left a comment
There was a problem hiding this comment.
I only have small remarks. Great job done here!
| }) | ||
| .collect(); | ||
|
|
||
| for t in threads { |
There was a problem hiding this comment.
I am not sure how this test ensures that the reads are done in parallel and not sequentially
There was a problem hiding this comment.
The threads are scheduled by the OS the same way virtio queue workers are. There's no synchronization, they just run whenever the scheduler picks them up. On a multicore machine 8 threads doing 16 reads each will naturally overlap. Thus, the access is concurrent.
Thanks
|
Thanks for all the reviews! Working on the error unification and trait infra as next refactoring steps. |
block: QCOW2 thread safe metadata for multiqueue I/O
The QCOW2 synchronous backend currently wraps
QcowFileinArc<Mutex>to avoid the data corruption from cloned instances with independent caches, introduced in #7661. This is thread safe but serializes all queue I/O through a single lock, eliminating any benefit from multiple virtio-blk queues.This series introduces
QcowMetadata, a coarseRwLockwrapper around the in memory QCOW2 metadata tables. Metadata lookup is separated from data I/O so each queue can read and write through its own file descriptor (viadupandpread64/pwrite64) without holding the metadata lock. Read path L2 cache hits only need a shared read lock. Cache misses and all write operations upgrade to a write lock.QcowDiskSyncholds a singleArc<QcowMetadata>shared across all queues. Backing files are decomposed into concrete owned types (BackingKindenum) so QCOW2 backings also use the same metadata pluspread64pattern, recursively through the backing chain.This is the first step in the block crate refactoring plan (#7560). The
ClusterReadMappingandClusterWriteMappingenums returned by the metadata layer describe exactly what host I/O is needed for each guest request without performing it. This separation lays the foundation for a futureio_uringbackend that can submit the mapped offsets as async operations instead of blockingpread64/pwrite64calls.Commits
Extract
QcowHeaderintoheader.rs- MoveQcowHeader, constants and helpers into a submodule. Pure code move, no functional changes.Extract utility functions into
util.rs- Move L1/L2 entry helpers, division utilities and constants into a submodule. Pure code move, no functional changes.Add
QcowMetadatawithRwLock- The core change.QcowMetadataprovides cluster resolution for reads and writes, and deallocate operations for discard. Extractparse_qcow()fromQcowFileso bothQcowFileandQcowDiskSyncshare parsing and validation.Add
resize()toQcowMetadata- Move resize logic into the metadata layer soQcowDiskSynccan grow the virtual disk size without going throughQcowFile.Refactor
BackingFilefor ownership based decomposition - Replace the clone basedBackingFileOpstrait with aBackingKindenum. QCOW2 backings callparse_qcow()directly instead of building a fullQcowFile. RemoveCloneforBackingFileandQcowFile.Extend unit tests - New tests covering multiqueue concurrent reads, raw and QCOW2 backing files, three layer backing chains, COW on partial cluster writes, discard with backing fallthrough, cross cluster boundary operations, reads beyond virtual size and resize.
Performance tests show substantial improvements over the
mainbaseline due to parallel data I/O across queues.