Describe the bug
Windows 11 25H2 guests boot intermittently (~50% in our tests) deadlock when the Windows PnP arbiter rebalances PCI resources at driver-load time and the new BAR address it picks for one virtio device collides with another live virtio device's BAR. CH's move_bar correctly rejects the conflicting allocation, but the guest kernel keeps re-issuing the same write — and after a handful of retries it leaves its internal PCI state out of sync with what CH actually has mapped, wedging early boot before cocoon-agent (or any user-mode service) can start.
This is distinct from #7938 (driver bug → BAR programmed outside allocator range): here the BAR write is legitimate, addresses a valid MMIO region, and is exactly what pci.sys is documented to do.
Root cause: Windows PnP "resource rebalance"
pci.sys implements the PnP Manager's resource rebalance protocol. When a new FDO loads, the arbiter is allowed to recompute and rewrite the BARs of peer devices on the same bus. Microsoft documents this explicitly:
If a user adds a device to a system, and if the device requires system resources that the PnP manager has already assigned to another device, the PnP manager attempts to reassign resources. … It then delivers new resource lists to the devices so that they can restart, using the new resources.
— The PnP Manager Redistributes System Resources
The full sequence is IRP_MN_QUERY_STOP_DEVICE → IRP_MN_STOP_DEVICE → arbiter recomputes → IRP_MN_START_DEVICE with a new resource list (Stopping a Device to Rebalance Resources).
Confirming this isn't virtio-win's doing: VirtIO/WDF/PCI.c only consumes the resource list WDF passes it (PCIAllocBars, MmMapIoSpace) — there is no rewrite or veto path in any virtio Windows driver. The rewrite originates above virtio-win, in pci.sys's arbiter.
Linux guests never trigger this — Linux only re-assigns BARs in pci_assign_unassigned_resources (hotplug, rescan), not on every driver bind.
Why CH wedges where QEMU doesn't
QEMU's pci_update_mappings simply unmaps the old MemoryRegion and remaps at the new address on every BAR config write — no allocator check, no failure path. Two BARs targeting the same address would stomp each other on real hardware, and QEMU follows that.
CH instead routes the new address through mmio64_allocator via move_bar. If the new range is occupied, the allocator returns Overlap and the move fails. Even with #7938's rollback fix in place — which correctly restores bars[i].addr and the config register — the guest's view diverges:
- Guest writes new BAR (
B) via IRP_MN_START_DEVICE's resource list.
- CH detects reprogramming,
move_bar(A → B) fails, restore_bar_addr writes config back to A.
- Guest reads back
A, doesn't reconcile with the resource list it just installed, re-issues the write of B. Loop.
- After ~6 retries (
pci.sys gives up), the guest kernel thinks the BAR is at B (cached in its DEVICE_OBJECT), CH has it mapped at A. Subsequent virtio queue setup writes target B, which actually maps to another live virtio device's BAR (the one we refused to evict for the rebalance). The other device's worker thread receives garbage notifications; the originating driver waits forever for ring response; boot deadlocks at ~65 s VM uptime.
Reproducer
Layout that triggers reliably with Win11 25H2 ~50% of the time:
- 5 virtio devices on bus 0 with 512 KiB BARs packed at the top of MMIO64 (the CH default with --disk + --net + --rng + --vsock + --watchdog): slots 1–5 at
0x3fffffe00000, 0x3fffffd80000, 0x3fffffd00000, 0x3fffffc80000, 0x3fffffc00000.
- Win11 25H2 with virtio-win 0.1.285.
- After cold boot, around 65 s into VM uptime, on the worst case the arbiter writes vsock's BAR (
0x3fffffd00000) → 0x3fffffd80000 (rng's slot). move_bar rejects.
Observed:
- ~50% of boots: arbiter happens to pick non-conflicting targets, boot proceeds.
- ~50% of boots: target collides → 6 retries in ~350 ms → guest wedged.
- All 5 virtio devices on the bus, not just NICs — even 0-NIC configs deadlock.
Error log
cloud-hypervisor: 65.940720s: <vcpu0> WARN: pci/src/bus.rs:471 -- Failed moving device BAR: failed allocating new MMIO range: 0x3fffffd00000->0x3fffffd80000(0x80000), keeping old BAR
cloud-hypervisor: 66.007469s: <vcpu1> WARN: ...same...
cloud-hypervisor: 66.076797s: <vcpu1> WARN: ...same...
cloud-hypervisor: 66.143539s: <vcpu0> WARN: ...same...
cloud-hypervisor: 66.212357s: <vcpu0> WARN: ...same...
cloud-hypervisor: 66.285027s: <vcpu1> WARN: ...same...
(no further log; CH consumes 90+ % CPU indefinitely, guest never reaches login)
Version
cloud-hypervisor v51.0.0 + #7938 fix (PR #7950)
VM configuration
cloud-hypervisor \
--firmware CLOUDHV.fd \
--disk path=windows.qcow2,image_type=qcow2,backing_files=on \
--net tap=tap0,mac=...,num_queues=4 \
--rng src=/dev/urandom \
--vsock cid=3,socket=/path/to/vsock.uds \
--watchdog \
--cpus boot=2,kvm_hyperv=on \
--memory size=4G
Guest: Windows 11 25H2 (build 26100), virtio-win 0.1.285
Host: Linux 6.17.0-1009-gcp, KVM, 46-bit phys addressing
Related issues
Proposed fixes
Three options, increasing in invasiveness:
-
Sparse initial layout (mitigation, low risk). Increase the allocator alignment for the virtio capability BAR from CAPABILITY_BAR_SIZE (512 KiB) to e.g. 8 MiB. Initial devices then land 8 MiB apart in MMIO64, leaving ample slack for any address the arbiter might compute. On a 46-bit-phys host the alignment-induced waste is negligible (~64 TiB available). We have this on a defensive branch and Win11 boots to login cleanly with it.
-
Coordinated swap on conflicting move (proper fix). When move_bar(A → B) finds B occupied by device D2, allocate a fresh address C for D2 first, move D2's mapping to C, then perform D2's BAR.addr = C write into its config register too, finally complete the original A → B move. The guest's subsequent reads of D2's BAR will see C and reconcile. Mirrors what real PCI bus rebalance does on hardware.
-
QEMU-style accept-and-stomp. Make mmio_bus.insert an upsert and let two BARs temporarily overlap. Simplest to implement but loses CH's allocator invariants and risks silent data corruption if a device worker hits the overlapped range before the guest finishes its rebalance.
Defensive branch
Mitigation #1 is on a dedicated branch off cloud-hypervisor/main, ready as a PR if useful. We suspect the proper fix is #2, but #1 is what we ship on our dev fork today since it deterministically avoids the deadlock in our reproducer:
Logs
The 6-retry pattern in the error log above is deterministic when the rebalance hits a conflict. The deadlock is observable as: (a) CH process at 90+ % CPU continuously, (b) vm.info shows pci_devices_down == 0 and all 5 BARs at their original addresses, (c) no further log entries after the 6th retry, (d) no progress on vsock or guest agent indefinitely.
Describe the bug
Windows 11 25H2 guests boot intermittently (~50% in our tests) deadlock when the Windows PnP arbiter rebalances PCI resources at driver-load time and the new BAR address it picks for one virtio device collides with another live virtio device's BAR. CH's
move_barcorrectly rejects the conflicting allocation, but the guest kernel keeps re-issuing the same write — and after a handful of retries it leaves its internal PCI state out of sync with what CH actually has mapped, wedging early boot beforecocoon-agent(or any user-mode service) can start.This is distinct from #7938 (driver bug → BAR programmed outside allocator range): here the BAR write is legitimate, addresses a valid MMIO region, and is exactly what
pci.sysis documented to do.Root cause: Windows PnP "resource rebalance"
pci.sysimplements the PnP Manager's resource rebalance protocol. When a new FDO loads, the arbiter is allowed to recompute and rewrite the BARs of peer devices on the same bus. Microsoft documents this explicitly:The full sequence is
IRP_MN_QUERY_STOP_DEVICE→IRP_MN_STOP_DEVICE→ arbiter recomputes →IRP_MN_START_DEVICEwith a new resource list (Stopping a Device to Rebalance Resources).Confirming this isn't virtio-win's doing:
VirtIO/WDF/PCI.conly consumes the resource list WDF passes it (PCIAllocBars,MmMapIoSpace) — there is no rewrite or veto path in any virtio Windows driver. The rewrite originates above virtio-win, inpci.sys's arbiter.Linux guests never trigger this — Linux only re-assigns BARs in
pci_assign_unassigned_resources(hotplug, rescan), not on every driver bind.Why CH wedges where QEMU doesn't
QEMU's
pci_update_mappingssimply unmaps the oldMemoryRegionand remaps at the new address on every BAR config write — no allocator check, no failure path. Two BARs targeting the same address would stomp each other on real hardware, and QEMU follows that.CH instead routes the new address through
mmio64_allocatorviamove_bar. If the new range is occupied, the allocator returnsOverlapand the move fails. Even with #7938's rollback fix in place — which correctly restoresbars[i].addrand the config register — the guest's view diverges:B) viaIRP_MN_START_DEVICE's resource list.move_bar(A → B)fails,restore_bar_addrwrites config back toA.A, doesn't reconcile with the resource list it just installed, re-issues the write ofB. Loop.pci.sysgives up), the guest kernel thinks the BAR is atB(cached in its DEVICE_OBJECT), CH has it mapped atA. Subsequent virtio queue setup writes targetB, which actually maps to another live virtio device's BAR (the one we refused to evict for the rebalance). The other device's worker thread receives garbage notifications; the originating driver waits forever for ring response; boot deadlocks at ~65 s VM uptime.Reproducer
Layout that triggers reliably with Win11 25H2 ~50% of the time:
0x3fffffe00000,0x3fffffd80000,0x3fffffd00000,0x3fffffc80000,0x3fffffc00000.0x3fffffd00000) →0x3fffffd80000(rng's slot).move_barrejects.Observed:
Error log
Version
VM configuration
Guest: Windows 11 25H2 (build 26100), virtio-win 0.1.285
Host: Linux 6.17.0-1009-gcp, KVM, 46-bit phys addressing
Related issues
detect_bar_reprogrammingbased on the PCI spec note that the OS may program BARs at addresses different from the initial assignment.Proposed fixes
Three options, increasing in invasiveness:
Sparse initial layout (mitigation, low risk). Increase the allocator alignment for the virtio capability BAR from
CAPABILITY_BAR_SIZE(512 KiB) to e.g. 8 MiB. Initial devices then land 8 MiB apart in MMIO64, leaving ample slack for any address the arbiter might compute. On a 46-bit-phys host the alignment-induced waste is negligible (~64 TiB available). We have this on a defensive branch and Win11 boots to login cleanly with it.Coordinated swap on conflicting move (proper fix). When
move_bar(A → B)findsBoccupied by deviceD2, allocate a fresh addressCforD2first, moveD2's mapping toC, then performD2'sBAR.addr = Cwrite into its config register too, finally complete the originalA → Bmove. The guest's subsequent reads ofD2's BAR will seeCand reconcile. Mirrors what real PCI bus rebalance does on hardware.QEMU-style accept-and-stomp. Make
mmio_bus.insertan upsert and let two BARs temporarily overlap. Simplest to implement but loses CH's allocator invariants and risks silent data corruption if a device worker hits the overlapped range before the guest finishes its rebalance.Defensive branch
Mitigation #1 is on a dedicated branch off
cloud-hypervisor/main, ready as a PR if useful. We suspect the proper fix is #2, but #1 is what we ship on our dev fork today since it deterministically avoids the deadlock in our reproducer:virtio: 8 MiB-aligned initial BAR placement— sparse initial layoutvirtio: relax BAR alignment on restore— snapshot/restore needs the natural BAR alignment, since the guest may have rewritten BARs to non-aligned addresses post-rebalanceLogs
The 6-retry pattern in the error log above is deterministic when the rebalance hits a conflict. The deadlock is observable as: (a) CH process at 90+ % CPU continuously, (b)
vm.infoshowspci_devices_down == 0and all 5 BARs at their original addresses, (c) no further log entries after the 6th retry, (d) no progress on vsock or guest agent indefinitely.