Skip to content

Windows guest PnP rebalance deadlocks boot when reprogrammed BAR overlaps another live device #8202

@CMGS

Description

@CMGS

Describe the bug

Windows 11 25H2 guests boot intermittently (~50% in our tests) deadlock when the Windows PnP arbiter rebalances PCI resources at driver-load time and the new BAR address it picks for one virtio device collides with another live virtio device's BAR. CH's move_bar correctly rejects the conflicting allocation, but the guest kernel keeps re-issuing the same write — and after a handful of retries it leaves its internal PCI state out of sync with what CH actually has mapped, wedging early boot before cocoon-agent (or any user-mode service) can start.

This is distinct from #7938 (driver bug → BAR programmed outside allocator range): here the BAR write is legitimate, addresses a valid MMIO region, and is exactly what pci.sys is documented to do.

Root cause: Windows PnP "resource rebalance"

pci.sys implements the PnP Manager's resource rebalance protocol. When a new FDO loads, the arbiter is allowed to recompute and rewrite the BARs of peer devices on the same bus. Microsoft documents this explicitly:

If a user adds a device to a system, and if the device requires system resources that the PnP manager has already assigned to another device, the PnP manager attempts to reassign resources. … It then delivers new resource lists to the devices so that they can restart, using the new resources.
The PnP Manager Redistributes System Resources

The full sequence is IRP_MN_QUERY_STOP_DEVICEIRP_MN_STOP_DEVICE → arbiter recomputes → IRP_MN_START_DEVICE with a new resource list (Stopping a Device to Rebalance Resources).

Confirming this isn't virtio-win's doing: VirtIO/WDF/PCI.c only consumes the resource list WDF passes it (PCIAllocBars, MmMapIoSpace) — there is no rewrite or veto path in any virtio Windows driver. The rewrite originates above virtio-win, in pci.sys's arbiter.

Linux guests never trigger this — Linux only re-assigns BARs in pci_assign_unassigned_resources (hotplug, rescan), not on every driver bind.

Why CH wedges where QEMU doesn't

QEMU's pci_update_mappings simply unmaps the old MemoryRegion and remaps at the new address on every BAR config write — no allocator check, no failure path. Two BARs targeting the same address would stomp each other on real hardware, and QEMU follows that.

CH instead routes the new address through mmio64_allocator via move_bar. If the new range is occupied, the allocator returns Overlap and the move fails. Even with #7938's rollback fix in place — which correctly restores bars[i].addr and the config register — the guest's view diverges:

  1. Guest writes new BAR (B) via IRP_MN_START_DEVICE's resource list.
  2. CH detects reprogramming, move_bar(A → B) fails, restore_bar_addr writes config back to A.
  3. Guest reads back A, doesn't reconcile with the resource list it just installed, re-issues the write of B. Loop.
  4. After ~6 retries (pci.sys gives up), the guest kernel thinks the BAR is at B (cached in its DEVICE_OBJECT), CH has it mapped at A. Subsequent virtio queue setup writes target B, which actually maps to another live virtio device's BAR (the one we refused to evict for the rebalance). The other device's worker thread receives garbage notifications; the originating driver waits forever for ring response; boot deadlocks at ~65 s VM uptime.

Reproducer

Layout that triggers reliably with Win11 25H2 ~50% of the time:

  • 5 virtio devices on bus 0 with 512 KiB BARs packed at the top of MMIO64 (the CH default with --disk + --net + --rng + --vsock + --watchdog): slots 1–5 at 0x3fffffe00000, 0x3fffffd80000, 0x3fffffd00000, 0x3fffffc80000, 0x3fffffc00000.
  • Win11 25H2 with virtio-win 0.1.285.
  • After cold boot, around 65 s into VM uptime, on the worst case the arbiter writes vsock's BAR (0x3fffffd00000) → 0x3fffffd80000 (rng's slot). move_bar rejects.

Observed:

  • ~50% of boots: arbiter happens to pick non-conflicting targets, boot proceeds.
  • ~50% of boots: target collides → 6 retries in ~350 ms → guest wedged.
  • All 5 virtio devices on the bus, not just NICs — even 0-NIC configs deadlock.

Error log

cloud-hypervisor: 65.940720s: <vcpu0> WARN: pci/src/bus.rs:471 -- Failed moving device BAR: failed allocating new MMIO range: 0x3fffffd00000->0x3fffffd80000(0x80000), keeping old BAR
cloud-hypervisor: 66.007469s: <vcpu1> WARN: ...same...
cloud-hypervisor: 66.076797s: <vcpu1> WARN: ...same...
cloud-hypervisor: 66.143539s: <vcpu0> WARN: ...same...
cloud-hypervisor: 66.212357s: <vcpu0> WARN: ...same...
cloud-hypervisor: 66.285027s: <vcpu1> WARN: ...same...
(no further log; CH consumes 90+ % CPU indefinitely, guest never reaches login)

Version

cloud-hypervisor v51.0.0 + #7938 fix (PR #7950)

VM configuration

cloud-hypervisor \
  --firmware CLOUDHV.fd \
  --disk path=windows.qcow2,image_type=qcow2,backing_files=on \
  --net tap=tap0,mac=...,num_queues=4 \
  --rng src=/dev/urandom \
  --vsock cid=3,socket=/path/to/vsock.uds \
  --watchdog \
  --cpus boot=2,kvm_hyperv=on \
  --memory size=4G

Guest: Windows 11 25H2 (build 26100), virtio-win 0.1.285
Host: Linux 6.17.0-1009-gcp, KVM, 46-bit phys addressing

Related issues

Proposed fixes

Three options, increasing in invasiveness:

  1. Sparse initial layout (mitigation, low risk). Increase the allocator alignment for the virtio capability BAR from CAPABILITY_BAR_SIZE (512 KiB) to e.g. 8 MiB. Initial devices then land 8 MiB apart in MMIO64, leaving ample slack for any address the arbiter might compute. On a 46-bit-phys host the alignment-induced waste is negligible (~64 TiB available). We have this on a defensive branch and Win11 boots to login cleanly with it.

  2. Coordinated swap on conflicting move (proper fix). When move_bar(A → B) finds B occupied by device D2, allocate a fresh address C for D2 first, move D2's mapping to C, then perform D2's BAR.addr = C write into its config register too, finally complete the original A → B move. The guest's subsequent reads of D2's BAR will see C and reconcile. Mirrors what real PCI bus rebalance does on hardware.

  3. QEMU-style accept-and-stomp. Make mmio_bus.insert an upsert and let two BARs temporarily overlap. Simplest to implement but loses CH's allocator invariants and risks silent data corruption if a device worker hits the overlapped range before the guest finishes its rebalance.

Defensive branch

Mitigation #1 is on a dedicated branch off cloud-hypervisor/main, ready as a PR if useful. We suspect the proper fix is #2, but #1 is what we ship on our dev fork today since it deterministically avoids the deadlock in our reproducer:

Logs

The 6-retry pattern in the error log above is deterministic when the rebalance hits a conflict. The deadlock is observable as: (a) CH process at 90+ % CPU continuously, (b) vm.info shows pci_devices_down == 0 and all 5 BARs at their original addresses, (c) no further log entries after the 6th retry, (d) no progress on vsock or guest agent indefinitely.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions