Skip to content

vmm: fix CpuManager deadlock between pause handling and MMIO#136

Merged
phip1611 merged 3 commits into
cyberus-technology:gardenlinuxfrom
phip1611:investigate-pause-resume-problem
Apr 10, 2026
Merged

vmm: fix CpuManager deadlock between pause handling and MMIO#136
phip1611 merged 3 commits into
cyberus-technology:gardenlinuxfrom
phip1611:investigate-pause-resume-problem

Conversation

@phip1611
Copy link
Copy Markdown
Member

@phip1611 phip1611 commented Apr 1, 2026

Extract AcpiCpuHotplugController from CpuManager and move the BusDevice implementation to the new type. This separates VMM-internal vCPU management from the guest-visible ACPI CPU hotplug MMIO interface.

Besides clarifying responsibilities and reducing technical debt, this fixes a rare deadlock involving pause handling and MMIO access.

New responsibilities:

  • CpuManager manages VMM-internal vCPU lifecycle and coordination
  • AcpiCpuHotplugController implements the guest-visible ACPI CPU hotplug MMIO interface

Deadlock scenario

A vCPU thread may exit KVM_RUN to perform an MMIO access previously handled by CpuManager. If the VMM thread begins processing a pause event before that MMIO operation acquires access to CpuManager, CpuManager::pause() will block waiting for the vCPU thread to ACK the pause, while the vCPU thread is blocked waiting to complete the MMIO operation through the same CpuManager - which it can never lock - the VMM is deadlocked.

This can occur during early boot or CPU hotplug when pause events race with MMIO accesses. The issue is rare and timing-dependent, but real. For reproducing: run ch-remote pause|resume in a loop while booting a Linux VM (via direct kernel boot).

With the new design, these MMIO operations no longer depend on CpuManager, which removes the deadlock path entirely.

PS: We have the same problem with DeviceManager. This is more complex, however, to fix.

Hints for Reviewers

CI Pipeline: https://gitlab.cyberus-technology.de/cyberus/cloud/libvirt/-/merge_requests/165/pipelines

@phip1611 phip1611 self-assigned this Apr 1, 2026
Comment thread vmm/src/cpu.rs
Comment thread vmm/src/cpu.rs Outdated
Comment thread vmm/src/cpu.rs Outdated
Comment thread vmm/src/cpu.rs Outdated
@phip1611 phip1611 force-pushed the investigate-pause-resume-problem branch 2 times, most recently from 1c9b7eb to 9d3fc06 Compare April 7, 2026 17:17
@phip1611 phip1611 requested a review from arctic-alpaca April 7, 2026 17:17
@phip1611 phip1611 force-pushed the investigate-pause-resume-problem branch from 9d3fc06 to c2b9ffc Compare April 7, 2026 17:20
@phip1611 phip1611 marked this pull request as draft April 8, 2026 06:08
@phip1611 phip1611 marked this pull request as draft April 8, 2026 06:08
@arctic-alpaca

This comment was marked as outdated.

@phip1611

This comment was marked as outdated.

@phip1611 phip1611 force-pushed the investigate-pause-resume-problem branch from c2b9ffc to 153f9d1 Compare April 9, 2026 11:32
@phip1611 phip1611 marked this pull request as ready for review April 9, 2026 11:33
@phip1611 phip1611 force-pushed the investigate-pause-resume-problem branch 2 times, most recently from 199578e to 065e7e8 Compare April 9, 2026 11:36
@phip1611
Copy link
Copy Markdown
Member Author

phip1611 commented Apr 9, 2026

Ready for another review round!

Copy link
Copy Markdown

@amphi amphi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me.

Comment thread vmm/src/cpu.rs Outdated
Comment thread vmm/src/device_manager.rs Outdated
@phip1611 phip1611 force-pushed the investigate-pause-resume-problem branch from 065e7e8 to 4658b49 Compare April 9, 2026 20:35
@phip1611 phip1611 requested a review from amphi April 9, 2026 20:35
@phip1611
Copy link
Copy Markdown
Member Author

phip1611 commented Apr 9, 2026

Looks good to me.

I made significant changes to the implementation since your review - please recheck

@phip1611 phip1611 force-pushed the investigate-pause-resume-problem branch from 4658b49 to 7c2c1ce Compare April 9, 2026 20:36
Comment thread vmm/src/lib.rs
@phip1611 phip1611 force-pushed the investigate-pause-resume-problem branch 3 times, most recently from 1bbec5e to bb9a084 Compare April 9, 2026 20:44
@phip1611 phip1611 force-pushed the investigate-pause-resume-problem branch from bb9a084 to 688cf4e Compare April 10, 2026 06:52
Comment thread vmm/src/cpu.rs
This is a prerequisite for the next commit where we need shared access.

On-behalf-of: SAP philipp.schuster@sap.com
Signed-off-by: Philipp Schuster <philipp.schuster@cyberus-technology.de>
Extract AcpiCpuHotplugController from CpuManager and move the BusDevice
implementation to the new type. This separates VMM-internal vCPU
management from the guest-visible ACPI CPU hotplug MMIO interface.

Besides clarifying responsibilities and reducing technical debt, this
fixes a rare deadlock involving pause handling and MMIO access.

New responsibilities:
- CpuManager manages VMM-internal vCPU lifecycle and coordination
- AcpiCpuHotplugController implements the guest-visible ACPI CPU hotplug
  MMIO interface

A vCPU thread may exit KVM_RUN to perform an MMIO access previously
handled by CpuManager. If the VMM thread begins processing a `pause`
event before that MMIO operation acquires access to CpuManager,
CpuManager::pause() will block waiting for the vCPU thread to ACK
the pause, while the vCPU thread is blocked waiting to complete the MMIO
operation through the same CpuManager - which it can never lock - the
VMM is deadlocked.

This can occur during early boot or CPU hotplug when pause events race
with MMIO accesses. The issue is rare and timing-dependent, but real.
For reproducing: run `ch-remote pause|resume` in a loop while booting
a Linux VM (via direct kernel boot).

With the new design, these MMIO operations no longer depend on
CpuManager, which removes the deadlock path entirely.

On-behalf-of: SAP philipp.schuster@sap.com
Signed-off-by: Philipp Schuster <philipp.schuster@cyberus-technology.de>
This improves the documentation at various places that the next commits
will touch anyway.

On-behalf-of: SAP philipp.schuster@sap.com
Signed-off-by: Philipp Schuster <philipp.schuster@cyberus-technology.de>
@phip1611 phip1611 force-pushed the investigate-pause-resume-problem branch from 688cf4e to f067c39 Compare April 10, 2026 09:04
@phip1611 phip1611 enabled auto-merge (rebase) April 10, 2026 09:04
@phip1611 phip1611 merged commit cc1598c into cyberus-technology:gardenlinux Apr 10, 2026
11 checks passed
@phip1611 phip1611 deleted the investigate-pause-resume-problem branch April 10, 2026 09:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants