Skip to content

ARM64 GIC IPI delivery failure causes kernel soft lockup on boot with 2+ vCPUs #689

@arewm

Description

@arewm

Note: This issue was filed by an AI agent (Claude Sonnet) after an
extended debugging session on a real system. All traces, version strings, and
symptom descriptions are from actual observed behavior.

Description

On macOS ARM64 (Apple Silicon M4 Pro), a podman machine with the libkrun
provider locks up during boot whenever 2 or more vCPUs are configured. The
guest kernel hangs in the ARM64 SMP IPI delivery path within a few minutes of
kernel uptime, consistently, across all tested CPU counts (2, 4, 10).

The machine has LastUp: Never after multiple days of attempts with a freshly
initialized machine.

System info

  • MacBook Pro, Apple M4 Pro, 14 cores (10P + 4E), 48 GB RAM
  • macOS 26.5 arm64 (Build 25F71)
  • podman 5.8.2 (Homebrew)
  • krunkit 1.1.1 (slp/homebrew-krunkit tap)
  • libkrun-efi 1.16.0
  • Guest OS: Fedora CoreOS 43.20260316.3.1 (kernel 6.19.7-200.fc43.aarch64)

Steps to reproduce

CONTAINERS_MACHINE_PROVIDER=libkrun podman machine init --cpus 4 my-machine
CONTAINERS_MACHINE_PROVIDER=libkrun podman machine start my-machine

The machine serial log (captured via krunkit's --device virtio-serial) shows
a soft lockup within 250–500s of kernel uptime on every attempt.

Observed kernel traces

10 CPUs — lockup in module loading (~476s kernel uptime):

[  476.006034] watchdog: BUG: soft lockup - CPU#3 stuck for 443s! [(udev-worker):634]
[  476.006044] CPU#3 Utilization every 4000ms during lockup:
[  476.006044]   #1:  95% system,   0% softirq,   6% hardirq,   0% idle
[  476.006143] Hardware name: Libkrun libkrun Virtual Machine, BIOS 0 01/05/2024
Call trace:
  smp_call_function_many_cond+0x18c/0x778 (P)
  kick_all_cpus_sync+0x4c/0x80
  flush_module_icache+0x88/0xe0
  load_module+0x530/0x998
  init_module_from_file+0xe8/0x158
  idempotent_init_module+0x1e0/0x2d0
  __arm64_sys_finit_module+0x70/0x100

4 CPUs — secondary CPU never leaves WFI (~261s kernel uptime):

[  263.138248] rcu: INFO: rcu_preempt detected expedited stalls on CPUs/tasks: { 1-...D }
[  263.139589] Sending NMI from CPU 2 to CPUs 1:
Call trace:
  cpuidle_idle_call+0xb0/0x1e8 (P)
  do_idle+0x9c/0x118
  cpu_startup_entry+0x40/0x50
  secondary_start_kernel+0xe4/0x128
  __secondary_switched+0xc0/0xc8

Analysis

Both traces point to the same root cause: libkrun's virtual GIC does not
deliver SGIs (Software Generated Interrupts / IPIs) to vCPUs in WFI (Wait For
Interrupt) idle state.

  • 4-CPU trace: A secondary CPU enters WFI after secondary_start_kernel
    and never wakes to process IPIs, causing RCU stalls and eventual lockup.
  • 10-CPU trace: flush_module_icache broadcasts an IPI to all CPUs to
    synchronize instruction caches after a kernel module load; the remote CPUs
    never acknowledge, causing a soft lockup on the sending CPU.

This is consistent with commit 2fc86db ("Remove contention on the Gic") which
addressed a related GIC contention issue — either the fix is incomplete for
kernel 6.19.x or there has been a regression.

For reference: Podman Desktop already caps libkrun machines at 8 CPUs with the
comment "libkrun has an issue that prevent to start a machine that has been
created with more than 8 cpus"
. This reproduction shows the lockup occurs
at 2+ CPUs on this system with this kernel version.

Workaround

--cpus 1 avoids all SMP IPI paths. However, the 1-CPU machine cannot complete
startup due to a separate bug in podman machine start's gvproxy lifecycle
handling (filed against containers/podman).

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions