Skip to content

KVM: kvmheartbeat.sh / kvmspheartbeat.sh hardcoded sysrq reboot causes false-positive host fencing on LINSTOR/DRBD primary storage #13089

@jmsperu

Description

@jmsperu

Description

The KVM agent's storage heartbeat scripts (kvmheartbeat.sh, kvmspheartbeat.sh) hard-code an immediate kernel-level reboot via echo b > /proc/sysrq-trigger when a heartbeat write to primary storage times out. This:

  1. Bypasses all OS-level shutdown protections (no clean filesystem unmount, no graceful VM stop)
  2. Drops ALL running VMs on the host instantly
  3. Triggers HA cascade — surviving hosts get flooded with VM restart requests
  4. Cannot be replaced with a gentler action (restart agent, graceful reboot, log-only) without forking the script

Affected files

  • scripts/vm/hypervisor/kvm/kvmheartbeat.sh (around line 162)
  • scripts/vm/hypervisor/kvm/kvmspheartbeat.sh (around line 64)
# Both scripts contain:
/usr/bin/logger -t heartbeat "...will reboot system..."
sync &
sleep 5
echo b > /proc/sysrq-trigger     # <-- hardcoded panic-reboot

Reproduction

  1. CloudStack 4.22.0.0 KVM hypervisor running Ubuntu 24.04
  2. Primary storage: LINSTOR with DRBD replication (Linstor storage pool)
  3. Trigger parallel DRBD resyncs across many resources (e.g. after a peer-node reboot, or linstor resource resume-sync after a maintenance pause)
  4. While resync I/O contends with normal I/O on the same disks, the heartbeat write to its hb file occasionally takes longer than the hardcoded timeout
  5. The host immediately force-reboots itself via sysrq, even though no actual fault exists

Real-world impact

We hit this 5+ times within 4 hours during recovery from an unrelated incident. Cascade pattern:

  • Heartbeat times out on host A → sysrq reboot → 90+ VMs dropped
  • HA worker reschedules those VMs onto host B
  • Host B's I/O spikes → its heartbeat times out → sysrq reboot → all its VMs dropped
  • Host A comes back, repeat

Each reboot took ~3 minutes; total customer-visible outage was several hours. Patching the script line out (echo b > /proc/sysrq-trigger → log-only) immediately stopped the cascade.

The exact log line preceding each reboot:

heartbeat[68685]: kvmspheartbeat.sh will reboot system because it was unable to write the heartbeat to the storage.

Affected versions

  • CloudStack: 4.22.0.0 (cloudstack-agent / cloudstack-common 4.22.0.0)
  • Hypervisor: Ubuntu 24.04 LTS / KVM, libvirt 10.x
  • Storage: LINSTOR 1.33.2 / DRBD 9.3.1 (LINBIT public PPA)
  • Reproduces on both single-cluster and multi-cluster zones

Why this matters

The assumption "heartbeat write timeout = host is dead" was reasonable for NFS shared storage where transient I/O latency is rare and the only failure mode of concern is split-brain during a real network partition.

With LINSTOR/DRBD (or any local storage doing replication), the same disk serves application I/O, replication I/O, and heartbeat I/O — heartbeat can be transiently delayed without the host being dead. A fence-on-failure mechanism shouldn't be a binary panic-button: it should be configurable, and ideally graceful by default.

Existing partial workaround (poorly documented)

There is already an agent property reboot.host.and.alert.management.on.heartbeat.timeout (default true) that, when set to false, prevents the Java-side KVMHAMonitor from invoking the shell script in fence mode (cflag=1). Setting:

reboot.host.and.alert.management.on.heartbeat.timeout=false

avoids the reboot entirely. This is an undocumented but effective workaround for the described issue.

However, this is a binary on/off — it doesn't support intermediate fence actions (restart agent, graceful reboot, log-with-alert). Operators want a middle ground: detect the failure, react in a less destructive way.

Proposed enhancement

Add a finer-grained kvm.heartbeat.fence.action property that supersedes the binary boolean when set:

# Action when storage heartbeat write fails persistently
# Values: reboot | graceful-reboot | restart-agent | log-only
# Default: reboot (preserves current behavior for backward compatibility)
kvm.heartbeat.fence.action=graceful-reboot

Action semantics:

  • reboot — current behavior (sysrq-trigger), kept as fallback for backward compat
  • graceful-rebootsystemctl reboot instead of sysrq, lets VMs stop cleanly
  • restart-agent — restart cloudstack-agent only; running VMs survive
  • log-only — log + alert, take no automatic action (admin investigates)

Existing reboot.host.and.alert.management.on.heartbeat.timeout=false continues to work as a complete bypass (Java-side, never invokes the shell script in fence mode).

Workarounds available today

  1. Set reboot.host.and.alert.management.on.heartbeat.timeout=false in /etc/cloudstack/agent/agent.properties (Java-side flag — best official option until a proper fix lands)
  2. Patch both shell scripts to replace echo b > /proc/sysrq-trigger with a logger line — survives until next package upgrade
  3. Move heartbeat target to host-local storage that's not under DRBD I/O contention

Willingness to contribute

A follow-up PR implementing the configurable kvm.heartbeat.fence.action design above is in progress.

Metadata

Metadata

Assignees

Labels

No labels
No labels
No fields configured for Feature.

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions