Description
The KVM agent's storage heartbeat scripts (kvmheartbeat.sh, kvmspheartbeat.sh) hard-code an immediate kernel-level reboot via echo b > /proc/sysrq-trigger when a heartbeat write to primary storage times out. This:
- Bypasses all OS-level shutdown protections (no clean filesystem unmount, no graceful VM stop)
- Drops ALL running VMs on the host instantly
- Triggers HA cascade — surviving hosts get flooded with VM restart requests
- Cannot be replaced with a gentler action (restart agent, graceful reboot, log-only) without forking the script
Affected files
scripts/vm/hypervisor/kvm/kvmheartbeat.sh (around line 162)
scripts/vm/hypervisor/kvm/kvmspheartbeat.sh (around line 64)
# Both scripts contain:
/usr/bin/logger -t heartbeat "...will reboot system..."
sync &
sleep 5
echo b > /proc/sysrq-trigger # <-- hardcoded panic-reboot
Reproduction
- CloudStack 4.22.0.0 KVM hypervisor running Ubuntu 24.04
- Primary storage: LINSTOR with DRBD replication (Linstor storage pool)
- Trigger parallel DRBD resyncs across many resources (e.g. after a peer-node reboot, or
linstor resource resume-sync after a maintenance pause)
- While resync I/O contends with normal I/O on the same disks, the heartbeat write to its
hb file occasionally takes longer than the hardcoded timeout
- The host immediately force-reboots itself via sysrq, even though no actual fault exists
Real-world impact
We hit this 5+ times within 4 hours during recovery from an unrelated incident. Cascade pattern:
- Heartbeat times out on host A → sysrq reboot → 90+ VMs dropped
- HA worker reschedules those VMs onto host B
- Host B's I/O spikes → its heartbeat times out → sysrq reboot → all its VMs dropped
- Host A comes back, repeat
Each reboot took ~3 minutes; total customer-visible outage was several hours. Patching the script line out (echo b > /proc/sysrq-trigger → log-only) immediately stopped the cascade.
The exact log line preceding each reboot:
heartbeat[68685]: kvmspheartbeat.sh will reboot system because it was unable to write the heartbeat to the storage.
Affected versions
- CloudStack: 4.22.0.0 (cloudstack-agent / cloudstack-common 4.22.0.0)
- Hypervisor: Ubuntu 24.04 LTS / KVM, libvirt 10.x
- Storage: LINSTOR 1.33.2 / DRBD 9.3.1 (LINBIT public PPA)
- Reproduces on both single-cluster and multi-cluster zones
Why this matters
The assumption "heartbeat write timeout = host is dead" was reasonable for NFS shared storage where transient I/O latency is rare and the only failure mode of concern is split-brain during a real network partition.
With LINSTOR/DRBD (or any local storage doing replication), the same disk serves application I/O, replication I/O, and heartbeat I/O — heartbeat can be transiently delayed without the host being dead. A fence-on-failure mechanism shouldn't be a binary panic-button: it should be configurable, and ideally graceful by default.
Existing partial workaround (poorly documented)
There is already an agent property reboot.host.and.alert.management.on.heartbeat.timeout (default true) that, when set to false, prevents the Java-side KVMHAMonitor from invoking the shell script in fence mode (cflag=1). Setting:
reboot.host.and.alert.management.on.heartbeat.timeout=false
avoids the reboot entirely. This is an undocumented but effective workaround for the described issue.
However, this is a binary on/off — it doesn't support intermediate fence actions (restart agent, graceful reboot, log-with-alert). Operators want a middle ground: detect the failure, react in a less destructive way.
Proposed enhancement
Add a finer-grained kvm.heartbeat.fence.action property that supersedes the binary boolean when set:
# Action when storage heartbeat write fails persistently
# Values: reboot | graceful-reboot | restart-agent | log-only
# Default: reboot (preserves current behavior for backward compatibility)
kvm.heartbeat.fence.action=graceful-reboot
Action semantics:
reboot — current behavior (sysrq-trigger), kept as fallback for backward compat
graceful-reboot — systemctl reboot instead of sysrq, lets VMs stop cleanly
restart-agent — restart cloudstack-agent only; running VMs survive
log-only — log + alert, take no automatic action (admin investigates)
Existing reboot.host.and.alert.management.on.heartbeat.timeout=false continues to work as a complete bypass (Java-side, never invokes the shell script in fence mode).
Workarounds available today
- Set
reboot.host.and.alert.management.on.heartbeat.timeout=false in /etc/cloudstack/agent/agent.properties (Java-side flag — best official option until a proper fix lands)
- Patch both shell scripts to replace
echo b > /proc/sysrq-trigger with a logger line — survives until next package upgrade
- Move heartbeat target to host-local storage that's not under DRBD I/O contention
Willingness to contribute
A follow-up PR implementing the configurable kvm.heartbeat.fence.action design above is in progress.
Description
The KVM agent's storage heartbeat scripts (
kvmheartbeat.sh,kvmspheartbeat.sh) hard-code an immediate kernel-level reboot viaecho b > /proc/sysrq-triggerwhen a heartbeat write to primary storage times out. This:Affected files
scripts/vm/hypervisor/kvm/kvmheartbeat.sh(around line 162)scripts/vm/hypervisor/kvm/kvmspheartbeat.sh(around line 64)Reproduction
linstor resource resume-syncafter a maintenance pause)hbfile occasionally takes longer than the hardcoded timeoutReal-world impact
We hit this 5+ times within 4 hours during recovery from an unrelated incident. Cascade pattern:
Each reboot took ~3 minutes; total customer-visible outage was several hours. Patching the script line out (
echo b > /proc/sysrq-trigger→ log-only) immediately stopped the cascade.The exact log line preceding each reboot:
Affected versions
Why this matters
The assumption "heartbeat write timeout = host is dead" was reasonable for NFS shared storage where transient I/O latency is rare and the only failure mode of concern is split-brain during a real network partition.
With LINSTOR/DRBD (or any local storage doing replication), the same disk serves application I/O, replication I/O, and heartbeat I/O — heartbeat can be transiently delayed without the host being dead. A fence-on-failure mechanism shouldn't be a binary panic-button: it should be configurable, and ideally graceful by default.
Existing partial workaround (poorly documented)
There is already an agent property
reboot.host.and.alert.management.on.heartbeat.timeout(defaulttrue) that, when set tofalse, prevents the Java-sideKVMHAMonitorfrom invoking the shell script in fence mode (cflag=1). Setting:reboot.host.and.alert.management.on.heartbeat.timeout=falseavoids the reboot entirely. This is an undocumented but effective workaround for the described issue.
However, this is a binary on/off — it doesn't support intermediate fence actions (restart agent, graceful reboot, log-with-alert). Operators want a middle ground: detect the failure, react in a less destructive way.
Proposed enhancement
Add a finer-grained
kvm.heartbeat.fence.actionproperty that supersedes the binary boolean when set:Action semantics:
reboot— current behavior (sysrq-trigger), kept as fallback for backward compatgraceful-reboot—systemctl rebootinstead of sysrq, lets VMs stop cleanlyrestart-agent— restartcloudstack-agentonly; running VMs survivelog-only— log + alert, take no automatic action (admin investigates)Existing
reboot.host.and.alert.management.on.heartbeat.timeout=falsecontinues to work as a complete bypass (Java-side, never invokes the shell script in fence mode).Workarounds available today
reboot.host.and.alert.management.on.heartbeat.timeout=falsein/etc/cloudstack/agent/agent.properties(Java-side flag — best official option until a proper fix lands)echo b > /proc/sysrq-triggerwith aloggerline — survives until next package upgradeWillingness to contribute
A follow-up PR implementing the configurable
kvm.heartbeat.fence.actiondesign above is in progress.