KVM: kvmheartbeat.sh / kvmspheartbeat.sh hardcoded sysrq reboot causes false-positive host fencing on LINSTOR/DRBD primary storage

### Description

The KVM agent's storage heartbeat scripts (`kvmheartbeat.sh`, `kvmspheartbeat.sh`) hard-code an immediate kernel-level reboot via `echo b > /proc/sysrq-trigger` when a heartbeat write to primary storage times out. This:

1. Bypasses all OS-level shutdown protections (no clean filesystem unmount, no graceful VM stop)
2. Drops ALL running VMs on the host instantly
3. Triggers HA cascade — surviving hosts get flooded with VM restart requests
4. Cannot be replaced with a gentler action (restart agent, graceful reboot, log-only) without forking the script

### Affected files

- `scripts/vm/hypervisor/kvm/kvmheartbeat.sh` (around line 162)
- `scripts/vm/hypervisor/kvm/kvmspheartbeat.sh` (around line 64)

```sh
# Both scripts contain:
/usr/bin/logger -t heartbeat "...will reboot system..."
sync &
sleep 5
echo b > /proc/sysrq-trigger     # <-- hardcoded panic-reboot
```

### Reproduction

1. CloudStack 4.22.0.0 KVM hypervisor running Ubuntu 24.04
2. Primary storage: **LINSTOR with DRBD** replication (Linstor storage pool)
3. Trigger parallel DRBD resyncs across many resources (e.g. after a peer-node reboot, or `linstor resource resume-sync` after a maintenance pause)
4. While resync I/O contends with normal I/O on the same disks, the heartbeat write to its `hb` file occasionally takes longer than the hardcoded timeout
5. The host immediately force-reboots itself via sysrq, even though no actual fault exists

### Real-world impact

We hit this 5+ times within 4 hours during recovery from an unrelated incident. Cascade pattern:

- Heartbeat times out on host A → sysrq reboot → 90+ VMs dropped
- HA worker reschedules those VMs onto host B
- Host B's I/O spikes → its heartbeat times out → sysrq reboot → all its VMs dropped
- Host A comes back, repeat

Each reboot took ~3 minutes; total customer-visible outage was several hours. Patching the script line out (`echo b > /proc/sysrq-trigger` → log-only) immediately stopped the cascade.

The exact log line preceding each reboot:

```
heartbeat[68685]: kvmspheartbeat.sh will reboot system because it was unable to write the heartbeat to the storage.
```

### Affected versions

- CloudStack: **4.22.0.0** (cloudstack-agent / cloudstack-common 4.22.0.0)
- Hypervisor: Ubuntu 24.04 LTS / KVM, libvirt 10.x
- Storage: LINSTOR 1.33.2 / DRBD 9.3.1 (LINBIT public PPA)
- Reproduces on both single-cluster and multi-cluster zones

### Why this matters

The assumption "heartbeat write timeout = host is dead" was reasonable for **NFS shared storage** where transient I/O latency is rare and the only failure mode of concern is split-brain during a real network partition.

With **LINSTOR/DRBD** (or any local storage doing replication), the same disk serves application I/O, replication I/O, and heartbeat I/O — heartbeat can be transiently delayed without the host being dead. A fence-on-failure mechanism shouldn't be a binary panic-button: it should be **configurable**, and ideally **graceful** by default.

### Existing partial workaround (poorly documented)

There is already an agent property `reboot.host.and.alert.management.on.heartbeat.timeout` (default `true`) that, when set to `false`, prevents the Java-side `KVMHAMonitor` from invoking the shell script in fence mode (cflag=1). Setting:

```properties
reboot.host.and.alert.management.on.heartbeat.timeout=false
```

avoids the reboot entirely. This is an undocumented but effective workaround for the described issue.

However, this is a binary on/off — it doesn't support intermediate fence actions (restart agent, graceful reboot, log-with-alert). Operators want a middle ground: detect the failure, *react* in a less destructive way.

### Proposed enhancement

Add a finer-grained `kvm.heartbeat.fence.action` property that **supersedes** the binary boolean when set:

```properties
# Action when storage heartbeat write fails persistently
# Values: reboot | graceful-reboot | restart-agent | log-only
# Default: reboot (preserves current behavior for backward compatibility)
kvm.heartbeat.fence.action=graceful-reboot
```

Action semantics:

- `reboot` — current behavior (sysrq-trigger), kept as fallback for backward compat
- `graceful-reboot` — `systemctl reboot` instead of sysrq, lets VMs stop cleanly
- `restart-agent` — restart `cloudstack-agent` only; running VMs survive
- `log-only` — log + alert, take no automatic action (admin investigates)

Existing `reboot.host.and.alert.management.on.heartbeat.timeout=false` continues to work as a complete bypass (Java-side, never invokes the shell script in fence mode).

### Workarounds available today

1. **Set `reboot.host.and.alert.management.on.heartbeat.timeout=false`** in `/etc/cloudstack/agent/agent.properties` (Java-side flag — best official option until a proper fix lands)
2. **Patch both shell scripts** to replace `echo b > /proc/sysrq-trigger` with a `logger` line — survives until next package upgrade
3. **Move heartbeat target to host-local storage** that's not under DRBD I/O contention

### Willingness to contribute

A follow-up PR implementing the configurable `kvm.heartbeat.fence.action` design above is in progress.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

KVM: kvmheartbeat.sh / kvmspheartbeat.sh hardcoded sysrq reboot causes false-positive host fencing on LINSTOR/DRBD primary storage #13089

Description

Affected files

Reproduction

Real-world impact

Affected versions

Why this matters

Existing partial workaround (poorly documented)

Proposed enhancement

Workarounds available today

Willingness to contribute

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

KVM: kvmheartbeat.sh / kvmspheartbeat.sh hardcoded sysrq reboot causes false-positive host fencing on LINSTOR/DRBD primary storage #13089

Description

Description

Affected files

Reproduction

Real-world impact

Affected versions

Why this matters

Existing partial workaround (poorly documented)

Proposed enhancement

Workarounds available today

Willingness to contribute

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions