-
Notifications
You must be signed in to change notification settings - Fork 1.3k
Description
Description
When using the NAS backup plugin on KVM, if a backup job fails (e.g. due to backup storage being full or I/O errors on the NFS target), the VM remains indefinitely paused at the hypervisor level. CloudStack marks the backup as Error but does not resume the VM, leaving it unresponsive until manually resumed via virsh resume.
Steps to Reproduce
- Configure NAS backup with NFS storage for a running KVM VM
- Fill up the NFS backup storage to 100% capacity
- Wait for the scheduled backup to trigger
- Observe the VM becomes paused and never resumes
Expected Behavior
The VM should be automatically resumed after a backup failure. The backup should be marked as failed, but the VM should continue running normally.
Actual Behavior
The VM remains in a paused state indefinitely. The backup monitoring loop in nasbackup.sh enters an infinite cycle:
virsh backup-beginpauses the QEMU domain for consistent snapshot- Backup write fails (storage full)
domjobinforeportsFailedstatuscleanup()is called but does not resume the VM- No
exitstatement after cleanup — loop continues, repeatedly detecting the failed job
Root Cause Analysis
Three bugs in scripts/vm/hypervisor/kvm/nasbackup.sh:
Bug 1: Missing exit after failed backup cleanup (line 144)
case "$status" in
Failed)
echo "Virsh backup job failed"
cleanup ;; # <-- no exit, falls through to sleep and loops forever
esacBug 2: cleanup() never resumes the VM (line 222)
The cleanup() function only removes files and unmounts storage. It never checks if the VM is paused or attempts to resume it, even though virsh backup-begin may have paused the domain.
Bug 3: Missing exit in backup_stopped_vm() (line 181)
Similar to Bug 1, backup_stopped_vm() calls cleanup() on qemu-img convert failure but does not exit, allowing the loop to continue processing subsequent disks.
Impact
- Production outage: All services on the affected VM become unresponsive
- Cascading failures: When backup storage fills up, ALL VMs being backed up get paused simultaneously
- Silent failure: CloudStack UI shows the VM as "Running" while it is actually paused at the KVM level
- No automatic recovery: Manual intervention (
virsh resume) is required per VM
In our environment, NFS backup storage filling to 100% caused 8 production VMs to become paused simultaneously across 3 KVM hosts, with some VMs remaining paused for over 6 hours before detection.
Environment
- CloudStack 4.19/4.20/main (code is unchanged across versions)
- KVM hypervisor
- NAS backup plugin with NFS storage
- File:
scripts/vm/hypervisor/kvm/nasbackup.sh
Proposed Fix
PR forthcoming with the following changes:
- Add VM state check and
virsh resumetocleanup()function - Add missing
exit 1aftercleanup()in theFailedbackup job case - Add missing
exit 1aftercleanup()inbackup_stopped_vm()onqemu-img convertfailure