KVM NAS backup: VM remains paused indefinitely when backup job fails

## Description

When using the NAS backup plugin on KVM, if a backup job fails (e.g. due to backup storage being full or I/O errors on the NFS target), the VM remains **indefinitely paused** at the hypervisor level. CloudStack marks the backup as `Error` but does not resume the VM, leaving it unresponsive until manually resumed via `virsh resume`.

## Steps to Reproduce

1. Configure NAS backup with NFS storage for a running KVM VM
2. Fill up the NFS backup storage to 100% capacity
3. Wait for the scheduled backup to trigger
4. Observe the VM becomes paused and never resumes

## Expected Behavior

The VM should be automatically resumed after a backup failure. The backup should be marked as failed, but the VM should continue running normally.

## Actual Behavior

The VM remains in a `paused` state indefinitely. The backup monitoring loop in `nasbackup.sh` enters an infinite cycle:
1. `virsh backup-begin` pauses the QEMU domain for consistent snapshot
2. Backup write fails (storage full)
3. `domjobinfo` reports `Failed` status
4. `cleanup()` is called but **does not resume the VM**
5. No `exit` statement after cleanup — loop continues, repeatedly detecting the failed job

## Root Cause Analysis

Three bugs in `scripts/vm/hypervisor/kvm/nasbackup.sh`:

### Bug 1: Missing exit after failed backup cleanup (line 144)
```bash
case "$status" in
  Failed)
    echo "Virsh backup job failed"
    cleanup ;;   # <-- no exit, falls through to sleep and loops forever
esac
```

### Bug 2: cleanup() never resumes the VM (line 222)
The `cleanup()` function only removes files and unmounts storage. It never checks if the VM is paused or attempts to resume it, even though `virsh backup-begin` may have paused the domain.

### Bug 3: Missing exit in backup_stopped_vm() (line 181)
Similar to Bug 1, `backup_stopped_vm()` calls `cleanup()` on `qemu-img convert` failure but does not exit, allowing the loop to continue processing subsequent disks.

## Impact

- **Production outage**: All services on the affected VM become unresponsive
- **Cascading failures**: When backup storage fills up, ALL VMs being backed up get paused simultaneously
- **Silent failure**: CloudStack UI shows the VM as "Running" while it is actually paused at the KVM level
- **No automatic recovery**: Manual intervention (`virsh resume`) is required per VM

In our environment, NFS backup storage filling to 100% caused **8 production VMs** to become paused simultaneously across 3 KVM hosts, with some VMs remaining paused for over 6 hours before detection.

## Environment

- CloudStack 4.19/4.20/main (code is unchanged across versions)
- KVM hypervisor
- NAS backup plugin with NFS storage
- File: `scripts/vm/hypervisor/kvm/nasbackup.sh`

## Proposed Fix

PR forthcoming with the following changes:
1. Add VM state check and `virsh resume` to `cleanup()` function
2. Add missing `exit 1` after `cleanup()` in the `Failed` backup job case
3. Add missing `exit 1` after `cleanup()` in `backup_stopped_vm()` on `qemu-img convert` failure

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

KVM NAS backup: VM remains paused indefinitely when backup job fails #12821

Description

Steps to Reproduce

Expected Behavior

Actual Behavior

Root Cause Analysis

Bug 1: Missing exit after failed backup cleanup (line 144)

Bug 2: cleanup() never resumes the VM (line 222)

Bug 3: Missing exit in backup_stopped_vm() (line 181)

Impact

Environment

Proposed Fix

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

KVM NAS backup: VM remains paused indefinitely when backup job fails #12821

Description

Description

Steps to Reproduce

Expected Behavior

Actual Behavior

Root Cause Analysis

Bug 1: Missing exit after failed backup cleanup (line 144)

Bug 2: cleanup() never resumes the VM (line 222)

Bug 3: Missing exit in backup_stopped_vm() (line 181)

Impact

Environment

Proposed Fix

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions