Skip to content

VM snapshot (checkpoint) and restore #6

@CMGS

Description

@CMGS

Summary

Add VM snapshot (checkpoint) and restore support, leveraging Cloud Hypervisor's native vm.pause + vm.snapshot API to capture full VM state (CPU, memory, devices) and COW disk, enabling fast resume from a saved point.

Background

Cloud Hypervisor provides snapshot/restore via REST API:

# Snapshot (VM must be paused first)
ch-remote pause
ch-remote snapshot file:///path/to/snapshot

# Restore (new CH process, VM starts in paused state)
cloud-hypervisor --api-socket new.sock --restore source_url=file:///path/to/snapshot
ch-remote resume

Snapshot produces three files:

  • config.json — VM configuration (human-editable, disk paths can be modified before restore)
  • state.json — CPU registers, MSR, virtio device state
  • memory-ranges — full guest RAM dump (size = guest memory)

Disks are NOT included in the snapshot — they must be copied separately.

Key CH Behaviors

  • --restore is mutually exclusive with all VM config flags (--cpus, --memory, --disk, --net, etc.) — config comes entirely from the snapshot's config.json
  • --api-socket CAN be used alongside --restore
  • If the original VM used tap=<name> (CH manages tap), CH recreates the tap automatically on restore — no net_fds needed
  • After restore + resume, the VM is a normal running VM. Subsequent stop/start uses standard CLI args built from VMRecord (cold boot from disk, memory state is lost)
  • config.json disk paths are absolute — must be updated if restoring to a different directory

Proposed Design

Storage Layout

runDir/{vmID}/
├── cow.raw                        # active COW disk
├── api.sock, ch.pid, ...          # runtime files
└── snapshots/
    └── {snapshot-name}/
        ├── meta.json              # cocoonv2 metadata (VMRecord + ImageBlobIDs)
        ├── config.json            # CH VM config (from vm.snapshot)
        ├── state.json             # CH device state (from vm.snapshot)
        ├── memory-ranges          # guest RAM dump (from vm.snapshot)
        └── cow.raw                # COW disk copy (reflink or rsync --sparse)

Snapshot Flow

Snapshot(vmID, name):
  1. PUT /api/v1/vm.pause
  2. PUT /api/v1/vm.snapshot  destination_url = runDir/{vmID}/snapshots/{name}/
  3. Copy COW disk to snapshot dir:
     - btrfs/XFS: FICLONE ioctl (instant, zero extra space)
     - ext4: rsync --sparse (slower, VM stays paused during copy)
  4. Write meta.json (VMRecord + ImageBlobIDs + timestamp)
  5. PUT /api/v1/vm.resume

Restore Flow

Restore(vmID, snapshotName):
  1. Read meta.json → full VMRecord with StorageConfigs, NetworkConfigs, BootConfig
  2. Create new VM directory, copy COW disk back
  3. Patch config.json disk paths to point to new VM directory
  4. recoverNetwork: CNI DEL + ADD with IP= arg + TC redirect setup
     (reuses existing recovery code from host-reboot network recovery)
  5. nsenter netns → cloud-hypervisor --api-socket X --restore source_url=...
  6. PUT /api/v1/vm.resume
  7. Write DB record (standard VMRecord, subsequent start/stop uses normal flow)

Post-Restore Lifecycle

After restore, the VM has a complete VMRecord in the DB. Subsequent operations use the standard code path:

  • Stop then Start: Normal cold boot from disk (memory state lost, uses --kernel/--firmware + full CLI args)
  • Host reboot then Start: recoverNetwork + normal cold boot
  • Want to preserve memory state on every stop?: Change Stop flow to pause → snapshot → terminate, and Start checks for existing snapshot before deciding --restore vs normal boot

New REST API Calls Needed

Endpoint Method Purpose
/api/v1/vm.pause PUT Pause VM before snapshot
/api/v1/vm.resume PUT Resume VM after snapshot
/api/v1/vm.snapshot PUT Create snapshot {"destination_url":"file:///path"}

These are existing CH API endpoints not currently used by cocoonv2.

Interface Changes

type Hypervisor interface {
    // ... existing methods ...
    Snapshot(ctx context.Context, ref string, name string) error
    Restore(ctx context.Context, ref string, name string) (*types.VM, error)
}

Open Questions

1. Snapshot lifecycle: tied to VM or independent?

Option A — Tied to VM: Snapshots live in runDir/{vmID}/snapshots/. Deleting the VM deletes all snapshots. Simple, no orphan cleanup needed.

Option B — Independent storage: Snapshots live in rootDir/snapshots/{snapshotID}/. Survive VM deletion, can rebuild a VM from snapshot. Requires separate GC module.

2. Pause duration during snapshot

The VM must stay paused until the COW disk copy completes. For large disks on ext4 (no reflink), this could take minutes.

Possible mitigations:

  • Recommend btrfs/XFS for runDir (instant reflink clone)
  • Accept the downtime for ext4 users
  • Investigate if we can resume first and copy disk after (sacrifices strict consistency — guest may have written new data between snapshot and disk copy)

3. memory-ranges file size

memory-ranges = guest RAM size. A 4GB VM produces a 4GB file. Multiple snapshots multiply this.

Options:

  • Accept as-is (disk is cheap)
  • Compress with zstd (guest unused memory pages are mostly zeros, good compression ratio)
  • Limit number of snapshots per VM

4. Image blob GC protection

Snapshots reference read-only layers (EROFS blobs for OCI, base qcow2 for cloudimg) that are NOT included in the snapshot. If the image is garbage-collected, restore fails.

Solution: Include ImageBlobIDs in meta.json. GC must check snapshot references before collecting blobs.

5. Scale-to-zero mode

Optional future enhancement: change Stop to pause → snapshot → terminate and Start to check snapshot → --restore or cold boot. This gives ~200ms wake-up time (per Koyeb benchmarks) but doubles stop time and disk usage.

Known CH Limitations

  • No incremental snapshots (always full memory dump)
  • virtiofs root restore hangs (CH Issue #6931) — cocoonv2 doesn't use virtiofs
  • Cross-CH-version restore not supported
  • VFIO (GPU passthrough) VMs cannot snapshot
  • Hot-plugged memory regions not restored (CH Issue #3165)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions