Summary
Add VM snapshot (checkpoint) and restore support, leveraging Cloud Hypervisor's native vm.pause + vm.snapshot API to capture full VM state (CPU, memory, devices) and COW disk, enabling fast resume from a saved point.
Background
Cloud Hypervisor provides snapshot/restore via REST API:
# Snapshot (VM must be paused first)
ch-remote pause
ch-remote snapshot file:///path/to/snapshot
# Restore (new CH process, VM starts in paused state)
cloud-hypervisor --api-socket new.sock --restore source_url=file:///path/to/snapshot
ch-remote resume
Snapshot produces three files:
config.json — VM configuration (human-editable, disk paths can be modified before restore)
state.json — CPU registers, MSR, virtio device state
memory-ranges — full guest RAM dump (size = guest memory)
Disks are NOT included in the snapshot — they must be copied separately.
Key CH Behaviors
--restore is mutually exclusive with all VM config flags (--cpus, --memory, --disk, --net, etc.) — config comes entirely from the snapshot's config.json
--api-socket CAN be used alongside --restore
- If the original VM used
tap=<name> (CH manages tap), CH recreates the tap automatically on restore — no net_fds needed
- After restore + resume, the VM is a normal running VM. Subsequent stop/start uses standard CLI args built from VMRecord (cold boot from disk, memory state is lost)
config.json disk paths are absolute — must be updated if restoring to a different directory
Proposed Design
Storage Layout
runDir/{vmID}/
├── cow.raw # active COW disk
├── api.sock, ch.pid, ... # runtime files
└── snapshots/
└── {snapshot-name}/
├── meta.json # cocoonv2 metadata (VMRecord + ImageBlobIDs)
├── config.json # CH VM config (from vm.snapshot)
├── state.json # CH device state (from vm.snapshot)
├── memory-ranges # guest RAM dump (from vm.snapshot)
└── cow.raw # COW disk copy (reflink or rsync --sparse)
Snapshot Flow
Snapshot(vmID, name):
1. PUT /api/v1/vm.pause
2. PUT /api/v1/vm.snapshot destination_url = runDir/{vmID}/snapshots/{name}/
3. Copy COW disk to snapshot dir:
- btrfs/XFS: FICLONE ioctl (instant, zero extra space)
- ext4: rsync --sparse (slower, VM stays paused during copy)
4. Write meta.json (VMRecord + ImageBlobIDs + timestamp)
5. PUT /api/v1/vm.resume
Restore Flow
Restore(vmID, snapshotName):
1. Read meta.json → full VMRecord with StorageConfigs, NetworkConfigs, BootConfig
2. Create new VM directory, copy COW disk back
3. Patch config.json disk paths to point to new VM directory
4. recoverNetwork: CNI DEL + ADD with IP= arg + TC redirect setup
(reuses existing recovery code from host-reboot network recovery)
5. nsenter netns → cloud-hypervisor --api-socket X --restore source_url=...
6. PUT /api/v1/vm.resume
7. Write DB record (standard VMRecord, subsequent start/stop uses normal flow)
Post-Restore Lifecycle
After restore, the VM has a complete VMRecord in the DB. Subsequent operations use the standard code path:
- Stop then Start: Normal cold boot from disk (memory state lost, uses
--kernel/--firmware + full CLI args)
- Host reboot then Start:
recoverNetwork + normal cold boot
- Want to preserve memory state on every stop?: Change Stop flow to
pause → snapshot → terminate, and Start checks for existing snapshot before deciding --restore vs normal boot
New REST API Calls Needed
| Endpoint |
Method |
Purpose |
/api/v1/vm.pause |
PUT |
Pause VM before snapshot |
/api/v1/vm.resume |
PUT |
Resume VM after snapshot |
/api/v1/vm.snapshot |
PUT |
Create snapshot {"destination_url":"file:///path"} |
These are existing CH API endpoints not currently used by cocoonv2.
Interface Changes
type Hypervisor interface {
// ... existing methods ...
Snapshot(ctx context.Context, ref string, name string) error
Restore(ctx context.Context, ref string, name string) (*types.VM, error)
}
Open Questions
1. Snapshot lifecycle: tied to VM or independent?
Option A — Tied to VM: Snapshots live in runDir/{vmID}/snapshots/. Deleting the VM deletes all snapshots. Simple, no orphan cleanup needed.
Option B — Independent storage: Snapshots live in rootDir/snapshots/{snapshotID}/. Survive VM deletion, can rebuild a VM from snapshot. Requires separate GC module.
2. Pause duration during snapshot
The VM must stay paused until the COW disk copy completes. For large disks on ext4 (no reflink), this could take minutes.
Possible mitigations:
- Recommend btrfs/XFS for
runDir (instant reflink clone)
- Accept the downtime for ext4 users
- Investigate if we can resume first and copy disk after (sacrifices strict consistency — guest may have written new data between snapshot and disk copy)
3. memory-ranges file size
memory-ranges = guest RAM size. A 4GB VM produces a 4GB file. Multiple snapshots multiply this.
Options:
- Accept as-is (disk is cheap)
- Compress with zstd (guest unused memory pages are mostly zeros, good compression ratio)
- Limit number of snapshots per VM
4. Image blob GC protection
Snapshots reference read-only layers (EROFS blobs for OCI, base qcow2 for cloudimg) that are NOT included in the snapshot. If the image is garbage-collected, restore fails.
Solution: Include ImageBlobIDs in meta.json. GC must check snapshot references before collecting blobs.
5. Scale-to-zero mode
Optional future enhancement: change Stop to pause → snapshot → terminate and Start to check snapshot → --restore or cold boot. This gives ~200ms wake-up time (per Koyeb benchmarks) but doubles stop time and disk usage.
Known CH Limitations
- No incremental snapshots (always full memory dump)
- virtiofs root restore hangs (CH Issue #6931) — cocoonv2 doesn't use virtiofs
- Cross-CH-version restore not supported
- VFIO (GPU passthrough) VMs cannot snapshot
- Hot-plugged memory regions not restored (CH Issue #3165)
Summary
Add VM snapshot (checkpoint) and restore support, leveraging Cloud Hypervisor's native
vm.pause+vm.snapshotAPI to capture full VM state (CPU, memory, devices) and COW disk, enabling fast resume from a saved point.Background
Cloud Hypervisor provides snapshot/restore via REST API:
Snapshot produces three files:
config.json— VM configuration (human-editable, disk paths can be modified before restore)state.json— CPU registers, MSR, virtio device statememory-ranges— full guest RAM dump (size = guest memory)Disks are NOT included in the snapshot — they must be copied separately.
Key CH Behaviors
--restoreis mutually exclusive with all VM config flags (--cpus,--memory,--disk,--net, etc.) — config comes entirely from the snapshot'sconfig.json--api-socketCAN be used alongside--restoretap=<name>(CH manages tap), CH recreates the tap automatically on restore — nonet_fdsneededconfig.jsondisk paths are absolute — must be updated if restoring to a different directoryProposed Design
Storage Layout
Snapshot Flow
Restore Flow
Post-Restore Lifecycle
After restore, the VM has a complete VMRecord in the DB. Subsequent operations use the standard code path:
--kernel/--firmware+ full CLI args)recoverNetwork+ normal cold bootpause → snapshot → terminate, and Start checks for existing snapshot before deciding--restorevs normal bootNew REST API Calls Needed
/api/v1/vm.pause/api/v1/vm.resume/api/v1/vm.snapshot{"destination_url":"file:///path"}These are existing CH API endpoints not currently used by cocoonv2.
Interface Changes
Open Questions
1. Snapshot lifecycle: tied to VM or independent?
Option A — Tied to VM: Snapshots live in
runDir/{vmID}/snapshots/. Deleting the VM deletes all snapshots. Simple, no orphan cleanup needed.Option B — Independent storage: Snapshots live in
rootDir/snapshots/{snapshotID}/. Survive VM deletion, can rebuild a VM from snapshot. Requires separate GC module.2. Pause duration during snapshot
The VM must stay paused until the COW disk copy completes. For large disks on ext4 (no reflink), this could take minutes.
Possible mitigations:
runDir(instant reflink clone)3. memory-ranges file size
memory-ranges= guest RAM size. A 4GB VM produces a 4GB file. Multiple snapshots multiply this.Options:
4. Image blob GC protection
Snapshots reference read-only layers (EROFS blobs for OCI, base qcow2 for cloudimg) that are NOT included in the snapshot. If the image is garbage-collected, restore fails.
Solution: Include
ImageBlobIDsinmeta.json. GC must check snapshot references before collecting blobs.5. Scale-to-zero mode
Optional future enhancement: change Stop to
pause → snapshot → terminateand Start tocheck snapshot → --restore or cold boot. This gives ~200ms wake-up time (per Koyeb benchmarks) but doubles stop time and disk usage.Known CH Limitations