macos: add support for macos, using builtin Apple Virtualization and APFS cow#3
Merged
Conversation
Add src/backend/mod.rs with the three backend traits that abstract platform-specific operations behind a common interface: - VmBackend: hypervisor lifecycle (start, stop, pause, resume) - StorageBackend: disk images, CoW clones, snapshots, mount/unmount - NetworkBackend: VM networking setup, teardown, IP discovery Also defines supporting types: StartedVm, SnapshotInfo, InitConfig. These traits will be implemented by platform-specific types (Linux: Firecracker+ZFS+TAP, macOS: AVF+APFS+vmnet) and selected at compile time via #[cfg(target_os)] type aliases.
Create src/backend/linux/vm.rs with LinuxVm implementing VmBackend. Wraps the existing firecracker::process and firecracker::api modules behind the trait interface: - start: spawn Firecracker, configure via API, boot (expects network to be pre-configured in VmMetadata by NetworkBackend::setup) - stop: graceful SSH reboot → SendCtrlAltDel → wait → SIGKILL fallback - force_stop: SIGKILL immediately - pause/resume: Firecracker PATCH /vm API - is_running: kill(pid, 0) check The CLI still calls firecracker directly; migration to the backend trait happens in a later task.
Create src/backend/linux/storage.rs with LinuxStorage implementing
StorageBackend. Wraps the existing zfs::pool, zfs::dataset, zfs::volume,
and zfs::snapshot modules behind the trait interface.
LinuxStorage holds ZFS dataset paths (derived from GlobalConfig) so
trait methods can map short names to full zvol paths:
- init: create ZFS datasets (pool/dataset/{images,vms})
- create_image_volume: create zvol, dd image, snapshot @base
- clone_for_vm: zfs clone image@base → vms/vm_name
- snapshot/restore/delete: zfs snapshot/rollback/destroy
- list_snapshots: zfs list (filters out @base)
- resize: zfs set volsize + e2fsck + resize2fs
- mount/unmount: mount block device / umount
- destroy_{vm,image}_storage: zfs destroy -r
Also updated StorageBackend trait to use &self methods (so the
implementation can hold config state), with init remaining an
associated function.
Create src/backend/linux/network.rs with LinuxNetwork implementing NetworkBackend. Wraps the existing network::ip, network::tap, network::nat, and network::wan modules behind the trait interface. LinuxNetwork holds a StateStore for IP allocation tracking: - setup: detect WAN interface, allocate /30 IP block, create TAP device, enable IP forwarding, add iptables NAT rules. Includes rollback cleanup on failure at each step. - teardown: best-effort removal of iptables rules, TAP device, IP allocation. - discover_guest_ip: not used on Linux (IPs are statically allocated). Also updated NetworkBackend trait to use &self methods (consistent with StorageBackend) so implementations can hold state.
…elpers Move the ext4 image creation pipeline (create, estimate_size_mib, mount_loop, umount, copy_rootfs) from src/image/ext4.rs into src/backend/linux/image.rs. This is the Linux-specific implementation; macOS will later provide its own backend/macos/image.rs using hdiutil attach/detach instead of loop mount. src/image/ext4.rs becomes a thin re-export layer so all existing call sites continue to work unchanged. LinuxStorage::unmount now calls the backend image module directly instead of going through the re-export.
Replace ~85 direct calls to crate::zfs, crate::firecracker, and crate::network in the CLI layer with calls through the backend trait abstractions (VmBackend, StorageBackend, NetworkBackend). Key changes: - backend/mod.rs: add type aliases (Vm, Storage, Network), add disk_device_path(), clone_from_snapshot(), destroy_fork_origin() to StorageBackend trait, add device field to InitConfig - backend/linux/storage.rs: implement new trait methods, move pool creation into init(), add Clone derive, update mount() to wait for device - cli/init.rs: use Storage::init() instead of direct zfs:: calls - cli/snapshot.rs: use Storage methods for all snapshot operations - cli/image.rs: use Storage for image volume create/destroy - cli/vm.rs: use Vm/Storage/Network backends for start/stop/pause/ resume/resize/create/fork/delete, remove ~200 lines of duplicated platform-specific helper functions - state/store.rs: add Clone derive to StateStore - state/vm.rs: add VmMetadata::default_for_teardown() helper The CLI no longer imports zfs::, firecracker::, or network:: directly. The only remaining platform-specific call is network::wan::detect() in init (a utility, not a per-VM backend operation).
cargo build and cargo test pass with no behavior change after the backend trait extraction refactoring. The only test failure (mkfs_ext4_on_sparse_file) is pre-existing (mkfs.ext4 not installed).
Set up the ember-vz SPM project with swift-argument-parser for CLI parsing. Defines the Start subcommand with all flags from the spec (--kernel, --disk, --cpus, --memory, --boot-args, --network, --serial-log, --ready-fd). Links against Virtualization.framework. Implementation is stubbed out for subsequent tasks to fill in.
Design a btrfs alternative to ZFS for copy-on-write VM storage on Linux. Uses cp --reflink=always for instant file clones, with managed filesystem creation at ember init time. Both backends coexist via runtime selection. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Implement the core VM boot logic using Apple Virtualization Framework: - VZLinuxBootLoader for direct kernel boot with configurable boot args - VZVirtioBlockDeviceConfiguration for raw ext4 disk image (/dev/vda) - VZVirtioNetworkDeviceConfiguration with VZNATNetworkDeviceAttachment (vmnet shared) - VZVirtioConsoleDeviceSerialPortConfiguration for serial console output - VZVirtioEntropyDeviceConfiguration for guest /dev/urandom - VZVirtioTraditionalMemoryBalloonDeviceConfiguration for memory management - VMDelegate that exits the process when the guest shuts down - MAC address reported to stderr (and to --ready-fd if provided) Signal handling (SIGTERM/SIGUSR1/SIGUSR2) and serial log file support are scaffolded via CLI flags but will be fully implemented in subsequent tasks.
Install a DispatchSource signal handler for SIGTERM that calls VZVirtualMachine.stop() to gracefully shut down the guest. The handler runs on the main queue (required by VZVirtualMachine) and falls back to exit(0) if stop() fails.
SIGKILL cannot be caught or handled — the OS terminates the process immediately. The SIGKILL force-stop path is implemented on the caller side (ember Rust CLI sends SIGTERM, waits, then SIGKILL). No ember-vz code needed.
Add DispatchSource signal handlers for SIGUSR1 and SIGUSR2 that call VZVirtualMachine.pause() and .resume() respectively. Both handlers check canPause/canResume before acting and log the result to stderr.
Move the ready-fd MAC address write into the vm.start success callback so the parent process (ember) only gets the notification after the VM is actually running. The MAC is still logged to stderr immediately for debugging.
Serial console logging to file was already implemented as part of the start command: --serial-log redirects VZVirtioConsoleDeviceSerialPort output to the specified file (or stdout if not provided).
…ection Major cross-platform compilation changes: - Gate firecracker, zfs, network modules with #[cfg(target_os = "linux")] - Gate root check and reconciliation for Linux only - Add macOS backend module with StorageBackend, VmBackend, NetworkBackend stubs - Implement MacosStorage::init() to create directory hierarchy (images/data, vms, kernels, network) - Add macOS image helpers (hdiutil attach/detach instead of loop mount) - Add state_dir field to GlobalConfig for macOS storage path derivation - Wire up #[cfg(target_os)] type aliases in backend/mod.rs All 148 unit tests pass. Builds cleanly on macOS.
On macOS the raw .img file IS the base image — no zvol, no snapshot. create_image_volume moves (or copies) the ext4 image into images/data/. Uses rename for same-filesystem, falls back to copy+delete for cross-device.
Clone base images for VMs using APFS copy-on-write via cp -c. Creates the VM directory + snapshots subdirectory, then clones the base image as rootfs.img. The apfs_clone helper detects common failures (cross-volume, non-APFS) and provides clear error messages.
Implement the four snapshot methods for the macOS storage backend using APFS copy-on-write clones (cp -c): - snapshot: clones rootfs.img → snapshots/<name>.img - restore: removes rootfs, clones snapshot back to rootfs.img - delete: removes the snapshot .img file - list: reads snapshots/ dir, returns name/time/size metadata All operations are instant CoW clones with no additional disk cost until blocks diverge. APFS handles reference counting internally.
Grow a VM's rootfs image by: 1. truncate -s <bytes> to extend the raw .img file 2. e2fsck -fy to ensure filesystem consistency 3. resize2fs to expand ext4 to fill the larger image Works directly on raw image files (no block device needed). Requires e2fsprogs from Homebrew.
Mount raw ext4 disk images using hdiutil attach with -plist output parsing to extract the mount point. Unmount with hdiutil detach. -nobrowse prevents mounted volumes from cluttering Finder. This is the macOS equivalent of Linux's mount/umount for loop devices, used during image preparation (SSH key injection) and resize operations.
…ot, destroy_fork_origin Complete the remaining macOS StorageBackend methods: - destroy_vm_storage: rm -rf the VM directory (rootfs + snapshots) - destroy_image_storage: rm the base image .img file - disk_device_path: returns the rootfs.img path (no block device indirection) - clone_from_snapshot: snapshot source VM then APFS-clone into target (for vm fork) - destroy_fork_origin: clean up fork snapshot using 'source_vm/snap_name' identifier All macOS StorageBackend methods are now implemented.
Run 'diskutil info -plist' on the state directory during init to check that it resides on an APFS volume. Warns (doesn't error) if the volume isn't APFS, since cp -c CoW clones won't work on other filesystems. Walks up to the nearest existing ancestor directory if the state dir doesn't exist yet. Silently skips the check if diskutil isn't available.
Measure wall-clock time of every APFS clone operation. A CoW clone completes in milliseconds regardless of file size, so if it takes over 1 second, warn the user that copy-on-write may not be working and suggest running 'ember debug storage-efficiency' to investigate. Also marks the cp -c error handling task as done (was already implemented in the apfs_clone helper).
Add a new 'debug' CLI subcommand with 'storage-efficiency' that reports CoW storage savings on macOS APFS volumes: - Counts images, VM rootfs files, and snapshots with their logical sizes - Reads actual disk usage via 'df -k' on the state directory volume - Computes CoW efficiency ratio (logical / actual) Output format matches the spec's example with aligned columns and a clear summary. No root required — reads file metadata and df only.
Add a note at the top of MACOS-TODO.md explaining that test tasks should be implemented as #[test] #[ignore] functions in tests/*.rs, following the same patterns as existing Linux integration tests.
Infrastructure changes: - Add #![cfg(target_os = "linux")] to all existing Linux test files so they compile as empty on macOS (no spurious ignored tests) - Update run-integration-tests.sh to work on both platforms: Linux tests run under sudo, macOS tests run as current user - Fix macOS resize backend to find e2fsprogs via Homebrew paths (keg-only install not in PATH) New macOS integration tests (tests/macos_storage.rs): - storage_lifecycle_create_clone_snapshot_restore: full snapshot CRUD - snapshot_create_duplicate_fails: duplicate name error - snapshot_create_base_name_rejected: reserved name guard - snapshot_restore_nonexistent_fails: missing snapshot error - snapshot_delete_nonexistent_fails: missing snapshot error - snapshot_list_empty: empty list display - apfs_clone_does_not_reduce_free_space: CoW efficiency proof - storage_efficiency_shows_savings: debug command output - vm_delete_removes_storage: cleanup verification Tests bypass 'ember vm create' (which needs ext4 mount, a Phase 5 task) and instead set up VM state manually with cp -c + vm.json.
…etwork Add integration test that verifies the ember-vz Swift helper can: - Boot a Linux VM via Apple Virtualization Framework - Produce serial console output (kernel boot messages) - Configure vmnet network (MAC address assignment) - Shut down gracefully on SIGTERM The test spawns ember-vz directly (VmBackend is not yet implemented), uses the Firecracker CI kernel (auto-downloaded and cached), and creates a minimal ext4 rootfs. Skips gracefully if ember-vz isn't built or no kernel is available.
Made exec_on_stopped_vm_fails cross-platform using TestEnv::with_vm(). Moved exec_command_returns_stdout and cp_upload_and_download into a #[cfg(target_os = "linux")] module (they need ubuntu-slim + docker). 1 test on macOS, 3 on Linux.
Checked off remaining TODO items: ssh.rs unification verified, macos_storage.rs already slimmed, run-integration-tests.sh unchanged. Full suite passes: 9 suites, 41 tests on macOS.
Add two tests specified in TEST-SPEC.md that were missing: - vm_list: creates two VMs, verifies both appear in table and JSON output from `ember vm list` - vm_force_stop: starts a VM, force-stops it with --force, verifies status transitions to stopped and PID is cleared Both are cross-platform (vm_list uses TestEnv::with_vm, vm_force_stop uses TestEnv::with_running_vm with skip-if-unavailable).
The exec_command_returns_stdout and cp_upload_and_download tests were Linux-only (#[cfg(target_os = "linux")] mod linux_ssh) because they needed ubuntu-slim with sshd. Docker is available on macOS too, so there's no reason they can't run on both platforms. Changes: - Add docker_available() and stop_and_delete_vm() to common/mod.rs - Add TestEnv::with_running_ssh_vm() constructor: inits, builds ubuntu-slim via Docker, creates/starts VM, waits for SSH readiness - Rewrite ssh.rs: remove Linux-only module, use with_running_ssh_vm() for cross-platform tests that skip gracefully if prerequisites (docker + hypervisor) are missing - Add wait_for_ssh_via_exec() helper that retries `ember exec true` until SSH is ready (up to 120s for systemd boot)
Container-derived images often ship without /etc/hosts, making 'localhost' unresolvable. This causes tools like psql to fail when connecting to localhost instead of 127.0.0.1. Inject a minimal /etc/hosts (IPv4 + IPv6 loopback) alongside the existing resolv.conf injection.
Homebrew fakeroot uses DYLD_INSERT_LIBRARIES which fails on macOS Sequoia runners due to arm64/arm64e architecture mismatch. When running as root, tar and mkfs.ext4 can set file ownership natively, so fakeroot is unnecessary. CI now runs integration tests with sudo on both platforms, avoiding the fakeroot issue entirely while preserving correct file ownership in extracted images.
Remove silent skipping when Firecracker, Docker, ember-vz, or kernel are not available. Tests now panic with clear messages (e.g. 'firecracker not found in PATH') instead of silently passing. Integration tests are run locally only, not in CI, so missing prerequisites should fail loudly. Also fix clippy identity_op warnings in fmt.rs.
The Docker apt repo was hardcoded to arch=amd64, causing unresolvable dependency errors when building on arm64 (Apple Silicon). Use $(dpkg --print-architecture) to match the build platform, consistent with how the GitHub CLI repo is already configured.
A stray 'ends' after serial-getty@hvc0.service caused systemctl to try enabling a nonexistent 'ends.service', failing the build.
The old estimate used `du -sm` which reports APFS-compressed disk usage, significantly underreporting what ext4 actually needs for text-heavy rootfs trees (HTML docs, man pages). Replace with find/stat to sum apparent (logical) file sizes, plus block-alignment waste per file. Use 2x data estimate since the sparse image file costs nothing on APFS. Also pass -i 8192 (more inodes for many-small-file trees) and -m 0 (no reserved blocks, unnecessary for VM images) to mkfs.ext4.
After mkfs.ext4 -d populates the filesystem, run e2fsck + resize2fs -M to shrink it to minimum size, then truncate the file to match. This reclaims the generous headroom from estimate_size_mib so the stored image only contains actual data + metadata, no wasted empty space. Also update the registry's size_mib to reflect the final file size rather than the pre-shrink estimate.
Allows updating cpus, memory, kernel, boot-args, ssh-user, and ssh-key on a stopped VM without recreating it. Includes integration tests.
Correct spec discrepancies: macOS 13+ (not 12+), debugfs for SSH injection (not hdiutil mount), mkfs.ext4 -d (not mount+cp), st_blocks for storage efficiency (not df), remove unimplemented ember-vz status command, remove ember-vz.pid file (PID in vm.json). Add missing spec details: entropy and balloon devices, fakeroot support, image shrinking, Makefile build orchestration, e2fsprogs tool path resolution, StorageBackend trait updates. Check off completed Phase 7 items (CI, swift build, e2e tests).
…ase script Formula supports both versioned releases (GitHub release tarballs with sha256) and --HEAD installs from git main. Builds Rust CLI via cargo and Swift helper via swift build + codesign on macOS. Depends on e2fsprogs and skopeo. Requires macOS 13+ (Ventura). Added script/release.sh to automate tagging, GitHub release creation, and formula sha256 update.
README now leads with both platforms, shows 'brew tap aljoscha/ember && brew install ember' as the primary macOS install path (plus --HEAD for dev builds), separate quick-start sections for each platform, and a platform comparison table. Removed sudo from macOS examples. Added vm update-config and storage-efficiency commands. Links to both SPEC.md and MACOS-SPEC.md.
Add a note at the top of SPEC.md pointing to MACOS-SPEC.md for the macOS-specific design (AVF + APFS instead of Firecracker + ZFS).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.