Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
51 changes: 51 additions & 0 deletions plans/future-squashfs-sidecar.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,51 @@
# Future: Read-Only Squashfs Sidecar for Heavy Image Content

**Status:** Future optimization. Do not design details until image bloat is an actual problem.
**Depends on:** Profiles, custom images, and overlay disks (see `profiles-and-overlay.md`) must be in place.

## The problem this solves

The pen initramfs is unpacked into tmpfs at boot, which means **every byte in the image consumes guest RAM for the life of the VM**. For the tooling we're currently targeting (claude binary, node, git, ripgrep, apk packages) this is fine — we're talking tens of megabytes.

But the ceiling is real. Profiles that want to bake in heavy content will hit it:
- Full JDK + Maven local repo
- CUDA toolkit
- Multi-gigabyte model weights
- Large language SDKs (Android NDK, Xcode command line equivalents)

Once profiles start exceeding a few hundred MB of baked content, the RAM cost becomes user-visible and the "just put it in the image" answer stops working.

## When to revisit

- When a user files a profile build that pushes the initrd past ~500MB and the resulting VM feels memory-starved.
- When a natural use case (ML tooling, JDK-based agents) appears that requires > 1GB of stable, read-only content.
- **Not** when someone just wants more space for mutable state — that belongs on the per-VM overlay disk, not here.

## Rough shape of the approach

Ship heavy, read-only content as a **squashfs image mounted from a virtio-blk device**, alongside the existing kernel/initrd. The initramfs remains small; the squashfs carries the bulk.

1. Extend the profile image format from `{vmlinuz, initrd}` to `{vmlinuz, initrd, sidecar.squashfs}` (sidecar optional).
2. During image build, the builder VM decides which files go in the initrd (small, required early) vs. the squashfs (large, accessed on demand). Probably driven by a profile field like `squashfs_paths = ["/opt/jdk", "/opt/models"]`.
3. At runtime, pen attaches the squashfs as a read-only virtio-blk device. Guest init mounts it at a fixed path (e.g. `/nix`-style or overlay-style) and the content is available without copying into RAM.
4. The image cache key now covers both the initrd and the squashfs.

## Key properties

- **Read-only.** This is not a replacement for the overlay disk. Anything mutable still goes on the per-VM overlay.
- **Content-addressed, shared across VMs.** Same semantics as custom initrds today — one squashfs per profile, reused by every VM built from it.
- **No per-VM storage cost beyond the overlay.** The squashfs lives in `~/.config/pen/images/profiles/<name>/sidecar.squashfs` alongside the initrd.
- **Cold reads only touch disk.** Linux page cache handles hot content naturally; RAM cost is proportional to the working set, not the image size.

## Known questions (to answer at design time, not now)

- **How does the profile author split content between initrd and squashfs?** Explicit path list, size threshold heuristic, or build-time tooling that analyzes what's large and moves it automatically?
- **Mount strategy:** plain mount at a fixed path, or overlay the squashfs into `/` so baked content appears in its "natural" location? Overlay is more magical but matches user expectations.
- **Multiple sidecars per profile?** Probably not for v1 — one sidecar per profile keeps the cache model simple.
- **Interaction with the per-VM overlayfs:** the overlay's lowerdir becomes a stack — initramfs root + squashfs content. Overlayfs supports this (multiple lower dirs), but ordering matters and failure modes multiply.
- **Build VM complexity:** the builder VM needs `mksquashfs`. Add to the base image's apk packages for building, or use a dedicated "builder image" variant?
- **Distribution:** per-profile squashfs files can be large. Does `pen image build` still run locally, or do we move to pre-built profile artifacts hosted on GitHub Releases?

## Why we're not doing this now

The two-layer model (immutable image + mutable per-VM disk) is sufficient for everything the current user story needs. Adding a third tier introduces real complexity — multi-lower overlayfs, image format versioning, build-time partitioning — and the RAM ceiling won't be hit until someone actually tries to bake something heavy. When that day comes, the plan above is the escape hatch; until then, keep the design simple.
43 changes: 43 additions & 0 deletions plans/future-vm-snapshots.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,43 @@
# Future: VM Snapshots and Suspend/Resume

**Status:** Future optimization. Do not design details until we're committing to build it.
**Depends on:** Profiles, custom images, and overlay disks (see `profiles-and-overlay.md`) should be stable first.

## Goal

Make `pen shell` startup time effectively instant (sub-second) by skipping the boot path entirely on subsequent invocations. Boot once, capture a snapshot of the running VM's memory and device state, and restore from that snapshot on future shells.

## When to revisit

- When per-shell startup time becomes a user complaint *after* profiles + custom images are in place. The current boot path is already fast; the value of this feature is almost entirely about eliminating the residual few seconds, not about solving a current pain point.
- When `Code-Hex/vz/v3` and Apple Virtualization.framework have documented, stable save/restore APIs we can rely on.
- If competing tools (Lima, OrbStack, etc.) popularize sub-second VM startup and it becomes a baseline expectation.

## Rough shape of the approach

1. On first `pen shell` for a VM, boot normally, run profile setup if needed, and reach a "ready" state (user-visible prompt).
2. Immediately before attaching the console, pause the VM and write a snapshot (memory image + device state) to `~/.config/pen/vms/<name>/snapshot/`.
3. On subsequent `pen shell` invocations, if a valid snapshot exists, restore from it instead of booting.
4. Invalidate the snapshot on any change that would make it stale: profile image hash mismatch, overlay disk modified out-of-band, env var changes (snapshots freeze the environment), kernel/image upgrade.

## Known hard problems (to solve at design time, not now)

- **Stale state in the restored VM:** clock drift, expired DHCP leases, dangling network connections, cached DNS, `/tmp` contents from the snapshot moment. Need a post-restore hook to re-run clock sync, renew networking, clear volatile state.
- **Environment variable changes:** pen injects env vars at boot. Snapshots freeze those. Either re-inject post-restore (requires a host→guest channel beyond the current share-and-read model) or invalidate the snapshot whenever env changes (common case, erodes the speedup).
- **Snapshot size:** memory snapshots are roughly the size of allocated guest RAM. A 4GB VM = 4GB snapshot on disk per VM. Disk pressure, especially with many VMs.
- **Snapshot portability:** snapshots are tied to macOS version, CPU generation, vz version. User upgrades can invalidate every snapshot. Need a version tag and automatic invalidation.
- **`vz` API maturity:** confirm save/restore is supported, stable, and exposed by `Code-Hex/vz/v3` (not just Apple's private APIs). If not, this plan is blocked until the bindings catch up.
- **Interaction with overlay disks:** the snapshot must be consistent with the overlay disk's on-disk state at snapshot time. Restoring while the overlay has been modified since the snapshot was taken = corruption. Need a consistency check (overlay mtime or hash) and automatic invalidation.

## Non-obvious things to keep in mind

- Snapshots are **per-VM**, not per-profile. Each VM has its own memory state, overlay contents, and workspace path.
- The first shell is slower, not faster (boot + snapshot). The win only materializes on the 2nd+ shell.
- Users will expect `pen shell` to "just work" after a macOS upgrade. Graceful fallback to a fresh boot on snapshot-version mismatch is non-negotiable.

## Alternatives to consider before committing

- **Kexec-based fast reboot:** if the slowness is the firmware/boot phase, kexec within the guest could shave time without full snapshot machinery.
- **Keep VMs running in the background:** instead of snapshot/restore, keep the VM alive across `pen shell` exits and just reattach the console. Simpler, avoids all snapshot pitfalls, but uses memory continuously.

Decide between snapshot, backgrounding, and kexec at design time. They solve the same problem with very different tradeoffs.
Loading
Loading