Skip to content

Scope host-global resources to a per-installation instance id#17

Merged
aljoscha merged 10 commits into
mainfrom
claude/isolate-config-installations-7rzVm
May 20, 2026
Merged

Scope host-global resources to a per-installation instance id#17
aljoscha merged 10 commits into
mainfrom
claude/isolate-config-installations-7rzVm

Conversation

@aljoscha
Copy link
Copy Markdown
Owner

Summary

Two ember installations on the same host previously shared the dm-thin
pool name (ember-pool), the em-* TAP namespace, the 10.100.0.0/16
default subnet, and — on macOS — vmnet's 192.168.64.0/24. Worse,
reconcile() listed every em-* interface and would delete the other
install's TAPs on any command; deinit --purge tore down the singleton
ember-pool regardless of which install was asking. An integration
test against a temp --state-dir could destroy a live install on the
same machine.

This branch introduces a 4-hex-char instance_id (derived from a hash
of the canonicalized state-dir at ember init, overridable via
--instance-id) and threads it through every host-global resource
name.

What gets scoped

  • dm-thin: ember-{id}-pool, ember-{id}-img-, ember-{id}-vm-
  • Networking (Linux): em{id}- TAP prefix (14-char names fit
    IFNAMSIZ), ember:{id} iptables comment threaded through every
    add/remove so cleanup only ever matches this install's rules
  • IP subnet (Linux): /16 slot inside 10.0.0.0/8 derived from
    the id; overridable via --ip-subnet
  • IP subnet (macOS): vmnet's /24 sub-allocated into 8× /27 slices
    picked by the low 3 bits of the id (1/8 collision per pair; the
    override is --ip-subnet)
  • macOS guest IPs: replaced the Linux /30 P2P allocator with a
    single-IP /32 allocator since vmnet is a shared L2 bridge. Lifts
    the per-install cap from 8 to ~30 VMs while staying inside the /27.

reconcile() now scopes the orphan-TAP sweep by the install's prefix
and skips it entirely if the config can't be read (preferable to
deleting a foreign device).

Boundary discipline

GlobalConfig does not own the format of any subsystem's names. It
exposes one generic accessor — instance_namespace() -> Option<&str>,
which collapses the empty-string legacy sentinel to None — and each
subsystem owns its own pure derivation next to the code that has to
match the literals byte-for-byte:

  • dm_thin::pool::name(Option<&str>)
  • dm_thin::thin::{image_prefix, vm_prefix}(Option<&str>)
  • network::tap::prefix(Option<&str>)
  • network::nat::comment(Option<&str>)
  • ember_linux::network::ip::derive_default_subnet(...) (Linux /16 in /8)
  • ember_macos::network::derive_vmnet_subnet(...) (macOS /27 slot)

Added a one-bullet style note in CLAUDE.md capturing the principle.

Back-compat

No deinit --purge && init migration required. Older configs lack
instance_id; instance_namespace() returns None and each
subsystem falls back to its historical literal (ember-pool, em-,
ember-img-, ember-vm-, empty iptables comment, full /24 on macOS).

The subtle one is iptables: older binaries added rules without
-m comment at all, so on -D the comment match has to be elided
or the rule won't match. Handled in network::nat.

Risk

  • Cross-install isolation is invariant-level: covered by three
    integration tests in tests/isolation.rs that go through the
    binary — two-install dm-thin non-interference, legacy config
    compatibility, scoped TAP reconcile.
  • macOS /27 collision is probabilistic (1/8 per install pair).
    Acceptable for personal use; --ip-subnet is the escape hatch.
  • macOS allocations.json layout is preserved byte-for-byte for
    legacy installs (no instance_id → keep the /30 allocator on the
    full /24); new installs get the /32 allocator on the /27.
  • derive_vmnet_subnet on bad instance_id panics rather than
    silently falling back to slot 0 (which would re-collide every
    affected install on the same slice). The id is CLI-validated or
    auto-derived, so this is unreachable in practice — the panic is
    there to catch a regression in validation.

Test plan

  • cargo test (229 unit tests pass)
  • cargo test --test isolation (three cross-install tests)
  • cargo test --test dm_thin -- --ignored on a host with the
    dm-thin module + thin-provisioning-tools
  • Manual: two ember init invocations against distinct
    --state-dirs on the same Linux host; create VMs in both;
    deinit --purge on A leaves B running
  • Manual: upgrade a pre-existing install in place (no re-init);
    verify VMs keep working and deinit cleans up old iptables
    rules
  • Manual macOS: two installs running VMs concurrently; verify
    distinct /27 slices and no guest-IP collisions

claude added 10 commits May 18, 2026 07:32
Two ember installations on the same host previously shared the dm-thin
pool name, the `em-*` TAP namespace, and (when both ran VMs) the IP
subnet. Worse, reconcile() listed every `em-*` interface and would
delete the other install's TAPs on any command, and `deinit --purge`
tore down the singleton `ember-pool` regardless of which install was
asking. Integration tests against a temp `--state-dir` could destroy
the user's live install on the same machine.

Introduce `instance_id` (4 hex chars, derived from a hash of the
canonicalized state-dir at `ember init`, overridable via
`--instance-id`) and `ip_subnet` (defaulting to a `/16` slot in
`10.0.0.0/8` derived from the same hash, overridable via
`--ip-subnet`). Both are persisted on `GlobalConfig`. Helper
accessors derive every host-global name from the id:
`ember-{id}-pool`, `ember-{id}-{img,vm}-` device-mapper prefixes,
`em{id}-` TAP prefix (14-char names fit Linux IFNAMSIZ), and
`ember:{id}` iptables comment threaded through every rule add/remove
so cleanup only ever matches this install's rules.

reconcile() now scopes the orphan-TAP sweep by the install's prefix
and skips it entirely if the config can't be read, which is
preferable to deleting a foreign device. NetworkBackend::teardown
takes &GlobalConfig to thread the comment through; macOS impl is a
no-op update.

Removes the singleton `pool::POOL_NAME`, `thin::IMAGE_PREFIX`,
`thin::VM_PREFIX`, `network::ip::DEFAULT_SUBNET`, and
`tap::list_ember_devices()` constants/helpers — names now flow from
config. No back-compat shim for older configs: existing installs
need `ember deinit --purge && ember init` to migrate.

https://claude.ai/code/session_01XbXjsQa8SRWLzxnuKFzxt6
Previous commit's clean-slate naming forced existing installs to
deinit + reinit because the config schema gained two required-by-
convention fields (instance_id, ip_subnet). That's a worse migration
than the problem it solves. Make the accessors back-compat aware
instead: when instance_id is empty (the serde default for older
configs), return the historical unprefixed names — `ember-pool`,
`em-`, `ember-img-`, `ember-vm-`, and an empty iptables comment.
ip_subnet's serde default already maps to `10.100.0.0/16`.

The iptables comment match is the subtle one. Older binaries added
rules without `-m comment` at all; if the upgraded binary then tried
to delete the same rule WITH a comment match, iptables would treat
it as a different rule and the cleanup would silently no-op. Splice
the comment match in only when the comment is non-empty so the byte-
exact rule stays matchable on `-D`.

Add three integration tests guarding the isolation contract:

* `dm_thin_two_installs_dont_interfere`: two installs with distinct
  --instance-id stand up `ember-aaaa-pool` and `ember-bbbb-pool`
  side-by-side; deinit on A leaves B's pool intact.
* `legacy_config_without_instance_id_keeps_working`: writes a
  hand-crafted legacy config (no instance_id), runs `ember info`
  through the binary, asserts the rendered pool name is the
  unprefixed `ember-pool`.
* `reconcile_does_not_touch_other_installs_taps`: creates a TAP
  matching install A's prefix manually, runs `ember vm list` against
  install B (which triggers reconcile), verifies A's TAP survives.

https://claude.ai/code/session_01XbXjsQa8SRWLzxnuKFzxt6
The cross-installation isolation tests (two-install dm-thin
non-interference, legacy config compatibility, scoped TAP
reconciliation) live in `tests/isolation.rs` rather than
piggybacking on `tests/dm_thin.rs`. The dm-thin file is for backend
mechanics; isolation is a different invariant that happens to use
dm-thin as its substrate.

Also fold a one-line WHY note into each legacy-mode branch on
GlobalConfig — the literals there must match what the
pre-instance-id binary wrote to disk and to the kernel, and the
specific failure mode (orphaned thin ids, IFNAMSIZ overflow,
silent iptables -D no-ops) varies per accessor.

https://claude.ai/code/session_01XbXjsQa8SRWLzxnuKFzxt6
vmnet's shared subnet (192.168.64.0/24) is owned by the framework
and there's no public API to ask for a different one, so two macOS
installs running VMs concurrently both allocate from the same /24
and hand out colliding guest IPs. State-dir isolation handled
everything else on macOS, but this last collision survived.

Carve the /24 into 8 /27 slots (32 addresses, 8 /30 VM blocks per
install) and pick a slot from the low 3 bits of the instance id.
Two installs collide with 1/8 probability per pair; `--ip-subnet`
is the override when that bites. 8 concurrent VMs per install is
generous for personal use on macOS.

Plumb via a new `Platform::default_ip_subnet(instance_id)` trait
method so init.rs picks the right derivation per platform: Linux
keeps the existing /16 slot inside 10/8, macOS gets the /27 slice.
`MacosNetwork::setup` reads `config.ip_subnet` for new configs and
falls back to the full /24 when `instance_id` is empty so legacy
installs continue to use the same allocations.json layout the old
binary wrote.

IP reuse on VM teardown was already correct: the allocator picks
the lowest-numbered free `/30` block and `release()` removes the
entry, so freed slots are reused before higher indices. Existing
`allocate_reuses_released_block` test covers it.

https://claude.ai/code/session_01XbXjsQa8SRWLzxnuKFzxt6
Previous /27 slicing capped each macOS install at 8 VMs because
ember reused Linux's /30 P2P allocator on macOS, where it makes no
sense — vmnet provides a shared L2 bridge so every guest sits on
the same subnet behind one gateway, and the host (.1) / broadcast
(.3) of the per-VM /30 are pure waste. 75% of the address space
went unused.

Add `network::ip::allocate_single`: hands out one /32 address from
a shared subnet, with a `reserved` list so callers can pin
host-global addresses (vmnet's network/gateway/broadcast in our
case) without seeding allocations.json. Persists through the same
flock-locked file as `allocate`, with `block_index` reinterpreted
as an address offset.

`MacosNetwork::setup` branches on `instance_id`:
* Legacy (no instance id): keeps using the /30 allocator against
  the full /24 so existing allocations.json layouts stay valid
  byte-for-byte across the upgrade.
* New install: uses `allocate_single` against the /27 slot. ~30
  VMs/install (32 - reserved), same 1/8 collision profile.

Move the /30-minimum constraint out of `parse_cidr` into `allocate`
so the shared parser can also accept /27, /28, /32 for
single-IP callers.

https://claude.ai/code/session_01XbXjsQa8SRWLzxnuKFzxt6
GlobalConfig previously owned the format of the dm-thin pool name,
the image/VM thin-volume prefixes, the TAP device prefix, and the
iptables comment — five accessors that hard-coded what should be a
subsystem-private concern. Each had its own legacy-fallback branch
duplicating the same `if instance_id.is_empty()` shape.

Replace the five accessors with one generic identity surface,
`GlobalConfig::instance_namespace() -> Option<&str>`, that collapses
the empty-string legacy sentinel into `None`. Each subsystem owns
its own pure derivation:

* `dm_thin::pool::name(Option<&str>)`
* `dm_thin::thin::{image_prefix, vm_prefix}(Option<&str>)`
* `network::tap::prefix(Option<&str>)`
* `network::nat::comment(Option<&str>)`

The legacy literals (`ember-pool`, `ember-img-`, `ember-vm-`, `em-`,
empty iptables comment) live next to the code that has to match
them byte-for-byte against existing kernel state, with the
"why this exact string" comment alongside.

Two related cleanups from the review:

* `DmThinStorage::init` was rebuilding the pool name itself with
  `format!("ember-{}-pool", ...)`, duplicating the
  `dm_thin_pool_name()` accessor's format and creating a silent
  drift risk. It now calls `pool::name(Some(...))` — one source of
  truth.
* The macOS vmnet-reserved address list was inlined in
  `MacosNetwork::setup`; hoist to a module-level `VMNET_RESERVED`
  const so the three host-global addresses have one definition.
* `reconcile::cleanup_network` was a one-line passthrough to
  `network::cleanup` — inlined the call.

229 ember-core + ember-linux unit tests still pass; integration
tests (`tests/isolation.rs`, `tests/dm_thin.rs`) are unchanged and
still cover the cross-install isolation contract via the binary.

https://claude.ai/code/session_01F51Z281wnRSUHqHPub4msK
Adds a one-bullet style note under "Coding Style & Conventions"
capturing the principle that motivated the previous refactor:
subsystems own their own names/formats, shared types like
`GlobalConfig` expose generic identity, and reaching across a
boundary to format or match another subsystem's strings is a
signal to move the logic to the owning side.

https://claude.ai/code/session_01F51Z281wnRSUHqHPub4msK
`MacosNetwork::setup` was the lone holdout still reading
`config.instance_id` raw to decide between the legacy /30
allocator and the new /27 single-IP path — every other subsystem
goes through `GlobalConfig::instance_namespace()`. Switch to the
accessor and match on `Option<&str>` so the legacy-vs-tagged
intent reads at the type level, consistent with `dm_thin::pool`,
`network::tap`, `network::nat`, etc.

No behavior change — `instance_namespace()` collapses the empty
instance_id to `None`, which selects the same legacy branch the
`is_empty()` check did.

https://claude.ai/code/session_01DdHPXvcXCZagRRDobyPbvi
`derive_ip_subnet` lived in `ember_core::config` but was only ever
called from `LinuxPlatform::default_ip_subnet`. macOS has its own
derivation (`derive_vmnet_subnet` in `ember-macos`) that carves a
/27 out of vmnet's fixed /24, so keeping the Linux-specific
function in core was the opposite of the boundary discipline this
branch is trying to land: shared types stay generic; platforms
own their own scoped derivations.

Move it to `ember_linux::network::ip::derive_default_subnet`,
alongside the `network::ip::allocate`/`release` it pairs with.
Expose `ember_core::config::fnv1a_32` as `pub` so the platform
crate can hash the instance id with the same primitive
`derive_instance_id` (still in core, still platform-agnostic)
uses — duplicating a 7-line hash would be sillier than
publishing a stable utility.

`network.rs`'s `pub use ember_core::network::ip` re-export
becomes a local `pub mod ip` that re-exports the same surface
(`allocate`, `allocate_single`, `release`, `IpAllocation`, …) and
adds the Linux-only `derive_default_subnet`. Every existing
`network::ip::X` call site keeps resolving — the new module pulls
core's public items through `pub use ember_core::network::ip::*`.

https://claude.ai/code/session_01DdHPXvcXCZagRRDobyPbvi
`derive_vmnet_subnet`'s `unwrap_or(0)` was defensive against a
hand-edited config, but the function is only called from
`MacosPlatform::default_ip_subnet` at `ember init` — never at
runtime — and the `instance_id` it gets there is either
CLI-validated (`parse_instance_id` enforces 4 lowercase hex) or
auto-derived (`derive_instance_id` always emits 4 hex chars).
The fallback was unreachable, but in the hypothetical world where
validation regresses it would silently collide every affected
install on slot 0 — exactly the cross-install clash this branch
exists to prevent.

Switch to a `panic!` with a message that names the contract
("CLI validation should have rejected this"), and replace the
`falls_back_to_slot_zero_on_garbage_id` test with two
`#[should_panic]` cases. The test that was locking in the silent
fallback is now locking in the loud one.

Also drops a trailing blank line in reconcile.rs that
`cargo fmt` flagged in passing.

https://claude.ai/code/session_01DdHPXvcXCZagRRDobyPbvi
@aljoscha aljoscha merged commit 62250d9 into main May 20, 2026
2 checks passed
@aljoscha aljoscha deleted the claude/isolate-config-installations-7rzVm branch May 20, 2026 12:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants