Skip to content

feat(network): always provision per-VM netns under CNI, even with 0 NICs#44

Merged
CMGS merged 5 commits into
masterfrom
feat/cni-netns-always
May 13, 2026
Merged

feat(network): always provision per-VM netns under CNI, even with 0 NICs#44
CMGS merged 5 commits into
masterfrom
feat/cni-netns-always

Conversation

@CMGS
Copy link
Copy Markdown
Contributor

@CMGS CMGS commented May 13, 2026

Problem

cocoon vm run --nics 0 (or vm clone --nics 0) under CNI silently put the CH/FC process in the host netns because initNetwork short-circuited on nics <= 0. A subsequent vm net resize --nics N was then rejected by plumbingForVM:

if len(configs) == 0 {
    return nil, fmt.Errorf("zero NICs; resize up not supported (use `vm clone --nics N` instead)")
}

This hard-codes an asymmetry users can't see from the CLI surface — --nics 0 looks like a valid lower bound on --nics N, but it actually puts the VM in a different mode (no per-VM netns, can't be lifted back up).

Bridge mode was mechanically fine (no netns concept at all) but got blocked by the same guard.

Fix

Move netns ownership entirely into the network module:

  • network.Network gains Prepare(ctx, vmID, vmCfg) (netnsPath, error).
    • CNI.Prepare idempotently creates the per-VM netns and returns its path. Returns "" when no CNI conflist is configured.
    • Bridge.Prepare is a no-op (bridge stays in host netns by design).
  • initNetwork always calls Prepare, regardless of NIC count, and writes the result onto vmCfg.{NetBackend, NetnsPath, NetBridgeDev} (transient json:"-" slots).
  • Add is only called when nics > 0.

Persistence:

  • Promoted NetBackend / NetnsPath / NetBridgeDev from per-NIC NetworkConfig to VM-level types.VM. They are the per-VM identity of "which netns CH lives in" — were always denormalized onto NIC[0]. The persisted JSON keeps the new top-level fields (net_backend, netns_path, net_bridge_dev).
  • Resolved*() accessors fall back to NetworkConfigs[0] for pre-PR records so existing VMs and snapshots keep working without migration.

Hypervisor consumers:

  • CH.launchProcess / FC.launchProcess no longer take the withNetwork bool parameter — the len(rec.NetworkConfigs) > 0 gate conflated "has NICs" with "needs netns entry", which is the same bug at the hypervisor layer. They read rec.ResolvedNetnsPath() instead.
  • cmd/vm/netresize.plumbingForVM and cmd/vm/lifecycle.providerForVM dispatch via vm.ResolvedNetBackend() and read vm.ResolvedNetBridgeDev() for bridge mode; the zero-NICs hard error is gone.

Result

Path Before After
vm run --nics 0 (CNI) CH in host netns CH in cocoon-<vmID> netns
vm run --nics 0 (bridge) CH in host netns CH in host netns (unchanged)
vm run --nics N (CNI) CH in cocoon-<vmID> netns unchanged
vm net resize --nics N on 0-NIC CNI VM hard error adds NICs to the existing netns ✓
vm net resize --nics N on 0-NIC bridge VM hard error adds TAPs on host bridge ✓
vm net resize --nics 0 (any) preserves netns unchanged
Resize back up after resize-to-0 already worked (netns preserved) still works

Test plan

  • make lint linux + darwin: 0 issues
  • go test ./...: green
  • Live: cocoon vm run --nics 0 under CNI → ip netns ls | grep cocoon- shows the netns; cocoon vm net resize --nics 2 succeeds and NICs land in the right netns
  • Live: same flow under --bridge cni0 → no netns, TAPs land on host bridge
  • Live: pre-PR-created VM with 1 NIC continues to work (fallback path)

@CMGS CMGS force-pushed the feat/cni-netns-always branch 2 times, most recently from 33d55a3 to f0cb8e5 Compare May 13, 2026 07:06
@CMGS CMGS requested a review from Copilot May 13, 2026 07:13
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes a networking edge case where running/cloning a VM with --nics 0 under CNI would leave the hypervisor process in the host network namespace, preventing later vm net resize --nics N. It does so by making per-VM network namespace provisioning independent of NIC count and by persisting per-VM network identity at the VM level (with backward-compatible fallbacks).

Changes:

  • Add Network.Prepare(...) and call it unconditionally so CNI can provision a per-VM netns even when nics == 0, while bridge remains a no-op.
  • Promote net_backend / netns_path / net_bridge_dev to VM-level persisted fields and add Resolved*() accessors for backward compatibility.
  • Update hypervisor launch paths and vm net resize plumbing selection to use VM-level resolved network identity instead of len(NetworkConfigs) > 0.

Reviewed changes

Copilot reviewed 26 out of 26 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
types/vm.go Adds VM-level persisted network identity fields and Resolved*() accessors; introduces types.NetSetup.
types/vm_test.go Adds unit tests for the new Resolved*() accessors.
README.md Updates NIC hot-resize documentation, including 0→N resize behavior under CNI/bridge.
network/network.go Extends the Network interface with Prepare(...).
network/cni/create.go Implements CNI.Prepare to idempotently ensure the per-VM netns exists when configured.
network/bridge/bridge_other.go Implements Prepare for non-Linux bridge builds (unsupported).
network/bridge/bridge_linux.go Implements bridge Prepare as a no-op for host-netns operation.
KNOWN_ISSUES.md Updates documentation to reflect longer eject wait timeout.
hypervisor/hypervisor.go Updates hypervisor interface to accept types.NetSetup for create/clone/direct-clone.
hypervisor/firecracker/start.go Removes withNetwork gating and uses resolved netns path for process launch.
hypervisor/firecracker/restore.go Aligns restore launch with new launchProcess signature.
hypervisor/firecracker/direct.go Updates DirectClone signature to accept types.NetSetup.
hypervisor/firecracker/create.go Updates Create signature to accept types.NetSetup.
hypervisor/firecracker/clone.go Updates Clone flow to accept types.NetSetup and persist VM-level network identity.
hypervisor/create.go Persists VM-level network identity fields during create.
hypervisor/cloudhypervisor/start.go Removes withNetwork gating and uses resolved netns path for process launch.
hypervisor/cloudhypervisor/restore.go Aligns restore launch with new launchProcess signature.
hypervisor/cloudhypervisor/netresize.go Increases NIC eject wait timeout to 30s.
hypervisor/cloudhypervisor/direct.go Updates DirectClone signature to accept types.NetSetup.
hypervisor/cloudhypervisor/create.go Updates Create signature to accept types.NetSetup.
hypervisor/cloudhypervisor/clone.go Updates Clone flow to accept types.NetSetup and persist VM-level network identity.
hypervisor/clone.go Threads types.NetSetup through clone helpers.
hypervisor/backend.go Replaces NetworkConfigs field in CreateSpec with Net types.NetSetup.
cmd/vm/run.go Makes initNetwork always call Prepare and passes types.NetSetup into hypervisor create/clone.
cmd/vm/netresize.go Selects plumbing based on resolved VM-level network identity rather than NIC presence.
cmd/vm/lifecycle.go Uses resolved VM-level network identity for provider selection and adds Prepare to recovery.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread types/vm.go Outdated
Comment thread cmd/vm/run.go
Comment thread network/bridge/bridge_linux.go Outdated
Comment thread types/vm_test.go
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 26 out of 26 changed files in this pull request and generated 3 comments.

Comment thread cmd/vm/run.go Outdated
Comment thread cmd/vm/netresize.go Outdated
Comment thread cmd/vm/lifecycle.go
CH/FC born from `vm run --nics 0` (or `vm clone --nics 0`) used to land
in the host netns because initNetwork short-circuited and no per-VM
netns was created. A later `vm net resize --nics N` was rejected by
plumbingForVM's zero-NICs guard, because the CH process couldn't see
NICs added into a netns it doesn't live in. The interface implied 0→N
was supported by symmetry but it wasn't — that asymmetry is the bug.

Move netns ownership onto the network module via a new Prepare step:
- network.Network gains Prepare(ctx, vmID, vmCfg) (netnsPath, error).
- CNI.Prepare ensureNetns idempotently and returns the path; "" when no
  conflist is configured.
- Bridge.Prepare is a no-op (bridge stays in host netns by design).
- initNetwork always calls Prepare regardless of NIC count.
- Add is only called when nics > 0.

Carry the result via a single types.NetSetup{Backend, NetnsPath,
BridgeDev, NICs} struct. Hypervisor.Create/Clone/DirectClone now take
NetSetup in place of `[]*NetworkConfig` so the same value flows from
the network module to the persisted VM record without a transient
duplicate slot on VMConfig.

Persist NetBackend / NetnsPath / NetBridgeDev at types.VM level (per
VM, not per NIC), with ResolvedNetX() fallbacks for old records that
still carry these on NetworkConfigs[0]. start.go / restore.go /
clone.go read rec.ResolvedNetnsPath() directly; the withNetwork
len(rec.NetworkConfigs) > 0 gate is gone — it conflated "has NICs"
with "needs netns entry", same bug one layer down.

netresize/plumbingForVM and lifecycle/providerForVM dispatch via
vm.ResolvedNetBackend() so a 0-NIC VM picks the right provider on
resize-up; the "zero NICs; resize up not supported" hard error is gone.
Bridge mode was always mechanically capable of 0→N (no netns concept);
it now shares the same code path as CNI without the spurious guard.

README updated to drop the "cannot resize up from zero" caveat. Adds
TestVMResolvedNetFields covering VM-level, NIC[0]-fallback, and empty
cases.
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 26 out of 26 changed files in this pull request and generated no new comments.

CMGS added 2 commits May 13, 2026 16:06
…ess takes netns directly

Senior + /simplify pass on PR #44:
- VM.{IsBridge,IsCNI} helpers; Resolved*/ApplyNetSetup are nil-safe
- launchProcess(netns) directly drops the partial-VM ferry in CH/FC clone
- printPostCloneHints reads vm.NetworkConfigs (no second arg)
- initNetwork rolls back on Prepare failure; caches netProvider.Type()
- recoverNetwork splits nil/empty-backend checks
- Tighten godoc and error strings
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 26 out of 26 changed files in this pull request and generated no new comments.

Single source of truth for VM network state. Anonymous embed preserves
JSON wire format (net_backend, netns_path, net_bridge_dev,
network_configs all stay flat on VM). Old records load unchanged.

- NetSetup gains JSON tags matching VM's existing field names
- VM drops the four duplicate fields; embeds NetSetup
- info.ApplyNetSetup(net) → info.NetSetup = net (direct assignment)
- setup.NICs → setup.NetworkConfigs (rename in NetSetup)
@CMGS CMGS merged commit d03dec4 into master May 13, 2026
4 checks passed
@CMGS CMGS deleted the feat/cni-netns-always branch May 13, 2026 08:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants