Symptom
cocoon snapshot export aborts on the memory-ranges entry for snapshots taken from VMs whose guest memory is sufficiently fragmented:
$ sudo cocoon snapshot export <name> -o /tmp/probe.tar
INF exporting to /tmp/probe.tar ...
Error: write archive: write header memory-ranges: archive/tar: header field too long
Reproduced 2026-05-05 against a Windows 11 cocoon VM running simular-pro-agent-runtime 1.8.0 (Electron app, live Firebase WebSocket signed in). The same export against the equivalent VM with the agent not signed in (no live Firebase connection → fewer fragmented allocations) succeeds normally. The 1.5.0 build of the same agent also exports fine.
Root cause
utils/tar_sparse_linux.go:tarFileMaybeSparse packs the entire sparse-segment list into a single tar PAX record (COCOON.sparse.map):
hdr.PAXRecords = map[string]string{
paxSparseMap: string(mapJSON), // can be > 1MB
paxSparseSize: strconv.FormatInt(size, 10),
}
if err := tw.WriteHeader(hdr); err != nil { ... }
Go's archive/tar caps the encoded PAX block at maxSpecialFileSize = 1<<20 (see archive/tar/format.go), and Writer.writeHeader returns ErrFieldTooLong when the encoded records exceed it. For a guest with many small live allocations (V8 heap, IPC buffers, WebSocket pools, native-module mmaps), memory-ranges produces tens of thousands of sparse segments; the JSON-encoded segment list balloons past the 1MB cap.
Empirical limit (measured against Go 1.26 archive/tar):
| segments |
mapJSON size |
tar.Writer.WriteHeader |
| 1,000 |
~22 KB |
ok |
| 10,000 |
~239 KB |
ok |
| 30,000 |
~736 KB |
ok |
| 50,000 |
~1.2 MB |
header field too long |
| 100,000 |
~2.5 MB |
header field too long |
Downstream impact
vk-cocoon's hibernate path (Save → Push → Remove) calls Pusher.PushSnapshot, which streams cocoon snapshot export -o - into epoch. When export fails on the memory-ranges entry, push only uploads the small metadata blobs (snapshot.json, config) before erroring; the memory layer never PUTs. vk-cocoon's workqueue then silently retries every ~30s with the same outcome, so:
- vm-service's hibernate API never observes
phase=Suspended and times out (was 300s, raising the budget does not help — the loop is permanent).
- The CocoonSet stays in
Running phase with pod=ProviderFailed.
- Hibernating any VM whose agent has a live Firebase connection is currently unreachable in production.
vk-cocoon's Provider.UpdatePod does return the error from hibernate() to the workqueue, but neither vk-cocoon nor cocoon CLI logs the wrapped error message at INF/WRN level; the only signal in operator logs is the absence of the expected vm rm call after the export step. That made this bug significantly harder to diagnose than it had to be — separately worth fixing the silent-retry behavior so the failing PushSnapshot error surfaces in the journal.
Proposed fix
PR #23 falls back to a non-sparse tar entry when len(mapJSON) exceeds ~800 KB (well below the 1MB cap, with margin for the size record + framing). Memory-ranges can be GB-scale, so the fallback gives up the sparse-export size win on the affected file — but a successful larger push beats an indefinite hung loop. The reader path is unchanged.
Open question: should there also be an INF/WRN log on the cocoon-CLI side when the fallback fires, so operators can see "this snapshot took the slow path because the segment map was too fragmented"? Right now the fallback is silent — that's friendly to existing tooling, but operators investigating slow snapshots won't know why a particular file was emitted full-size.
Symptom
cocoon snapshot exportaborts on thememory-rangesentry for snapshots taken from VMs whose guest memory is sufficiently fragmented:Reproduced 2026-05-05 against a Windows 11 cocoon VM running
simular-pro-agent-runtime 1.8.0(Electron app, live Firebase WebSocket signed in). The same export against the equivalent VM with the agent not signed in (no live Firebase connection → fewer fragmented allocations) succeeds normally. The 1.5.0 build of the same agent also exports fine.Root cause
utils/tar_sparse_linux.go:tarFileMaybeSparsepacks the entire sparse-segment list into a single tar PAX record (COCOON.sparse.map):Go's
archive/tarcaps the encoded PAX block atmaxSpecialFileSize = 1<<20(seearchive/tar/format.go), andWriter.writeHeaderreturnsErrFieldTooLongwhen the encoded records exceed it. For a guest with many small live allocations (V8 heap, IPC buffers, WebSocket pools, native-module mmaps),memory-rangesproduces tens of thousands of sparse segments; the JSON-encoded segment list balloons past the 1MB cap.Empirical limit (measured against Go 1.26
archive/tar):header field too longheader field too longDownstream impact
vk-cocoon's hibernate path (
Save → Push → Remove) callsPusher.PushSnapshot, which streamscocoon snapshot export -o -into epoch. When export fails on the memory-ranges entry, push only uploads the small metadata blobs (snapshot.json, config) before erroring; the memory layer never PUTs. vk-cocoon's workqueue then silently retries every ~30s with the same outcome, so:phase=Suspendedand times out (was 300s, raising the budget does not help — the loop is permanent).Runningphase withpod=ProviderFailed.vk-cocoon's
Provider.UpdatePoddoes return the error fromhibernate()to the workqueue, but neither vk-cocoon nor cocoon CLI logs the wrapped error message at INF/WRN level; the only signal in operator logs is the absence of the expectedvm rmcall after the export step. That made this bug significantly harder to diagnose than it had to be — separately worth fixing the silent-retry behavior so the failing PushSnapshot error surfaces in the journal.Proposed fix
PR #23 falls back to a non-sparse tar entry when
len(mapJSON)exceeds ~800 KB (well below the 1MB cap, with margin for the size record + framing). Memory-ranges can be GB-scale, so the fallback gives up the sparse-export size win on the affected file — but a successful larger push beats an indefinite hung loop. The reader path is unchanged.Open question: should there also be an INF/WRN log on the cocoon-CLI side when the fallback fires, so operators can see "this snapshot took the slow path because the segment map was too fragmented"? Right now the fallback is silent — that's friendly to existing tooling, but operators investigating slow snapshots won't know why a particular file was emitted full-size.