Skip to content

snapshot export fails with "header field too long" on highly fragmented memory-ranges #24

@tonicmuroq

Description

@tonicmuroq

Symptom

cocoon snapshot export aborts on the memory-ranges entry for snapshots taken from VMs whose guest memory is sufficiently fragmented:

$ sudo cocoon snapshot export <name> -o /tmp/probe.tar
INF exporting to /tmp/probe.tar ...
Error: write archive: write header memory-ranges: archive/tar: header field too long

Reproduced 2026-05-05 against a Windows 11 cocoon VM running simular-pro-agent-runtime 1.8.0 (Electron app, live Firebase WebSocket signed in). The same export against the equivalent VM with the agent not signed in (no live Firebase connection → fewer fragmented allocations) succeeds normally. The 1.5.0 build of the same agent also exports fine.

Root cause

utils/tar_sparse_linux.go:tarFileMaybeSparse packs the entire sparse-segment list into a single tar PAX record (COCOON.sparse.map):

hdr.PAXRecords = map[string]string{
    paxSparseMap:  string(mapJSON),                  // can be > 1MB
    paxSparseSize: strconv.FormatInt(size, 10),
}
if err := tw.WriteHeader(hdr); err != nil { ... }

Go's archive/tar caps the encoded PAX block at maxSpecialFileSize = 1<<20 (see archive/tar/format.go), and Writer.writeHeader returns ErrFieldTooLong when the encoded records exceed it. For a guest with many small live allocations (V8 heap, IPC buffers, WebSocket pools, native-module mmaps), memory-ranges produces tens of thousands of sparse segments; the JSON-encoded segment list balloons past the 1MB cap.

Empirical limit (measured against Go 1.26 archive/tar):

segments mapJSON size tar.Writer.WriteHeader
1,000 ~22 KB ok
10,000 ~239 KB ok
30,000 ~736 KB ok
50,000 ~1.2 MB header field too long
100,000 ~2.5 MB header field too long

Downstream impact

vk-cocoon's hibernate path (Save → Push → Remove) calls Pusher.PushSnapshot, which streams cocoon snapshot export -o - into epoch. When export fails on the memory-ranges entry, push only uploads the small metadata blobs (snapshot.json, config) before erroring; the memory layer never PUTs. vk-cocoon's workqueue then silently retries every ~30s with the same outcome, so:

  • vm-service's hibernate API never observes phase=Suspended and times out (was 300s, raising the budget does not help — the loop is permanent).
  • The CocoonSet stays in Running phase with pod=ProviderFailed.
  • Hibernating any VM whose agent has a live Firebase connection is currently unreachable in production.

vk-cocoon's Provider.UpdatePod does return the error from hibernate() to the workqueue, but neither vk-cocoon nor cocoon CLI logs the wrapped error message at INF/WRN level; the only signal in operator logs is the absence of the expected vm rm call after the export step. That made this bug significantly harder to diagnose than it had to be — separately worth fixing the silent-retry behavior so the failing PushSnapshot error surfaces in the journal.

Proposed fix

PR #23 falls back to a non-sparse tar entry when len(mapJSON) exceeds ~800 KB (well below the 1MB cap, with margin for the size record + framing). Memory-ranges can be GB-scale, so the fallback gives up the sparse-export size win on the affected file — but a successful larger push beats an indefinite hung loop. The reader path is unchanged.

Open question: should there also be an INF/WRN log on the cocoon-CLI side when the fallback fires, so operators can see "this snapshot took the slow path because the segment map was too fragmented"? Right now the fallback is silent — that's friendly to existing tooling, but operators investigating slow snapshots won't know why a particular file was emitted full-size.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions