BRANCH 3+ pause_ms jumps 5× on same source (snapshot-worker single-thread loop)

## Problem
Consecutive \`diff: true\` BRANCHes on the same source sandbox show a sharp pause-time jump starting around BRANCH 3:

| BRANCH | pause_ms |
|---|---|
| 1 | 288 |
| 2 | 263 |
| 3 | 1321 |
| 4 | 1389 |
| 5 | 1446 |

Source: \`bench/pause-window/multi-branch-sweep.csv\` (mem-2048, SSD, 3s gap). Reproducible on smaller snapshots too (\`coding-agent-fork-prewarm-v1\` shows the same shape, sustaining ~700-2700ms from BRANCH 6+).

\`diff_physical_bytes\` does NOT grow with the slow BRANCHes — they're SMALLER in the slow regime (300-700 KB vs 1.2 MB for BRANCH 1). So it's not a data-volume problem.

## What's been ruled out (3 probe passes)

**[PR #128](https://github.com/deeplethe/forkd/pull/128) — initial strace probe**:
- Not syscall count (constant ~709-1078 calls)
- Not time-in-syscalls (~5-10ms out of 800-1000ms wall)
- Concluded: user-space CPU. Direction set for Phase 2/3 of #118.

**[PR #140](https://github.com/deeplethe/forkd/pull/140) — bpftrace + per-thread stack poll**:
- bpftrace profile:hz:199 → only 18 samples in 1.6s of BRANCH out of ~320 expected = **FC is off-CPU ~94% of the slow window**
- /proc/$pid/task/*/stack histogram during slow window:
  - 90 \`ep_poll\` (main thread idle)
  - 88 \`[kvm]\` (vCPU thread in \`kvm_vcpu_halt\` — pause working as designed)
  - 50 \`vhost_task_fn\` (vhost-net idle)
  - **17 \`futex_wait_queue\`** ← actual signal
  - 5 block IO / journal commit
- Original "user-space CPU" conclusion was too strong; needed deeper investigation.

**[PR #143](https://github.com/deeplethe/forkd/pull/143) — futex tracing**:
- bpftrace on \`sys_enter_futex\` / \`sys_exit_futex\` with \`args->uaddr\` capture
- 3 different futexes each accumulate **~152ms wait** during an 8s window covering one BRANCH = ~the entire pause
- All 3 use op 137 = \`FUTEX_WAIT_BITSET_PRIVATE\`
- Cross-referenced \`/proc/$pid/maps\`: 3 uaddrs land in anonymous heap mappings adjacent to FC binary's \`.bss\` (consistent with Rust heap-allocated synchronization primitives — \`parking_lot::Mutex\` / \`Condvar\` inner atomics)
- **Interpretation flip**: these are NOT contended. They're 3 idle waiter threads parked on the snapshot worker's completion signal. Eliminating the futex calls wouldn't speed anything up — the bottleneck is whatever the snapshot worker is doing single-threaded.

## Current best hypothesis
The slowness lives in a single-threaded loop inside Firecracker's \`vmm::persist::create_snapshot\` (~21 KB of compiled code; objdump symbol at \`0x192e80\` in the v1.12 binary). The loop's iteration count grows with the number of snapshots taken on a single VM. Candidates:
- Memory region descriptor walk (each BRANCH may append metadata)
- vCPU snapshot \`versionize\` chain
- KVM dirty-page bitmap user-side post-processing
- Block-device / vsock state accumulation

## Direct consequences for [#118](https://github.com/deeplethe/forkd/issues/118)
The original "Phase 2: io_uring writer, Phase 3: 1s pre-emptive tick" scope is wrong:
- Phase 2 addresses block-IO writeback — ~2% of the slow window per the thread-poll histogram
- Phase 3 (1s tick) compounds nothing about the futexes (they're passive waiters), and would itself hit the same per-snapshot CPU growth within ~10 ticks

## Next concrete step
1. **Build Firecracker with DWARF debug info** (\`[profile.release] debug = true\` + frame pointers, or \`RUSTFLAGS="-C debuginfo=2 -C force-frame-pointers=yes"\`). Current FC binary is not-stripped but lacks DWARF + frame pointers, so bpftrace's \`ustack\` and perf's call-graph mode can't symbolize the worker thread's user-space stacks.
2. Install \`perf\` for kernel 6.14.0-36 on the dev box (\`linux-tools-6.14.0-36-generic\` ships without it; needs build-from-source or kernel-tools workaround).
3. \`perf record -F 99 -p $FC_PID -g --call-graph dwarf\` during a slow BRANCH window.
4. Convert to flamegraph (Brendan Gregg's \`flamegraph.pl\`). Hot bar = the per-snapshot growing loop.
5. Read FC source for \`vmm::persist::create_snapshot\` around the hot function. Patch upstream OR cap per-VM diff BRANCH count and recycle source via Full BRANCH + restore.

## Workarounds today
- Cap diff BRANCHes per source at 2-3; spawn a fresh source from the latest BRANCH and continue from there
- Still ~14× better aggregate downtime than Full BRANCHes even in the slow regime — doesn't block production use

## Refs
- \`bench/pause-window/PROBE-multi-branch-anomaly.md\` — full investigation log
- \`bench/pause-window/RESULTS-v0.3.md\` § "Phase 1d: multi-BRANCH diff" — original anomaly note
- \`bench/pause-window/probe-multi-branch-strace.sh\`, \`probe-bpftrace-fc.sh\`, \`probe-futex-trace.sh\` — reproducible probes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BRANCH 3+ pause_ms jumps 5× on same source (snapshot-worker single-thread loop) #146

Problem

What's been ruled out (3 probe passes)

Current best hypothesis

Direct consequences for #118

Next concrete step

Workarounds today

Refs

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

BRANCH 3+ pause_ms jumps 5× on same source (snapshot-worker single-thread loop) #146

Description

Problem

What's been ruled out (3 probe passes)

Current best hypothesis

Direct consequences for #118

Next concrete step

Workarounds today

Refs

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions