Problem
Consecutive `diff: true` BRANCHes on the same source sandbox show a sharp pause-time jump starting around BRANCH 3:
| BRANCH |
pause_ms |
| 1 |
288 |
| 2 |
263 |
| 3 |
1321 |
| 4 |
1389 |
| 5 |
1446 |
Source: `bench/pause-window/multi-branch-sweep.csv` (mem-2048, SSD, 3s gap). Reproducible on smaller snapshots too (`coding-agent-fork-prewarm-v1` shows the same shape, sustaining ~700-2700ms from BRANCH 6+).
`diff_physical_bytes` does NOT grow with the slow BRANCHes — they're SMALLER in the slow regime (300-700 KB vs 1.2 MB for BRANCH 1). So it's not a data-volume problem.
What's been ruled out (3 probe passes)
PR #128 — initial strace probe:
PR #140 — bpftrace + per-thread stack poll:
- bpftrace profile:hz:199 → only 18 samples in 1.6s of BRANCH out of ~320 expected = FC is off-CPU ~94% of the slow window
- /proc/$pid/task/*/stack histogram during slow window:
- 90 `ep_poll` (main thread idle)
- 88 `[kvm]` (vCPU thread in `kvm_vcpu_halt` — pause working as designed)
- 50 `vhost_task_fn` (vhost-net idle)
- 17 `futex_wait_queue` ← actual signal
- 5 block IO / journal commit
- Original "user-space CPU" conclusion was too strong; needed deeper investigation.
PR #143 — futex tracing:
- bpftrace on `sys_enter_futex` / `sys_exit_futex` with `args->uaddr` capture
- 3 different futexes each accumulate ~152ms wait during an 8s window covering one BRANCH = ~the entire pause
- All 3 use op 137 = `FUTEX_WAIT_BITSET_PRIVATE`
- Cross-referenced `/proc/$pid/maps`: 3 uaddrs land in anonymous heap mappings adjacent to FC binary's `.bss` (consistent with Rust heap-allocated synchronization primitives — `parking_lot::Mutex` / `Condvar` inner atomics)
- Interpretation flip: these are NOT contended. They're 3 idle waiter threads parked on the snapshot worker's completion signal. Eliminating the futex calls wouldn't speed anything up — the bottleneck is whatever the snapshot worker is doing single-threaded.
Current best hypothesis
The slowness lives in a single-threaded loop inside Firecracker's `vmm::persist::create_snapshot` (~21 KB of compiled code; objdump symbol at `0x192e80` in the v1.12 binary). The loop's iteration count grows with the number of snapshots taken on a single VM. Candidates:
- Memory region descriptor walk (each BRANCH may append metadata)
- vCPU snapshot `versionize` chain
- KVM dirty-page bitmap user-side post-processing
- Block-device / vsock state accumulation
Direct consequences for #118
The original "Phase 2: io_uring writer, Phase 3: 1s pre-emptive tick" scope is wrong:
- Phase 2 addresses block-IO writeback — ~2% of the slow window per the thread-poll histogram
- Phase 3 (1s tick) compounds nothing about the futexes (they're passive waiters), and would itself hit the same per-snapshot CPU growth within ~10 ticks
Next concrete step
- Build Firecracker with DWARF debug info (`[profile.release] debug = true` + frame pointers, or `RUSTFLAGS="-C debuginfo=2 -C force-frame-pointers=yes"`). Current FC binary is not-stripped but lacks DWARF + frame pointers, so bpftrace's `ustack` and perf's call-graph mode can't symbolize the worker thread's user-space stacks.
- Install `perf` for kernel 6.14.0-36 on the dev box (`linux-tools-6.14.0-36-generic` ships without it; needs build-from-source or kernel-tools workaround).
- `perf record -F 99 -p $FC_PID -g --call-graph dwarf` during a slow BRANCH window.
- Convert to flamegraph (Brendan Gregg's `flamegraph.pl`). Hot bar = the per-snapshot growing loop.
- Read FC source for `vmm::persist::create_snapshot` around the hot function. Patch upstream OR cap per-VM diff BRANCH count and recycle source via Full BRANCH + restore.
Workarounds today
- Cap diff BRANCHes per source at 2-3; spawn a fresh source from the latest BRANCH and continue from there
- Still ~14× better aggregate downtime than Full BRANCHes even in the slow regime — doesn't block production use
Refs
- `bench/pause-window/PROBE-multi-branch-anomaly.md` — full investigation log
- `bench/pause-window/RESULTS-v0.3.md` § "Phase 1d: multi-BRANCH diff" — original anomaly note
- `bench/pause-window/probe-multi-branch-strace.sh`, `probe-bpftrace-fc.sh`, `probe-futex-trace.sh` — reproducible probes
Problem
Consecutive `diff: true` BRANCHes on the same source sandbox show a sharp pause-time jump starting around BRANCH 3:
Source: `bench/pause-window/multi-branch-sweep.csv` (mem-2048, SSD, 3s gap). Reproducible on smaller snapshots too (`coding-agent-fork-prewarm-v1` shows the same shape, sustaining ~700-2700ms from BRANCH 6+).
`diff_physical_bytes` does NOT grow with the slow BRANCHes — they're SMALLER in the slow regime (300-700 KB vs 1.2 MB for BRANCH 1). So it's not a data-volume problem.
What's been ruled out (3 probe passes)
PR #128 — initial strace probe:
PR #140 — bpftrace + per-thread stack poll:
PR #143 — futex tracing:
Current best hypothesis
The slowness lives in a single-threaded loop inside Firecracker's `vmm::persist::create_snapshot` (~21 KB of compiled code; objdump symbol at `0x192e80` in the v1.12 binary). The loop's iteration count grows with the number of snapshots taken on a single VM. Candidates:
Direct consequences for #118
The original "Phase 2: io_uring writer, Phase 3: 1s pre-emptive tick" scope is wrong:
Next concrete step
Workarounds today
Refs