Skip to content

BRANCH 3+ pause_ms jumps 5× on same source (snapshot-worker single-thread loop) #146

@WaylandYang

Description

@WaylandYang

Problem

Consecutive `diff: true` BRANCHes on the same source sandbox show a sharp pause-time jump starting around BRANCH 3:

BRANCH pause_ms
1 288
2 263
3 1321
4 1389
5 1446

Source: `bench/pause-window/multi-branch-sweep.csv` (mem-2048, SSD, 3s gap). Reproducible on smaller snapshots too (`coding-agent-fork-prewarm-v1` shows the same shape, sustaining ~700-2700ms from BRANCH 6+).

`diff_physical_bytes` does NOT grow with the slow BRANCHes — they're SMALLER in the slow regime (300-700 KB vs 1.2 MB for BRANCH 1). So it's not a data-volume problem.

What's been ruled out (3 probe passes)

PR #128 — initial strace probe:

PR #140 — bpftrace + per-thread stack poll:

  • bpftrace profile:hz:199 → only 18 samples in 1.6s of BRANCH out of ~320 expected = FC is off-CPU ~94% of the slow window
  • /proc/$pid/task/*/stack histogram during slow window:
    • 90 `ep_poll` (main thread idle)
    • 88 `[kvm]` (vCPU thread in `kvm_vcpu_halt` — pause working as designed)
    • 50 `vhost_task_fn` (vhost-net idle)
    • 17 `futex_wait_queue` ← actual signal
    • 5 block IO / journal commit
  • Original "user-space CPU" conclusion was too strong; needed deeper investigation.

PR #143 — futex tracing:

  • bpftrace on `sys_enter_futex` / `sys_exit_futex` with `args->uaddr` capture
  • 3 different futexes each accumulate ~152ms wait during an 8s window covering one BRANCH = ~the entire pause
  • All 3 use op 137 = `FUTEX_WAIT_BITSET_PRIVATE`
  • Cross-referenced `/proc/$pid/maps`: 3 uaddrs land in anonymous heap mappings adjacent to FC binary's `.bss` (consistent with Rust heap-allocated synchronization primitives — `parking_lot::Mutex` / `Condvar` inner atomics)
  • Interpretation flip: these are NOT contended. They're 3 idle waiter threads parked on the snapshot worker's completion signal. Eliminating the futex calls wouldn't speed anything up — the bottleneck is whatever the snapshot worker is doing single-threaded.

Current best hypothesis

The slowness lives in a single-threaded loop inside Firecracker's `vmm::persist::create_snapshot` (~21 KB of compiled code; objdump symbol at `0x192e80` in the v1.12 binary). The loop's iteration count grows with the number of snapshots taken on a single VM. Candidates:

  • Memory region descriptor walk (each BRANCH may append metadata)
  • vCPU snapshot `versionize` chain
  • KVM dirty-page bitmap user-side post-processing
  • Block-device / vsock state accumulation

Direct consequences for #118

The original "Phase 2: io_uring writer, Phase 3: 1s pre-emptive tick" scope is wrong:

  • Phase 2 addresses block-IO writeback — ~2% of the slow window per the thread-poll histogram
  • Phase 3 (1s tick) compounds nothing about the futexes (they're passive waiters), and would itself hit the same per-snapshot CPU growth within ~10 ticks

Next concrete step

  1. Build Firecracker with DWARF debug info (`[profile.release] debug = true` + frame pointers, or `RUSTFLAGS="-C debuginfo=2 -C force-frame-pointers=yes"`). Current FC binary is not-stripped but lacks DWARF + frame pointers, so bpftrace's `ustack` and perf's call-graph mode can't symbolize the worker thread's user-space stacks.
  2. Install `perf` for kernel 6.14.0-36 on the dev box (`linux-tools-6.14.0-36-generic` ships without it; needs build-from-source or kernel-tools workaround).
  3. `perf record -F 99 -p $FC_PID -g --call-graph dwarf` during a slow BRANCH window.
  4. Convert to flamegraph (Brendan Gregg's `flamegraph.pl`). Hot bar = the per-snapshot growing loop.
  5. Read FC source for `vmm::persist::create_snapshot` around the hot function. Patch upstream OR cap per-VM diff BRANCH count and recycle source via Full BRANCH + restore.

Workarounds today

  • Cap diff BRANCHes per source at 2-3; spawn a fresh source from the latest BRANCH and continue from there
  • Still ~14× better aggregate downtime than Full BRANCHes even in the slow regime — doesn't block production use

Refs

  • `bench/pause-window/PROBE-multi-branch-anomaly.md` — full investigation log
  • `bench/pause-window/RESULTS-v0.3.md` § "Phase 1d: multi-BRANCH diff" — original anomaly note
  • `bench/pause-window/probe-multi-branch-strace.sh`, `probe-bpftrace-fc.sh`, `probe-futex-trace.sh` — reproducible probes

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions