Skip to content

probe: perf flamegraph (DWARF FC) — pause is in-kernel sleep, not user CPU#150

Merged
WaylandYang merged 1 commit into
mainfrom
probe/118-perf-flamegraph
May 23, 2026
Merged

probe: perf flamegraph (DWARF FC) — pause is in-kernel sleep, not user CPU#150
WaylandYang merged 1 commit into
mainfrom
probe/118-perf-flamegraph

Conversation

@WaylandYang
Copy link
Copy Markdown
Contributor

Summary

Fourth pass on #146. Built FC v1.12.0 from source with DWARF + frame pointers, swapped into /usr/local/bin/firecracker. Ran `perf record -F 99 -a -g --call-graph fp` during a slow BRANCH window (3.8 s of pause inside 10 s capture).

Result flips the interpretation again — and the new picture is the simplest yet.

What perf saw

```
Samples by process (top):
1176 swapper ← CPU idle (47 % of samples)
97 plymouthd
78 bash
73 sshd
...
1 firecracker ← !!!
```

Only 1 sample landed in firecracker in the entire 3.8 s of pause window. That sample was in `vmm::vstate::vm::ArchVm::snapshot_memory_to_file+0x871` (called from `vmm::persist::create_snapshot+0x2c5`). ~10 ms of on-CPU time can't explain 3.8 s of pause.

The pause time isn't burned in user-space CPU. It's burned in in-kernel sleep with FC parked. On-CPU flamegraphing is the wrong tool here.

Synthesis across four probes

Pass PR Hypothesis Verdict
1 strace -c #128 user-space CPU bottleneck too strong
2 bpftrace + /proc/stack #140 94 % off-CPU + futex waiters right shape
3 futex args.uaddr #143 passive waiters, not contention eliminates futex
4 perf -a -g + DWARF FC this PR in-kernel sleep dominates final picture

Best remaining hypotheses for the kernel-sleep duration growth

  1. KVM_GET_DIRTY_LOG kernel walk grows with VM uptime — KVM's per-memslot dirty bitmap clears each call, but kernel-side scan may walk structures that grow as guest pages ever-get-touched.
  2. page-cache → writeback contention — N+1th BRANCH's `memory.bin` writes compete with N's pending writeback. `kswapd0` / `kworker/u80:1+f` both visible in samples.
  3. VMA / mmap teardown — process VMA list grows per BRANCH; some kernel paths walk it linearly.

Next concrete step

Off-CPU flamegraph via bpftrace `finish_task_switch` + `try_to_wake_up` paired probe. Sums per-(user-stack, kernel-stack) sleep duration; the hottest pair tells us exactly which kernel function FC is parked on. Estimated 30 min.

Setup notes for reproducing

  • `firecracker-microvm/firecracker@v1.12.0` cloned at `~/work/firecracker-src` on the dev box
  • Patched `Cargo.toml`: `lto = false`, `debug = "full"`
  • Built with `RUSTFLAGS="-C force-frame-pointers=yes"`
  • Required apt: `cmake`, `clang-17`, `libseccomp-dev=2.5.5-1ubuntu3` (downgraded to match libseccomp2)
  • `kernel.perf_event_paranoid=1` (4 silently produces 0-sample perf.data)
  • `kernel.kptr_restrict=0` for kernel symbol resolution

Files

  • `bench/pause-window/probe-perf-flamegraph.sh` — the probe script
  • `bench/pause-window/PROBE-multi-branch-anomaly.md` — "Follow-up: perf flamegraph with DWARF Firecracker" section appended

Refs #146.

🤖 Generated with Claude Code

Fourth pass on #146. Built Firecracker v1.12.0 from source with
\`debug = \"full\"\` + force-frame-pointers, swapped into /usr/local/bin
/firecracker. Lowered kernel.perf_event_paranoid 4→1 (was silently
dropping samples). Ran \`perf record -F 99 -a -g --call-graph fp\` for
10 s with two slow BRANCHes (pause 2279ms + 1530ms = 3.8s active) in
the window.

Result that flips the picture again:

  Samples by process (top):
    1176  swapper           ← CPU idle (47 % of samples)
      97  plymouthd
      78  bash
      ...
       1  firecracker       ← !!!

Only ONE sample in 3.8 s of pause landed in firecracker — in
\`vmm::vstate::vm::ArchVm::snapshot_memory_to_file\` called from
\`vmm::persist::create_snapshot\`. ~10 ms of on-CPU time can't
explain 3.8 s of pause.

**The pause time isn't burned in user-space CPU. It's burned in
in-kernel sleep with FC parked.** On-CPU flamegraphing is the wrong
tool here.

Synthesis across all four probes:
  Pass 1 (strace -c)        → \"user-space CPU\"        (too strong)
  Pass 2 (bpftrace + stack) → \"94 % off-CPU + futex\"  (right shape)
  Pass 3 (futex args.uaddr) → \"passive waiters\"        (eliminates futex)
  Pass 4 (perf -a -g)        → \"in-kernel sleep wins\"   (final picture)

Most likely real causes of the grow-per-BRANCH wait time:
  1. KVM_GET_DIRTY_LOG kernel walk that scales with VM uptime
  2. page-cache → writeback contention with prior BRANCHes
  3. VMA / mmap teardown that walks growing process VMA list

Next concrete step: off-CPU flamegraph. bpftrace probe sketch on
kprobe:finish_task_switch + try_to_wake_up to sum per-(ustack,kstack)
sleep duration. Estimated 30 min once probe is right.

Ships:
  - bench/pause-window/probe-perf-flamegraph.sh
  - PROBE-multi-branch-anomaly.md \"Follow-up: perf flamegraph\" section

Refs #146.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@WaylandYang WaylandYang force-pushed the probe/118-perf-flamegraph branch from de5efc4 to 07758d0 Compare May 23, 2026 05:49
@WaylandYang WaylandYang merged commit 6db3eeb into main May 23, 2026
2 checks passed
@WaylandYang WaylandYang deleted the probe/118-perf-flamegraph branch May 23, 2026 05:53
WaylandYang added a commit that referenced this pull request May 23, 2026
…151)

Round 5 of #146. Two complementary findings:

1. perf with explicit hybrid sampling (-e cpu_core/cycles/P -e
   cpu_atom/cycles/P) — fixes the Alder Lake E-core blindness from
   round 4. Got 13 FC samples in a 20s window with 4 slow BRANCHes,
   46% of which are in ext4 kernel paths:
     io_schedule ← wbt_wait ← submit_bio ← ext4_bio_write_folio
       ← ext4_do_writepages
     down_write ← ext4_da_map_blocks ← ext4_da_write_begin
       ← vfs_write
     crc32c_x86_3way ← ext4_block_bitmap_csum_set
       ← ext4_mb_mark_diskspace_used ← ext4_mb_new_blocks
   = ext4 delayed allocation + writeback throttle (wbt) + multi-block
   allocator + block-bitmap checksumming.

2. tmpfs control — second daemon with --snapshot-root /dev/shm/...
   Ran 10 consecutive diff BRANCHes:
     BRANCH  1 (cold): pause_ms=728
     BRANCH  2-10:     pause_ms 111-259, flat
   No slow regime. Anomaly is 100% filesystem-layer.

Synthesis of all 5 probe rounds:
  Pass 1 strace -c    → user-space CPU             (wrong)
  Pass 2 bpftrace      → 94% off-CPU + futex        (side obs)
  Pass 3 futex args    → passive waiters            (confirmed)
  Pass 4 perf -a -g    → kernel sleep dominates     (right direction, blind)
  Pass 5 perf hybrid + tmpfs → ext4 wbt + mballoc   (ROOT CAUSE)

Fix path is now clear and small:
  1. fallocate memory.bin to full size before FC writes (~30 LOC Rust)
  2. O_DIRECT in FC's snapshot writer (upstream FC PR)
  3. io_uring (Phase 2 original case, now justified)
  4. memfd-backed memory.bin (v0.4)

Recommended first step: fallocate. If pause stays flat → ship as
forkd v0.3.4 + close #146.

Ships:
  - bench/pause-window/probe-perf-hybrid.sh
  - PROBE-multi-branch-anomaly.md "ROOT CAUSE FOUND" section
  - probe-offcpu.sh referenced (already shipped in #150)

Refs #146.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
WaylandYang added a commit that referenced this pull request May 23, 2026
…balloc (#152)

5 rounds of probe (#128, #140, #143, #150, #151) traced the multi-BRANCH
pause_ms anomaly to ext4 delayed allocation + writeback throttle
+ multi-block allocator + block-bitmap CRC compounding per BRANCH.
tmpfs control confirmed: anomaly is 100% in the fs layer.

Fix: posix_fallocate the destination memory.bin to source full size
right after we create snap_dir, before either the diff-mode background
copy OR FC's snapshot/create write. ext4 reserves the extents up-front;
subsequent writes don't run mballoc or update block bitmap.

Best-effort — on tmpfs / unsupported FS the syscall returns ENOSYS, we
log at WARN and continue (no behavior change). On ext4 (the actual
problem case), this should flatten pause_ms across the BRANCH chain.

Adds libc 0.2 as a cfg(unix) dep for posix_fallocate.

Refs #146.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant