probe: perf flamegraph (DWARF FC) — pause is in-kernel sleep, not user CPU#150
Merged
Conversation
Fourth pass on #146. Built Firecracker v1.12.0 from source with \`debug = \"full\"\` + force-frame-pointers, swapped into /usr/local/bin /firecracker. Lowered kernel.perf_event_paranoid 4→1 (was silently dropping samples). Ran \`perf record -F 99 -a -g --call-graph fp\` for 10 s with two slow BRANCHes (pause 2279ms + 1530ms = 3.8s active) in the window. Result that flips the picture again: Samples by process (top): 1176 swapper ← CPU idle (47 % of samples) 97 plymouthd 78 bash ... 1 firecracker ← !!! Only ONE sample in 3.8 s of pause landed in firecracker — in \`vmm::vstate::vm::ArchVm::snapshot_memory_to_file\` called from \`vmm::persist::create_snapshot\`. ~10 ms of on-CPU time can't explain 3.8 s of pause. **The pause time isn't burned in user-space CPU. It's burned in in-kernel sleep with FC parked.** On-CPU flamegraphing is the wrong tool here. Synthesis across all four probes: Pass 1 (strace -c) → \"user-space CPU\" (too strong) Pass 2 (bpftrace + stack) → \"94 % off-CPU + futex\" (right shape) Pass 3 (futex args.uaddr) → \"passive waiters\" (eliminates futex) Pass 4 (perf -a -g) → \"in-kernel sleep wins\" (final picture) Most likely real causes of the grow-per-BRANCH wait time: 1. KVM_GET_DIRTY_LOG kernel walk that scales with VM uptime 2. page-cache → writeback contention with prior BRANCHes 3. VMA / mmap teardown that walks growing process VMA list Next concrete step: off-CPU flamegraph. bpftrace probe sketch on kprobe:finish_task_switch + try_to_wake_up to sum per-(ustack,kstack) sleep duration. Estimated 30 min once probe is right. Ships: - bench/pause-window/probe-perf-flamegraph.sh - PROBE-multi-branch-anomaly.md \"Follow-up: perf flamegraph\" section Refs #146. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
de5efc4 to
07758d0
Compare
WaylandYang
added a commit
that referenced
this pull request
May 23, 2026
…151) Round 5 of #146. Two complementary findings: 1. perf with explicit hybrid sampling (-e cpu_core/cycles/P -e cpu_atom/cycles/P) — fixes the Alder Lake E-core blindness from round 4. Got 13 FC samples in a 20s window with 4 slow BRANCHes, 46% of which are in ext4 kernel paths: io_schedule ← wbt_wait ← submit_bio ← ext4_bio_write_folio ← ext4_do_writepages down_write ← ext4_da_map_blocks ← ext4_da_write_begin ← vfs_write crc32c_x86_3way ← ext4_block_bitmap_csum_set ← ext4_mb_mark_diskspace_used ← ext4_mb_new_blocks = ext4 delayed allocation + writeback throttle (wbt) + multi-block allocator + block-bitmap checksumming. 2. tmpfs control — second daemon with --snapshot-root /dev/shm/... Ran 10 consecutive diff BRANCHes: BRANCH 1 (cold): pause_ms=728 BRANCH 2-10: pause_ms 111-259, flat No slow regime. Anomaly is 100% filesystem-layer. Synthesis of all 5 probe rounds: Pass 1 strace -c → user-space CPU (wrong) Pass 2 bpftrace → 94% off-CPU + futex (side obs) Pass 3 futex args → passive waiters (confirmed) Pass 4 perf -a -g → kernel sleep dominates (right direction, blind) Pass 5 perf hybrid + tmpfs → ext4 wbt + mballoc (ROOT CAUSE) Fix path is now clear and small: 1. fallocate memory.bin to full size before FC writes (~30 LOC Rust) 2. O_DIRECT in FC's snapshot writer (upstream FC PR) 3. io_uring (Phase 2 original case, now justified) 4. memfd-backed memory.bin (v0.4) Recommended first step: fallocate. If pause stays flat → ship as forkd v0.3.4 + close #146. Ships: - bench/pause-window/probe-perf-hybrid.sh - PROBE-multi-branch-anomaly.md "ROOT CAUSE FOUND" section - probe-offcpu.sh referenced (already shipped in #150) Refs #146. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
WaylandYang
added a commit
that referenced
this pull request
May 23, 2026
…balloc (#152) 5 rounds of probe (#128, #140, #143, #150, #151) traced the multi-BRANCH pause_ms anomaly to ext4 delayed allocation + writeback throttle + multi-block allocator + block-bitmap CRC compounding per BRANCH. tmpfs control confirmed: anomaly is 100% in the fs layer. Fix: posix_fallocate the destination memory.bin to source full size right after we create snap_dir, before either the diff-mode background copy OR FC's snapshot/create write. ext4 reserves the extents up-front; subsequent writes don't run mballoc or update block bitmap. Best-effort — on tmpfs / unsupported FS the syscall returns ENOSYS, we log at WARN and continue (no behavior change). On ext4 (the actual problem case), this should flatten pause_ms across the BRANCH chain. Adds libc 0.2 as a cfg(unix) dep for posix_fallocate. Refs #146. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
9 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fourth pass on #146. Built FC v1.12.0 from source with DWARF + frame pointers, swapped into /usr/local/bin/firecracker. Ran `perf record -F 99 -a -g --call-graph fp` during a slow BRANCH window (3.8 s of pause inside 10 s capture).
Result flips the interpretation again — and the new picture is the simplest yet.
What perf saw
```
Samples by process (top):
1176 swapper ← CPU idle (47 % of samples)
97 plymouthd
78 bash
73 sshd
...
1 firecracker ← !!!
```
Only 1 sample landed in firecracker in the entire 3.8 s of pause window. That sample was in `vmm::vstate::vm::ArchVm::snapshot_memory_to_file+0x871` (called from `vmm::persist::create_snapshot+0x2c5`). ~10 ms of on-CPU time can't explain 3.8 s of pause.
The pause time isn't burned in user-space CPU. It's burned in in-kernel sleep with FC parked. On-CPU flamegraphing is the wrong tool here.
Synthesis across four probes
Best remaining hypotheses for the kernel-sleep duration growth
KVM_GET_DIRTY_LOGkernel walk grows with VM uptime — KVM's per-memslot dirty bitmap clears each call, but kernel-side scan may walk structures that grow as guest pages ever-get-touched.Next concrete step
Off-CPU flamegraph via bpftrace `finish_task_switch` + `try_to_wake_up` paired probe. Sums per-(user-stack, kernel-stack) sleep duration; the hottest pair tells us exactly which kernel function FC is parked on. Estimated 30 min.
Setup notes for reproducing
Files
Refs #146.
🤖 Generated with Claude Code