feat: --unwind auto lazy CFI compile by dpsoft · Pull Request #11 · dpsoft/perf-agent

dpsoft · 2026-04-28T15:26:37Z

Summary

Implements lazy CFI compile for --unwind auto -a. Replaces today's eager AttachAllProcesses (compiles every binary visible at startup, ~30 s on a typical desktop) with a lazy variant that populates pid_mappings only at startup and defers per-binary compile until a sample fires inside an unprepared binary.

Validated bench results on this hardware:

Metric	Eager (`--unwind dwarf`)	Lazy (`--unwind auto`)	Δ%
p50 wall time	34,618 ms	8,141 ms	−76.5%
p95 / max	69,683 ms	40,982 ms	−41.2%
Warm-state runs (2-5)	34-37 s	6-8 s	78-83%

5× faster from the second invocation onward — the operator-attaching-to-running-host case the doc was concerned about.

Architecture

BPF (bpf/unwind_common.h): new cfi_miss_events ringbuf (64 KB) + cfi_miss_ratelimit LRU_HASH (4096 entries, 1-sec rate limit per (pid, table_id)). Walker probes cfi_classification_lengths[table_id] before classify_rel_pc; if missing, emits a miss event and falls through to FP path. The existing FP_LESS+cfi_lookup miss path also emits as a fallback for the rare case of compiled-but-incomplete CFI.
Userspace drainer (unwind/dwarfagent/miss_drainer.go): consumeCFIMisses goroutine reads ringbuf, dedupes per-(pid, table_id), poisons after 3 failures, calls tracker.AttachCompileOnly(pid, path) to compile on demand. MissStats exposes counters.
Tracker (unwind/ehmaps/tracker.go): new EnrollWithoutCompile (populates pid_mappings, no refcount) + AttachCompileOnly (compiles + populates cfi_*, takes refcount, no pid_mappings write).
ScanAndEnroll (unwind/ehmaps/scan_enroll.go): walks /proc/* and enrolls every binary with a shared build-id cache. Cached so 1,000 unique binaries get ~1,000 build-id reads, not ~270,000.
Mode dispatch (unwind/dwarfagent/agent.go): new Mode enum + NewProfilerWithMode. Per-PID forces ModeEager (compile cost is 1.4% there, lazy buys nothing). Existing NewProfiler/NewProfilerWithHooks delegate with ModeEager — backward compatible.
--unwind dispatch (perfagent/agent.go): auto → ModeLazy; dwarf → ModeEager (escape hatch). Off-CPU stays eager (v1 non-goal per spec).
Bench harness (bench/cmd/scenario/main.go): --unwind {auto|dwarf} flag for before/after diffs. schema.Config.UnwindMode records the chosen mode.

What's preserved

--unwind dwarf retains today's eager behavior. Same code path, plus one map lookup per frame in walk_step (~tens of ns).
--unwind fp unchanged.
--pid N unchanged regardless of --unwind value (compile cost is already negligible).
All existing unit + integration tests pass; off-CPU profiler unchanged.

What was caught and fixed during execution

Spec reviewer caught a legacy wg.Add(1) + go func() pattern (commit 5034cfe0) — converted to Go 1.25's wg.Go(...).
Integration test failed on first run with MissStats.Received == 0 — exposed a real spec bug. Original spec said the BPF emit point was "FP_LESS + cfi_lookup miss"; in lazy mode cfi_classification isn't populated so classify_rel_pc returned MODE_FP_SAFE and the FP_LESS branch was unreachable. Added a cfi_classification_lengths probe before classify_rel_pc (commit 8b9dba69). Test passed on second run with Received=32, Resolved=32.
Code reviewer caught a LastLatencyNs clock mismatch (time.Now() vs bpf_ktime_get_ns()); fixed by using unix.ClockGettime(unix.CLOCK_MONOTONIC, ...) (commit 2f13d419).
Off-CPU scope creep in Task 9 (added NewOffCPUProfilerWithMode, would have nil-panicked) — reverted (commit 3e9af28e).

Test plan

make test-unit passes (7 packages, 0 failures).
BenchmarkScanAndEnroll_BuildIDCacheHit reports buildid_reads/op = 5.000 for 100 PIDs × 5 binaries — proves the cache caps reads at K, not N×K.
TestLazyMode_FiresAndCompilesOnMiss (caps-gated) PASS: Received=32, Resolved=32, PoisonedKeys=0 after 5-second window.
make bench-scenarios system-wide-mixed --processes 30 --runs 5 for both --unwind dwarf and --unwind auto. Diff shows −76.5% p50 reduction.
Soak test on a high-fork-rate workload to confirm rate-limit + LRU eviction behaves under load (follow-up if needed).

Companion docs

Spec: docs/superpowers/specs/2026-04-27-unwind-auto-lazy-a2-design.md (with the validated-results section).
Plan: docs/superpowers/plans/2026-04-27-unwind-auto-lazy-a2.md.
docs/unwind-auto-refinement-design.md updated to mark A2 as implemented.

Follow-ups (deliberately deferred)

Off-CPU lazy mode (v1 non-goal — small ringbuf cost on off-CPU sessions documented inline).
Userspace-side compile parallelism (current drainer is serial; bench shows it's fast enough).
MissStats.LastLatencyNs exposure to a histogram / external metric.

Without this, lazy mode never fires miss events: cfi_classification isn't populated for enrolled-but-uncompiled binaries, so classify_rel_pc returns the default MODE_FP_SAFE, the walker takes the FP path, and the existing FP_LESS+miss emit (Task 1) is unreachable. Add a check in walk_step before classify_rel_pc: probe cfi_classification_lengths[table_id]. If missing, emit miss and continue with FP path. Eager mode is unchanged (length always populated).

Two final-review followups: 1. LastLatencyNs was subtracting bpf_ktime_get_ns() (CLOCK_MONOTONIC since boot) from time.Now().UnixNano() (wallclock since 1970), producing garbage. Fix: read CLOCK_MONOTONIC on the userspace side via unix.ClockGettime to match the BPF clock. Guard against negative deltas. 2. offcpu_dwarf.bpf.c inherits cfi_miss_events + ratelimit maps via unwind_common.h but never drains them (off-CPU is always eager in v1 per the A2 spec's non-goals). Add a comment documenting that the bounded ~96 KB cost is intentional and a v2 may extend the drainer.

The committed architecture.excalidraw was the original sketch from the FP-only era — it still showed only profile/, offcpu/, cpu/, perf.bpf.c, offcpu.bpf.c, and cpu.bpf.c. Since then the codebase has grown: - unwind/dwarfagent/ (DWARF hybrid CPU + off-CPU walker, PRs #1–#11) - unwind/{ehcompile,ehmaps,procmap} (CFI compile, per-PID lifecycle, /proc resolver — PRs #3–#9) - internal/perfevent (per-CPU perf_event_open + AttachRawLink shared helper extracted from profile/dwarfagent — PR #13) - inject/{python,ptraceop,elfsym} (Python perf-trampoline activator via ptrace — PR #12) - internal/{nspid,k8slabels} (namespace-aware --pid + k8s pprof labels — PR #14) - bpf/perf_dwarf.bpf.c, bpf/offcpu_dwarf.bpf.c (DWARF kernel-side) The new file groups them into pre-profile setup (optional, dashed — nspid, k8slabels, inject), profilers (FP, DWARF, PMU), helpers (perfevent, ehcompile, ehmaps, procmap, pprof), and the four BPF programs in the kernel band, with arrows showing data + control flow. Output lands in *-on-cpu.pb.gz / *-off-cpu.pb.gz / PMU stdout-or-file. Layout was generated programmatically; open the file in https://excalidraw.com or the VS Code extension to fine-tune positions and colours. The README's ASCII version of the same diagram is unchanged — both exist intentionally so readers can grok the architecture without opening Excalidraw, and the .excalidraw is the editable source for future iterations.

dpsoft added 20 commits April 28, 2026 17:03

docs: A2 lazy CFI design spec for --unwind auto

545de92

docs: add implementation plan for A2 lazy CFI

484ddcc

bpf: add cfi_miss_events ringbuf + rate-limit map for lazy CFI (A2)

239fb1b

profile: expose CFIMissRingbuf + CFIMissRatelimitMap accessors

bbf604a

ehmaps: add EnrollWithoutCompile + AttachCompileOnly for lazy CFI

8a381b8

ehmaps: add ScanAndEnroll + ScanAndEnrollFromTree for lazy startup

315c74d

ehmaps: add ScanAndEnroll unit tests + build-id cache microbenchmark

105d747

dwarfagent: add cfi miss event parser, MissStats, sentinel errors

5af15cb

dwarfagent: add CFI miss drainer goroutine + session lazy-mode fields

5000278

dwarfagent: add Mode + NewProfilerWithMode + lazy branch in newSession

87b2131

dwarfagent: use wg.Go for drainer spawn (modern Go 1.25+ idiom)

0ec952e

perfagent: --unwind auto dispatches to dwarfagent.ModeLazy (A2)

09a24b3

perfagent: keep off-CPU on eager path; A2 lazy is CPU-only in v1

137582a

bench/scenario: add --unwind flag and record mode in schema

e63353c

dwarfagent: add caps-gated TestLazyMode_FiresAndCompilesOnMiss

27ec1f6

docs: validated A2 lazy CFI bench results (76.5% p50 reduction)

a50f757

docs: mark A2 as implemented in unwind-auto-refinement-design

85be39f

ehmaps: fix staticcheck QF1012 (fmt.Fprintf over WriteString+Sprintf)

cd131cc

dpsoft force-pushed the feat/unwind-auto-lazy-a2 branch from d0c5aa3 to cd131cc Compare April 28, 2026 20:03

dpsoft changed the title ~~feat: --unwind auto lazy CFI compile (Option A2)~~ feat: --unwind auto lazy CFI compile Apr 28, 2026

dpsoft merged commit d87a1c7 into main Apr 28, 2026
10 checks passed

dpsoft deleted the feat/unwind-auto-lazy-a2 branch April 28, 2026 21:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: --unwind auto lazy CFI compile#11

feat: --unwind auto lazy CFI compile#11
dpsoft merged 20 commits into
mainfrom
feat/unwind-auto-lazy-a2

dpsoft commented Apr 28, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

dpsoft commented Apr 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Architecture

What's preserved

What was caught and fixed during execution

Test plan

Companion docs

Follow-ups (deliberately deferred)

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

dpsoft commented Apr 28, 2026 •

edited

Loading