feat: --unwind auto lazy CFI compile#11
Merged
Merged
Conversation
Without this, lazy mode never fires miss events: cfi_classification isn't populated for enrolled-but-uncompiled binaries, so classify_rel_pc returns the default MODE_FP_SAFE, the walker takes the FP path, and the existing FP_LESS+miss emit (Task 1) is unreachable. Add a check in walk_step before classify_rel_pc: probe cfi_classification_lengths[table_id]. If missing, emit miss and continue with FP path. Eager mode is unchanged (length always populated).
Two final-review followups: 1. LastLatencyNs was subtracting bpf_ktime_get_ns() (CLOCK_MONOTONIC since boot) from time.Now().UnixNano() (wallclock since 1970), producing garbage. Fix: read CLOCK_MONOTONIC on the userspace side via unix.ClockGettime to match the BPF clock. Guard against negative deltas. 2. offcpu_dwarf.bpf.c inherits cfi_miss_events + ratelimit maps via unwind_common.h but never drains them (off-CPU is always eager in v1 per the A2 spec's non-goals). Add a comment documenting that the bounded ~96 KB cost is intentional and a v2 may extend the drainer.
d0c5aa3 to
cd131cc
Compare
dpsoft
added a commit
that referenced
this pull request
May 3, 2026
The committed architecture.excalidraw was the original sketch from the FP-only era — it still showed only profile/, offcpu/, cpu/, perf.bpf.c, offcpu.bpf.c, and cpu.bpf.c. Since then the codebase has grown: - unwind/dwarfagent/ (DWARF hybrid CPU + off-CPU walker, PRs #1–#11) - unwind/{ehcompile,ehmaps,procmap} (CFI compile, per-PID lifecycle, /proc resolver — PRs #3–#9) - internal/perfevent (per-CPU perf_event_open + AttachRawLink shared helper extracted from profile/dwarfagent — PR #13) - inject/{python,ptraceop,elfsym} (Python perf-trampoline activator via ptrace — PR #12) - internal/{nspid,k8slabels} (namespace-aware --pid + k8s pprof labels — PR #14) - bpf/perf_dwarf.bpf.c, bpf/offcpu_dwarf.bpf.c (DWARF kernel-side) The new file groups them into pre-profile setup (optional, dashed — nspid, k8slabels, inject), profilers (FP, DWARF, PMU), helpers (perfevent, ehcompile, ehmaps, procmap, pprof), and the four BPF programs in the kernel band, with arrows showing data + control flow. Output lands in *-on-cpu.pb.gz / *-off-cpu.pb.gz / PMU stdout-or-file. Layout was generated programmatically; open the file in https://excalidraw.com or the VS Code extension to fine-tune positions and colours. The README's ASCII version of the same diagram is unchanged — both exist intentionally so readers can grok the architecture without opening Excalidraw, and the .excalidraw is the editable source for future iterations.
dpsoft
added a commit
that referenced
this pull request
May 3, 2026
The committed architecture.excalidraw was the original sketch from the FP-only era — it still showed only profile/, offcpu/, cpu/, perf.bpf.c, offcpu.bpf.c, and cpu.bpf.c. Since then the codebase has grown: - unwind/dwarfagent/ (DWARF hybrid CPU + off-CPU walker, PRs #1–#11) - unwind/{ehcompile,ehmaps,procmap} (CFI compile, per-PID lifecycle, /proc resolver — PRs #3–#9) - internal/perfevent (per-CPU perf_event_open + AttachRawLink shared helper extracted from profile/dwarfagent — PR #13) - inject/{python,ptraceop,elfsym} (Python perf-trampoline activator via ptrace — PR #12) - internal/{nspid,k8slabels} (namespace-aware --pid + k8s pprof labels — PR #14) - bpf/perf_dwarf.bpf.c, bpf/offcpu_dwarf.bpf.c (DWARF kernel-side) The new file groups them into pre-profile setup (optional, dashed — nspid, k8slabels, inject), profilers (FP, DWARF, PMU), helpers (perfevent, ehcompile, ehmaps, procmap, pprof), and the four BPF programs in the kernel band, with arrows showing data + control flow. Output lands in *-on-cpu.pb.gz / *-off-cpu.pb.gz / PMU stdout-or-file. Layout was generated programmatically; open the file in https://excalidraw.com or the VS Code extension to fine-tune positions and colours. The README's ASCII version of the same diagram is unchanged — both exist intentionally so readers can grok the architecture without opening Excalidraw, and the .excalidraw is the editable source for future iterations.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Implements lazy CFI compile for
--unwind auto -a. Replaces today's eagerAttachAllProcesses(compiles every binary visible at startup, ~30 s on a typical desktop) with a lazy variant that populatespid_mappingsonly at startup and defers per-binary compile until a sample fires inside an unprepared binary.Validated bench results on this hardware:
--unwind dwarf)--unwind auto)5× faster from the second invocation onward — the operator-attaching-to-running-host case the doc was concerned about.
Architecture
bpf/unwind_common.h): newcfi_miss_eventsringbuf (64 KB) +cfi_miss_ratelimitLRU_HASH (4096 entries, 1-sec rate limit per(pid, table_id)). Walker probescfi_classification_lengths[table_id]beforeclassify_rel_pc; if missing, emits a miss event and falls through to FP path. The existing FP_LESS+cfi_lookupmiss path also emits as a fallback for the rare case of compiled-but-incomplete CFI.unwind/dwarfagent/miss_drainer.go):consumeCFIMissesgoroutine reads ringbuf, dedupes per-(pid, table_id), poisons after 3 failures, callstracker.AttachCompileOnly(pid, path)to compile on demand.MissStatsexposes counters.unwind/ehmaps/tracker.go): newEnrollWithoutCompile(populatespid_mappings, no refcount) +AttachCompileOnly(compiles + populatescfi_*, takes refcount, nopid_mappingswrite).ScanAndEnroll(unwind/ehmaps/scan_enroll.go): walks/proc/*and enrolls every binary with a shared build-id cache. Cached so 1,000 unique binaries get ~1,000 build-id reads, not ~270,000.unwind/dwarfagent/agent.go): newModeenum +NewProfilerWithMode. Per-PID forcesModeEager(compile cost is 1.4% there, lazy buys nothing). ExistingNewProfiler/NewProfilerWithHooksdelegate withModeEager— backward compatible.--unwinddispatch (perfagent/agent.go):auto→ModeLazy;dwarf→ModeEager(escape hatch). Off-CPU stays eager (v1 non-goal per spec).bench/cmd/scenario/main.go):--unwind {auto|dwarf}flag for before/after diffs.schema.Config.UnwindModerecords the chosen mode.What's preserved
--unwind dwarfretains today's eager behavior. Same code path, plus one map lookup per frame in walk_step (~tens of ns).--unwind fpunchanged.--pid Nunchanged regardless of--unwindvalue (compile cost is already negligible).What was caught and fixed during execution
wg.Add(1) + go func()pattern (commit5034cfe0) — converted to Go 1.25'swg.Go(...).MissStats.Received == 0— exposed a real spec bug. Original spec said the BPF emit point was "FP_LESS + cfi_lookup miss"; in lazy modecfi_classificationisn't populated soclassify_rel_pcreturnedMODE_FP_SAFEand the FP_LESS branch was unreachable. Added acfi_classification_lengthsprobe before classify_rel_pc (commit8b9dba69). Test passed on second run withReceived=32, Resolved=32.LastLatencyNsclock mismatch (time.Now()vsbpf_ktime_get_ns()); fixed by usingunix.ClockGettime(unix.CLOCK_MONOTONIC, ...)(commit2f13d419).NewOffCPUProfilerWithMode, would have nil-panicked) — reverted (commit3e9af28e).Test plan
make test-unitpasses (7 packages, 0 failures).BenchmarkScanAndEnroll_BuildIDCacheHitreportsbuildid_reads/op = 5.000for 100 PIDs × 5 binaries — proves the cache caps reads at K, not N×K.TestLazyMode_FiresAndCompilesOnMiss(caps-gated) PASS:Received=32, Resolved=32, PoisonedKeys=0after 5-second window.make bench-scenarios system-wide-mixed --processes 30 --runs 5for both--unwind dwarfand--unwind auto. Diff shows −76.5% p50 reduction.Companion docs
docs/superpowers/specs/2026-04-27-unwind-auto-lazy-a2-design.md(with the validated-results section).docs/superpowers/plans/2026-04-27-unwind-auto-lazy-a2.md.docs/unwind-auto-refinement-design.mdupdated to mark A2 as implemented.Follow-ups (deliberately deferred)
MissStats.LastLatencyNsexposure to a histogram / external metric.