Skip to content

feat: --unwind auto lazy CFI compile#11

Merged
dpsoft merged 20 commits into
mainfrom
feat/unwind-auto-lazy-a2
Apr 28, 2026
Merged

feat: --unwind auto lazy CFI compile#11
dpsoft merged 20 commits into
mainfrom
feat/unwind-auto-lazy-a2

Conversation

@dpsoft
Copy link
Copy Markdown
Owner

@dpsoft dpsoft commented Apr 28, 2026

Summary

Implements lazy CFI compile for --unwind auto -a. Replaces today's eager AttachAllProcesses (compiles every binary visible at startup, ~30 s on a typical desktop) with a lazy variant that populates pid_mappings only at startup and defers per-binary compile until a sample fires inside an unprepared binary.

Validated bench results on this hardware:

Metric Eager (--unwind dwarf) Lazy (--unwind auto) Δ%
p50 wall time 34,618 ms 8,141 ms −76.5%
p95 / max 69,683 ms 40,982 ms −41.2%
Warm-state runs (2-5) 34-37 s 6-8 s 78-83%

5× faster from the second invocation onward — the operator-attaching-to-running-host case the doc was concerned about.

Architecture

  • BPF (bpf/unwind_common.h): new cfi_miss_events ringbuf (64 KB) + cfi_miss_ratelimit LRU_HASH (4096 entries, 1-sec rate limit per (pid, table_id)). Walker probes cfi_classification_lengths[table_id] before classify_rel_pc; if missing, emits a miss event and falls through to FP path. The existing FP_LESS+cfi_lookup miss path also emits as a fallback for the rare case of compiled-but-incomplete CFI.
  • Userspace drainer (unwind/dwarfagent/miss_drainer.go): consumeCFIMisses goroutine reads ringbuf, dedupes per-(pid, table_id), poisons after 3 failures, calls tracker.AttachCompileOnly(pid, path) to compile on demand. MissStats exposes counters.
  • Tracker (unwind/ehmaps/tracker.go): new EnrollWithoutCompile (populates pid_mappings, no refcount) + AttachCompileOnly (compiles + populates cfi_*, takes refcount, no pid_mappings write).
  • ScanAndEnroll (unwind/ehmaps/scan_enroll.go): walks /proc/* and enrolls every binary with a shared build-id cache. Cached so 1,000 unique binaries get ~1,000 build-id reads, not ~270,000.
  • Mode dispatch (unwind/dwarfagent/agent.go): new Mode enum + NewProfilerWithMode. Per-PID forces ModeEager (compile cost is 1.4% there, lazy buys nothing). Existing NewProfiler/NewProfilerWithHooks delegate with ModeEager — backward compatible.
  • --unwind dispatch (perfagent/agent.go): autoModeLazy; dwarfModeEager (escape hatch). Off-CPU stays eager (v1 non-goal per spec).
  • Bench harness (bench/cmd/scenario/main.go): --unwind {auto|dwarf} flag for before/after diffs. schema.Config.UnwindMode records the chosen mode.

What's preserved

  • --unwind dwarf retains today's eager behavior. Same code path, plus one map lookup per frame in walk_step (~tens of ns).
  • --unwind fp unchanged.
  • --pid N unchanged regardless of --unwind value (compile cost is already negligible).
  • All existing unit + integration tests pass; off-CPU profiler unchanged.

What was caught and fixed during execution

  • Spec reviewer caught a legacy wg.Add(1) + go func() pattern (commit 5034cfe0) — converted to Go 1.25's wg.Go(...).
  • Integration test failed on first run with MissStats.Received == 0 — exposed a real spec bug. Original spec said the BPF emit point was "FP_LESS + cfi_lookup miss"; in lazy mode cfi_classification isn't populated so classify_rel_pc returned MODE_FP_SAFE and the FP_LESS branch was unreachable. Added a cfi_classification_lengths probe before classify_rel_pc (commit 8b9dba69). Test passed on second run with Received=32, Resolved=32.
  • Code reviewer caught a LastLatencyNs clock mismatch (time.Now() vs bpf_ktime_get_ns()); fixed by using unix.ClockGettime(unix.CLOCK_MONOTONIC, ...) (commit 2f13d419).
  • Off-CPU scope creep in Task 9 (added NewOffCPUProfilerWithMode, would have nil-panicked) — reverted (commit 3e9af28e).

Test plan

  • make test-unit passes (7 packages, 0 failures).
  • BenchmarkScanAndEnroll_BuildIDCacheHit reports buildid_reads/op = 5.000 for 100 PIDs × 5 binaries — proves the cache caps reads at K, not N×K.
  • TestLazyMode_FiresAndCompilesOnMiss (caps-gated) PASS: Received=32, Resolved=32, PoisonedKeys=0 after 5-second window.
  • make bench-scenarios system-wide-mixed --processes 30 --runs 5 for both --unwind dwarf and --unwind auto. Diff shows −76.5% p50 reduction.
  • Soak test on a high-fork-rate workload to confirm rate-limit + LRU eviction behaves under load (follow-up if needed).

Companion docs

  • Spec: docs/superpowers/specs/2026-04-27-unwind-auto-lazy-a2-design.md (with the validated-results section).
  • Plan: docs/superpowers/plans/2026-04-27-unwind-auto-lazy-a2.md.
  • docs/unwind-auto-refinement-design.md updated to mark A2 as implemented.

Follow-ups (deliberately deferred)

  • Off-CPU lazy mode (v1 non-goal — small ringbuf cost on off-CPU sessions documented inline).
  • Userspace-side compile parallelism (current drainer is serial; bench shows it's fast enough).
  • MissStats.LastLatencyNs exposure to a histogram / external metric.

dpsoft added 20 commits April 28, 2026 17:03
Without this, lazy mode never fires miss events: cfi_classification
isn't populated for enrolled-but-uncompiled binaries, so classify_rel_pc
returns the default MODE_FP_SAFE, the walker takes the FP path, and the
existing FP_LESS+miss emit (Task 1) is unreachable.

Add a check in walk_step before classify_rel_pc: probe
cfi_classification_lengths[table_id]. If missing, emit miss and continue
with FP path. Eager mode is unchanged (length always populated).
Two final-review followups:

1. LastLatencyNs was subtracting bpf_ktime_get_ns() (CLOCK_MONOTONIC since
   boot) from time.Now().UnixNano() (wallclock since 1970), producing
   garbage. Fix: read CLOCK_MONOTONIC on the userspace side via
   unix.ClockGettime to match the BPF clock. Guard against negative deltas.

2. offcpu_dwarf.bpf.c inherits cfi_miss_events + ratelimit maps via
   unwind_common.h but never drains them (off-CPU is always eager in v1
   per the A2 spec's non-goals). Add a comment documenting that the
   bounded ~96 KB cost is intentional and a v2 may extend the drainer.
@dpsoft dpsoft force-pushed the feat/unwind-auto-lazy-a2 branch from d0c5aa3 to cd131cc Compare April 28, 2026 20:03
@dpsoft dpsoft changed the title feat: --unwind auto lazy CFI compile (Option A2) feat: --unwind auto lazy CFI compile Apr 28, 2026
@dpsoft dpsoft merged commit d87a1c7 into main Apr 28, 2026
10 checks passed
@dpsoft dpsoft deleted the feat/unwind-auto-lazy-a2 branch April 28, 2026 21:01
dpsoft added a commit that referenced this pull request May 3, 2026
The committed architecture.excalidraw was the original sketch from the
FP-only era — it still showed only profile/, offcpu/, cpu/, perf.bpf.c,
offcpu.bpf.c, and cpu.bpf.c. Since then the codebase has grown:

- unwind/dwarfagent/ (DWARF hybrid CPU + off-CPU walker, PRs #1#11)
- unwind/{ehcompile,ehmaps,procmap} (CFI compile, per-PID lifecycle,
  /proc resolver — PRs #3#9)
- internal/perfevent (per-CPU perf_event_open + AttachRawLink shared
  helper extracted from profile/dwarfagent — PR #13)
- inject/{python,ptraceop,elfsym} (Python perf-trampoline activator
  via ptrace — PR #12)
- internal/{nspid,k8slabels} (namespace-aware --pid + k8s pprof labels
  — PR #14)
- bpf/perf_dwarf.bpf.c, bpf/offcpu_dwarf.bpf.c (DWARF kernel-side)

The new file groups them into pre-profile setup (optional, dashed —
nspid, k8slabels, inject), profilers (FP, DWARF, PMU), helpers
(perfevent, ehcompile, ehmaps, procmap, pprof), and the four BPF
programs in the kernel band, with arrows showing data + control flow.
Output lands in *-on-cpu.pb.gz / *-off-cpu.pb.gz / PMU stdout-or-file.

Layout was generated programmatically; open the file in
https://excalidraw.com or the VS Code extension to fine-tune
positions and colours.

The README's ASCII version of the same diagram is unchanged — both
exist intentionally so readers can grok the architecture without
opening Excalidraw, and the .excalidraw is the editable source for
future iterations.
dpsoft added a commit that referenced this pull request May 3, 2026
The committed architecture.excalidraw was the original sketch from the
FP-only era — it still showed only profile/, offcpu/, cpu/, perf.bpf.c,
offcpu.bpf.c, and cpu.bpf.c. Since then the codebase has grown:

- unwind/dwarfagent/ (DWARF hybrid CPU + off-CPU walker, PRs #1#11)
- unwind/{ehcompile,ehmaps,procmap} (CFI compile, per-PID lifecycle,
  /proc resolver — PRs #3#9)
- internal/perfevent (per-CPU perf_event_open + AttachRawLink shared
  helper extracted from profile/dwarfagent — PR #13)
- inject/{python,ptraceop,elfsym} (Python perf-trampoline activator
  via ptrace — PR #12)
- internal/{nspid,k8slabels} (namespace-aware --pid + k8s pprof labels
  — PR #14)
- bpf/perf_dwarf.bpf.c, bpf/offcpu_dwarf.bpf.c (DWARF kernel-side)

The new file groups them into pre-profile setup (optional, dashed —
nspid, k8slabels, inject), profilers (FP, DWARF, PMU), helpers
(perfevent, ehcompile, ehmaps, procmap, pprof), and the four BPF
programs in the kernel band, with arrows showing data + control flow.
Output lands in *-on-cpu.pb.gz / *-off-cpu.pb.gz / PMU stdout-or-file.

Layout was generated programmatically; open the file in
https://excalidraw.com or the VS Code extension to fine-tune
positions and colours.

The README's ASCII version of the same diagram is unchanged — both
exist intentionally so readers can grok the architecture without
opening Excalidraw, and the .excalidraw is the editable source for
future iterations.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant