DWARF stack unwinding + pprof address/mapping fidelity#7
Conversation
Initial spec assumed blazesym could unwind (regs, stack_bytes) -> PCs. Verified against blazesym main (2025) and upstream activity: the C API exposes only symbolize/normalize/inspect/kernel; the Rust crate has no unwind module. A prototype FP unwinder existed in the 2021 initial commit but was removed during the pivot to symbolization-only. Update Option B to document the real architecture: kernel captures (regs, stack_bytes) as before, but userspace needs a separate unwinder library. Three concrete paths documented — libunwind via CGO (recommended), delve dwarf/frame (pure Go), or shelling out to perf as a prototype harness. Scope estimate adjusted accordingly. No code changes on this branch yet; spec correction is the first commit.
- --unwind {fp,dwarf,auto} with auto as default.
- auto is a hybrid: consult pre-scanned .eh_frame classification per
PC range; take the kernel's cheap FP callchain when PC is in
FP-safe code; DWARF-unwind only when PC lands in FP-less code
(libstd, glibc, C++ release builds) or when FP callchain truncated
suspiciously.
- Pre-scan .eh_frame per ELF at mmap-change events; cache by build-id
so repeated loads don't re-parse. Sample-time decision is a range
lookup + length check.
Tests pass, binary builds clean, go1.26.0 toolchain resolved via GOTOOLCHAIN=auto. No code changes required for the upgrade.
Captures PERF_RECORD_SAMPLE with IP + TID + TIME + CALLCHAIN +
REGS_USER + STACK_USER for a target PID at a configurable frequency.
Ring buffer is mmapped and drained on demand via ReadNext().
Split into:
- unwind/perfreader/regs_{amd64,arm64}.go — per-arch register masks
and SP/BP/IP accessors using sparse-bit indexing into the dense
regs[] array the kernel emits.
- unwind/perfreader/reader.go — perf_event_open, mmap, ring parser.
- unwind/perfreader/sample.go — Sample record type.
- cmd/perfreader-test/main.go — diagnostic CLI that prints sample
summaries (IP, callchain, registers, first bytes of stack).
No unwinder yet; this is the kernel-side capture validated
end-to-end. Needs the host to have cap_sys_admin+cap_bpf+cap_perfmon
(set via setcap on the built binary).
Initial spike failed to read samples because perfMmapPage struct layout was wrong — my padding placed data_head/data_tail 4 bytes off from where the kernel actually writes them. The page struct has a long reserved area that's evolved across kernels and is risky to mirror in Go. Fix: access data_head (offset 1024) and data_tail (offset 1032) directly via raw pointer into the mmap — those offsets are kernel-ABI stable. Deleted the full perfMmapPage struct mirror entirely. Also in the test CLI: - Add --duration so a silent hang is bounded. - Periodic progress line to stderr (wakeups/records/samples). - Correct PERF_CONTEXT_USER sentinel (0xfffffffffffffe00 = -512, not the wrong -256 I had before). End-to-end validated against Go cpu_bound workload: samples arrive with correct IP + 17 regs + 8KB stack, and the stack bytes contain real return addresses (verified [rbp+8] matches callchain[2]).
unwind/fpwalker/ reconstructs a callchain by following the frame-pointer chain through captured PERF_SAMPLE_STACK_USER bytes. Arch-agnostic (x86_64 rbp and arm64 x29 share the same [savedBP, retAddr] layout at the frame pointer). 6 unit tests cover simple chains, OOB bp, null terminator, non-monotonic chains, short stacks, and cycle safety. Validation: the diagnostic CLI now runs the FP walker on each sample and compares its output to the kernel's PERF_SAMPLE_CALLCHAIN (minus PERF_CONTEXT_* sentinels). Tested against Go cpu_bound — 3/3 samples produce identical chains, confirming end-to-end that the captured (regs, stack_addr, stack_bytes) is fully usable for userspace unwinding. This is the cheap leg of the planned --unwind auto hybrid. The DWARF leg (libunwind CGO) is next.
Package unwind/dwarfunwind/ wraps libunwind for DWARF-based stack
unwinding. This commit lands the architecture without a working
find_proc_info — the CGO wiring, address-space setup, custom
accessors, ELF-mapping parser, and end-to-end Go→C→Go path are
verified via a smoke test.
Files:
- unwinder.h : C surface exposed to Go.
- unwinder.c : libunwind accessor callbacks and /proc/<pid>/maps
based ELF mmap table. Custom access_mem serves
captured stack, locally-mmap'd ELFs, and
/proc/<pid>/mem fallback. Custom access_reg
tracks live_regs updated by libunwind as it
restores prior frames.
- unwinder.go : Go facade (Unwinder, New, Close, Unwind).
- CGO flags : -lunwind -lunwind-x86_64 from libunwind-devel.
Known gap: find_proc_info is currently a UNW_ENOINFO stub. libunwind
will therefore fail to step beyond the first frame. Next commit adds
a real find_proc_info that locates PT_GNU_EH_FRAME in each ELF and
builds per-mapping unw_dyn_info_t for dwarf_search_unwind_table.
No integration with the rest of the repo yet; this is standalone.
…ding The original spec chose Option B — userspace libunwind via CGO. While bringing up find_proc_info we hit the libunwind REMOTE_TABLE format's requirement that table_data point at the FDE search table (not the .eh_frame_hdr header). That's solvable, but it surfaced how much of the unwinder complexity we'd be pushing into userspace: .eh_frame_hdr parsing, /proc/<pid>/maps tracking, ELF mmap management, register-restore state machine per sample — all fighting the CGO boundary and competing with the workload CPU. Option A moves the same work into the BPF program using CFI tables pre-compiled in userspace. Wins: lower per-sample cost (single-digit microseconds vs. tens), native off-CPU support (previously impossible in Option B because sched_switch doesn't generate PERF_SAMPLE_STACK_USER records), no CGO on the hot path, no libunwind dependency. Non-goal: matching parca-agent scope. We skip full DWARF expressions (0.05% of glibc FDEs measured on this host), JIT code tracking, and high-sample-rate always-on system-wide. We keep all current features — every --profile, --offcpu, --pmu, --pid, -a combination that works on main continues to work. Spec rewritten at docs/dwarf-unwinding-design.md. Staging: ehcompile/ → BPF FP+classify → BPF DWARF walker → ehmaps/ → dwarfagent/ → off-CPU port → system-wide. Each stage independently mergeable; --unwind fp (default) never regresses. unwind/dwarfunwind/ deleted — the libunwind CGO wrapper is now dead code.
Material fixes: - Resolve --unwind auto contradiction: auto routes to FP programs in S1-S5 and flips to DWARF programs in new stage S8. - Kernel floor simplified to Linux 6.0+ (all features mature well before 6.0; keeps the compat matrix simple). - Build-id hash widened from 32 bits to 64 (FNV-1a of build-id) to eliminate birthday-collision risk. - S6 (off-CPU port) now spells out the tracepoint-vs-perf_event capture differences: bpf_task_pt_regs + explicit bpf_probe_read_user instead of PERF_SAMPLE_STACK_USER. - counts map clarified: DWARF programs don't use one. Samples go out per-event via ringbuf; userspace aggregates after symbolization. - Per-stage exit criteria added — each stage now has a verifiable end state instead of just an effort estimate. Minor: - FALLBACK classification clarified: runtime-identical to FP_SAFE, different only for telemetry. - 'byte-for-byte identical' claim scoped to the BPF layer; the Go path gets one new switch arm. - BPF unit-testing strategy made explicit: Go mirror of the walker, fuzz-compare BPF vs Go outputs. - Ringbuf record changed from fixed-128-slot to variable length (header + n_pcs × u64), saves ~75% of typical bandwidth.
Spec updated: cfi_entry layout made arch-neutral — cfa_type uses SP/FP instead of cfa_reg; added ra_offset field; fp_type/fp_offset renamed from rbp_* for arch clarity. RA on arm64 isn't conventionally at [CFA-8] like x86_64, so tracking it explicitly is required. Plan saved to docs/superpowers/plans/2026-04-22-ehcompile.md — 19 tasks covering the ehcompile Go package: - Package skeleton + arch-neutral output types (CFIEntry, Classification). - ULEB128/SLEB128/DW_EH_PE_* primitives. - archInfo struct with x86_64 and arm64 register maps. - CIE/FDE parsing with augmentation handling (zR, zRS, zPR, etc.). - .eh_frame section walker with backward CIE pointer semantics. - CFI interpreter covering: advance_loc family, def_cfa family, offset/ restore/same/undefined/register, remember/restore_state, expression opcodes → FALLBACK, DW_CFA_GNU_args_size. - Compile() with ELF-machine-type dispatch. - Tests: x86_64 golden fixture (committed), arm64 cross-compiled fixture (optional, skips if aarch64-linux-gnu-gcc absent), arm64 synthetic bytecode coverage (host-independent), system glibc stress test, Go binary integration test, benchmark. Stage is independently mergeable — no BPF code, no integration with profile/ or offcpu/ packages, no new runtime dependencies beyond Go stdlib debug/elf.
Also fix sampleFDE to compute CIE_pointer dynamically from the actual ciePos/fdePos arguments instead of a stale hardcoded value.
…ndler Cross-compile hello.c to hello_arm64.o as a relocatable ELF — we can't link a full arm64 executable without libc/crt, but the object file still has .eh_frame and parses fine through debug/elf. Plan defect surfaced by real arm64 CFI: DW_CFA_AArch64_negate_ra_state (0x2d) wasn't in our opcode list. It's emitted in virtually every arm64 FDE to toggle a pointer-auth (PAC) state bit. No operand; no effect on our CFA/FP/RA tracking. Added as a no-op case. Also regenerated hello.golden after removing the unused stdio.h include from hello.c (was blocking the arm64 cross-compile).
System glibc: 13,769 CFI entries, 8 FALLBACK (0.058%) — matches the research estimate of 0.05% for DW_CFA_expression usage in real-world glibc. Test asserts the FALLBACK ratio stays under 2%. Go binaries: pure-Go executables have no .eh_frame (Go's runtime uses pclntab for stack unwinding instead). Test accepts ErrNoEHFrame as a valid outcome.
Benchmarks on AMD Ryzen 9 7940HS: - Glibc (13,769 entries): 4.4 ms / 5.1 MB / 24,727 allocs - Hello x86_64 (71 entries): 16.1 us / 15 KB / 169 allocs - Hello arm64 .o (6 entries): 11.0 us / 5.7 KB / 78 allocs Well under the plan's 50 ms/compile target for glibc. Allocation count is high (mostly slice growth during FDE accumulation) — a future pass could pre-size entries/classifications based on .eh_frame_hdr's fde_count field. Not needed for S1's success criteria. Docstring now summarizes the full CFI dialect we handle, arch support, and explicit out-of-scope items.
go.mod pins 1.26+; system Go is often 1.25 on current distros. Without GOTOOLCHAIN=auto the build fails outright; with it, Go fetches the required toolchain on demand. Users can still pin to local with GOTOOLCHAIN=local make build.
AttachAllProcesses/AttachAllMappings can fail on individual binaries (notably Go's static executables, which use .gopclntab + frame pointers instead of .eh_frame). Before this, NewProfiler refused to start at all when the first binary failed. The hybrid walker handles FP-safe code without CFI tables, so eager-compile is purely an optimization — log the error and continue rather than abort.
Static binaries (Go) are self-contained: no libc, no shared libs, so exactly one real (non-sentinel) mapping is correct. The original >=2 threshold assumed dynamically-linked targets. The point of the assertion is 'we're not falling back to the hardcoded placeholder' — that holds with one real mapping.
Captures architecture for full-stack GPU profiling beyond what eBPF alone can deliver: vendor-agnostic host layer + vendor-specific device layer. Draft only — not on the S9-narrow critical path; tracked here so the design isn't lost between sessions.
- Rewrite README architecture diagram to show DWARF/hybrid path, ehcompile/ehmaps/procmap pipeline, and the address-keyed pprof ProfileBuilder. Refresh flags (--unwind), build (Go 1.26+, GOTOOLCHAIN=auto), and testing (cap-aware skip gates). - Track .claude/settings.json (shared plugin config); keep settings.local.json (per-user state) ignored.
Idea capture only — not a design spec, not a plan. Documents two implementation paths (direct pprof->LLVM converter vs perf-agent emitting perf.data + reusing autofdo/llvm-profgen), the open questions worth resolving before designing, and what's deliberately out of scope. Tracked here so the thread isn't lost between sessions; revisit and brainstorm into a real spec before implementation.
Three CI failures motivated this: - Lint failed because golangci-lint built against Go 1.25 can't parse a go.mod that requires 1.26. - Unit Tests failed with 'compile: version go1.26.0 does not match go tool version go1.25.9' — toolchain mismatch when the Go binary is pinned to 1.25 but go.mod targets 1.26. - Integration Tests failed with 'symbol bpf_loop: unsatisfied program reference' — bpf_loop requires kernel >=5.17, ubuntu-22.04 ships 5.15. ubuntu-24.04 ships kernel 6.x. Updates tests.yml, ci.yml, release.yml. Build job + lint + unit + integration.
There was a problem hiding this comment.
Pull request overview
Note
Copilot was unable to run its full agentic suite in this review.
Adds a DWARF-capable hybrid stack unwinder and refactors pprof emission to preserve per-binary mapping identity and address-keyed locations (enabling higher-fidelity downstream consumers like sample-based PGO tooling).
Changes:
- Introduces DWARF/hybrid CPU + off-CPU profilers (BPF programs + Go wiring) with CFI table support and MMAP2-driven mapping lifecycle.
- Refactors pprof builder to accept rich
Frameobjects, intern realMappings (path/start/limit/off/build-id), and keyLocations by binary-relative address. - Expands integration/unit tests and workloads (Rust + dlopen probe) and updates CLI/config defaults around
--unwind.
Reviewed changes
Copilot reviewed 49 out of 111 changed files in this pull request and generated 9 comments.
Show a summary per file
| File | Description |
|---|---|
| unwind/dwarfagent/agent_test.go | Adds an end-to-end DWARF profiler test against the Rust workload. |
| unwind/dwarfagent/agent.go | Introduces DWARF-capable CPU profiler implementation and perf-event attachment. |
| test/workloads/rust/src/main.rs | Extends Rust workload with optional dlopen + probe function pointer path. |
| test/workloads/rust/probe/src/lib.rs | Adds a small cdylib probe function for dlopen/MMAP2 tracking tests. |
| test/workloads/rust/probe/Cargo.toml | Defines probe cdylib build configuration (debug symbols in release). |
| test/workloads/rust/Cargo.toml | Adds libc dependency to support dlopen/dlsym. |
| test/integration_test.go | Adds CAP_BPF gating, DWARF walker tests, and pprof fidelity assertions. |
| test/go.mod | Bumps Go version and promotes several deps to direct requirements for tests. |
| profile/profiler_test.go | Adds tests for address propagation into frames and resolver wiring. |
| profile/profiler.go | Switches to frame-based stacks, batches symbolization, and wires procmap resolver into pprof building. |
| profile/perf_dwarf_x86_bpfel.go | Adds generated bpf2go bindings for perf_dwarf (x86). |
| profile/perf_dwarf_test.go | Adds a verifier smoke test for loading perf_dwarf BPF. |
| profile/perf_dwarf_arm64_bpfel.go | Adds generated bpf2go bindings for perf_dwarf (arm64). |
| profile/offcpu_dwarf_x86_bpfel.go | Adds generated bpf2go bindings for offcpu_dwarf (x86). |
| profile/offcpu_dwarf_export.go | Exposes an API wrapper to load/use the offcpu_dwarf program and maps. |
| profile/offcpu_dwarf_arm64_bpfel.go | Adds generated bpf2go bindings for offcpu_dwarf (arm64). |
| profile/gen.go | Extends go:generate to build new DWARF BPF objects for amd64/arm64. |
| profile/dwarf_export.go | Exposes an API wrapper to load/use the perf_dwarf program and maps. |
| pprof/pprof_test.go | Updates tests to frame stacks; adds fidelity and perf-map decoder coverage. |
| pprof/pprof.go | Introduces Frame model, resolver-based mapping/location fidelity, and perf-map decoding. |
| perfagent/options.go | Adds config option + option helper for selecting unwind mode. |
| perfagent/agent.go | Dispatches to FP vs DWARF profilers based on --unwind, adds adapters. |
| offcpu/profiler.go | Refactors off-CPU symbolization to frame stacks and batched lookups. |
| main.go | Adds --unwind flag defaulting to auto. |
| internal/bpfstack/stack_test.go | Adds unit tests and benchmarks for parsing raw stackmap buffers. |
| internal/bpfstack/stack.go | Adds shared helper to decode stackmap IPs. |
| go.mod | Bumps module Go version to 1.26.0. |
| docs/unwind-auto-refinement-design.md | Adds a follow-up design doc for reducing auto mode startup cost. |
| docs/superpowers/specs/2026-04-25-s10-pgo-converter-idea.md | Adds idea note for future pprof→LLVM sample profile conversion. |
| docs/superpowers/specs/2026-04-25-gpu-profiling-design.md | Adds draft GPU profiling design spec (out of scope follow-up). |
| docs/superpowers/specs/2026-04-24-s9-pprof-fidelity-design.md | Adds design spec capturing the S9 fidelity work implemented here. |
| docs/dwarf-unwinding-design.md | Adds DWARF unwinding design spec (architecture and staging). |
| cmd/test_blazesym/main.go | Removes an old blazesym diagnostic binary. |
| bpf/unwind_common.h | Adds shared BPF types/maps/helpers for hybrid FP+DWARF unwinding. |
| bpf/perf_dwarf.bpf.c | Adds DWARF-capable CPU sampling BPF program emitting ringbuf records. |
| bpf/offcpu_dwarf.bpf.c | Adds DWARF-capable off-CPU BPF program for sched_switch-based sampling. |
| README.md | Updates docs to reflect DWARF/hybrid unwinding and pprof fidelity changes. |
| Makefile | Sets GOTOOLCHAIN=auto and extends workload/test targets (Rust probe + unwind tests). |
| .claude/settings.json | Adds editor plugin config (non-runtime). |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| ips := bpfstack.ExtractIPs(stack) | ||
| if len(ips) > 0 { | ||
| symbols, err := pr.symbolizer.SymbolizeProcessAbsAddrs( | ||
| ips, | ||
| samplePid, | ||
| blazesym.ProcessSourceWithPerfMap(true), | ||
| blazesym.ProcessSourceWithDebugSyms(true), | ||
| ) | ||
| if err != nil { | ||
| log.Printf("Failed to symbolize: %v", err) | ||
| break | ||
| } else { | ||
| for _, s := range symbols { | ||
| for _, f := range blazeSymToFrames(s) { | ||
| sb.append(f) | ||
| } | ||
| } | ||
| } | ||
|
|
||
| sb.append(symbol[0].Name) | ||
| } |
There was a problem hiding this comment.
Off-CPU symbolization drops the per-IP address, so pprof.Frame.Address remains 0 for all off-CPU frames. With the new pprof builder, Address=0 forces fallback location-keying (and prevents mapping/file-offset fidelity) even when ips are available. Fix by iterating symbols with an index and threading ips[i] into the frame conversion (mirroring profile/profiler.go’s blazeSymToFrames(s, ips[i]) approach).
| // TestProfilerEndToEnd runs the full dwarfagent stack against the | ||
| // rust-workload and asserts that the resulting pprof contains at | ||
| // least one sample naming cpu_intensive_work. | ||
| func TestProfilerEndToEnd(t *testing.T) { | ||
| if os.Getuid() != 0 { | ||
| caps := cap.GetProc() | ||
| have, _ := caps.GetFlag(cap.Permitted, cap.BPF) | ||
| if !have { | ||
| t.Skip("requires root or CAP_BPF") | ||
| } |
There was a problem hiding this comment.
This test only gates on CAP_BPF, but loading/attaching the DWARF profilers also requires additional capabilities (e.g., PERFMON/SYS_PTRACE/CHECKPOINT_RESTORE, and memlock handling). As written, the test can fail (instead of skip) on systems where CAP_BPF is present but the other required caps are missing. Consider reusing the broader capability gate used elsewhere in the repo (or checking the full set needed by the loader) and handling a potential nil/failed cap.GetProc() defensively.
| // TestProfilerEndToEnd runs the full dwarfagent stack against the | |
| // rust-workload and asserts that the resulting pprof contains at | |
| // least one sample naming cpu_intensive_work. | |
| func TestProfilerEndToEnd(t *testing.T) { | |
| if os.Getuid() != 0 { | |
| caps := cap.GetProc() | |
| have, _ := caps.GetFlag(cap.Permitted, cap.BPF) | |
| if !have { | |
| t.Skip("requires root or CAP_BPF") | |
| } | |
| // haveRequiredProfilerCaps reports whether the current process has the | |
| // capabilities required to load and attach the DWARF profiler stack when | |
| // not running as root. | |
| func haveRequiredProfilerCaps() (bool, string) { | |
| if os.Getuid() == 0 { | |
| return true, "" | |
| } | |
| caps := cap.GetProc() | |
| if caps == nil { | |
| return false, "cannot inspect process capabilities" | |
| } | |
| required := []struct { | |
| value cap.Value | |
| name string | |
| }{ | |
| {value: cap.BPF, name: "CAP_BPF"}, | |
| {value: cap.PERFMON, name: "CAP_PERFMON"}, | |
| {value: cap.SYS_PTRACE, name: "CAP_SYS_PTRACE"}, | |
| {value: cap.CHECKPOINT_RESTORE, name: "CAP_CHECKPOINT_RESTORE"}, | |
| {value: cap.IPC_LOCK, name: "CAP_IPC_LOCK"}, | |
| } | |
| missing := make([]string, 0, len(required)) | |
| for _, req := range required { | |
| have, err := caps.GetFlag(cap.Permitted, req.value) | |
| if err != nil || !have { | |
| missing = append(missing, req.name) | |
| } | |
| } | |
| if len(missing) > 0 { | |
| return false, "missing required capabilities: " + strings.Join(missing, ", ") | |
| } | |
| return true, "" | |
| } | |
| // TestProfilerEndToEnd runs the full dwarfagent stack against the | |
| // rust-workload and asserts that the resulting pprof contains at | |
| // least one sample naming cpu_intensive_work. | |
| func TestProfilerEndToEnd(t *testing.T) { | |
| if ok, reason := haveRequiredProfilerCaps(); !ok { | |
| t.Skipf("requires root or profiler capabilities: %s", reason) |
| resolver *procmap.Resolver | ||
| mappings map[mappingKey]*profile.Mapping | ||
| locations map[any]*profile.Location | ||
| functions map[any]*profile.Function |
There was a problem hiding this comment.
Using map[any]... for locations and functions weakens type safety and introduces a sharp edge: inserting a non-comparable key (even accidentally) will panic at runtime. Since these maps are internal and already have a small, closed set of key types, consider splitting them into strongly typed maps (e.g., map[locationKey]*profile.Location and map[locationFallbackKey]*profile.Location, similarly for functionKey) or introducing a single comparable tagged-key struct to keep compile-time guarantees.
| // looksJIT returns true when frame.Name matches one of the perf-map | ||
| // runtime prefixes (Python, Node). decodePerfMapFrame zeros Address | ||
| // for these in Task 8; before then, the Address field may still be | ||
| // nonzero but we keep these on the [jit] sentinel because anonymous | ||
| // JIT mappings have no file-offset identity. | ||
| func looksJIT(f Frame) bool { | ||
| return strings.HasPrefix(f.Name, "py::") || | ||
| strings.HasPrefix(f.Name, "JS:") || | ||
| strings.HasPrefix(f.Name, "LazyCompile:") || | ||
| strings.HasPrefix(f.Name, "Function:") || | ||
| strings.HasPrefix(f.Name, "Builtin:") || | ||
| strings.HasPrefix(f.Name, "Code:") || | ||
| strings.HasPrefix(f.Name, "Script:") | ||
| } |
There was a problem hiding this comment.
The comment references a staged rollout ('Task 8') and suggests Address might still be nonzero 'before then', but decodePerfMapFrame in this diff unconditionally zeros Address when it recognizes Python/Node formats. Updating this comment to describe the current behavior (without stage references) will keep the code documentation accurate.
| import ( | ||
| "bytes" | ||
| _ "embed" | ||
| "fmt" | ||
| "io" | ||
| "structs" | ||
|
|
||
| "github.com/cilium/ebpf" | ||
| ) |
There was a problem hiding this comment.
The generated bpf2go bindings import a package named structs via a bare import path. Unless this repo provides a top-level structs package/module with HostLayout, this will not compile. If this is intended, please ensure the structs package is present and in-module; otherwise, regenerate with the correct import path or adjust the generator configuration so the import resolves consistently in module builds.
v2.8.0 was built with Go 1.25 and refuses to lint a project targeting Go 1.26 with: 'the Go language version (go1.25) used to build golangci-lint is lower than the targeted Go version (1.26.0)'. v2.11.x is built against Go 1.26+.
errcheck: explicit '_ =' on best-effort Close/Munmap calls (defer cleanup paths), or wrap defer in a lambda where appropriate. ineffassign: drop dead pcEnd reassignment in parseSample. staticcheck: apply QF1008/QF1011/QF1001 simplifications.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 49 out of 111 changed files in this pull request and generated 3 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| // `pids` map (populated via objs.AddPID above) restricts emission to the | ||
| // workload's TGID. Using pid=workloadPID here would sample ONLY that | ||
| // specific TID, missing the worker threads where the actual CPU load runs. | ||
| for cpu := range ncpu { |
There was a problem hiding this comment.
range over an int (ncpu) does not compile. Replace this with an indexed loop (e.g., for cpu := 0; cpu < ncpu; cpu++) to iterate CPUs.
| for cpu := range ncpu { | |
| for cpu := 0; cpu < ncpu; cpu++ { |
Conflict in pprof/pprof.go between PR #8's pre-S5 method-based dedup (f.locationKey() / f.functionKey() returning frameKey) and S9's struct-based intern keys + sentinels + looksJIT helper. Resolution: keep S9. The methods are dead code in S9's world — addLocation now constructs locationKey/locationFallbackKey/functionKey structs directly, never calls Frame.locationKey() or Frame.functionKey(). go build, pprof unit tests, and procmap unit tests all pass after the merge.
…sgs, fidelity asserts, Acquire ordering - offcpu/profiler.go: thread Address through blazeSymToFrames; create procmap.Resolver in NewProfiler; pass into BuildersOptions; close in Profiler.Close. Off-CPU pprof now carries the same address+mapping fidelity as on-CPU. - perfagent/options.go: Config.Unwind doc reflects auto default. - unwind/dwarfagent/agent.go: 'no perf events attached' error message no longer implies the PID exited (wrong in system-wide mode). - pprof/pprof_test.go: rename TestAddSameNameDifferentModulesMakesDistinctFunctions to ...DedupsWithoutResolver and update the comment to describe current behavior; drop Task 1 fixture reference; scrub stage markers. - pprof/pprof.go: looksJIT comment describes current behavior; drop pre-S9 reference in locationFallbackKey doc; drop 'After Task 8' in addLocation comment. - test/integration_test.go: apply assertPprofFidelity to off-CPU tests too; doc comment reflects the >=1 threshold; drop S9: prefixes. - test/workloads/rust/src/main.rs: probe.load uses Acquire to pair with Release store. - bpf/unwind_common.h: drop (Task 5) from walker helper comment. - unwind/ehmaps/ehmaps_runtime_test.go: drop 'in Task 9' phrase.
Local golangci-lint v2.11.4 caught 11 issues that the previous CI-driven fix pass missed. Fixed: - errcheck on Close() calls in dwarfagent/agent.go, dwarfagent/offcpu.go, ehcompile/ehcompile.go, ehcompile/ehcompile_test.go, ehmaps/ehmaps.go (multiple sites in pid_mapping/classification), ehmaps/tracker_test.go. - QF1008 in dwarfagent/offcpu.go: drop redundant .session selector on embedded field for collect() and close() calls. - QF1001 in procmap/procmap_test.go: apply De Morgan's law to hex-char test in TestReadBuildID. Verified locally: golangci-lint run --timeout=5m → 0 issues.
CI failure on amd64 (arm64 same code passed) where the system-wide DWARF capture happened to land on binaries without GNU build-ids. This is environmental — kernel threads, custom Rust release builds with stripped notes, or older system binaries can legitimately have no .note.gnu.build-id section. The mapping count + non-zero Address checks remain the load-bearing fidelity guarantees. BuildID presence is now logged via t.Logf so a regression that wipes build-ids globally is still visible in test output, but the test no longer fails on captures that legitimately sample stripped code.
#25) * fix(kernel-stacks): lockdown-safe kernel symbolization + perf.data userspace MMAP2 Three production gaps shipped in v1.2.0 M1 kernel-stacks: 1. Kernel symbolization fails under lockdown=integrity (Secure Boot). blazesym's kernel source unconditionally probes /proc/kcore for DWARF, which is 0400 root:root + CAP_SYS_RAWIO-gated. perf-agent binaries setcap'd with the standard cap_perfmon/cap_bpf set lack that, so blazesym returns BLAZE_ERR_PERMISSION_DENIED for every batch and kernel frames disappear from the pprof. Setting vmlinux="" doesn't help — blazesym still hits kcore. Fix: pure-Go /proc/kallsyms symbolizer (symbolize/kallsyms.go) as the lockdown-safe fallback. On EPERM from blazesym, LocalKernelSymbolizer switches to the kallsyms path for the rest of its lifetime (sticky). Resolution loses inline expansion and file:line attribution but keeps function name + offset + module marker, which is the load-bearing property for flame graphs. PERFAGENT_FORCE_KERNEL_FALLBACK=1 lets CI exercise the fallback on hosts that don't naturally hit EPERM. 2. When every symbolizer fails, kernel frames are dropped entirely instead of being preserved as raw addresses. Caller's pprof loses kernel context. Fix: rawKernelAddrFrames synthesizes Frame{Name:"0x<hex>", Module:"[kernel.kallsyms]", Reason:FailureMissingSymbols} so operators can still decode hot kernel frames via /proc/kallsyms (the existing awk workaround the spec calls out), and the pprof keeps structural shape. 3. perf.data files written by --perf-data-output contain the [kernel.kallsyms]_text MMAP2 but no userspace MMAP2 records. `perf script` / `perf report` show [unknown] for every user-space IP, forcing operators to post-process with the maps.txt + nm hack. Fix: perfdata.Writer.AddUserspaceMmaps walks /proc/<pid>/maps via the existing procmap.Resolver and emits one PERF_RECORD_MMAP2 per executable mapping for the target PID. Wired into agent.go right after AddKernelMmap. System-wide (-a) is out of scope — left as a documented follow-up. Tests: stub-based unit tests for the fallback ladder and raw-address preservation (no kallsyms-readable host required), plus four integration tests including a forced-fallback variant that exercises the kallsyms path on every host. * docs(roadmap): post-prod-hardening improvement track Captures 10 follow-ups surfaced while fixing the v1.2.0 kernel-stacks production gaps: - self-profile lane (would have caught the lockdown bug at PR time) - metrics.Exporter histograms for symbolize hot paths - /metrics + /debug/pprof HTTP endpoint - symbolizer error counter by reason (lockdown_eperm, kallsyms_unreadable, ...) - CI lane that runs kernel-stacks integration with PERFAGENT_FORCE_KERNEL_FALLBACK=1 - go test -bench budgets for symbolize - allocs/sample budget - system-wide userspace MMAP2 (the -a follow-up to this PR) - lazy MMAP2 on first new-PID sample - kthread MMAP / COMM attribution audit Prioritized so the first three would prevent the next ship-and-pray. * feat(kernel-stacks): ship roadmap items #2, #5, #8, #10 in same PR Four follow-ups from docs/post-prod-hardening.md landed together with the bug fixes — small enough to ship in one commit, large enough to matter: #5 CI lockdown lane: new `make test-integration-lockdown` target runs the kernel-stacks integration tests with PERFAGENT_FORCE_KERNEL_FALLBACK=1, wired into the default `make test`. Every CI run now exercises the kallsyms fallback path regardless of the host's lockdown state — would have caught the original v1.2.0 ship-and-pray. #2 metrics counters: symbolize.Counters (atomic) tracks KernelBatches, KernelInputIPs, KernelBatchFailures, KernelFallbackEngaged, KernelRawAddrFrames. Logged at agent shutdown via agent.cleanup. Histograms (p50/p99 batch latency) left as follow-up (requires reservoir sampling). The fallback_engaged counter bumps both on natural EPERM-triggered switch AND on env-var forced fallback — operators see the active mode from the end-of-run log without scraping. #8 system-wide userspace MMAP2: emitCommAndMmapsForAllPIDs walks /proc at writer init and emits PERF_RECORD_MMAP2 for every non-kthread PID's executable mappings (extends Bug 3 single-PID fix to -a mode). TestPerfDataUserspaceMmap2_SystemWide verifies the walk covers multiple distinct PIDs (saw 9068 on this host). Snapshot semantics; lazy emission is roadmap #9. #10 COMM attribution audit: found that perfdata.Writer.AddComm was never called from anywhere in perf-agent — every perf.data emitted had zero COMM records. Not just kthreads — all PIDs showed up nameless in `perf script`. emitCommForPID reads /proc/<pid>/comm and emits PERF_RECORD_COMM for every PID enumerated, including kthreads (kvm-pit, vhost-*, kworker/*). TestPerfDataUserspaceMmap2 asserts the COMM record appears alongside the MMAP2 record for the workload pid. Modernized three for-loops to Go 1.22 `for i := range N` style in the modified files (per Go 1.26 project target). Roadmap doc updated to mark shipped items and reorder remaining priorities — #1 (self-profile lane) is now the highest-leverage follow-up. * feat(bench): self-profile scenario (roadmap #1) — profile the profiler A second perf-agent profiles the first while it captures a CPU workload. Catches: - perf-agent's own CPU overhead (agent_samples / workload_samples at the same sample rate — directly comparable) - kernel-symbol resolution rate (fraction of kernel Locations whose Function.Name isn't a "0x<hex>" raw-address fallback) The kernel-resolution canary is what would have surfaced the v1.2.0 lockdown bug at PR time: a sudden drop from ~100% to 0% on lockdown hosts would have been impossible to miss in CI. Wiring: - new "self" case in bench/cmd/scenario, with --self-duration, --cpu-budget, --resolution-budget flags - schema.SelfMetrics inlined into Run (omitzero, no effect on other scenarios' JSON) - countKernelResolution helper + unit tests for the metric logic - non-zero exit (3) when budgets breached so CI can gate on it - make bench-self target with 1.5 CPU / 0.5 resolution defaults (calibrated to catch regressions, not enforce a tight target on this codebase as-of-today) Verified end-to-end: 2 runs on this host showed agent/workload CPU ratios of 0.38 and 0.85 with 100% kernel resolution — both within budget, scenario exits 0. Tight budgets (--cpu-budget 0.01) exit 3. Skips the in-process caps check for "self" because the scenario is pure orchestration; the perf-agent subprocesses carry their own file caps. bench-self Makefile target deliberately depends only on bench-build (not build) so it doesn't wipe the manually-applied setcap on perf-agent each invocation. Roadmap doc updated to mark #1 shipped; remaining items reordered. * fix(symbolize): userspace fallback chain — discovered via bench-self Two related bugs surfaced by dogfooding the self-profile scenario (roadmap #1) against this PR's own perf-agent binary: A. DebuginfodSymbolizer doesn't fall back to local on NULL. When DEBUGINFOD_URLS points at a server that doesn't have the target binary's build-id (the normal case for locally-built perf-agent), blaze_symbolize_process_abs_addrs returns NULL and the dispatcher gave up — every userspace frame became <unknown>. Fix: keep a LocalSymbolizer next to the cgo state; on NULL, retry through it before erroring. B. Userspace symbolize EPERM when target has file caps. Profiling a setcap'd process (e.g., a second perf-agent) hits "permission denied" from blazesym because /proc/<pid>/exe is restricted by ptrace_scope when the target is privileged. The error propagated as "Failed to symbolize user" and the whole batch was dropped. Fix: on any blazesym error, LocalSymbolizer.SymbolizeProcess now returns rawUserAddrFrames (Name="0x<hex>", Reason=FailureMissingSymbols) — symmetric to the kernel-side rawKernelAddrFrames fix in this PR's first commit. Stack shape and addresses survive into the pprof so operators can recover with addr2line. Full per-mapping ELF resolution (the proper fix that bypasses /proc/<pid>/exe entirely by using procmap-supplied paths) is left as a roadmap follow-up — see docs/post-prod-hardening.md "Findings from self-profile" section. Verified end-to-end: profiling perf-agent with perf-agent on a setcap'd binary went from 100% <unknown> (whole batch dropped) to named raw-hex frames + structural mapping info; addr2line on those hex names produced the hot-function ranking that drives the new roadmap section (cilium/ebpf BTF parse, net.* DNS for debuginfod, SQLite cache init, etc.). No "permission denied" log spam. * perf(symbolize, bpfstack): close the dogfood loop — three iterations Running the self scenario four times in sequence, each iteration exposed the next bottleneck once the previous one was fixed: iter 2 → fix per-IP empty-name rendering frameFromCSym and fromBlazesymSym left Name="" when blazesym resolved the binary but couldn't map a specific IP to a symbol. pprof rendered those as <unknown>. Filling with "0x<hex>" + FailureMissingSymbols matches the kernel-side behavior shipped earlier in this PR. <unknown> went from 76% to ~0%. iter 3 → fix BPF user-stack kernel leak + speed up /proc/kallsyms bpf_get_stackid with BPF_F_USER_STACK can include kernel addresses in the user-stack buffer when the sampled task was in syscall/irq/fault context. That leak made 40% of "user" CPU look like 0xffff...kernel-range hex names in the pprof, with the user code below them invisible. bpfstack.SplitUserKernelIPs partitions ips by the canonical x86_64 kernel threshold and routes the strays to the kernel symbolizer (where they belong). Once the leak was fixed, the new top of the profile was /proc/kallsyms reading (module_get_kallsym, seq_read_iter, vsnprintf — all kernel-side). The kallsyms file is synthesized by the kernel: each small read() forces vsnprintf for every line. The default 4 KiB scanner buffer meant ~700 read syscalls for 3M kallsyms lines. Wrapping the file in a 256 KiB bufio Reader cut that ~64x and dropped kallsyms entirely out of the top 20. iter 4 → no fix; steady-state hot paths surface do_syscall_64 (42% cum), finish_task_switch (21%), ep_poll (10%), ringbuf.readRecord, dwarfagent.aggregateCPUSample, Syscall6 / nanotime1 / futex. All intrinsic costs of being an eBPF-based sampler. Further wins are diminishing returns. Roadmap doc captures the iteration log so future contributors see the dogfood-as-process pattern: ship #1, run it against itself, let it tell you what to fix next. Tests: new internal/bpfstack/split_test.go covers SplitUserKernelIPs edge cases (kernel-at-top leak, no-leak passthrough, all-kernel, boundary value). All existing unit suites still pass. * perf(symbolize): allocation-free /proc/kallsyms parser bench-self iter 5 (A=with --kernel-stacks, B=without) surfaced strings.Fields + strconv.ParseUint as the top user-side allocation source: 3M kallsyms lines × multi-string Fields × ParseUint triggered visible GC pressure (mallocgc / sweepone / tryDeferToSpanScan in the user-side pprof's top 10). Replaced with parseKallsymsLine: byte-level scan over the read buffer, no string allocation per line. Names are copied out once per kept symbol (mandatory — buffer reuses). Module strings go through an intern map (typical kernel has tens of modules but millions of module symbols). BenchmarkParseKallsymsLine-16 72630638 16.54 ns/op 0 B/op 0 allocs/op iter 6 confirmed: user-side allocation hotspots are gone. The remaining kallsyms cost is kernel-side (module_get_kallsym + vsnprintf rendering the synthesized /proc/kallsyms file) and can only be avoided by NOT parsing on every invocation. Roadmap doc adds a "remaining bottleneck" section proposing a disk-cached index keyed by kernel boot_id as the next follow-up. Also lifts strings.Fields-based parsing out of the kallsyms hot path entirely — net deletion of 2 imports (strconv, strings) from kallsyms.go. * perf(symbolize): disk-cached /proc/kallsyms index keyed by boot_id bench-self iter 6 identified the kernel synthesizing /proc/kallsyms (module_get_kallsym + vsnprintf in the profile) as the floor — unfixable from userspace because the file is generated on read. The fix is to avoid re-reading on every agent invocation. Cache key: kernel boot_id from /proc/sys/kernel/random/boot_id. Kernel symbol addresses change only across reboots (KASLR), so boot_id is exactly the right signal — drops the cache when it would be wrong, never falsely retains a stale copy. Format (little-endian): header: magic u32 | version u32 | boot_id [16]byte | n_syms u32 per-symbol: addr u64 | name_len u16 | mod_len u16 | name | module ~50 bytes/symbol, ~11 MiB on disk for this host's filtered kallsyms (~225k T/t/W/w/i entries). atomic write via temp + rename, so concurrent readers never see a partial file. Behavioral flow in newKallsymsSymbolizer: 1. loadCachedKallsyms → return on hit (boot_id matches) 2. parseKallsymsFresh on stale / corrupt / missing 3. best-effort writeKallsymsCache for next invocation iter 7 dogfood (cold vs warm cache run, same workload + flags): metric cold warm delta module_get_kallsym 16% 11% -45% vsnprintf (cum) 13% 5% -67% seq_read_iter (cum)40% n/a out of top 12 total CPU 676ms 565ms -16% Tests cover roundtrip, stale boot_id rejection, missing file, and corrupt-file non-fatal handling. cachePathFn and readBootIDFn are package vars so tests can pin both deterministically. Next opportunity (next commit): the remaining ~11% module_get_kallsym in the warm run is blazesym's FAILED kernel-source attempt reading kallsyms before hitting EPERM on /proc/kcore. A persistent "blazesym EPERM'd on this boot" marker would let perf-agent skip that attempt entirely on subsequent invocations. * perf(symbolize): persistent blazesym-EPERM marker — skip failing attempt Iter 7 dogfood revealed that even with a warm kallsyms disk cache, the FIRST blazesym call per agent invocation still cost ~16% of perf-agent CPU: blazesym opens /proc/kallsyms (the read shows up as module_get_kallsym + vsnprintf in the profile) before discovering /proc/kcore is restricted and returning EPERM. The sticky in-process bit prevents subsequent batches from re-paying, but every new agent process re-attempts blazesym from scratch. Fix: a boot_id-scoped marker file at ~/.cache/perf-agent/blazesym-eperm-<hex_boot_id>. Written on the first observed EPERM. Checked at LocalKernelSymbolizer construction: if present, fallback is pre-engaged and blazesym is never invoked. Marker is automatically invalidated by reboot (boot_id changes), so a kernel upgrade that loosens lockdown is re-detected on next boot. iter 8 dogfood (cold → warm cache + warm marker, same workload): metric cold warm delta module_get_kallsym 16% ZERO out of top 12 vsnprintf (cum) 14% ZERO out of top 12 do_syscall_64 (cum) 52% 0% down (no failing blazesym calls) total perf-agent CPU 505ms 131ms -74% Warm pprof now shows intrinsic eBPF-sampler costs only: finish_task_switch (21%), ep_poll waiting (69% cum), spin_unlock. Diminishing returns from here — every remaining sample is genuine "be a profiler" work. Tests pin XDG_CACHE_HOME via t.Setenv + override readBootIDFn so the marker lands in a t.TempDir(). Covers: roundtrip, boot_id-scoped invisibility (post-reboot equivalence), and filename structurally encodes the boot_id. * perf(perfdata): lazy MMAP2/COMM on first sample per PID (roadmap #9) bench-self iter 9 found system-wide capture burning ~30% of perf-agent CPU in kernel /proc/<pid>/maps rendering (show_map_vma, mangle_path, lock_next_vma, prepend_path, d_path) because emitCommAndMmapsForAllPIDs walked every PID at writer init — ~9000 reads on this host, of which only ~200 produce samples in a typical capture. Fix: perfdata.Writer.OnNewPID callback fires the first time AddSample sees each unique pid (sentinel-filtered: skips 0 and 0xffffffff). agent.go wires it to emitCommForPID + emitUserspaceMmapsForPID for system-wide mode; the eager walk is gone. Single-PID mode is unchanged (always knows its target). iter 11 dogfood (system-wide, --perf-data-output ON): perf.data size: ~3 MB (vs ~40-50 MB the eager walk produced) enrolled PIDs: ~208 lazy (vs ~9000 eager) remaining /proc/maps cost: ~15% cum, bounded by sampled-PID count rather than host-PID count — natural floor from procmap.Resolver's per-batch re-snapshot (spec invariant) and dwarfagent's lazy attach. Bonus: lazy mode also covers PIDs that exec AFTER capture starts. The eager walk could never see those. Tests pin the contract: fires once per unique pid (multiple samples for same pid don't re-trigger); sentinel pids are filtered; nil callback is a no-op (existing --pid-mode callers keep working unchanged). * feat(symbolize): reason-bucketed error counters + bench coverage (roadmap #4, #6) #4 reason-bucketed counters: - KernelLockdownEPERM bumps on each BLAZE_ERR_PERMISSION_DENIED - KernelOtherErr bumps on every other blazesym failure - CountersSnapshot.String() now logs eperm=N and other_err=N Operators can distinguish "canonical lockdown" (high eperm, fallback_engaged=1) from "unexpected blazesym failure" (any other_err > 0 deserves attention) without grepping logs. #6 symbolize benchmarks: - BenchmarkParseKallsymsFresh (cold-path baseline) - BenchmarkLoadCachedKallsyms (warm-path; ~20x faster) - BenchmarkResolveKernelIPs (per-IP cost ~35 ns) - BenchmarkParseKallsymsLine kept for per-line regression - new `make bench-symbolize` target The 20x cache speedup is the headline metric. PRs touching the symbolize hot path can now show numerical evidence rather than vibes. Roadmap doc reflects shipped status; both items move from Pending to Shipped in the triage table. * feat(perfagent): --metrics-listen HTTP endpoint (roadmap #3) --metrics-listen <addr> starts an in-process HTTP server on the given address (e.g. 127.0.0.1:7777) exposing: /metrics Prometheus text format of symbolize.Counters (batches, input_ips, batch_failures, fallback_engaged, raw_addr_frames, lockdown_eperm, other_err) /debug/pprof/ Go runtime self-pprof — live scrape with `go tool pprof http://host:7777/debug/pprof/profile` Default off — no port opened, zero CPU overhead when unused. Lifecycle: server starts after kernelSymbolizer is constructed (so the metrics handler closure can snapshot live counters); shutdown happens in Agent.cleanup AFTER the symbolizer is closed, with a 2s graceful drain. Tests cover: - Prometheus exposition format (HELP + TYPE + value lines per metric, all 7 counters present) - Lifecycle round-trip against a :0 listener (live scrape, Stop returns within grace, post-Stop scrape fails) - Empty-addr no-op (--metrics-listen not set → nil server) - /debug/pprof/ mount works through the metrics mux End-to-end verified: agent run with --metrics-listen serves the canonical Prometheus payload AND a working gzipped pprof profile (1f 8b 08 magic). * feat(symbolize): batch-duration histograms + allocs budget (roadmap #2, #7) #2 batch-duration histograms: - LatencyHist: ring buffer (1024 samples) + min/max/sum, with p50/p99 on Snapshot via slices.Sort. - Wired into SymbolizeKernel: time.Now() on entry, Record on deferred exit. Microseconds — readable across warm-cache (sub-ms) and cold-CGO (millisecond+) paths. - End-of-run log line gains batch_p50_us / batch_p99_us / batch_max_us fields. - /metrics endpoint exposes three new gauges: perf_agent_symbolize_kernel_batch_{p50,p99,max}_microseconds Sliding-window semantics: an early burst of slow batches doesn't permanently bias the percentile — only the most recent 1024 stay in the ring. Tested with the TestLatencyHist_RingOverwrite case. #7 allocs budget: - testing.AllocsPerRun gates on the three hottest paths: parseKallsymsLine (0 alloc/op), kallsymsSymbolizer.Resolve (1 alloc/op = the return slice), LatencyHist.Record (0 alloc/op). - PRs that reintroduce strings.Fields, strconv.ParseUint, or other allocators into these paths will fail the test immediately — not just look worse in benchstat. Catches the regression class dogfood iter 5 surfaced (mallocgc / sweepone in the user-side profile of perf-agent). Roadmap doc: items #2 and #7 marked shipped (#7 partial — the symbolize hot path is gated; sample-processing follow-up is documented). * chore(perfagent): drop dead emitCommAndMmapsForAllPIDs CI lint caught it: the helper was the eager-walk path replaced by the lazy OnNewPID hook in 405611a (roadmap #9). emitCommForPID and emitUserspaceMmapsForPID are now the only callers — both already exported through the lazy callback. Dead. * docs(README): drop GPU-profiling experimental notice GPU profiling lives on its own track (and its own worktree). The README banner was conflating it with the stable CPU / off-CPU / PMU paths; dropping the line keeps the front page honest about what this build delivers.
Summary
This branch lands the DWARF/hybrid stack unwinder and the pprof address + mapping fidelity refactor (S — 115 commits totaling roughly 2 months of design and implementation. End result:
.eh_frameand consulted in BPF on FP-less PCs.Mappingentries carry path + start/limit/file-offset + GNU build-id;Locationentries are address-keyed (Address - MapStart + MapOff) so two PCs at the same source line stay distinguishable.-Cprofile-useand friends) is now in place; the conversion tool itself is the next piece and is intentionally out of scope here.Architecture changes
The hybrid walker shares the FP path's userspace pipeline:
unwind/ehcompile/— pure-Go.eh_frameparser. Classifies PC ranges into FP_SAFE / FP_LESS / FALLBACK and emits flat 24-byte CFI entries the BPF walker can binary-search.unwind/ehmaps/— BPF map population + per-PID lifecycle (refcounted CFI tables,MmapWatcherconsuming PERF_RECORD_MMAP2/EXIT/FORK to keep maps fresh throughdlopen/exec).unwind/dwarfagent/—Profiler/OffCPUProfilerfor the DWARF path; shares asessionstruct (BPF objs, ringbuf reader, symbolizer, MMAP2 watcher). Eager-compile failures (Go binaries, no.eh_frame) are tolerated — the walker's FP fast path covers those.unwind/procmap/(new in S9) — lazy per-PID/proc/<pid>/maps+ ELF.note.gnu.build-idresolver. Keyed binary-search lookup feeds the pprof builder. MmapWatcher events invalidate cache entries through an observer callback onPIDTracker.Run.pprof/—FramegainsAddress/BuildID/MapStart/MapLimit/MapOff/IsKernel. New intern keys (mappingKey,locationKey,locationFallbackKey,functionKey) replace the name-only dedup.addLocationdispatches: kernel sentinel → JIT sentinel (Python/Node perf-maps) → resolver-driven address path → name-based fallback.bpf/perf_dwarf.bpf.c+bpf/offcpu_dwarf.bpf.c, sharing types viabpf/unwind_common.h. CFI consulted viaHASH_OF_MAPSkeyed by FNV-1a of GNU build-id.--unwindflag selects the walker:fp(cheap, FP-only),dwarf(eager CFI compile + per-frame hybrid),auto(=dwarf; the hybrid walker covers FP-safe code via the FP path with no CFI lookup).Notable behavior changes
--unwind autois now the default (previouslyfp). Most workloads see no change because the hybrid walker's FP fast path is byte-equivalent to pure-FP for FP-safe code; the difference shows up on FP-less binaries that previously truncated.functionKey{MappingID, Name}). Same symbol name in two binaries (main.mainin the target plus a vendored tool invoked viaexec) is now two distinctprofile.Functionentries — pprof-correct, but a schema-compatible change.Mapping[0].File == ""no longer indicates "the binary" — it's a fallback placeholder, kept for synthetic[unknown]and resolver-miss frames. Real binaries get their ownMappingentries.Test plan
Verified end-to-end on Fedora 43, kernel 6.19, on a
setcap-capped binary (no sudo):make build— clean (defaultedGOTOOLCHAIN=autoso the Go 1.26 requirement doesn't fail on hosts with older Go).make test-unit— all unit packages pass, including newunwind/procmaptests (parser, build-id reader, Resolver lazy cache, InvalidateAddr observable behavior).TestProfileMode(go-cpu/go-io/rust-cpu/python-cpu/python-io) andTestPerfAgentSystemWideDwarfProfile: every captured profile has ≥1 real (non-sentinel) Mapping with non-empty BuildID, and every user-space Location has non-zero file-offset Address.go tool pprof -raw <output>.pb.gzparses cleanly and shows real binary paths, build-ids, and addresses.Operational notes
setcapon theperf-agentbinary file (no sudo needed for the integration suite). Tests categorize as Category A (exec perf-agent — file caps suffice) or B (in-process BPF — test binary itself needs caps). SeerequireBPFRunnableintest/integration_test.go.AttachAllMappingsis now best-effort: per-binary failures (Go executables lack.eh_frame) log + continue rather than abortingNewProfiler. The hybrid walker covers those binaries via the FP path at runtime.Out of scope (follow-ups)
--unwind autoruntime-cost refinement (lazy compile / binary-detect heuristics). Spec atdocs/unwind-auto-refinement-design.md. Recommendation: benchmark current eager-compile cost on real workloads first; only act if-astartup is painful.docs/superpowers/specs/2026-04-25-gpu-profiling-design.md.