Skip to content

DWARF stack unwinding + pprof address/mapping fidelity#7

Merged
dpsoft merged 122 commits into
mainfrom
feat/dwarf-unwinding
Apr 25, 2026
Merged

DWARF stack unwinding + pprof address/mapping fidelity#7
dpsoft merged 122 commits into
mainfrom
feat/dwarf-unwinding

Conversation

@dpsoft
Copy link
Copy Markdown
Owner

@dpsoft dpsoft commented Apr 25, 2026

Summary

This branch lands the DWARF/hybrid stack unwinder and the pprof address + mapping fidelity refactor (S — 115 commits totaling roughly 2 months of design and implementation. End result:

  • Release-built C++/Rust binaries unwind correctly even without frame pointers, via per-binary CFI tables compiled from .eh_frame and consulted in BPF on FP-less PCs.
  • pprof output preserves real per-binary identity: Mapping entries carry path + start/limit/file-offset + GNU build-id; Location entries are address-keyed (Address - MapStart + MapOff) so two PCs at the same source line stay distinguishable.
  • Plumbing for downstream LLVM sample-PGO (Rust -Cprofile-use and friends) is now in place; the conversion tool itself is the next piece and is intentionally out of scope here.

Architecture changes

The hybrid walker shares the FP path's userspace pipeline:

  • unwind/ehcompile/ — pure-Go .eh_frame parser. Classifies PC ranges into FP_SAFE / FP_LESS / FALLBACK and emits flat 24-byte CFI entries the BPF walker can binary-search.
  • unwind/ehmaps/ — BPF map population + per-PID lifecycle (refcounted CFI tables, MmapWatcher consuming PERF_RECORD_MMAP2/EXIT/FORK to keep maps fresh through dlopen/exec).
  • unwind/dwarfagent/Profiler / OffCPUProfiler for the DWARF path; shares a session struct (BPF objs, ringbuf reader, symbolizer, MMAP2 watcher). Eager-compile failures (Go binaries, no .eh_frame) are tolerated — the walker's FP fast path covers those.
  • unwind/procmap/ (new in S9) — lazy per-PID /proc/<pid>/maps + ELF .note.gnu.build-id resolver. Keyed binary-search lookup feeds the pprof builder. MmapWatcher events invalidate cache entries through an observer callback on PIDTracker.Run.
  • pprof/Frame gains Address/BuildID/MapStart/MapLimit/MapOff/IsKernel. New intern keys (mappingKey, locationKey, locationFallbackKey, functionKey) replace the name-only dedup. addLocation dispatches: kernel sentinel → JIT sentinel (Python/Node perf-maps) → resolver-driven address path → name-based fallback.
  • BPF: bpf/perf_dwarf.bpf.c + bpf/offcpu_dwarf.bpf.c, sharing types via bpf/unwind_common.h. CFI consulted via HASH_OF_MAPS keyed by FNV-1a of GNU build-id.

--unwind flag selects the walker: fp (cheap, FP-only), dwarf (eager CFI compile + per-frame hybrid), auto (= dwarf; the hybrid walker covers FP-safe code via the FP path with no CFI lookup).

Notable behavior changes

  • --unwind auto is now the default (previously fp). Most workloads see no change because the hybrid walker's FP fast path is byte-equivalent to pure-FP for FP-safe code; the difference shows up on FP-less binaries that previously truncated.
  • pprof Function dedup is now per-binary (functionKey{MappingID, Name}). Same symbol name in two binaries (main.main in the target plus a vendored tool invoked via exec) is now two distinct profile.Function entries — pprof-correct, but a schema-compatible change.
  • Mapping[0].File == "" no longer indicates "the binary" — it's a fallback placeholder, kept for synthetic [unknown] and resolver-miss frames. Real binaries get their own Mapping entries.

Test plan

Verified end-to-end on Fedora 43, kernel 6.19, on a setcap-capped binary (no sudo):

  • make build — clean (defaulted GOTOOLCHAIN=auto so the Go 1.26 requirement doesn't fail on hosts with older Go).
  • make test-unit — all unit packages pass, including new unwind/procmap tests (parser, build-id reader, Resolver lazy cache, InvalidateAddr observable behavior).
  • Full integration suite — 18 PASS, 6 SKIP, 0 FAIL (5:34 runtime). Skipped tests load BPF in-process and require capping the test binary itself.
  • S9 fidelity assertions on TestProfileMode (go-cpu/go-io/rust-cpu/python-cpu/python-io) and TestPerfAgentSystemWideDwarfProfile: every captured profile has ≥1 real (non-sentinel) Mapping with non-empty BuildID, and every user-space Location has non-zero file-offset Address.
  • go tool pprof -raw <output>.pb.gz parses cleanly and shows real binary paths, build-ids, and addresses.

Operational notes

  • Test capability gates now honor setcap on the perf-agent binary file (no sudo needed for the integration suite). Tests categorize as Category A (exec perf-agent — file caps suffice) or B (in-process BPF — test binary itself needs caps). See requireBPFRunnable in test/integration_test.go.
  • AttachAllMappings is now best-effort: per-binary failures (Go executables lack .eh_frame) log + continue rather than aborting NewProfiler. The hybrid walker covers those binaries via the FP path at runtime.

Out of scope (follow-ups)

  • S10 — pprof → LLVM sample-profile converter for Rust PGO. The originally stated end goal; deferred to a fresh session because the LLVM sample-profile format has its own design surface (line discriminators, weight calibration, body-sample vs callsite-sample emission).
  • --unwind auto runtime-cost refinement (lazy compile / binary-detect heuristics). Spec at docs/unwind-auto-refinement-design.md. Recommendation: benchmark current eager-compile cost on real workloads first; only act if -a startup is painful.
  • GPU profiling (full-stack CPU launch + device-side correlation). Draft spec at docs/superpowers/specs/2026-04-25-gpu-profiling-design.md.

dpsoft added 30 commits April 21, 2026 16:41
Initial spec assumed blazesym could unwind (regs, stack_bytes) -> PCs.
Verified against blazesym main (2025) and upstream activity: the C API
exposes only symbolize/normalize/inspect/kernel; the Rust crate has no
unwind module. A prototype FP unwinder existed in the 2021 initial
commit but was removed during the pivot to symbolization-only.

Update Option B to document the real architecture: kernel captures
(regs, stack_bytes) as before, but userspace needs a separate unwinder
library. Three concrete paths documented — libunwind via CGO (recommended),
delve dwarf/frame (pure Go), or shelling out to perf as a prototype
harness. Scope estimate adjusted accordingly.

No code changes on this branch yet; spec correction is the first commit.
- --unwind {fp,dwarf,auto} with auto as default.
- auto is a hybrid: consult pre-scanned .eh_frame classification per
  PC range; take the kernel's cheap FP callchain when PC is in
  FP-safe code; DWARF-unwind only when PC lands in FP-less code
  (libstd, glibc, C++ release builds) or when FP callchain truncated
  suspiciously.
- Pre-scan .eh_frame per ELF at mmap-change events; cache by build-id
  so repeated loads don't re-parse. Sample-time decision is a range
  lookup + length check.
Tests pass, binary builds clean, go1.26.0 toolchain resolved via
GOTOOLCHAIN=auto. No code changes required for the upgrade.
Captures PERF_RECORD_SAMPLE with IP + TID + TIME + CALLCHAIN +
REGS_USER + STACK_USER for a target PID at a configurable frequency.
Ring buffer is mmapped and drained on demand via ReadNext().

Split into:
- unwind/perfreader/regs_{amd64,arm64}.go — per-arch register masks
  and SP/BP/IP accessors using sparse-bit indexing into the dense
  regs[] array the kernel emits.
- unwind/perfreader/reader.go — perf_event_open, mmap, ring parser.
- unwind/perfreader/sample.go — Sample record type.
- cmd/perfreader-test/main.go — diagnostic CLI that prints sample
  summaries (IP, callchain, registers, first bytes of stack).

No unwinder yet; this is the kernel-side capture validated
end-to-end. Needs the host to have cap_sys_admin+cap_bpf+cap_perfmon
(set via setcap on the built binary).
Initial spike failed to read samples because perfMmapPage struct layout
was wrong — my padding placed data_head/data_tail 4 bytes off from
where the kernel actually writes them. The page struct has a long
reserved area that's evolved across kernels and is risky to mirror
in Go.

Fix: access data_head (offset 1024) and data_tail (offset 1032)
directly via raw pointer into the mmap — those offsets are kernel-ABI
stable. Deleted the full perfMmapPage struct mirror entirely.

Also in the test CLI:
- Add --duration so a silent hang is bounded.
- Periodic progress line to stderr (wakeups/records/samples).
- Correct PERF_CONTEXT_USER sentinel (0xfffffffffffffe00 = -512, not
  the wrong -256 I had before).

End-to-end validated against Go cpu_bound workload: samples arrive
with correct IP + 17 regs + 8KB stack, and the stack bytes contain
real return addresses (verified [rbp+8] matches callchain[2]).
unwind/fpwalker/ reconstructs a callchain by following the frame-pointer
chain through captured PERF_SAMPLE_STACK_USER bytes. Arch-agnostic
(x86_64 rbp and arm64 x29 share the same [savedBP, retAddr] layout at
the frame pointer). 6 unit tests cover simple chains, OOB bp, null
terminator, non-monotonic chains, short stacks, and cycle safety.

Validation: the diagnostic CLI now runs the FP walker on each sample
and compares its output to the kernel's PERF_SAMPLE_CALLCHAIN (minus
PERF_CONTEXT_* sentinels). Tested against Go cpu_bound — 3/3 samples
produce identical chains, confirming end-to-end that the captured
(regs, stack_addr, stack_bytes) is fully usable for userspace
unwinding.

This is the cheap leg of the planned --unwind auto hybrid. The DWARF
leg (libunwind CGO) is next.
Package unwind/dwarfunwind/ wraps libunwind for DWARF-based stack
unwinding. This commit lands the architecture without a working
find_proc_info — the CGO wiring, address-space setup, custom
accessors, ELF-mapping parser, and end-to-end Go→C→Go path are
verified via a smoke test.

Files:
- unwinder.h        : C surface exposed to Go.
- unwinder.c        : libunwind accessor callbacks and /proc/<pid>/maps
                      based ELF mmap table. Custom access_mem serves
                      captured stack, locally-mmap'd ELFs, and
                      /proc/<pid>/mem fallback. Custom access_reg
                      tracks live_regs updated by libunwind as it
                      restores prior frames.
- unwinder.go       : Go facade (Unwinder, New, Close, Unwind).
- CGO flags         : -lunwind -lunwind-x86_64 from libunwind-devel.

Known gap: find_proc_info is currently a UNW_ENOINFO stub. libunwind
will therefore fail to step beyond the first frame. Next commit adds
a real find_proc_info that locates PT_GNU_EH_FRAME in each ELF and
builds per-mapping unw_dyn_info_t for dwarf_search_unwind_table.

No integration with the rest of the repo yet; this is standalone.
…ding

The original spec chose Option B — userspace libunwind via CGO. While
bringing up find_proc_info we hit the libunwind REMOTE_TABLE format's
requirement that table_data point at the FDE search table (not the
.eh_frame_hdr header). That's solvable, but it surfaced how much of
the unwinder complexity we'd be pushing into userspace: .eh_frame_hdr
parsing, /proc/<pid>/maps tracking, ELF mmap management,
register-restore state machine per sample — all fighting the CGO
boundary and competing with the workload CPU.

Option A moves the same work into the BPF program using CFI tables
pre-compiled in userspace. Wins: lower per-sample cost (single-digit
microseconds vs. tens), native off-CPU support (previously impossible
in Option B because sched_switch doesn't generate PERF_SAMPLE_STACK_USER
records), no CGO on the hot path, no libunwind dependency.

Non-goal: matching parca-agent scope. We skip full DWARF expressions
(0.05% of glibc FDEs measured on this host), JIT code tracking, and
high-sample-rate always-on system-wide. We keep all current features —
every --profile, --offcpu, --pmu, --pid, -a combination that works on
main continues to work.

Spec rewritten at docs/dwarf-unwinding-design.md.

Staging: ehcompile/ → BPF FP+classify → BPF DWARF walker → ehmaps/ →
dwarfagent/ → off-CPU port → system-wide. Each stage independently
mergeable; --unwind fp (default) never regresses.

unwind/dwarfunwind/ deleted — the libunwind CGO wrapper is now dead
code.
Material fixes:
- Resolve --unwind auto contradiction: auto routes to FP programs
  in S1-S5 and flips to DWARF programs in new stage S8.
- Kernel floor simplified to Linux 6.0+ (all features mature well
  before 6.0; keeps the compat matrix simple).
- Build-id hash widened from 32 bits to 64 (FNV-1a of build-id) to
  eliminate birthday-collision risk.
- S6 (off-CPU port) now spells out the tracepoint-vs-perf_event
  capture differences: bpf_task_pt_regs + explicit bpf_probe_read_user
  instead of PERF_SAMPLE_STACK_USER.
- counts map clarified: DWARF programs don't use one. Samples go out
  per-event via ringbuf; userspace aggregates after symbolization.
- Per-stage exit criteria added — each stage now has a verifiable
  end state instead of just an effort estimate.

Minor:
- FALLBACK classification clarified: runtime-identical to FP_SAFE,
  different only for telemetry.
- 'byte-for-byte identical' claim scoped to the BPF layer; the Go
  path gets one new switch arm.
- BPF unit-testing strategy made explicit: Go mirror of the walker,
  fuzz-compare BPF vs Go outputs.
- Ringbuf record changed from fixed-128-slot to variable length
  (header + n_pcs × u64), saves ~75% of typical bandwidth.
Spec updated: cfi_entry layout made arch-neutral — cfa_type uses SP/FP
instead of cfa_reg; added ra_offset field; fp_type/fp_offset renamed
from rbp_* for arch clarity. RA on arm64 isn't conventionally at
[CFA-8] like x86_64, so tracking it explicitly is required.

Plan saved to docs/superpowers/plans/2026-04-22-ehcompile.md — 19 tasks
covering the ehcompile Go package:

- Package skeleton + arch-neutral output types (CFIEntry, Classification).
- ULEB128/SLEB128/DW_EH_PE_* primitives.
- archInfo struct with x86_64 and arm64 register maps.
- CIE/FDE parsing with augmentation handling (zR, zRS, zPR, etc.).
- .eh_frame section walker with backward CIE pointer semantics.
- CFI interpreter covering: advance_loc family, def_cfa family, offset/
  restore/same/undefined/register, remember/restore_state, expression
  opcodes → FALLBACK, DW_CFA_GNU_args_size.
- Compile() with ELF-machine-type dispatch.
- Tests: x86_64 golden fixture (committed), arm64 cross-compiled fixture
  (optional, skips if aarch64-linux-gnu-gcc absent), arm64 synthetic
  bytecode coverage (host-independent), system glibc stress test, Go
  binary integration test, benchmark.

Stage is independently mergeable — no BPF code, no integration with
profile/ or offcpu/ packages, no new runtime dependencies beyond
Go stdlib debug/elf.
Also fix sampleFDE to compute CIE_pointer dynamically from the
actual ciePos/fdePos arguments instead of a stale hardcoded value.
…ndler

Cross-compile hello.c to hello_arm64.o as a relocatable ELF — we can't
link a full arm64 executable without libc/crt, but the object file
still has .eh_frame and parses fine through debug/elf.

Plan defect surfaced by real arm64 CFI: DW_CFA_AArch64_negate_ra_state
(0x2d) wasn't in our opcode list. It's emitted in virtually every arm64
FDE to toggle a pointer-auth (PAC) state bit. No operand; no effect on
our CFA/FP/RA tracking. Added as a no-op case.

Also regenerated hello.golden after removing the unused stdio.h include
from hello.c (was blocking the arm64 cross-compile).
System glibc: 13,769 CFI entries, 8 FALLBACK (0.058%) — matches the
research estimate of 0.05% for DW_CFA_expression usage in real-world
glibc. Test asserts the FALLBACK ratio stays under 2%.

Go binaries: pure-Go executables have no .eh_frame (Go's runtime uses
pclntab for stack unwinding instead). Test accepts ErrNoEHFrame as a
valid outcome.
Benchmarks on AMD Ryzen 9 7940HS:
- Glibc (13,769 entries):          4.4 ms  /  5.1 MB  / 24,727 allocs
- Hello x86_64 (71 entries):      16.1 us  /   15 KB  /    169 allocs
- Hello arm64 .o (6 entries):     11.0 us  /  5.7 KB  /     78 allocs

Well under the plan's 50 ms/compile target for glibc. Allocation count
is high (mostly slice growth during FDE accumulation) — a future pass
could pre-size entries/classifications based on .eh_frame_hdr's fde_count
field. Not needed for S1's success criteria.

Docstring now summarizes the full CFI dialect we handle, arch support,
and explicit out-of-scope items.
dpsoft added 14 commits April 25, 2026 16:28
go.mod pins 1.26+; system Go is often 1.25 on current distros. Without
GOTOOLCHAIN=auto the build fails outright; with it, Go fetches the
required toolchain on demand. Users can still pin to local with
GOTOOLCHAIN=local make build.
AttachAllProcesses/AttachAllMappings can fail on individual binaries
(notably Go's static executables, which use .gopclntab + frame pointers
instead of .eh_frame). Before this, NewProfiler refused to start at all
when the first binary failed. The hybrid walker handles FP-safe code
without CFI tables, so eager-compile is purely an optimization — log
the error and continue rather than abort.
Static binaries (Go) are self-contained: no libc, no shared libs, so
exactly one real (non-sentinel) mapping is correct. The original >=2
threshold assumed dynamically-linked targets. The point of the
assertion is 'we're not falling back to the hardcoded placeholder' —
that holds with one real mapping.
Captures architecture for full-stack GPU profiling beyond what eBPF
alone can deliver: vendor-agnostic host layer + vendor-specific device
layer. Draft only — not on the S9-narrow critical path; tracked here
so the design isn't lost between sessions.
- Rewrite README architecture diagram to show DWARF/hybrid path,
  ehcompile/ehmaps/procmap pipeline, and the address-keyed pprof
  ProfileBuilder. Refresh flags (--unwind), build (Go 1.26+,
  GOTOOLCHAIN=auto), and testing (cap-aware skip gates).
- Track .claude/settings.json (shared plugin config); keep
  settings.local.json (per-user state) ignored.
Idea capture only — not a design spec, not a plan. Documents two
implementation paths (direct pprof->LLVM converter vs perf-agent
emitting perf.data + reusing autofdo/llvm-profgen), the open questions
worth resolving before designing, and what's deliberately out of scope.

Tracked here so the thread isn't lost between sessions; revisit and
brainstorm into a real spec before implementation.
@dpsoft dpsoft requested a review from Copilot April 25, 2026 20:57
Three CI failures motivated this:
- Lint failed because golangci-lint built against Go 1.25 can't parse a
  go.mod that requires 1.26.
- Unit Tests failed with 'compile: version go1.26.0 does not match go
  tool version go1.25.9' — toolchain mismatch when the Go binary is
  pinned to 1.25 but go.mod targets 1.26.
- Integration Tests failed with 'symbol bpf_loop: unsatisfied program
  reference' — bpf_loop requires kernel >=5.17, ubuntu-22.04 ships 5.15.
  ubuntu-24.04 ships kernel 6.x.

Updates tests.yml, ci.yml, release.yml. Build job + lint + unit + integration.
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

Adds a DWARF-capable hybrid stack unwinder and refactors pprof emission to preserve per-binary mapping identity and address-keyed locations (enabling higher-fidelity downstream consumers like sample-based PGO tooling).

Changes:

  • Introduces DWARF/hybrid CPU + off-CPU profilers (BPF programs + Go wiring) with CFI table support and MMAP2-driven mapping lifecycle.
  • Refactors pprof builder to accept rich Frame objects, intern real Mappings (path/start/limit/off/build-id), and key Locations by binary-relative address.
  • Expands integration/unit tests and workloads (Rust + dlopen probe) and updates CLI/config defaults around --unwind.

Reviewed changes

Copilot reviewed 49 out of 111 changed files in this pull request and generated 9 comments.

Show a summary per file
File Description
unwind/dwarfagent/agent_test.go Adds an end-to-end DWARF profiler test against the Rust workload.
unwind/dwarfagent/agent.go Introduces DWARF-capable CPU profiler implementation and perf-event attachment.
test/workloads/rust/src/main.rs Extends Rust workload with optional dlopen + probe function pointer path.
test/workloads/rust/probe/src/lib.rs Adds a small cdylib probe function for dlopen/MMAP2 tracking tests.
test/workloads/rust/probe/Cargo.toml Defines probe cdylib build configuration (debug symbols in release).
test/workloads/rust/Cargo.toml Adds libc dependency to support dlopen/dlsym.
test/integration_test.go Adds CAP_BPF gating, DWARF walker tests, and pprof fidelity assertions.
test/go.mod Bumps Go version and promotes several deps to direct requirements for tests.
profile/profiler_test.go Adds tests for address propagation into frames and resolver wiring.
profile/profiler.go Switches to frame-based stacks, batches symbolization, and wires procmap resolver into pprof building.
profile/perf_dwarf_x86_bpfel.go Adds generated bpf2go bindings for perf_dwarf (x86).
profile/perf_dwarf_test.go Adds a verifier smoke test for loading perf_dwarf BPF.
profile/perf_dwarf_arm64_bpfel.go Adds generated bpf2go bindings for perf_dwarf (arm64).
profile/offcpu_dwarf_x86_bpfel.go Adds generated bpf2go bindings for offcpu_dwarf (x86).
profile/offcpu_dwarf_export.go Exposes an API wrapper to load/use the offcpu_dwarf program and maps.
profile/offcpu_dwarf_arm64_bpfel.go Adds generated bpf2go bindings for offcpu_dwarf (arm64).
profile/gen.go Extends go:generate to build new DWARF BPF objects for amd64/arm64.
profile/dwarf_export.go Exposes an API wrapper to load/use the perf_dwarf program and maps.
pprof/pprof_test.go Updates tests to frame stacks; adds fidelity and perf-map decoder coverage.
pprof/pprof.go Introduces Frame model, resolver-based mapping/location fidelity, and perf-map decoding.
perfagent/options.go Adds config option + option helper for selecting unwind mode.
perfagent/agent.go Dispatches to FP vs DWARF profilers based on --unwind, adds adapters.
offcpu/profiler.go Refactors off-CPU symbolization to frame stacks and batched lookups.
main.go Adds --unwind flag defaulting to auto.
internal/bpfstack/stack_test.go Adds unit tests and benchmarks for parsing raw stackmap buffers.
internal/bpfstack/stack.go Adds shared helper to decode stackmap IPs.
go.mod Bumps module Go version to 1.26.0.
docs/unwind-auto-refinement-design.md Adds a follow-up design doc for reducing auto mode startup cost.
docs/superpowers/specs/2026-04-25-s10-pgo-converter-idea.md Adds idea note for future pprof→LLVM sample profile conversion.
docs/superpowers/specs/2026-04-25-gpu-profiling-design.md Adds draft GPU profiling design spec (out of scope follow-up).
docs/superpowers/specs/2026-04-24-s9-pprof-fidelity-design.md Adds design spec capturing the S9 fidelity work implemented here.
docs/dwarf-unwinding-design.md Adds DWARF unwinding design spec (architecture and staging).
cmd/test_blazesym/main.go Removes an old blazesym diagnostic binary.
bpf/unwind_common.h Adds shared BPF types/maps/helpers for hybrid FP+DWARF unwinding.
bpf/perf_dwarf.bpf.c Adds DWARF-capable CPU sampling BPF program emitting ringbuf records.
bpf/offcpu_dwarf.bpf.c Adds DWARF-capable off-CPU BPF program for sched_switch-based sampling.
README.md Updates docs to reflect DWARF/hybrid unwinding and pprof fidelity changes.
Makefile Sets GOTOOLCHAIN=auto and extends workload/test targets (Rust probe + unwind tests).
.claude/settings.json Adds editor plugin config (non-runtime).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread offcpu/profiler.go
Comment on lines +174 to 191
ips := bpfstack.ExtractIPs(stack)
if len(ips) > 0 {
symbols, err := pr.symbolizer.SymbolizeProcessAbsAddrs(
ips,
samplePid,
blazesym.ProcessSourceWithPerfMap(true),
blazesym.ProcessSourceWithDebugSyms(true),
)
if err != nil {
log.Printf("Failed to symbolize: %v", err)
break
} else {
for _, s := range symbols {
for _, f := range blazeSymToFrames(s) {
sb.append(f)
}
}
}

sb.append(symbol[0].Name)
}
Copy link

Copilot AI Apr 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Off-CPU symbolization drops the per-IP address, so pprof.Frame.Address remains 0 for all off-CPU frames. With the new pprof builder, Address=0 forces fallback location-keying (and prevents mapping/file-offset fidelity) even when ips are available. Fix by iterating symbols with an index and threading ips[i] into the frame conversion (mirroring profile/profiler.go’s blazeSymToFrames(s, ips[i]) approach).

Copilot uses AI. Check for mistakes.
Comment thread perfagent/options.go Outdated
Comment on lines +17 to +26
// TestProfilerEndToEnd runs the full dwarfagent stack against the
// rust-workload and asserts that the resulting pprof contains at
// least one sample naming cpu_intensive_work.
func TestProfilerEndToEnd(t *testing.T) {
if os.Getuid() != 0 {
caps := cap.GetProc()
have, _ := caps.GetFlag(cap.Permitted, cap.BPF)
if !have {
t.Skip("requires root or CAP_BPF")
}
Copy link

Copilot AI Apr 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test only gates on CAP_BPF, but loading/attaching the DWARF profilers also requires additional capabilities (e.g., PERFMON/SYS_PTRACE/CHECKPOINT_RESTORE, and memlock handling). As written, the test can fail (instead of skip) on systems where CAP_BPF is present but the other required caps are missing. Consider reusing the broader capability gate used elsewhere in the repo (or checking the full set needed by the loader) and handling a potential nil/failed cap.GetProc() defensively.

Suggested change
// TestProfilerEndToEnd runs the full dwarfagent stack against the
// rust-workload and asserts that the resulting pprof contains at
// least one sample naming cpu_intensive_work.
func TestProfilerEndToEnd(t *testing.T) {
if os.Getuid() != 0 {
caps := cap.GetProc()
have, _ := caps.GetFlag(cap.Permitted, cap.BPF)
if !have {
t.Skip("requires root or CAP_BPF")
}
// haveRequiredProfilerCaps reports whether the current process has the
// capabilities required to load and attach the DWARF profiler stack when
// not running as root.
func haveRequiredProfilerCaps() (bool, string) {
if os.Getuid() == 0 {
return true, ""
}
caps := cap.GetProc()
if caps == nil {
return false, "cannot inspect process capabilities"
}
required := []struct {
value cap.Value
name string
}{
{value: cap.BPF, name: "CAP_BPF"},
{value: cap.PERFMON, name: "CAP_PERFMON"},
{value: cap.SYS_PTRACE, name: "CAP_SYS_PTRACE"},
{value: cap.CHECKPOINT_RESTORE, name: "CAP_CHECKPOINT_RESTORE"},
{value: cap.IPC_LOCK, name: "CAP_IPC_LOCK"},
}
missing := make([]string, 0, len(required))
for _, req := range required {
have, err := caps.GetFlag(cap.Permitted, req.value)
if err != nil || !have {
missing = append(missing, req.name)
}
}
if len(missing) > 0 {
return false, "missing required capabilities: " + strings.Join(missing, ", ")
}
return true, ""
}
// TestProfilerEndToEnd runs the full dwarfagent stack against the
// rust-workload and asserts that the resulting pprof contains at
// least one sample naming cpu_intensive_work.
func TestProfilerEndToEnd(t *testing.T) {
if ok, reason := haveRequiredProfilerCaps(); !ok {
t.Skipf("requires root or profiler capabilities: %s", reason)

Copilot uses AI. Check for mistakes.
Comment thread unwind/dwarfagent/agent.go Outdated
Comment thread pprof/pprof.go
Comment on lines +241 to +244
resolver *procmap.Resolver
mappings map[mappingKey]*profile.Mapping
locations map[any]*profile.Location
functions map[any]*profile.Function
Copy link

Copilot AI Apr 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using map[any]... for locations and functions weakens type safety and introduces a sharp edge: inserting a non-comparable key (even accidentally) will panic at runtime. Since these maps are internal and already have a small, closed set of key types, consider splitting them into strongly typed maps (e.g., map[locationKey]*profile.Location and map[locationFallbackKey]*profile.Location, similarly for functionKey) or introducing a single comparable tagged-key struct to keep compile-time guarantees.

Copilot uses AI. Check for mistakes.
Comment thread pprof/pprof_test.go Outdated
Comment thread pprof/pprof.go
Comment on lines +225 to +238
// looksJIT returns true when frame.Name matches one of the perf-map
// runtime prefixes (Python, Node). decodePerfMapFrame zeros Address
// for these in Task 8; before then, the Address field may still be
// nonzero but we keep these on the [jit] sentinel because anonymous
// JIT mappings have no file-offset identity.
func looksJIT(f Frame) bool {
return strings.HasPrefix(f.Name, "py::") ||
strings.HasPrefix(f.Name, "JS:") ||
strings.HasPrefix(f.Name, "LazyCompile:") ||
strings.HasPrefix(f.Name, "Function:") ||
strings.HasPrefix(f.Name, "Builtin:") ||
strings.HasPrefix(f.Name, "Code:") ||
strings.HasPrefix(f.Name, "Script:")
}
Copy link

Copilot AI Apr 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment references a staged rollout ('Task 8') and suggests Address might still be nonzero 'before then', but decodePerfMapFrame in this diff unconditionally zeros Address when it recognizes Python/Node formats. Updating this comment to describe the current behavior (without stage references) will keep the code documentation accurate.

Copilot uses AI. Check for mistakes.
Comment on lines +6 to +14
import (
"bytes"
_ "embed"
"fmt"
"io"
"structs"

"github.com/cilium/ebpf"
)
Copy link

Copilot AI Apr 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The generated bpf2go bindings import a package named structs via a bare import path. Unless this repo provides a top-level structs package/module with HostLayout, this will not compile. If this is intended, please ensure the structs package is present and in-module; otherwise, regenerate with the correct import path or adjust the generator configuration so the import resolves consistently in module builds.

Copilot uses AI. Check for mistakes.
Comment thread test/integration_test.go Outdated
v2.8.0 was built with Go 1.25 and refuses to lint a project targeting
Go 1.26 with: 'the Go language version (go1.25) used to build
golangci-lint is lower than the targeted Go version (1.26.0)'.
v2.11.x is built against Go 1.26+.
@dpsoft dpsoft requested a review from Copilot April 25, 2026 21:16
errcheck: explicit '_ =' on best-effort Close/Munmap calls (defer
cleanup paths), or wrap defer in a lambda where appropriate.
ineffassign: drop dead pcEnd reassignment in parseSample.
staticcheck: apply QF1008/QF1011/QF1001 simplifications.
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 49 out of 111 changed files in this pull request and generated 3 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread test/integration_test.go
// `pids` map (populated via objs.AddPID above) restricts emission to the
// workload's TGID. Using pid=workloadPID here would sample ONLY that
// specific TID, missing the worker threads where the actual CPU load runs.
for cpu := range ncpu {
Copy link

Copilot AI Apr 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

range over an int (ncpu) does not compile. Replace this with an indexed loop (e.g., for cpu := 0; cpu < ncpu; cpu++) to iterate CPUs.

Suggested change
for cpu := range ncpu {
for cpu := 0; cpu < ncpu; cpu++ {

Copilot uses AI. Check for mistakes.
Comment thread test/workloads/rust/src/main.rs Outdated
Comment thread test/integration_test.go Outdated
dpsoft added 3 commits April 25, 2026 18:21
Conflict in pprof/pprof.go between PR #8's pre-S5 method-based dedup
(f.locationKey() / f.functionKey() returning frameKey) and S9's
struct-based intern keys + sentinels + looksJIT helper.

Resolution: keep S9. The methods are dead code in S9's world —
addLocation now constructs locationKey/locationFallbackKey/functionKey
structs directly, never calls Frame.locationKey() or Frame.functionKey().

go build, pprof unit tests, and procmap unit tests all pass after the
merge.
…sgs, fidelity asserts, Acquire ordering

- offcpu/profiler.go: thread Address through blazeSymToFrames; create
  procmap.Resolver in NewProfiler; pass into BuildersOptions; close in
  Profiler.Close. Off-CPU pprof now carries the same address+mapping
  fidelity as on-CPU.
- perfagent/options.go: Config.Unwind doc reflects auto default.
- unwind/dwarfagent/agent.go: 'no perf events attached' error message
  no longer implies the PID exited (wrong in system-wide mode).
- pprof/pprof_test.go: rename TestAddSameNameDifferentModulesMakesDistinctFunctions
  to ...DedupsWithoutResolver and update the comment to describe
  current behavior; drop Task 1 fixture reference; scrub stage markers.
- pprof/pprof.go: looksJIT comment describes current behavior; drop
  pre-S9 reference in locationFallbackKey doc; drop 'After Task 8' in
  addLocation comment.
- test/integration_test.go: apply assertPprofFidelity to off-CPU tests
  too; doc comment reflects the >=1 threshold; drop S9: prefixes.
- test/workloads/rust/src/main.rs: probe.load uses Acquire to pair
  with Release store.
- bpf/unwind_common.h: drop (Task 5) from walker helper comment.
- unwind/ehmaps/ehmaps_runtime_test.go: drop 'in Task 9' phrase.
Local golangci-lint v2.11.4 caught 11 issues that the previous CI-driven
fix pass missed. Fixed:
- errcheck on Close() calls in dwarfagent/agent.go, dwarfagent/offcpu.go,
  ehcompile/ehcompile.go, ehcompile/ehcompile_test.go,
  ehmaps/ehmaps.go (multiple sites in pid_mapping/classification),
  ehmaps/tracker_test.go.
- QF1008 in dwarfagent/offcpu.go: drop redundant .session selector on
  embedded field for collect() and close() calls.
- QF1001 in procmap/procmap_test.go: apply De Morgan's law to hex-char
  test in TestReadBuildID.

Verified locally: golangci-lint run --timeout=5m → 0 issues.
@dpsoft dpsoft changed the title DWARF stack unwinding + pprof address/mapping fidelity (S3-S9) DWARF stack unwinding + pprof address/mapping fidelity Apr 25, 2026
CI failure on amd64 (arm64 same code passed) where the system-wide
DWARF capture happened to land on binaries without GNU build-ids.
This is environmental — kernel threads, custom Rust release builds
with stripped notes, or older system binaries can legitimately have
no .note.gnu.build-id section.

The mapping count + non-zero Address checks remain the load-bearing
fidelity guarantees. BuildID presence is now logged via t.Logf so a
regression that wipes build-ids globally is still visible in test
output, but the test no longer fails on captures that legitimately
sample stripped code.
@dpsoft dpsoft merged commit 8d24076 into main Apr 25, 2026
10 checks passed
@dpsoft dpsoft deleted the feat/dwarf-unwinding branch April 25, 2026 22:08
dpsoft added a commit that referenced this pull request May 21, 2026
#25)

* fix(kernel-stacks): lockdown-safe kernel symbolization + perf.data userspace MMAP2

Three production gaps shipped in v1.2.0 M1 kernel-stacks:

1. Kernel symbolization fails under lockdown=integrity (Secure Boot).
   blazesym's kernel source unconditionally probes /proc/kcore for
   DWARF, which is 0400 root:root + CAP_SYS_RAWIO-gated. perf-agent
   binaries setcap'd with the standard cap_perfmon/cap_bpf set lack
   that, so blazesym returns BLAZE_ERR_PERMISSION_DENIED for every
   batch and kernel frames disappear from the pprof. Setting
   vmlinux="" doesn't help — blazesym still hits kcore.

   Fix: pure-Go /proc/kallsyms symbolizer (symbolize/kallsyms.go) as
   the lockdown-safe fallback. On EPERM from blazesym,
   LocalKernelSymbolizer switches to the kallsyms path for the rest
   of its lifetime (sticky). Resolution loses inline expansion and
   file:line attribution but keeps function name + offset + module
   marker, which is the load-bearing property for flame graphs.
   PERFAGENT_FORCE_KERNEL_FALLBACK=1 lets CI exercise the fallback
   on hosts that don't naturally hit EPERM.

2. When every symbolizer fails, kernel frames are dropped entirely
   instead of being preserved as raw addresses. Caller's pprof loses
   kernel context.

   Fix: rawKernelAddrFrames synthesizes Frame{Name:"0x<hex>",
   Module:"[kernel.kallsyms]", Reason:FailureMissingSymbols} so
   operators can still decode hot kernel frames via /proc/kallsyms
   (the existing awk workaround the spec calls out), and the pprof
   keeps structural shape.

3. perf.data files written by --perf-data-output contain the
   [kernel.kallsyms]_text MMAP2 but no userspace MMAP2 records.
   `perf script` / `perf report` show [unknown] for every user-space
   IP, forcing operators to post-process with the maps.txt + nm hack.

   Fix: perfdata.Writer.AddUserspaceMmaps walks /proc/<pid>/maps via
   the existing procmap.Resolver and emits one PERF_RECORD_MMAP2
   per executable mapping for the target PID. Wired into agent.go
   right after AddKernelMmap. System-wide (-a) is out of scope —
   left as a documented follow-up.

Tests: stub-based unit tests for the fallback ladder and raw-address
preservation (no kallsyms-readable host required), plus four
integration tests including a forced-fallback variant that exercises
the kallsyms path on every host.

* docs(roadmap): post-prod-hardening improvement track

Captures 10 follow-ups surfaced while fixing the v1.2.0 kernel-stacks
production gaps:

- self-profile lane (would have caught the lockdown bug at PR time)
- metrics.Exporter histograms for symbolize hot paths
- /metrics + /debug/pprof HTTP endpoint
- symbolizer error counter by reason (lockdown_eperm, kallsyms_unreadable, ...)
- CI lane that runs kernel-stacks integration with
  PERFAGENT_FORCE_KERNEL_FALLBACK=1
- go test -bench budgets for symbolize
- allocs/sample budget
- system-wide userspace MMAP2 (the -a follow-up to this PR)
- lazy MMAP2 on first new-PID sample
- kthread MMAP / COMM attribution audit

Prioritized so the first three would prevent the next ship-and-pray.

* feat(kernel-stacks): ship roadmap items #2, #5, #8, #10 in same PR

Four follow-ups from docs/post-prod-hardening.md landed together with
the bug fixes — small enough to ship in one commit, large enough to
matter:

#5 CI lockdown lane: new `make test-integration-lockdown` target
   runs the kernel-stacks integration tests with
   PERFAGENT_FORCE_KERNEL_FALLBACK=1, wired into the default `make
   test`. Every CI run now exercises the kallsyms fallback path
   regardless of the host's lockdown state — would have caught the
   original v1.2.0 ship-and-pray.

#2 metrics counters: symbolize.Counters (atomic) tracks
   KernelBatches, KernelInputIPs, KernelBatchFailures,
   KernelFallbackEngaged, KernelRawAddrFrames. Logged at agent
   shutdown via agent.cleanup. Histograms (p50/p99 batch latency)
   left as follow-up (requires reservoir sampling). The
   fallback_engaged counter bumps both on natural EPERM-triggered
   switch AND on env-var forced fallback — operators see the active
   mode from the end-of-run log without scraping.

#8 system-wide userspace MMAP2: emitCommAndMmapsForAllPIDs walks
   /proc at writer init and emits PERF_RECORD_MMAP2 for every
   non-kthread PID's executable mappings (extends Bug 3 single-PID
   fix to -a mode). TestPerfDataUserspaceMmap2_SystemWide verifies
   the walk covers multiple distinct PIDs (saw 9068 on this host).
   Snapshot semantics; lazy emission is roadmap #9.

#10 COMM attribution audit: found that perfdata.Writer.AddComm was
    never called from anywhere in perf-agent — every perf.data
    emitted had zero COMM records. Not just kthreads — all PIDs
    showed up nameless in `perf script`. emitCommForPID reads
    /proc/<pid>/comm and emits PERF_RECORD_COMM for every PID
    enumerated, including kthreads (kvm-pit, vhost-*, kworker/*).
    TestPerfDataUserspaceMmap2 asserts the COMM record appears
    alongside the MMAP2 record for the workload pid.

Modernized three for-loops to Go 1.22 `for i := range N` style
in the modified files (per Go 1.26 project target).

Roadmap doc updated to mark shipped items and reorder remaining
priorities — #1 (self-profile lane) is now the highest-leverage
follow-up.

* feat(bench): self-profile scenario (roadmap #1) — profile the profiler

A second perf-agent profiles the first while it captures a CPU
workload. Catches:

  - perf-agent's own CPU overhead (agent_samples / workload_samples
    at the same sample rate — directly comparable)
  - kernel-symbol resolution rate (fraction of kernel Locations whose
    Function.Name isn't a "0x<hex>" raw-address fallback)

The kernel-resolution canary is what would have surfaced the v1.2.0
lockdown bug at PR time: a sudden drop from ~100% to 0% on lockdown
hosts would have been impossible to miss in CI.

Wiring:
  - new "self" case in bench/cmd/scenario, with --self-duration,
    --cpu-budget, --resolution-budget flags
  - schema.SelfMetrics inlined into Run (omitzero, no effect on
    other scenarios' JSON)
  - countKernelResolution helper + unit tests for the metric logic
  - non-zero exit (3) when budgets breached so CI can gate on it
  - make bench-self target with 1.5 CPU / 0.5 resolution defaults
    (calibrated to catch regressions, not enforce a tight target
    on this codebase as-of-today)

Verified end-to-end: 2 runs on this host showed agent/workload CPU
ratios of 0.38 and 0.85 with 100% kernel resolution — both within
budget, scenario exits 0. Tight budgets (--cpu-budget 0.01) exit 3.

Skips the in-process caps check for "self" because the scenario is
pure orchestration; the perf-agent subprocesses carry their own
file caps.

bench-self Makefile target deliberately depends only on bench-build
(not build) so it doesn't wipe the manually-applied setcap on
perf-agent each invocation.

Roadmap doc updated to mark #1 shipped; remaining items reordered.

* fix(symbolize): userspace fallback chain — discovered via bench-self

Two related bugs surfaced by dogfooding the self-profile scenario
(roadmap #1) against this PR's own perf-agent binary:

A. DebuginfodSymbolizer doesn't fall back to local on NULL.
   When DEBUGINFOD_URLS points at a server that doesn't have the
   target binary's build-id (the normal case for locally-built
   perf-agent), blaze_symbolize_process_abs_addrs returns NULL and
   the dispatcher gave up — every userspace frame became
   <unknown>. Fix: keep a LocalSymbolizer next to the cgo state;
   on NULL, retry through it before erroring.

B. Userspace symbolize EPERM when target has file caps.
   Profiling a setcap'd process (e.g., a second perf-agent) hits
   "permission denied" from blazesym because /proc/<pid>/exe is
   restricted by ptrace_scope when the target is privileged. The
   error propagated as "Failed to symbolize user" and the whole
   batch was dropped. Fix: on any blazesym error,
   LocalSymbolizer.SymbolizeProcess now returns rawUserAddrFrames
   (Name="0x<hex>", Reason=FailureMissingSymbols) — symmetric to
   the kernel-side rawKernelAddrFrames fix in this PR's first
   commit. Stack shape and addresses survive into the pprof so
   operators can recover with addr2line.

Full per-mapping ELF resolution (the proper fix that bypasses
/proc/<pid>/exe entirely by using procmap-supplied paths) is left
as a roadmap follow-up — see docs/post-prod-hardening.md "Findings
from self-profile" section.

Verified end-to-end: profiling perf-agent with perf-agent on a
setcap'd binary went from 100% <unknown> (whole batch dropped) to
named raw-hex frames + structural mapping info; addr2line on those
hex names produced the hot-function ranking that drives the new
roadmap section (cilium/ebpf BTF parse, net.* DNS for debuginfod,
SQLite cache init, etc.). No "permission denied" log spam.

* perf(symbolize, bpfstack): close the dogfood loop — three iterations

Running the self scenario four times in sequence, each iteration
exposed the next bottleneck once the previous one was fixed:

iter 2 → fix per-IP empty-name rendering
  frameFromCSym and fromBlazesymSym left Name="" when blazesym
  resolved the binary but couldn't map a specific IP to a symbol.
  pprof rendered those as <unknown>. Filling with "0x<hex>" +
  FailureMissingSymbols matches the kernel-side behavior shipped
  earlier in this PR. <unknown> went from 76% to ~0%.

iter 3 → fix BPF user-stack kernel leak + speed up /proc/kallsyms
  bpf_get_stackid with BPF_F_USER_STACK can include kernel
  addresses in the user-stack buffer when the sampled task was
  in syscall/irq/fault context. That leak made 40% of "user"
  CPU look like 0xffff...kernel-range hex names in the pprof,
  with the user code below them invisible. bpfstack.SplitUserKernelIPs
  partitions ips by the canonical x86_64 kernel threshold and
  routes the strays to the kernel symbolizer (where they belong).

  Once the leak was fixed, the new top of the profile was
  /proc/kallsyms reading (module_get_kallsym, seq_read_iter,
  vsnprintf — all kernel-side). The kallsyms file is synthesized
  by the kernel: each small read() forces vsnprintf for every
  line. The default 4 KiB scanner buffer meant ~700 read syscalls
  for 3M kallsyms lines. Wrapping the file in a 256 KiB bufio
  Reader cut that ~64x and dropped kallsyms entirely out of the
  top 20.

iter 4 → no fix; steady-state hot paths surface
  do_syscall_64 (42% cum), finish_task_switch (21%), ep_poll
  (10%), ringbuf.readRecord, dwarfagent.aggregateCPUSample,
  Syscall6 / nanotime1 / futex. All intrinsic costs of being
  an eBPF-based sampler. Further wins are diminishing returns.

Roadmap doc captures the iteration log so future contributors see
the dogfood-as-process pattern: ship #1, run it against itself,
let it tell you what to fix next.

Tests: new internal/bpfstack/split_test.go covers SplitUserKernelIPs
edge cases (kernel-at-top leak, no-leak passthrough, all-kernel,
boundary value). All existing unit suites still pass.

* perf(symbolize): allocation-free /proc/kallsyms parser

bench-self iter 5 (A=with --kernel-stacks, B=without) surfaced
strings.Fields + strconv.ParseUint as the top user-side allocation
source: 3M kallsyms lines × multi-string Fields × ParseUint
triggered visible GC pressure (mallocgc / sweepone /
tryDeferToSpanScan in the user-side pprof's top 10).

Replaced with parseKallsymsLine: byte-level scan over the read
buffer, no string allocation per line. Names are copied out once
per kept symbol (mandatory — buffer reuses). Module strings go
through an intern map (typical kernel has tens of modules but
millions of module symbols).

  BenchmarkParseKallsymsLine-16  72630638  16.54 ns/op  0 B/op  0 allocs/op

iter 6 confirmed: user-side allocation hotspots are gone. The
remaining kallsyms cost is kernel-side (module_get_kallsym +
vsnprintf rendering the synthesized /proc/kallsyms file) and
can only be avoided by NOT parsing on every invocation. Roadmap
doc adds a "remaining bottleneck" section proposing a disk-cached
index keyed by kernel boot_id as the next follow-up.

Also lifts strings.Fields-based parsing out of the kallsyms hot
path entirely — net deletion of 2 imports (strconv, strings) from
kallsyms.go.

* perf(symbolize): disk-cached /proc/kallsyms index keyed by boot_id

bench-self iter 6 identified the kernel synthesizing /proc/kallsyms
(module_get_kallsym + vsnprintf in the profile) as the floor —
unfixable from userspace because the file is generated on read.
The fix is to avoid re-reading on every agent invocation.

Cache key: kernel boot_id from /proc/sys/kernel/random/boot_id.
Kernel symbol addresses change only across reboots (KASLR), so
boot_id is exactly the right signal — drops the cache when it
would be wrong, never falsely retains a stale copy.

Format (little-endian):

  header: magic u32 | version u32 | boot_id [16]byte | n_syms u32
  per-symbol: addr u64 | name_len u16 | mod_len u16 | name | module

~50 bytes/symbol, ~11 MiB on disk for this host's filtered kallsyms
(~225k T/t/W/w/i entries). atomic write via temp + rename, so
concurrent readers never see a partial file.

Behavioral flow in newKallsymsSymbolizer:
  1. loadCachedKallsyms → return on hit (boot_id matches)
  2. parseKallsymsFresh on stale / corrupt / missing
  3. best-effort writeKallsymsCache for next invocation

iter 7 dogfood (cold vs warm cache run, same workload + flags):

  metric            cold    warm   delta
  module_get_kallsym 16%    11%    -45%
  vsnprintf (cum)    13%     5%    -67%
  seq_read_iter (cum)40%    n/a    out of top 12
  total CPU       676ms   565ms    -16%

Tests cover roundtrip, stale boot_id rejection, missing file,
and corrupt-file non-fatal handling. cachePathFn and readBootIDFn
are package vars so tests can pin both deterministically.

Next opportunity (next commit): the remaining ~11% module_get_kallsym
in the warm run is blazesym's FAILED kernel-source attempt reading
kallsyms before hitting EPERM on /proc/kcore. A persistent
"blazesym EPERM'd on this boot" marker would let perf-agent skip
that attempt entirely on subsequent invocations.

* perf(symbolize): persistent blazesym-EPERM marker — skip failing attempt

Iter 7 dogfood revealed that even with a warm kallsyms disk cache,
the FIRST blazesym call per agent invocation still cost ~16% of
perf-agent CPU: blazesym opens /proc/kallsyms (the read shows up
as module_get_kallsym + vsnprintf in the profile) before
discovering /proc/kcore is restricted and returning EPERM. The
sticky in-process bit prevents subsequent batches from re-paying,
but every new agent process re-attempts blazesym from scratch.

Fix: a boot_id-scoped marker file at
~/.cache/perf-agent/blazesym-eperm-<hex_boot_id>. Written on the
first observed EPERM. Checked at LocalKernelSymbolizer
construction: if present, fallback is pre-engaged and blazesym is
never invoked. Marker is automatically invalidated by reboot
(boot_id changes), so a kernel upgrade that loosens lockdown is
re-detected on next boot.

iter 8 dogfood (cold → warm cache + warm marker, same workload):

  metric                cold    warm    delta
  module_get_kallsym    16%     ZERO    out of top 12
  vsnprintf (cum)       14%     ZERO    out of top 12
  do_syscall_64 (cum)   52%      0%     down (no failing blazesym calls)
  total perf-agent CPU 505ms   131ms    -74%

Warm pprof now shows intrinsic eBPF-sampler costs only:
finish_task_switch (21%), ep_poll waiting (69% cum), spin_unlock.
Diminishing returns from here — every remaining sample is genuine
"be a profiler" work.

Tests pin XDG_CACHE_HOME via t.Setenv + override readBootIDFn so
the marker lands in a t.TempDir(). Covers: roundtrip,
boot_id-scoped invisibility (post-reboot equivalence), and
filename structurally encodes the boot_id.

* perf(perfdata): lazy MMAP2/COMM on first sample per PID (roadmap #9)

bench-self iter 9 found system-wide capture burning ~30% of
perf-agent CPU in kernel /proc/<pid>/maps rendering
(show_map_vma, mangle_path, lock_next_vma, prepend_path, d_path)
because emitCommAndMmapsForAllPIDs walked every PID at writer
init — ~9000 reads on this host, of which only ~200 produce
samples in a typical capture.

Fix: perfdata.Writer.OnNewPID callback fires the first time
AddSample sees each unique pid (sentinel-filtered: skips 0 and
0xffffffff). agent.go wires it to emitCommForPID +
emitUserspaceMmapsForPID for system-wide mode; the eager walk
is gone. Single-PID mode is unchanged (always knows its target).

iter 11 dogfood (system-wide, --perf-data-output ON):
  perf.data size:  ~3 MB (vs ~40-50 MB the eager walk produced)
  enrolled PIDs:   ~208 lazy (vs ~9000 eager)
  remaining /proc/maps cost: ~15% cum, bounded by sampled-PID
    count rather than host-PID count — natural floor from
    procmap.Resolver's per-batch re-snapshot (spec invariant)
    and dwarfagent's lazy attach.

Bonus: lazy mode also covers PIDs that exec AFTER capture
starts. The eager walk could never see those.

Tests pin the contract: fires once per unique pid (multiple
samples for same pid don't re-trigger); sentinel pids are
filtered; nil callback is a no-op (existing --pid-mode callers
keep working unchanged).

* feat(symbolize): reason-bucketed error counters + bench coverage (roadmap #4, #6)

#4 reason-bucketed counters:
  - KernelLockdownEPERM bumps on each BLAZE_ERR_PERMISSION_DENIED
  - KernelOtherErr bumps on every other blazesym failure
  - CountersSnapshot.String() now logs eperm=N and other_err=N

  Operators can distinguish "canonical lockdown" (high eperm,
  fallback_engaged=1) from "unexpected blazesym failure"
  (any other_err > 0 deserves attention) without grepping logs.

#6 symbolize benchmarks:
  - BenchmarkParseKallsymsFresh (cold-path baseline)
  - BenchmarkLoadCachedKallsyms (warm-path; ~20x faster)
  - BenchmarkResolveKernelIPs (per-IP cost ~35 ns)
  - BenchmarkParseKallsymsLine kept for per-line regression
  - new `make bench-symbolize` target

  The 20x cache speedup is the headline metric. PRs touching
  the symbolize hot path can now show numerical evidence
  rather than vibes.

Roadmap doc reflects shipped status; both items move from
Pending to Shipped in the triage table.

* feat(perfagent): --metrics-listen HTTP endpoint (roadmap #3)

--metrics-listen <addr> starts an in-process HTTP server on the
given address (e.g. 127.0.0.1:7777) exposing:

  /metrics       Prometheus text format of symbolize.Counters
                 (batches, input_ips, batch_failures, fallback_engaged,
                 raw_addr_frames, lockdown_eperm, other_err)
  /debug/pprof/  Go runtime self-pprof — live scrape with
                 `go tool pprof http://host:7777/debug/pprof/profile`

Default off — no port opened, zero CPU overhead when unused.

Lifecycle: server starts after kernelSymbolizer is constructed
(so the metrics handler closure can snapshot live counters);
shutdown happens in Agent.cleanup AFTER the symbolizer is
closed, with a 2s graceful drain.

Tests cover:
  - Prometheus exposition format (HELP + TYPE + value lines per
    metric, all 7 counters present)
  - Lifecycle round-trip against a :0 listener (live scrape,
    Stop returns within grace, post-Stop scrape fails)
  - Empty-addr no-op (--metrics-listen not set → nil server)
  - /debug/pprof/ mount works through the metrics mux

End-to-end verified: agent run with --metrics-listen serves
the canonical Prometheus payload AND a working gzipped pprof
profile (1f 8b 08 magic).

* feat(symbolize): batch-duration histograms + allocs budget (roadmap #2, #7)

#2 batch-duration histograms:
  - LatencyHist: ring buffer (1024 samples) + min/max/sum, with
    p50/p99 on Snapshot via slices.Sort.
  - Wired into SymbolizeKernel: time.Now() on entry, Record on
    deferred exit. Microseconds — readable across warm-cache
    (sub-ms) and cold-CGO (millisecond+) paths.
  - End-of-run log line gains batch_p50_us / batch_p99_us /
    batch_max_us fields.
  - /metrics endpoint exposes three new gauges:
    perf_agent_symbolize_kernel_batch_{p50,p99,max}_microseconds

  Sliding-window semantics: an early burst of slow batches
  doesn't permanently bias the percentile — only the most
  recent 1024 stay in the ring. Tested with the
  TestLatencyHist_RingOverwrite case.

#7 allocs budget:
  - testing.AllocsPerRun gates on the three hottest paths:
    parseKallsymsLine (0 alloc/op), kallsymsSymbolizer.Resolve
    (1 alloc/op = the return slice), LatencyHist.Record
    (0 alloc/op).
  - PRs that reintroduce strings.Fields, strconv.ParseUint, or
    other allocators into these paths will fail the test
    immediately — not just look worse in benchstat.

  Catches the regression class dogfood iter 5 surfaced (mallocgc
  / sweepone in the user-side profile of perf-agent).

Roadmap doc: items #2 and #7 marked shipped (#7 partial — the
symbolize hot path is gated; sample-processing follow-up is
documented).

* chore(perfagent): drop dead emitCommAndMmapsForAllPIDs

CI lint caught it: the helper was the eager-walk path replaced by
the lazy OnNewPID hook in 405611a (roadmap #9). emitCommForPID
and emitUserspaceMmapsForPID are now the only callers — both
already exported through the lazy callback. Dead.

* docs(README): drop GPU-profiling experimental notice

GPU profiling lives on its own track (and its own worktree). The
README banner was conflating it with the stable CPU / off-CPU /
PMU paths; dropping the line keeps the front page honest about
what this build delivers.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants