lib mode support by dpsoft · Pull Request #1 · dpsoft/perf-agent

dpsoft · 2026-01-26T18:16:44Z

Add runqueue latency tracking (time waiting for CPU) in cpu.bpf.c
Add task state classification (preempted, voluntary, I/O wait) to PMU metrics
Refactor goto statement in cpu.bpf.c for better code clarity
Add perfagent library package for programmatic usage
Add metrics package with console and exporter interfaces
Update PMU output to show On-CPU Time, Runqueue Latency, and Context Switch Reasons
Add comprehensive integration tests for new metrics
Fix LD_LIBRARY_PATH handling in CI/CD and test scripts
Update README with library usage examples and new metrics documentation

Blazesym Vendoring Strategy:

Use local internal/blazesym wrapper for Go 1.22+ compatibility
Official github.com/libbpf/blazesym/go requires Go 1.25+
C library (libblazesym_c.a) built separately from upstream
Maintains backward compatibility while using latest blazesym C API

- Add runqueue latency tracking (time waiting for CPU) in cpu.bpf.c - Add task state classification (preempted, voluntary, I/O wait) to PMU metrics - Refactor goto statement in cpu.bpf.c for better code clarity - Add perfagent library package for programmatic usage - Add metrics package with console and exporter interfaces - Update PMU output to show On-CPU Time, Runqueue Latency, and Context Switch Reasons - Add comprehensive integration tests for new metrics - Fix LD_LIBRARY_PATH handling in CI/CD and test scripts - Update README with library usage examples and new metrics documentation Blazesym Vendoring Strategy: - Use local internal/blazesym wrapper for Go 1.22+ compatibility - Official github.com/libbpf/blazesym/go requires Go 1.25+ - C library (libblazesym_c.a) built separately from upstream - Maintains backward compatibility while using latest blazesym C API

Copilot

Pull request overview

This PR adds library mode support to perf-agent by introducing a perfagent package for programmatic usage, along with a metrics package providing exporter interfaces for flexible metrics output. The changes refactor profile collection to support streaming output, replace dependencies on go-perf and the official blazesym/go with an internal blazesym wrapper for Go 1.22+ compatibility, and add comprehensive integration tests for the new library features.

Changes:

Adds perfagent package with agent abstraction and configuration options for programmatic usage
Adds metrics package with exporter interface and console implementation for flexible metrics output
Refactors CPU and off-CPU profilers to support streaming (io.Writer) in addition to file-based output
Replaces external dependencies with internal blazesym wrapper and direct unix syscalls for perf events
Updates test infrastructure with LD_LIBRARY_PATH handling and adds library integration tests

Reviewed changes

Copilot reviewed 31 out of 35 changed files in this pull request and generated 10 comments.

Show a summary per file

File	Description
go.mod, test/go.mod	Updated Go version to 1.24.0 and removed go-perf/blazesym dependencies
perfagent/agent.go	Main agent implementation with Start/Stop lifecycle and configuration
perfagent/options.go	Functional options for agent configuration
perfagent/*_test.go	Unit and integration tests for library usage
metrics/types.go	Data structures for metrics snapshots
metrics/exporter.go	Exporter interface definition
metrics/console.go	Console exporter implementation
profile/profiler.go	Added Collect() method for streaming profile output
offcpu/profiler.go	Added Collect() method for streaming profile output
cpu/perf_events.go	Replaced go-perf with direct unix.PerfEventOpen syscalls
cpu/pmu_monitor.go	Added GetMetricsSnapshot and ExportMetrics methods
cpu/cpu_usage_collector.go	Added GetSnapshot and ExportMetrics for metrics export
internal/blazesym/*.go	Internal blazesym wrapper for Go 1.22+ compatibility
main.go	Refactored to use perfagent library
test/run_tests.sh	Added LD_LIBRARY_PATH handling for blazesym runtime loading
test/integration_test.go	Added library usage tests
.github/workflows/*.yml	Updated CI to pass LD_LIBRARY_PATH to sudo
Makefile	Updated test-integration target with LD_LIBRARY_PATH
README.md	Added library usage documentation and examples

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-01-26T18:49:23Z

+	// Double start should error (after stop, agent is no longer started)
+	err = agent.Start(ctx)
+	require.Error(t, err)


This test comment is misleading. Looking at Agent.Stop(), it doesn't set a.started = false, so after calling Stop(), the agent is still in the "started" state. The actual behavior here is that calling Start() after Stop() will error with "already started", not because it's "not started". The test behavior contradicts the comment.

Suggested change

// Double start should error (after stop, agent is no longer started)

err = agent.Start(ctx)

require.Error(t, err)

// Double start should error with "already started" (Stop does not reset started state)

err = agent.Start(ctx)

require.Error(t, err)

require.Contains(t, err.Error(), "already started")

Copilot · 2026-01-26T18:49:23Z

+
+	pointers := unsafe.Slice(input, length)
+
+	// Go 1.21 compatible loop (instead of: for i := range length)


The comment mentions "Go 1.21 compatible loop" but the module requires Go 1.24.0 (which doesn't exist). If the code is actually targeting Go 1.22 or 1.23, this loop style is already supported by those versions. The comment should be updated to match the actual Go version requirement, or the Go version should be corrected to an existing version.

Suggested change

// Go 1.21 compatible loop (instead of: for i := range length)

// Use an index-based loop instead of ranging directly over length.

Copilot · 2026-01-26T18:49:23Z

+func (a *Agent) Start(ctx context.Context) error {
+	a.mu.Lock()
+	defer a.mu.Unlock()
+
+	if a.started {
+		return errors.New("agent already started")
+	}
+
+	// Set capabilities
+	caps := cap.GetProc()
+	if err := caps.SetFlag(cap.Effective, true, cap.SYS_ADMIN, cap.PERFMON); err != nil {
+		return fmt.Errorf("set capabilities: %w", err)
+	}
+
+	// Remove memlock limit
+	if err := rlimit.RemoveMemlock(); err != nil {
+		return fmt.Errorf("remove memlock: %w", err)
+	}
+
+	// Get CPUs to monitor
+	cpus := a.config.CPUs
+	if len(cpus) == 0 {
+		var err error
+		cpus, err = cpuonline.Get()
+		if err != nil {
+			return fmt.Errorf("get online CPUs: %w", err)
+		}
+	}
+
+	// Start CPU profiler if enabled
+	if a.config.EnableCPUProfile {
+		profiler, err := profile.NewProfiler(
+			a.config.PID,
+			a.config.SystemWide,
+			cpus,
+			a.config.Tags,
+			a.config.SampleRate,
+		)
+		if err != nil {
+			return fmt.Errorf("create CPU profiler: %w", err)
+		}
+		a.cpuProfiler = profiler
+		if a.config.SystemWide {
+			log.Printf("CPU profiler enabled (system-wide, %d Hz)", a.config.SampleRate)
+		} else {
+			log.Printf("CPU profiler enabled (PID: %d, %d Hz)", a.config.PID, a.config.SampleRate)
+		}
+	}
+
+	// Start off-CPU profiler if enabled
+	if a.config.EnableOffCPUProfile {
+		profiler, err := offcpu.NewProfiler(
+			a.config.PID,
+			a.config.SystemWide,
+			a.config.Tags,
+		)
+		if err != nil {
+			a.cleanup()
+			return fmt.Errorf("create off-CPU profiler: %w", err)
+		}
+		a.offcpuProfiler = profiler
+		if a.config.SystemWide {
+			log.Println("Off-CPU profiler enabled (system-wide)")
+		} else {
+			log.Printf("Off-CPU profiler enabled (PID: %d)", a.config.PID)
+		}
+	}
+
+	// Start PMU monitor if enabled
+	if a.config.EnablePMU {
+		monitor, err := cpu.NewPMUMonitor(
+			a.config.PID,
+			a.config.SystemWide,
+			cpus,
+		)
+		if err != nil {
+			a.cleanup()
+			return fmt.Errorf("create PMU monitor: %w", err)
+		}
+		a.pmuMonitor = monitor
+		if a.config.SystemWide {
+			log.Println("PMU monitor enabled (system-wide)")
+		} else {
+			log.Printf("PMU monitor enabled (PID: %d)", a.config.PID)
+		}
+	}
+
+	a.started = true
+	return nil
+}


The context parameter ctx is accepted but never used in the Start method. If the context is not needed for cancellation or deadline handling, it should either be used (e.g., for goroutine cancellation) or removed from the signature. If it's for future extensibility, add a comment explaining this.

Copilot · 2026-01-26T18:49:24Z

+    agent.Start(ctx)
+    time.Sleep(10 * time.Second)
+    agent.Stop(ctx)


The example code doesn't check the error returned from agent.Start() and agent.Stop(). While this is example code, it's a best practice to show proper error handling, especially since these are operations that can fail. Consider adding error checks: if err := agent.Start(ctx); err != nil { log.Fatal(err) }

Suggested change

agent.Start(ctx)

time.Sleep(10 * time.Second)

agent.Stop(ctx)

if err := agent.Start(ctx); err != nil {

log.Fatal(err)

}

time.Sleep(10 * time.Second)

if err := agent.Stop(ctx); err != nil {

log.Fatal(err)

}

Copilot · 2026-01-26T18:49:24Z

+### Custom Metrics Export
+
+```go
+agent, _ := perfagent.New(


The example uses error suppression with agent, _ := perfagent.New(...). This is poor practice in documentation as it encourages users to ignore errors. The example should show proper error handling.

Copilot · 2026-01-26T18:49:24Z

+func (e *ConsoleExporter) Export(ctx context.Context, snapshot *MetricsSnapshot) error {
+	if len(snapshot.Processes) == 0 {
+		_, _ = fmt.Fprintln(e.Writer, "\nNo PMU metrics collected")
+		return nil
+	}
+
+	if !snapshot.SystemWide {
+		// Targeted mode (single PID)
+		for pid, m := range snapshot.Processes {
+			_, _ = fmt.Fprintf(e.Writer, "\n=== PMU Metrics (PID: %d) ===\n", pid)
+			e.printSinglePIDMetrics(m)
+		}
+		return nil
+	}
+
+	if e.PerPID {
+		// System-wide with per-PID breakdown
+		_, _ = fmt.Fprintf(e.Writer, "\n=== PMU Metrics (System-Wide, Per-PID) ===\n")
+		_, _ = fmt.Fprintf(e.Writer, "Profiled %d processes\n", len(snapshot.Processes))
+		for pid, m := range snapshot.Processes {
+			_, _ = fmt.Fprintf(e.Writer, "\n--- PID %d ---\n", pid)
+			e.printSinglePIDMetrics(m)
+		}
+	} else {
+		// System-wide aggregate (default)
+		_, _ = fmt.Fprintf(e.Writer, "\n=== PMU Metrics (System-Wide) ===\n")
+		e.printAggregateMetrics(snapshot)
+	}
+
+	return nil
+}


The context parameter ctx is accepted but never used in the Export method. If the context is not needed for cancellation or timeout handling, it should either be used (e.g., to support cancellation during export) or the signature should be reconsidered. If it's for future extensibility or interface consistency, add a comment explaining this.

Copilot · 2026-01-26T18:49:24Z

+			}
+		}
+	}
+


The Agent.Stop method does not set a.started = false, which means after calling Stop(), the agent remains in the "started" state. This causes the test comment on line 205 to be incorrect - after stop, the agent IS still considered started, so calling Start() again should error with "already started" not because it's not started.

This appears to be a bug: Stop() should set a.started = false to allow the agent to be stopped and indicate it's no longer running. The cleanup is done but the state flag is not updated.

Suggested change

// Mark the agent as no longer started after stopping and cleanup.

a.started = false

Copilot · 2026-01-26T18:49:24Z

+func makeCArrayOfStrings(input []string) **C.char {
+	arr := C.malloc(C.size_t(len(input)) * C.size_t(unsafe.Sizeof(uintptr(0))))
+
+	pointers := unsafe.Slice((**C.char)(arr), len(input))
+
+	for i, s := range input {
+		pointers[i] = C.CString(s)
+	}
+
+	return (**C.char)(arr)
+}


Potential memory leak: If len(input) is 0, the function allocates memory with C.malloc but returns immediately without freeing it. The allocated memory at line 85 should be freed when input is empty, or the allocation should be skipped entirely.

Copilot · 2026-01-26T18:49:25Z

+
+```go
+var buf bytes.Buffer
+agent, _ := perfagent.New(


The example uses error suppression with agent, _ := perfagent.New(...). This is poor practice in documentation as it encourages users to ignore errors. The example should show proper error handling.

Copilot · 2026-01-26T18:49:25Z

+	// For aggregate metrics, we need to merge histograms from all processes
+	// This is a simplified version - for full histogram merging, see cpu_usage_collector.go
+	if len(snapshot.Processes) > 0 {
+		// Find first process with data to show sample stats
+		for _, m := range snapshot.Processes {
+			if m.OnCPUStats.Count > 0 {
+				_, _ = fmt.Fprintf(w, "\nOn-CPU Time (time slice per context switch):\n")
+				_, _ = fmt.Fprintf(w, "  Min:    %.3f ms\n", float64(m.OnCPUStats.Min)/1e6)
+				_, _ = fmt.Fprintf(w, "  Max:    %.3f ms\n", float64(m.OnCPUStats.Max)/1e6)
+				_, _ = fmt.Fprintf(w, "  Mean:   %.3f ms\n", m.OnCPUStats.Mean/1e6)
+				_, _ = fmt.Fprintf(w, "  P50:    %.3f ms\n", float64(m.OnCPUStats.P50)/1e6)
+				_, _ = fmt.Fprintf(w, "  P95:    %.3f ms\n", float64(m.OnCPUStats.P95)/1e6)
+				_, _ = fmt.Fprintf(w, "  P99:    %.3f ms\n", float64(m.OnCPUStats.P99)/1e6)
+				_, _ = fmt.Fprintf(w, "  P99.9:  %.3f ms\n", float64(m.OnCPUStats.P999)/1e6)
+				break
+			}
+		}
+
+		for _, m := range snapshot.Processes {
+			if m.RunqueueStats.Count > 0 {
+				_, _ = fmt.Fprintf(w, "\nRunqueue Latency (time waiting for CPU):\n")
+				_, _ = fmt.Fprintf(w, "  Min:    %.3f ms\n", float64(m.RunqueueStats.Min)/1e6)
+				_, _ = fmt.Fprintf(w, "  Max:    %.3f ms\n", float64(m.RunqueueStats.Max)/1e6)
+				_, _ = fmt.Fprintf(w, "  Mean:   %.3f ms\n", m.RunqueueStats.Mean/1e6)
+				_, _ = fmt.Fprintf(w, "  P50:    %.3f ms\n", float64(m.RunqueueStats.P50)/1e6)
+				_, _ = fmt.Fprintf(w, "  P95:    %.3f ms\n", float64(m.RunqueueStats.P95)/1e6)
+				_, _ = fmt.Fprintf(w, "  P99:    %.3f ms\n", float64(m.RunqueueStats.P99)/1e6)
+				_, _ = fmt.Fprintf(w, "  P99.9:  %.3f ms\n", float64(m.RunqueueStats.P999)/1e6)
+				break
+			}
+		}
+	}


The aggregate metrics display in system-wide mode only shows latency stats from the first process with data (lines 129-155). This is misleading because it doesn't actually aggregate the histograms from all processes - it just picks one process arbitrarily. The comment on line 126 mentions this is a "simplified version" but this makes the output misleading. Consider either properly merging histograms or clearly labeling that these are sample stats from a single process, not aggregated stats.

The committed architecture.excalidraw was the original sketch from the FP-only era — it still showed only profile/, offcpu/, cpu/, perf.bpf.c, offcpu.bpf.c, and cpu.bpf.c. Since then the codebase has grown: - unwind/dwarfagent/ (DWARF hybrid CPU + off-CPU walker, PRs #1–#11) - unwind/{ehcompile,ehmaps,procmap} (CFI compile, per-PID lifecycle, /proc resolver — PRs #3–#9) - internal/perfevent (per-CPU perf_event_open + AttachRawLink shared helper extracted from profile/dwarfagent — PR #13) - inject/{python,ptraceop,elfsym} (Python perf-trampoline activator via ptrace — PR #12) - internal/{nspid,k8slabels} (namespace-aware --pid + k8s pprof labels — PR #14) - bpf/perf_dwarf.bpf.c, bpf/offcpu_dwarf.bpf.c (DWARF kernel-side) The new file groups them into pre-profile setup (optional, dashed — nspid, k8slabels, inject), profilers (FP, DWARF, PMU), helpers (perfevent, ehcompile, ehmaps, procmap, pprof), and the four BPF programs in the kernel band, with arrows showing data + control flow. Output lands in *-on-cpu.pb.gz / *-off-cpu.pb.gz / PMU stdout-or-file. Layout was generated programmatically; open the file in https://excalidraw.com or the VS Code extension to fine-tune positions and colours. The README's ASCII version of the same diagram is unchanged — both exist intentionally so readers can grok the architecture without opening Excalidraw, and the .excalidraw is the editable source for future iterations.

#25) * fix(kernel-stacks): lockdown-safe kernel symbolization + perf.data userspace MMAP2 Three production gaps shipped in v1.2.0 M1 kernel-stacks: 1. Kernel symbolization fails under lockdown=integrity (Secure Boot). blazesym's kernel source unconditionally probes /proc/kcore for DWARF, which is 0400 root:root + CAP_SYS_RAWIO-gated. perf-agent binaries setcap'd with the standard cap_perfmon/cap_bpf set lack that, so blazesym returns BLAZE_ERR_PERMISSION_DENIED for every batch and kernel frames disappear from the pprof. Setting vmlinux="" doesn't help — blazesym still hits kcore. Fix: pure-Go /proc/kallsyms symbolizer (symbolize/kallsyms.go) as the lockdown-safe fallback. On EPERM from blazesym, LocalKernelSymbolizer switches to the kallsyms path for the rest of its lifetime (sticky). Resolution loses inline expansion and file:line attribution but keeps function name + offset + module marker, which is the load-bearing property for flame graphs. PERFAGENT_FORCE_KERNEL_FALLBACK=1 lets CI exercise the fallback on hosts that don't naturally hit EPERM. 2. When every symbolizer fails, kernel frames are dropped entirely instead of being preserved as raw addresses. Caller's pprof loses kernel context. Fix: rawKernelAddrFrames synthesizes Frame{Name:"0x<hex>", Module:"[kernel.kallsyms]", Reason:FailureMissingSymbols} so operators can still decode hot kernel frames via /proc/kallsyms (the existing awk workaround the spec calls out), and the pprof keeps structural shape. 3. perf.data files written by --perf-data-output contain the [kernel.kallsyms]_text MMAP2 but no userspace MMAP2 records. `perf script` / `perf report` show [unknown] for every user-space IP, forcing operators to post-process with the maps.txt + nm hack. Fix: perfdata.Writer.AddUserspaceMmaps walks /proc/<pid>/maps via the existing procmap.Resolver and emits one PERF_RECORD_MMAP2 per executable mapping for the target PID. Wired into agent.go right after AddKernelMmap. System-wide (-a) is out of scope — left as a documented follow-up. Tests: stub-based unit tests for the fallback ladder and raw-address preservation (no kallsyms-readable host required), plus four integration tests including a forced-fallback variant that exercises the kallsyms path on every host. * docs(roadmap): post-prod-hardening improvement track Captures 10 follow-ups surfaced while fixing the v1.2.0 kernel-stacks production gaps: - self-profile lane (would have caught the lockdown bug at PR time) - metrics.Exporter histograms for symbolize hot paths - /metrics + /debug/pprof HTTP endpoint - symbolizer error counter by reason (lockdown_eperm, kallsyms_unreadable, ...) - CI lane that runs kernel-stacks integration with PERFAGENT_FORCE_KERNEL_FALLBACK=1 - go test -bench budgets for symbolize - allocs/sample budget - system-wide userspace MMAP2 (the -a follow-up to this PR) - lazy MMAP2 on first new-PID sample - kthread MMAP / COMM attribution audit Prioritized so the first three would prevent the next ship-and-pray. * feat(kernel-stacks): ship roadmap items #2, #5, #8, #10 in same PR Four follow-ups from docs/post-prod-hardening.md landed together with the bug fixes — small enough to ship in one commit, large enough to matter: #5 CI lockdown lane: new `make test-integration-lockdown` target runs the kernel-stacks integration tests with PERFAGENT_FORCE_KERNEL_FALLBACK=1, wired into the default `make test`. Every CI run now exercises the kallsyms fallback path regardless of the host's lockdown state — would have caught the original v1.2.0 ship-and-pray. #2 metrics counters: symbolize.Counters (atomic) tracks KernelBatches, KernelInputIPs, KernelBatchFailures, KernelFallbackEngaged, KernelRawAddrFrames. Logged at agent shutdown via agent.cleanup. Histograms (p50/p99 batch latency) left as follow-up (requires reservoir sampling). The fallback_engaged counter bumps both on natural EPERM-triggered switch AND on env-var forced fallback — operators see the active mode from the end-of-run log without scraping. #8 system-wide userspace MMAP2: emitCommAndMmapsForAllPIDs walks /proc at writer init and emits PERF_RECORD_MMAP2 for every non-kthread PID's executable mappings (extends Bug 3 single-PID fix to -a mode). TestPerfDataUserspaceMmap2_SystemWide verifies the walk covers multiple distinct PIDs (saw 9068 on this host). Snapshot semantics; lazy emission is roadmap #9. #10 COMM attribution audit: found that perfdata.Writer.AddComm was never called from anywhere in perf-agent — every perf.data emitted had zero COMM records. Not just kthreads — all PIDs showed up nameless in `perf script`. emitCommForPID reads /proc/<pid>/comm and emits PERF_RECORD_COMM for every PID enumerated, including kthreads (kvm-pit, vhost-*, kworker/*). TestPerfDataUserspaceMmap2 asserts the COMM record appears alongside the MMAP2 record for the workload pid. Modernized three for-loops to Go 1.22 `for i := range N` style in the modified files (per Go 1.26 project target). Roadmap doc updated to mark shipped items and reorder remaining priorities — #1 (self-profile lane) is now the highest-leverage follow-up. * feat(bench): self-profile scenario (roadmap #1) — profile the profiler A second perf-agent profiles the first while it captures a CPU workload. Catches: - perf-agent's own CPU overhead (agent_samples / workload_samples at the same sample rate — directly comparable) - kernel-symbol resolution rate (fraction of kernel Locations whose Function.Name isn't a "0x<hex>" raw-address fallback) The kernel-resolution canary is what would have surfaced the v1.2.0 lockdown bug at PR time: a sudden drop from ~100% to 0% on lockdown hosts would have been impossible to miss in CI. Wiring: - new "self" case in bench/cmd/scenario, with --self-duration, --cpu-budget, --resolution-budget flags - schema.SelfMetrics inlined into Run (omitzero, no effect on other scenarios' JSON) - countKernelResolution helper + unit tests for the metric logic - non-zero exit (3) when budgets breached so CI can gate on it - make bench-self target with 1.5 CPU / 0.5 resolution defaults (calibrated to catch regressions, not enforce a tight target on this codebase as-of-today) Verified end-to-end: 2 runs on this host showed agent/workload CPU ratios of 0.38 and 0.85 with 100% kernel resolution — both within budget, scenario exits 0. Tight budgets (--cpu-budget 0.01) exit 3. Skips the in-process caps check for "self" because the scenario is pure orchestration; the perf-agent subprocesses carry their own file caps. bench-self Makefile target deliberately depends only on bench-build (not build) so it doesn't wipe the manually-applied setcap on perf-agent each invocation. Roadmap doc updated to mark #1 shipped; remaining items reordered. * fix(symbolize): userspace fallback chain — discovered via bench-self Two related bugs surfaced by dogfooding the self-profile scenario (roadmap #1) against this PR's own perf-agent binary: A. DebuginfodSymbolizer doesn't fall back to local on NULL. When DEBUGINFOD_URLS points at a server that doesn't have the target binary's build-id (the normal case for locally-built perf-agent), blaze_symbolize_process_abs_addrs returns NULL and the dispatcher gave up — every userspace frame became <unknown>. Fix: keep a LocalSymbolizer next to the cgo state; on NULL, retry through it before erroring. B. Userspace symbolize EPERM when target has file caps. Profiling a setcap'd process (e.g., a second perf-agent) hits "permission denied" from blazesym because /proc/<pid>/exe is restricted by ptrace_scope when the target is privileged. The error propagated as "Failed to symbolize user" and the whole batch was dropped. Fix: on any blazesym error, LocalSymbolizer.SymbolizeProcess now returns rawUserAddrFrames (Name="0x<hex>", Reason=FailureMissingSymbols) — symmetric to the kernel-side rawKernelAddrFrames fix in this PR's first commit. Stack shape and addresses survive into the pprof so operators can recover with addr2line. Full per-mapping ELF resolution (the proper fix that bypasses /proc/<pid>/exe entirely by using procmap-supplied paths) is left as a roadmap follow-up — see docs/post-prod-hardening.md "Findings from self-profile" section. Verified end-to-end: profiling perf-agent with perf-agent on a setcap'd binary went from 100% <unknown> (whole batch dropped) to named raw-hex frames + structural mapping info; addr2line on those hex names produced the hot-function ranking that drives the new roadmap section (cilium/ebpf BTF parse, net.* DNS for debuginfod, SQLite cache init, etc.). No "permission denied" log spam. * perf(symbolize, bpfstack): close the dogfood loop — three iterations Running the self scenario four times in sequence, each iteration exposed the next bottleneck once the previous one was fixed: iter 2 → fix per-IP empty-name rendering frameFromCSym and fromBlazesymSym left Name="" when blazesym resolved the binary but couldn't map a specific IP to a symbol. pprof rendered those as <unknown>. Filling with "0x<hex>" + FailureMissingSymbols matches the kernel-side behavior shipped earlier in this PR. <unknown> went from 76% to ~0%. iter 3 → fix BPF user-stack kernel leak + speed up /proc/kallsyms bpf_get_stackid with BPF_F_USER_STACK can include kernel addresses in the user-stack buffer when the sampled task was in syscall/irq/fault context. That leak made 40% of "user" CPU look like 0xffff...kernel-range hex names in the pprof, with the user code below them invisible. bpfstack.SplitUserKernelIPs partitions ips by the canonical x86_64 kernel threshold and routes the strays to the kernel symbolizer (where they belong). Once the leak was fixed, the new top of the profile was /proc/kallsyms reading (module_get_kallsym, seq_read_iter, vsnprintf — all kernel-side). The kallsyms file is synthesized by the kernel: each small read() forces vsnprintf for every line. The default 4 KiB scanner buffer meant ~700 read syscalls for 3M kallsyms lines. Wrapping the file in a 256 KiB bufio Reader cut that ~64x and dropped kallsyms entirely out of the top 20. iter 4 → no fix; steady-state hot paths surface do_syscall_64 (42% cum), finish_task_switch (21%), ep_poll (10%), ringbuf.readRecord, dwarfagent.aggregateCPUSample, Syscall6 / nanotime1 / futex. All intrinsic costs of being an eBPF-based sampler. Further wins are diminishing returns. Roadmap doc captures the iteration log so future contributors see the dogfood-as-process pattern: ship #1, run it against itself, let it tell you what to fix next. Tests: new internal/bpfstack/split_test.go covers SplitUserKernelIPs edge cases (kernel-at-top leak, no-leak passthrough, all-kernel, boundary value). All existing unit suites still pass. * perf(symbolize): allocation-free /proc/kallsyms parser bench-self iter 5 (A=with --kernel-stacks, B=without) surfaced strings.Fields + strconv.ParseUint as the top user-side allocation source: 3M kallsyms lines × multi-string Fields × ParseUint triggered visible GC pressure (mallocgc / sweepone / tryDeferToSpanScan in the user-side pprof's top 10). Replaced with parseKallsymsLine: byte-level scan over the read buffer, no string allocation per line. Names are copied out once per kept symbol (mandatory — buffer reuses). Module strings go through an intern map (typical kernel has tens of modules but millions of module symbols). BenchmarkParseKallsymsLine-16 72630638 16.54 ns/op 0 B/op 0 allocs/op iter 6 confirmed: user-side allocation hotspots are gone. The remaining kallsyms cost is kernel-side (module_get_kallsym + vsnprintf rendering the synthesized /proc/kallsyms file) and can only be avoided by NOT parsing on every invocation. Roadmap doc adds a "remaining bottleneck" section proposing a disk-cached index keyed by kernel boot_id as the next follow-up. Also lifts strings.Fields-based parsing out of the kallsyms hot path entirely — net deletion of 2 imports (strconv, strings) from kallsyms.go. * perf(symbolize): disk-cached /proc/kallsyms index keyed by boot_id bench-self iter 6 identified the kernel synthesizing /proc/kallsyms (module_get_kallsym + vsnprintf in the profile) as the floor — unfixable from userspace because the file is generated on read. The fix is to avoid re-reading on every agent invocation. Cache key: kernel boot_id from /proc/sys/kernel/random/boot_id. Kernel symbol addresses change only across reboots (KASLR), so boot_id is exactly the right signal — drops the cache when it would be wrong, never falsely retains a stale copy. Format (little-endian): header: magic u32 | version u32 | boot_id [16]byte | n_syms u32 per-symbol: addr u64 | name_len u16 | mod_len u16 | name | module ~50 bytes/symbol, ~11 MiB on disk for this host's filtered kallsyms (~225k T/t/W/w/i entries). atomic write via temp + rename, so concurrent readers never see a partial file. Behavioral flow in newKallsymsSymbolizer: 1. loadCachedKallsyms → return on hit (boot_id matches) 2. parseKallsymsFresh on stale / corrupt / missing 3. best-effort writeKallsymsCache for next invocation iter 7 dogfood (cold vs warm cache run, same workload + flags): metric cold warm delta module_get_kallsym 16% 11% -45% vsnprintf (cum) 13% 5% -67% seq_read_iter (cum)40% n/a out of top 12 total CPU 676ms 565ms -16% Tests cover roundtrip, stale boot_id rejection, missing file, and corrupt-file non-fatal handling. cachePathFn and readBootIDFn are package vars so tests can pin both deterministically. Next opportunity (next commit): the remaining ~11% module_get_kallsym in the warm run is blazesym's FAILED kernel-source attempt reading kallsyms before hitting EPERM on /proc/kcore. A persistent "blazesym EPERM'd on this boot" marker would let perf-agent skip that attempt entirely on subsequent invocations. * perf(symbolize): persistent blazesym-EPERM marker — skip failing attempt Iter 7 dogfood revealed that even with a warm kallsyms disk cache, the FIRST blazesym call per agent invocation still cost ~16% of perf-agent CPU: blazesym opens /proc/kallsyms (the read shows up as module_get_kallsym + vsnprintf in the profile) before discovering /proc/kcore is restricted and returning EPERM. The sticky in-process bit prevents subsequent batches from re-paying, but every new agent process re-attempts blazesym from scratch. Fix: a boot_id-scoped marker file at ~/.cache/perf-agent/blazesym-eperm-<hex_boot_id>. Written on the first observed EPERM. Checked at LocalKernelSymbolizer construction: if present, fallback is pre-engaged and blazesym is never invoked. Marker is automatically invalidated by reboot (boot_id changes), so a kernel upgrade that loosens lockdown is re-detected on next boot. iter 8 dogfood (cold → warm cache + warm marker, same workload): metric cold warm delta module_get_kallsym 16% ZERO out of top 12 vsnprintf (cum) 14% ZERO out of top 12 do_syscall_64 (cum) 52% 0% down (no failing blazesym calls) total perf-agent CPU 505ms 131ms -74% Warm pprof now shows intrinsic eBPF-sampler costs only: finish_task_switch (21%), ep_poll waiting (69% cum), spin_unlock. Diminishing returns from here — every remaining sample is genuine "be a profiler" work. Tests pin XDG_CACHE_HOME via t.Setenv + override readBootIDFn so the marker lands in a t.TempDir(). Covers: roundtrip, boot_id-scoped invisibility (post-reboot equivalence), and filename structurally encodes the boot_id. * perf(perfdata): lazy MMAP2/COMM on first sample per PID (roadmap #9) bench-self iter 9 found system-wide capture burning ~30% of perf-agent CPU in kernel /proc/<pid>/maps rendering (show_map_vma, mangle_path, lock_next_vma, prepend_path, d_path) because emitCommAndMmapsForAllPIDs walked every PID at writer init — ~9000 reads on this host, of which only ~200 produce samples in a typical capture. Fix: perfdata.Writer.OnNewPID callback fires the first time AddSample sees each unique pid (sentinel-filtered: skips 0 and 0xffffffff). agent.go wires it to emitCommForPID + emitUserspaceMmapsForPID for system-wide mode; the eager walk is gone. Single-PID mode is unchanged (always knows its target). iter 11 dogfood (system-wide, --perf-data-output ON): perf.data size: ~3 MB (vs ~40-50 MB the eager walk produced) enrolled PIDs: ~208 lazy (vs ~9000 eager) remaining /proc/maps cost: ~15% cum, bounded by sampled-PID count rather than host-PID count — natural floor from procmap.Resolver's per-batch re-snapshot (spec invariant) and dwarfagent's lazy attach. Bonus: lazy mode also covers PIDs that exec AFTER capture starts. The eager walk could never see those. Tests pin the contract: fires once per unique pid (multiple samples for same pid don't re-trigger); sentinel pids are filtered; nil callback is a no-op (existing --pid-mode callers keep working unchanged). * feat(symbolize): reason-bucketed error counters + bench coverage (roadmap #4, #6) #4 reason-bucketed counters: - KernelLockdownEPERM bumps on each BLAZE_ERR_PERMISSION_DENIED - KernelOtherErr bumps on every other blazesym failure - CountersSnapshot.String() now logs eperm=N and other_err=N Operators can distinguish "canonical lockdown" (high eperm, fallback_engaged=1) from "unexpected blazesym failure" (any other_err > 0 deserves attention) without grepping logs. #6 symbolize benchmarks: - BenchmarkParseKallsymsFresh (cold-path baseline) - BenchmarkLoadCachedKallsyms (warm-path; ~20x faster) - BenchmarkResolveKernelIPs (per-IP cost ~35 ns) - BenchmarkParseKallsymsLine kept for per-line regression - new `make bench-symbolize` target The 20x cache speedup is the headline metric. PRs touching the symbolize hot path can now show numerical evidence rather than vibes. Roadmap doc reflects shipped status; both items move from Pending to Shipped in the triage table. * feat(perfagent): --metrics-listen HTTP endpoint (roadmap #3) --metrics-listen <addr> starts an in-process HTTP server on the given address (e.g. 127.0.0.1:7777) exposing: /metrics Prometheus text format of symbolize.Counters (batches, input_ips, batch_failures, fallback_engaged, raw_addr_frames, lockdown_eperm, other_err) /debug/pprof/ Go runtime self-pprof — live scrape with `go tool pprof http://host:7777/debug/pprof/profile` Default off — no port opened, zero CPU overhead when unused. Lifecycle: server starts after kernelSymbolizer is constructed (so the metrics handler closure can snapshot live counters); shutdown happens in Agent.cleanup AFTER the symbolizer is closed, with a 2s graceful drain. Tests cover: - Prometheus exposition format (HELP + TYPE + value lines per metric, all 7 counters present) - Lifecycle round-trip against a :0 listener (live scrape, Stop returns within grace, post-Stop scrape fails) - Empty-addr no-op (--metrics-listen not set → nil server) - /debug/pprof/ mount works through the metrics mux End-to-end verified: agent run with --metrics-listen serves the canonical Prometheus payload AND a working gzipped pprof profile (1f 8b 08 magic). * feat(symbolize): batch-duration histograms + allocs budget (roadmap #2, #7) #2 batch-duration histograms: - LatencyHist: ring buffer (1024 samples) + min/max/sum, with p50/p99 on Snapshot via slices.Sort. - Wired into SymbolizeKernel: time.Now() on entry, Record on deferred exit. Microseconds — readable across warm-cache (sub-ms) and cold-CGO (millisecond+) paths. - End-of-run log line gains batch_p50_us / batch_p99_us / batch_max_us fields. - /metrics endpoint exposes three new gauges: perf_agent_symbolize_kernel_batch_{p50,p99,max}_microseconds Sliding-window semantics: an early burst of slow batches doesn't permanently bias the percentile — only the most recent 1024 stay in the ring. Tested with the TestLatencyHist_RingOverwrite case. #7 allocs budget: - testing.AllocsPerRun gates on the three hottest paths: parseKallsymsLine (0 alloc/op), kallsymsSymbolizer.Resolve (1 alloc/op = the return slice), LatencyHist.Record (0 alloc/op). - PRs that reintroduce strings.Fields, strconv.ParseUint, or other allocators into these paths will fail the test immediately — not just look worse in benchstat. Catches the regression class dogfood iter 5 surfaced (mallocgc / sweepone in the user-side profile of perf-agent). Roadmap doc: items #2 and #7 marked shipped (#7 partial — the symbolize hot path is gated; sample-processing follow-up is documented). * chore(perfagent): drop dead emitCommAndMmapsForAllPIDs CI lint caught it: the helper was the eager-walk path replaced by the lazy OnNewPID hook in 405611a (roadmap #9). emitCommForPID and emitUserspaceMmapsForPID are now the only callers — both already exported through the lazy callback. Dead. * docs(README): drop GPU-profiling experimental notice GPU profiling lives on its own track (and its own worktree). The README banner was conflating it with the stable CPU / off-CPU / PMU paths; dropping the line keeps the front page honest about what this build delivers.

dpsoft added 3 commits January 26, 2026 15:13

Improve error handling and output formatting in console metrics

83f89dc

Improve output formatting in console metrics for better readability

5f8e5a2

dpsoft requested a review from Copilot January 26, 2026 18:43

Copilot started reviewing on behalf of dpsoft January 26, 2026 18:43 View session

Copilot AI reviewed Jan 26, 2026

View reviewed changes

dpsoft merged commit e6e458a into main Jan 26, 2026
16 checks passed

dpsoft deleted the lib-mode branch January 26, 2026 18:50

dpsoft mentioned this pull request May 20, 2026

fix(kernel-stacks): lockdown-safe symbolization + production hardening #25

Merged

9 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

lib mode support#1

lib mode support#1
dpsoft merged 3 commits into
mainfrom
lib-mode

dpsoft commented Jan 26, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Jan 26, 2026

Uh oh!

Copilot AI Jan 26, 2026

Uh oh!

Copilot AI Jan 26, 2026

Uh oh!

Copilot AI Jan 26, 2026

Uh oh!

Copilot AI Jan 26, 2026

Uh oh!

Copilot AI Jan 26, 2026

Uh oh!

Copilot AI Jan 26, 2026

Uh oh!

Copilot AI Jan 26, 2026

Uh oh!

Copilot AI Jan 26, 2026

Uh oh!

Copilot AI Jan 26, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants


		pointers := unsafe.Slice(input, length)

		// Go 1.21 compatible loop (instead of: for i := range length)

	// Go 1.21 compatible loop (instead of: for i := range length)
	// Use an index-based loop instead of ranging directly over length.



	// Mark the agent as no longer started after stopping and cleanup.
	a.started = false

Conversation

dpsoft commented Jan 26, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Jan 26, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 26, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 26, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 26, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 26, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 26, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 26, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 26, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 26, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 26, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants