Skip to content

lib mode support#1

Merged
dpsoft merged 3 commits into
mainfrom
lib-mode
Jan 26, 2026
Merged

lib mode support#1
dpsoft merged 3 commits into
mainfrom
lib-mode

Conversation

@dpsoft
Copy link
Copy Markdown
Owner

@dpsoft dpsoft commented Jan 26, 2026

  • Add runqueue latency tracking (time waiting for CPU) in cpu.bpf.c
  • Add task state classification (preempted, voluntary, I/O wait) to PMU metrics
  • Refactor goto statement in cpu.bpf.c for better code clarity
  • Add perfagent library package for programmatic usage
  • Add metrics package with console and exporter interfaces
  • Update PMU output to show On-CPU Time, Runqueue Latency, and Context Switch Reasons
  • Add comprehensive integration tests for new metrics
  • Fix LD_LIBRARY_PATH handling in CI/CD and test scripts
  • Update README with library usage examples and new metrics documentation

Blazesym Vendoring Strategy:

  • Use local internal/blazesym wrapper for Go 1.22+ compatibility
  • Official github.com/libbpf/blazesym/go requires Go 1.25+
  • C library (libblazesym_c.a) built separately from upstream
  • Maintains backward compatibility while using latest blazesym C API

- Add runqueue latency tracking (time waiting for CPU) in cpu.bpf.c
- Add task state classification (preempted, voluntary, I/O wait) to PMU metrics
- Refactor goto statement in cpu.bpf.c for better code clarity
- Add perfagent library package for programmatic usage
- Add metrics package with console and exporter interfaces
- Update PMU output to show On-CPU Time, Runqueue Latency, and Context Switch Reasons
- Add comprehensive integration tests for new metrics
- Fix LD_LIBRARY_PATH handling in CI/CD and test scripts
- Update README with library usage examples and new metrics documentation

Blazesym Vendoring Strategy:
- Use local internal/blazesym wrapper for Go 1.22+ compatibility
- Official github.com/libbpf/blazesym/go requires Go 1.25+
- C library (libblazesym_c.a) built separately from upstream
- Maintains backward compatibility while using latest blazesym C API
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds library mode support to perf-agent by introducing a perfagent package for programmatic usage, along with a metrics package providing exporter interfaces for flexible metrics output. The changes refactor profile collection to support streaming output, replace dependencies on go-perf and the official blazesym/go with an internal blazesym wrapper for Go 1.22+ compatibility, and add comprehensive integration tests for the new library features.

Changes:

  • Adds perfagent package with agent abstraction and configuration options for programmatic usage
  • Adds metrics package with exporter interface and console implementation for flexible metrics output
  • Refactors CPU and off-CPU profilers to support streaming (io.Writer) in addition to file-based output
  • Replaces external dependencies with internal blazesym wrapper and direct unix syscalls for perf events
  • Updates test infrastructure with LD_LIBRARY_PATH handling and adds library integration tests

Reviewed changes

Copilot reviewed 31 out of 35 changed files in this pull request and generated 10 comments.

Show a summary per file
File Description
go.mod, test/go.mod Updated Go version to 1.24.0 and removed go-perf/blazesym dependencies
perfagent/agent.go Main agent implementation with Start/Stop lifecycle and configuration
perfagent/options.go Functional options for agent configuration
perfagent/*_test.go Unit and integration tests for library usage
metrics/types.go Data structures for metrics snapshots
metrics/exporter.go Exporter interface definition
metrics/console.go Console exporter implementation
profile/profiler.go Added Collect() method for streaming profile output
offcpu/profiler.go Added Collect() method for streaming profile output
cpu/perf_events.go Replaced go-perf with direct unix.PerfEventOpen syscalls
cpu/pmu_monitor.go Added GetMetricsSnapshot and ExportMetrics methods
cpu/cpu_usage_collector.go Added GetSnapshot and ExportMetrics for metrics export
internal/blazesym/*.go Internal blazesym wrapper for Go 1.22+ compatibility
main.go Refactored to use perfagent library
test/run_tests.sh Added LD_LIBRARY_PATH handling for blazesym runtime loading
test/integration_test.go Added library usage tests
.github/workflows/*.yml Updated CI to pass LD_LIBRARY_PATH to sudo
Makefile Updated test-integration target with LD_LIBRARY_PATH
README.md Added library usage documentation and examples

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +205 to +207
// Double start should error (after stop, agent is no longer started)
err = agent.Start(ctx)
require.Error(t, err)
Copy link

Copilot AI Jan 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test comment is misleading. Looking at Agent.Stop(), it doesn't set a.started = false, so after calling Stop(), the agent is still in the "started" state. The actual behavior here is that calling Start() after Stop() will error with "already started", not because it's "not started". The test behavior contradicts the comment.

Suggested change
// Double start should error (after stop, agent is no longer started)
err = agent.Start(ctx)
require.Error(t, err)
// Double start should error with "already started" (Stop does not reset started state)
err = agent.Start(ctx)
require.Error(t, err)
require.Contains(t, err.Error(), "already started")

Copilot uses AI. Check for mistakes.

pointers := unsafe.Slice(input, length)

// Go 1.21 compatible loop (instead of: for i := range length)
Copy link

Copilot AI Jan 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment mentions "Go 1.21 compatible loop" but the module requires Go 1.24.0 (which doesn't exist). If the code is actually targeting Go 1.22 or 1.23, this loop style is already supported by those versions. The comment should be updated to match the actual Go version requirement, or the Go version should be corrected to an existing version.

Suggested change
// Go 1.21 compatible loop (instead of: for i := range length)
// Use an index-based loop instead of ranging directly over length.

Copilot uses AI. Check for mistakes.
Comment thread perfagent/agent.go
Comment on lines +75 to +164
func (a *Agent) Start(ctx context.Context) error {
a.mu.Lock()
defer a.mu.Unlock()

if a.started {
return errors.New("agent already started")
}

// Set capabilities
caps := cap.GetProc()
if err := caps.SetFlag(cap.Effective, true, cap.SYS_ADMIN, cap.PERFMON); err != nil {
return fmt.Errorf("set capabilities: %w", err)
}

// Remove memlock limit
if err := rlimit.RemoveMemlock(); err != nil {
return fmt.Errorf("remove memlock: %w", err)
}

// Get CPUs to monitor
cpus := a.config.CPUs
if len(cpus) == 0 {
var err error
cpus, err = cpuonline.Get()
if err != nil {
return fmt.Errorf("get online CPUs: %w", err)
}
}

// Start CPU profiler if enabled
if a.config.EnableCPUProfile {
profiler, err := profile.NewProfiler(
a.config.PID,
a.config.SystemWide,
cpus,
a.config.Tags,
a.config.SampleRate,
)
if err != nil {
return fmt.Errorf("create CPU profiler: %w", err)
}
a.cpuProfiler = profiler
if a.config.SystemWide {
log.Printf("CPU profiler enabled (system-wide, %d Hz)", a.config.SampleRate)
} else {
log.Printf("CPU profiler enabled (PID: %d, %d Hz)", a.config.PID, a.config.SampleRate)
}
}

// Start off-CPU profiler if enabled
if a.config.EnableOffCPUProfile {
profiler, err := offcpu.NewProfiler(
a.config.PID,
a.config.SystemWide,
a.config.Tags,
)
if err != nil {
a.cleanup()
return fmt.Errorf("create off-CPU profiler: %w", err)
}
a.offcpuProfiler = profiler
if a.config.SystemWide {
log.Println("Off-CPU profiler enabled (system-wide)")
} else {
log.Printf("Off-CPU profiler enabled (PID: %d)", a.config.PID)
}
}

// Start PMU monitor if enabled
if a.config.EnablePMU {
monitor, err := cpu.NewPMUMonitor(
a.config.PID,
a.config.SystemWide,
cpus,
)
if err != nil {
a.cleanup()
return fmt.Errorf("create PMU monitor: %w", err)
}
a.pmuMonitor = monitor
if a.config.SystemWide {
log.Println("PMU monitor enabled (system-wide)")
} else {
log.Printf("PMU monitor enabled (PID: %d)", a.config.PID)
}
}

a.started = true
return nil
}
Copy link

Copilot AI Jan 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The context parameter ctx is accepted but never used in the Start method. If the context is not needed for cancellation or deadline handling, it should either be used (e.g., for goroutine cancellation) or removed from the signature. If it's for future extensibility, add a comment explaining this.

Copilot uses AI. Check for mistakes.
Comment thread README.md
Comment on lines +207 to +209
agent.Start(ctx)
time.Sleep(10 * time.Second)
agent.Stop(ctx)
Copy link

Copilot AI Jan 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The example code doesn't check the error returned from agent.Start() and agent.Stop(). While this is example code, it's a best practice to show proper error handling, especially since these are operations that can fail. Consider adding error checks: if err := agent.Start(ctx); err != nil { log.Fatal(err) }

Suggested change
agent.Start(ctx)
time.Sleep(10 * time.Second)
agent.Stop(ctx)
if err := agent.Start(ctx); err != nil {
log.Fatal(err)
}
time.Sleep(10 * time.Second)
if err := agent.Stop(ctx); err != nil {
log.Fatal(err)
}

Copilot uses AI. Check for mistakes.
Comment thread README.md
### Custom Metrics Export

```go
agent, _ := perfagent.New(
Copy link

Copilot AI Jan 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The example uses error suppression with agent, _ := perfagent.New(...). This is poor practice in documentation as it encourages users to ignore errors. The example should show proper error handling.

Copilot uses AI. Check for mistakes.
Comment thread metrics/console.go
Comment on lines +30 to +60
func (e *ConsoleExporter) Export(ctx context.Context, snapshot *MetricsSnapshot) error {
if len(snapshot.Processes) == 0 {
_, _ = fmt.Fprintln(e.Writer, "\nNo PMU metrics collected")
return nil
}

if !snapshot.SystemWide {
// Targeted mode (single PID)
for pid, m := range snapshot.Processes {
_, _ = fmt.Fprintf(e.Writer, "\n=== PMU Metrics (PID: %d) ===\n", pid)
e.printSinglePIDMetrics(m)
}
return nil
}

if e.PerPID {
// System-wide with per-PID breakdown
_, _ = fmt.Fprintf(e.Writer, "\n=== PMU Metrics (System-Wide, Per-PID) ===\n")
_, _ = fmt.Fprintf(e.Writer, "Profiled %d processes\n", len(snapshot.Processes))
for pid, m := range snapshot.Processes {
_, _ = fmt.Fprintf(e.Writer, "\n--- PID %d ---\n", pid)
e.printSinglePIDMetrics(m)
}
} else {
// System-wide aggregate (default)
_, _ = fmt.Fprintf(e.Writer, "\n=== PMU Metrics (System-Wide) ===\n")
e.printAggregateMetrics(snapshot)
}

return nil
}
Copy link

Copilot AI Jan 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The context parameter ctx is accepted but never used in the Export method. If the context is not needed for cancellation or timeout handling, it should either be used (e.g., to support cancellation during export) or the signature should be reconsidered. If it's for future extensibility or interface consistency, add a comment explaining this.

Copilot uses AI. Check for mistakes.
Comment thread perfagent/agent.go
}
}
}

Copy link

Copilot AI Jan 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Agent.Stop method does not set a.started = false, which means after calling Stop(), the agent remains in the "started" state. This causes the test comment on line 205 to be incorrect - after stop, the agent IS still considered started, so calling Start() again should error with "already started" not because it's not started.

This appears to be a bug: Stop() should set a.started = false to allow the agent to be stopped and indicate it's no longer running. The cleanup is done but the state flag is not updated.

Suggested change
// Mark the agent as no longer started after stopping and cleanup.
a.started = false

Copilot uses AI. Check for mistakes.
Comment on lines +84 to +94
func makeCArrayOfStrings(input []string) **C.char {
arr := C.malloc(C.size_t(len(input)) * C.size_t(unsafe.Sizeof(uintptr(0))))

pointers := unsafe.Slice((**C.char)(arr), len(input))

for i, s := range input {
pointers[i] = C.CString(s)
}

return (**C.char)(arr)
}
Copy link

Copilot AI Jan 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Potential memory leak: If len(input) is 0, the function allocates memory with C.malloc but returns immediately without freeing it. The allocated memory at line 85 should be freed when input is empty, or the allocation should be skipped entirely.

Copilot uses AI. Check for mistakes.
Comment thread README.md

```go
var buf bytes.Buffer
agent, _ := perfagent.New(
Copy link

Copilot AI Jan 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The example uses error suppression with agent, _ := perfagent.New(...). This is poor practice in documentation as it encourages users to ignore errors. The example should show proper error handling.

Copilot uses AI. Check for mistakes.
Comment thread metrics/console.go
Comment on lines +125 to +156
// For aggregate metrics, we need to merge histograms from all processes
// This is a simplified version - for full histogram merging, see cpu_usage_collector.go
if len(snapshot.Processes) > 0 {
// Find first process with data to show sample stats
for _, m := range snapshot.Processes {
if m.OnCPUStats.Count > 0 {
_, _ = fmt.Fprintf(w, "\nOn-CPU Time (time slice per context switch):\n")
_, _ = fmt.Fprintf(w, " Min: %.3f ms\n", float64(m.OnCPUStats.Min)/1e6)
_, _ = fmt.Fprintf(w, " Max: %.3f ms\n", float64(m.OnCPUStats.Max)/1e6)
_, _ = fmt.Fprintf(w, " Mean: %.3f ms\n", m.OnCPUStats.Mean/1e6)
_, _ = fmt.Fprintf(w, " P50: %.3f ms\n", float64(m.OnCPUStats.P50)/1e6)
_, _ = fmt.Fprintf(w, " P95: %.3f ms\n", float64(m.OnCPUStats.P95)/1e6)
_, _ = fmt.Fprintf(w, " P99: %.3f ms\n", float64(m.OnCPUStats.P99)/1e6)
_, _ = fmt.Fprintf(w, " P99.9: %.3f ms\n", float64(m.OnCPUStats.P999)/1e6)
break
}
}

for _, m := range snapshot.Processes {
if m.RunqueueStats.Count > 0 {
_, _ = fmt.Fprintf(w, "\nRunqueue Latency (time waiting for CPU):\n")
_, _ = fmt.Fprintf(w, " Min: %.3f ms\n", float64(m.RunqueueStats.Min)/1e6)
_, _ = fmt.Fprintf(w, " Max: %.3f ms\n", float64(m.RunqueueStats.Max)/1e6)
_, _ = fmt.Fprintf(w, " Mean: %.3f ms\n", m.RunqueueStats.Mean/1e6)
_, _ = fmt.Fprintf(w, " P50: %.3f ms\n", float64(m.RunqueueStats.P50)/1e6)
_, _ = fmt.Fprintf(w, " P95: %.3f ms\n", float64(m.RunqueueStats.P95)/1e6)
_, _ = fmt.Fprintf(w, " P99: %.3f ms\n", float64(m.RunqueueStats.P99)/1e6)
_, _ = fmt.Fprintf(w, " P99.9: %.3f ms\n", float64(m.RunqueueStats.P999)/1e6)
break
}
}
}
Copy link

Copilot AI Jan 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The aggregate metrics display in system-wide mode only shows latency stats from the first process with data (lines 129-155). This is misleading because it doesn't actually aggregate the histograms from all processes - it just picks one process arbitrarily. The comment on line 126 mentions this is a "simplified version" but this makes the output misleading. Consider either properly merging histograms or clearly labeling that these are sample stats from a single process, not aggregated stats.

Copilot uses AI. Check for mistakes.
@dpsoft dpsoft merged commit e6e458a into main Jan 26, 2026
16 checks passed
@dpsoft dpsoft deleted the lib-mode branch January 26, 2026 18:50
dpsoft added a commit that referenced this pull request May 3, 2026
The committed architecture.excalidraw was the original sketch from the
FP-only era — it still showed only profile/, offcpu/, cpu/, perf.bpf.c,
offcpu.bpf.c, and cpu.bpf.c. Since then the codebase has grown:

- unwind/dwarfagent/ (DWARF hybrid CPU + off-CPU walker, PRs #1#11)
- unwind/{ehcompile,ehmaps,procmap} (CFI compile, per-PID lifecycle,
  /proc resolver — PRs #3#9)
- internal/perfevent (per-CPU perf_event_open + AttachRawLink shared
  helper extracted from profile/dwarfagent — PR #13)
- inject/{python,ptraceop,elfsym} (Python perf-trampoline activator
  via ptrace — PR #12)
- internal/{nspid,k8slabels} (namespace-aware --pid + k8s pprof labels
  — PR #14)
- bpf/perf_dwarf.bpf.c, bpf/offcpu_dwarf.bpf.c (DWARF kernel-side)

The new file groups them into pre-profile setup (optional, dashed —
nspid, k8slabels, inject), profilers (FP, DWARF, PMU), helpers
(perfevent, ehcompile, ehmaps, procmap, pprof), and the four BPF
programs in the kernel band, with arrows showing data + control flow.
Output lands in *-on-cpu.pb.gz / *-off-cpu.pb.gz / PMU stdout-or-file.

Layout was generated programmatically; open the file in
https://excalidraw.com or the VS Code extension to fine-tune
positions and colours.

The README's ASCII version of the same diagram is unchanged — both
exist intentionally so readers can grok the architecture without
opening Excalidraw, and the .excalidraw is the editable source for
future iterations.
dpsoft added a commit that referenced this pull request May 3, 2026
The committed architecture.excalidraw was the original sketch from the
FP-only era — it still showed only profile/, offcpu/, cpu/, perf.bpf.c,
offcpu.bpf.c, and cpu.bpf.c. Since then the codebase has grown:

- unwind/dwarfagent/ (DWARF hybrid CPU + off-CPU walker, PRs #1#11)
- unwind/{ehcompile,ehmaps,procmap} (CFI compile, per-PID lifecycle,
  /proc resolver — PRs #3#9)
- internal/perfevent (per-CPU perf_event_open + AttachRawLink shared
  helper extracted from profile/dwarfagent — PR #13)
- inject/{python,ptraceop,elfsym} (Python perf-trampoline activator
  via ptrace — PR #12)
- internal/{nspid,k8slabels} (namespace-aware --pid + k8s pprof labels
  — PR #14)
- bpf/perf_dwarf.bpf.c, bpf/offcpu_dwarf.bpf.c (DWARF kernel-side)

The new file groups them into pre-profile setup (optional, dashed —
nspid, k8slabels, inject), profilers (FP, DWARF, PMU), helpers
(perfevent, ehcompile, ehmaps, procmap, pprof), and the four BPF
programs in the kernel band, with arrows showing data + control flow.
Output lands in *-on-cpu.pb.gz / *-off-cpu.pb.gz / PMU stdout-or-file.

Layout was generated programmatically; open the file in
https://excalidraw.com or the VS Code extension to fine-tune
positions and colours.

The README's ASCII version of the same diagram is unchanged — both
exist intentionally so readers can grok the architecture without
opening Excalidraw, and the .excalidraw is the editable source for
future iterations.
dpsoft added a commit that referenced this pull request May 21, 2026
#25)

* fix(kernel-stacks): lockdown-safe kernel symbolization + perf.data userspace MMAP2

Three production gaps shipped in v1.2.0 M1 kernel-stacks:

1. Kernel symbolization fails under lockdown=integrity (Secure Boot).
   blazesym's kernel source unconditionally probes /proc/kcore for
   DWARF, which is 0400 root:root + CAP_SYS_RAWIO-gated. perf-agent
   binaries setcap'd with the standard cap_perfmon/cap_bpf set lack
   that, so blazesym returns BLAZE_ERR_PERMISSION_DENIED for every
   batch and kernel frames disappear from the pprof. Setting
   vmlinux="" doesn't help — blazesym still hits kcore.

   Fix: pure-Go /proc/kallsyms symbolizer (symbolize/kallsyms.go) as
   the lockdown-safe fallback. On EPERM from blazesym,
   LocalKernelSymbolizer switches to the kallsyms path for the rest
   of its lifetime (sticky). Resolution loses inline expansion and
   file:line attribution but keeps function name + offset + module
   marker, which is the load-bearing property for flame graphs.
   PERFAGENT_FORCE_KERNEL_FALLBACK=1 lets CI exercise the fallback
   on hosts that don't naturally hit EPERM.

2. When every symbolizer fails, kernel frames are dropped entirely
   instead of being preserved as raw addresses. Caller's pprof loses
   kernel context.

   Fix: rawKernelAddrFrames synthesizes Frame{Name:"0x<hex>",
   Module:"[kernel.kallsyms]", Reason:FailureMissingSymbols} so
   operators can still decode hot kernel frames via /proc/kallsyms
   (the existing awk workaround the spec calls out), and the pprof
   keeps structural shape.

3. perf.data files written by --perf-data-output contain the
   [kernel.kallsyms]_text MMAP2 but no userspace MMAP2 records.
   `perf script` / `perf report` show [unknown] for every user-space
   IP, forcing operators to post-process with the maps.txt + nm hack.

   Fix: perfdata.Writer.AddUserspaceMmaps walks /proc/<pid>/maps via
   the existing procmap.Resolver and emits one PERF_RECORD_MMAP2
   per executable mapping for the target PID. Wired into agent.go
   right after AddKernelMmap. System-wide (-a) is out of scope —
   left as a documented follow-up.

Tests: stub-based unit tests for the fallback ladder and raw-address
preservation (no kallsyms-readable host required), plus four
integration tests including a forced-fallback variant that exercises
the kallsyms path on every host.

* docs(roadmap): post-prod-hardening improvement track

Captures 10 follow-ups surfaced while fixing the v1.2.0 kernel-stacks
production gaps:

- self-profile lane (would have caught the lockdown bug at PR time)
- metrics.Exporter histograms for symbolize hot paths
- /metrics + /debug/pprof HTTP endpoint
- symbolizer error counter by reason (lockdown_eperm, kallsyms_unreadable, ...)
- CI lane that runs kernel-stacks integration with
  PERFAGENT_FORCE_KERNEL_FALLBACK=1
- go test -bench budgets for symbolize
- allocs/sample budget
- system-wide userspace MMAP2 (the -a follow-up to this PR)
- lazy MMAP2 on first new-PID sample
- kthread MMAP / COMM attribution audit

Prioritized so the first three would prevent the next ship-and-pray.

* feat(kernel-stacks): ship roadmap items #2, #5, #8, #10 in same PR

Four follow-ups from docs/post-prod-hardening.md landed together with
the bug fixes — small enough to ship in one commit, large enough to
matter:

#5 CI lockdown lane: new `make test-integration-lockdown` target
   runs the kernel-stacks integration tests with
   PERFAGENT_FORCE_KERNEL_FALLBACK=1, wired into the default `make
   test`. Every CI run now exercises the kallsyms fallback path
   regardless of the host's lockdown state — would have caught the
   original v1.2.0 ship-and-pray.

#2 metrics counters: symbolize.Counters (atomic) tracks
   KernelBatches, KernelInputIPs, KernelBatchFailures,
   KernelFallbackEngaged, KernelRawAddrFrames. Logged at agent
   shutdown via agent.cleanup. Histograms (p50/p99 batch latency)
   left as follow-up (requires reservoir sampling). The
   fallback_engaged counter bumps both on natural EPERM-triggered
   switch AND on env-var forced fallback — operators see the active
   mode from the end-of-run log without scraping.

#8 system-wide userspace MMAP2: emitCommAndMmapsForAllPIDs walks
   /proc at writer init and emits PERF_RECORD_MMAP2 for every
   non-kthread PID's executable mappings (extends Bug 3 single-PID
   fix to -a mode). TestPerfDataUserspaceMmap2_SystemWide verifies
   the walk covers multiple distinct PIDs (saw 9068 on this host).
   Snapshot semantics; lazy emission is roadmap #9.

#10 COMM attribution audit: found that perfdata.Writer.AddComm was
    never called from anywhere in perf-agent — every perf.data
    emitted had zero COMM records. Not just kthreads — all PIDs
    showed up nameless in `perf script`. emitCommForPID reads
    /proc/<pid>/comm and emits PERF_RECORD_COMM for every PID
    enumerated, including kthreads (kvm-pit, vhost-*, kworker/*).
    TestPerfDataUserspaceMmap2 asserts the COMM record appears
    alongside the MMAP2 record for the workload pid.

Modernized three for-loops to Go 1.22 `for i := range N` style
in the modified files (per Go 1.26 project target).

Roadmap doc updated to mark shipped items and reorder remaining
priorities — #1 (self-profile lane) is now the highest-leverage
follow-up.

* feat(bench): self-profile scenario (roadmap #1) — profile the profiler

A second perf-agent profiles the first while it captures a CPU
workload. Catches:

  - perf-agent's own CPU overhead (agent_samples / workload_samples
    at the same sample rate — directly comparable)
  - kernel-symbol resolution rate (fraction of kernel Locations whose
    Function.Name isn't a "0x<hex>" raw-address fallback)

The kernel-resolution canary is what would have surfaced the v1.2.0
lockdown bug at PR time: a sudden drop from ~100% to 0% on lockdown
hosts would have been impossible to miss in CI.

Wiring:
  - new "self" case in bench/cmd/scenario, with --self-duration,
    --cpu-budget, --resolution-budget flags
  - schema.SelfMetrics inlined into Run (omitzero, no effect on
    other scenarios' JSON)
  - countKernelResolution helper + unit tests for the metric logic
  - non-zero exit (3) when budgets breached so CI can gate on it
  - make bench-self target with 1.5 CPU / 0.5 resolution defaults
    (calibrated to catch regressions, not enforce a tight target
    on this codebase as-of-today)

Verified end-to-end: 2 runs on this host showed agent/workload CPU
ratios of 0.38 and 0.85 with 100% kernel resolution — both within
budget, scenario exits 0. Tight budgets (--cpu-budget 0.01) exit 3.

Skips the in-process caps check for "self" because the scenario is
pure orchestration; the perf-agent subprocesses carry their own
file caps.

bench-self Makefile target deliberately depends only on bench-build
(not build) so it doesn't wipe the manually-applied setcap on
perf-agent each invocation.

Roadmap doc updated to mark #1 shipped; remaining items reordered.

* fix(symbolize): userspace fallback chain — discovered via bench-self

Two related bugs surfaced by dogfooding the self-profile scenario
(roadmap #1) against this PR's own perf-agent binary:

A. DebuginfodSymbolizer doesn't fall back to local on NULL.
   When DEBUGINFOD_URLS points at a server that doesn't have the
   target binary's build-id (the normal case for locally-built
   perf-agent), blaze_symbolize_process_abs_addrs returns NULL and
   the dispatcher gave up — every userspace frame became
   <unknown>. Fix: keep a LocalSymbolizer next to the cgo state;
   on NULL, retry through it before erroring.

B. Userspace symbolize EPERM when target has file caps.
   Profiling a setcap'd process (e.g., a second perf-agent) hits
   "permission denied" from blazesym because /proc/<pid>/exe is
   restricted by ptrace_scope when the target is privileged. The
   error propagated as "Failed to symbolize user" and the whole
   batch was dropped. Fix: on any blazesym error,
   LocalSymbolizer.SymbolizeProcess now returns rawUserAddrFrames
   (Name="0x<hex>", Reason=FailureMissingSymbols) — symmetric to
   the kernel-side rawKernelAddrFrames fix in this PR's first
   commit. Stack shape and addresses survive into the pprof so
   operators can recover with addr2line.

Full per-mapping ELF resolution (the proper fix that bypasses
/proc/<pid>/exe entirely by using procmap-supplied paths) is left
as a roadmap follow-up — see docs/post-prod-hardening.md "Findings
from self-profile" section.

Verified end-to-end: profiling perf-agent with perf-agent on a
setcap'd binary went from 100% <unknown> (whole batch dropped) to
named raw-hex frames + structural mapping info; addr2line on those
hex names produced the hot-function ranking that drives the new
roadmap section (cilium/ebpf BTF parse, net.* DNS for debuginfod,
SQLite cache init, etc.). No "permission denied" log spam.

* perf(symbolize, bpfstack): close the dogfood loop — three iterations

Running the self scenario four times in sequence, each iteration
exposed the next bottleneck once the previous one was fixed:

iter 2 → fix per-IP empty-name rendering
  frameFromCSym and fromBlazesymSym left Name="" when blazesym
  resolved the binary but couldn't map a specific IP to a symbol.
  pprof rendered those as <unknown>. Filling with "0x<hex>" +
  FailureMissingSymbols matches the kernel-side behavior shipped
  earlier in this PR. <unknown> went from 76% to ~0%.

iter 3 → fix BPF user-stack kernel leak + speed up /proc/kallsyms
  bpf_get_stackid with BPF_F_USER_STACK can include kernel
  addresses in the user-stack buffer when the sampled task was
  in syscall/irq/fault context. That leak made 40% of "user"
  CPU look like 0xffff...kernel-range hex names in the pprof,
  with the user code below them invisible. bpfstack.SplitUserKernelIPs
  partitions ips by the canonical x86_64 kernel threshold and
  routes the strays to the kernel symbolizer (where they belong).

  Once the leak was fixed, the new top of the profile was
  /proc/kallsyms reading (module_get_kallsym, seq_read_iter,
  vsnprintf — all kernel-side). The kallsyms file is synthesized
  by the kernel: each small read() forces vsnprintf for every
  line. The default 4 KiB scanner buffer meant ~700 read syscalls
  for 3M kallsyms lines. Wrapping the file in a 256 KiB bufio
  Reader cut that ~64x and dropped kallsyms entirely out of the
  top 20.

iter 4 → no fix; steady-state hot paths surface
  do_syscall_64 (42% cum), finish_task_switch (21%), ep_poll
  (10%), ringbuf.readRecord, dwarfagent.aggregateCPUSample,
  Syscall6 / nanotime1 / futex. All intrinsic costs of being
  an eBPF-based sampler. Further wins are diminishing returns.

Roadmap doc captures the iteration log so future contributors see
the dogfood-as-process pattern: ship #1, run it against itself,
let it tell you what to fix next.

Tests: new internal/bpfstack/split_test.go covers SplitUserKernelIPs
edge cases (kernel-at-top leak, no-leak passthrough, all-kernel,
boundary value). All existing unit suites still pass.

* perf(symbolize): allocation-free /proc/kallsyms parser

bench-self iter 5 (A=with --kernel-stacks, B=without) surfaced
strings.Fields + strconv.ParseUint as the top user-side allocation
source: 3M kallsyms lines × multi-string Fields × ParseUint
triggered visible GC pressure (mallocgc / sweepone /
tryDeferToSpanScan in the user-side pprof's top 10).

Replaced with parseKallsymsLine: byte-level scan over the read
buffer, no string allocation per line. Names are copied out once
per kept symbol (mandatory — buffer reuses). Module strings go
through an intern map (typical kernel has tens of modules but
millions of module symbols).

  BenchmarkParseKallsymsLine-16  72630638  16.54 ns/op  0 B/op  0 allocs/op

iter 6 confirmed: user-side allocation hotspots are gone. The
remaining kallsyms cost is kernel-side (module_get_kallsym +
vsnprintf rendering the synthesized /proc/kallsyms file) and
can only be avoided by NOT parsing on every invocation. Roadmap
doc adds a "remaining bottleneck" section proposing a disk-cached
index keyed by kernel boot_id as the next follow-up.

Also lifts strings.Fields-based parsing out of the kallsyms hot
path entirely — net deletion of 2 imports (strconv, strings) from
kallsyms.go.

* perf(symbolize): disk-cached /proc/kallsyms index keyed by boot_id

bench-self iter 6 identified the kernel synthesizing /proc/kallsyms
(module_get_kallsym + vsnprintf in the profile) as the floor —
unfixable from userspace because the file is generated on read.
The fix is to avoid re-reading on every agent invocation.

Cache key: kernel boot_id from /proc/sys/kernel/random/boot_id.
Kernel symbol addresses change only across reboots (KASLR), so
boot_id is exactly the right signal — drops the cache when it
would be wrong, never falsely retains a stale copy.

Format (little-endian):

  header: magic u32 | version u32 | boot_id [16]byte | n_syms u32
  per-symbol: addr u64 | name_len u16 | mod_len u16 | name | module

~50 bytes/symbol, ~11 MiB on disk for this host's filtered kallsyms
(~225k T/t/W/w/i entries). atomic write via temp + rename, so
concurrent readers never see a partial file.

Behavioral flow in newKallsymsSymbolizer:
  1. loadCachedKallsyms → return on hit (boot_id matches)
  2. parseKallsymsFresh on stale / corrupt / missing
  3. best-effort writeKallsymsCache for next invocation

iter 7 dogfood (cold vs warm cache run, same workload + flags):

  metric            cold    warm   delta
  module_get_kallsym 16%    11%    -45%
  vsnprintf (cum)    13%     5%    -67%
  seq_read_iter (cum)40%    n/a    out of top 12
  total CPU       676ms   565ms    -16%

Tests cover roundtrip, stale boot_id rejection, missing file,
and corrupt-file non-fatal handling. cachePathFn and readBootIDFn
are package vars so tests can pin both deterministically.

Next opportunity (next commit): the remaining ~11% module_get_kallsym
in the warm run is blazesym's FAILED kernel-source attempt reading
kallsyms before hitting EPERM on /proc/kcore. A persistent
"blazesym EPERM'd on this boot" marker would let perf-agent skip
that attempt entirely on subsequent invocations.

* perf(symbolize): persistent blazesym-EPERM marker — skip failing attempt

Iter 7 dogfood revealed that even with a warm kallsyms disk cache,
the FIRST blazesym call per agent invocation still cost ~16% of
perf-agent CPU: blazesym opens /proc/kallsyms (the read shows up
as module_get_kallsym + vsnprintf in the profile) before
discovering /proc/kcore is restricted and returning EPERM. The
sticky in-process bit prevents subsequent batches from re-paying,
but every new agent process re-attempts blazesym from scratch.

Fix: a boot_id-scoped marker file at
~/.cache/perf-agent/blazesym-eperm-<hex_boot_id>. Written on the
first observed EPERM. Checked at LocalKernelSymbolizer
construction: if present, fallback is pre-engaged and blazesym is
never invoked. Marker is automatically invalidated by reboot
(boot_id changes), so a kernel upgrade that loosens lockdown is
re-detected on next boot.

iter 8 dogfood (cold → warm cache + warm marker, same workload):

  metric                cold    warm    delta
  module_get_kallsym    16%     ZERO    out of top 12
  vsnprintf (cum)       14%     ZERO    out of top 12
  do_syscall_64 (cum)   52%      0%     down (no failing blazesym calls)
  total perf-agent CPU 505ms   131ms    -74%

Warm pprof now shows intrinsic eBPF-sampler costs only:
finish_task_switch (21%), ep_poll waiting (69% cum), spin_unlock.
Diminishing returns from here — every remaining sample is genuine
"be a profiler" work.

Tests pin XDG_CACHE_HOME via t.Setenv + override readBootIDFn so
the marker lands in a t.TempDir(). Covers: roundtrip,
boot_id-scoped invisibility (post-reboot equivalence), and
filename structurally encodes the boot_id.

* perf(perfdata): lazy MMAP2/COMM on first sample per PID (roadmap #9)

bench-self iter 9 found system-wide capture burning ~30% of
perf-agent CPU in kernel /proc/<pid>/maps rendering
(show_map_vma, mangle_path, lock_next_vma, prepend_path, d_path)
because emitCommAndMmapsForAllPIDs walked every PID at writer
init — ~9000 reads on this host, of which only ~200 produce
samples in a typical capture.

Fix: perfdata.Writer.OnNewPID callback fires the first time
AddSample sees each unique pid (sentinel-filtered: skips 0 and
0xffffffff). agent.go wires it to emitCommForPID +
emitUserspaceMmapsForPID for system-wide mode; the eager walk
is gone. Single-PID mode is unchanged (always knows its target).

iter 11 dogfood (system-wide, --perf-data-output ON):
  perf.data size:  ~3 MB (vs ~40-50 MB the eager walk produced)
  enrolled PIDs:   ~208 lazy (vs ~9000 eager)
  remaining /proc/maps cost: ~15% cum, bounded by sampled-PID
    count rather than host-PID count — natural floor from
    procmap.Resolver's per-batch re-snapshot (spec invariant)
    and dwarfagent's lazy attach.

Bonus: lazy mode also covers PIDs that exec AFTER capture
starts. The eager walk could never see those.

Tests pin the contract: fires once per unique pid (multiple
samples for same pid don't re-trigger); sentinel pids are
filtered; nil callback is a no-op (existing --pid-mode callers
keep working unchanged).

* feat(symbolize): reason-bucketed error counters + bench coverage (roadmap #4, #6)

#4 reason-bucketed counters:
  - KernelLockdownEPERM bumps on each BLAZE_ERR_PERMISSION_DENIED
  - KernelOtherErr bumps on every other blazesym failure
  - CountersSnapshot.String() now logs eperm=N and other_err=N

  Operators can distinguish "canonical lockdown" (high eperm,
  fallback_engaged=1) from "unexpected blazesym failure"
  (any other_err > 0 deserves attention) without grepping logs.

#6 symbolize benchmarks:
  - BenchmarkParseKallsymsFresh (cold-path baseline)
  - BenchmarkLoadCachedKallsyms (warm-path; ~20x faster)
  - BenchmarkResolveKernelIPs (per-IP cost ~35 ns)
  - BenchmarkParseKallsymsLine kept for per-line regression
  - new `make bench-symbolize` target

  The 20x cache speedup is the headline metric. PRs touching
  the symbolize hot path can now show numerical evidence
  rather than vibes.

Roadmap doc reflects shipped status; both items move from
Pending to Shipped in the triage table.

* feat(perfagent): --metrics-listen HTTP endpoint (roadmap #3)

--metrics-listen <addr> starts an in-process HTTP server on the
given address (e.g. 127.0.0.1:7777) exposing:

  /metrics       Prometheus text format of symbolize.Counters
                 (batches, input_ips, batch_failures, fallback_engaged,
                 raw_addr_frames, lockdown_eperm, other_err)
  /debug/pprof/  Go runtime self-pprof — live scrape with
                 `go tool pprof http://host:7777/debug/pprof/profile`

Default off — no port opened, zero CPU overhead when unused.

Lifecycle: server starts after kernelSymbolizer is constructed
(so the metrics handler closure can snapshot live counters);
shutdown happens in Agent.cleanup AFTER the symbolizer is
closed, with a 2s graceful drain.

Tests cover:
  - Prometheus exposition format (HELP + TYPE + value lines per
    metric, all 7 counters present)
  - Lifecycle round-trip against a :0 listener (live scrape,
    Stop returns within grace, post-Stop scrape fails)
  - Empty-addr no-op (--metrics-listen not set → nil server)
  - /debug/pprof/ mount works through the metrics mux

End-to-end verified: agent run with --metrics-listen serves
the canonical Prometheus payload AND a working gzipped pprof
profile (1f 8b 08 magic).

* feat(symbolize): batch-duration histograms + allocs budget (roadmap #2, #7)

#2 batch-duration histograms:
  - LatencyHist: ring buffer (1024 samples) + min/max/sum, with
    p50/p99 on Snapshot via slices.Sort.
  - Wired into SymbolizeKernel: time.Now() on entry, Record on
    deferred exit. Microseconds — readable across warm-cache
    (sub-ms) and cold-CGO (millisecond+) paths.
  - End-of-run log line gains batch_p50_us / batch_p99_us /
    batch_max_us fields.
  - /metrics endpoint exposes three new gauges:
    perf_agent_symbolize_kernel_batch_{p50,p99,max}_microseconds

  Sliding-window semantics: an early burst of slow batches
  doesn't permanently bias the percentile — only the most
  recent 1024 stay in the ring. Tested with the
  TestLatencyHist_RingOverwrite case.

#7 allocs budget:
  - testing.AllocsPerRun gates on the three hottest paths:
    parseKallsymsLine (0 alloc/op), kallsymsSymbolizer.Resolve
    (1 alloc/op = the return slice), LatencyHist.Record
    (0 alloc/op).
  - PRs that reintroduce strings.Fields, strconv.ParseUint, or
    other allocators into these paths will fail the test
    immediately — not just look worse in benchstat.

  Catches the regression class dogfood iter 5 surfaced (mallocgc
  / sweepone in the user-side profile of perf-agent).

Roadmap doc: items #2 and #7 marked shipped (#7 partial — the
symbolize hot path is gated; sample-processing follow-up is
documented).

* chore(perfagent): drop dead emitCommAndMmapsForAllPIDs

CI lint caught it: the helper was the eager-walk path replaced by
the lazy OnNewPID hook in 405611a (roadmap #9). emitCommForPID
and emitUserspaceMmapsForPID are now the only callers — both
already exported through the lazy callback. Dead.

* docs(README): drop GPU-profiling experimental notice

GPU profiling lives on its own track (and its own worktree). The
README banner was conflating it with the stable CPU / off-CPU /
PMU paths; dropping the line keeps the front page honest about
what this build delivers.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants