feat(inject): Python perf-trampoline injector via ptrace#12
Conversation
CPython 3.12+ ships an opt-in perf-compatible trampoline that emits
JIT entries to /tmp/perf-<pid>.map. Today perf-agent users must
restart targets with `python -X perf` to get Python frame names —
hostile UX for production debugging.
This spec proposes a v1 injector that uses ptrace to remotely call
sys.activate_stack_trampoline("perf") on running Python 3.12+
processes when the user passes --profile --inject-python, and
deactivates on shutdown so trampoline overhead doesn't persist
past the profiling window.
Decisions reached during brainstorming (full Q&A in spec):
- 3.12+ only; older versions fall back to existing DWARF unwinding
- Profiling-time flag (--profile --inject-python), not subcommand
- Strict per-PID, lenient system-wide
- Pure Go via golang.org/x/sys/unix
- Both amd64 and arm64 in v1
- SONAME-or-exe detection + debug/elf for symbol resolution
- Inject before BPF attach
- Three remote calls per activation: PyGILState_Ensure →
PyRun_SimpleString(payload) → PyGILState_Release; SIGSEGV-at-0
return sentinel; stack-with-mmap-fallback for payload bytes
- Idempotency via /tmp/perf-PID.map preexisting check
- 5s bounded shutdown deactivation
Move from flat inject/ (Python-only assumption) to a tree that splits
language-agnostic primitives from language-specific logic:
inject/
elfsym/ (shared — ELF symbol resolution, any compiled runtime)
ptraceop/ (shared — low-level ptrace primitives, any C-ABI runtime)
python/ (Python-specific: detector, manager, payload, errors)
Future runtimes (Node.js, JVM, Ruby) plug in as siblings of python/
without churn in the shared layer. Some runtimes may not use ptrace
at all — e.g., a Node.js injector via the V8 inspector protocol would
sit alongside ptraceop/ without using it. The layout supports both.
Public API change: import "inject/python"; python.NewManager(...)
ptraceop has no dependency on python — it takes primitive SymbolAddrs
to avoid an import cycle, which has the side benefit of making
ptraceop trivially reusable for non-Python runtimes that need a
brief remote-function-call sequence.
Test paths and InjectStats accessor renamed to PythonInjectStats
to reserve the namespace for future runtimes.
12 tasks from sentinel-error scaffolding through caps-gated integration
tests:
1. errors + payload encoding (inject/python)
2. SONAME parsing (inject/elfsym/soname.go)
3. ELF symbol resolver (inject/elfsym/elfsym.go)
4. detection ladder (inject/python/detector.go)
5. ptraceop arch-generic + amd64 (inject/ptraceop)
6. ptraceop arm64
7. Manager + lifecycle + dedupe (inject/python/{python,manager}.go)
8. wire into perfagent.Agent + main.go --inject-python flag
9. mmap-watcher new-exec hook for late arrivals
10. bench harness flag
11. integration tests (caps-gated)
12. docs (README + docs/python-profiling.md)
Each task is TDD-shaped with bite-sized steps. Tests exercise stubs
for ptrace operations (real ptrace only in integration tests, since
unit tests can't fake a target process safely). The plan flags one
intentional placeholder: the haveCaps helper in Task 11 must be
copied from existing integration_test.go conventions.
Hard requirement made explicit: producer must do zero work when no subscriber is registered. Two patterns documented (callback or Subscribe-on-demand channel). Added TestNewExec_NoSubscribersZeroDispatch to guard the invariant against refactors.
Add --inject-python CLI flag that injects sys.activate_stack_trampoline('perf')
into CPython 3.12+ targets via ptrace before BPF attach, so early samples carry
Python frame names. Strict per-PID mode with --pid; lenient with -a. Validates
that --inject-python requires --profile and CAP_SYS_PTRACE.
… profiling) Remove PreexistingMarkerCheck, defaultPreexistingMarkerCheck, SkippedPreexisting, and the preexisting field on trackedTarget. Always activate and always deactivate: the marker check conflated "trampoline active now" with "file left from prior run", causing Run 2+ in continuous profiling to silently skip activation. Remove ErrPreexisting (now unused). Update package doc to document the idempotent activation model and the -X perf tradeoff.
When ptrace attaches to a target mid-syscall (e.g. nanosleep), the kernel
records the syscall number in orig_rax. On PtraceCont the kernel runs
syscall-return processing, which clobbers RAX with -ERESTART or similar,
overwriting our remote-call return value. The pyrasite/py-spy/manhole
convention is to set orig_rax = -1 ('not in a syscall') in the call
frame to disable that processing.
Symptom: PyRun_SimpleString remote call returned 230 deterministically
on a Python target attached during time.sleep(). With the fix, all three
remote calls (PyGILState_Ensure → PyRun_SimpleString → PyGILState_Release)
return their actual values cleanly.
Verified end-to-end: integration test now produces a perf-map with
py:: entries and a profile with Python qualnames (Thread.run,
cpu_work, etc.) from a Python 3.14 target launched without -X perf.
The pre-flight check on _PyPerf_Callbacks was meant as a free 'is the trampoline compiled in?' filter, but it does not survive real-world packaging: distributors (Fedora, Ubuntu, ...) strip internal/private symbols from production libpython builds, so the symbol is absent from .dynsym/.symtab even when --enable-perf-trampoline was set at compile time. Result: detector wrongly skipped Python 3.12+ on standard distro builds with ErrNoPerfTrampoline. We now rely on the runtime return value of PyRun_SimpleString — if the trampoline isn't available, the call returns -1 and the manager surfaces ErrNoPerfTrampoline at activation time. Cost: one extra ptrace round-trip per genuinely-unsupported target during scan. Verified on Fedora's stock python3.14 (libpython3.14.so.1.0 with all internal symbols stripped) — detector now lets it through and activation succeeds.
requirePython312Plus now also runs sys.activate_stack_trampoline + deactivate_stack_trampoline at the command line to confirm the build actually supports the trampoline. Hosts where Python is 3.12+ but the build lacks --enable-perf-trampoline (rare, mostly custom builds) will skip cleanly instead of failing.
errcheck flagged unchecked defer f.Close() in inject/elfsym,
inject/ptraceop, inject/python/detector. Wrap with the project's
existing `defer func() { _ = f.Close() }()` idiom.
staticcheck QF1006 in tracker_newexec_test: lift the break condition
into the for header.
There was a problem hiding this comment.
Pull request overview
This PR adds a Python “perf trampoline” injector to perf-agent, enabling --profile --inject-python to activate/deactivate CPython 3.12+’s sys.activate_stack_trampoline('perf') on already-running processes via ptrace, with a system-wide late-process hook via the mmap watcher.
Changes:
- Introduces a new
inject/package tree (elfsym,ptraceop,python) for detection + remote-call injection + lifecycle management. - Wires
--inject-pythonthrough CLI →perfagent.Agent→dwarfagent/ehmapsto support late execs in-amode. - Adds unit tests, integration tests, bench plumbing, and documentation for Python injection.
Reviewed changes
Copilot reviewed 31 out of 31 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
| unwind/ehmaps/tracker_newexec_test.go | Adds tests for lazy new-exec hook dispatch behavior |
| unwind/ehmaps/tracker.go | Adds onNewExec callback + dispatch on group-leader fork events |
| unwind/dwarfagent/hooks.go | Extends dwarfagent Hooks with OnNewExec callback |
| unwind/dwarfagent/common.go | Registers OnNewExec hook onto PIDTracker before Run starts |
| test/integration_inject_python_test.go | Adds caps-gated integration tests for Python injection behavior |
| perfagent/options.go | Adds config + option for InjectPython |
| perfagent/agent.go | Constructs python Manager, runs activate/deactivate, wires dwarf hooks, adds PID scan |
| main.go | Adds --inject-python flag and forwards option |
| inject/python/python.go | Defines Manager/Options/Stats and lifecycle entry points |
| inject/python/manager.go | Implements activation, skip/summary logic, and ESRCH detection |
| inject/python/manager_test.go | Unit tests for strict/lenient policy, dedupe, deadline behavior, ESRCH tolerance |
| inject/python/payload.go | Defines activate/deactivate payload bytes |
| inject/python/payload_test.go | Tests payload encoding/termination |
| inject/python/errors.go | Defines sentinel errors for detection/activation outcomes |
| inject/python/detector.go | /proc-based detector + ELF symbol resolution for libpython/exe |
| inject/python/detector_test.go | Unit tests for detector behavior with synthetic /proc trees |
| inject/ptraceop/ptraceop.go | Implements ptrace attach → 3-call sequence → restore/detach + stack discovery |
| inject/ptraceop/regs_amd64.go | amd64 call-frame setup / return extraction helpers |
| inject/ptraceop/regs_amd64_test.go | Unit tests for amd64 call-frame setup helpers |
| inject/ptraceop/regs_arm64.go | arm64 call-frame setup / return extraction helpers |
| inject/ptraceop/regs_arm64_test.go | Unit tests for arm64 call-frame setup helpers |
| inject/elfsym/soname.go | libpython SONAME parsing + 3.12+ gating helper |
| inject/elfsym/soname_test.go | Unit tests for SONAME parsing + version gating |
| inject/elfsym/elfsym.go | ELF symbol resolver (.dynsym first, .symtab fallback) |
| inject/elfsym/elfsym_test.go | Unit tests for ELF symbol resolution against host libc |
| docs/superpowers/specs/2026-04-28-python-perf-injector-design.md | Adds detailed injector design doc |
| docs/superpowers/plans/2026-04-28-python-perf-injector.md | Adds implementation plan doc |
| docs/python-profiling.md | Adds user documentation for Python injection |
| bench/internal/schema/schema.go | Records inject_python in bench output schema |
| bench/cmd/scenario/main.go | Adds --inject-python flag to bench harness and plumbs into scenarios |
| README.md | Adds README section describing profiling running Python processes |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| // DeactivateAll runs the bounded shutdown deactivation pass. Tolerates ESRCH | ||
| // (process gone). Honors ctx cancellation AND the configured deactivation | ||
| // deadline (5s default). | ||
| func (m *Manager) DeactivateAll(ctx context.Context) { | ||
| deadline, cancel := context.WithTimeout(ctx, m.opts.DeactivateDeadline) | ||
| defer cancel() | ||
|
|
||
| snapshot := m.snapshotTracked() | ||
| for pid, tt := range snapshot { | ||
| select { | ||
| case <-deadline.Done(): | ||
| m.log.Warn("python deactivate cancelled (deadline or ctx)", | ||
| "abandoned", len(snapshot)-int(m.stats.Deactivated.Load())) | ||
| return | ||
| default: | ||
| } | ||
| addrs := SymbolAddrsForTarget{ | ||
| PyGILEnsure: tt.target.PyGILEnsureAddr, | ||
| PyGILRelease: tt.target.PyGILReleaseAddr, | ||
| PyRunString: tt.target.PyRunStringAddr, | ||
| } | ||
| if err := m.opts.Injector.RemoteDeactivate(pid, addrs); err != nil { | ||
| if isProcessGone(err) { | ||
| continue | ||
| } | ||
| m.log.Warn("python deactivate failed", "pid", pid, "err", err) | ||
| m.stats.DeactivateFailed.Add(1) | ||
| continue |
There was a problem hiding this comment.
DeactivateAll's timeout only gets checked between iterations; a single RemoteDeactivate call can still block indefinitely (e.g., ptrace attach/wait on an uninterruptible-sleep target), so this doesn't actually cap total shutdown time despite the comment. Consider running each RemoteDeactivate in a separate goroutine and returning once the deadline fires (even if a goroutine remains stuck), or extend the LowLevelInjector API to accept a context/deadline that can be enforced at the low level.
| if runErr != nil { | ||
| return fmt.Errorf("PyRun_SimpleString: %w", runErr) | ||
| } | ||
| if runResult != 0 { | ||
| return fmt.Errorf("PyRun_SimpleString returned non-zero: %d (likely activation refused at runtime)", runResult) | ||
| } |
There was a problem hiding this comment.
When PyRun_SimpleString returns non-zero, this returns a generic fmt.Errorf. The higher-level python.Manager expects to be able to classify this as ErrNoPerfTrampoline (and expose a structured skip reason / increment SkippedNoTramp), but ptraceop can't currently communicate this in a machine-readable way. Consider returning a typed error (or ptraceop-level sentinel) carrying the remote return code so the python bridge/Manager can translate it into python.ErrNoPerfTrampoline and treat it as a skip rather than a generic activation failure.
| addrs := SymbolAddrsForTarget{ | ||
| PyGILEnsure: target.PyGILEnsureAddr, | ||
| PyGILRelease: target.PyGILReleaseAddr, | ||
| PyRunString: target.PyRunStringAddr, | ||
| } | ||
| if err := m.opts.Injector.RemoteActivate(pid, addrs); err != nil { | ||
| m.stats.ActivateFailed.Add(1) | ||
| m.log.Warn("python inject failed", "pid", pid, "err", err) | ||
| if m.opts.StrictPerPID { | ||
| return fmt.Errorf("activate pid=%d: %w", pid, err) | ||
| } | ||
| return nil |
There was a problem hiding this comment.
activateOne increments ActivateFailed for any RemoteActivate error, so failures due to missing perf-trampoline support (PyRun_SimpleString non-zero) won't be counted as SkippedNoTramp / reason=no_perf_trampoline even though ErrNoPerfTrampoline exists and recordSkipReason handles it. Once the low-level injector returns a typed/sentinel error for 'no perf trampoline', consider mapping it to python.ErrNoPerfTrampoline here and counting it as a skip (and, in lenient mode, not as a generic failure) so stats/reasons match the documented behavior.
| ctx, cancel := context.WithCancel(t.Context()) | ||
| done := make(chan struct{}) | ||
| go func() { | ||
| tracker.Run(ctx, w) | ||
| close(done) | ||
| }() | ||
|
|
||
| // Send a group-leader ForkEvent (pid == tid). | ||
| w.ch <- MmapEventRecord{Kind: ForkEvent, PID: 99, TID: 99} | ||
| // Give Run time to process the event. | ||
| time.Sleep(50 * time.Millisecond) | ||
| cancel() | ||
| <-done |
There was a problem hiding this comment.
TestNewExec_NoSubscribersZeroDispatch doesn't actually prove the ForkEvent was processed: even if Run never consumes from w.ch before the cancel, the test still passes because onNewExec remains nil. To make this a real guardrail, use Run's observer hook to signal when the ForkEvent is observed/handled (e.g., close a channel from an observer when ev.Kind==ForkEvent && ev.PID==99) and wait on that instead of time.Sleep.
scanForLibpython and scanForExeBase picked the first executable mapping as load_base. That assumes the executable PT_LOAD segment is the first one (file offset 0). Modern glibc (Ubuntu 24.04+, default for actions/setup-python's CPython 3.12 build) uses -Wl,-z,separate-code, which produces an ELF with a leading r-- LOAD segment (header + RELRO) at offset 0 and the r-x .text segment at a non-zero offset. Old logic returned start_of_text as load_base, off by the size of the first segment. Every absolute symbol address derived from it was wrong by that delta, so the remote call's RIP jumped into garbage; the target SEGVed mid-execution and we read its pre-SEGV RAX — which still held the kernel's -ERESTARTNOHAND (-514) from when ptrace attached during clock_nanosleep. Symptom: TestInjectPython_ActivatesTrampoline failed deterministically on GHA runners (Ubuntu 24.04, Python 3.12.13) with "PyRun_SimpleString returned non-zero: 18446744073709551102" while passing on Fedora hosts (Python 3.14, executable-first layout). Fix: filter mappings by file offset == 0 to anchor on the first PT_LOAD segment regardless of perms. Both static and dynamic detection paths get the same fix. Also: - ptraceop: validate post-call RIP == sentinel(0); if SIGSEGV fired at a non-zero address, the function crashed mid-execution — surface that with the actual fnAddr instead of returning whatever was in RAX. - python/manager: include libpython, load_base, py_run_string in the "python inject failed" warn — makes future regressions debuggable from a single CI log line. - detector_test: TestDetect_DynamicLinkedPython_SeparateCodeLayout covers the leading-r-- case directly.
Previous fix (filter mappings by file offset==0) gave the right
mapping_start, but the detector still treated mapping_start as the
load_bias. That works for PIE binaries with first-LOAD p_vaddr==0
(PIE convention) — and exclusively that case happens on Fedora local
hosts, hence the false-pass locally. It is wrong for non-PIE ET_EXEC
binaries: Ubuntu /usr/bin/python3.12 is a non-PIE executable with
first-LOAD p_vaddr=0x400000 and absolute symbol values; adding
mapping_start (0x400000) to an already-absolute sym.Value pushes RIP
past the binary's last mapping → SIGSEGV.
The right formula:
load_bias = mapping_start - p_vaddr_of_first_LOAD
abs_addr = load_bias + sym.Value
For PIE p_vaddr==0 it collapses to mapping_start (current behavior,
unchanged). For non-PIE p_vaddr=0x400000 it correctly resolves to 0.
CI evidence: previous build's diagnostic showed
loadbase=0x400000, fnAddr=0xab27d0, SIGSEGV at fnAddr — proving fnAddr
was off the binary's executable text. With load_bias=0 we get
fnAddr=sym.Value, which lands inside .text.
Same ehmaps/LoadProcessMappings already does this for unwind tables;
inject/python now matches the convention.
…as SkippedNoTramp ptraceop.RemoteActivate / RemoteDeactivate previously returned a generic fmt.Errorf when the called function (PyRun_SimpleString) completed the SIGSEGV-sentinel handshake but returned non-zero. The python.Manager had no machine-readable way to distinguish that "Python-level error" case from any other ptrace failure, so every such case incremented ActivateFailed even though commit 384b01a's design intent was to surface ErrNoPerfTrampoline at activation time after dropping the _PyPerf_Callbacks pre-flight. Wiring delivered: ptraceop: return *ErrRemoteCallNonZero{Op, Result} on non-zero return. ptraceop stays language-agnostic — it does not import inject/python and does not interpret the int. perfagent: ptraceopBridge translates *ErrRemoteCallNonZero with Result that sign-extends to int32(-1) into a wrap of python.ErrNoPerfTrampoline. The bridge is the only piece that knows both layers, which keeps the existing import boundary intact. python: activateOne checks errors.Is(err, ErrNoPerfTrampoline) and classifies via recordSkipReason → SkippedNoTramp + reason=no_perf_trampoline, mirroring detect-time skips. Strict mode wraps with the sentinel so callers can errors.Is. Also clarify NewManager's doc — Detector and Injector are not defaulted inside this package (which deliberately doesn't import ptraceop); they are caller-injected by perfagent.Agent. Calling activate/deactivate with either nil panics at first use, intentionally — a silent default would mask wiring mistakes. Rename TestPayloadsAreImmutable → TestActivatePayloadIsConsistent and clarify what it actually asserts (consistency between consecutive calls, not in-place mutation safety; the slice is intentionally shared). Tests: ptraceop: errors.As round-trip + Error() formatting for the typed err. python: SkippedNoTramp counted (not ActivateFailed) on lenient path; strict path returns wrap that errors.Is the sentinel.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 32 out of 32 changed files in this pull request and generated 4 comments.
Comments suppressed due to low confidence (1)
bench/cmd/scenario/main.go:163
- The new --inject-python bench flag is plumbed through function parameters (e.g., measureOnePID(..., injectPython bool)) but never used to alter profiler construction or hooks. As-is, the flag only changes the emitted JSON field and has no behavioral effect. Either wire it into the benchmarked code path (e.g., by constructing a perfagent.Agent with WithInjectPython / appropriate dwarf hooks), or remove the flag/parameters until it can be supported to avoid misleading results.
// measureOnePID times one NewProfilerWithMode(pid=...) call and the
// per-binary breakdown collected via OnCompile.
func measureOnePID(pid, runN int, unwind string, injectPython bool) schema.Run {
var (
mu sync.Mutex
entries []schema.Binary
)
hooks := &dwarfagent.Hooks{
OnCompile: func(path, buildID string, ehFrameBytes int, dur time.Duration) {
mu.Lock()
defer mu.Unlock()
entries = append(entries, schema.Binary{
Path: path,
BuildID: buildID,
EhFrameBytes: ehFrameBytes,
CompileMs: float64(dur.Microseconds()) / 1000.0,
})
},
}
t0 := time.Now()
prof, err := dwarfagent.NewProfilerWithMode(pid, false, []uint{0}, nil, 99, hooks, modeFromFlag(unwind))
totalMs := float64(time.Since(t0).Microseconds()) / 1000.0
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
…sserts
Five separate review items folded into one commit (each independent
within its file):
* perfagent/agent.go: hasCapSysPtrace
- Nil-check cap.GetProc() before dereferencing (mirrors the same
guard in test/integration_test.go).
- Check Permitted ∥ Effective. validate() runs before Start()
promotes Permitted → Effective via SetFlag, so Effective-only
rejected legitimate runs where the cap was granted via setcap or
inherited but not yet promoted.
* inject/python/python.go: NewManager
- Default opts.Detector = NewDetector("/proc", logger) when nil
(zero-cost, both live in this package — no new import).
- Panic at construction with a clear message if opts.Injector is
nil instead of letting the deeper NPE fire later in activateOne.
inject/python deliberately does not import inject/ptraceop so
Injector cannot be defaulted; surfacing the wiring mistake at
NewManager-time is friendlier than at first activate.
* bench/cmd/scenario, bench/internal/schema: drop --inject-python
- The flag was plumbed through measureOnePID/measureSystemWide
parameters and emitted into the JSON, but no codepath in the bench
actually invokes Python injection — measure* constructs
dwarfagent.NewProfilerWithMode directly instead of going through
perfagent.Agent (which owns the python.Manager wiring). Reporting
"with injection" results identical to "without" is misleading;
drop the flag, parameter, and schema field. Re-add when the bench
is reworked around perfagent.Agent. Comment in main.go records
the intent.
* test/integration_inject_python_test.go: TestInjectPython_*
- TestInjectPython_ActivatesTrampoline: handle os.ReadFile error with
t.Fatalf so a transient read failure produces a clear message
instead of silently asserting against an empty buffer.
- TestInjectPython_StrictFailsOnNonPython: drop the bare "python"
substring fallback that would mask unrelated failures (missing
CAP_SYS_PTRACE, validate-time flag rejection, etc.) and assert
only on the structured "not python" reason tokens. Match
case-insensitive because slog/fmt.Errorf chains differ in casing.
The committed architecture.excalidraw was the original sketch from the FP-only era — it still showed only profile/, offcpu/, cpu/, perf.bpf.c, offcpu.bpf.c, and cpu.bpf.c. Since then the codebase has grown: - unwind/dwarfagent/ (DWARF hybrid CPU + off-CPU walker, PRs #1–#11) - unwind/{ehcompile,ehmaps,procmap} (CFI compile, per-PID lifecycle, /proc resolver — PRs #3–#9) - internal/perfevent (per-CPU perf_event_open + AttachRawLink shared helper extracted from profile/dwarfagent — PR #13) - inject/{python,ptraceop,elfsym} (Python perf-trampoline activator via ptrace — PR #12) - internal/{nspid,k8slabels} (namespace-aware --pid + k8s pprof labels — PR #14) - bpf/perf_dwarf.bpf.c, bpf/offcpu_dwarf.bpf.c (DWARF kernel-side) The new file groups them into pre-profile setup (optional, dashed — nspid, k8slabels, inject), profilers (FP, DWARF, PMU), helpers (perfevent, ehcompile, ehmaps, procmap, pprof), and the four BPF programs in the kernel band, with arrows showing data + control flow. Output lands in *-on-cpu.pb.gz / *-off-cpu.pb.gz / PMU stdout-or-file. Layout was generated programmatically; open the file in https://excalidraw.com or the VS Code extension to fine-tune positions and colours. The README's ASCII version of the same diagram is unchanged — both exist intentionally so readers can grok the architecture without opening Excalidraw, and the .excalidraw is the editable source for future iterations.
The committed architecture.excalidraw was the original sketch from the FP-only era — it still showed only profile/, offcpu/, cpu/, perf.bpf.c, offcpu.bpf.c, and cpu.bpf.c. Since then the codebase has grown: - unwind/dwarfagent/ (DWARF hybrid CPU + off-CPU walker, PRs #1–#11) - unwind/{ehcompile,ehmaps,procmap} (CFI compile, per-PID lifecycle, /proc resolver — PRs #3–#9) - internal/perfevent (per-CPU perf_event_open + AttachRawLink shared helper extracted from profile/dwarfagent — PR #13) - inject/{python,ptraceop,elfsym} (Python perf-trampoline activator via ptrace — PR #12) - internal/{nspid,k8slabels} (namespace-aware --pid + k8s pprof labels — PR #14) - bpf/perf_dwarf.bpf.c, bpf/offcpu_dwarf.bpf.c (DWARF kernel-side) The new file groups them into pre-profile setup (optional, dashed — nspid, k8slabels, inject), profilers (FP, DWARF, PMU), helpers (perfevent, ehcompile, ehmaps, procmap, pprof), and the four BPF programs in the kernel band, with arrows showing data + control flow. Output lands in *-on-cpu.pb.gz / *-off-cpu.pb.gz / PMU stdout-or-file. Layout was generated programmatically; open the file in https://excalidraw.com or the VS Code extension to fine-tune positions and colours. The README's ASCII version of the same diagram is unchanged — both exist intentionally so readers can grok the architecture without opening Excalidraw, and the .excalidraw is the editable source for future iterations.
Summary
perf-agent --profile --inject-pythonto activate CPython 3.12+'s perf trampoline (sys.activate_stack_trampoline('perf')) on a running target via ptrace, so users get Python JIT names without restarting the target withpython -X perf.-amode is lenient (failures logged + counted, profile continues).-aruns are picked up via a new lazy mmap-watcher hook.Architecture
The split anticipates future runtime injectors (Node.js inspector protocol, JVM perf-map-agent) plugging in as siblings of
python/without touching the shared layer.Lifecycle
Agent.Start()→python.Manager.ActivateAll(pids)walks /proc, identifies Python 3.12+ via libpython SONAME + ELF symbols, ptrace-attaches each, runs three remote calls (PyGILState_Ensure→PyRun_SimpleString(payload)→PyGILState_Release) inside a single ptrace session, detaches.Agent.Stop()→python.Manager.DeactivateAll(ctx)runs the same dance with the deactivate payload, capped at 5s total to prevent shutdown hangs on uninterruptible-sleep targets.Activation is idempotent — continuous profiling pattern works without cleanup. perf-map files persist between runs (we don't delete them); blazesym reads them on demand for symbol resolution.
Lazy mmap-watcher hook
For
-amode, new exec events firepython.Manager.ActivateLate(pid). The hook is registered viaPIDTracker.SetOnNewExec(fn)only when--inject-pythonis on; with the flag off, the producer does a single nil check and skips dispatch entirely (zero overhead).Critical correctness fixes landed during review/testing
runtime.LockOSThreadinrunSequence: on Linux, only the OS thread that issuedPTRACE_ATTACHmay issue subsequent ptrace ops on the tracee. Without pinning, Go's scheduler could migrate the goroutine mid-sequence and the next ptrace op would fail withEPERM/ESRCH— or worse, target a different stopped tracee.PtraceSetRegsbefore deferred detach: on error paths after a remote call, the target's PC was 0 (the SIGSEGV sentinel); without restoring registers before detach, the target would resume from PC=0 and immediately segfault. The deferred restore guarantees clean target state on every exit path./proc/<pid>/mapstruncation fix: replaced 64KB fixed read withbufio.Scanner+ 4MB max — Python apps with hundreds of mappings would silently miss the[stack]line.Orig_rax = -1in the call frame: when ptrace attaches mid-syscall (e.g.nanosleep), the kernel runs syscall-return processing onPtraceContand clobbersRAX. Settingorig_rax = -1disables this. (Standard pyrasite/py-spy/manhole workaround.)_PyPerf_Callbackspre-flight check: distributors strip internal symbols from production libpython, so this 'is the trampoline compiled in?' filter wrongly rejected stock Fedora/Ubuntu Python builds. We now rely on the runtime return value ofPyRun_SimpleString.Test plan
Unit
inject/elfsym/— SONAME parsing + ELF symbol resolution against host libc.so.6.inject/ptraceop/— register frame setup for amd64 + arm64 (build-tagged), alignment validation, return-value extraction.inject/python/— detection ladder (synthetic /proc trees), Manager strict/lenient policy, dedupe under concurrency, bounded shutdown.unwind/ehmaps/— lazy new-exec hook (TestNewExec_NoSubscribersZeroDispatch,TestNewExec_HookFires).Integration (caps-gated, verified locally)
TestInjectPython_ActivatesTrampoline— Python 3.14 launched WITHOUT-X perf,--inject-pythonactivates trampoline, profile containspy::Thread.run,py::cpu_work, etc.; deactivation fires at profile end (perf-map size stops growing).TestInjectPython_StrictFailsOnNonPython— Go binary target, perf-agent exits non-zero withnot_pythonreason.Cross-arch
GOOS=linux GOARCH=arm64 go build ./inject/ptraceop/...— clean.Lint/vet
gofmt -lclean across all touched packages.go vetclean (with CGO env for blazesym).go test -race ./inject/python/... ./unwind/ehmaps/....Documentation
README.md— new section "Profiling running Python processes" with worked example.docs/python-profiling.md— deeper dive: how it works, when injection is skipped (table), continuous profiling, container caveats, performance impact, troubleshooting.Build / run
Out of scope (deliberate non-goals for v1)
--enable-perf-trampoline(skipped at activation time).-X perf-launched targets will get deactivated on profile end — documented tradeoff for users who want-X perfalways-on).