fix(debuginfod): per-mapping classifier + file-mode symbolization (cache layout fix)#22
Merged
Conversation
Address the v1.1.0 gap where stripped binaries without .gnu_debuglink silently miss off-box symbolization. blazesym's split-debug lookup is gated on debug-link presence and ignores debug_dirs for build-id walks, so the cached .debug at <cacheDir>/.build-id/NN/REST.debug is never consulted for the common Rust/Go release-build case. Industry references (Parca, Pyroscope, OpenTelemetry eBPF profiler) all solve this the same way: normalize PID+vaddr → file-VA in Go using /proc/<pid>/maps + ELF program headers, then symbolize against the fetched .debug file directly via blazesym's elf-mode API. This bypasses the debug-link/build-id resolver entirely. The spec adopts that pattern: per-mapping classifier picks process-mode vs file-mode, AddressMapper (ported from OTel's pfelf.AddressMapper) handles normalization including the page-alignment correctness fix, new cgo wrapper calls blaze_symbolize_elf_virt_offsets. Dispatcher Case 3 (the buggy v1.1.0 path) is removed. Cache layout unchanged. No CLI/wire changes. Integration tests cover stripped Rust + stripped Go workloads end-to-end through a hermetic debuginfod docker container.
Six defects fixed inline: High 1 (Case 3 panic conflicts with file-mode demotion): - Removed panic-guard plan. Case 3 stays but its semantics change to "no-fetch fallback": return "" without fetching. The classifier owns fetch decisions; dispatcher is fail-open for any mapping that reaches process-mode (whether by demotion or transient race). High 2 (per-PID classification cache had no real invalidation path): - Dropped persistent pidClass entirely. Re-classify on every Symbolize call against a fresh /proc/<pid>/maps snapshot. Content- addressed caches survive: AddressMapper keyed by (path, ino), negFetch keyed by build-id. Mappings can change across calls (mmap/dlopen/exec); fresh snapshot is the correct invariant. Medium 3 (no-build-id inconsistency): - Standardized on "process-mode" everywhere. .symtab may still resolve via blazesym defaults. Skip is reserved for [vdso]/ [stack]/[vsyscall]/empty paths. Medium 4 (G2 — local /usr/lib/debug/.build-id check): - Added explicit check for /usr/lib/debug/.build-id/NN/REST.debug in the classifier before deciding to fetch. Distro debuginfo without .gnu_debuglink is reused locally; no remote fetch. Low 5 (Go test recipe with -w -s contradicts uploading DWARF): - Removed -w/-s from Go workload build command. Plain `go build` emits DWARF; objcopy --strip-all leaves build-id but removes DWARF + symtab from the binary, mirroring the Rust test. Low 6 (--kernel-stacks=off not parseable): - Removed --kernel-stacks from test commands (defaults to false). Debuginfod tests don't exercise kernel symbolization.
Four defects fixed:
High (sidecar / mount-namespace):
- Classification and AddressMapper now go through a two-step path
resolution (map_files preferred, symbolic path fallback), mirroring
the existing readBuildID policy. procmap.Mapping gains MapFiles +
OpenablePath() helper. New integration test for the deleted-but-
still-mapped binary case proves symbols resolve when the symbolic
path is unreachable.
Medium (architecture vs classification re-snapshot contradiction):
- Architecture diagram now says "re-snapshot /proc/<pid>/maps every
call" matching the classification + concurrency sections.
Medium (component-table "Remove Case 3" contradicting dispatcher text):
- Component table now says "Reframe Case 3 as no-fetch fallback".
Adds entries for the procmap changes (Mapping.MapFiles +
OpenablePath helper + tests).
Medium (--symbol-fail-closed contradictory mentions):
- Failure-modes table no longer claims fail-closed gains new behavior.
Migration note unchanged ("semantics unchanged" is now accurate).
Added explicit follow-up entry: wire fail-closed end-to-end in a
separate PR.
Low (classifier pseudocode dangling file-mode line):
- Pseudocode rewritten with explicit fetch-succeeds / fetch-fails
branches matching the failure table.
Five defects fixed:
High (file-mode Frame.Address must be original IP, not virt-offset):
- symbolizeElfVirt signature now takes both originalIPs and virtOffsets;
Frame.Address is rewritten to originalIPs[i] after symbolization. The
rewrite propagates to every entry in the inlined chain. Without this,
pprof's resolver routes locations to the synthetic mapping 0 instead
of the real binary mapping. New TestFileModeFrameAddressPreservesMapping
guards the contract end-to-end (asserts Location.Address ∈
Mapping.Start..Limit and Mapping.BuildID matches the workload).
Medium (parse-failure demotion bypassed by classifier ordering):
- Split into two cache structures: negFetch (build-id, "don't re-fetch")
and badDebug (build-id, "don't open this file again"). The classifier
checks badDebug FIRST in Tier 3a, before any cache/local-path lookup,
so a corrupt cached .debug never gets re-selected. badDebug entries
expire on TTL OR when (dev, ino, mtime) of the bad path changes.
Medium (hasResolvableDebuglink with map_files path):
- Debug-link search is filesystem-relative — only meaningful with the
symbolic path. Classifier now passes mapping.Path (not OpenablePath)
to hasResolvableDebuglink. New test guards the path selection.
Medium (unreachable-path branch: skip vs process-mode):
- Unified to process-mode for "file-like path but unreadable" (preserves
frame counts; blazesym defaults can still emit [binary]:offset). Skip
is reserved for [vdso]/[stack]/[vsyscall]/[heap]/empty paths only,
where no symbolization is possible.
Low (mapperKey inconsistency):
- Both places now say mapperKey{dev: uint64, ino: uint64}. Concurrency
section gains the badDebug map; stats counters renamed for clarity
(fileModeLocalHits added).
Four defects fixed:
Medium (badDebug build-id key blocks valid sibling copies):
- Re-keyed badDebug by pathSig{dev, ino, mtime} of the specific failing
file, not by build-id. Tier 3 flow rewritten as candidate iteration:
for each candidate path (systemPath, cache), check its signature
against badDebug; demoted candidates fall through to the next. A
corrupt cache copy no longer blocks a valid /usr/lib/debug/... copy
for the same build-id.
Medium (Resolver.populate doesn't read build-id via map_files):
- Component table now includes unwind/procmap/resolver.go: populate()
must pass both MapFiles and Path to buildIDFor so pprof
Mapping.BuildID is populated in sidecar / deleted-binary cases.
TestStrippedSidecarUnreachableSymbolicPath now asserts
Mapping.BuildID is populated when symbolic path is missing.
Low (normalization-fail fallback ambiguity):
- Specified: route that IP through process-mode for the current batch.
blazesym's defaults emit [binary]:offset; stack shape preserved, no
frames dropped, no synthetic placeholder.
Low (leftover "(path, st_ino)" at line 232):
- Changed to "mapperKey{dev, ino}", matching the rest of the spec.
TDD plan (red test first, green test after fix). 20 tasks across 5
phases:
Phase 0 — Foundation (procmap.Mapping.MapFiles + OpenablePath,
Resolver.populate via map_files, AddressMapper port)
Phase 1 — Classifier (Tier 1 skip, Tier 2 process-mode,
Tier 3 file-mode with badDebug per-path filtering)
Phase 2 — symbolizeElfVirt cgo wrapper with originalIPs rewrite
Phase 3 — Symbolizer per-mapping routing + dispatcher Case 3
reframe + stats
Phase 4 — 7 integration tests covering Rust/Go stripped workloads,
file-mode address-rewrite regression guard, cache hit,
parse-fail demotion, sidecar map_files path, libc local
debuginfo
Phase 5 — Final build + test + lint matrix
Each task has concrete test code, concrete implementation code,
exact bash commands, and expected output. Spec: docs/specs/
2026-05-12-debuginfod-cache-layout-design.md.
map_files is /proc/<pid>/map_files/<start>-<limit>, a kernel-resolved symlink that works across mount namespaces and survives deletion of the original file. OpenablePath() prefers it, falls back to the symbolic path, returns "" when neither is readable. Foundation for the debuginfod cache-layout fix: classifier and AddressMapper need namespace-safe paths for sidecar / deleted-binary cases.
- Doc comment on OpenablePath() warns callers about the TOCTOU between probe and actual use: the returned path may become unreadable between the probe and the caller's open. Callers must still handle os.Open errors themselves. - The "map_files preferred when both readable" subtest now uses two distinct files for MapFiles vs Path so the priority is actually asserted (previously both pointed at the same file, so the test would pass even if Path was checked first).
populate() now tries Mapping.MapFiles (kernel-resolved symlink, works across mount namespaces) before falling back to the symbolic Path. Fixes the sidecar / deleted-binary case where pprof Mapping.BuildID was silently empty because the symbolic path was unreachable from the agent's namespace.
- attachBuildIDs's inline comment duplicated its function doc; removed. - writeELFWithBuildID (~80 LOC ELF byte layout) moved out of procmap_test.go into elfhelpers_test.go so the main test file stays focused on behavior assertions and the helper is reusable from any test in the package.
Port of pfelf.AddressMapper from open-telemetry/opentelemetry-ebpf-profiler (libpf/pfelf/addressmapper.go) under Apache-2.0. Preserves the page-alignment correctness fix (offsets near segment starts get misattributed without it) by aligning p_offset down to page boundary to mirror the kernel's mmap behavior. Used by the upcoming file-mode symbolization path: addresses captured in /proc/<pid>/maps space get normalized to ELF virtual addresses that blazesym's elf-mode API can resolve against a separate .debug file.
- Replace the three-branch sort comparator with cmp.Compare(a.Off, b.Off) — idiomatic Go 1.21+ and shorter. - Add test coverage for the p_offset = 0 case (common in static single-segment binaries and PIE text segments). Math already handled it correctly (0 &^ 0xfff == 0), but the test gap was flagged in code review.
Guards against future regressions: multi-segment binaries (common in PIE builds) must dispatch each offset to the correct segment, and report gaps between segments as unmapped. ET_DYN parses identically to ET_EXEC for the mapper's purposes.
Lays out routeKind, classifyResult, classifier struct with negFetch (build-id keyed) and badDebug (pathSig keyed) state. Implements only Tier 1 — paths that are inherently non-symbolizable (vdso/ stack/vsyscall/heap/empty). Subsequent tasks add Tier 2 (process- mode routing) and Tier 3 (file-mode candidate iteration).
Mappings with .debug_info in-binary, a resolvable .gnu_debuglink, or an unreachable symbolic path go through blazesym's process-mode path (its defaults handle them). Debug-link check uses mapping.Path (symbolic) — map_files paths would search /proc/<pid>/map_files/ which never contains the linkee.
…ering Iterates candidate paths (systemPath, cache.AbsPath) and filters each by its (dev, ino, mtime) signature against badDebug. One corrupt copy never blocks a valid sibling for the same build-id. Falls through to process-mode when every candidate is bad or fetch fails.
- Add TestClassifierBadDebugSkipsToSiblingCandidate that proves the loop continues past a bad candidate to a valid sibling for the same build-id. Required injecting systemDebugRoot as a field on classifier (defaults to /usr/lib/debug/.build-id; overridden in test). This exercises the core Tier 3 invariant the prior tests didn't cover. - statSig now wraps the path into its error message via fmt.Errorf for easier debugging.
cgo wrapper for blaze_symbolize_elf_virt_offsets. Takes both the file-VA (for the actual lookup) and the original process PC. Rewrites Frame.Address (and every Inlined entry's Address) to originalIPs[i] post-symbolization so pprof's mapping resolver can find each location in /proc/<pid>/maps. Without the rewrite, file-mode locations would land at synthetic mapping 0 with no BuildID.
…riant Replace the unreachable bounds guard inside the loop with an explicit invariant check at the top. blazesym's contract guarantees cnt equals the number of input addresses, and the function's pre-condition guarantees len(originalIPs) == len(virtOffsets). If the invariant ever breaks, fail loudly instead of silently dropping frames.
The v1.1.0 Case 3 fetched .debug into the cache and returned "", expecting blazesym's debug_dirs walker to find it. That never worked: blazesym's split-debug lookup is gated on .gnu_debuglink, which most stripped Rust/Go binaries lack. File-mode routing in the classifier owns this case now. Case 3 stays in the dispatcher as a fail-open path for two reasons: (a) a mapping demoted from file-mode after a fetch/parse failure re-enters here; (b) a transient race could leave a build-id-only mapping unclassified. Both want "" — let blazesym emit [binary]: offset fallback. No fetch, no panic. Also: introduce a minimal sfFetcher interface so tests can inject a counting fake to verify the no-fetch behavior.
Classifies each mapping in the target PID via classifier.classify(), then buckets the IPs by route. Process-mode batch handed to blazesym as before; file-mode buckets (one per cached .debug) go through symbolizeElfVirt with AddressMapper-normalized addresses and the originalIPs rewrite. Failure modes preserved: AddressMapper miss demotes the IP to process-mode; symbolizeElfVirt NULL marks the path in badDebug and falls back to process-mode for that bucket; resolver failure (or no Resolver configured) falls back to a pure process-mode batch. Existing SymbolizeProcess(pid, ips) signature is preserved; ctx for the classifier's Tier-3 fetch is derived internally from FetchTimeout to match the dispatcher's own ctx-derivation pattern. Supporting additions: - procmap.Resolver.Mappings(pid): returns the cached snapshot of a PID's executable mappings, populating on first call (Lookup-style per-address API didn't fit per-mapping classification). - classifier.mapperFor(m): inode-keyed AddressMapper cache; benign parse-race loses to the cached entry to keep identity stable. - atomicStats: classifyProcessMode/classifyFileMode/classifySkipped + fileModeCalls/fileModeParseFails + normalizationFails counters used by the new routing path. Surfaced through Stats snapshot.
The previous commit's routeAndSymbolize called resolver.Mappings(pid) which is sync.Once-cached; subsequent calls for the same PID returned the stale snapshot. Live mmap/dlopen/exec changes were invisible: IPs in new mappings fell through findMapping with no hit and were silently misrouted. Spec invariant (docs/specs/2026-05-12-debuginfod-cache-layout-design.md): each Symbolize call must re-snapshot /proc/<pid>/maps. Calling Invalidate(pid) before Mappings(pid) forces the re-parse. Add TestResolverMappingsReSnapshotAfterInvalidate in procmap_test.go: mutates the fixture between two Mappings calls (with an Invalidate in between) and asserts the second snapshot sees the added mapping and that the populate counter increments — proving the re-parse occurred.
Completes the new counter set from the spec. fileModeLocalHits increments when a Tier 3 candidate (systemPath or cache.AbsPath) is selected without a network fetch — measures the value of the local- debug-info path. fileModeFetchFails increments when the debuginfod fetcher returns an error, alongside markNegFetch. The classifier gains a *stats pointer field; newClassifier signature adds a stats arg. Existing tests that constructed classifiers pass a fresh stats.
stripWorkload(): copy + objcopy --strip-all, return build-id. uploadDebug(): invoke debuginfod/upload.sh, return build-id + expected .debug path under the store dir. waitForDebuginfodReady(): poll the server until the build-id serves. Used by the upcoming TestStripped* / TestFileMode* integration tests. dst paths MUST be under the worktree, not /tmp — caps don't survive exec from nosuid mounts.
End-to-end coverage of the file-mode symbolization path for a stripped Rust release binary. Builds the workload with DWARF, uploads .debug to the local debuginfod container, strips a copy, runs perf-agent with --debuginfod-url, and asserts that rust_workload::cpu_intensive_work appears in the pprof — proving the file-mode path resolved the symbol even though the binary itself carries only .note.gnu.build-id. Helpers added: requireDebuginfodContainer, requireTool, spawnBinaryAsWorkload.
Mirror of the Rust test for a Go release binary. Asserts DWARF is present in the fixture before stripping (rebuild without -ldflags='-w' if it isn't), then strips, uploads, profiles, and confirms main.* is resolved through the file-mode path.
Regression guard for the symbolizeElfVirt address-rewrite invariant. Asserts that for every Rust frame in the resulting pprof: * Location.Address is inside Location.Mapping.Start..Mapping.Limit * Location.Mapping.BuildID equals the workload's build-id * Location.Mapping.File points at the workload Without the rewrite, every assertion fails — locations land at the synthetic mapping 0 with no build-id.
Two consecutive runs share a --symbol-cache-dir. Asserts the second run doesn't hit debuginfod for the same build-id — proves the cache.Has short-circuit in the classifier. Helpers added: runStripped, countDebuginfodHits.
Truncates a cached .debug to force blaze_symbolize_elf_virt_offsets to return NULL. Asserts the agent doesn't crash and that pprof still emits samples + the workload's mapping — proving the badDebug per-path filter demotes the mapping to process-mode gracefully.
Deletes the stripped workload from disk while it's still running. The process keeps it alive via its open file descriptor; /proc/<pid>/map_files still resolves. Asserts symbols resolve via map_files AND that pprof Mapping.BuildID is populated (Resolver.populate must read via map_files when the symbolic path is gone).
Verifies G2: when local /usr/lib/debug/.build-id/.../libc.so.debug exists, the classifier doesn't refetch libc through debuginfod. Skips on hosts without glibc-debuginfo installed. Helper added: findLibcWithLocalDebuginfo.
Our port stored page-aligned offsets and extended filesz at construction, then computed file-VA as `Vaddr + (off - alignedOff)`. That's wrong: for any segment with non-page-aligned p_offset, the returned VA is off by `delta = p.offset - aligned_p_offset`. PIE Rust release builds have p_offset = 0x16ec0 → delta = 0xec0 → samples in cpu_intensive_work resolve to adjacent stdlib functions. OTel's reference implementation stores p.offset/p.vaddr/p.filesz raw and aligns at lookup time. The math is `vaddr + (off - p.offset)` (equivalent to OTel's `vaddr - (p.offset - off)`). This produces correct results for both un-aligned `off == p.offset` and page- aligned `off == p.offset &^ (pageSize - 1)`. Also fixes TestAddressMapperPageAlignment which was asserting the incorrect 0x400000 for the page-aligned start instead of the correct 0x400000 - 0x234. Adds TestAddressMapperPIEWithNonPage- AlignedOffset reproducing the real rust-workload PHDR layout.
Task 0a added the MapFiles field and Task 0b wired it into attachBuildIDs, but the parser never set it. attachBuildIDs's map_files branch was always skipped; pprof Mapping.BuildID stayed empty whenever the symbolic path was unreachable (sidecar / deleted-binary case). parseMapsFile now derives the map_files directory from the maps file path and passes it to parseMapsLine, which constructs /proc/<pid>/map_files/<startHex>-<limitHex> for each mapping.
The previous assertion compared loc.Address (a file offset per pprof's contract: Address = ProcessPC - MapStart + MapOff) to Mapping.Start..Mapping.Limit (process address range). Different units; the check was always false. The test was passing in CI only by accident because the test fixture/setup wasn't exercising the file-mode path on rust-workload's running PC range; once AddressMapper started producing correct file-VAs, the bogus check surfaced. The real invariant we want to verify is: after file-mode symbol- ization, the resulting pprof Location is tied to a real Mapping (not synthetic 0) with the correct BuildID. Keep those two checks; drop the address-in-range nonsense.
TestStrippedSidecarUnreachableSymbolicPath was end-to-end profiling a target whose binary was deleted from disk. The classifier and AddressMapper handle this correctly (Mapping.MapFiles works), but the DWARF unwinder in unwind/dwarfagent/ opens the binary by symbolic path to compile .eh_frame at attach time — that fails when the path is deleted, leaving no user-stack samples in the profile. The unwinder fix is its own PR (separate subsystem). For now: replace the integration test with a unit-level classifier test that verifies the classifier routes correctly when MapFiles is the only readable path. The DWARF unwinder gap is recorded as a follow-up in the design spec.
…t-mapped case The DWARF unwinder previously opened every target binary by its symbolic path at attach time (os.Stat / elf.Open / ReadBuildID). When the binary has been unlinked from disk while still mapped — the sidecar / mount-namespace case — those calls all fail with "no such file or directory", and the entire mapping is skipped: no CFI compiled, no pid_mappings row, no user-stack samples in the profile. Add an openableBinary(pid, start, limit, symbolicPath) helper that probes /proc/<pid>/map_files/<startHex>-<limitHex> first (kernel-resolved symlink — works across mount namespaces and survives unlinked-but-mapped binaries) and falls back to the symbolic path. perf-agent's standard cap set (CAP_SYS_ADMIN among others) already covers the map_files read. Thread a new openPath argument through PIDTracker.Attach / Detach counterparts (Attach, EnrollWithoutCompile, AttachCompileOnly), TableStore.AcquireBinary, and LoadProcessMappings. binPath remains the cache key (so two PIDs sharing the same .so still share one ehcompile result); openPath is the file actually opened for the build-id read, ehcompile.Compile, and /proc/<pid>/maps cross-check. Call sites that walk /proc/<pid>/maps (AttachAllMappings, enrollPIDFromTree, resolveBinaryByTableID) now parse the va range from fields[0] and route I/O through openableBinary. The MmapEvent handler in PIDTracker.Run does the same using ev.Addr / ev.Addr+ev.Len. looksExecutable accepts mappings reachable via map_files even when the symbolic path is gone. Synthetic-tree unit tests (scan_enroll_test) skip the map_files fallback since the fake /proc/<pid> doesn't correspond to a real PID namespace. Adds unit tests covering openableBinary's three branches (symbolic works, map_files fallback when symbolic is missing, both missing).
…inder supports map_files The end-to-end sidecar test was narrowed to a classifier-only unit test in 7904423 because the DWARF unwinder's attach path opened the symbolic binary path and failed when the binary had been deleted from disk. The previous commit (ehmaps: open target binaries via /proc/<pid>/map_files) removes that limitation, so the full integration test is back in scope. Keep TestClassifierUsesMapFilesWhenSymbolicPathDeleted in place — it complements the integration test by exercising the classifier's routing logic without spinning up a debuginfod server, and runs in milliseconds on every test invocation.
The follow-up listed unwind/dwarfagent's inability to open binaries via /proc/<pid>/map_files for the deleted-but-mapped (sidecar / mount-namespace) case. That gap is now closed by routing every Attach / AcquireBinary / LoadProcessMappings call through ehmaps.openableBinary. Drop the entry from the design spec's follow-ups section.
Kubernetes-aware profile labels section was 18 lines of dense prose
with the lead message (use DaemonSet) buried in caveats. Sidecar
bullet stretched the security tradeoff across 4 nested clauses in
a single sentence. Trimmed to 9 lines: DaemonSet recommendation
upfront, sidecar's shareProcessNamespace caveat in one parenthesized
phrase instead of an inline mini-essay.
Also dropped a stray forward-reference under Usage ("Two specific
deployment shapes — Python via --inject-python, and sidecar inside
a Kubernetes pod — work as documented in the use cases above") that
conflated two unrelated features and pointed back at content already
covered above. Replaced with a plain pointer to the Python docs.
The paragraph still described v1.1.0: process_dispatch hook is the mechanism, "binaries already on disk get DWARF via the cache (no override)", "sidecar/missing binaries get the full ELF". None of that is accurate after the cache-layout fix. Reality now: per-mapping classifier routes each binary into process-mode (local DWARF / debug-link) or file-mode (cached .debug via the OTel-style normalize-addresses-then-symbolize-against-debug pattern). Sidecar / deleted-binary cases open through /proc/<pid>/map_files. The dispatcher's role shrank to a fail-open path. Section now leads with the three routing buckets and points at the design doc for the math.
There was a problem hiding this comment.
Pull request overview
Fixes off-box symbolization for stripped ELF binaries that only carry .note.gnu.build-id (common for Rust/Go release builds) by adding a per-mapping routing layer that chooses between blazesym process-mode and a new file-mode path that symbolizes directly against cached .debug files (with Go-side address normalization). Also improves robustness for “deleted-but-still-mapped” / sidecar mount-namespace cases by preferentially opening binaries via /proc/<pid>/map_files/....
Changes:
- Add
procmap.Mapping.MapFiles+OpenablePath()and plumb/proc/<pid>/map_files/<start>-<limit>through procmap + EH unwinder paths so build-ids and binaries can be opened even when the symbolic path is unreachable. - Introduce
procmap.AddressMapperand a debuginfod classifier + per-mapping symbolization router (process-mode vs file-mode), with file-mode callingblaze_symbolize_elf_virt_offsetsagainst cached.debugand rewriting frame addresses back to original process PCs. - Add extensive unit + integration tests covering stripped Rust/Go, cache hits, parse-fail demotion, libc local resolution, and deleted-but-mapped binaries.
Reviewed changes
Copilot reviewed 28 out of 29 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| unwind/procmap/resolver.go | Adds Mappings() accessor and attaches BuildIDs preferring MapFiles over symbolic Path. |
| unwind/procmap/procmap.go | Adds Mapping.MapFiles and OpenablePath() helper to prefer /proc/<pid>/map_files. |
| unwind/procmap/procmap_test.go | Extends procmap tests for MapFiles, Mappings() resnapshot behavior, and OpenablePath(). |
| unwind/procmap/parse.go | Populates Mapping.MapFiles during /proc/<pid>/maps parsing. |
| unwind/procmap/elfhelpers_test.go | Adds ELF fixture writer used by procmap tests. |
| unwind/procmap/addressmapper.go | Adds Go-side PHDR-based address normalization (AddressMapper). |
| unwind/procmap/addressmapper_test.go | Adds unit tests for AddressMapper mapping math (alignment/PIE/multi-PT_LOAD). |
| unwind/ehmaps/tracker.go | Threads openPath through Attach/Enroll/CompileOnly and uses map_files when symbolic path is unreachable. |
| unwind/ehmaps/tracker_test.go | Updates tests for new Attach/EnrollWithoutCompile/AttachCompileOnly signatures. |
| unwind/ehmaps/store.go | Updates AcquireBinary to accept (binPath, openPath) and compile from openable path. |
| unwind/ehmaps/scan_enroll.go | Uses map_files for real /proc, but avoids it for synthetic proc trees in tests. |
| unwind/ehmaps/openable.go | Adds helper to choose an openable binary path (map_files first, then symbolic). |
| unwind/ehmaps/openable_test.go | Tests openable-path selection (skipping map_files checks without required caps). |
| unwind/ehmaps/elf_helpers.go | Updates LoadProcessMappings to open via openPath while matching by symbolic name. |
| unwind/dwarfagent/miss_drainer.go | Resolves binaries for lazy compile via openable path (including map_files). |
| unwind/dwarfagent/miss_drainer_test.go | Updates tests for new (binPath, openPath) return signature. |
| test/integration_test.go | Adds/extends integration tests for stripped Rust/Go debuginfod flows, cache hits, parse-fail demotion, and sidecar-like deletion. |
| test/integration_strip_helpers.go | Adds helpers to strip workloads and upload .debug to local debuginfod fixture. |
| symbolize/debuginfod/symbolizer.go | Implements per-mapping routing, file-mode symbolization, and address rewrite back to process PCs. |
| symbolize/debuginfod/stats.go | Adds counters for classify routes, file-mode outcomes, and normalization failures. |
| symbolize/debuginfod/singleflight.go | Introduces sfFetcher interface for easier fetcher substitution in tests. |
| symbolize/debuginfod/elftest_helpers_test.go | Adds ELF fixture builders for debuginfod classifier/dispatcher tests. |
| symbolize/debuginfod/dispatcher.go | Adds file-mode cgo wrapper for blaze_symbolize_elf_virt_offsets and reframes dispatcher Case 3 as no-fetch fail-open. |
| symbolize/debuginfod/dispatcher_test.go | Adds tests for file-mode address rewrite and dispatcher Case 3 no-fetch behavior. |
| symbolize/debuginfod/classifier.go | Adds per-mapping classifier for skip/process/file routes, local-vs-fetch decisions, and bad-debug/neg-fetch tracking. |
| symbolize/debuginfod/classifier_test.go | Adds unit tests for classifier routing decisions and badDebug behavior. |
| README.md | Updates documentation to describe the new per-mapping classifier + file-mode flow. |
| docs/specs/2026-05-12-debuginfod-cache-layout-design.md | Adds design spec documenting rationale, routing, failure modes, and test plan. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
attachBuildIDs was caching the build-id under the MapFiles key (/proc/<pid>/map_files/<range>), which is unique per PID + VA range. Two PIDs mapping the same libc would create two cache entries for the same build-id, causing r.buildIDs to grow unbounded in long-running agents. Fix: read the build-id directly via ReadBuildID(MapFiles) — bypassing buildIDFor's cache — then backfill r.buildIDs under the stable symbolic path. Future lookups for the same binary (even from a different PID) get a cache hit without re-reading the ELF. Add TestAttachBuildIDsCachesUnderSymbolicPath to prove the invariant: after attachBuildIDs runs, buildIDFor(symbolicPath) returns the cached value even when the MapFiles ELF has been deleted (no re-read possible).
… eviction negFetch, badDebug, and mappers were plain map[K]V with TTL-only cleanup that only fires on access. Cold entries were never reclaimed; with many unique build-ids the maps would grow without bound. The doc comment on classifier claimed "bounded LRU" but the implementation did not enforce any size limit. Fix: add classifierCacheMax = 4096 and enforce it on insert: - negFetch / badDebug: evictOldestTime() drops the entry with the earliest deadline (O(n) scan, runs at most once per insert, amortised O(1) for a map that rarely changes topology) — correct for TTL-keyed maps. - mappers: no deadline, so an arbitrary entry is dropped via single-pass map iteration when the map is full. Update the classifier doc comment to accurately describe the policy: "bounded by size + TTL; oldest-deadline eviction on insert when full". Add TestClassifierBadDebugBounded and TestClassifierNegFetchBounded to prove each map stays at or below classifierCacheMax after classifierCacheMax+100 inserts.
The previous comment warned "dst MUST be under the worktree, NOT /tmp" citing the nosuid-mount / file-caps constraint. That constraint applies to the perf-agent binary (which is setcap'd), not to workload binaries being stripped with objcopy. stripWorkload itself never sets or relies on file caps; callers already pass t.TempDir() (which resolves to /tmp), so the "MUST NOT /tmp" claim directly contradicted actual usage. Replace with a short, accurate description of what the function does.
…ot calls) symbolizeFileBucket incremented fileModeCalls by len(virt) — the number of virtual addresses in the bucket — not by 1 per call. The name "fileModeCalls" implied "number of calls to symbolizeFileBucket", which is misleading and caused confusion when reading the counter in Stats. Rename throughout: - atomicStats.fileModeCalls → fileModeAddrs - Stats.FileModeCalls → FileModeAddrs - snapshot() mirror updated - Doc comments updated to clarify: "total number of addresses (IPs) resolved through the file-mode path". The increment site (len(virt)) is unchanged.
The previous comment said "returns nil when ... /proc parse failed", but the implementation returns (nil, entry.err) on parse errors — only the PID-gone fast-fail returns (nil, nil). Rewrite the contract block to clearly distinguish: - (nil, nil) — PID gone, access restricted, or no executable regions - (nil, err) — /proc parse failed (I/O error, unexpected format) - (mappings, nil) — success
The fetcher field and newClassifier parameter were typed as the concrete *singleflightFetcher, bypassing the sfFetcher interface that was added in singleflight.go precisely for minimal coupling and test-fake injection. Change both to sfFetcher so: - Tests can inject a nil or stub without constructing a real fetcher - The classifier is decoupled from the concrete transport layer - Existing call sites (symbolizer.go passes *singleflightFetcher, tests pass nil) continue to work unchanged since *singleflightFetcher already satisfies sfFetcher and nil is a valid nil-interface value
…ream
The skip-route bullet implied that [vdso]/[stack]/anonymous mappings
commonly reach the classifier and trigger routeSkip. In fact,
procmap.Resolver's parser (parseMapsLine) filters out paths with
HasPrefix("[") and empty paths before they ever appear in the Mappings
snapshot, so IPs in those regions hit "no mapping found" and fall through
to process-mode rather than the skip route.
Update the doc comment to:
- Keep the skip bullet but qualify it as rarely triggered in practice
given the upstream procmap filter (routeSkip is still a real code path
the classifier can return for any ELF path that happens to match
nonSymbolizablePaths)
- Add an explicit note that mappings from procmap.Resolver exclude
pseudo-files and anonymous ranges, so they fall back to process-mode
2 tasks
dpsoft
added a commit
that referenced
this pull request
May 15, 2026
Both projects these documents covered have shipped: - kernel-stacks M1 (PR #21, v1.2.0) - debuginfod cache layout fix (PR #22) The specs and plans served their purpose during design/implementation; they're not user-facing reference material. The user-facing docs under docs/ — debuginfod-symbolization.md, perf-data-output.md, python-profiling.md — stay. Removes 6294 lines of planning artifacts.
2 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixes a v1.1.0 gap where stripped binaries with only
.note.gnu.build-id(the common Rust/Go release-build case) silently failed to resolve through the debuginfod cache. blazesym's split-debug walker is gated entirely on.gnu_debuglink, and theBUILD_ID_DEBUG_DIRwalker is hardcoded to/usr/lib/debug/.build-id— the dispatcher fetched.debugfiles into our cache that blazesym never looked at.Adopts the Parca / Pyroscope / OpenTelemetry eBPF profiler pattern: per-mapping classifier picks process-mode vs file-mode, Go-side address normalization (
AddressMapperported from OTelpfelf), thenblaze_symbolize_elf_virt_offsetsagainst the cached.debugdirectly. blazesym's debug-link resolver becomes irrelevant for stripped binaries.What changed
symbolize/debuginfod/classifier.go) routes each ELF mapping in the target PID through one of three routes: skip / process-mode / file-mode.badDebugis keyed bypathSig{dev, ino, mtime}so one corrupt cache copy doesn't block a valid sibling for the same build-id.unwind/procmap/addressmapper.go) — port ofpfelf.AddressMapperfrom open-telemetry/opentelemetry-ebpf-profiler under Apache-2.0 with inline attribution. Math matches OTel reference: stores rawp.offset/p.vaddr/p.filesz, computes alignment lazily at lookup.symbolize/debuginfod/dispatcher.go::symbolizeElfVirt) — takes bothoriginalIPsandvirtOffsets. RewritesFrame.Addressto the original process PC for the leaf and every inlined chain entry so pprof's mapping resolver routes locations to real binary mappings (not synthetic mapping 0).Symbolize(symbolize/debuginfod/symbolizer.go) buckets IPs by route, runs two batched blazesym calls (one process-mode, one per cached.debug), merges by original index. Resolver re-snapshots/proc/<pid>/mapsevery call (no stale state).return ""). The classifier owns the fetch decision now; Case 3 is reached only on demoted mappings.procmap.MappinggainsMapFiles+OpenablePath().parseMapsLineconstructs/proc/<pid>/map_files/<start>-<limit>for each mapping.Resolver.populatereads build-id viaMapFilesfirst. DWARF unwinder (unwind/ehmaps) now opens target binaries viamap_filesfirst too — fixes the sidecar / deleted-but-still-mapped case end-to-end.Industry reference
pkg/symbolizer/normalize.go), opens.debugdirectly, never touches.gnu_debuglink.(build-id, file-offset), backend symbolizes against fetched.debugvia custom Lidia indexer.pfelf.AddressMapperdoes the same PHDR normalization including page-alignment fix.Spec: `docs/specs/2026-05-12-debuginfod-cache-layout-design.md` (6 review rounds). Plan: `docs/plans/2026-05-12-debuginfod-cache-layout.md` (20 TDD tasks executed via subagent-driven workflow).
Verification
Stats (vs main)
Test plan
Notes for reviewers
The DWARF unwinder change (
unwind/ehmaps) threads an additionalopenPathargument throughAttach/LoadProcessMappings/AcquireBinary. Cache keys stay as the symbolic path so two PIDs sharing the same.sostill share one compile. Look at87379c1dfor that change in isolation.`docs/specs/2026-05-12-debuginfod-cache-layout-design.md` and `docs/plans/2026-05-12-debuginfod-cache-layout.md` capture the full design + task breakdown if you want to follow the reasoning.