Skip to content

fix(debuginfod): per-mapping classifier + file-mode symbolization (cache layout fix)#22

Merged
dpsoft merged 47 commits into
mainfrom
fix/debuginfod-cache-layout
May 15, 2026
Merged

fix(debuginfod): per-mapping classifier + file-mode symbolization (cache layout fix)#22
dpsoft merged 47 commits into
mainfrom
fix/debuginfod-cache-layout

Conversation

@dpsoft
Copy link
Copy Markdown
Owner

@dpsoft dpsoft commented May 14, 2026

Summary

Fixes a v1.1.0 gap where stripped binaries with only .note.gnu.build-id (the common Rust/Go release-build case) silently failed to resolve through the debuginfod cache. blazesym's split-debug walker is gated entirely on .gnu_debuglink, and the BUILD_ID_DEBUG_DIR walker is hardcoded to /usr/lib/debug/.build-id — the dispatcher fetched .debug files into our cache that blazesym never looked at.

Adopts the Parca / Pyroscope / OpenTelemetry eBPF profiler pattern: per-mapping classifier picks process-mode vs file-mode, Go-side address normalization (AddressMapper ported from OTel pfelf), then blaze_symbolize_elf_virt_offsets against the cached .debug directly. blazesym's debug-link resolver becomes irrelevant for stripped binaries.

What changed

  • Per-mapping classifier (symbolize/debuginfod/classifier.go) routes each ELF mapping in the target PID through one of three routes: skip / process-mode / file-mode. badDebug is keyed by pathSig{dev, ino, mtime} so one corrupt cache copy doesn't block a valid sibling for the same build-id.
  • Address normalization (unwind/procmap/addressmapper.go) — port of pfelf.AddressMapper from open-telemetry/opentelemetry-ebpf-profiler under Apache-2.0 with inline attribution. Math matches OTel reference: stores raw p.offset/p.vaddr/p.filesz, computes alignment lazily at lookup.
  • File-mode cgo wrapper (symbolize/debuginfod/dispatcher.go::symbolizeElfVirt) — takes both originalIPs and virtOffsets. Rewrites Frame.Address to the original process PC for the leaf and every inlined chain entry so pprof's mapping resolver routes locations to real binary mappings (not synthetic mapping 0).
  • Per-mapping Symbolize (symbolize/debuginfod/symbolizer.go) buckets IPs by route, runs two batched blazesym calls (one process-mode, one per cached .debug), merges by original index. Resolver re-snapshots /proc/<pid>/maps every call (no stale state).
  • Dispatcher Case 3 reframed as a no-fetch fail-open (return ""). The classifier owns the fetch decision now; Case 3 is reached only on demoted mappings.
  • map_files plumbingprocmap.Mapping gains MapFiles + OpenablePath(). parseMapsLine constructs /proc/<pid>/map_files/<start>-<limit> for each mapping. Resolver.populate reads build-id via MapFiles first. DWARF unwinder (unwind/ehmaps) now opens target binaries via map_files first too — fixes the sidecar / deleted-but-still-mapped case end-to-end.
  • New stats: classify{ProcessMode,FileMode,Skipped}, fileMode{Calls,LocalHits,FetchFails,ParseFails}, normalizationFails.

Industry reference

Project Same approach
Parca Normalizes in Go (pkg/symbolizer/normalize.go), opens .debug directly, never touches .gnu_debuglink.
Pyroscope Ships (build-id, file-offset), backend symbolizes against fetched .debug via custom Lidia indexer.
OpenTelemetry eBPF profiler pfelf.AddressMapper does the same PHDR normalization including page-alignment fix.

Spec: `docs/specs/2026-05-12-debuginfod-cache-layout-design.md` (6 review rounds). Plan: `docs/plans/2026-05-12-debuginfod-cache-layout.md` (20 TDD tasks executed via subagent-driven workflow).

Verification

Check Result
`make build` ✅ clean
`go test ./... -count=1` (20+ packages) ✅ all pass
`go vet ./...` ✅ clean
Integration tests (caps + local debuginfod docker)
TestStrippedRustOffBoxSymbolization ✅ stripped Rust release symbols resolved end-to-end
TestStrippedGoOffBoxSymbolization ✅ stripped Go release symbols resolved end-to-end
TestFileModeFrameAddressPreservesMapping ✅ pprof Mapping.BuildID matches workload (regression guard for address-rewrite invariant)
TestStrippedCachedHitNoFetch ✅ second run reuses cache, no re-fetch
TestFileModeParseFailDemotes ✅ corrupt cached .debug → badDebug, demote to process-mode, no crash
TestStrippedSidecarUnreachableSymbolicPath ✅ binary deleted while mapped, symbols still resolve via /proc//map_files
TestOffBoxLibcResolution ⏭️ SKIP on hosts without glibc-debuginfo

Stats (vs main)

  • 33 commits
  • ~+7000 / −125 lines (most are tests + design docs)
  • New: classifier, AddressMapper, symbolizeElfVirt, integration test helpers, 6 integration tests, 6 classifier unit tests, 1 ehmaps unit test, design spec, implementation plan.

Test plan

  • Run a workload with stripped binary against a debuginfod server with the debug-info uploaded — confirm symbols resolve.
  • Run with libc-debuginfo installed locally — confirm libc is not refetched.
  • Run against a binary deleted while mapped (sidecar simulation) — confirm symbols still resolve and pprof Mapping.BuildID is populated.
  • Run two consecutive profiles for the same target — confirm second run uses cache (no debuginfod GETs).

Notes for reviewers

The DWARF unwinder change (unwind/ehmaps) threads an additional openPath argument through Attach / LoadProcessMappings / AcquireBinary. Cache keys stay as the symbolic path so two PIDs sharing the same .so still share one compile. Look at 87379c1d for that change in isolation.

`docs/specs/2026-05-12-debuginfod-cache-layout-design.md` and `docs/plans/2026-05-12-debuginfod-cache-layout.md` capture the full design + task breakdown if you want to follow the reasoning.

dpsoft added 30 commits May 12, 2026 20:34
Address the v1.1.0 gap where stripped binaries without .gnu_debuglink
silently miss off-box symbolization. blazesym's split-debug lookup is
gated on debug-link presence and ignores debug_dirs for build-id walks,
so the cached .debug at <cacheDir>/.build-id/NN/REST.debug is never
consulted for the common Rust/Go release-build case.

Industry references (Parca, Pyroscope, OpenTelemetry eBPF profiler) all
solve this the same way: normalize PID+vaddr → file-VA in Go using
/proc/<pid>/maps + ELF program headers, then symbolize against the
fetched .debug file directly via blazesym's elf-mode API. This bypasses
the debug-link/build-id resolver entirely.

The spec adopts that pattern: per-mapping classifier picks process-mode
vs file-mode, AddressMapper (ported from OTel's pfelf.AddressMapper)
handles normalization including the page-alignment correctness fix,
new cgo wrapper calls blaze_symbolize_elf_virt_offsets. Dispatcher
Case 3 (the buggy v1.1.0 path) is removed.

Cache layout unchanged. No CLI/wire changes. Integration tests cover
stripped Rust + stripped Go workloads end-to-end through a hermetic
debuginfod docker container.
Six defects fixed inline:

High 1 (Case 3 panic conflicts with file-mode demotion):
- Removed panic-guard plan. Case 3 stays but its semantics change to
  "no-fetch fallback": return "" without fetching. The classifier
  owns fetch decisions; dispatcher is fail-open for any mapping that
  reaches process-mode (whether by demotion or transient race).

High 2 (per-PID classification cache had no real invalidation path):
- Dropped persistent pidClass entirely. Re-classify on every
  Symbolize call against a fresh /proc/<pid>/maps snapshot. Content-
  addressed caches survive: AddressMapper keyed by (path, ino),
  negFetch keyed by build-id. Mappings can change across calls
  (mmap/dlopen/exec); fresh snapshot is the correct invariant.

Medium 3 (no-build-id inconsistency):
- Standardized on "process-mode" everywhere. .symtab may still
  resolve via blazesym defaults. Skip is reserved for [vdso]/
  [stack]/[vsyscall]/empty paths.

Medium 4 (G2 — local /usr/lib/debug/.build-id check):
- Added explicit check for /usr/lib/debug/.build-id/NN/REST.debug
  in the classifier before deciding to fetch. Distro debuginfo
  without .gnu_debuglink is reused locally; no remote fetch.

Low 5 (Go test recipe with -w -s contradicts uploading DWARF):
- Removed -w/-s from Go workload build command. Plain `go build`
  emits DWARF; objcopy --strip-all leaves build-id but removes
  DWARF + symtab from the binary, mirroring the Rust test.

Low 6 (--kernel-stacks=off not parseable):
- Removed --kernel-stacks from test commands (defaults to false).
  Debuginfod tests don't exercise kernel symbolization.
Four defects fixed:

High (sidecar / mount-namespace):
- Classification and AddressMapper now go through a two-step path
  resolution (map_files preferred, symbolic path fallback), mirroring
  the existing readBuildID policy. procmap.Mapping gains MapFiles +
  OpenablePath() helper. New integration test for the deleted-but-
  still-mapped binary case proves symbols resolve when the symbolic
  path is unreachable.

Medium (architecture vs classification re-snapshot contradiction):
- Architecture diagram now says "re-snapshot /proc/<pid>/maps every
  call" matching the classification + concurrency sections.

Medium (component-table "Remove Case 3" contradicting dispatcher text):
- Component table now says "Reframe Case 3 as no-fetch fallback".
  Adds entries for the procmap changes (Mapping.MapFiles +
  OpenablePath helper + tests).

Medium (--symbol-fail-closed contradictory mentions):
- Failure-modes table no longer claims fail-closed gains new behavior.
  Migration note unchanged ("semantics unchanged" is now accurate).
  Added explicit follow-up entry: wire fail-closed end-to-end in a
  separate PR.

Low (classifier pseudocode dangling file-mode line):
- Pseudocode rewritten with explicit fetch-succeeds / fetch-fails
  branches matching the failure table.
Five defects fixed:

High (file-mode Frame.Address must be original IP, not virt-offset):
- symbolizeElfVirt signature now takes both originalIPs and virtOffsets;
  Frame.Address is rewritten to originalIPs[i] after symbolization. The
  rewrite propagates to every entry in the inlined chain. Without this,
  pprof's resolver routes locations to the synthetic mapping 0 instead
  of the real binary mapping. New TestFileModeFrameAddressPreservesMapping
  guards the contract end-to-end (asserts Location.Address ∈
  Mapping.Start..Limit and Mapping.BuildID matches the workload).

Medium (parse-failure demotion bypassed by classifier ordering):
- Split into two cache structures: negFetch (build-id, "don't re-fetch")
  and badDebug (build-id, "don't open this file again"). The classifier
  checks badDebug FIRST in Tier 3a, before any cache/local-path lookup,
  so a corrupt cached .debug never gets re-selected. badDebug entries
  expire on TTL OR when (dev, ino, mtime) of the bad path changes.

Medium (hasResolvableDebuglink with map_files path):
- Debug-link search is filesystem-relative — only meaningful with the
  symbolic path. Classifier now passes mapping.Path (not OpenablePath)
  to hasResolvableDebuglink. New test guards the path selection.

Medium (unreachable-path branch: skip vs process-mode):
- Unified to process-mode for "file-like path but unreadable" (preserves
  frame counts; blazesym defaults can still emit [binary]:offset). Skip
  is reserved for [vdso]/[stack]/[vsyscall]/[heap]/empty paths only,
  where no symbolization is possible.

Low (mapperKey inconsistency):
- Both places now say mapperKey{dev: uint64, ino: uint64}. Concurrency
  section gains the badDebug map; stats counters renamed for clarity
  (fileModeLocalHits added).
Four defects fixed:

Medium (badDebug build-id key blocks valid sibling copies):
- Re-keyed badDebug by pathSig{dev, ino, mtime} of the specific failing
  file, not by build-id. Tier 3 flow rewritten as candidate iteration:
  for each candidate path (systemPath, cache), check its signature
  against badDebug; demoted candidates fall through to the next. A
  corrupt cache copy no longer blocks a valid /usr/lib/debug/... copy
  for the same build-id.

Medium (Resolver.populate doesn't read build-id via map_files):
- Component table now includes unwind/procmap/resolver.go: populate()
  must pass both MapFiles and Path to buildIDFor so pprof
  Mapping.BuildID is populated in sidecar / deleted-binary cases.
  TestStrippedSidecarUnreachableSymbolicPath now asserts
  Mapping.BuildID is populated when symbolic path is missing.

Low (normalization-fail fallback ambiguity):
- Specified: route that IP through process-mode for the current batch.
  blazesym's defaults emit [binary]:offset; stack shape preserved, no
  frames dropped, no synthetic placeholder.

Low (leftover "(path, st_ino)" at line 232):
- Changed to "mapperKey{dev, ino}", matching the rest of the spec.
TDD plan (red test first, green test after fix). 20 tasks across 5
phases:

  Phase 0 — Foundation (procmap.Mapping.MapFiles + OpenablePath,
            Resolver.populate via map_files, AddressMapper port)
  Phase 1 — Classifier (Tier 1 skip, Tier 2 process-mode,
            Tier 3 file-mode with badDebug per-path filtering)
  Phase 2 — symbolizeElfVirt cgo wrapper with originalIPs rewrite
  Phase 3 — Symbolizer per-mapping routing + dispatcher Case 3
            reframe + stats
  Phase 4 — 7 integration tests covering Rust/Go stripped workloads,
            file-mode address-rewrite regression guard, cache hit,
            parse-fail demotion, sidecar map_files path, libc local
            debuginfo
  Phase 5 — Final build + test + lint matrix

Each task has concrete test code, concrete implementation code,
exact bash commands, and expected output. Spec: docs/specs/
2026-05-12-debuginfod-cache-layout-design.md.
map_files is /proc/<pid>/map_files/<start>-<limit>, a kernel-resolved
symlink that works across mount namespaces and survives deletion of
the original file. OpenablePath() prefers it, falls back to the
symbolic path, returns "" when neither is readable.

Foundation for the debuginfod cache-layout fix: classifier and
AddressMapper need namespace-safe paths for sidecar / deleted-binary
cases.
- Doc comment on OpenablePath() warns callers about the TOCTOU between
  probe and actual use: the returned path may become unreadable
  between the probe and the caller's open. Callers must still handle
  os.Open errors themselves.
- The "map_files preferred when both readable" subtest now uses two
  distinct files for MapFiles vs Path so the priority is actually
  asserted (previously both pointed at the same file, so the test
  would pass even if Path was checked first).
populate() now tries Mapping.MapFiles (kernel-resolved symlink, works
across mount namespaces) before falling back to the symbolic Path.
Fixes the sidecar / deleted-binary case where pprof Mapping.BuildID
was silently empty because the symbolic path was unreachable from
the agent's namespace.
- attachBuildIDs's inline comment duplicated its function doc; removed.
- writeELFWithBuildID (~80 LOC ELF byte layout) moved out of
  procmap_test.go into elfhelpers_test.go so the main test file
  stays focused on behavior assertions and the helper is reusable
  from any test in the package.
Port of pfelf.AddressMapper from open-telemetry/opentelemetry-ebpf-profiler
(libpf/pfelf/addressmapper.go) under Apache-2.0. Preserves the
page-alignment correctness fix (offsets near segment starts get
misattributed without it) by aligning p_offset down to page boundary
to mirror the kernel's mmap behavior.

Used by the upcoming file-mode symbolization path: addresses captured
in /proc/<pid>/maps space get normalized to ELF virtual addresses
that blazesym's elf-mode API can resolve against a separate .debug file.
- Replace the three-branch sort comparator with cmp.Compare(a.Off, b.Off)
  — idiomatic Go 1.21+ and shorter.
- Add test coverage for the p_offset = 0 case (common in static
  single-segment binaries and PIE text segments). Math already handled
  it correctly (0 &^ 0xfff == 0), but the test gap was flagged in
  code review.
Guards against future regressions: multi-segment binaries (common in
PIE builds) must dispatch each offset to the correct segment, and
report gaps between segments as unmapped. ET_DYN parses identically
to ET_EXEC for the mapper's purposes.
Lays out routeKind, classifyResult, classifier struct with negFetch
(build-id keyed) and badDebug (pathSig keyed) state. Implements
only Tier 1 — paths that are inherently non-symbolizable (vdso/
stack/vsyscall/heap/empty). Subsequent tasks add Tier 2 (process-
mode routing) and Tier 3 (file-mode candidate iteration).
Mappings with .debug_info in-binary, a resolvable .gnu_debuglink, or
an unreachable symbolic path go through blazesym's process-mode path
(its defaults handle them). Debug-link check uses mapping.Path
(symbolic) — map_files paths would search /proc/<pid>/map_files/
which never contains the linkee.
…ering

Iterates candidate paths (systemPath, cache.AbsPath) and filters each
by its (dev, ino, mtime) signature against badDebug. One corrupt copy
never blocks a valid sibling for the same build-id. Falls through to
process-mode when every candidate is bad or fetch fails.
- Add TestClassifierBadDebugSkipsToSiblingCandidate that proves the
  loop continues past a bad candidate to a valid sibling for the
  same build-id. Required injecting systemDebugRoot as a field on
  classifier (defaults to /usr/lib/debug/.build-id; overridden in
  test). This exercises the core Tier 3 invariant the prior tests
  didn't cover.
- statSig now wraps the path into its error message via fmt.Errorf
  for easier debugging.
cgo wrapper for blaze_symbolize_elf_virt_offsets. Takes both the
file-VA (for the actual lookup) and the original process PC. Rewrites
Frame.Address (and every Inlined entry's Address) to originalIPs[i]
post-symbolization so pprof's mapping resolver can find each location
in /proc/<pid>/maps. Without the rewrite, file-mode locations would
land at synthetic mapping 0 with no BuildID.
…riant

Replace the unreachable bounds guard inside the loop with an explicit
invariant check at the top. blazesym's contract guarantees cnt equals
the number of input addresses, and the function's pre-condition
guarantees len(originalIPs) == len(virtOffsets). If the invariant
ever breaks, fail loudly instead of silently dropping frames.
The v1.1.0 Case 3 fetched .debug into the cache and returned "",
expecting blazesym's debug_dirs walker to find it. That never worked:
blazesym's split-debug lookup is gated on .gnu_debuglink, which most
stripped Rust/Go binaries lack. File-mode routing in the classifier
owns this case now.

Case 3 stays in the dispatcher as a fail-open path for two reasons:
(a) a mapping demoted from file-mode after a fetch/parse failure
re-enters here; (b) a transient race could leave a build-id-only
mapping unclassified. Both want "" — let blazesym emit [binary]:
offset fallback. No fetch, no panic.

Also: introduce a minimal sfFetcher interface so tests can inject a
counting fake to verify the no-fetch behavior.
Classifies each mapping in the target PID via classifier.classify(),
then buckets the IPs by route. Process-mode batch handed to blazesym
as before; file-mode buckets (one per cached .debug) go through
symbolizeElfVirt with AddressMapper-normalized addresses and the
originalIPs rewrite.

Failure modes preserved: AddressMapper miss demotes the IP to
process-mode; symbolizeElfVirt NULL marks the path in badDebug and
falls back to process-mode for that bucket; resolver failure (or
no Resolver configured) falls back to a pure process-mode batch.
Existing SymbolizeProcess(pid, ips) signature is preserved; ctx for
the classifier's Tier-3 fetch is derived internally from FetchTimeout
to match the dispatcher's own ctx-derivation pattern.

Supporting additions:
- procmap.Resolver.Mappings(pid): returns the cached snapshot of a
  PID's executable mappings, populating on first call (Lookup-style
  per-address API didn't fit per-mapping classification).
- classifier.mapperFor(m): inode-keyed AddressMapper cache; benign
  parse-race loses to the cached entry to keep identity stable.
- atomicStats: classifyProcessMode/classifyFileMode/classifySkipped
  + fileModeCalls/fileModeParseFails + normalizationFails counters
  used by the new routing path. Surfaced through Stats snapshot.
The previous commit's routeAndSymbolize called resolver.Mappings(pid)
which is sync.Once-cached; subsequent calls for the same PID returned
the stale snapshot. Live mmap/dlopen/exec changes were invisible:
IPs in new mappings fell through findMapping with no hit and were
silently misrouted.

Spec invariant (docs/specs/2026-05-12-debuginfod-cache-layout-design.md):
each Symbolize call must re-snapshot /proc/<pid>/maps. Calling
Invalidate(pid) before Mappings(pid) forces the re-parse.

Add TestResolverMappingsReSnapshotAfterInvalidate in procmap_test.go:
mutates the fixture between two Mappings calls (with an Invalidate in
between) and asserts the second snapshot sees the added mapping and that
the populate counter increments — proving the re-parse occurred.
Completes the new counter set from the spec. fileModeLocalHits
increments when a Tier 3 candidate (systemPath or cache.AbsPath) is
selected without a network fetch — measures the value of the local-
debug-info path. fileModeFetchFails increments when the debuginfod
fetcher returns an error, alongside markNegFetch.

The classifier gains a *stats pointer field; newClassifier signature
adds a stats arg. Existing tests that constructed classifiers pass a
fresh stats.
stripWorkload(): copy + objcopy --strip-all, return build-id.
uploadDebug(): invoke debuginfod/upload.sh, return build-id + expected
.debug path under the store dir.
waitForDebuginfodReady(): poll the server until the build-id serves.

Used by the upcoming TestStripped* / TestFileMode* integration tests.
dst paths MUST be under the worktree, not /tmp — caps don't survive
exec from nosuid mounts.
End-to-end coverage of the file-mode symbolization path for a stripped
Rust release binary. Builds the workload with DWARF, uploads .debug to
the local debuginfod container, strips a copy, runs perf-agent with
--debuginfod-url, and asserts that rust_workload::cpu_intensive_work
appears in the pprof — proving the file-mode path resolved the symbol
even though the binary itself carries only .note.gnu.build-id.

Helpers added: requireDebuginfodContainer, requireTool, spawnBinaryAsWorkload.
Mirror of the Rust test for a Go release binary. Asserts DWARF is
present in the fixture before stripping (rebuild without -ldflags='-w'
if it isn't), then strips, uploads, profiles, and confirms main.* is
resolved through the file-mode path.
Regression guard for the symbolizeElfVirt address-rewrite invariant.
Asserts that for every Rust frame in the resulting pprof:
  * Location.Address is inside Location.Mapping.Start..Mapping.Limit
  * Location.Mapping.BuildID equals the workload's build-id
  * Location.Mapping.File points at the workload
Without the rewrite, every assertion fails — locations land at the
synthetic mapping 0 with no build-id.
Two consecutive runs share a --symbol-cache-dir. Asserts the second
run doesn't hit debuginfod for the same build-id — proves the cache.Has
short-circuit in the classifier.

Helpers added: runStripped, countDebuginfodHits.
Truncates a cached .debug to force blaze_symbolize_elf_virt_offsets to
return NULL. Asserts the agent doesn't crash and that pprof still emits
samples + the workload's mapping — proving the badDebug per-path
filter demotes the mapping to process-mode gracefully.
Deletes the stripped workload from disk while it's still running. The
process keeps it alive via its open file descriptor; /proc/<pid>/map_files
still resolves. Asserts symbols resolve via map_files AND that pprof
Mapping.BuildID is populated (Resolver.populate must read via map_files
when the symbolic path is gone).
dpsoft added 10 commits May 13, 2026 17:41
Verifies G2: when local /usr/lib/debug/.build-id/.../libc.so.debug
exists, the classifier doesn't refetch libc through debuginfod. Skips
on hosts without glibc-debuginfo installed.

Helper added: findLibcWithLocalDebuginfo.
Our port stored page-aligned offsets and extended filesz at
construction, then computed file-VA as `Vaddr + (off - alignedOff)`.
That's wrong: for any segment with non-page-aligned p_offset, the
returned VA is off by `delta = p.offset - aligned_p_offset`. PIE
Rust release builds have p_offset = 0x16ec0 → delta = 0xec0 → samples
in cpu_intensive_work resolve to adjacent stdlib functions.

OTel's reference implementation stores p.offset/p.vaddr/p.filesz raw
and aligns at lookup time. The math is `vaddr + (off - p.offset)`
(equivalent to OTel's `vaddr - (p.offset - off)`). This produces
correct results for both un-aligned `off == p.offset` and page-
aligned `off == p.offset &^ (pageSize - 1)`.

Also fixes TestAddressMapperPageAlignment which was asserting the
incorrect 0x400000 for the page-aligned start instead of the
correct 0x400000 - 0x234. Adds TestAddressMapperPIEWithNonPage-
AlignedOffset reproducing the real rust-workload PHDR layout.
Task 0a added the MapFiles field and Task 0b wired it into
attachBuildIDs, but the parser never set it. attachBuildIDs's
map_files branch was always skipped; pprof Mapping.BuildID stayed
empty whenever the symbolic path was unreachable (sidecar /
deleted-binary case).

parseMapsFile now derives the map_files directory from the maps
file path and passes it to parseMapsLine, which constructs
/proc/<pid>/map_files/<startHex>-<limitHex> for each mapping.
The previous assertion compared loc.Address (a file offset per
pprof's contract: Address = ProcessPC - MapStart + MapOff) to
Mapping.Start..Mapping.Limit (process address range). Different
units; the check was always false. The test was passing in CI
only by accident because the test fixture/setup wasn't exercising
the file-mode path on rust-workload's running PC range; once
AddressMapper started producing correct file-VAs, the bogus check
surfaced.

The real invariant we want to verify is: after file-mode symbol-
ization, the resulting pprof Location is tied to a real Mapping
(not synthetic 0) with the correct BuildID. Keep those two checks;
drop the address-in-range nonsense.
TestStrippedSidecarUnreachableSymbolicPath was end-to-end profiling
a target whose binary was deleted from disk. The classifier and
AddressMapper handle this correctly (Mapping.MapFiles works), but
the DWARF unwinder in unwind/dwarfagent/ opens the binary by
symbolic path to compile .eh_frame at attach time — that fails when
the path is deleted, leaving no user-stack samples in the profile.

The unwinder fix is its own PR (separate subsystem). For now: replace
the integration test with a unit-level classifier test that verifies
the classifier routes correctly when MapFiles is the only readable
path. The DWARF unwinder gap is recorded as a follow-up in the
design spec.
…t-mapped case

The DWARF unwinder previously opened every target binary by its symbolic
path at attach time (os.Stat / elf.Open / ReadBuildID). When the binary
has been unlinked from disk while still mapped — the sidecar /
mount-namespace case — those calls all fail with "no such file or
directory", and the entire mapping is skipped: no CFI compiled, no
pid_mappings row, no user-stack samples in the profile.

Add an openableBinary(pid, start, limit, symbolicPath) helper that
probes /proc/<pid>/map_files/<startHex>-<limitHex> first (kernel-resolved
symlink — works across mount namespaces and survives unlinked-but-mapped
binaries) and falls back to the symbolic path. perf-agent's standard
cap set (CAP_SYS_ADMIN among others) already covers the map_files read.

Thread a new openPath argument through PIDTracker.Attach / Detach
counterparts (Attach, EnrollWithoutCompile, AttachCompileOnly),
TableStore.AcquireBinary, and LoadProcessMappings. binPath remains the
cache key (so two PIDs sharing the same .so still share one ehcompile
result); openPath is the file actually opened for the build-id read,
ehcompile.Compile, and /proc/<pid>/maps cross-check.

Call sites that walk /proc/<pid>/maps (AttachAllMappings,
enrollPIDFromTree, resolveBinaryByTableID) now parse the va range from
fields[0] and route I/O through openableBinary. The MmapEvent handler
in PIDTracker.Run does the same using ev.Addr / ev.Addr+ev.Len.
looksExecutable accepts mappings reachable via map_files even when the
symbolic path is gone. Synthetic-tree unit tests (scan_enroll_test)
skip the map_files fallback since the fake /proc/<pid> doesn't
correspond to a real PID namespace.

Adds unit tests covering openableBinary's three branches (symbolic
works, map_files fallback when symbolic is missing, both missing).
…inder supports map_files

The end-to-end sidecar test was narrowed to a classifier-only unit test in
7904423 because the DWARF unwinder's attach path opened the symbolic
binary path and failed when the binary had been deleted from disk. The
previous commit (ehmaps: open target binaries via /proc/<pid>/map_files)
removes that limitation, so the full integration test is back in scope.

Keep TestClassifierUsesMapFilesWhenSymbolicPathDeleted in place — it
complements the integration test by exercising the classifier's routing
logic without spinning up a debuginfod server, and runs in milliseconds
on every test invocation.
The follow-up listed unwind/dwarfagent's inability to open binaries via
/proc/<pid>/map_files for the deleted-but-mapped (sidecar / mount-namespace)
case. That gap is now closed by routing every Attach / AcquireBinary /
LoadProcessMappings call through ehmaps.openableBinary. Drop the entry
from the design spec's follow-ups section.
Kubernetes-aware profile labels section was 18 lines of dense prose
with the lead message (use DaemonSet) buried in caveats. Sidecar
bullet stretched the security tradeoff across 4 nested clauses in
a single sentence. Trimmed to 9 lines: DaemonSet recommendation
upfront, sidecar's shareProcessNamespace caveat in one parenthesized
phrase instead of an inline mini-essay.

Also dropped a stray forward-reference under Usage ("Two specific
deployment shapes — Python via --inject-python, and sidecar inside
a Kubernetes pod — work as documented in the use cases above") that
conflated two unrelated features and pointed back at content already
covered above. Replaced with a plain pointer to the Python docs.
The paragraph still described v1.1.0: process_dispatch hook is the
mechanism, "binaries already on disk get DWARF via the cache (no
override)", "sidecar/missing binaries get the full ELF". None of
that is accurate after the cache-layout fix.

Reality now: per-mapping classifier routes each binary into
process-mode (local DWARF / debug-link) or file-mode (cached .debug
via the OTel-style normalize-addresses-then-symbolize-against-debug
pattern). Sidecar / deleted-binary cases open through
/proc/<pid>/map_files. The dispatcher's role shrank to a fail-open
path. Section now leads with the three routing buckets and points
at the design doc for the math.
@dpsoft dpsoft requested a review from Copilot May 15, 2026 18:04
@dpsoft dpsoft marked this pull request as ready for review May 15, 2026 18:05
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes off-box symbolization for stripped ELF binaries that only carry .note.gnu.build-id (common for Rust/Go release builds) by adding a per-mapping routing layer that chooses between blazesym process-mode and a new file-mode path that symbolizes directly against cached .debug files (with Go-side address normalization). Also improves robustness for “deleted-but-still-mapped” / sidecar mount-namespace cases by preferentially opening binaries via /proc/<pid>/map_files/....

Changes:

  • Add procmap.Mapping.MapFiles + OpenablePath() and plumb /proc/<pid>/map_files/<start>-<limit> through procmap + EH unwinder paths so build-ids and binaries can be opened even when the symbolic path is unreachable.
  • Introduce procmap.AddressMapper and a debuginfod classifier + per-mapping symbolization router (process-mode vs file-mode), with file-mode calling blaze_symbolize_elf_virt_offsets against cached .debug and rewriting frame addresses back to original process PCs.
  • Add extensive unit + integration tests covering stripped Rust/Go, cache hits, parse-fail demotion, libc local resolution, and deleted-but-mapped binaries.

Reviewed changes

Copilot reviewed 28 out of 29 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
unwind/procmap/resolver.go Adds Mappings() accessor and attaches BuildIDs preferring MapFiles over symbolic Path.
unwind/procmap/procmap.go Adds Mapping.MapFiles and OpenablePath() helper to prefer /proc/<pid>/map_files.
unwind/procmap/procmap_test.go Extends procmap tests for MapFiles, Mappings() resnapshot behavior, and OpenablePath().
unwind/procmap/parse.go Populates Mapping.MapFiles during /proc/<pid>/maps parsing.
unwind/procmap/elfhelpers_test.go Adds ELF fixture writer used by procmap tests.
unwind/procmap/addressmapper.go Adds Go-side PHDR-based address normalization (AddressMapper).
unwind/procmap/addressmapper_test.go Adds unit tests for AddressMapper mapping math (alignment/PIE/multi-PT_LOAD).
unwind/ehmaps/tracker.go Threads openPath through Attach/Enroll/CompileOnly and uses map_files when symbolic path is unreachable.
unwind/ehmaps/tracker_test.go Updates tests for new Attach/EnrollWithoutCompile/AttachCompileOnly signatures.
unwind/ehmaps/store.go Updates AcquireBinary to accept (binPath, openPath) and compile from openable path.
unwind/ehmaps/scan_enroll.go Uses map_files for real /proc, but avoids it for synthetic proc trees in tests.
unwind/ehmaps/openable.go Adds helper to choose an openable binary path (map_files first, then symbolic).
unwind/ehmaps/openable_test.go Tests openable-path selection (skipping map_files checks without required caps).
unwind/ehmaps/elf_helpers.go Updates LoadProcessMappings to open via openPath while matching by symbolic name.
unwind/dwarfagent/miss_drainer.go Resolves binaries for lazy compile via openable path (including map_files).
unwind/dwarfagent/miss_drainer_test.go Updates tests for new (binPath, openPath) return signature.
test/integration_test.go Adds/extends integration tests for stripped Rust/Go debuginfod flows, cache hits, parse-fail demotion, and sidecar-like deletion.
test/integration_strip_helpers.go Adds helpers to strip workloads and upload .debug to local debuginfod fixture.
symbolize/debuginfod/symbolizer.go Implements per-mapping routing, file-mode symbolization, and address rewrite back to process PCs.
symbolize/debuginfod/stats.go Adds counters for classify routes, file-mode outcomes, and normalization failures.
symbolize/debuginfod/singleflight.go Introduces sfFetcher interface for easier fetcher substitution in tests.
symbolize/debuginfod/elftest_helpers_test.go Adds ELF fixture builders for debuginfod classifier/dispatcher tests.
symbolize/debuginfod/dispatcher.go Adds file-mode cgo wrapper for blaze_symbolize_elf_virt_offsets and reframes dispatcher Case 3 as no-fetch fail-open.
symbolize/debuginfod/dispatcher_test.go Adds tests for file-mode address rewrite and dispatcher Case 3 no-fetch behavior.
symbolize/debuginfod/classifier.go Adds per-mapping classifier for skip/process/file routes, local-vs-fetch decisions, and bad-debug/neg-fetch tracking.
symbolize/debuginfod/classifier_test.go Adds unit tests for classifier routing decisions and badDebug behavior.
README.md Updates documentation to describe the new per-mapping classifier + file-mode flow.
docs/specs/2026-05-12-debuginfod-cache-layout-design.md Adds design spec documenting rationale, routing, failure modes, and test plan.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread unwind/procmap/resolver.go
Comment thread symbolize/debuginfod/classifier.go
Comment thread test/integration_strip_helpers.go Outdated
Comment thread symbolize/debuginfod/symbolizer.go Outdated
dpsoft added 4 commits May 15, 2026 15:15
attachBuildIDs was caching the build-id under the MapFiles key
(/proc/<pid>/map_files/<range>), which is unique per PID + VA range.
Two PIDs mapping the same libc would create two cache entries for the
same build-id, causing r.buildIDs to grow unbounded in long-running agents.

Fix: read the build-id directly via ReadBuildID(MapFiles) — bypassing
buildIDFor's cache — then backfill r.buildIDs under the stable symbolic
path. Future lookups for the same binary (even from a different PID) get a
cache hit without re-reading the ELF.

Add TestAttachBuildIDsCachesUnderSymbolicPath to prove the invariant:
after attachBuildIDs runs, buildIDFor(symbolicPath) returns the cached value
even when the MapFiles ELF has been deleted (no re-read possible).
… eviction

negFetch, badDebug, and mappers were plain map[K]V with TTL-only cleanup
that only fires on access. Cold entries were never reclaimed; with many
unique build-ids the maps would grow without bound. The doc comment on
classifier claimed "bounded LRU" but the implementation did not enforce
any size limit.

Fix: add classifierCacheMax = 4096 and enforce it on insert:
- negFetch / badDebug: evictOldestTime() drops the entry with the earliest
  deadline (O(n) scan, runs at most once per insert, amortised O(1) for a
  map that rarely changes topology) — correct for TTL-keyed maps.
- mappers: no deadline, so an arbitrary entry is dropped via single-pass
  map iteration when the map is full.

Update the classifier doc comment to accurately describe the policy:
"bounded by size + TTL; oldest-deadline eviction on insert when full".

Add TestClassifierBadDebugBounded and TestClassifierNegFetchBounded to
prove each map stays at or below classifierCacheMax after
classifierCacheMax+100 inserts.
The previous comment warned "dst MUST be under the worktree, NOT /tmp"
citing the nosuid-mount / file-caps constraint. That constraint applies
to the perf-agent binary (which is setcap'd), not to workload binaries
being stripped with objcopy. stripWorkload itself never sets or relies on
file caps; callers already pass t.TempDir() (which resolves to /tmp),
so the "MUST NOT /tmp" claim directly contradicted actual usage.

Replace with a short, accurate description of what the function does.
…ot calls)

symbolizeFileBucket incremented fileModeCalls by len(virt) — the number
of virtual addresses in the bucket — not by 1 per call. The name
"fileModeCalls" implied "number of calls to symbolizeFileBucket", which
is misleading and caused confusion when reading the counter in Stats.

Rename throughout:
- atomicStats.fileModeCalls → fileModeAddrs
- Stats.FileModeCalls        → FileModeAddrs
- snapshot() mirror updated
- Doc comments updated to clarify: "total number of addresses (IPs)
  resolved through the file-mode path".

The increment site (len(virt)) is unchanged.
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 28 out of 29 changed files in this pull request and generated 3 comments.

Comment thread unwind/procmap/resolver.go
Comment thread symbolize/debuginfod/classifier.go Outdated
Comment thread symbolize/debuginfod/symbolizer.go Outdated
dpsoft added 3 commits May 15, 2026 15:52
The previous comment said "returns nil when ... /proc parse failed", but
the implementation returns (nil, entry.err) on parse errors — only the
PID-gone fast-fail returns (nil, nil).

Rewrite the contract block to clearly distinguish:
- (nil, nil)  — PID gone, access restricted, or no executable regions
- (nil, err)  — /proc parse failed (I/O error, unexpected format)
- (mappings, nil) — success
The fetcher field and newClassifier parameter were typed as the concrete
*singleflightFetcher, bypassing the sfFetcher interface that was added in
singleflight.go precisely for minimal coupling and test-fake injection.

Change both to sfFetcher so:
- Tests can inject a nil or stub without constructing a real fetcher
- The classifier is decoupled from the concrete transport layer
- Existing call sites (symbolizer.go passes *singleflightFetcher, tests
  pass nil) continue to work unchanged since *singleflightFetcher already
  satisfies sfFetcher and nil is a valid nil-interface value
…ream

The skip-route bullet implied that [vdso]/[stack]/anonymous mappings
commonly reach the classifier and trigger routeSkip. In fact,
procmap.Resolver's parser (parseMapsLine) filters out paths with
HasPrefix("[") and empty paths before they ever appear in the Mappings
snapshot, so IPs in those regions hit "no mapping found" and fall through
to process-mode rather than the skip route.

Update the doc comment to:
- Keep the skip bullet but qualify it as rarely triggered in practice
  given the upstream procmap filter (routeSkip is still a real code path
  the classifier can return for any ELF path that happens to match
  nonSymbolizablePaths)
- Add an explicit note that mappings from procmap.Resolver exclude
  pseudo-files and anonymous ranges, so they fall back to process-mode
@dpsoft dpsoft merged commit 3167642 into main May 15, 2026
10 checks passed
@dpsoft dpsoft deleted the fix/debuginfod-cache-layout branch May 15, 2026 19:17
dpsoft added a commit that referenced this pull request May 15, 2026
Both projects these documents covered have shipped:

- kernel-stacks M1 (PR #21, v1.2.0)
- debuginfod cache layout fix (PR #22)

The specs and plans served their purpose during design/implementation;
they're not user-facing reference material. The user-facing docs
under docs/ — debuginfod-symbolization.md, perf-data-output.md,
python-profiling.md — stay.

Removes 6294 lines of planning artifacts.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants