feat(cmd): pprof CPU + heap capture per MCP tool (mache-6b6da6 phase 2)#361
Conversation
Phase 2 of mache-6b6da6: extends the e2e tool harness to optionally
capture runtime/pprof CPU + heap profiles per tool. Gated on
$E2E_CAPTURE_PPROF=1; phase 1's latency/alloc-only mode is
unchanged when the env var isn't set.
== What lands ==
cmd/all_tools_e2e_test.go:
- toolProfile gains CPUProfile, HeapProfile, CPUIterations
fields. omitempty in the JSON manifest, so phase-1 runs emit
the same shape as before.
- pprofOpts struct (dir + iterations) carries env-driven config.
- readPprofOpts(t) parses E2E_CAPTURE_PPROF, E2E_PPROF_DIR (with
a sensible default sibling of the manifest), and
E2E_CPU_ITERATIONS (default 500 — see comment for why).
- capturePprof(t, ...) runs the handler in a K-iteration loop
under pprof.StartCPUProfile, then snapshots heap via
pprof.WriteHeapProfile. Skipped on tools that errored or were
skipped in the canonical call (profiling "no _ast table" is
noise, not signal).
The latency/alloc measurement still comes from a single canonical
call so manifest numbers stay comparable across runs regardless of
pprof being on. Iterations exist solely to give the runtime
sampler enough work to record stacks.
== Why 500 iterations default ==
Mache's tools complete in <1ms each (phase 1 baseline). The CPU
sampler runs at ~100Hz (10ms tick). Below ~200 iterations the
profile records zero samples — the loop exits before any tick
fires. 500 iterations × ~1ms ≈ 500ms per tool yields ~30ms of
samples (14% duty cycle), enough for flamegraphs to find dominant
stacks. Heap profiling has no equivalent constraint and is
meaningful at any iteration count.
For larger fixtures with slower tools, lower iterations work;
override via E2E_CPU_ITERATIONS.
== Taskfile ==
New task profile-tools-pprof variant runs the harness with
E2E_CAPTURE_PPROF=1 + manifest pinned to {{.ROOT_DIR}}/.e2e/.
Outputs land in .e2e/pprof/<tool>.{cpu,heap}.pprof and link from
the manifest. Total runtime ~3s on this fixture.
Phase-1 task profile-tools is unchanged (latency/alloc only, ~0.6s).
== Sample output ==
go tool pprof -top .e2e/pprof/get_architecture.cpu.pprof
Duration: 200ms, Total samples = 30ms (14.83%)
Showing nodes accounting for 30ms, 100% of 30ms total
flat flat% sum% cum cum%
20ms 66.67% 66.67% 20ms 66.67% runtime.kevent
10ms 33.33% 100% 10ms 33.33% runtime.add (inline)
(runtime.kevent dominates because the file watcher started by
buildMaybeMultiGraph runs concurrently. Phase 3 flamegraph
generation can apply -focus to filter to tool-specific stacks.)
go tool pprof -top .e2e/pprof/get_architecture.heap.pprof
File: cmd.test
Type: inuse_space
Showing top 5 nodes out of 35
flat flat% sum% cum cum%
2051.25kB 40.03% 40.03% 2051.25kB 40.03% runtime.mallocgc
1024.03kB 19.99% 60.02% 1024.03kB 19.99% strings.Fields
512.38kB 10.00% 70.02% 1024.41kB 19.99% jsonschema/v6.newSchema
512.07kB 9.99% 80.01% 512.07kB 9.99% hclsyntax.init.0
512.03kB 9.99% 90.01% 512.03kB 9.99% fmt.Sprintf
Heap profiles are immediately useful — actionable allocation
attribution for every tool, no sampling-rate tax.
== What's next (phase 3) ==
- SVG flamegraph generation via 'go tool pprof -svg' for each
tool's profiles, written to .e2e/pprof/<tool>.{cpu,heap}.svg
- INDEX.md table linking each tool to its profiles + flamegraphs
- Optional: 'go tool pprof -focus=<handler>' to filter background
noise from the file watcher
Refs: mache-6b6da6 phase 2, builds on phase 1 (#360)
find_smells (advisory)Scoped to files changed in this PR. Rules below run on the standalone (no-LLO) cross-ref tables; No structural smells in changed files. ✓ Rules: see |
There was a problem hiding this comment.
Pull request overview
Adds optional per-tool runtime/pprof CPU + heap capture to the existing MCP tools E2E profiling harness (phase 2 of mache-6b6da6), gated by E2E_CAPTURE_PPROF=1, and exposes a new Task task to run it.
Changes:
- Extend the per-tool manifest schema to include optional
cpu_profile/heap_profilepaths and CPU iteration count. - Add env-var driven pprof configuration (
E2E_CAPTURE_PPROF,E2E_PPROF_DIR,E2E_CPU_ITERATIONS) and implement per-tool CPU/heap capture. - Add
task profile-tools-pprofto generate pprof artifacts alongside the existing latency/alloc manifest.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 5 comments.
| File | Description |
|---|---|
| Taskfile.yml | Adds a new task to run the E2E harness with pprof capture enabled and documents expected outputs/usage. |
| cmd/all_tools_e2e_test.go | Implements env-var parsing and per-tool CPU/heap pprof capture and records artifact links in the manifest. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Five points from Copilot on PR #361, all valid: 1. Manifest path comment claimed paths were 'relative to test cwd' but capturePprof records absolute paths. Updated comment to be honest about absolute-paths-not-portable, with a forward-link to phase 4 making them manifest-relative when regression-detection consumes baselines. 2. Default CPU iteration count comment said 50, code said 500. Drift from the bump made during phase 2 iteration. Fixed comment to match code. 3. StopCPUProfile is now deferred for panic-safety. Without the defer, a panicking handler leaves profiling enabled, and the next tool's StartCPUProfile fails with 'already profiling', cascading the failure across the harness. Kept the explicit StopCPUProfile() after the loop so the CPU profile boundary tightens to just the iterations on the happy path; the defer becomes a no-op there and the panic-safety fallback otherwise. 4. readPprofOpts hard-failed on mkdir while capturePprof treats pprof as best-effort. Aligned to best-effort: mkdir failure logs and disables capture for the run; phase-1 latency/alloc capture still runs. 5. Taskfile claimed 'pprof -svg' renders a flamegraph. It doesn't — it emits a call graph. Updated comment to be accurate and forward-link to phase 3 for actual flamegraphs via brendangregg/flamegraph.pl or pprof's HTTP UI. No behavioral change for the happy path. Full e2e suite still green; pprof artifacts still produced for ok-status tools. Refs: PR #361 Copilot review, mache-6b6da6 phase 2
Two fixes folded into the phase-2 PR before merge, addressing the 'heap profile dominated by init noise' problem caught when looking at the rendered SVG (Image #2 from inline review): == Fix 1: heap baseline + delta == cmd/all_tools_e2e_test.go::capturePprof now captures a heap snapshot BEFORE the iteration loop in addition to the one AFTER. Each tool ends up with: <tool>.heap.baseline.pprof — pre-loop snapshot (init state) <tool>.heap.pprof — post-loop snapshot (tool work) <tool>.cpu.pprof — CPU profile over the loop The baseline lets consumers compute the delta via 'pprof -base=<baseline>'. Without it, heap profiles show cumulative allocation since process start — dominated by package-init noise (jsonschema, sqlite, tree-sitter init) that buries the tool's actual signal at ~5% of width. == Fix 2: flamegraphs task uses -alloc_space + -base == Taskfile 'flamegraphs' now runs: go tool pprof -alloc_space -base=baseline -raw <after.heap.pprof> | sed (drop cols 3-4 to match stackcollapse-go.pl's 2-col regex) | stackcollapse-go.pl | flamegraph.pl --title='<tool> heap delta (alloc_space, baseline-subtracted)' -alloc_space is the right column for delta because -inuse_space diff is near-zero on short loops (allocs get GC'd between the two snapshots). alloc_space captures bytes allocated DURING the loop, regardless of whether they're still live. == Real flamegraphs (not call graphs) == PR #361 phase 2 deferred this to a follow-up after Copilot correctly noted that 'pprof -svg' emits a call graph, not a flamegraph. Phase 3 wires the brendangregg/FlameGraph pipeline: task flamegraphs: - pre-flight: requires flamegraph.pl + stackcollapse-go.pl on PATH (install via 'brew install flamegraph' on macOS, 'git clone https://github.com/brendangregg/FlameGraph' on Linux) - emits .e2e/pprof/<tool>.{cpu,heap}.svg per tool - emits .e2e/INDEX.md linking each tool to its profiles + flamegraphs - placeholder SVGs for tools with no CPU samples (handler too fast for 100Hz sampler at 500 iterations) so INDEX.md stays consistent Vendoring the perl scripts considered + rejected: ~30KB of CDDL+GPL perl into an Apache 2.0 Go project is license fragmentation not worth the convenience. brew/git-clone is universal enough. == First-run signal (sample, get_architecture) == Pre-fix flamegraph dominated by: runtime.doInit / runtime.doInit1 (66% of width) github.com/santhosh-tekuri/jsonschema (init) modernc.org/sqlite (init) github.com/smacker/go-tree-sitter (init) Tool's actual stack: <5% of width on the right edge Post-fix flamegraph (baseline-subtracted): Tool's actual stack base wide: testing.tRunner → TestE2E_AllMCPTools → profileTool → capturePprof → makeGetArchitectureHandler.func1 Hotspots above: internal/graph.buildProjection (widest — graph projection) internal/graph.DetectCommunities (Louvain modularity) mark3labs/mcp-go (JSON-RPC framework overhead) That's actionable: get_architecture's heap pressure comes from community detection, an obvious caching target for follow-up optimization work. == Out of scope (phase 4) == - Rich-fixture mode (mache-on-mache instead of toy 4-package) so CPU samples have meaningful tool runtime to attribute - Regression detection vs checked-in baseline manifest - INDEX.md auto-detection of placeholder SVGs (footnote them) Refs: mache-6b6da6 phase 3 (folded into phase 2 PR before merge)
…3) (#362) Surfaced by the e2e tool harness from PR #361. Heap-delta flamegraphs showed MemoryStore.DefsMap as the dominant allocator in 4 of 16 MCP tools: get_impact: 52% of heap delta (4.51MB) get_overview: 32% of heap delta (2.51MB) get_architecture: significant (community detection path) get_communities: significant (Louvain on RefsMap) Each call to DefsMap or RefsMap deep-copied the whole map plus one slice per entry. On the 4-package toy fixture that's small absolute numbers; on a real workload (10K+ constructs, 50K+ ref tokens) it's millions of allocs per call. == Fix == Memoize the deep-copy snapshot. atomic.Pointer holds the cached snapshot; AddDef / AddRef / DeleteFileNodes invalidate it under the existing write lock; DefsMap / RefsMap return the cached instance on hit, rebuild under double-checked-locking on miss. Concurrency contract: - Snapshot store happens under RLock (the rebuild side) - Invalidation store happens under Lock (the mutation side) - Lock blocks behind any in-flight RLock, so a stale snapshot can't be stored after invalidation - Multiple readers seeing a nil snapshot may race to rebuild; double-checked locking inside the RLock prevents torn snapshots Callers must treat the returned map as read-only (every existing caller is). LookupDef remains the right API for callers who want a mutable per-token slice. == Tests == internal/graph/snapshot_cache_test.go pins the cache CONTRACT, not the speedup (the e2e harness measures the latter empirically): - DefsMap_ReturnsSameInstanceWithoutMutation: repeated calls return the same map address (cache hit) - DefsMap_AddDefInvalidatesCache: AddDef invalidates; pre-invalidation snapshots stay consistent (no retroactive mutation for existing readers) - RefsMap_ReturnsSameInstanceWithoutMutation: same for refs - RefsMap_AddRefInvalidatesCache: same for refs - SnapshotCache_ConcurrentReadersGetConsistentSnapshot: 32 concurrent readers after invalidation must all see a complete (non-torn) snapshot of 100 entries If a future refactor re-introduces per-call allocation here, these tests fail loudly — that's the CI gate. == Empirical validation == Pre-cache (single canonical call + 500-iter loop, baseline-subtracted heap delta): get_impact: ~8.7MB total, DefsMap=52% top get_overview: ~7.8MB total, DefsMap=32% top get_architecture: ~26MB total, tool-fn=42% (buildProjection deep) get_communities: ~14.7MB total, tool-fn=20% Post-cache: get_impact: ~4.75MB total (-45%), DefsMap GONE from top-3 get_overview: ~9.7MB total (different shape, harness pprof overhead now dominates instead of DefsMap) get_architecture: ~17.1MB total (-34%), tool-fn=35% (still top but smaller in absolute terms) get_communities: ~20.5MB total The remaining heap on get_overview / get_communities is harness overhead (runtime/pprof.StartCPUProfile, compress/flate from profile writing) and JSON response marshaling — not caching candidates we control. == Full suite == go test ./... — 20/20 green go test -race — passes (concurrent reader test on -race) golangci-lint — clean == What's next == - Rich fixture (mache-on-mache, synthetic-medium) — mache-be8090, surfaces more memoization candidates at real-world scale - Sheaf cache cross-repo wiring (mache-b37cff) — daemon-side sheaf_get_topology op so cached community detection survives process restarts - Hetero corpus (mache-d332b5) — gated on CGO removal Refs: mache-6b6da6 phase 3 follow-up; preserves the e2e harness contract from PR #361
Summary
Phase 2 of
mache-6b6da6— extends the e2e tool harness from #360 to optionally captureruntime/pprofCPU + heap per tool. Gated onE2E_CAPTURE_PPROF=1; phase 1's latency/alloc path is unchanged when the env var is unset.What lands
cmd/all_tools_e2e_test.go:toolProfilegainsCPUProfile,HeapProfile,CPUIterationsfields (omitempty)pprofOpts+readPprofOpts(t)parseE2E_CAPTURE_PPROF,E2E_PPROF_DIR,E2E_CPU_ITERATIONScapturePprof()runs the handler in a K-iteration loop underpprof.StartCPUProfile, then snapshots heap viapprof.WriteHeapProfileTaskfile.yml:task profile-tools-pprofruns the harness with capture enabledtask profile-toolsunchanged (latency/alloc only)Why 500 iterations default
Mache's tools complete in <1ms each. The CPU sampler runs at ~100Hz (10ms tick). Below ~200 iterations the CPU profile records zero samples — loop exits before any tick. 500 × ~1ms ≈ 500ms per tool yields ~30ms of samples (~14% duty cycle), enough for flamegraphs to find dominant stacks. Heap has no equivalent constraint.
Override via
E2E_CPU_ITERATIONS=Nfor slower fixtures or deeper traces.Sample outputs
CPU profile (
get_architecture):Heap profile (
get_architecture):Heap is immediately actionable. CPU is dominated by background
runtime.kevent(file watcher started bybuildMaybeMultiGraphruns concurrently); phase 3 will apply-focus=<handler>to filter to tool-specific stacks.What's next (Phase 3)
go tool pprof -svgper toolINDEX.mdlinking tools → profiles → flamegraphs-focusfilter for background-noise filteringTest plan
task profile-tools(no env vars) — phase 1 path unchanged, ~0.6stask profile-tools-pprofproduces 32 pprof files (16 tools × CPU + heap), ~3s totalcpu_profile/heap_profilefieldsgo tool pprof -topreads both profile types successfullygo test ./...20/20 greenRefs:
mache-6b6da6phase 2, builds on phase 1 (#360).