feat(cmd): pprof CPU + heap capture per MCP tool (mache-6b6da6 phase 2) by jamestexas · Pull Request #361 · agentic-research/mache

jamestexas · 2026-05-09T16:19:13Z

Summary

Phase 2 of mache-6b6da6 — extends the e2e tool harness from #360 to optionally capture runtime/pprof CPU + heap per tool. Gated on E2E_CAPTURE_PPROF=1; phase 1's latency/alloc path is unchanged when the env var is unset.

What lands

cmd/all_tools_e2e_test.go:

toolProfile gains CPUProfile, HeapProfile, CPUIterations fields (omitempty)
pprofOpts + readPprofOpts(t) parse E2E_CAPTURE_PPROF, E2E_PPROF_DIR, E2E_CPU_ITERATIONS
capturePprof() runs the handler in a K-iteration loop under pprof.StartCPUProfile, then snapshots heap via pprof.WriteHeapProfile
Skipped on tools that errored / were skipped (profiling "no _ast table" is noise)
Latency/alloc measurement still uses single canonical call → manifest numbers stay comparable across runs

Taskfile.yml:

New task profile-tools-pprof runs the harness with capture enabled
Phase 1 task profile-tools unchanged (latency/alloc only)

Why 500 iterations default

Mache's tools complete in <1ms each. The CPU sampler runs at ~100Hz (10ms tick). Below ~200 iterations the CPU profile records zero samples — loop exits before any tick. 500 × ~1ms ≈ 500ms per tool yields ~30ms of samples (~14% duty cycle), enough for flamegraphs to find dominant stacks. Heap has no equivalent constraint.

Override via E2E_CPU_ITERATIONS=N for slower fixtures or deeper traces.

Sample outputs

CPU profile (get_architecture):

Duration: 200ms, Total samples = 30ms (14.83%)
Showing nodes accounting for 30ms, 100% of 30ms total
      flat  flat%   sum%        cum   cum%
      20ms 66.67% 66.67%  20ms 66.67%  runtime.kevent
      10ms 33.33%   100%  10ms 33.33%  runtime.add (inline)

Heap profile (get_architecture):

File: cmd.test
Type: inuse_space
Showing top 5 nodes out of 35
 2051.25kB 40.03% 40.03%  2051.25kB 40.03%  runtime.mallocgc
 1024.03kB 19.99% 60.02%  1024.03kB 19.99%  strings.Fields
  512.38kB 10.00% 70.02%  1024.41kB 19.99%  jsonschema/v6.newSchema
  512.07kB  9.99% 80.01%   512.07kB  9.99%  hclsyntax.init.0
  512.03kB  9.99% 90.01%   512.03kB  9.99%  fmt.Sprintf

Heap is immediately actionable. CPU is dominated by background runtime.kevent (file watcher started by buildMaybeMultiGraph runs concurrently); phase 3 will apply -focus=<handler> to filter to tool-specific stacks.

What's next (Phase 3)

SVG flamegraph generation via go tool pprof -svg per tool
INDEX.md linking tools → profiles → flamegraphs
-focus filter for background-noise filtering

Test plan

task profile-tools (no env vars) — phase 1 path unchanged, ~0.6s
task profile-tools-pprof produces 32 pprof files (16 tools × CPU + heap), ~3s total
Manifest links profiles via cpu_profile / heap_profile fields
go tool pprof -top reads both profile types successfully
Full go test ./... 20/20 green
golangci-lint passes
CI green
Copilot review

Refs: mache-6b6da6 phase 2, builds on phase 1 (#360).

Phase 2 of mache-6b6da6: extends the e2e tool harness to optionally capture runtime/pprof CPU + heap profiles per tool. Gated on $E2E_CAPTURE_PPROF=1; phase 1's latency/alloc-only mode is unchanged when the env var isn't set. == What lands == cmd/all_tools_e2e_test.go: - toolProfile gains CPUProfile, HeapProfile, CPUIterations fields. omitempty in the JSON manifest, so phase-1 runs emit the same shape as before. - pprofOpts struct (dir + iterations) carries env-driven config. - readPprofOpts(t) parses E2E_CAPTURE_PPROF, E2E_PPROF_DIR (with a sensible default sibling of the manifest), and E2E_CPU_ITERATIONS (default 500 — see comment for why). - capturePprof(t, ...) runs the handler in a K-iteration loop under pprof.StartCPUProfile, then snapshots heap via pprof.WriteHeapProfile. Skipped on tools that errored or were skipped in the canonical call (profiling "no _ast table" is noise, not signal). The latency/alloc measurement still comes from a single canonical call so manifest numbers stay comparable across runs regardless of pprof being on. Iterations exist solely to give the runtime sampler enough work to record stacks. == Why 500 iterations default == Mache's tools complete in <1ms each (phase 1 baseline). The CPU sampler runs at ~100Hz (10ms tick). Below ~200 iterations the profile records zero samples — the loop exits before any tick fires. 500 iterations × ~1ms ≈ 500ms per tool yields ~30ms of samples (14% duty cycle), enough for flamegraphs to find dominant stacks. Heap profiling has no equivalent constraint and is meaningful at any iteration count. For larger fixtures with slower tools, lower iterations work; override via E2E_CPU_ITERATIONS. == Taskfile == New task profile-tools-pprof variant runs the harness with E2E_CAPTURE_PPROF=1 + manifest pinned to {{.ROOT_DIR}}/.e2e/. Outputs land in .e2e/pprof/<tool>.{cpu,heap}.pprof and link from the manifest. Total runtime ~3s on this fixture. Phase-1 task profile-tools is unchanged (latency/alloc only, ~0.6s). == Sample output == go tool pprof -top .e2e/pprof/get_architecture.cpu.pprof Duration: 200ms, Total samples = 30ms (14.83%) Showing nodes accounting for 30ms, 100% of 30ms total flat flat% sum% cum cum% 20ms 66.67% 66.67% 20ms 66.67% runtime.kevent 10ms 33.33% 100% 10ms 33.33% runtime.add (inline) (runtime.kevent dominates because the file watcher started by buildMaybeMultiGraph runs concurrently. Phase 3 flamegraph generation can apply -focus to filter to tool-specific stacks.) go tool pprof -top .e2e/pprof/get_architecture.heap.pprof File: cmd.test Type: inuse_space Showing top 5 nodes out of 35 flat flat% sum% cum cum% 2051.25kB 40.03% 40.03% 2051.25kB 40.03% runtime.mallocgc 1024.03kB 19.99% 60.02% 1024.03kB 19.99% strings.Fields 512.38kB 10.00% 70.02% 1024.41kB 19.99% jsonschema/v6.newSchema 512.07kB 9.99% 80.01% 512.07kB 9.99% hclsyntax.init.0 512.03kB 9.99% 90.01% 512.03kB 9.99% fmt.Sprintf Heap profiles are immediately useful — actionable allocation attribution for every tool, no sampling-rate tax. == What's next (phase 3) == - SVG flamegraph generation via 'go tool pprof -svg' for each tool's profiles, written to .e2e/pprof/<tool>.{cpu,heap}.svg - INDEX.md table linking each tool to its profiles + flamegraphs - Optional: 'go tool pprof -focus=<handler>' to filter background noise from the file watcher Refs: mache-6b6da6 phase 2, builds on phase 1 (#360)

github-actions · 2026-05-09T16:21:58Z

find_smells (advisory)

Scoped to files changed in this PR. Rules below run on the standalone (no-LLO) cross-ref tables; _ast rules (cyclomatic_complexity, long_function, long_file, magic_int_in_comparison) are not exercised here.

No structural smells in changed files. ✓

Rules: see docs/ARCHITECTURE.md for the full registry. Advisory only — these are heuristics, not gates.

Copilot

Pull request overview

Adds optional per-tool runtime/pprof CPU + heap capture to the existing MCP tools E2E profiling harness (phase 2 of mache-6b6da6), gated by E2E_CAPTURE_PPROF=1, and exposes a new Task task to run it.

Changes:

Extend the per-tool manifest schema to include optional cpu_profile / heap_profile paths and CPU iteration count.
Add env-var driven pprof configuration (E2E_CAPTURE_PPROF, E2E_PPROF_DIR, E2E_CPU_ITERATIONS) and implement per-tool CPU/heap capture.
Add task profile-tools-pprof to generate pprof artifacts alongside the existing latency/alloc manifest.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 5 comments.

File	Description
Taskfile.yml	Adds a new task to run the E2E harness with pprof capture enabled and documents expected outputs/usage.
cmd/all_tools_e2e_test.go	Implements env-var parsing and per-tool CPU/heap pprof capture and records artifact links in the manifest.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Five points from Copilot on PR #361, all valid: 1. Manifest path comment claimed paths were 'relative to test cwd' but capturePprof records absolute paths. Updated comment to be honest about absolute-paths-not-portable, with a forward-link to phase 4 making them manifest-relative when regression-detection consumes baselines. 2. Default CPU iteration count comment said 50, code said 500. Drift from the bump made during phase 2 iteration. Fixed comment to match code. 3. StopCPUProfile is now deferred for panic-safety. Without the defer, a panicking handler leaves profiling enabled, and the next tool's StartCPUProfile fails with 'already profiling', cascading the failure across the harness. Kept the explicit StopCPUProfile() after the loop so the CPU profile boundary tightens to just the iterations on the happy path; the defer becomes a no-op there and the panic-safety fallback otherwise. 4. readPprofOpts hard-failed on mkdir while capturePprof treats pprof as best-effort. Aligned to best-effort: mkdir failure logs and disables capture for the run; phase-1 latency/alloc capture still runs. 5. Taskfile claimed 'pprof -svg' renders a flamegraph. It doesn't — it emits a call graph. Updated comment to be accurate and forward-link to phase 3 for actual flamegraphs via brendangregg/flamegraph.pl or pprof's HTTP UI. No behavioral change for the happy path. Full e2e suite still green; pprof artifacts still produced for ok-status tools. Refs: PR #361 Copilot review, mache-6b6da6 phase 2

Two fixes folded into the phase-2 PR before merge, addressing the 'heap profile dominated by init noise' problem caught when looking at the rendered SVG (Image #2 from inline review): == Fix 1: heap baseline + delta == cmd/all_tools_e2e_test.go::capturePprof now captures a heap snapshot BEFORE the iteration loop in addition to the one AFTER. Each tool ends up with: <tool>.heap.baseline.pprof — pre-loop snapshot (init state) <tool>.heap.pprof — post-loop snapshot (tool work) <tool>.cpu.pprof — CPU profile over the loop The baseline lets consumers compute the delta via 'pprof -base=<baseline>'. Without it, heap profiles show cumulative allocation since process start — dominated by package-init noise (jsonschema, sqlite, tree-sitter init) that buries the tool's actual signal at ~5% of width. == Fix 2: flamegraphs task uses -alloc_space + -base == Taskfile 'flamegraphs' now runs: go tool pprof -alloc_space -base=baseline -raw <after.heap.pprof> | sed (drop cols 3-4 to match stackcollapse-go.pl's 2-col regex) | stackcollapse-go.pl | flamegraph.pl --title='<tool> heap delta (alloc_space, baseline-subtracted)' -alloc_space is the right column for delta because -inuse_space diff is near-zero on short loops (allocs get GC'd between the two snapshots). alloc_space captures bytes allocated DURING the loop, regardless of whether they're still live. == Real flamegraphs (not call graphs) == PR #361 phase 2 deferred this to a follow-up after Copilot correctly noted that 'pprof -svg' emits a call graph, not a flamegraph. Phase 3 wires the brendangregg/FlameGraph pipeline: task flamegraphs: - pre-flight: requires flamegraph.pl + stackcollapse-go.pl on PATH (install via 'brew install flamegraph' on macOS, 'git clone https://github.com/brendangregg/FlameGraph' on Linux) - emits .e2e/pprof/<tool>.{cpu,heap}.svg per tool - emits .e2e/INDEX.md linking each tool to its profiles + flamegraphs - placeholder SVGs for tools with no CPU samples (handler too fast for 100Hz sampler at 500 iterations) so INDEX.md stays consistent Vendoring the perl scripts considered + rejected: ~30KB of CDDL+GPL perl into an Apache 2.0 Go project is license fragmentation not worth the convenience. brew/git-clone is universal enough. == First-run signal (sample, get_architecture) == Pre-fix flamegraph dominated by: runtime.doInit / runtime.doInit1 (66% of width) github.com/santhosh-tekuri/jsonschema (init) modernc.org/sqlite (init) github.com/smacker/go-tree-sitter (init) Tool's actual stack: <5% of width on the right edge Post-fix flamegraph (baseline-subtracted): Tool's actual stack base wide: testing.tRunner → TestE2E_AllMCPTools → profileTool → capturePprof → makeGetArchitectureHandler.func1 Hotspots above: internal/graph.buildProjection (widest — graph projection) internal/graph.DetectCommunities (Louvain modularity) mark3labs/mcp-go (JSON-RPC framework overhead) That's actionable: get_architecture's heap pressure comes from community detection, an obvious caching target for follow-up optimization work. == Out of scope (phase 4) == - Rich-fixture mode (mache-on-mache instead of toy 4-package) so CPU samples have meaningful tool runtime to attribute - Regression detection vs checked-in baseline manifest - INDEX.md auto-detection of placeholder SVGs (footnote them) Refs: mache-6b6da6 phase 3 (folded into phase 2 PR before merge)

…3) (#362) Surfaced by the e2e tool harness from PR #361. Heap-delta flamegraphs showed MemoryStore.DefsMap as the dominant allocator in 4 of 16 MCP tools: get_impact: 52% of heap delta (4.51MB) get_overview: 32% of heap delta (2.51MB) get_architecture: significant (community detection path) get_communities: significant (Louvain on RefsMap) Each call to DefsMap or RefsMap deep-copied the whole map plus one slice per entry. On the 4-package toy fixture that's small absolute numbers; on a real workload (10K+ constructs, 50K+ ref tokens) it's millions of allocs per call. == Fix == Memoize the deep-copy snapshot. atomic.Pointer holds the cached snapshot; AddDef / AddRef / DeleteFileNodes invalidate it under the existing write lock; DefsMap / RefsMap return the cached instance on hit, rebuild under double-checked-locking on miss. Concurrency contract: - Snapshot store happens under RLock (the rebuild side) - Invalidation store happens under Lock (the mutation side) - Lock blocks behind any in-flight RLock, so a stale snapshot can't be stored after invalidation - Multiple readers seeing a nil snapshot may race to rebuild; double-checked locking inside the RLock prevents torn snapshots Callers must treat the returned map as read-only (every existing caller is). LookupDef remains the right API for callers who want a mutable per-token slice. == Tests == internal/graph/snapshot_cache_test.go pins the cache CONTRACT, not the speedup (the e2e harness measures the latter empirically): - DefsMap_ReturnsSameInstanceWithoutMutation: repeated calls return the same map address (cache hit) - DefsMap_AddDefInvalidatesCache: AddDef invalidates; pre-invalidation snapshots stay consistent (no retroactive mutation for existing readers) - RefsMap_ReturnsSameInstanceWithoutMutation: same for refs - RefsMap_AddRefInvalidatesCache: same for refs - SnapshotCache_ConcurrentReadersGetConsistentSnapshot: 32 concurrent readers after invalidation must all see a complete (non-torn) snapshot of 100 entries If a future refactor re-introduces per-call allocation here, these tests fail loudly — that's the CI gate. == Empirical validation == Pre-cache (single canonical call + 500-iter loop, baseline-subtracted heap delta): get_impact: ~8.7MB total, DefsMap=52% top get_overview: ~7.8MB total, DefsMap=32% top get_architecture: ~26MB total, tool-fn=42% (buildProjection deep) get_communities: ~14.7MB total, tool-fn=20% Post-cache: get_impact: ~4.75MB total (-45%), DefsMap GONE from top-3 get_overview: ~9.7MB total (different shape, harness pprof overhead now dominates instead of DefsMap) get_architecture: ~17.1MB total (-34%), tool-fn=35% (still top but smaller in absolute terms) get_communities: ~20.5MB total The remaining heap on get_overview / get_communities is harness overhead (runtime/pprof.StartCPUProfile, compress/flate from profile writing) and JSON response marshaling — not caching candidates we control. == Full suite == go test ./... — 20/20 green go test -race — passes (concurrent reader test on -race) golangci-lint — clean == What's next == - Rich fixture (mache-on-mache, synthetic-medium) — mache-be8090, surfaces more memoization candidates at real-world scale - Sheaf cache cross-repo wiring (mache-b37cff) — daemon-side sheaf_get_topology op so cached community detection survives process restarts - Hetero corpus (mache-d332b5) — gated on CGO removal Refs: mache-6b6da6 phase 3 follow-up; preserves the e2e harness contract from PR #361

jamestexas requested a review from Copilot May 9, 2026 16:27

Copilot started reviewing on behalf of jamestexas May 9, 2026 16:28 View session

Copilot AI reviewed May 9, 2026

View reviewed changes

Comment thread cmd/all_tools_e2e_test.go Outdated

Comment thread cmd/all_tools_e2e_test.go Outdated

Comment thread cmd/all_tools_e2e_test.go

Comment thread cmd/all_tools_e2e_test.go Outdated

Comment thread Taskfile.yml Outdated

jamestexas added 2 commits May 9, 2026 10:34

jamestexas merged commit 49b740a into main May 9, 2026
15 checks passed

jamestexas deleted the feat/mache-6b6da6-phase2-pprof branch May 9, 2026 17:02

jamestexas mentioned this pull request May 9, 2026

perf(graph): memoize DefsMap + RefsMap snapshots (mache-6b6da6 phase 3 followup) #362

Merged

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(cmd): pprof CPU + heap capture per MCP tool (mache-6b6da6 phase 2)#361

feat(cmd): pprof CPU + heap capture per MCP tool (mache-6b6da6 phase 2)#361
jamestexas merged 3 commits into
mainfrom
feat/mache-6b6da6-phase2-pprof

jamestexas commented May 9, 2026

Uh oh!

github-actions Bot commented May 9, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

jamestexas commented May 9, 2026

Summary

What lands

Why 500 iterations default

Sample outputs

What's next (Phase 3)

Test plan

Uh oh!

github-actions Bot commented May 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

find_smells (advisory)

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

github-actions Bot commented May 9, 2026 •

edited

Loading