Skip to content

feat(cmd): pprof CPU + heap capture per MCP tool (mache-6b6da6 phase 2)#361

Merged
jamestexas merged 3 commits into
mainfrom
feat/mache-6b6da6-phase2-pprof
May 9, 2026
Merged

feat(cmd): pprof CPU + heap capture per MCP tool (mache-6b6da6 phase 2)#361
jamestexas merged 3 commits into
mainfrom
feat/mache-6b6da6-phase2-pprof

Conversation

@jamestexas
Copy link
Copy Markdown
Contributor

Summary

Phase 2 of mache-6b6da6 — extends the e2e tool harness from #360 to optionally capture runtime/pprof CPU + heap per tool. Gated on E2E_CAPTURE_PPROF=1; phase 1's latency/alloc path is unchanged when the env var is unset.

What lands

cmd/all_tools_e2e_test.go:

  • toolProfile gains CPUProfile, HeapProfile, CPUIterations fields (omitempty)
  • pprofOpts + readPprofOpts(t) parse E2E_CAPTURE_PPROF, E2E_PPROF_DIR, E2E_CPU_ITERATIONS
  • capturePprof() runs the handler in a K-iteration loop under pprof.StartCPUProfile, then snapshots heap via pprof.WriteHeapProfile
  • Skipped on tools that errored / were skipped (profiling "no _ast table" is noise)
  • Latency/alloc measurement still uses single canonical call → manifest numbers stay comparable across runs

Taskfile.yml:

  • New task profile-tools-pprof runs the harness with capture enabled
  • Phase 1 task profile-tools unchanged (latency/alloc only)

Why 500 iterations default

Mache's tools complete in <1ms each. The CPU sampler runs at ~100Hz (10ms tick). Below ~200 iterations the CPU profile records zero samples — loop exits before any tick. 500 × ~1ms ≈ 500ms per tool yields ~30ms of samples (~14% duty cycle), enough for flamegraphs to find dominant stacks. Heap has no equivalent constraint.

Override via E2E_CPU_ITERATIONS=N for slower fixtures or deeper traces.

Sample outputs

CPU profile (get_architecture):

Duration: 200ms, Total samples = 30ms (14.83%)
Showing nodes accounting for 30ms, 100% of 30ms total
      flat  flat%   sum%        cum   cum%
      20ms 66.67% 66.67%  20ms 66.67%  runtime.kevent
      10ms 33.33%   100%  10ms 33.33%  runtime.add (inline)

Heap profile (get_architecture):

File: cmd.test
Type: inuse_space
Showing top 5 nodes out of 35
 2051.25kB 40.03% 40.03%  2051.25kB 40.03%  runtime.mallocgc
 1024.03kB 19.99% 60.02%  1024.03kB 19.99%  strings.Fields
  512.38kB 10.00% 70.02%  1024.41kB 19.99%  jsonschema/v6.newSchema
  512.07kB  9.99% 80.01%   512.07kB  9.99%  hclsyntax.init.0
  512.03kB  9.99% 90.01%   512.03kB  9.99%  fmt.Sprintf

Heap is immediately actionable. CPU is dominated by background runtime.kevent (file watcher started by buildMaybeMultiGraph runs concurrently); phase 3 will apply -focus=<handler> to filter to tool-specific stacks.

What's next (Phase 3)

  • SVG flamegraph generation via go tool pprof -svg per tool
  • INDEX.md linking tools → profiles → flamegraphs
  • -focus filter for background-noise filtering

Test plan

  • task profile-tools (no env vars) — phase 1 path unchanged, ~0.6s
  • task profile-tools-pprof produces 32 pprof files (16 tools × CPU + heap), ~3s total
  • Manifest links profiles via cpu_profile / heap_profile fields
  • go tool pprof -top reads both profile types successfully
  • Full go test ./... 20/20 green
  • golangci-lint passes
  • CI green
  • Copilot review

Refs: mache-6b6da6 phase 2, builds on phase 1 (#360).

Phase 2 of mache-6b6da6: extends the e2e tool harness to optionally
capture runtime/pprof CPU + heap profiles per tool. Gated on
$E2E_CAPTURE_PPROF=1; phase 1's latency/alloc-only mode is
unchanged when the env var isn't set.

== What lands ==

cmd/all_tools_e2e_test.go:

  - toolProfile gains CPUProfile, HeapProfile, CPUIterations
    fields. omitempty in the JSON manifest, so phase-1 runs emit
    the same shape as before.
  - pprofOpts struct (dir + iterations) carries env-driven config.
  - readPprofOpts(t) parses E2E_CAPTURE_PPROF, E2E_PPROF_DIR (with
    a sensible default sibling of the manifest), and
    E2E_CPU_ITERATIONS (default 500 — see comment for why).
  - capturePprof(t, ...) runs the handler in a K-iteration loop
    under pprof.StartCPUProfile, then snapshots heap via
    pprof.WriteHeapProfile. Skipped on tools that errored or were
    skipped in the canonical call (profiling "no _ast table" is
    noise, not signal).

The latency/alloc measurement still comes from a single canonical
call so manifest numbers stay comparable across runs regardless of
pprof being on. Iterations exist solely to give the runtime
sampler enough work to record stacks.

== Why 500 iterations default ==

Mache's tools complete in <1ms each (phase 1 baseline). The CPU
sampler runs at ~100Hz (10ms tick). Below ~200 iterations the
profile records zero samples — the loop exits before any tick
fires. 500 iterations × ~1ms ≈ 500ms per tool yields ~30ms of
samples (14% duty cycle), enough for flamegraphs to find dominant
stacks. Heap profiling has no equivalent constraint and is
meaningful at any iteration count.

For larger fixtures with slower tools, lower iterations work;
override via E2E_CPU_ITERATIONS.

== Taskfile ==

New task profile-tools-pprof variant runs the harness with
E2E_CAPTURE_PPROF=1 + manifest pinned to {{.ROOT_DIR}}/.e2e/.
Outputs land in .e2e/pprof/<tool>.{cpu,heap}.pprof and link from
the manifest. Total runtime ~3s on this fixture.

Phase-1 task profile-tools is unchanged (latency/alloc only, ~0.6s).

== Sample output ==

  go tool pprof -top .e2e/pprof/get_architecture.cpu.pprof
    Duration: 200ms, Total samples = 30ms (14.83%)
    Showing nodes accounting for 30ms, 100% of 30ms total
          flat  flat%   sum%        cum   cum%
          20ms 66.67% 66.67%  20ms 66.67%  runtime.kevent
          10ms 33.33%   100%  10ms 33.33%  runtime.add (inline)

(runtime.kevent dominates because the file watcher started by
buildMaybeMultiGraph runs concurrently. Phase 3 flamegraph
generation can apply -focus to filter to tool-specific stacks.)

  go tool pprof -top .e2e/pprof/get_architecture.heap.pprof
    File: cmd.test
    Type: inuse_space
    Showing top 5 nodes out of 35
          flat  flat%   sum%        cum   cum%
     2051.25kB 40.03% 40.03%  2051.25kB 40.03%  runtime.mallocgc
     1024.03kB 19.99% 60.02%  1024.03kB 19.99%  strings.Fields
      512.38kB 10.00% 70.02%  1024.41kB 19.99%  jsonschema/v6.newSchema
      512.07kB  9.99% 80.01%   512.07kB  9.99%  hclsyntax.init.0
      512.03kB  9.99% 90.01%   512.03kB  9.99%  fmt.Sprintf

Heap profiles are immediately useful — actionable allocation
attribution for every tool, no sampling-rate tax.

== What's next (phase 3) ==

  - SVG flamegraph generation via 'go tool pprof -svg' for each
    tool's profiles, written to .e2e/pprof/<tool>.{cpu,heap}.svg
  - INDEX.md table linking each tool to its profiles + flamegraphs
  - Optional: 'go tool pprof -focus=<handler>' to filter background
    noise from the file watcher

Refs: mache-6b6da6 phase 2, builds on phase 1 (#360)
@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 9, 2026

find_smells (advisory)

Scoped to files changed in this PR. Rules below run on the standalone (no-LLO) cross-ref tables; _ast rules (cyclomatic_complexity, long_function, long_file, magic_int_in_comparison) are not exercised here.

No structural smells in changed files. ✓

Rules: see docs/ARCHITECTURE.md for the full registry. Advisory only — these are heuristics, not gates.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds optional per-tool runtime/pprof CPU + heap capture to the existing MCP tools E2E profiling harness (phase 2 of mache-6b6da6), gated by E2E_CAPTURE_PPROF=1, and exposes a new Task task to run it.

Changes:

  • Extend the per-tool manifest schema to include optional cpu_profile / heap_profile paths and CPU iteration count.
  • Add env-var driven pprof configuration (E2E_CAPTURE_PPROF, E2E_PPROF_DIR, E2E_CPU_ITERATIONS) and implement per-tool CPU/heap capture.
  • Add task profile-tools-pprof to generate pprof artifacts alongside the existing latency/alloc manifest.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 5 comments.

File Description
Taskfile.yml Adds a new task to run the E2E harness with pprof capture enabled and documents expected outputs/usage.
cmd/all_tools_e2e_test.go Implements env-var parsing and per-tool CPU/heap pprof capture and records artifact links in the manifest.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread cmd/all_tools_e2e_test.go Outdated
Comment thread cmd/all_tools_e2e_test.go Outdated
Comment thread cmd/all_tools_e2e_test.go
Comment thread cmd/all_tools_e2e_test.go Outdated
Comment thread Taskfile.yml Outdated
jamestexas added 2 commits May 9, 2026 10:34
Five points from Copilot on PR #361, all valid:

1. Manifest path comment claimed paths were 'relative to test
   cwd' but capturePprof records absolute paths. Updated comment
   to be honest about absolute-paths-not-portable, with a
   forward-link to phase 4 making them manifest-relative when
   regression-detection consumes baselines.

2. Default CPU iteration count comment said 50, code said 500.
   Drift from the bump made during phase 2 iteration. Fixed
   comment to match code.

3. StopCPUProfile is now deferred for panic-safety. Without the
   defer, a panicking handler leaves profiling enabled, and the
   next tool's StartCPUProfile fails with 'already profiling',
   cascading the failure across the harness. Kept the explicit
   StopCPUProfile() after the loop so the CPU profile boundary
   tightens to just the iterations on the happy path; the defer
   becomes a no-op there and the panic-safety fallback otherwise.

4. readPprofOpts hard-failed on mkdir while capturePprof treats
   pprof as best-effort. Aligned to best-effort: mkdir failure
   logs and disables capture for the run; phase-1 latency/alloc
   capture still runs.

5. Taskfile claimed 'pprof -svg' renders a flamegraph. It
   doesn't — it emits a call graph. Updated comment to be
   accurate and forward-link to phase 3 for actual flamegraphs
   via brendangregg/flamegraph.pl or pprof's HTTP UI.

No behavioral change for the happy path. Full e2e suite still
green; pprof artifacts still produced for ok-status tools.

Refs: PR #361 Copilot review, mache-6b6da6 phase 2
Two fixes folded into the phase-2 PR before merge, addressing the
'heap profile dominated by init noise' problem caught when looking
at the rendered SVG (Image #2 from inline review):

== Fix 1: heap baseline + delta ==

cmd/all_tools_e2e_test.go::capturePprof now captures a heap
snapshot BEFORE the iteration loop in addition to the one
AFTER. Each tool ends up with:

  <tool>.heap.baseline.pprof  — pre-loop snapshot (init state)
  <tool>.heap.pprof           — post-loop snapshot (tool work)
  <tool>.cpu.pprof            — CPU profile over the loop

The baseline lets consumers compute the delta via
'pprof -base=<baseline>'. Without it, heap profiles show
cumulative allocation since process start — dominated by
package-init noise (jsonschema, sqlite, tree-sitter init)
that buries the tool's actual signal at ~5% of width.

== Fix 2: flamegraphs task uses -alloc_space + -base ==

Taskfile 'flamegraphs' now runs:

  go tool pprof -alloc_space -base=baseline -raw <after.heap.pprof>
    | sed (drop cols 3-4 to match stackcollapse-go.pl's 2-col regex)
    | stackcollapse-go.pl
    | flamegraph.pl --title='<tool> heap delta (alloc_space, baseline-subtracted)'

-alloc_space is the right column for delta because -inuse_space
diff is near-zero on short loops (allocs get GC'd between the
two snapshots). alloc_space captures bytes allocated DURING
the loop, regardless of whether they're still live.

== Real flamegraphs (not call graphs) ==

PR #361 phase 2 deferred this to a follow-up after Copilot
correctly noted that 'pprof -svg' emits a call graph, not a
flamegraph. Phase 3 wires the brendangregg/FlameGraph pipeline:

  task flamegraphs:
    - pre-flight: requires flamegraph.pl + stackcollapse-go.pl on
      PATH (install via 'brew install flamegraph' on macOS,
      'git clone https://github.com/brendangregg/FlameGraph' on Linux)
    - emits .e2e/pprof/<tool>.{cpu,heap}.svg per tool
    - emits .e2e/INDEX.md linking each tool to its profiles + flamegraphs
    - placeholder SVGs for tools with no CPU samples (handler too
      fast for 100Hz sampler at 500 iterations) so INDEX.md stays
      consistent

Vendoring the perl scripts considered + rejected: ~30KB of
CDDL+GPL perl into an Apache 2.0 Go project is license
fragmentation not worth the convenience. brew/git-clone is
universal enough.

== First-run signal (sample, get_architecture) ==

Pre-fix flamegraph dominated by:
  runtime.doInit / runtime.doInit1 (66% of width)
  github.com/santhosh-tekuri/jsonschema (init)
  modernc.org/sqlite (init)
  github.com/smacker/go-tree-sitter (init)
  Tool's actual stack: <5% of width on the right edge

Post-fix flamegraph (baseline-subtracted):
  Tool's actual stack base wide:
    testing.tRunner → TestE2E_AllMCPTools → profileTool
      → capturePprof → makeGetArchitectureHandler.func1
  Hotspots above:
    internal/graph.buildProjection (widest — graph projection)
    internal/graph.DetectCommunities (Louvain modularity)
    mark3labs/mcp-go (JSON-RPC framework overhead)

That's actionable: get_architecture's heap pressure comes from
community detection, an obvious caching target for follow-up
optimization work.

== Out of scope (phase 4) ==

  - Rich-fixture mode (mache-on-mache instead of toy 4-package)
    so CPU samples have meaningful tool runtime to attribute
  - Regression detection vs checked-in baseline manifest
  - INDEX.md auto-detection of placeholder SVGs (footnote them)

Refs: mache-6b6da6 phase 3 (folded into phase 2 PR before merge)
@jamestexas jamestexas merged commit 49b740a into main May 9, 2026
15 checks passed
@jamestexas jamestexas deleted the feat/mache-6b6da6-phase2-pprof branch May 9, 2026 17:02
jamestexas added a commit that referenced this pull request May 9, 2026
…3) (#362)

Surfaced by the e2e tool harness from PR #361. Heap-delta
flamegraphs showed MemoryStore.DefsMap as the dominant allocator
in 4 of 16 MCP tools:

  get_impact:         52% of heap delta (4.51MB)
  get_overview:       32% of heap delta (2.51MB)
  get_architecture:   significant (community detection path)
  get_communities:    significant (Louvain on RefsMap)

Each call to DefsMap or RefsMap deep-copied the whole map plus
one slice per entry. On the 4-package toy fixture that's small
absolute numbers; on a real workload (10K+ constructs, 50K+ ref
tokens) it's millions of allocs per call.

== Fix ==

Memoize the deep-copy snapshot. atomic.Pointer holds the cached
snapshot; AddDef / AddRef / DeleteFileNodes invalidate it under
the existing write lock; DefsMap / RefsMap return the cached
instance on hit, rebuild under double-checked-locking on miss.

Concurrency contract:
  - Snapshot store happens under RLock (the rebuild side)
  - Invalidation store happens under Lock (the mutation side)
  - Lock blocks behind any in-flight RLock, so a stale snapshot
    can't be stored after invalidation
  - Multiple readers seeing a nil snapshot may race to rebuild;
    double-checked locking inside the RLock prevents torn snapshots

Callers must treat the returned map as read-only (every existing
caller is). LookupDef remains the right API for callers who want
a mutable per-token slice.

== Tests ==

internal/graph/snapshot_cache_test.go pins the cache CONTRACT,
not the speedup (the e2e harness measures the latter empirically):

  - DefsMap_ReturnsSameInstanceWithoutMutation:
    repeated calls return the same map address (cache hit)
  - DefsMap_AddDefInvalidatesCache:
    AddDef invalidates; pre-invalidation snapshots stay
    consistent (no retroactive mutation for existing readers)
  - RefsMap_ReturnsSameInstanceWithoutMutation: same for refs
  - RefsMap_AddRefInvalidatesCache: same for refs
  - SnapshotCache_ConcurrentReadersGetConsistentSnapshot:
    32 concurrent readers after invalidation must all see a
    complete (non-torn) snapshot of 100 entries

If a future refactor re-introduces per-call allocation here,
these tests fail loudly — that's the CI gate.

== Empirical validation ==

Pre-cache (single canonical call + 500-iter loop, baseline-subtracted heap delta):

  get_impact:        ~8.7MB total, DefsMap=52% top
  get_overview:      ~7.8MB total, DefsMap=32% top
  get_architecture:  ~26MB total, tool-fn=42% (buildProjection deep)
  get_communities:   ~14.7MB total, tool-fn=20%

Post-cache:

  get_impact:        ~4.75MB total (-45%), DefsMap GONE from top-3
  get_overview:      ~9.7MB total (different shape, harness pprof
                     overhead now dominates instead of DefsMap)
  get_architecture:  ~17.1MB total (-34%), tool-fn=35%
                     (still top but smaller in absolute terms)
  get_communities:   ~20.5MB total

The remaining heap on get_overview / get_communities is harness
overhead (runtime/pprof.StartCPUProfile, compress/flate from
profile writing) and JSON response marshaling — not caching
candidates we control.

== Full suite ==

  go test ./...   — 20/20 green
  go test -race   — passes (concurrent reader test on -race)
  golangci-lint   — clean

== What's next ==

  - Rich fixture (mache-on-mache, synthetic-medium) — mache-be8090,
    surfaces more memoization candidates at real-world scale
  - Sheaf cache cross-repo wiring (mache-b37cff) — daemon-side
    sheaf_get_topology op so cached community detection survives
    process restarts
  - Hetero corpus (mache-d332b5) — gated on CGO removal

Refs: mache-6b6da6 phase 3 follow-up; preserves the e2e harness
contract from PR #361
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants