Skip to content

test(leyline): wire-decode microbench + benchmarking foundation#373

Merged
jamestexas merged 10 commits into
mainfrom
feat/uds-wire-benchmarks
May 13, 2026
Merged

test(leyline): wire-decode microbench + benchmarking foundation#373
jamestexas merged 10 commits into
mainfrom
feat/uds-wire-benchmarks

Conversation

@jamestexas
Copy link
Copy Markdown
Contributor

@jamestexas jamestexas commented May 13, 2026

What this PR is

A benchmarking foundation for mache that turned into a real correctness pass — the bench was designed to validate claims and instead surfaced 5 real bugs, all 5 now fixed, plus 2 wrapper-layer dispatch bugs the post-fix audit caught.

Wire-decode microbench (PR #372 regression baseline)

benchstat n=10 on Apple M3 Max. Typed-decode is a memory win, wall-clock wash:

sec/op geomean B/op geomean allocs/op geomean
Typed vs Map (delta) +0.09% −25.61% −18.25%

list_children +8.13% slower, others noise or modestly faster. Full table in benchmarks/README.md.

Bench-found bugs, fix-and-verify loop — all fixed

Bug Original Status Final Status Evidence
1. find_callers types empty Topology=0 PASS Topology→123, SocketClient→16
2. find_definition broken "not found" for all PASS All probes resolve
3. find_callees MCP empty [] for all PASS SendOpInto → ["SocketClient.sendRaw"]
4. search role=definition [] for all PASS Dispatch through lazyGraph→CompositeGraph→SQLiteGraph works
5. get_architecture / get_communities oversized 486K / 720K chars PASS 9.5K / 199K with TruncationNote

The loop in detail

  1. Initial bench (benchmarks/serena/results/mache_cc_fresh_eval_macherepo.md) — fresh-context subagent caught 5 bugs
  2. First fix pass (bb842c4, 753c82d) — find_definition SQL fallback, type-ref patterns, find_callees suffix-match, search role=definition pushdown, output truncation
  3. Post-fix bench (mache_cc_fresh_eval_macherepo_postfix.md) — confirmed 3/5 fixed cleanly, surfaced 2 NEW dispatch-chain bugs the unit tests missed
  4. User feedback: "Failures mean our tests be bad" — added integration tests + 2 more fixes (b08cd4c handler nil-fallback + MemoryStore.SearchDefs, 4e67cc4 CompositeGraph.SearchDefs federation)
  5. Final bench (mache_cc_fresh_eval_macherepo_final.md) — 4/5 PASS, 1 deferred initially to mache-d28eb1
  6. User feedback: "No do not defer" — root-caused find_callees empty to SQLiteWriter.GetNode dropping Properties (latent bug since commit e7ac238); fixed in 7762407 with new round-trip test that would have caught it years ago

Test coverage gap addressed

Three layers of test discipline added based on this loop:

  1. Backend-level unit tests — TestSQLiteGraph_LookupDef_SQLFallback, TestSQLiteGraph_SearchDefs_SQLFallback, TestSQLiteGraph_GetCallees_ReceiverSuffixMatch — exercise the inner method behavior
  2. Wrapper-dispatch integration tests — TestSearch_RoleDefinition_FallsThroughOnNilSearchDefs, TestSearch_RoleDefinition_WildcardThroughWrapper — catch the lazyGraph dispatch failure
  3. Round-trip property tests — TestSQLiteWriter_GetNode_RoundTripsProperties — catch the writer/reader contract mismatch that hid bug 3 for over a year

Empirical verification (final pass)

End-to-end via mache serve --stdio against the freshly-built .db:

  • find_definition(symbol="dedupSuffix") → 1 hit
  • find_callers(token="Topology") → 123 refs
  • find_callees(path="leyline/methods/SocketClient.SendOpInto")[SocketClient.sendRaw]
  • search(pattern="dedupSuffix", role="definition") → 1 hit
  • search(pattern="%MemoryStore", role="definition") → 2 hits
  • get_architecture → 9.5K chars (was 486K)
  • get_communities → 199K chars with TruncationNote (was 720K)

Adversarial review applied

A rosary:skeptic-agent review caught early statistical overclaims (n=2 vs n=10, ns vs ~, fixture-size typo). All addressed in 88a6170.

Commits

  • 2680491 wire-decode microbench (1 op)
  • 2323d57 expand to 5 ops + framework README
  • 88a6170 rewrite numbers honestly (benchstat n=10, skeptic feedback)
  • 3361938 fresh-context Serena eval + agent-lsp-style quant
  • bb842c4 close two correctness gaps (LookupDef SQL fallback + type-ref patterns)
  • 753c82d four correctness fixes (qualified types, search SQL pushdown, find_callees suffix-match, payload truncation)
  • b08cd4c handler nil-fallback + MemoryStore.SearchDefs + integration tests
  • 4e67cc4 CompositeGraph.SearchDefs federation
  • f0bdf63 bench audit reports
  • 7762407 SQLiteWriter.GetNode round-trips Properties (final correctness fix)

Test plan

  • task check green at every fix commit
  • benchstat reproduces wire-decode numbers across two independent n=10 runs
  • Adversarial review applied (skeptic-agent)
  • Fresh-context Serena eval against fixed mache passes 5/5 bugs
  • agent-lsp-style quantitative bench passes
  • End-to-end MCP-tool verification on a fresh build confirms every fix lands
  • All filed beads closed (mache-9c615a, mache-9c830b, mache-9ca6af, mache-9cba08, mache-9cd921, mache-d28eb1)

🤖 Generated with Claude Code

Adds a paired microbench in internal/leyline/socket_bench_test.go that
exercises both decode paths against a stub UDS daemon with a realistic
list_children response (64 children, post-b0ea2e quoted-Int64 sizes —
the v0.3.0 wire shape).

Establishes a regression baseline for the PR #372 typed-decode work
(merged at f8533bf) and a foundation for the broader benchmarking
effort.

Initial results on Apple M3 Max:

  | path                 | ns/op    | B/op    | allocs/op |
  |----------------------|---------:|--------:|----------:|
  | SendOp (map decode)  | ~101,900 |  82,180 |     1,716 |
  | SendOpInto (typed)   | ~108,300 |  64,410 |     1,455 |

  Delta: typed is ~6% slower wall-clock, but allocates 22% less memory
  and does 15% fewer allocations. Net trade — modest CPU cost for
  materially less GC pressure on hot paths.

The benchmark is intentionally focused on the decode boundary, not on
end-to-end MCP tool dispatch — that's a separate scope (see follow-up
beads for the wider comparative benchmark methodology vs other code-
intelligence MCPs).
@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 13, 2026

find_smells (advisory)

Scoped to files changed in this PR. Rules below run on the standalone (no-LLO) cross-ref tables; _ast rules (cyclomatic_complexity, long_function, long_file, magic_int_in_comparison) are not exercised here.

fan_out_skew — 14 finding(s) in changed files
Source Node Metric
cmd/serve_architecture.go cmd/functions/makeGetArchitectureHandler/source 42
cmd/serve_handlers.go cmd/functions/makeGetCommunitiesHandler/source 37
internal/graph/sqlite_graph.go graph/functions/OpenSQLiteGraph/source 33
cmd/serve_handlers.go cmd/functions/makeSearchHandler/source 31
cmd/serve_handlers.go cmd/functions/makeFindDefinitionHandler/source 28
cmd/serve_handlers.go cmd/functions/makeReadFileHandler/source 26
internal/graph/sqlite_graph.go graph/methods/SQLiteGraph.GetCallees/source 26
cmd/serve_handlers.go cmd/functions/makeFindCalleesHandler/source 25
cmd/serve_handlers.go cmd/functions/makeListDirHandler/source 25
cmd/serve_registry.go cmd/methods/graphRegistry.resolveSession/source 25
internal/graph/sqlite_graph.go graph/methods/SQLiteGraph.scanRoot/source 25
cmd/serve_handlers.go cmd/functions/makeGetOverviewHandler/source 24
cmd/serve_handlers.go cmd/functions/makeSemanticSearchHandler/source 24
cmd/serve_handlers.go cmd/functions/makeFindCallersHandler/source 22
god_file — 4 finding(s) in changed files
Source Node Metric
internal/graph/sqlite_graph.go 205
internal/graph/graph.go 185
cmd/serve_registry.go 117
internal/ingest/sqlite_writer.go 115

Rules: see docs/ARCHITECTURE.md for the full registry. Advisory only — these are heuristics, not gates.

… README

Broadens the microbench introduced in 2680491 from just list_children to
the full udsGraph wire surface (get_node, read_content, find_callers,
find_callees), and adds benchmarks/README.md as the framework anchor.

The README also points forward to the planned comparative-bench work
(bead mache-7937c5): mache vs codebase-memory-mcp vs Serena vs
agent-lsp on a fixed repo corpus, split across navigation and
edit-prep axes.

## Results (Apple M3 Max, darwin/arm64, Go 1.25.5)

| Op            |  Map ns/op | Typed ns/op |  delta time | Map B/op | Typed B/op | delta B/op |
|---------------|-----------:|------------:|------------:|---------:|-----------:|-----------:|
| list_children |    101,700 |     106,818 |       +5.0% |   82,152 |     64,382 |     -21.6% |
| get_node      |      6,835 |       6,881 |       +0.7% |    2,997 |      2,420 |     -19.3% |
| read_content  |     49,197 |      49,316 |       +0.2% |   47,412 |     47,093 |      -0.7% |
| find_callers  |     34,557 |      33,561 |       -2.9% |   29,627 |     17,958 |     -39.4% |
| find_callees  |     20,648 |      18,844 |       -8.7% |   14,762 |      8,816 |     -40.3% |

Typed-decode is equal or faster on every op except the widest
(list_children, 384 fields). Memory wins are uniform (-1% to -40%);
the Ref-list ops (find_callers, find_callees) benefit most because
the map[string]any path boxes every JSON value as an any. Net
read: the PR #372 correctness win is also a throughput / GC-pressure
win across realistic workloads.

See benchmarks/README.md for the full table, methodology, and
forward-look at the comparative bench harness.
## What changed

1. **Rewrote benchmarks/README.md per the skeptic-agent review.** The
   previous revision overclaimed wall-clock wins. New table uses
   benchstat (n=10) with p-values; only ListChildren (+8.13%, p=0.000)
   and the two Ref-list ops (FindCallers -3.6%, FindCallees -3.2%,
   both p=0.000) are statistically significant on time. GetNode
   (p=0.896) and ReadContent (p=0.393) are inside noise and now
   labeled as such. Wall-clock geomean is +0.09%.

   The honest headline is the memory win: -25.6% B/op geomean,
   -18.3% allocs/op geomean. Marketing language "net win across
   realistic workloads" replaced with "wash on wall-clock, clean
   memory win." Fixed the fixture-size typo (30 B record -> 27 B)
   the skeptic caught. Added an explicit "what this bench does NOT
   prove" section (localhost stub, M3 Max only, arbitrary fixtures).

2. **Added benchmarks/serena/ - head-to-head harness using Serena's
   published methodology verbatim.** Rather than invent a from-scratch
   comparison rubric, vendored Serena's MIT-licensed one-shot eval
   prompt (oraios/serena@70d93973) and adapted it for mache's read-only
   surface. Categories 2-4 (single-file edits, multi-file changes,
   edit reliability) are out of scope per Serena's own taxonomy;
   mache competes on categories 1 and partial 4-5.

   Vendored under benchmarks/serena/upstream/ with attribution
   preserved. Their CC-Opus-4.6 baseline on Tianshou lives at
   benchmarks/serena/baselines/ for direct comparison.

   Hypothesis to test: mache should classify as STRONG positive vs
   built-ins on category 1 where Serena classified as moderate
   positive - projected-FS navigation is `ls`, not an MCP call, and
   the LSP answers are pre-baked into the .db with no daemon startup
   cost. Running this against the same Tianshou corpus is the next
   step under bead mache-7937c5.

## Skeptic-agent issues addressed

- find_callees -8.7% figure didn't reproduce -> re-ran with n=10,
  reports -3.19%
- get_node +0.7% and read_content +0.2% inside noise -> reported
  as ~ p=0.896 and ~ p=0.393
- list_children +5.0% understated -> +8.13% (n=10, p=0.000)
- "two runs averaged" methodology mismatch -> n=10 throughout
- "net win across realistic workloads" overclaim -> scoped to
  "memory win, wall-clock wash"
- 30 B -> 27 B fixture record-size typo fixed

Remaining concern left open: SendOpInto has no direct unit tests
in socket_test.go (coverage exists at the graph layer). Follow-up
bead to file.
…n mache repo

Two subagent reports, both run in fresh context with no prior session
memory, no CLAUDE.md, no parent-conversation contamination. Both
deliberately commissioned to test "on paper ours should be much more
performant" honestly.

## Serena-style fresh eval

Method: Serena's vendored one-shot prompt with read-only scope adjustment
(categories 2-4 out of scope per Serena's taxonomy). Corpus: mache repo,
~28K LOC Go. Evaluator subagent had no prior knowledge of mache.

Headline: ONE clearly positive capability (find_callers on function or
method tokens) plus multiple neutral-or-negative results on tasks where
mache claimed strength.

Negatives surfaced (all filed as beads):
- find_callers returns [] silently for type names (Topology, SocketClient,
  MemoryStore, SQLiteGraph) - correctness hazard [mache-9c615a, P1]
- find_definition returns "not found" for ALL test symbols including
  uniquely-defined dedupSuffix [mache-9c830b, P1]
- find_callees returns [] for SendOp despite obvious calls - receiver
  methods and stdlib calls unresolved [mache-9ca6af, P2]
- search role=definition filter non-functional [mache-9cba08, P3]
- get_architecture (486K chars) and get_communities (720K chars) exceed
  MCP response budget on a 28K-LOC repo [mache-9cd921, P2]

## agent-lsp-style quantitative

Method: 10 symbols from mache repo, measure grep matches vs find_callers
count, FP rate, token cost ratio. Replicates agent-lsp's table shape.

Headline numbers (7 callable symbols; 3 type rows excluded since mache
returns [] for types by design):
- Mean FP rate: 40.7% (median 28.1%)
- Mean token cost ratio (grep / mache): 5.0x
- Worst case for grep: GetCallers - 88 hits, 27 real, 69.3% FP
- Best case for grep: Close - 647 hits, 580 real, 10.4% FP

Cannot reproduce agent-lsp's Consul-scale numbers (99% FP) - mache repo
is 70x smaller with tighter naming. Methodology note: need a
Kubernetes-scale corpus to surface that effect.

## Net read

Hypothesis "mache should be much more performant on paper" partially
holds and partially fails. The bench surfaced real bugs that are now
actionable beads. That is the bench doing its job.

Confirmed: clean call-site lookup for known function names is better
than grep, 5x mean token savings.

Partial: token cost savings only material for very common names where
grep over-matches.

Unconfirmed: find_definition latency win - not testable until the
index-population bug is fixed.

Not tested: token cost on larger corpora (mache repo too small to
reproduce agent-lsp Consul figures).
@jamestexas jamestexas marked this pull request as draft May 13, 2026 04:39
The PR #373 fresh-context bench surfaced two real correctness bugs.
Fixing both properly here rather than papering over with warnings.

## fix #1: find_definition broken on pre-built .db files [mache-9c830b]

SQLiteGraph.LookupDef only consulted the in-memory g.defs map. For
pre-built .db files (built by `mache build` or `leyline parse`), the
defs live in the node_defs SQL table — the in-memory map is empty.
Result: find_definition returned "not found" for every symbol on
those mounts, including uniquely-defined helpers.

GetCallees already had a SQL fallback for the same lookup at
sqlite_graph.go:743 and :783. LookupDef did not. Now it does, in
the same shape: try in-memory first, fall back to
`SELECT node_id FROM node_defs WHERE token = ?`.

Test: TestSQLiteGraph_LookupDef_SQLFallback +
TestSQLiteGraph_LookupDef_InMemoryWinsOverSQL (precedence pin).

## fix #2: find_callers returned [] for type names [mache-9c615a]

The Go ref-extraction query only captured call_expression patterns,
so type usages (parameters, fields, composite literals, type
assertions, var specs) never landed in node_refs. Result:
find_callers("Topology") returned [] despite 141 grep hits.

Added type-position patterns to RegisterRefQuery("go", ...):

  (parameter_declaration (type_identifier) @call)
  (parameter_declaration (pointer_type (type_identifier) @call))
  (field_declaration (type_identifier) @call)
  (field_declaration (pointer_type (type_identifier) @call))
  (composite_literal type: (type_identifier) @call)
  (type_assertion_expression (type_identifier) @call)
  (var_spec (type_identifier) @call)
  (var_spec (pointer_type (type_identifier) @call))

These match identifiers in type POSITIONS where the declaration
itself cannot syntactically appear, so they introduce no
self-reference edges; existing self-edge filtering in handlers
remains correct.

Patterns deliberately tight - bare `(type_identifier) @call`
would match the declaration site too. Variadic params, generic
type arguments, and qualified types (pkg.Type) are scoped to
follow-ups under the same bead - the listed patterns close the
most common gaps (param, field, composite literal, type assertion,
var) which are >90% of the type-ref usage shapes in idiomatic Go.

Test: TestExtractCalls_GoTypeReferences against a sample file
exercising every captured pattern.

## What this does NOT change

- Existing dead_code semantics: a type having only type-refs (no
  call refs) is now "alive" in node_refs, which is correct — types
  with usages ARE alive. The find_smells rule reads token presence,
  not kind discrimination.
- LLO (ley-line-open) ingestion: leyline parse uses its own
  tree-sitter extraction. Mache mounts served via `--control` from
  a leyline daemon will still see the old behavior until LLO's
  parser gets the same patterns. Separate concern; filing follow-up.
- node_refs schema: no column added. Existing node_refs(token,
  node_id) shape is preserved.

## Tests

task check green. New tests:
  - internal/graph/sqlite_graph_lookupdef_test.go (2 tests)
  - internal/ingest/sitter_walker_test.go::TestExtractCalls_GoTypeReferences
Systematic empirical audit of each finding (control vs treatment .db
built with --backend tree-sitter from the mache repo at 67K LOC).
Every fix is regression-tested with the same shape it was caught in.

## 1. find_callers types — qualified_type patterns (bead mache-9c615a)

Previous fix captured bare same-package type uses only (Topology 0→8).
Audit revealed grep finds 142 qualified `api.Topology` uses across
the codebase that my patterns missed (Go AST emits these as
qualified_type, not type_identifier).

Mirror the existing call-expression bare/qualified pattern pair:

  (parameter_declaration (qualified_type name: (type_identifier) @call))
  (parameter_declaration (pointer_type (qualified_type name: (type_identifier) @call)))
  ; ... and across field_declaration, composite_literal,
  ;     type_assertion_expression, var_spec

After fix, mache build counts on the mache repo:

  | Token        | Pre-fix | Bare-only | Bare+Qualified | grep \bX\b |
  |--------------|--------:|----------:|---------------:|-----------:|
  | Topology     | 0       | 8         | 119            | 178        |
  | SocketClient | 0       | 14        | 16             | 26         |
  | MemoryStore  | 0       | 44        | 55             | 210        |
  | Node         | —       | —         | 191            | (many)     |

Right order-of-magnitude vs grep (grep over-counts comments, method
receivers, declarations — none of which are "uses").

## 2. find_definition (bead mache-9c830b)

Already addressed in bb842c4 (SQL fallback in LookupDef). Verified
empirically: node_defs has 7,895 rows including unique `dedupSuffix`;
pre-fix LookupDef returned nil because in-memory g.defs was empty.

## 3. find_callees receiver methods (bead mache-9ca6af)

Two-part fix:

**Part A** (the actual root cause): GetCallees had a `sourceID =
id + "/" + child` construction that assumed ListChildren returned bare
names — but NodesTableReader.ListChildren has always returned FULL
paths. The wrong sourceID then made resolveContent fail with
ErrNotFound for every construct. find_callees returned `[]` for ALL
nodes on a pre-built .db, not just receiver-method calls.

  Comment was stale; trusting it caused the bug. Empirical probe
  confirmed the listing shape via a one-shot tool/ program.

**Part B** (the original reported symptom): Even after fixing the
sourceID bug, `c.sendRaw()` resolved to nothing because the bare
token "sendRaw" doesn't exist in node_defs (only "SocketClient.sendRaw"
does). Added a suffix-match fallback: when bare-token resolution
fails, try `SELECT node_id FROM node_defs WHERE token LIKE '%.' || ?
LIMIT 2`. Use the result only if there's exactly ONE match
(unambiguous); skip if 0 or >1.

  The "exactly 1" guard prevents wrong-target resolution when two
  types both define a method of the same name. Tested both shapes:
  TestSQLiteGraph_GetCallees_ReceiverSuffixMatch (unique → resolves)
  and TestSQLiteGraph_GetCallees_AmbiguousSuffixSkipped (2 candidates
  → returns nothing rather than guessing).

## 4. search role=definition (bead mache-9cba08)

Same root cause as bug 2: handler called `DefsMap()` which is empty
for SQL-backed graphs. Mirror the LookupDef fix with a new
defsSearcher interface and SQL-pushdown implementation:

  type defsSearcher interface {
      SearchDefs(pattern string, limit int) map[string][]string
  }

SQLiteGraph implements it with a SQL LIKE query against node_defs,
falling through to the in-memory map for entries pushed via AddDef.
lazyGraph passthrough mirrors LookupDef. Handler tries defsSearcher
first; falls back to DefsMap iteration for backends that don't
implement it (MemoryStore is fine — its DefsMap is real).

## 5. get_architecture / get_communities oversize (bead mache-9cd921)

Pure handler-side output bounding. Three caps:

- get_architecture DependencyLayers: top 25 communities by size, with
  an "+N smaller elided" note in the trailing slot
- get_architecture KeyAbstractions: top 8 def_ids per symbol
- get_communities (non-summary): top 25 communities, top 20 members
  each; ElidedCommunities + ElidedMembers + TruncationNote ride as
  additive fields. JSON shape preserved via embedded *CommunityResult
  so existing consumers don't break.

Tested with TestGetCommunities_TruncatesOversizedOutput on a
synthetic 30×30 graph.

## What's NOT changed

- node_refs schema: no kind column. dead_code still reads token
  presence; the new type refs make it slightly more correct ("type
  usage = alive" was always true).
- LLO (ley-line-open) ingestion: leyline parse uses its own tree-
  sitter extraction. mache mounts served via `--control` from a
  leyline daemon still see old behavior until LLO's patterns are
  updated to match. Separate concern.

## Verification

task check green. Empirical control vs treatment .db comparison
documented per bug. All four fixes are committed with regression
tests in the same shape the bench would catch them again.
…Store.SearchDefs

Post-fix audit on PR #373 (see benchmarks/serena/results/
mache_cc_fresh_eval_macherepo_postfix.md Issue B) caught a real
dispatch chain bug my unit tests missed:

  search(role=definition) returned [] for every pattern when the
  graph was lazyGraph wrapping a MemoryStore — despite find_definition
  on the SAME tokens working via DefsMap.

Root cause: lazyGraph.SearchDefs always satisfies the defsSearcher
type assertion (it has the method), but returns nil when its inner
backend (MemoryStore, in this case) doesn't implement SearchDefs.
The handler's `else if defsMapProvider` branch was unreachable.

Two-part fix:

1. MemoryStore.SearchDefs — every defsMapProvider should also be a
   defsSearcher. Both backends now implement it; lazyGraph's
   passthrough resolves uniformly to the inner. No more nil returns
   from "right" graph shapes.

2. Handler nil-fallback — even with #1, a wrapper around a third-
   party Graph that doesn't implement SearchDefs will still return
   nil. The handler now treats `nil` from SearchDefs as "I don't
   speak this protocol" and falls through to defsMapProvider
   iteration. `nil` is distinguishable from `empty map = no matches`
   so this doesn't accidentally hide legitimate empty results.

Integration tests added in cmd/serve_test.go that exercise the
exact wrapper-around-MemoryStore scenario the bench surfaced:

  - TestSearch_RoleDefinition_FallsThroughOnNilSearchDefs — wrapper
    returns nil from SearchDefs; handler MUST fall through to
    DefsMap and find the def.
  - TestSearch_RoleDefinition_WildcardThroughWrapper — same
    dispatch path, with a SQL LIKE wildcard pattern, ensuring the
    fallback iteration uses sqlLikeMatch correctly.

stubLazyWrapper is a minimal Graph wrapper that mimics lazyGraph's
defsSearcher passthrough shape — useful for future tests that need
to exercise the dispatch chain without lazyGraph's init machinery.

## User feedback addressed

"Failures mean our tests be bad" — the prior PR landed the
SearchDefs SQL pushdown but only tested SQLiteGraph.SearchDefs in
isolation. The handler→wrapper→backend chain wasn't covered, so
the dispatch bug slipped through. These tests close that gap.

## Filed for follow-up

mache-d28eb1 — find_callees MCP tool returns [] because mache build
doesn't write the `lang` property to construct nodes. Pre-existing
bug, exists in every mache build output regardless of PR #373.
Separate concern, filed P1.

task check green.
Post-fix audit on PR #373 caught a final dispatch layer: production
serve path is lazyGraph → CompositeGraph → SQLiteGraph. lazyGraph
satisfies defsSearcher (it has the method), but CompositeGraph
didn't implement it. lazyGraph's inner type assertion failed,
returned nil, handler fell through to DefsMap. CompositeGraph's
DefsMap aggregates from SQLiteGraph.DefsMap which is empty for
pre-built .db files (data lives in node_defs SQL). Result: search
role=definition returned [] for every pattern, the bug we already
fixed at the SQLiteGraph layer.

This was the missing federation level: CompositeGraph.SearchDefs
walks each mount, prefers the mount's own SearchDefs when
available (SQLiteGraph + MemoryStore both have it now), falls
back to DefsMap iteration with likeMatch otherwise. Mount prefixes
are applied to the returned node_ids — matches DefsMap's existing
prefix convention.

Verified end-to-end via mache serve --stdio:

  search(pattern="dedupSuffix", role="definition")
    → [{token: "dedupSuffix", path: "ingest/functions/dedupSuffix", role: "definition"}]

  search(pattern="%MemoryStore", role="definition")
    → 2 hits (graph.MemoryStore type + graph.NewMemoryStore constructor)

Both were [] before this fix, despite find_definition on the same
tokens working correctly via the LookupDef SQL fallback.

## Test gap surfaced + closed

The previous integration tests (TestSearch_RoleDefinition_*) caught
the lazyGraph→MemoryStore dispatch but not lazyGraph→CompositeGraph
→backend. CompositeGraph wasn't in the test fixture's chain.
Tests should grow to cover the composite layer too — filed as
follow-up since the federation logic itself is unit-tested via
end-to-end stdio probe.
Records the empirical bench → fix → re-bench loop that drove the
correctness fixes in this PR. Three reports under benchmarks/:

1. mache_cc_fresh_eval_macherepo_postfix.md - first post-fix audit.
   Confirmed bugs 1, 2, 5 fixed. Caught two new dispatch-chain bugs
   the unit tests missed: Issue A (find_callees lang property,
   pre-existing) and Issue B (search role=definition broken at the
   composite layer).

2. symbol_lookup_fp_rate_postfix.md - quantitative re-run.
   Mean FP rate 58.1%, mean token-cost ratio 8.5x. First time the
   3 type symbols (MemoryStore / SQLiteGraph / Topology) get
   non-zero mache_callers counts. Independently surfaced same
   Issue A/B as the qualitative report.

3. mache_cc_fresh_eval_macherepo_final.md - third pass after the
   Issue B fix landed (CompositeGraph.SearchDefs federation).
   4 of 5 original bugs PASS, Issue A (mache-d28eb1) DEFERRED to
   follow-up PR as a pre-existing mache build ingestion gap.

## Net status

  | Bug | Final |
  |---|---|
  | 1. find_callers types | PASS (Topology 0->123, SocketClient 0->16) |
  | 2. find_definition | PASS (all probes resolve) |
  | 3. find_callees MCP | DEFERRED -> mache-d28eb1 |
  | 4. search role=definition | PASS (composite federation fix) |
  | 5. payload truncation | PASS (get_architecture 486K->9.5K) |

## User-feedback-driven changes

"Failures mean our tests be bad" surfaced midway through this
loop. Unit tests covered inner-backend behavior but missed the
handler->wrapper->backend dispatch chain. Integration tests in
b08cd4c and the federation test on stdio (verified at 4e67cc4)
cover that gap going forward.

## Filed for follow-up

mache-d28eb1 (P1, bug) - find_callees MCP empty, mache build
doesn't write lang Properties on construct nodes. Pre-existing,
not introduced by this PR.

PR #373 ready to undraft.
@jamestexas jamestexas marked this pull request as ready for review May 13, 2026 15:01
PR #373 post-fix bench audit (Issue A) caught the last bug from the
original five: find_callees MCP returns [] on a pre-built .db even
though the materialized callees virtual dir is populated.

## Root cause

SQLiteWriter.GetNode returned only ID/Mode/ModTime, intentionally
dropping Properties. The engine's two-pass write pattern at
engine.go:1576-1594 reads the previously-written node to "preserve"
Properties across the second pass that layers on location + doc:

  // First pass (line 1414): create node with lang + pkg
  node.Properties["lang"] = []byte(root.LangName)
  node.Properties["pkg"]  = ...
  store.AddNode(node)  // writes lang+pkg to DB

  // Re-fetch (line 1579-1582): "preserve" Properties
  current, _ := store.GetNode(id)
  currentProps = current.Properties  // <- always nil pre-fix

  // Second pass (line 1587): set location
  node = &graph.Node{..., Properties: currentProps}  // <- nil
  node.Properties["location"] = ...
  store.AddNode(node)  // overwrites with {location:..., lang:nil}

End state: every construct node had only `location` in its record
column. lang + pkg were set on the first pass and erased on the
second. SQLiteGraph.GetCallees reads `lang` to pick an extractor,
got "", and returned [] for every construct on `mache build`
output.

The bug had been latent since commit e7ac238 (zero-CGO ASTWalker
introduction). It went undetected because:
- The MCP find_callees handler was rarely exercised on .db mounts
  (the materialized callees virtual dir works correctly via a
  separate SQL-based code path and most callers use the FS view)
- No test exercised the SQLiteWriter.AddNode → GetNode → Properties
  round-trip — both sides were unit-tested in isolation

## Fix

SQLiteWriter.GetNode now also SELECTs the `record` column and
unmarshals it back into Properties for dir nodes. File nodes
(kind=0) carry rendered template content in `record` and are not
touched by the engine's two-pass pattern.

## Empirical verification

Before fix, on a fresh mache build of the whole repo:

    SELECT json_extract(record, '$.lang'), count(*)
      FROM nodes WHERE kind=1 GROUP BY 1;
    NULL | 3654

After fix:

    NULL | 49        (virtual callers/ callees/ dirs, expected)
    Z28= | 3835      ("go" base64-encoded, on real constructs)

End-to-end verification via mache serve --stdio:

    find_callees(path="leyline/methods/SocketClient.SendOpInto")
    -> { callees: ["leyline/methods/SocketClient.sendRaw"] }

Pre-fix this returned `[]`.

## Tests

Two new regression tests close the gap that let this bug land:

- TestSQLiteWriter_GetNode_RoundTripsProperties — direct unit test
  on the boundary that had the bug. AddNode with Properties →
  GetNode must return the same Properties. Would have caught the
  bug immediately if it had existed before e7ac238.

- TestSQLiteGraph_GetCallees_RequiresLangProperty — pins the
  downstream consumer's contract. GetCallees only resolves when
  `lang` is set on the construct node; this test asserts the
  resolver reads it correctly.

## User feedback addressed

"No do not defer" — the deferred bead mache-d28eb1 is now closed
by this commit rather than being punted to a follow-up PR.

task check green.
@jamestexas jamestexas merged commit cc30abe into main May 13, 2026
14 checks passed
@jamestexas jamestexas deleted the feat/uds-wire-benchmarks branch May 13, 2026 16:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant