Skip to content

perf(llm): batch classifier + dead-code, bump concurrency 2->8 [WIP 1/5]#3

Merged
andreinknv merged 1 commit into
wip-stress-basefrom
pr1-perf-classifier-batching
May 1, 2026
Merged

perf(llm): batch classifier + dead-code, bump concurrency 2->8 [WIP 1/5]#3
andreinknv merged 1 commit into
wip-stress-basefrom
pr1-perf-classifier-batching

Conversation

@andreinknv
Copy link
Copy Markdown
Owner

Stress-test fixes — PR 1 of 5 (stacked).
Base: wip-stress-base (= df96c6c, the verified round-3 tip).
Stack: PR1 → PR2 → PR3 → PR4 → PR5. Each subsequent PR targets the previous one as base.

Commits

  • 9927549 perf: batch LLM classifier + dead-code, bump concurrency, N1 attribution hint

Summary

  • Concurrency 2→8 for non-bridge providers (openai-compat, anthropic-api). The hardcoded 2 was tuned for local proxies that serialise internally; cloud Haiku is dominated by network roundtrip and benefits from parallelism. claude-bridge stays at 4 (subprocess spawn cost).
  • Classifier batchingclassifyAllRoles rewritten from one chat call per symbol to N-per-call (50 default) with bracket-aware JSON-array parsing. Same pattern applied to dead-code judging.
  • N1 attribution hint in formatImpact — when codegraph_impact resolves to >=2 sources, the merged per-file detail now points readers up to the per-source breakdown so they don't mistake the merged list for being source-attributed.

Test plan

  • npm test — full suite at 1113 / 13 skip / 0 fail across the entire 5-PR stack.

WIP — opened as draft for review.

…ion hint

Layered on top of the round-1+2+3+4 stress-test fixes. Four changes:

## N1 follow-up — attribution hint in formatImpact

When codegraph_impact resolves the symbol name to >=2 sources, the
existing per-source rollup (round-4 Gap 2) tells you which definition
contributes which file counts. The merged per-file *symbol detail* below
it still shows symbols cross-source — middleware/openai.go: Write,
EmbeddingsMiddleware, ... does not say which Encode definition each
symbol traces back to. Adds a one-line cross-reference under the merged
detail pointing readers up to the per-source breakdown so they do not
mistake the merged list for being source-attributed.

formatImpact now takes a multiSource: boolean flag, threaded through
from handleImpact based on perSource.length > 1.

## Concurrency bump — 2->8 for non-bridge providers

src/index.ts:1002 hardcoded chatConcurrency = 2 for openai-compat /
anthropic-api. The original comment claimed openai-compat servers
serialise internally so 2 is plenty — true for some local proxies but
not for cloud Anthropic Haiku, where per-call latency is dominated by
network roundtrip. Bumped to 8 (claude-bridge stays at 4 — subprocess
spawn cost is the bottleneck and this is what already worked).

DEFAULT_CONCURRENCY in classifier.ts, summarizer.ts, dead-code.ts
similarly bumped 2->8 in case anyone calls them directly without
overriding. dir-summarizer.ts left at 1 (large per-call output, serial
is fine).

## Classifier batching — 50 symbols per call, JSON list response

classifyAllRoles in src/llm/classifier.ts rewrote from one chat call
per symbol to N-per-call:

- New buildBatchPrompt sends the role list once + a numbered list of
  symbols, asks for a JSON array of {i, role} entries in the same order.
- New parseBatchResponse extracts the first balanced JSON array
  (string-aware bracket counting so brackets inside string literals do
  not fool the depth tracker), requires >=80% coverage, returns null
  on parse failure.
- Worker claims batchSize candidates at a time. On batch parse failure
  or LlmEndpointError, falls back to classifyOne (per-item) for that
  batch only — an unrelated bad symbol cannot lose the rest.
- buildSinglePrompt factored out for fallback; ROLE_LIST_TEXT shared
  between both prompt builders.

Net effect on a --force re-index of 722 symbols: ~30x fewer chat calls
(15 batches vs 722 calls), ~3x fewer input tokens (role list sent 15x
instead of 722x), and ~30x lower wall-clock latency at the new
concurrency.

## Dead-code batching — 8 candidates per call

Same pattern as classifier in src/llm/dead-code.ts, smaller batch (8)
because verdict output is bigger (~80 tokens per entry). New
parseBatchJudge mirrors parseBatchResponse string-aware bracket
counting (matters here because reason fields can contain brackets).
Synthetic uncertain entries created for items missing from a parsed
batch (when batch passes 80% coverage but a row is dropped) — they
also increment judged to keep judged === results.length invariant the
test asserts.

useAskModel: true preserved — dead-code judge still goes to Sonnet,
not Haiku, since the verdict is high-stakes.

## Tests

- __tests__/llm-tiers.test.ts fake-server stub now detects batched
  prompts via the structural marker "Symbols (zero-indexed):"
  (anchored on structure, not on reply-instruction phrasing which
  could drift independently in either file). Counts numbered symbol
  lines and emits a same-sized JSON array of stubbed verdicts/roles.
- 2 new unit tests:
  - parseBatchResponse extracts roles, tolerates surrounding prose,
    rejects under-coverage — covers happy path, prose-wrapped JSON,
    title-case role coercion via parseRole, sparse response rejection
    (under 80% coverage), malformed JSON, and bracket-inside-string-
    literal preservation.
  - parseBatchJudge handles dead-code verdict arrays + recovery edges
    — covers happy path, confidence clamp to [0,1], unknown-verdict
    coercion to uncertain, bracket-inside-string-literal via inString
    tracker, sparse rejection.

Suite: 1081 / 13 skip / 0 fail (was 1079).

## Reviewer trail

Two passes. Pass 1: APPROVE + 3 info findings (parseBatchResponse
missing string-state tracking, dead-code judged undercount on
synthetic entries, fragile test stub anchored on reply phrase). All
addressed. Pass 2: APPROVE + 2 info findings (parseBatchResponse test
missing bracket-in-string case, doc-comment overclaiming what
string-aware tracking handles). Both addressed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@andreinknv andreinknv merged commit 9927549 into wip-stress-base May 1, 2026
@andreinknv andreinknv deleted the pr1-perf-classifier-batching branch May 1, 2026 11:37
andreinknv added a commit that referenced this pull request May 1, 2026
… docs

Three reviewer-flagged improvements on 6d0e7a2 (the WASM-visibility
commit). All informational items from that review — no functional
regression, no scope drift, suite stays at 1117/13/0.

## #1 — backend is now per-DatabaseConnection, not a process global

`createDatabase` previously set a module-level `activeBackend` and
exposed it via `getActiveBackend()`. In the MCP cross-project path
(handleStatus called against `projectPath` opens a SECOND DB via
openSync) the global reflected whichever DB was opened most recently,
not necessarily the one whose stats were just rendered. Benign in
practice (every DB in a process resolves the same backend) but
structurally imprecise.

Refactor:
- `createDatabase(dbPath)` now returns `{db, backend}` instead of a
  bare `SqliteDatabase`. Caller stores both.
- `DatabaseConnection` carries a `private backend: SqliteBackend` and
  exposes `getBackend()`.
- `CodeGraph.getBackend()` delegates — that's the public surface.
- CLI `codegraph status` and MCP `handleStatus` both call
  `cg.getBackend()` instead of the global. The global is removed.

Two pre-existing tests (`migrations-015-016`, `migrations-022`) that
called `createDatabase` directly now destructure `{db: adapter}`.

## #2 — fix recipe deduplicated across the two code surfaces

The `xcode-select` / `npm rebuild` / `npm install --save` recipe
appeared inline in both `buildWasmFallbackBanner` (sqlite-adapter.ts)
and the MCP `handleStatus` formatter (mcp/tools.ts). New
`WASM_FALLBACK_FIX_RECIPE` constant in sqlite-adapter.ts is the
single source for the one-line summary; the MCP formatter
interpolates it. The banner formats the same content multi-line for
the stderr surface. README is intentionally separate (different
audience, different rendering).

## #3 — README troubleshooting now covers Linux

Section title renamed "Indexing is slow on macOS / WASM fallback" ->
"Indexing is slow / WASM fallback active". New code block lists fix
steps for macOS, Debian/Ubuntu, RHEL/Fedora, and the cross-platform
`npm install --save` escape hatch. The banner stderr block also
gained the Linux equivalent for symmetry.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
andreinknv added a commit that referenced this pull request May 1, 2026
… server-config flags

Tooling-gap backlog (codegraph/docs/codegraph-tooling-gaps.md) closed:
  #1 freshness severity bucket — `classifyFreshness` with fresh|recent|stale|very_stale
  #2 allowStale flag — opt-in bypass for the heavy-drift gate, registry-injected schema
  #3 module format in status — `module-format.ts` parses package.json + tsconfig (JSONC-safe)
  #4 codegraph_imports tool + import-classifier — file/directory/bare/unresolvable filters
  #5 dynamic imports — extractor catches `import('…')` + `require('…')`, incl. template_string
  #6 build-context refs — new `build_context_refs` table for `__dirname` / `import.meta.*`
  #7 files.is_test flag — column populated by glob; surfaced in status as `(N test)`
  colbymchenry#11 summarize-also-embeds (discovered while dogfooding) — `cg.summarizeAll()` chains
       `embedAllSummaries`; new `cg.embedAll()` for embed-only path; CLI `codegraph embed`

CLI/MCP alignment (5/32 → 33+/35):
  - 13 new CLI commands via `runViaMCP` shim: callers, callees, impact, node, similar,
    biomarkers, imports, help-tools, explore, hotspots, dead-code, config-refs, sql-refs,
    module-summary, role, coverage-query, pending-summaries, save-summaries, review-context
  - 7 new MCP tools: codegraph_imports, codegraph_embed, codegraph_summarize, codegraph_sync,
    codegraph_reindex, codegraph_coverage_ingest, codegraph_init, codegraph_uninit,
    codegraph_unlock, codegraph_affected

MCP server-level operator config (`codegraph serve --mcp`):
  - --no-write-tools / --allow-stale-default / --disable-tool (sandboxing)
  - --llm-endpoint / --llm-chat-model / --llm-ask-model / --llm-embedding-model /
    --llm-api-key (operator LLM config; per-project config wins on conflict)
  - New CODEGRAPH_LLM_* env vars wired through `mergeLlmEnv` in resolveLlmProviders

Architectural cleanups:
  - `bypassFreshnessGate` and `isWriteTool` declarative flags on ToolModule (replaces
    growing string-comparison chain in execute())
  - `withAllowStale` registry injection only on tools that DO see the gate
  - DRY of inline copy-paste in 3 hooks → `src/index-hooks/enclosing.ts`
  - `LlmClient.isEmbeddingReachable` for split-provider correctness
  - SyncResult `lockContention` flag → handleSync emits distinct retryable message
  - `clearStructural` deletes from build_context_refs (was orphan-leaking on --force)
  - cli:dev npm script + tsx CLI fixed (web-tree-sitter `import type` for type-only refs)

Migrations:
  023-files-is-test.ts — add `files.is_test`
  024-build-context-refs.ts — add `build_context_refs` table

Reviewer rounds: 11 total, all REQUEST_CHANGES addressed inline. Notable fixes:
  - JSONC URL strip via state machine (was eating `https://` tails)
  - classifyFreshness very_stale now requires isStale (in-sync-but-old → recent)
  - Dynamic imports also match template_string nodes
  - process.exit deferred until after finally cleanup in runViaMCP
  - --same-language / --different-language mutual exclusion guard
  - help-tools CLI bypasses isInitialized (works without a project)
  - handleUninit sweeps projectCache by getProjectRoot (no dangling alias leaks)
  - handleAffected errors instead of silently dropping unsupported glob filters
  - mergeLlmEnv preserves precedence: legacy flat config wins over env-synthesised block

Suite: 1268 passing, 1 expected red (colbymchenry#8 — undecided), 13 skipped, 1 todo, 0 regressions.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
andreinknv added a commit that referenced this pull request May 2, 2026
Move 3 graph-derived biomarker scanners to a new
src/db/queries-biomarkers-graph.ts module as free functions taking
(qb: QueryBuilder, ...args). Same proven pattern as the prior ten
extractions.

Functions moved:
- findGodClasses (god_class biomarker — class-like nodes with too
  many `contains` children)
- findFeatureEnvy (feature_envy biomarker — methods calling out
  more than in)
- findUnusedExports (unused_export biomarker — exported symbols
  with no cross-file `calls` / `references` / `instantiates` /
  `extends` / `implements` edge)

This module covers the cross-file biomarker SCANNERS only; finding
STORAGE (appendFindings, etc.) lives in queries-findings.ts (cluster
#3).

Touchpoints (only one consumer):
- src/biomarkers/index.ts: 3 call sites (one per scanner)

Verification:
- tsc clean, suite 1352/13/0 (no regressions)
- Reindex: QueryBuilder god_class **46 -> 43** (delta = 3, exact)

QueryBuilder is now within 3 of the error-threshold (40). Next
cluster will be unresolved-refs (~10 methods), which clears the
error entirely.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
andreinknv added a commit that referenced this pull request May 3, 2026
)

getFindingsRanked gains `excludeFile` — drop findings whose file_path
starts with a literal prefix. The triage workflow it enables:
"give me ranked warnings excluding src/legacy/" without having to
rebaseline the index or post-filter in JS.

Implementation
--------------
- src/db/queries-findings.ts: WHERE n.file_path NOT LIKE @Prefix
  ESCAPE '\\'. The prefix is escaped (\\, _, %) before being bound
  so any of those characters in real path names match themselves;
  the first draft used GLOB but that was discarded after the
  reviewer flagged that GLOB has no escape syntax — `excludeFile: '*'`
  would have collapsed to "match every path" and silently zeroed
  the result set.
- src/mcp/tools/biomarkers.ts: new `excludeFile` arg on the
  inputSchema, passed through to `getFindingsRanked`. The
  description notes that the filter only applies to mode='ranked'
  (silently ignored on 'symbol' / 'stats').
- src/bin/codegraph.ts: `--exclude-file <prefix>` on the
  `codegraph biomarkers` CLI command.

Tests
-----
__tests__/biomarkers.test.ts:
- baseline (no filter) sees both functions
- directory exclusion via trailing-slash prefix drops one
- single-file exclusion drops the named file
- `excludeFile: '*'` is treated literally (no path starts with '*')
  so the result equals baseline — pinning the reviewer-caught fix
- `excludeFile: 'src/k_ep/'` does NOT match 'src/keep/' because the
  underscore is escaped, not a wildcard

Also: struck the duplicate B colbymchenry#33 marker in BACKLOG.md (it was the
same item as A #3).

Suite: 1366/13/0 (+1 new test). tsc clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
andreinknv added a commit that referenced this pull request May 4, 2026
ToolHandler was a 491-line god-class doing five things commingled.
This extracts the project-cache + watcher coordination into its own
class so the dispatcher (ToolHandler) can focus on what's left:
disabled-tool filtering, freshness gating, dispatch, and lifecycle.

Before:
- ToolHandler held projectCache (Map), cachedRoots (Map),
  watchedRoots (Set), MAX_CACHED_PROJECTS, MAX_WATCHED_PROJECTS
- ...plus getCodeGraph, evictOldestIfFull, tryStartWatcher,
  closeAll, closeProjectsMatching, shouldEvictCachedProject,
  closeDefaultCgIfMatching as private methods.

After:
- New `src/mcp/tools/_project-cache.ts` (175 LOC) owns the cache
  Map, the cachedRoots map, the watchedRoots set, the FIFO eviction,
  and tryStartWatcher. Public surface: getOrOpen / closeAll /
  closeProjectsMatching / readonlyView (for ToolCtx exposure).
- ToolHandler shrinks ~150 lines; getCodeGraph becomes a 12-line
  method that delegates to cache.getOrOpen for explicit-project
  paths. closeAll delegates to cache.closeAll + closes the default
  cg if any. closeProjectsMatching delegates + sweeps the default cg.

Defense-by-design angle:
- Each class now has a narrower public surface — fewer reachable
  invariants per file → smaller blast radius for the next change.
- ProjectCache's own state (the three collections) can no longer be
  perturbed by unrelated dispatch logic; encapsulated behind 4 methods.

Self-doc angle:
- ToolHandler.execute() now reads as a clear pipeline: validate → gate
  → validate args (D1) → dispatch. The "what's a watcher doing here?"
  noise is gone.

What's NOT changed (intentional, smaller follow-up scope):
- The `runFreshnessGate` + auto-sync logic stays inside ToolHandler.
  Splitting that further is a separate exercise — it's tightly coupled
  to ToolResult shaping (banner injection, freshness metadata).
- WatcherCoordinator is folded INTO ProjectCache rather than a third
  class because the eviction pathway shares state with the watcher
  (closing a cached cg must stop its watcher). Keeping them in one
  class avoids cross-class state coordination.
- The CodeGraph class itself was triaged and DECLINED for split: it's
  already a facade exposing sub-managers (db/queries/orchestrator/etc).
  The remaining lifecycle code is genuinely cohesive.

Tests:
- Watcher tests update their internal-state probes to drill through
  the new `cache` field (handler.cache.watchedRoots instead of
  handler.watchedRoots). 4 fixture tests adjusted, 0 new tests needed.
- Full suite 1634/34/0 — same as pre-split.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
andreinknv added a commit that referenced this pull request May 8, 2026
…arch arc)

Three additions inspired by the 2026-05-08 code-KG paper sweep
(CodeRAG, GraphCodeAgent, GraphGen4Code, Maarleveld GNN survey,
CodexGraph, FalkorDB). Each verified absent before implementation;
all opt-in to keep default structural traversals unchanged.

#1 — In-graph similar_to edges (similarity as graph hops)
  - New EdgeKind similar_to (confidence=INFERRED, metadata.score)
  - buildSimilarToEdges reuses findSimilarViaVec; delete+insert are
    wrapped in db.transaction so the replacement is atomic
  - Migration 036 partial-index on edges(source,kind) WHERE
    kind=similar_to — guarded by sqlite_master existence check for
    the pre-016 hand-rolled migration tests
  - Surfaced as codegraph_admin({action: build-similarity-edges})
    and CLI codegraph admin build-similarity-edges --k --min-score
  - EXCLUDED_EDGE_KINDS keeps it out of default traversals; explicit
    edgeKinds bypass the filter

#2 — mode=intent search over symbol_summaries
  - FTS5 virtual table summary_fts (porter unicode61, mirrors nodes_fts)
    + INSERT/UPDATE/DELETE triggers on symbol_summaries
  - Migration 037 + parallel schema.sql entry for fresh-init path
  - bm25-ranked, optional kind/language/pathFilter filters
  - pathFilter LIKE uses canonical backslash-escape pattern matching
    queries-findings.ts:227-232 (no _/% injection)
  - Refuse-when-empty error points at codegraph summarize
  - FTS5 query parse errors caught and re-surfaced as a clear
    syntax-error message

#3 — Intra-procedural def_use edges (TS/JS/TSX/JSX)
  - New EdgeKind def_use as self-loops on function/method nodes;
    metadata carries name, defLine, useLines
  - Scope-bounded extractor in src/extraction/def-use.ts, called
    from both tsExtractFunction and tsExtractMethod
  - Skips parameters (function inputs), fields (covered by
    field_access), nested-scope vars (belong to inner function set)
  - EXCLUDED_EDGE_KINDS opt-in; no traversal helper assumes
    source != target

Schema-version assertions bumped 35 to 37 in foundation.test.ts and
pr19-improvements.test.ts.

Suite 1742/0/34 (was 1729 baseline, +13 new tests). Three reviewer
rounds: round 1 caught LIKE-escape, atomicity, field rename, FTS5
try/catch; round 2 caught a JSDoc rot from the rename and a
contradictory test assertion; round 3 APPROVE.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
andreinknv added a commit that referenced this pull request May 8, 2026
…ge-level diff (eval arc)

Three additions with pre-set hypotheses + post-impl measurement, per
the user's "evaluation and measurement how much useful it will
actually be" brief. Pre-impl thresholds were committed before code so
post numbers couldn't be motivated. One feature deferred with honest
documented reasoning.

#1 Selective parse-cache invalidation
  - clearParseCache(qb, language?) returns deleted-row count
  - CLI: --clear-parse-cache [language] (boolean OR optional value)
  - MCP: clearParseCache: boolean + clearParseCacheLanguage: string
    (typed schema doesn't allow oneOf — split into two args; language
     wins when both set)

  Hypothesis (pre-set): >=3x wall-clock speedup, <30s absolute
  Measured (codegraph repo, 463/498 = 93% TS in parse cache):
    - TS-only clear: 3.70s wall / 3.77s user-CPU
    - Full clear:    3.87s wall / 8.13s user-CPU
    - Wall ratio: 1.04x (parallelism masks the work delta)
    - User-CPU ratio: 2.16x more work for full clear
  Verdict: speed threshold NOT MET on monoglot testbed. Real value
  here is correctness (targeted invalidation when an extractor
  changes). On polyglot repos at 50% target-lang ratio, expected ~2x
  wall-clock speedup.

#1.5 Docstring source for mode='intent' (user follow-up: "make intent richer")
  - Migration 038: docstring_fts FTS5 over nodes.docstring + INS/UPD/DEL
    triggers (with WHEN docstring IS NOT NULL AND != ''); schema.sql parity
    for fresh-init path; pragma_table_info guard for pre-016 hand-rolled
    migration test setups
  - _search-intent.ts queries BOTH summary_fts AND docstring_fts,
    UNIONs by node_id keeping best rank, surfaces 'via summary' /
    'via docstring' provenance label per result
  - Empty-corpus check fixed: FTS5 external-content COUNT(*) reads
    from the source table, not the actual indexed rows — switched to
    direct content count

  Hypothesis (pre-set): >=30% recall increase
  Measured (20 hand-picked intent queries, codegraph corpus):
    - Summary-only hits: 22
    - Docstring-only hits: 34
    - Combined unique-node hits: 56 (2.55x = 155% improvement)
  Verdict: well above threshold. Best ROI of this arc — docstrings
  cover 26% of nodes vs summaries' 18%, AND describe intent verbatim
  in JSDoc / Python docstrings / Go comments.

#2 Edge-level diff in compare_to_ref
  - EdgeDelta / EdgesDelta types
  - diffEdgeLists keyed by stable
    (srcQualName::srcKind=>tgtQualName::tgtKind::edgeKind) so line
    shifts don't surface as spurious changes
  - Cross-file edges out of scope (compareToRef is per-file)
  - Opt-in via includeEdges: true (CLI: --include-edges)
  - Renderer surfaces source -> target node IDs (round-1 reviewer
    finding: discarded data; fixed)

  Hypothesis (pre-set): >=30% additional info (>=20% loose)
  Measured (HEAD vs HEAD~3 on this branch):
    - Node changes: 21+11+308 = 340
    - Edge changes (NEW signal): 83+11 = 94
    - Files surfaced: 22 -> 27 (+5 visible only via edge changes)
    - Information gain: 94/340 = 27.6%
  Verdict: >=20% threshold MET; just below >=30% strict. The 5
  newly-surfaced files (pure-edge changes) are the qualitative win.

#3 Stack-graphs cross-language resolver — DEFERRED
  Survey of the codegraph corpus: monoglot TypeScript. child_process
  invocations are ~30 git execFileSync calls; no Python/Ruby/Go
  spawn targets. Dynamic imports are NPM packages. string_imports
  table is dominated by test fixtures. Conclusion: this corpus
  lacks the ground-truth cross-language references needed to
  measure a scope-graph rule meaningfully. Building infrastructure
  without testable signal would be speculative abstraction (CLAUDE.md
  anti-pattern). Stays on the borrow-ideas backlog as the
  long-horizon item; not blocked, just not session-feasible without
  a polyglot testbed.

Schema-version assertions bumped 37 -> 38 in foundation.test.ts and
pr19-improvements.test.ts.

Suite 1746/0/34 (was 1742; +4 new tests). Two reviewer rounds: round
1 caught the edge-delta formatter discarding source/target IDs;
round 2 APPROVE. Info-level note tracked for later: summary_fts_au
trigger could mirror the docstring_fts_au SELECT-WHERE guard pattern
for consistency.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
andreinknv added a commit that referenced this pull request May 8, 2026
…enrichment

User asked: "is there other hot data we can apply intent search on, and
is there something missing we can apply on hot data?" Answer: test
descriptions mined from it/test/describe(...) calls in test files.
"Hot" = extracted at index time, no LLM pass needed. Plus enrich
codegraph_tests_for and codegraph_node with the same data while we're
in the area.

What shipped
============

#1.6a Schema + mining hook
  - Migration 039: test_names (id / file_path / line / description, FK
    on files.path, ON DELETE CASCADE) + test_names_fts FTS5 with
    porter unicode61 tokenizer + INS/UPD/DEL triggers; schema.sql
    parity for fresh-init path; sqlite_master guard for pre-016 hand-
    rolled migration tests
  - src/index-hooks/test-names.ts: regex-mining hook (NOT tree-sitter
    — patterns are simple enough; ~5% missed cases acceptable per
    docstring justification). Matches it/test/describe/context/
    fit/fdescribe/fcontext/xit/xtest/xdescribe with optional
    .skip/.only/.each/.todo/.concurrent/.sequential/.failing modifiers.
    Idempotent: full rescan on indexAll, per-file rescan on sync.
    Runs after TESTS_EDGES_HOOK so test->subject edges are fresh.

#1.6b mode='intent' third source
  - _search-intent.ts queries summary_fts + docstring_fts +
    test_names_fts. Symbol-anchored sources merge by node_id keeping
    best rank; test_names render in a separate "Test-description
    matches" section because they anchor to file:line not a node id.
  - kind/languageFilter skip the test_names branch (test_names don't
    have a node kind). pathFilter applies via the canonical
    LIKE-escape pattern from queries-findings.ts (memo item #3).
  - Empty-corpus refusal + no-match messages updated to mention all
    three sources.

#1.6c codegraph_tests_for enrichment
  - TestRow.testDescriptions: Array<{line, description}>, fetched via
    fetchTestDescriptionsForFile (50-row cap per file).
  - appendBucket renders up to 8 descriptions per file with overflow
    note ("...and N more assertions").
  - handleFilesMode (the files=[...] path) gets the same enrichment.

#1.6d codegraph_node enrichment
  - When fetching a single symbol's details, surface up to 5 mined
    test assertions from any test file linked to the symbol's file
    via the `tests` edge. Anchored by file (not by precise symbol —
    label says "may include broader-scope tests").

Schema-version assertions bumped 38 -> 39 in foundation.test.ts and
pr19-improvements.test.ts.

Measurement
===========

Coverage on this codegraph repo: 2169 test descriptions across 133
test files mined.

Per-source hits across 10 hand-picked queries:
  - summary-only:  22 hits
  - docstring:     80 hits
  - test_name:    101 hits
  - Combined:     203 hits
  - Gain from test_name: 99% additional signal vs summary+docstring

Vocabulary specialization: test_names dominate on assertion-style
queries ("rejects", "returns empty", "refusal error"); docstrings
dominate on implementation-style queries ("parses tree sitter",
"writes embeddings"). Sources are complementary, not redundant.

Suite 1751/0/34 (was 1750; +1 for the new node-enrichment test;
total +5 across the test-names test file). Typecheck clean.

Reviewer: APPROVE with 3 info-level findings. Addressed the missing
codegraph_node enrichment test in this commit; the two style notes
(canonical FK-safe insert pattern via `WHERE EXISTS`, transaction
wrap on clearAllTestNames for very large repos) are tracked as
follow-ups.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
andreinknv added a commit that referenced this pull request May 8, 2026
…ulk codegraph_at_range

Three additive surface extensions to cut investigation round-trips:

#1 codegraph_node accepts `symbols: string[]` (up to 20) alongside
   the existing single-symbol `symbol`. Duplicate inputs that resolve
   to the same node are merged. Saves N round-trips when checking a
   list of suspect symbols.

#2 codegraph_node accepts four new inline-expansion flags
   (`includeCallers` / `includeCallees` / `includeBiomarkers` /
   `includeTests`). When set, the response folds in the corresponding
   tool's answer under each card, capped per-section to keep token
   pressure low (10 callers, 10 callees, 5 findings, 5 test files).
   Collapses 3-5 round-trips into one for "tell me everything about X"
   patterns; the dedicated tools remain available for the full lists.

#3 codegraph_at_range accepts `ranges: Array<{file, startLine, endLine}>`
   (up to 100) alongside the single-range form. Output renders one
   subsection per range so the agent can map results back to specific
   diff hunks. PR review with N hunks goes from N+1 calls to 1.

All three paths are additive — the legacy single-input shapes are
preserved verbatim. Backward-compat is locked in by the existing tests
plus 19 new ones (8 for node multi/expansions, 5 for bulk at-range,
+6 from refactoring). Docs updated in CLAUDE.md, README.md, and the
server-instructions playbook.

Suite 1772/0/34. No schema migration.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
andreinknv added a commit that referenced this pull request May 9, 2026
…polish items

Eight friction-tracker items addressed in parallel by sub-agents (2 Haiku,
1 Sonnet); reviewer caught one real correctness edge case (bucket overlap
on degenerate fresh-index shapes) plus two info items, all addressed in
this commit.

## colbymchenry#21 — at_range cost-benefit JSDoc

Doc-only update to src/mcp/tools/at-range.ts. Tool description and
JSDoc now state "pays off most on dense files (100+ symbols) and
multi-range bulk lookups; for tiny preview fetches on small files,
raw `head -N` is comparable." No code change.

## colbymchenry#25 — blame surfaces rename detection inline

src/git-utils.ts gains a new helper `getFileFollowEarliestTs` that runs
`git log --follow --format=%aI -- <path>` (5 s timeout, ISO timestamp).
src/mcp/tools/blame.ts compares the rename-aware oldest commit against
the line-range-only timeline's oldest. When `--follow` reaches further
back, appends a warning that the timeline truncated at the file's
rename and points at `git log --follow <file>` for the full history.
Edge cases handled: not-a-git-repo, timeout, empty timeline.

Test approach uses `vi.spyOn` to mock pre-rename history because
real fixtures are unreliable: modern git's `git log -L` follows
renames via content-similarity tracking, making a deterministic
black-box rename-fixture impossible.

## colbymchenry#26 — hotspots split into 3 mutually-exclusive categories

src/db/queries-history.ts gains `getCategorizedHotspots` and
src/mcp/tools/hotspots.ts gains a `category: 'risk' | 'maintenance'
| 'brittle' | 'all'` arg (default 'risk' for backward compat).
Thresholds use 75/25 percentile rather than hardcoded magic
numbers — they adapt as the project grows.

Buckets:
- risk        : high centrality AND high churn — where bugs hide
- maintenance : high churn AND not-high centrality — refactor target
- brittle     : high centrality AND not-high churn — stable critical

Reviewer-caught correctness bug: original filters used `<= low` for
the secondary axis, which collapsed buckets when high == low (fresh
index where centrality is uniformly zero, or repos where every file
has identical churn). A file at the threshold could appear in both
risk AND maintenance simultaneously. Fixed by switching maintenance
and brittle to `< highThreshold`, making them strictly disjoint
even on degenerate inputs. Also added a more-hint when any section
hit the per-category cap (the existing `category='risk'` path
already had this; `category='all'` now mirrors).

New `__tests__/hotspots.test.ts` (4 cases) covers all-section
rendering, single-category dispatch, and the backward-compat
default path.

## colbymchenry#27 — search centrality:high differentiates "hook hasn't run"
       vs "no node met the threshold"

src/mcp/tools/search.ts. `probeCentralityFilterCulprit` now runs a
sub-millisecond probe `SELECT 1 FROM nodes WHERE centrality IS NOT
NULL LIMIT 1` (uses the existing `idx_nodes_centrality` index). When
ALL nodes have NULL centrality the agent gets the existing "centrality
hook hasn't run — run codegraph index" hint. When SOME nodes have
centrality but none cleared the filter, a different hint suggests
relaxing the threshold. Two-case hint instead of one.

## colbymchenry#28 — search exact promotes multi-token-query warning to pre-result

src/mcp/tools/search.ts. `buildConceptHintIfNeeded` now returns
`{ preResult, postResult }` instead of a single string. When the
query splits into 2+ space-separated non-qualified tokens (likely
"multiple symbol names"), the agent gets a leading hint to call
search per name OR use codegraph_explore — BEFORE the result list
rather than buried after.

Field-qualified tokens (`kind:function lang:typescript`) and
single-free-token queries are unchanged.

## colbymchenry#33 — callers on "constructor" with no callers explains
       the instantiates-edge model

src/mcp/tools/callers.ts. When the resolved symbol is
`kind=method && name=constructor` AND the callers list is empty,
appends a one-line note: "constructors are invoked via
`new ClassName(...)`, which graph-edges as `instantiates` on the
parent class. To find construction sites, run codegraph_callers on
the enclosing class instead of 'constructor'." Both the multi-match
and single-match paths got the note (guarded by the same
kind+name+empty check). Constructors WITH callers (e.g. via super())
render normally — no false positive.

## colbymchenry#35 — node.symbol tie-break prefers non-fixture, then centrality

src/mcp/tools/symbol-resolver.ts. `pickFromMultipleExactMatches`
now filters out fixture paths first (falls back to all-fixture when
that's all that matches), then sorts by centrality DESC (NULL → 0).
A `helper` symbol that exists in both `src/core.ts` and
`docs/test-beds/fixture.ts` resolves to `src/core.ts` as the displayed
primary. Tier #3 (last_touched_ts) deferred — data not in the
resolver's existing query.

Reviewer-caught DRY issue: the fixture-path regex set was duplicated
between symbol-resolver.ts and dead-code.ts (introduced by parallel
sub-agents on the same brief). Extracted to `isFixturePath` in
src/mcp/tools/shared.ts; both consumers now import the single source.

## colbymchenry#49 — getSummaryCoverage denominator threading (3 call sites)

src/bin/codegraph.ts (lines 348, 1461) + src/mcp/tools/status.ts
(line 440). All three pass `SUMMARIZABLE_KINDS` to getSummaryCoverage
to match the canonical pattern from the previously-fixed
_search-intent.ts:218. Without this, the helper falls back to
COUNT(*) which inflates the denominator with parameters / imports /
file nodes — its own JSDoc explicitly warns against this.

## Test re-additions

Sub-agent #1 deleted its own test files for colbymchenry#33 and colbymchenry#35 (a brief
misread — "DO NOT commit" was interpreted as "DO NOT leave tests in
repo"). Re-added as
`__tests__/mcp-callers-constructor-and-fixture-tiebreak.test.ts`
covering: constructor-with-no-callers note appears,
non-constructor-method note absent, name-collision picks non-fixture
primary.

## Verification

- 15 modified files + 2 new test files, +619/-55
- npm run typecheck — clean
- 74/74 tests pass across 9 LLM/search/hotspots-related test files
- New exports: `isFixturePath` (shared.ts), `getCategorizedHotspots`
  (queries-history.ts), `getFileFollowEarliestTs` (git-utils.ts) —
  all have concrete in-tree callers in the same diff per
  reviewer-memo item #7

Reviewer pass with .claude/reviewer-memo.md prepended caught:
- (request_changes) bucket-exclusivity edge case → fixed
- (info) isFixturePath duplication → deduped
- (info) category='all' missing more-hint → added

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
andreinknv added a commit that referenced this pull request May 9, 2026
New module src/llm/local-role-classifier.ts. Wraps
@huggingface/transformers `zero-shot-classification` pipeline to
predict the codegraph 7-class role taxonomy
(business_logic / api_endpoint / util / data_model / framework_glue
/ test_helper / unknown) without a chat round-trip.

Default model: Xenova/distilbert-base-uncased-mnli (~80MB ONNX,
zero-shot via NLI). Same patterns as the other Stage-3+ in-process
clients (RerankerClient, QaClient): module-level pipeline cache
keyed by `<model>::<dtype>`, eviction on rejection, LlmEndpointError
on missing optional dep, empty-input short-circuit.

Public API:
- LocalRoleClassifier(cfg)
- classify(text) → { label, score, scores[] }
- classifyBatch(texts) → RoleClassification[]
- isReachable / listModels
- _clearRoleClassifierCache (test export)

Output normalization handles flat {labels, scores} AND nested
[{labels, scores}] shapes; out-of-taxonomy labels mapped to 'unknown'.

26 vitest cases across 9 describe blocks.

Replaces the chat-mediated codegraph_role taxonomy path when
configured — token cost ~200 tokens/call → 0. Integration into the
classify pipeline lands separately (tracked as Stage 7 #3 follow-up).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
andreinknv added a commit that referenced this pull request May 9, 2026
Wires the in-process zero-shot role classifier into the classifier
phase as replace-mode. When localRoleLlm is configured:
1. Per batch, call LocalRoleClassifier.classifyBatch on the texts.
2. Stamp results with localModelLabel (cache-attribution discipline
   from the Stage 5 #B reviewer fix).
3. On any local failure (transient model issue, abort, length
   mismatch), fall through to the existing chat-batch path.

- Added localRoleLlm to LlmEndpointConfig + ResolvedLlm + user-facing
  CodeGraphConfig.llm.
- New resolveLocalRole helper in provider.ts.
- classifyAllRoles accepts localRoleClassifier + localModelLabel.
- New classifierTryLocalBatch helper in classifier.ts; threads
  through ClassifierRunCtx.
- runClassifyPhase instantiates LocalRoleClassifier lazily when
  resolved.localRoleLlm is set.

Confidence floor: LOCAL_ROLE_CONFIDENCE_FLOOR = 0.25; below that
the label is stamped as 'unknown' rather than the model's top
guess. Tunable later if the empirical taxonomy distribution warrants.

Off by default. When configured, classification token cost drops
~200/symbol → 0/symbol (entire chat path is bypassed for the common
case). Suite remains 2374/0/34. Typecheck clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
andreinknv added a commit that referenced this pull request May 10, 2026
…Stage 7 #2/#3 follow-ups)

Two new local-NLI surfaces. Both use the existing bart-large-mnli
zero-shot classifier; both keep the heuristic / rule-based path
on the happy case and only consult the model when the deterministic
rules can't reach a confident answer. Token cost vs the chat
backend: ~0 per call.

# C — commit-intent NLI fallback

`classifyCommitMessageWithFallback(message, classifier?)` runs the
existing heuristic first; when it returns `'unknown'` (~30% of
commits in messy histories: "wip", "stuff", "more"), feeds the
subject into a 7-hypothesis NLI classifier:

  feat     → "a new feature or capability"
  fix      → "a bug fix or error correction"
  refactor → "code restructuring or cleanup without behaviour change"
  perf     → "a performance improvement or optimisation"
  test     → "tests, specs, or test infrastructure"
  docs     → "documentation, comments, or readme"
  chore    → "dependency bumps, build config, or routine chores"

Confidence floor 0.45 — below that we keep `'unknown'` rather than
ascribe a low-confidence label (avoids polluting commit_intents
with junk on truly-opaque commits). Wired into the cochange index
hook: classifier is constructed from `localRoleLlm` config when
present, otherwise the hook stays heuristic-only and behaviour-
identical to before.

The original sync `classifyCommitMessage` is unchanged — existing
callers + tests continue to use the heuristic path.

# D — new structured change-kind classifier

`classifyChangeKind({classifier, beforeBody, afterBody, name, kind})`
in `src/llm/change-kind.ts`. Distinct from the existing
generative `summarizeChange` (which produces prose via the chat
backend): this produces a STRUCTURED label suitable for grouping /
filtering / metrics:

  addition | removal | modification | refactor
  | signature_change | behavioral_change | doc_only | unknown

Rule-based dispatch handles the trivial cases (empty before →
addition, empty after → removal, identical → unknown) without an
NLI call. The remaining cases consult bart-large-mnli with 4
prose hypotheses against the diff. Same 0.45 confidence floor;
sub-threshold tops fall through to `'modification'`.

# Supporting refactors

- `LocalRoleClassifier.classifyLabels(text, labels)` — new method
  that runs zero-shot against an arbitrary caller-supplied label
  set (vs the existing `classify()` which is hardcoded to the
  7-class role taxonomy). The classifier wraps the same pipeline,
  so both surfaces share one model load.

- `IndexHook` afterIndexAll/afterSync are already async-aware in
  the registry; the cochange hook had been synchronous. Made
  `applyResults`/`applyFullRescan`/`applyIncremental`/`refresh`
  async so the NLI fallback can await per-commit. Behaviour
  unchanged when `localRoleLlm` is unset.

# Tests

+ `__tests__/commit-intent-fallback.test.ts` — 13 cases covering
  short-circuit, NLI dispatch, low-confidence floor, error
  degradation, all 7 intent labels.
+ `__tests__/change-kind.test.ts` — 12 cases covering rule-based
  dispatch, NLI dispatch, low-confidence floor, error paths.
+ `__tests__/local-role-classifier.test.ts` — 7 new cases for
  `classifyLabels` (custom label set, short-circuit, abort,
  unknown labels NOT coerced to ROLE_LABELS).

# Why now

Confirmed by web research (HF API + WebSearch agent) that the
in-process transformers.js model surface is "five small specialized
models is the realistic ceiling": no code-tuned summarizer or
embedder ships in transformers.js-loadable form today. The biggest
remaining win for the existing surface was unifying everything that
used to chat-call onto the one bart-large-mnli load — both C and D
fit that shape.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant