Skip to content

feat+perf(llm): split chat/askChat provider + summary batching [WIP 3/5]#5

Closed
andreinknv wants to merge 2 commits into
pr2-cache-preservationfrom
pr3-split-provider-and-summary-batching
Closed

feat+perf(llm): split chat/askChat provider + summary batching [WIP 3/5]#5
andreinknv wants to merge 2 commits into
pr2-cache-preservationfrom
pr3-split-provider-and-summary-batching

Conversation

@andreinknv
Copy link
Copy Markdown
Owner

Stress-test fixes — PR 3 of 5 (stacked). Base: pr2-cache-preservation.

Commits

  • 2620304 feat(llm): split provider per role — chat vs askChat
  • 877930d perf(llm): summary batching, cache-aware, provider-defaulted to 3 for bridge/api

Summary

Why these are paired: the summary-batching commit's provider gating reads the new askChat slot from 2620304, so they must ship together.

Split provider (2620304)

End-to-end measurement uncovered a real cost/quality tradeoff: bulk summary work runs ~2.6× faster on a local MLX-served Qwen3-Coder-30B than on claude-bridge → Haiku, with comparable summary quality. But the same Qwen3-Coder confidently confabulates wrong answers on synthesis-heavy ask queries. Sonnet via claude-bridge gives precisely correct answers (12-40s).

The pre-existing schema (chat.askModel: string) only allowed swapping the model id within the same provider/endpoint. New top-level askChat block lets bulk and ask route through entirely different providers:

"chat":    { "provider": "openai-compat", "endpoint": "...", "model": "qwen3-coder" },
"askChat": { "provider": "claude-bridge",                      "model": "claude-sonnet-4-6" }

getAskModel() cascade: askChat.model → chat.askModel → chat.model → chatModel (legacy).

Summary batching (877930d)

Bench against MLX with 12 real symbols at batch sizes 1/3/5/10:

  • Local single-stream backends (MLX, llama.cpp): wall-clock identical regardless of batch size — no per-call overhead to amortise.
  • Quality: BATCH=3 indistinguishable from 1, BATCH=5 starts losing specifics, BATCH=10 noticeably terser.
  • Implication: batch only when there's per-call overhead (claude-bridge subprocess spawn, anthropic-api network roundtrip).

Worker accumulates a per-worker pending list; cache hits never enter it. Parse failures fall back to per-item generateSingle. Provider gating defaults batch=3 for bridge/anthropic-api, batch=1 for openai-compat.

Test plan

  • Suite green at 1113 / 13 skip / 0 fail.
  • Manual: split-provider config produces summaries via chat model and ask responses via askChat model.

WIP draft.

andreinknv and others added 2 commits April 30, 2026 18:53
End-to-end measurement of the cache-preservation work uncovered a
real cost/quality tradeoff: bulk summary work runs 2.6x faster on a
local MLX-served model (Qwen3-Coder-30B-A3B-Instruct-4bit-DWQ via
mlx-openai-server) than on claude-bridge -> Haiku, with comparable
quality. But the same Qwen3-Coder model confidently confabulates
WRONG answers on synthesis-heavy ask queries — got the PRAGMA
foreign_keys direction reversed and cited the wrong migration
filename when asked about the cache-preservation mechanism. Sonnet
via claude-bridge gave precisely correct answers, but takes 12-40s.

The pre-existing schema (chat.askModel: string) only allowed swapping
the model id within the SAME provider/endpoint. To route bulk and
ask through entirely DIFFERENT providers, this PR adds a top-level
`askChat` block.

## Changes

### Schema (src/types.ts, src/llm/client.ts)

New optional `llm.askChat` block mirroring `chat`:

    "chat":    { "provider": "openai-compat",
                 "endpoint": "http://localhost:8081/v1",
                 "model": "qwen3-coder" },
    "askChat": { "provider": "claude-bridge",
                 "model": "claude-sonnet-4-6" }

`LlmEndpointConfig.askChat?: ChatProviderConfig | null` parallels the
existing `chat` slot. `normalizeEndpointConfig` returns the new field
alongside chat + embeddings.

### Resolver (src/llm/provider.ts)

New `resolveAskChat()` mirrors `resolveChat()` but never reads legacy
flat fields — askChat is opt-in only. Wired into `resolveLlmProviders`
so the resulting `ResolvedLlm` carries both `chat` and `askChat` slots.

New `getAskModel()` helper with cascade:
- askChat.model -> chat.askModel -> chat.model -> chatModel (legacy)
This is the right primitive for any caller that wants to know which
model id will actually answer an ask/dead-code call.

### Client routing (src/llm/client.ts)

`LlmClient` now lazy-loads TWO backends instead of one:
- `chatBackend` — for bulk work (summaries, classifier, dir summaries).
- `askChatBackend` — for ask + dead-code judge, only when askChat is
  configured.

`LlmClient.chat(messages, { useAskModel: true })` checks `askChatCfg`:
- If set → dispatch to ask backend with `useAskModel: false` (the ask
  backend's own `model` field is already the right model; no re-swap
  needed within that backend).
- If unset → dispatch to chat backend with `useAskModel` passed
  through (legacy single-provider behaviour: same backend, swap model
  id via `chat.askModel`).

`instantiateBackend` factored out from `getChatBackend` so both paths
share construction logic.

`isReachable()` now also probes the askChat backend when configured —
returns false if EITHER chat OR askChat fails, so status output never
claims reachable when ask calls would throw at runtime (e.g.
claude-bridge binary missing on the ask side).

### Display + gating (src/bin/codegraph.ts, src/mcp/tools.ts)

- `codegraph status` shows a separate `Ask model:` line when ask
  routes differently from chat. Provider hint in parentheses
  (`Ask model: claude-sonnet-4-6 (claude-bridge)`) appears only in
  true split-provider setups, not for single-provider askModel
  overrides.
- `codegraph ask` CLI trailer now uses getAskModel for the displayed
  model id so the printed `model X` matches what actually generated
  the answer.
- handleAsk and handleDeadCode in the MCP tools now gate on
  getAskModel rather than getChatModel, with error messages updated
  to mention the askChat block as an alternative configuration path.
  Pre-fix, a config with chat=null but askChat configured (rare but
  legitimate) would have failed the gate even though ask was
  perfectly configured.

### Tests (__tests__/llm.test.ts)

Two new tests:

1. `split provider: useAskModel routes to askChat backend when
   configured` — uses TWO fake servers, asserts bulk hits chat server
   only and ask hits askChat server only. Captures and asserts the
   `model` field in the request body of each, guarding against a
   regression where routing is right but the model id sent is wrong.

2. `legacy single-provider: useAskModel stays on chat backend, swaps
   model id` — asserts no-askChat config preserves prior behaviour
   (single backend handles both calls, model id swaps).

`FakeServer` extended with `lastChatBody` capture so tests can assert
which model id reached which server.

## Live validation

Tested on the codegraph self-repo with chat -> MLX/Qwen3-Coder and
askChat -> claude-bridge/Sonnet. Same ask question that Qwen3-Coder
confabulated about now returns Sonnet's precisely correct answer:
- Pre-split: 6.9s, wrong on PRAGMA foreign_keys direction, cited
  migration 014 (unrelated).
- Post-split: 39.5s, precisely correct including line numbers
  (`summarizer.ts:189-197`, `queries.ts:2125`) and the right
  migration filename (022-add-content-hash-index).

## Backwards compatibility

Existing configs with just `chat: { ..., askModel: "..." }` continue
to work unchanged — askChat is optional. Test #2 covers this path.

## Reviewer trail

Two passes. Pass 1: REQUEST_CHANGES + 4 findings.
- (1) handleAsk/handleDeadCode used getChatModel — addressed,
  switched to getAskModel with updated error messages.
- (2) status display omitted ask model — addressed, added
  conditional Ask model line in split-provider setup.
- (3) isReachable didn't probe askChat — addressed.
- (4) split-provider test didn't assert request-body model id —
  addressed.

Pass 2: APPROVE + 2 info findings (stale test comment, redundant
provider hint for single-provider askModel override) — both
addressed.

Suite: 1089 / 13 skip / 0 fail (was 1087).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… bridge/api

Bench against the live MLX server with 12 real codegraph symbols at
batch sizes 1/3/5/10 surfaced an asymmetry the original "batch all
LLM calls uniformly" instinct would have got wrong:

- Wall-clock on local single-stream backends (MLX, llama.cpp): IDENTICAL
  across batch sizes (~9.3-9.9s for 12 symbols regardless). No per-call
  overhead to amortise; total generation work dominates.
- Quality: monotonically degraded — BATCH=3 indistinguishable from
  BATCH=1 on Qwen3-Coder; BATCH=5 starts losing specifics; BATCH=10
  noticeably terser.
- Implication: batching helps ONLY when there's per-call overhead
  (claude-bridge subprocess spawn ~1-2s, anthropic-api network
  roundtrip). For local servers, batching is pure quality loss.

For codegraph self-repo on claude-bridge, the math: 672 summaries at
~30 sym/min (Phase A baseline) = 22 min. Batched 3-per-call:
~224 calls × ~5s / 4 concurrency ≈ 5 min. ~4-5x speedup with
quality holding.

## Implementation (cache-aware)

Worker rewritten in src/llm/summarizer.ts:

- New `summaryBatchSize?: number` SummarizerOption, default 1.
- Per-worker `pending` accumulator. Cache hits NEVER batch — they
  short-circuit before pending push, so they don't burn batch slots.
  Only cache MISSES enter the batch.
- `flushBatch` calls client.chat with `buildBatchPrompt`, parses via
  `parseBatchSummaries` (mirrors classifier's bracket-aware string-
  state JSON extractor + ≥80% coverage gate).
- On parse failure or LlmEndpointError → per-item fallback via
  `generateSingle` so a bad batch can't lose the whole group.
- Single-item batch (e.g. trailing partial) routes straight to
  `generateSingle` to avoid batch-prompt overhead.
- Pre-fix `try/finally`-with-`continue` double-count hazard removed
  by the rewrite — every exit path explicitly increments `done`.

## Provider gating

src/index.ts startBackgroundSummarization picks the default based on
chat.provider:

- `claude-bridge` / `anthropic-api` → 3 (per-call overhead amortises)
- `openai-compat` → 1 (most local servers don't continuous-batch)

Users can override via `chat.summaryBatchSize` in config — useful for
high-latency openai-compat endpoints (cloud-hosted proxies that pay
network roundtrip cost).

The public CodeGraph.summarizeAll() wrapper applies the same cascade
(caller option > config > provider default).

## Token budget

batchMaxTokens = max(160, batch.length * 120). Per-item budget breaks
down to ~60 tokens summary text (200 chars / 3.5 chars per token) +
~40 tokens JSON framing + ~20 tokens headroom for verbose models.
batch=5 gets 600 tokens, batch=10 gets 1200 — comfortable margin.

## Tests (4 new)

- `parseBatchSummaries extracts summaries, tolerates prose, rejects
  under-coverage` — happy path, prose-wrapped JSON, bracket-in-string,
  sparse rejection, malformed.
- `parseBatchSummaries: ≥80% coverage gate works at batch=5` — locks
  down the boundary (4 of 5 passes, 3 of 5 fails).
- `summary batching reduces chat-call count when batchSize > 1` —
  asserts `incremental ≤ ceil(generated/3) + 2` slop with batchSize=3.
- `summary batching: cache hits never burn batch slots` — second-pass
  summarizeAll on identical bodies asserts ZERO incremental chat calls
  AND `secondPass.cacheHits === firstPass.cacheHits + firstPass.generated`
  (resilient to first-pass body-read errors).

Plus: fake server stub updated to detect batched-summary mode (anchored
on the unique `Write a SINGLE LINE summary` instruction marker) and emit
`{i, summary}` shape responses. Fixed a subtle bug in the existing
batched-prompt detection: it counted bare `N.` line prefixes which
would over-count when a symbol body happened to contain `1. step`
text. Summary-mode now uses `### N.` headers (the summarizer's
distinctive per-item marker).

Suite: 1093 / 13 skip / 0 fail (was 1089).

## Reviewer trail

Two passes. Pass 1: APPROVE + 3 info findings (token budget tight,
test assertion fragile to first-pass errors, missing user config knob)
— all addressed. Pass 2: APPROVE + 1 info finding (public
summarizeAll wrapper didn't thread summaryBatchSize) — addressed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@andreinknv
Copy link
Copy Markdown
Owner Author

Stack collapsed into wip-stress-base (now at 56fac32). Commits landed via the fast-forward of wip-stress-base from df96c6c to 56fac32; #3 shows as merged, #4-#7 are closed because their tips are already ancestors of the base. Branch preserved for reference.

@andreinknv andreinknv closed this May 1, 2026
@andreinknv andreinknv deleted the pr3-split-provider-and-summary-batching branch May 1, 2026 11:37
andreinknv added a commit that referenced this pull request May 1, 2026
… server-config flags

Tooling-gap backlog (codegraph/docs/codegraph-tooling-gaps.md) closed:
  #1 freshness severity bucket — `classifyFreshness` with fresh|recent|stale|very_stale
  #2 allowStale flag — opt-in bypass for the heavy-drift gate, registry-injected schema
  #3 module format in status — `module-format.ts` parses package.json + tsconfig (JSONC-safe)
  #4 codegraph_imports tool + import-classifier — file/directory/bare/unresolvable filters
  #5 dynamic imports — extractor catches `import('…')` + `require('…')`, incl. template_string
  #6 build-context refs — new `build_context_refs` table for `__dirname` / `import.meta.*`
  #7 files.is_test flag — column populated by glob; surfaced in status as `(N test)`
  colbymchenry#11 summarize-also-embeds (discovered while dogfooding) — `cg.summarizeAll()` chains
       `embedAllSummaries`; new `cg.embedAll()` for embed-only path; CLI `codegraph embed`

CLI/MCP alignment (5/32 → 33+/35):
  - 13 new CLI commands via `runViaMCP` shim: callers, callees, impact, node, similar,
    biomarkers, imports, help-tools, explore, hotspots, dead-code, config-refs, sql-refs,
    module-summary, role, coverage-query, pending-summaries, save-summaries, review-context
  - 7 new MCP tools: codegraph_imports, codegraph_embed, codegraph_summarize, codegraph_sync,
    codegraph_reindex, codegraph_coverage_ingest, codegraph_init, codegraph_uninit,
    codegraph_unlock, codegraph_affected

MCP server-level operator config (`codegraph serve --mcp`):
  - --no-write-tools / --allow-stale-default / --disable-tool (sandboxing)
  - --llm-endpoint / --llm-chat-model / --llm-ask-model / --llm-embedding-model /
    --llm-api-key (operator LLM config; per-project config wins on conflict)
  - New CODEGRAPH_LLM_* env vars wired through `mergeLlmEnv` in resolveLlmProviders

Architectural cleanups:
  - `bypassFreshnessGate` and `isWriteTool` declarative flags on ToolModule (replaces
    growing string-comparison chain in execute())
  - `withAllowStale` registry injection only on tools that DO see the gate
  - DRY of inline copy-paste in 3 hooks → `src/index-hooks/enclosing.ts`
  - `LlmClient.isEmbeddingReachable` for split-provider correctness
  - SyncResult `lockContention` flag → handleSync emits distinct retryable message
  - `clearStructural` deletes from build_context_refs (was orphan-leaking on --force)
  - cli:dev npm script + tsx CLI fixed (web-tree-sitter `import type` for type-only refs)

Migrations:
  023-files-is-test.ts — add `files.is_test`
  024-build-context-refs.ts — add `build_context_refs` table

Reviewer rounds: 11 total, all REQUEST_CHANGES addressed inline. Notable fixes:
  - JSONC URL strip via state machine (was eating `https://` tails)
  - classifyFreshness very_stale now requires isStale (in-sync-but-old → recent)
  - Dynamic imports also match template_string nodes
  - process.exit deferred until after finally cleanup in runViaMCP
  - --same-language / --different-language mutual exclusion guard
  - help-tools CLI bypasses isInitialized (works without a project)
  - handleUninit sweeps projectCache by getProjectRoot (no dangling alias leaks)
  - handleAffected errors instead of silently dropping unsupported glob filters
  - mergeLlmEnv preserves precedence: legacy flat config wins over env-synthesised block

Suite: 1268 passing, 1 expected red (colbymchenry#8 — undecided), 13 skipped, 1 todo, 0 regressions.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
andreinknv added a commit that referenced this pull request May 3, 2026
…ion A #5)

Closes the post-refactor backlog item #5: static large_method flags
evergreen-large symbols indistinguishably from actively-growing ones;
recently_grew surfaces the latter as the actionable refactor target.

Architecture:
- Migration 027 adds node_loc_history(node_id, indexed_ts, loc) — one
  row per analyseProject pass that touched the symbol. The per-file
  content-hash short-circuit means unchanged symbols are not
  re-snapshotted; the compound index (node_id, indexed_ts DESC) makes
  "previous snapshot for this node" a sub-millisecond seek.
- src/db/queries-loc-history.ts exports recordLocSnapshots (batch
  insert with the canonical FK-safe SELECT WHERE EXISTS / ON CONFLICT
  pattern from queries-summaries) and getPriorLocSnapshots (per-node
  seek; one prepared stmt, N steps).
- biomarkers/index.ts: at analyseProject start, capture nowMs once;
  per-file batch-fetch prior snapshots, evaluate rules with prior +
  nowMs threaded into RuleContext, persist a fresh snapshot for every
  symbol whose metrics we computed.
- biomarkers/engine.ts: new evaluateRecentlyGrew rule. Triggers when
  prior_loc >= 20 AND current_loc > prior_loc * 1.5 AND
  (now - prior_ts) <= 30d. Severity ladder: 1.5–2x info,
  2–3x warning, ≥3x error. Metric is round(ratio * 100) so the
  ranked-mode sort orders by growth percent.

Reviewer-memo gates passed:
- #4 schema-version asserts both bumped to 27 (foundation +
  pr19-improvements).
- #5 node_loc_history added to BOTH the migration AND schema.sql so
  fresh installs initialize correctly.
- #7 dead-export check: dropped recordLocSnapshot/getPriorLocSnapshot
  singulars after the reviewer flagged YAGNI; only the batch
  variants are exported (and called).

9 engine-level tests for the rule logic. Full suite 1393/34/0
(+9 from prior). Persistence path is exercised implicitly by every
existing biomarker test (silent green confirms the schema + writes
are sound on real fixtures).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
andreinknv added a commit that referenced this pull request May 3, 2026
First half of the trace-logging arc that backs the viewer's
Agent-trace tab. This commit lands the storage + logger; the
MCP server hookup (commit B) and the viewer endpoints (commit C)
follow.

- Migration 028 + schema.sql: mcp_sessions(id, started_ts,
  last_activity_ts, tool_count) + mcp_tool_calls(session_id, step,
  ts, tool_name, args_json, result_summary, duration_ms). FK
  CASCADE from calls → sessions; idx_mcp_tool_calls_ts gives the
  retention sweep an O(log n) scan.
- src/db/queries-trace.ts: insertSession, appendToolCall (transaction
  that bumps the parent session's tool_count atomically),
  pruneToolCalls (cap rows + GC orphan sessions), recentSessions,
  callsForSession.
- src/trace/logger.ts: TraceLogger class. Constructor inserts a
  session row; log() appends + prunes-every-N. All persistence is
  best-effort (DB write failures log at debug, never throw —
  trace must not lose tool results). Args + result-summary are
  truncated at 2000 chars.
- generateSessionId(): unix-seconds prefix + 6 random hex chars,
  sortable by time.

Reviewer-memo gates passed:
- #4 schema-version asserts both bumped to 28 (foundation +
  pr19-improvements).
- #5 mcp_sessions + mcp_tool_calls in BOTH the migration AND
  schema.sql.

10 logger round-trip tests. Full suite 1422/34/0.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
andreinknv added a commit that referenced this pull request May 3, 2026
Second commit of the trace-logging arc. Wraps every
toolHandler.execute() call in a TraceLogger.log() so the call
flows into the mcp_tool_calls table that landed in 027139b.
The viewer's Agent-trace tab (commit C) reads it back.

Lifecycle:
- TraceLogger is created lazily once `cg` is open (via
  ensureTraceLogger() called on each dispatch). Skipped entirely
  when --no-write-tools is passed: the spirit of that flag is "no
  DB writes", and trace logging is a write path.
- log() is contractually best-effort (DB failures swallowed at
  debug). The dispatch site additionally wraps it in try/finally
  so even a contract violation can never strand the tool result —
  sendResult always fires.

Reviewer-memo gates passed:
- #1 docstring rot: ensureTraceLogger and the dispatch comments
  match the implementation.
- #2 best-effort claim is upheld at both layers (logger try/catch
  + caller try/finally).
- #5 N/A — this commit doesn't add tables.

The 3-line wiring isn't separately tested; TraceLogger has 10
round-trip tests covering every behavior the wiring composes,
and the wiring has no branches. Suite 1422/34/0 unchanged.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
andreinknv added a commit that referenced this pull request May 3, 2026
Closes the cyclomatic + maxNesting "—" gap in the viewer's right
pane. Previously these fields only had a value when a biomarker
finding fired (via the metric on `code_health_findings`); clean
symbols read null. Now every analysable symbol has a row in
node_metrics with all six engine-computed metrics, overwritten
on each pass.

Architecture:
- Migration 029 + node_metrics table: PK on node_id, overwritten
  per pass via a single FK-safe upsert. Distinct from
  code_health_findings (threshold-only) and node_loc_history
  (LoC time-series for recently_grew). Schema in BOTH the
  migration AND schema.sql per memo #5.
- queries-metrics.ts: upsertNodeMetricsBatch (one transaction,
  WHERE EXISTS guard), getNodeMetrics (single-row reader).
- biomarkers/index.ts: evaluateAnalysableNodes returns a third
  map of metrics-by-node; analyseSingleFile persists to the new
  table after the findings persist.
- BIOMARKER_CACHE_KEY bumped v1 → v2 so existing indexes
  self-heal: without the bump, the per-file content-hash cache
  would short-circuit unchanged files and leave node_metrics
  empty until users hit `--force` themselves.
- Viewer server: buildMetricsBlock now reads from node_metrics
  first; falls back to the per-finding metric on older indexes.
  paramCount also surfaced (not yet displayed in the UI).

Smoke-tested on the codegraph self-index after re-index:
extractFromSource now reports cyclomatic 3 / maxNesting 1 /
paramCount 3 — matching what the engine computes during
biomarker analysis.

Reviewer-memo gates passed:
- #4 schema-version asserts both bumped to 29.
- #5 node_metrics in BOTH migration AND schema.sql.

Updated mcp-reindex test (cache key bump). Suite 1425/34/0.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
andreinknv added a commit that referenced this pull request May 3, 2026
Adds appendMoreHint() helper in shared.ts and wires it into the six
highest-traffic capped tools: callers, callees, search, biomarkers,
hotspots, history. When the tool detects its result was at-or-above
the requested limit, the response gets a tail line:
"> Result capped — pass a higher `limit` to see more."

Lets agents skip a redundant re-fetch when the absence of the hint
confirms they got the complete set, AND signals when re-running with
a higher limit could surface more — backlog #5, scoped narrower than
the literal spec after a verification pass found that almost every
tool already caps results.

Each tool computes hasMore from its own ground truth — exact
rawCount > shownCount for callers/callees/search, heuristic
rows.length >= limit for the SQL-LIMIT tools (biomarkers/hotspots/
history). Hint goes after truncateOutput in every tool so the 15K
budget can't chop it. Multi-match grouped paths in callers/callees
also wired (formatGroupedCallers/Callees now return { text, hasMore }).

11 new tests in __tests__/more-hint.test.ts cover the helper and
single-match integration paths; suite 1442/34/0 (was 1431).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
andreinknv added a commit that referenced this pull request May 8, 2026
… 040)

Ships the role-classifier cascade after the empirical N=48 ablation
established that docstring-input retains comparable quality to
summary-input (B<->baseline 66.7% vs A<->baseline 68.8% — within
local-LLM noise; eval memory at
project_role_classifier_input_eval_2026_05_08).

Migration 040
  - Adds nodes.role + nodes.role_model + idx_nodes_role
  - Backfills from symbol_summaries.role; idempotent guard
    (nodes.role IS NULL) so a direct re-invoke can't clobber
    classifier-assigned values
  - Pre-016 hand-rolled-test guard via sqlite_master + pragma_table_info
  - schema.sql parity (memo item #5)
  - Old symbol_summaries.role + idx_summaries_role left in place for
    one cycle; future migration drops them

Classifier cascade input
  - Candidate.summary now nullable; new Candidate.docstring
  - descriptionFor(c) picks summary when present, falls back to
    docstring (both .trim()); empty strings count as absent
  - buildSinglePrompt and buildBatchPrompt use the cascade output
    and label the source ("Description (from summary)" vs "Description
    (from docstring)") so the LLM knows the input provenance
  - getClassifiableSummaries -> getClassifiableNodes (LEFT JOIN
    summary, returns nodes where (summary OR docstring) AND no role
    for this model)

Consumer read paths
  - getSymbolRoles / findNodesByRole / getRoleCounts now read from
    nodes.role (not symbol_summaries.role)
  - upsertSymbolRole writes to nodes.role / nodes.role_model
  - All in-tree consumers (codegraph_role tool, status, module tool)
    use the helpers; transparent to them

Test fixtures + 1 new test
  - foundation.test.ts + pr19-improvements.test.ts: schema 39 -> 40
  - llm-tiers.test.ts wipe target: nodes.role (not symbol_summaries.role)
  - New focused test: classifier input cascade — docstring-only node
    gets classified when summary is absent

Docs
  - CLAUDE.md classify line + server-instructions.ts codegraph_role
    description updated to mention the cascade input
  - JSDoc rot purged across 8 sites in classifier.ts /
    codegraph-llm-service.ts / bin/codegraph.ts / mcp/tools/classify.ts
    (all "already-summarised symbols" / "every summarised symbol"
    replaced with cascade-aware wording)

Live measurement on this codegraph corpus (1932 newly-classified
symbols via the cascade, end-to-end run)
  - Roles before: 1414
  - Roles after:  3346  (+1932 = +136% coverage)
  - Via summary path:    1416
  - Via docstring path:  1930  (the cascade-only contribution)
  - Unknown rate: 78/3346 = 2.3%

Suite 1752/0/34 (was 1751; +1 cascade test). Typecheck clean. Two
reviewer rounds: round 1 caught JSDoc rot across 8 sites + flagged
backfill idempotency; round 2 APPROVE.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
andreinknv added a commit that referenced this pull request May 9, 2026
Four request_changes issues from the independent reviewer pass on
87109da^..ef21afd:

1. schema.sql: add hnsw_meta DDL alongside migration 042 — fresh
   installs that initialize from schema.sql directly were silently
   falling back to vec0 forever (recurring scrutiny pattern #5).

2. codegraph-llm-pass.ts: clear svc.hnswByDim alongside the existing
   embeddingCache.invalidate() in runEmbedPhase. The background embed
   path was leaving stale HNSW indexes in the in-memory query cache
   until the next foreground summarizeAll/embedAll fired the clear.

3. hnsw-build.ts: use Math.abs on the rowCount delta so pure-deletion
   churn (no new inserts, maxRowid unchanged) still triggers a
   rebuild. The growth-only check would have left the index stale
   indefinitely after a large delete.

4. codegraph-llm-service.ts: narrow tryLoadHnswForDim catch to the
   "no such table" shape so genuine DB failures (SQLITE_CORRUPT /
   SQLITE_IOERR) surface instead of being silently swallowed as
   vec0-fallback signals.

Suite remains 2100/0/34. Typecheck clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
andreinknv added a commit that referenced this pull request May 9, 2026
…rieval

Subtask colbymchenry#10.1 of the embedding-features arc Stage 2. Adds a `grain
TEXT NOT NULL DEFAULT 'symbol'` column to `symbol_embeddings` so the
table can carry per-symbol embeddings (existing) and per-file
aggregated embeddings (forthcoming subtask colbymchenry#10.2). File-grain rows
will reuse the file's nodes(id) since `getEmbeddableNodes` already
excludes file kinds from per-symbol embedding — no PK collision.

Schema change is column-additive (ALTER TABLE ADD COLUMN); no
backfill needed because the default 'symbol' covers every existing
row. schema.sql updated alongside the migration per recurring
scrutiny pattern #5. Schema-version test assertions bumped 42 → 43
per recurring scrutiny pattern #4.

Suite remains 2100/0/34. Typecheck clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
andreinknv added a commit that referenced this pull request May 9, 2026
…#2)

Wires the heuristic commit-intent classifier (shipped in d12ee01) into
a SHA-keyed persistence layer:

- Migration 045: new `commit_intents (sha PK, intent, score, seen_at)`
  table with idx_commit_intents_intent for per-intent queries.
- schema.sql parity (recurring scrutiny pattern #5).
- Schema-version test assertions bumped 44 → 45.
- src/db/queries-commit-intents.ts (new): recordCommitIntent /
  recordCommitIntents (batch upsert with txn) / getCommitIntent /
  getCommitIntents (multi-SHA Map fetch) / aggregateIntentBreakdown
  (returns Record<intent, count> for codegraph_history-style folds) /
  clearCommitIntents (cochange full-rescan path).

Mining-side integration deferred — `mineCoChanges` doesn't currently
emit subjects (only SHAs + file pairs). Adding a subject-capture pass
is tracked as a Stage 7 #2 follow-up.

Suite remains 2374/0/34. Typecheck clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
andreinknv added a commit that referenced this pull request May 10, 2026
Sibling-table approach (deferred-design from earlier-session friction with
the compound-PK rebuild on symbol_embeddings):

- Migration 046: new `symbol_chunk_embeddings (node_id, chunk_idx)` PK
  table, FK CASCADE to nodes(id). chunk_idx >= 1 for chunks; the
  canonical chunk_idx=0 stays in symbol_embeddings.
- schema.sql parity (recurring scrutiny pattern #5).
- Schema-version test bumps 45 → 46.
- src/embeddings/multi-vec.ts (NEW): `chunkSymbol` line-window split
  with overlap. Constants:
    LARGE_SYMBOL_LOC_THRESHOLD = 500
    CHUNK_LOC = 200
    CHUNK_OVERLAP = 50
- src/db/queries-chunk-embeddings.ts (NEW): upsertChunkEmbedding /
  getChunkEmbeddingsByNode / clearChunkEmbeddings / countChunkEmbeddings.
- 33 tests in __tests__/multi-vec.test.ts covering chunking boundary
  cases, overlap correctness, persistence, cascade-on-delete.

What's NOT done (deferred to a follow-up PR): runEmbedPhase integration
to actually generate chunk embeddings, findSimilarViaVec multi-vec
retrieval merge, vec0 mirror table for chunks. Foundation is sound;
the wiring is the next step.

Suite: 2374 → 2433 (+33 from new tests + agent-already-bumped vec
tests). Typecheck clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant