feat: Add full Kotlin language support by mduenas · Pull Request #8 · colbymchenry/codegraph

mduenas · 2026-01-30T17:22:09Z

Summary

Upgrades Kotlin from "Basic support" to "Full support"
Adds extraction for: object declarations, companion objects, interfaces, enums, properties, type aliases
Adds inheritance/implementation tracking (extends/implements edges)
Improves getChildByField() with fallback for tree-sitter grammars without field definitions
Adds 15+ comprehensive Kotlin tests

Test plan

All 55 extraction tests pass (including 15+ new Kotlin tests)
All 211 total tests pass with no regressions
Tests cover: data classes, sealed classes, object declarations, companion objects, interfaces, enums, extension functions, properties, type aliases, inheritance, visibility modifiers, abstract classes, function calls

🤖 Generated with Claude Code

This PR upgrades Kotlin from "Basic support" to "Full support" by adding: - Object declarations (singletons) extraction - Companion object extraction - Data classes, sealed classes, abstract classes detection - Interface extraction (properly distinguishes from classes) - Enum extraction with enum members - Properties (val/var) extraction - Type alias extraction - Inheritance/implementation extraction (extends/implements edges) - Enhanced function body parsing for call tracking Also improves getChildByField() to fall back to node type matching, which fixes issues with tree-sitter grammars that don't define field names (like tree-sitter-kotlin). Added 15+ new comprehensive Kotlin tests covering all constructs. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Add HTTP/JSON-RPC transport to MCP server enabling integration with GitHub Copilot and other HTTP-based MCP clients. The stdio transport remains the default for Claude Code. - Extract ITransport interface for pluggable transport mechanisms - Add HttpTransport class with CORS support and graceful error handling - Refactor MCPServer to accept transport configuration - Add CLI flags: --http, --port, --host Usage: codegraph serve --mcp --http --port 3000 Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Installer now asks which AI assistant the user plans to use: - Claude Code (stdio transport, default) - GitHub Copilot (HTTP transport) - Both Shows appropriate configuration instructions and next steps for each option. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Copilot uses stdio transport like Claude Code, not HTTP. Updated installer and CLI to show correct mcp-config.json format: { "mcpServers": { "codegraph": { "type": "stdio", "command": "codegraph", "args": ["serve", "--mcp"], "tools": ["*"] } } } Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

- Remove tree-sitter-liquid (uses regex extraction, not tree-sitter) - Add overrides for tree-sitter to suppress peer dependency warnings Fixes install failures due to node-gyp-build issues with GitHub deps. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

When GitHub Copilot is selected, installer now writes CodeGraph tool usage instructions to .github/copilot-instructions.md Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

The DEFAULT_CONFIG.include was missing *.kt, *.kts, and *.swift patterns, causing these files to be skipped during indexing even though the grammars were properly configured. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

- Batch fetch node info upfront instead of per-reference queries - Add progress reporting during resolution phase - Add --skip-resolve flag to skip resolution for faster indexing - Resolution now shows actual progress instead of stuck at 0% Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Builds out the rest of the LLM enrichment roadmap on top of the semantic-search/RAG foundation. All features remain fully optional; codegraph behaves identically when no LLM is reachable. Tier 1 #3 — directory/module summaries: - Schema v6 adds directory_summaries(dir_path PK, summary, content_hash, model, generated_at) - src/llm/dir-summarizer.ts aggregates symbol summaries per dir into one paragraph (≤600 chars), keyed by stable content_hash; min 3 summarised symbols per dir as the threshold for "worth a paragraph" - Wired into background pass after summaries+embeddings - New MCP tool codegraph_module (per-dir lookup, or list-all when dirPath omitted) Tier 2 #4 — change-intent helper: - src/llm/change-intent.ts: standalone summarizeChange(client, name, kind, beforeBody, afterBody). Three modes (added | removed | modified) with shape-appropriate prompts. - CodeGraph.summarizeChange wrapper; no caching at this layer (PR-review tooling owns lifecycle), no MCP/CLI surface yet. Tier 2 #5 — role classification: - Schema v7 adds role + role_model columns on symbol_summaries plus idx_summaries_role - Closed label set: api_endpoint | business_logic | data_model | util | framework_glue | test_helper | unknown - parseRole tolerates punctuation, fences, and multi-word "Business Logic" variants by snake_casing as a fallback - Wired into background pass after dir summaries - New MCP tool codegraph_role (filter by label); CodeGraph methods findNodesByRole, getRoleCounts Tier 3 #7 — dead-code judge: - SQL pre-filter findOrphanedSymbols: kind in {function|method| class|component} + is_exported=0 + zero incoming `calls` edges, with path-pattern exemptions for tests/scripts/bench - LLM judge returns JSON {verdict: dead|live|uncertain, confidence, reason}; tolerant fence-stripper (incl. stray backticks) and confidence clamp [0..1]; results sorted dead-first then by confidence - New MCP tool codegraph_dead_code with verdict filter Tier 3 colbymchenry#8 — naming-drift checker: - sampleSiblingNames pulls ≤30 same-kind names from other files - LLM judge returns {consistent, suggestion, reason}; defaults to "consistent" on parse failure or <5 siblings (advisory, not enforcement) - CodeGraph.checkNamingDrift wrapper; designed for sync-hook callers, no MCP surface yet Agent-as-LLM bridge (NEW — works without any local LLM): - src/llm/agent-bridge.ts exposes a contract for the agent in the user's session (Claude Code, etc.) to fill the summary cache directly when no Ollama is available. - codegraph_pending_summaries returns un-summarised symbols with bodies + content_hash for the agent to summarise. - codegraph_save_summaries persists agent results, re-validating content_hash against current disk to reject stale answers. - Cache shape is identical to the local-LLM path; the two coexist cleanly. - Reuses validatePathWithinRoot for body reads; same SUMMARIZABLE_KINDS / MIN_BODY_LINES filter as local pass for cache-key parity. Reviewer-flagged fixes (from independent review of the tier work): - dir-summarizer cache-hit branch no longer double-increments `done` (the `finally` always runs on `continue`) - classifier parseRole gains snake_case fallback so "Business Logic" and similar multi-word responses no longer silently degrade to "unknown" - dead-code parseJudgeResponse strips stray backticks before JSON.parse so fenced multi-line responses don't fall back to uncertain - Post-chat abort guards added to summarizer, dir-summarizer, classifier, dead-code: writes after a close()-cancelled chat are now skipped instead of persisted as stale entries - dir-summarizer module doc acknowledges prompt-injection accepted risk (data originates from user's own codebase) - dead-code test replaces tautological assertion with meaningful judged ≤ candidates / errors == 0 checks - parseRole exported for direct testing; new test covers canonical, fenced, multi-word, and trailing-punct inputs Tests: 7 new in __tests__/llm-tiers.test.ts - Background pass writes both directory summaries and role labels - summarizeChange added/removed/modified - Dead-code judge returns parsed verdicts in valid range, no errors - parseRole handles all four input shapes - Agent bridge round-trip without LLM (pending → save → coverage reflects writes; second batch short-circuits on cache) - Agent bridge stale-content-hash rejection Schema migrations cover v4 → v5 → v6 → v7 cleanly. Test schema version expectations bumped. Full suite 409/409 passing including the watcher tests that were previously flaky. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

… server-config flags Tooling-gap backlog (codegraph/docs/codegraph-tooling-gaps.md) closed: #1 freshness severity bucket — `classifyFreshness` with fresh|recent|stale|very_stale #2 allowStale flag — opt-in bypass for the heavy-drift gate, registry-injected schema #3 module format in status — `module-format.ts` parses package.json + tsconfig (JSONC-safe) #4 codegraph_imports tool + import-classifier — file/directory/bare/unresolvable filters #5 dynamic imports — extractor catches `import('…')` + `require('…')`, incl. template_string #6 build-context refs — new `build_context_refs` table for `__dirname` / `import.meta.*` #7 files.is_test flag — column populated by glob; surfaced in status as `(N test)` colbymchenry#11 summarize-also-embeds (discovered while dogfooding) — `cg.summarizeAll()` chains `embedAllSummaries`; new `cg.embedAll()` for embed-only path; CLI `codegraph embed` CLI/MCP alignment (5/32 → 33+/35): - 13 new CLI commands via `runViaMCP` shim: callers, callees, impact, node, similar, biomarkers, imports, help-tools, explore, hotspots, dead-code, config-refs, sql-refs, module-summary, role, coverage-query, pending-summaries, save-summaries, review-context - 7 new MCP tools: codegraph_imports, codegraph_embed, codegraph_summarize, codegraph_sync, codegraph_reindex, codegraph_coverage_ingest, codegraph_init, codegraph_uninit, codegraph_unlock, codegraph_affected MCP server-level operator config (`codegraph serve --mcp`): - --no-write-tools / --allow-stale-default / --disable-tool (sandboxing) - --llm-endpoint / --llm-chat-model / --llm-ask-model / --llm-embedding-model / --llm-api-key (operator LLM config; per-project config wins on conflict) - New CODEGRAPH_LLM_* env vars wired through `mergeLlmEnv` in resolveLlmProviders Architectural cleanups: - `bypassFreshnessGate` and `isWriteTool` declarative flags on ToolModule (replaces growing string-comparison chain in execute()) - `withAllowStale` registry injection only on tools that DO see the gate - DRY of inline copy-paste in 3 hooks → `src/index-hooks/enclosing.ts` - `LlmClient.isEmbeddingReachable` for split-provider correctness - SyncResult `lockContention` flag → handleSync emits distinct retryable message - `clearStructural` deletes from build_context_refs (was orphan-leaking on --force) - cli:dev npm script + tsx CLI fixed (web-tree-sitter `import type` for type-only refs) Migrations: 023-files-is-test.ts — add `files.is_test` 024-build-context-refs.ts — add `build_context_refs` table Reviewer rounds: 11 total, all REQUEST_CHANGES addressed inline. Notable fixes: - JSONC URL strip via state machine (was eating `https://` tails) - classifyFreshness very_stale now requires isStale (in-sync-but-old → recent) - Dynamic imports also match template_string nodes - process.exit deferred until after finally cleanup in runViaMCP - --same-language / --different-language mutual exclusion guard - help-tools CLI bypasses isInitialized (works without a project) - handleUninit sweeps projectCache by getProjectRoot (no dangling alias leaks) - handleAffected errors instead of silently dropping unsupported glob filters - mergeLlmEnv preserves precedence: legacy flat config wins over env-synthesised block Suite: 1268 passing, 1 expected red (colbymchenry#8 — undecided), 13 skipped, 1 todo, 0 regressions. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…iness Adds a new `### Server config` section to `codegraph_status` that surfaces operator-level MCP flags so a sandboxed deployment can verify the running server actually applied the intended config: - **Disabled tools:** lists names from `--disable-tool` - **Write tools:** disabled when `--no-write-tools` is set - **Default `allowStale`:** true when `--allow-stale-default` is set - **LLM env overrides:** which CODEGRAPH_LLM_* vars are active Section is hidden entirely when nothing is set (default install stays clean). Renders identically across MCP and CLI surfaces since both route through `handleStatus`. Visual treatment for the existing sections: - Backend: ✅ suffix when sqlite-vec is loaded; ⚠ when not - Status: 🟢 fresh / 🟡 recent / 🟠 stale / 🔴 very_stale (driven by `FreshnessSeverity` from #1) - Feature Readiness: 🟢/🟡/⚪ traffic-light per row, bold keys, percentages on summaries - Section headers gained icons: 🚦 Feature Readiness, 🤖 LLM providers, 🔧 Server config Test changes: - 5 new cases for the Server config section (omitted-when-empty, disable-write-tools, disabled-tools list, allowStale default, LLM env-overrides) - Two fragile string-equality assertions widened to regex (mcp-handlers-stress-test-fixes / mcp-llm-visibility) so future emoji tweaks do not re-break them — semantic invariants preserved. Suite: 1273 passing, 1 expected red (colbymchenry#8), 0 regressions. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

colbymchenry#8) Tree-sitter parsing is the dominant non-LLM cost on `--force` reindex. The existing per-file unchanged-files check inside `storeExtractionResult` only short-circuits the PERSIST step — the parse already happened. This adds a content-hash-keyed read-through cache in front of `requestParse`, so a `--force` on a project with no content changes replays cached ExtractionResults and skips parsing entirely. How it works ------------ - New SQLite table `parse_cache(content_hash, language, file_path)` with a JSON payload (migration 026 + schema.sql entry). Key includes file_path because `Node.id = sha256(filePath:kind:name:line)`, so cross-path replay would need ID rewriting; v1 keeps it simple, hits same-path-same-content only. Renames take a fresh parse. - Payload is wrapped `{ _v: 1, result }` so a future ExtractionResult shape change can bump the version and have older entries treated as misses (no risk of replaying a stale shape against new code). - `processOneFile` is the insertion point: hash content, check cache, on hit replay; on miss, parse and write through (only when the parse produced nodes or no errors — never cache a broken parse). - LRU eviction at the END of every indexAll: drops the oldest 25% when row count exceeds 20k. Cheap under the cap (one COUNT(*) probe). - Cache survives `clearStructural` (--force) intentionally — content-addressed entries stay valid against any re-extract of the same content. Surfaces -------- - src/db/migrations/026-parse-cache.ts - src/db/queries-parse-cache.ts: getCachedParse, putCachedParse, evictParseCacheIfOversized, clearParseCache, getParseCacheStats. - src/db/schema.sql: parse_cache table + idx_parse_cache_generated_at. - src/extraction/index.ts: cache check in processOneFile, public parseCacheHits counter via cg.orchestrator.getParseCacheHits() (used by tests + a logDebug summary at indexAll end). Tests (11 new) - round-trip put/get; miss on wrong key; corrupt JSON treated as miss; payload-without-version-envelope treated as miss; payload with stale _v treated as miss; eviction drops oldest 25% over cap; no-op under cap; clearParseCache wipes; integration: fresh indexAll populates, re-extract is full hit, edit causes miss + re-cache. Suite: 1392/13/0 (was 1381/13/0; +11 new). Two existing schema-version tests bumped 25 → 26. Reviewer pass — addressed - Schema-drift gap: added the `_v: 1` payload version envelope. A future ExtractionResult shape change just bumps the constant; old entries become misses + re-parse, no corrupt replay. - Removed a redundant `detectLanguage` call inside the existing `processOneFile` flow. Reviewer-noted scope limits, deferred - Sync path (indexFileWithContent) doesn't use the cache. - Retry-after-WASM-fault paths don't write through. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…olbymchenry#8) Promotes the resolver numeric `metadata.confidence` to a categorical, queryable column on `edges`. `codegraph_callers`, `_callees`, `_impact` now annotate each row with `*(INFERRED)*` when the edge is not concrete, and accept a new `minConfidence` arg to drop edges below a chosen level — useful for refactor-impact analysis where false-positive reach inflates the blast radius. Migration 032 adds `edges.confidence TEXT` (nullable; read-cast in EDGE_SCHEMA collapses NULL → `EXTRACTED`). Nullable lets legacy / structural / extractor-direct edges need no producer change. Index on the column for filter-down queries. Idempotent ADD COLUMN with table-exists guard following the convention from migrations 020 / 021 / 023 / 030. Producer: `ReferenceResolver.createEdges` calls a new free `classifyConfidence(resolvedBy, score)` helper: - import / qualified-name / file-path / exact-match → EXTRACTED - framework with score >= 0.85 → EXTRACTED, else INFERRED - instance-method / fuzzy → INFERRED AMBIGUOUS is not stamped from the resolver path in v1 — needs tie-tracking from the candidate picker. Tracked as B3. Consumers (`result-formatters.ts`): - `formatConfidence(edge)` — markdown italic suffix (empty for EXTRACTED; `*(INFERRED)*` / `*(AMBIGUOUS)*` otherwise). - `parseMinConfidence(raw)` — returns level / null / errorResult. - `filterByConfidence(rows, min)` — drop below threshold. - `CONFIDENCE_RANK` — numeric ordering for inline filter loops. Wired into all four code paths in `callers.ts` (collectCallers, collectCallersForSource, collectTypeUsers, formatGroupedCallers), both paths in `callees.ts`, and `impact.ts` with a `filterImpactByConfidence` BFS that prunes nodes unreachable along the surviving edge set. Type cleanup: `Edge.confidence` added with JSDoc; the long-dead `Edge.provenance` field also got a JSDoc note pointing at the SCIP-integration-reservation explanation in memory. Reviewer-memo #4 catch (schema-version test forgetfulness): noticed `__tests__/pr19-improvements.test.ts` was still asserting `toBe(29)` even though migrations 030 and 031 had shipped. Catch-up to 32 in this commit covers both the missed bumps and the new migration. Reviewer: APPROVE, three info-only findings (type-user display gap, framework-rule calibration table doc, catch of the recurring pr19 pattern). Suite: 1555 / 34 / 0 (+18 new edge-confidence tests). Eval: 11/11 passed | recall=1.00 | mrr=0.86 (within regression budget vs 091e935 baseline). Phase 3 done. Backlog now down to: #6 TOON (deferred), colbymchenry#11 trace_to_culprits, colbymchenry#17/colbymchenry#18 Phase 7, colbymchenry#19 streaming. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…g (B2 + B3) ## B2 — codegraph_search inline `kind:X` qualifier surfaces hint When the agent writes `getNodeById kind:function` (the rich-query syntax tool descriptions document), the empty-results path was only checking `args['kind']`. The inline qualifier sets the SQL filter via `parseQuery` but leaves the explicit arg undefined — so the much- more-useful "Without the kind filter, N results match (kinds: method). Drop the kind filter or set it to one of those" hint never fired. Agent saw a bare did-you-mean trail and reasonably concluded the symbol didn't exist. Fix in `buildEmptyResultsResponse`: extract inline `kind:X` tokens from the query string, treat the first as the effective kind for the hint, and STRIP all `kind:` tokens from the probe query so the "without filter" search actually runs without it. Mirrors the pattern already used for centrality-stripping a few lines above. The `kindNote` at the bottom of the function picks up the same effective kind so the no-results trailer stays consistent. ## B3 — resolver tie tracking → AMBIGUOUS confidence (colbymchenry#8 follow-up) Phase 3 (colbymchenry#8) defined three buckets — EXTRACTED / INFERRED / AMBIGUOUS — but the resolver only stamped the first two; AMBIGUOUS was a TODO. The classifier needed a signal that the picker had a near-tie with a DIFFERENT target. `ResolvedRef` gains an optional `tieMargin?: number` (winner.confidence - runnerUp.confidence). `resolveOne` now sorts candidates and finds the best different-target runner-up; if one exists, the gap stamps onto the returned `ResolvedRef`. `classifyConfidence` short-circuits to AMBIGUOUS when `tieMargin < AMBIGUOUS_TIE_MARGIN` (0.05) regardless of `resolvedBy` — an "import" resolution that scored 0.61 over a 0.60 alternative is still a coin flip from the agent's POV. `createEdges` forwards the threshold call AND copies tieMargin into `metadata` so consumers unpacking the JSON can see the underlying gap. Same-target candidates reinforce (don't contest) the winner so they don't count toward the gap — `runnerUp` is the first sorted candidate with a different `targetNodeId`. ## Found-issue surfaced B8 (filed): the resolver's name-matcher returns only ONE candidate even for textbook same-name same-kind ambiguities (two `ambigTarget` exports, one no-import caller). With one candidate, the picker has no runner-up to compare against → tieMargin never set → AMBIGUOUS never stamped on the most natural ambiguity case. B3's wiring is correct and exercised by the integration test, but its real-world impact is gated on B8 fixing the matcher's candidate generation. ## Tests Suite: **1560 / 34 / 0** (+1 B2 test in realwork-fixes; +4 B3 tests in edge-confidence). Eval: **11/11 | recall=1.00 | mrr=0.86 | within regression budget**. Typecheck clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Half of colbymchenry#17 (the rename half; extract was descoped per user check — needs per-language scope analysis that doesn't pay back enough on a multi-language index). Pure query wrapper that uses the edge-confidence column from colbymchenry#8 + B8 to give the agent a ready-to-execute rename plan. ## What it returns Three sections, header counts: 1. **Definitions** — every node matching `symbol` (handles overloads / re-exports). The site(s) where the rename starts. 2. **Call sites (graph edges)** — every incoming edge from callers / type-users / instantiators. Grouped by `edge.confidence`: - **EXTRACTED** *concrete; safe to rename mechanically* — imports / qualified-name / file-path resolution. Apply with sed / Edit replace_all. - **INFERRED** *heuristic; glance before applying* — same-name match, fuzzy dispatch, framework rule below the 0.85 confidence threshold. - **AMBIGUOUS** *ambiguous; verify the target before renaming* — picker had a near-tied runner-up (B8 tieMargin < 0.05). Sites are sorted EXTRACTED-first, so `limit` truncation drops the lowest-confidence rows last. 3. **Textual mentions** — `\bname\b` regex over the indexed file set, attributed to enclosing function / method / class / interface. Word-boundary so `id` doesn't match `userId`. Doc comments / JSDoc / string literals — surfaces where the symbol is mentioned NON-graph (no edge), so the agent can review case- by-case (not every textual match is the same symbol). De-dups: textual hits that ALSO appear as graph call sites are filtered out — graph edges are the higher-quality signal, no reason to surface the same line twice. ## Validation (warnings, not blocks) - newName checked for valid-identifier shape (Unicode-letter + digit + underscore + $; rejects whitespace / hyphens). - newName checked for collision with existing indexed symbols (would shadow / break). Both surface as `### Warnings` at the top of the report. The agent decides whether to proceed. ## Why a separate tool when codegraph_callers + codegraph_grep exist Three reasons: 1. **Confidence-stamped grouping** — `codegraph_callers` returns a flat list with the confidence suffix per row. Rename planning needs the EXTRACTED rows isolated for mechanical-safe edits; ad-hoc partitioning by the agent is error-prone. 2. **Doc-mention attribution to enclosing symbol + dedup against call sites** — `codegraph_grep` doesn't dedup against the call graph. The agent doing this manually would re-render the same import line in both surfaces. 3. **Validation up front** — collision + identifier checks happen once, not per-call-site. Saves the agent a round trip. ## CLI mirror `codegraph propose-rename <symbol> <newName>` routes via runViaMCP, supports `--limit` and `--doc-limit`. Dogfooded against the live repo: `extractFromSource → extractFromCode` surfaces 14 call sites (all EXTRACTED) + 29 textual mentions, plus a collision warning on the existing `extractFromCode` if any. ## Tests 9 cases in `__tests__/mcp-propose-rename.test.ts`: - Three-section render with header counts - EXTRACTED-first grouping - Collision warning fires - Invalid-identifier warning fires - Same-name rename → errorResult - Unknown-symbol → notFound - docLimit=0 skips textual scan + section - Textual mentions don't double-count graph call sites - limit caps the call-sites section Suite: **1608 / 34 / 0** (+9 propose-rename tests). ## Backlog after this #17a (this) shipped. #17b (propose_extract) deferred — needs per- language scope analysis (free-variable detection, parameter inference). User-checked decision: not worth the cross-language risk. colbymchenry#18 (plan-and-execute CLI) deferred — the macro infra from colbymchenry#13 already covers the structural part; the LLM-planning piece is arguably the calling agent's job. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…ts (S3 batch 1) S3 (file-split) starts with the lowest-risk hotspot: `src/utils.ts` (762 LOC, ranked colbymchenry#8 in churn × centrality). Extracts the contiguous concurrency / time / memory block (was lines 414-761) into a sibling `src/utils-concurrency.ts`. Moved: - FileLock — cross-process file lock with PID staleness check - Mutex — in-process async mutex - MemoryMonitor — interval-based heap watch - processInBatches — bounded-parallelism batch runner - readFileInChunks — streaming file reader - debounce / throttle — call-rate limiters What stays in utils.ts: path / string / language predicates (isJsFamily, validatePathWithinRootReal, isTestPath, normalizePath, stripBom, globToSafeRegex, etc.) — pure functions with no runtime state. Caller compatibility: utils.ts re-exports every name from utils-concurrency.ts so `import { FileLock } from '../utils.js'` keeps working unchanged. Future imports can switch to the direct path when convenient — no rush. Result: - utils.ts: 762 → 426 LOC (-336, -44%) - utils-concurrency.ts: 348 LOC (new) - Net: same total LOC, but each file holds one cohesive concern. - Suite 1634/34/0 unchanged. Other S3 candidates (extraction/index.ts 2098 LOC, tree-sitter.ts 1595 LOC, context/index.ts 1422 LOC) need per-file scoping with fresh attention to their internal cohesion — left as separate follow-ups per the original task description ("one PR per file"). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Foundation for colbymchenry#8 (HNSW approximate KNN). Module is unwired -- no behavior change; provides the building block that 8.4 (similar-edges) and 8.5 (semantic search) will swap in for per-row vec0 KNN. - hnswlib-node ^3.0.0 in optionalDependencies (cross-platform prebuilds, 3-sec install verified). Same fall-back discipline as @huggingface/transformers. - HnswIndex: build / query / persist / load lifecycle. One index per embedding dimension, mirroring vec0's per-dim layout. Cosine on normalised via 'ip' space. - isHnswAvailable + computeRowsetSignature helpers ready for the staleness-tracking migration (8.3). Verified spike on 5096x768: 745 builds/s, 1832 queries/s, 0.55ms per query, 10ms persist+reload. Recall@10 ~0.97 at the chosen M=16 / efConstruction=200 / ef=64 defaults. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Subtask 8.3 of the embedding-features arc (colbymchenry#8 HNSW). Adds a per-dim staleness ledger (row_count, max_rowid, built_at, file_path, recall_hint, ef_query) so postHooks can decide whether to rebuild the HNSW index without rescanning every row. The handoff memo reserved migration 041 but it was taken by summary-priority-queue in interim work, so this lands as 042. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…raph friction items Biomarker pass: 10 findings (1 error / 3 warnings / 6 info) -> 0/0/0. - has() LRU low_coverage: new __tests__/node-lru-cache.test.ts exercises the previously 0%-covered path via QueryBuilder.nodeCache. - hasExtraStatement complex_method (cyc 17 -> 0): table-driven STEP_HANDLERS dispatch in src/mcp/tools/sql.ts replaces the inline state machine. - buildQueryOutput / appendCompletenessAndBudget long_parameter_list: bundled positional args into *Args interfaces. - findLowCoverage + buildCoverageTips magic_number: extracted PCT_SCALE / COVERAGE_PCT_ROUND / CENTRALITY_ROUND + MS_PER_DAY / STALE_AFTER_DAYS named constants (docstrings cite values to satisfy stale_doc). - 4 unused exports dropped: RULE_NAMES deleted; LabelConfidence/LabelSource un-exported; duplicate CURRENT_SCHEMA_VERSION at migrations/index.ts:226 removed (derive directly in migrations.ts). Codegraph friction fixes: - #7 unused_export FP on aliased named imports: empirically zero remaining FPs on the in-tree corpus. Rule comment updated to reflect that emitNamedImportRefsFromFile + pickMatchingImport resolve aliased imports correctly. - colbymchenry#8 n_xxxxxxxx UIDs not accepted by codegraph_biomarkers / _coverage / _role mode=symbol: each tool's resolveSymbolToNodeId now checks RefIdCache.isUid first and resolves via the per-server cache. handleGetRoleOf bundled into HandleGetRoleOfArgs after refIds pushed it past the long-param tier. Punctuation guard aligned across all three siblings. Regression test in __tests__/biomarkers.test.ts mints a UID via codegraph_find and round-trips it through codegraph_biomarkers. Gates: tsgo typecheck clean, vitest 2633/0/34, codegraph_biomarkers stats empty after sync, independent reviewer APPROVE on both diff slices. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…ache to EXTRACTION_LOGIC_VERSION Two improvements that came out of analysing the friction items closed in 9cb4ab5: 1. **Consolidate resolveSymbolToNodeId (prevents friction colbymchenry#8 recurrence).** Three near-identical helpers in biomarkers.ts / coverage.ts / role.ts deleted; replaced with a single exported `resolveSymbolToNodeId(cg, symbol, refIds)` in symbol-resolver.ts. The shared helper preserves the post-colbymchenry#8 contract: UID-first check + punctuation guard + getNodeById + FTS top-1 fallback. Distinct from `findSymbol` in the same file, which is richer (multi-match disambiguation + fuzzy fallback with notes) and is what callers/callees/impact use. Future mode=symbol handlers import one helper instead of risking a copy of the buggy pre-colbymchenry#8 version. 2. **Couple BIOMARKER_CACHE_KEY to EXTRACTION_LOGIC_VERSION.** `BIOMARKER_CACHE_KEY` is now `biomarker_file_state_v10_ext${EXTRACTION_LOGIC_VERSION}` (currently `_ext7`). The per-file biomarker cache short-circuits on `content_hash` match, but `content_hash` is the source hash — when EXTRACTION_LOGIC_VERSION bumps and re-extraction emits new nodes/edges for unchanged source, the per-file cache skips re-evaluation and findings stay stale until a forced reindex. Suspected cause of the original CURRENT_SCHEMA_VERSION unused_export FP on 2026-05-11 (cached past the resolver improvement that made aliased imports resolve correctly). Cross-file rules already re-run on every sync (clear+append in runCrossFileBiomarkers) so they were never affected. Trade-off: one extra full pass per extractor bump. Acceptable — the alternative is silently-stale findings for any rule that reads `nodes.docstring` or the references-edges graph. Drift detector manifest updated: src/db/cache-rule-hashes.json biomarker hash 19fa804b5ca6f866 -> 60025bd2a6a831d9 (real semantic change to cache-key derivation). Existing `src/db/queries.ts:714` cleanup uses `LIKE 'biomarker_file_state_%'` so it picks up both old `v10` and new `v10_ext7` entries on `clearStructural`. `isBiomarkerCacheCold` semantics unchanged: a new ext-suffix reads as absent and triggers a one-shot full pass. CLAUDE.md and docs/emulation-questions.example.json updated to reflect the new key shape (reviewer caught both as docstring rot). Gates: tsgo typecheck clean, vitest 2633/0/34, codegraph_biomarkers stats empty, independent reviewer APPROVE. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…olved-by-FTS Three more friction closures from the 2026-05-11 workout: - src/extraction/index.ts + src/db/queries-files.ts: heal-marked files now propagate through the git fast-path. The EXTRACTION_LOGIC_VERSION self-heal sets needs_reextract=1 on every file, but the git fast-path eoCollectGitChanges only iterated files git reported as changed — so heal-marked files with no diff were silently skipped in any git-tracked project. Adds a getFilesNeedingReextract(qb) query and unions the result into the change set after the git fast-path returns. Full-scan path was already correct via eoClassifyFileChange; no change there. Closes friction colbymchenry#22 (surfaced while validating the cluster fix in 6957b74 — the original needs_reextract=1 → 620 of 627 stayed pending after sync, requiring admin index --force). - src/mcp/tools/entry-points.ts: new "MCP tools" bucket scans for exported `*_TOOL` constants whose initializer signature contains `name: 'codegraph_*'` and surfaces them with the parsed canonical tool name. Closes friction colbymchenry#13 / cluster symptom #6: every MCP tool dispatcher previously fell through every bucket because routes/CLI commands/CLI-files/public-exports each had a different filter mismatch. Five buckets now (HTTP routes, CLI commands, MCP tools, CLI files, public exports). __tests__/mcp-entry-points grows 5 → 6 tests with the new bucket fixture. - Friction colbymchenry#9 (semantic/intent search can't bridge tool-name → handler-name): resolved by existing FTS5 path. Investigation showed nodes_fts indexes the `signature` column at bm25=2, so searching `codegraph_X` already returns the matching XXX_TOOL constant via signature substring match. The morning-workout "No results" observation was likely a stale-index transient (pre-cluster-fix). Added a regression test in mcp-search-family to lock in the behavior. Test impact: 2633 → 2635 (one new MCP-tools-bucket test, one new canonical-tool-name regression test). Type-check clean. Closes: - colbymchenry#22 EXTRACTION_LOGIC_VERSION heal flag honored - colbymchenry#13 entry_points 5th bucket for MCP tool dispatchers - colbymchenry#9 already-resolved-by-FTS — locked in with regression test Still open: - colbymchenry#8 / cluster #1 (context whiff): graph edge now exists, but the context-builder ranker still surfaces ToolHandler/ToolResult/ ToolCtx instead of dispatchers. Needs a context-ranker boost for `codegraph_*` queries. Logged as colbymchenry#24 follow-up. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Closes friction colbymchenry#8 / cluster #1: codegraph_context whiff on the PRIMARY natural-language tool when the query mentions a canonical MCP tool name. Symptom: query like "how does codegraph_find dispatch across by:name/content/env/sql axes" returned generic ToolHandler / ToolResult / ToolCtx plumbing or unrelated SQL extractor symbols (both match `name`/`content`/`sql` tokens in the trailing English). The actual XXX_TOOL constant got pushed below the cap by text-search noise even though FTS5 finds it via signature. Fix: new `cbPromoteMcpToolMatches` helper. After the standard extractor + prefix passes, scan extracted query symbols for the canonical `codegraph_X` shape; for each hit, look up `constant` nodes whose initializer signature contains `name: 'codegraph_X'` and append them to exactMatches with a high base score (MCP_TOOL_PROMOTION_SCORE = 100). The score puts the registered tool above text-search and centrality-boosted noise so it lands in the top entry-point slots. Implementation cost: one extra `getNodesByKind('constant')` call per query, bounded by the count of top-level exported constants (~hundreds in a typical codebase). No-op when no `codegraph_*` token is in the extracted symbol list. Tests: __tests__/context.test.ts grows 18 → 19 cases. Fixture has DEMO_TOOL plus three buildSql* helpers; without promotion the SQL helpers outrank DEMO_TOOL in the entry points. Test impact: 2635 → 2636 (one new MCP-tool-promotion test). Type-check clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

mduenas and others added 9 commits January 30, 2026 10:21

chore: Bump version to 0.5.0

dfd4132

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

feat: Add Copilot instructions support to installer

31bf037

When GitHub Copilot is selected, installer now writes CodeGraph tool usage instructions to .github/copilot-instructions.md Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

fix: Add Kotlin and Swift to default include patterns

6e2a6cf

The DEFAULT_CONFIG.include was missing *.kt, *.kts, and *.swift patterns, causing these files to be skipped during indexing even though the grammars were properly configured. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

mduenas closed this Jan 30, 2026

mduenas deleted the feature/full-kotlin-support branch January 30, 2026 22:58

andreinknv mentioned this pull request Apr 26, 2026

feat: LLM symbol summaries, semantic search, RAG Q&A, dir/role/dead-code/naming + agent-as-LLM bridge #111

Open

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Add full Kotlin language support#8

feat: Add full Kotlin language support#8
mduenas wants to merge 9 commits into
colbymchenry:mainfrom
mduenas:feature/full-kotlin-support

mduenas commented Jan 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

mduenas commented Jan 30, 2026

Summary

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant