feat: Add full Kotlin language support#8
Closed
mduenas wants to merge 9 commits into
Closed
Conversation
This PR upgrades Kotlin from "Basic support" to "Full support" by adding: - Object declarations (singletons) extraction - Companion object extraction - Data classes, sealed classes, abstract classes detection - Interface extraction (properly distinguishes from classes) - Enum extraction with enum members - Properties (val/var) extraction - Type alias extraction - Inheritance/implementation extraction (extends/implements edges) - Enhanced function body parsing for call tracking Also improves getChildByField() to fall back to node type matching, which fixes issues with tree-sitter grammars that don't define field names (like tree-sitter-kotlin). Added 15+ new comprehensive Kotlin tests covering all constructs. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Add HTTP/JSON-RPC transport to MCP server enabling integration with GitHub Copilot and other HTTP-based MCP clients. The stdio transport remains the default for Claude Code. - Extract ITransport interface for pluggable transport mechanisms - Add HttpTransport class with CORS support and graceful error handling - Refactor MCPServer to accept transport configuration - Add CLI flags: --http, --port, --host Usage: codegraph serve --mcp --http --port 3000 Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Installer now asks which AI assistant the user plans to use: - Claude Code (stdio transport, default) - GitHub Copilot (HTTP transport) - Both Shows appropriate configuration instructions and next steps for each option. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Copilot uses stdio transport like Claude Code, not HTTP. Updated
installer and CLI to show correct mcp-config.json format:
{
"mcpServers": {
"codegraph": {
"type": "stdio",
"command": "codegraph",
"args": ["serve", "--mcp"],
"tools": ["*"]
}
}
}
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Remove tree-sitter-liquid (uses regex extraction, not tree-sitter) - Add overrides for tree-sitter to suppress peer dependency warnings Fixes install failures due to node-gyp-build issues with GitHub deps. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
When GitHub Copilot is selected, installer now writes CodeGraph tool usage instructions to .github/copilot-instructions.md Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
The DEFAULT_CONFIG.include was missing *.kt, *.kts, and *.swift patterns, causing these files to be skipped during indexing even though the grammars were properly configured. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Batch fetch node info upfront instead of per-reference queries - Add progress reporting during resolution phase - Add --skip-resolve flag to skip resolution for faster indexing - Resolution now shows actual progress instead of stuck at 0% Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
andreinknv
added a commit
to andreinknv/codegraph
that referenced
this pull request
Apr 26, 2026
Builds out the rest of the LLM enrichment roadmap on top of the semantic-search/RAG foundation. All features remain fully optional; codegraph behaves identically when no LLM is reachable. Tier 1 #3 — directory/module summaries: - Schema v6 adds directory_summaries(dir_path PK, summary, content_hash, model, generated_at) - src/llm/dir-summarizer.ts aggregates symbol summaries per dir into one paragraph (≤600 chars), keyed by stable content_hash; min 3 summarised symbols per dir as the threshold for "worth a paragraph" - Wired into background pass after summaries+embeddings - New MCP tool codegraph_module (per-dir lookup, or list-all when dirPath omitted) Tier 2 #4 — change-intent helper: - src/llm/change-intent.ts: standalone summarizeChange(client, name, kind, beforeBody, afterBody). Three modes (added | removed | modified) with shape-appropriate prompts. - CodeGraph.summarizeChange wrapper; no caching at this layer (PR-review tooling owns lifecycle), no MCP/CLI surface yet. Tier 2 #5 — role classification: - Schema v7 adds role + role_model columns on symbol_summaries plus idx_summaries_role - Closed label set: api_endpoint | business_logic | data_model | util | framework_glue | test_helper | unknown - parseRole tolerates punctuation, fences, and multi-word "Business Logic" variants by snake_casing as a fallback - Wired into background pass after dir summaries - New MCP tool codegraph_role (filter by label); CodeGraph methods findNodesByRole, getRoleCounts Tier 3 #7 — dead-code judge: - SQL pre-filter findOrphanedSymbols: kind in {function|method| class|component} + is_exported=0 + zero incoming `calls` edges, with path-pattern exemptions for tests/scripts/bench - LLM judge returns JSON {verdict: dead|live|uncertain, confidence, reason}; tolerant fence-stripper (incl. stray backticks) and confidence clamp [0..1]; results sorted dead-first then by confidence - New MCP tool codegraph_dead_code with verdict filter Tier 3 colbymchenry#8 — naming-drift checker: - sampleSiblingNames pulls ≤30 same-kind names from other files - LLM judge returns {consistent, suggestion, reason}; defaults to "consistent" on parse failure or <5 siblings (advisory, not enforcement) - CodeGraph.checkNamingDrift wrapper; designed for sync-hook callers, no MCP surface yet Agent-as-LLM bridge (NEW — works without any local LLM): - src/llm/agent-bridge.ts exposes a contract for the agent in the user's session (Claude Code, etc.) to fill the summary cache directly when no Ollama is available. - codegraph_pending_summaries returns un-summarised symbols with bodies + content_hash for the agent to summarise. - codegraph_save_summaries persists agent results, re-validating content_hash against current disk to reject stale answers. - Cache shape is identical to the local-LLM path; the two coexist cleanly. - Reuses validatePathWithinRoot for body reads; same SUMMARIZABLE_KINDS / MIN_BODY_LINES filter as local pass for cache-key parity. Reviewer-flagged fixes (from independent review of the tier work): - dir-summarizer cache-hit branch no longer double-increments `done` (the `finally` always runs on `continue`) - classifier parseRole gains snake_case fallback so "Business Logic" and similar multi-word responses no longer silently degrade to "unknown" - dead-code parseJudgeResponse strips stray backticks before JSON.parse so fenced multi-line responses don't fall back to uncertain - Post-chat abort guards added to summarizer, dir-summarizer, classifier, dead-code: writes after a close()-cancelled chat are now skipped instead of persisted as stale entries - dir-summarizer module doc acknowledges prompt-injection accepted risk (data originates from user's own codebase) - dead-code test replaces tautological assertion with meaningful judged ≤ candidates / errors == 0 checks - parseRole exported for direct testing; new test covers canonical, fenced, multi-word, and trailing-punct inputs Tests: 7 new in __tests__/llm-tiers.test.ts - Background pass writes both directory summaries and role labels - summarizeChange added/removed/modified - Dead-code judge returns parsed verdicts in valid range, no errors - parseRole handles all four input shapes - Agent bridge round-trip without LLM (pending → save → coverage reflects writes; second batch short-circuits on cache) - Agent bridge stale-content-hash rejection Schema migrations cover v4 → v5 → v6 → v7 cleanly. Test schema version expectations bumped. Full suite 409/409 passing including the watcher tests that were previously flaky. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
6 tasks
andreinknv
added a commit
to andreinknv/codegraph
that referenced
this pull request
May 1, 2026
… server-config flags Tooling-gap backlog (codegraph/docs/codegraph-tooling-gaps.md) closed: #1 freshness severity bucket — `classifyFreshness` with fresh|recent|stale|very_stale #2 allowStale flag — opt-in bypass for the heavy-drift gate, registry-injected schema #3 module format in status — `module-format.ts` parses package.json + tsconfig (JSONC-safe) #4 codegraph_imports tool + import-classifier — file/directory/bare/unresolvable filters #5 dynamic imports — extractor catches `import('…')` + `require('…')`, incl. template_string #6 build-context refs — new `build_context_refs` table for `__dirname` / `import.meta.*` #7 files.is_test flag — column populated by glob; surfaced in status as `(N test)` colbymchenry#11 summarize-also-embeds (discovered while dogfooding) — `cg.summarizeAll()` chains `embedAllSummaries`; new `cg.embedAll()` for embed-only path; CLI `codegraph embed` CLI/MCP alignment (5/32 → 33+/35): - 13 new CLI commands via `runViaMCP` shim: callers, callees, impact, node, similar, biomarkers, imports, help-tools, explore, hotspots, dead-code, config-refs, sql-refs, module-summary, role, coverage-query, pending-summaries, save-summaries, review-context - 7 new MCP tools: codegraph_imports, codegraph_embed, codegraph_summarize, codegraph_sync, codegraph_reindex, codegraph_coverage_ingest, codegraph_init, codegraph_uninit, codegraph_unlock, codegraph_affected MCP server-level operator config (`codegraph serve --mcp`): - --no-write-tools / --allow-stale-default / --disable-tool (sandboxing) - --llm-endpoint / --llm-chat-model / --llm-ask-model / --llm-embedding-model / --llm-api-key (operator LLM config; per-project config wins on conflict) - New CODEGRAPH_LLM_* env vars wired through `mergeLlmEnv` in resolveLlmProviders Architectural cleanups: - `bypassFreshnessGate` and `isWriteTool` declarative flags on ToolModule (replaces growing string-comparison chain in execute()) - `withAllowStale` registry injection only on tools that DO see the gate - DRY of inline copy-paste in 3 hooks → `src/index-hooks/enclosing.ts` - `LlmClient.isEmbeddingReachable` for split-provider correctness - SyncResult `lockContention` flag → handleSync emits distinct retryable message - `clearStructural` deletes from build_context_refs (was orphan-leaking on --force) - cli:dev npm script + tsx CLI fixed (web-tree-sitter `import type` for type-only refs) Migrations: 023-files-is-test.ts — add `files.is_test` 024-build-context-refs.ts — add `build_context_refs` table Reviewer rounds: 11 total, all REQUEST_CHANGES addressed inline. Notable fixes: - JSONC URL strip via state machine (was eating `https://` tails) - classifyFreshness very_stale now requires isStale (in-sync-but-old → recent) - Dynamic imports also match template_string nodes - process.exit deferred until after finally cleanup in runViaMCP - --same-language / --different-language mutual exclusion guard - help-tools CLI bypasses isInitialized (works without a project) - handleUninit sweeps projectCache by getProjectRoot (no dangling alias leaks) - handleAffected errors instead of silently dropping unsupported glob filters - mergeLlmEnv preserves precedence: legacy flat config wins over env-synthesised block Suite: 1268 passing, 1 expected red (colbymchenry#8 — undecided), 13 skipped, 1 todo, 0 regressions. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
andreinknv
added a commit
to andreinknv/codegraph
that referenced
this pull request
May 1, 2026
…iness
Adds a new `### Server config` section to `codegraph_status` that
surfaces operator-level MCP flags so a sandboxed deployment can verify
the running server actually applied the intended config:
- **Disabled tools:** lists names from `--disable-tool`
- **Write tools:** disabled when `--no-write-tools` is set
- **Default `allowStale`:** true when `--allow-stale-default` is set
- **LLM env overrides:** which CODEGRAPH_LLM_* vars are active
Section is hidden entirely when nothing is set (default install stays
clean). Renders identically across MCP and CLI surfaces since both
route through `handleStatus`.
Visual treatment for the existing sections:
- Backend: ✅ suffix when sqlite-vec is loaded; ⚠ when not
- Status: 🟢 fresh / 🟡 recent / 🟠 stale / 🔴 very_stale
(driven by `FreshnessSeverity` from #1)
- Feature Readiness: 🟢/🟡/⚪ traffic-light per row, bold keys,
percentages on summaries
- Section headers gained icons: 🚦 Feature Readiness, 🤖 LLM providers,
🔧 Server config
Test changes:
- 5 new cases for the Server config section (omitted-when-empty,
disable-write-tools, disabled-tools list, allowStale default,
LLM env-overrides)
- Two fragile string-equality assertions widened to regex
(mcp-handlers-stress-test-fixes / mcp-llm-visibility) so future
emoji tweaks do not re-break them — semantic invariants preserved.
Suite: 1273 passing, 1 expected red (colbymchenry#8), 0 regressions.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
andreinknv
added a commit
to andreinknv/codegraph
that referenced
this pull request
May 3, 2026
colbymchenry#8) Tree-sitter parsing is the dominant non-LLM cost on `--force` reindex. The existing per-file unchanged-files check inside `storeExtractionResult` only short-circuits the PERSIST step — the parse already happened. This adds a content-hash-keyed read-through cache in front of `requestParse`, so a `--force` on a project with no content changes replays cached ExtractionResults and skips parsing entirely. How it works ------------ - New SQLite table `parse_cache(content_hash, language, file_path)` with a JSON payload (migration 026 + schema.sql entry). Key includes file_path because `Node.id = sha256(filePath:kind:name:line)`, so cross-path replay would need ID rewriting; v1 keeps it simple, hits same-path-same-content only. Renames take a fresh parse. - Payload is wrapped `{ _v: 1, result }` so a future ExtractionResult shape change can bump the version and have older entries treated as misses (no risk of replaying a stale shape against new code). - `processOneFile` is the insertion point: hash content, check cache, on hit replay; on miss, parse and write through (only when the parse produced nodes or no errors — never cache a broken parse). - LRU eviction at the END of every indexAll: drops the oldest 25% when row count exceeds 20k. Cheap under the cap (one COUNT(*) probe). - Cache survives `clearStructural` (--force) intentionally — content-addressed entries stay valid against any re-extract of the same content. Surfaces -------- - src/db/migrations/026-parse-cache.ts - src/db/queries-parse-cache.ts: getCachedParse, putCachedParse, evictParseCacheIfOversized, clearParseCache, getParseCacheStats. - src/db/schema.sql: parse_cache table + idx_parse_cache_generated_at. - src/extraction/index.ts: cache check in processOneFile, public parseCacheHits counter via cg.orchestrator.getParseCacheHits() (used by tests + a logDebug summary at indexAll end). Tests (11 new) - round-trip put/get; miss on wrong key; corrupt JSON treated as miss; payload-without-version-envelope treated as miss; payload with stale _v treated as miss; eviction drops oldest 25% over cap; no-op under cap; clearParseCache wipes; integration: fresh indexAll populates, re-extract is full hit, edit causes miss + re-cache. Suite: 1392/13/0 (was 1381/13/0; +11 new). Two existing schema-version tests bumped 25 → 26. Reviewer pass — addressed - Schema-drift gap: added the `_v: 1` payload version envelope. A future ExtractionResult shape change just bumps the constant; old entries become misses + re-parse, no corrupt replay. - Removed a redundant `detectLanguage` call inside the existing `processOneFile` flow. Reviewer-noted scope limits, deferred - Sync path (indexFileWithContent) doesn't use the cache. - Retry-after-WASM-fault paths don't write through. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
andreinknv
added a commit
to andreinknv/codegraph
that referenced
this pull request
May 3, 2026
…olbymchenry#8) Promotes the resolver numeric `metadata.confidence` to a categorical, queryable column on `edges`. `codegraph_callers`, `_callees`, `_impact` now annotate each row with `*(INFERRED)*` when the edge is not concrete, and accept a new `minConfidence` arg to drop edges below a chosen level — useful for refactor-impact analysis where false-positive reach inflates the blast radius. Migration 032 adds `edges.confidence TEXT` (nullable; read-cast in EDGE_SCHEMA collapses NULL → `EXTRACTED`). Nullable lets legacy / structural / extractor-direct edges need no producer change. Index on the column for filter-down queries. Idempotent ADD COLUMN with table-exists guard following the convention from migrations 020 / 021 / 023 / 030. Producer: `ReferenceResolver.createEdges` calls a new free `classifyConfidence(resolvedBy, score)` helper: - import / qualified-name / file-path / exact-match → EXTRACTED - framework with score >= 0.85 → EXTRACTED, else INFERRED - instance-method / fuzzy → INFERRED AMBIGUOUS is not stamped from the resolver path in v1 — needs tie-tracking from the candidate picker. Tracked as B3. Consumers (`result-formatters.ts`): - `formatConfidence(edge)` — markdown italic suffix (empty for EXTRACTED; `*(INFERRED)*` / `*(AMBIGUOUS)*` otherwise). - `parseMinConfidence(raw)` — returns level / null / errorResult. - `filterByConfidence(rows, min)` — drop below threshold. - `CONFIDENCE_RANK` — numeric ordering for inline filter loops. Wired into all four code paths in `callers.ts` (collectCallers, collectCallersForSource, collectTypeUsers, formatGroupedCallers), both paths in `callees.ts`, and `impact.ts` with a `filterImpactByConfidence` BFS that prunes nodes unreachable along the surviving edge set. Type cleanup: `Edge.confidence` added with JSDoc; the long-dead `Edge.provenance` field also got a JSDoc note pointing at the SCIP-integration-reservation explanation in memory. Reviewer-memo #4 catch (schema-version test forgetfulness): noticed `__tests__/pr19-improvements.test.ts` was still asserting `toBe(29)` even though migrations 030 and 031 had shipped. Catch-up to 32 in this commit covers both the missed bumps and the new migration. Reviewer: APPROVE, three info-only findings (type-user display gap, framework-rule calibration table doc, catch of the recurring pr19 pattern). Suite: 1555 / 34 / 0 (+18 new edge-confidence tests). Eval: 11/11 passed | recall=1.00 | mrr=0.86 (within regression budget vs 091e935 baseline). Phase 3 done. Backlog now down to: #6 TOON (deferred), colbymchenry#11 trace_to_culprits, colbymchenry#17/colbymchenry#18 Phase 7, colbymchenry#19 streaming. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
andreinknv
added a commit
to andreinknv/codegraph
that referenced
this pull request
May 3, 2026
…g (B2 + B3) ## B2 — codegraph_search inline `kind:X` qualifier surfaces hint When the agent writes `getNodeById kind:function` (the rich-query syntax tool descriptions document), the empty-results path was only checking `args['kind']`. The inline qualifier sets the SQL filter via `parseQuery` but leaves the explicit arg undefined — so the much- more-useful "Without the kind filter, N results match (kinds: method). Drop the kind filter or set it to one of those" hint never fired. Agent saw a bare did-you-mean trail and reasonably concluded the symbol didn't exist. Fix in `buildEmptyResultsResponse`: extract inline `kind:X` tokens from the query string, treat the first as the effective kind for the hint, and STRIP all `kind:` tokens from the probe query so the "without filter" search actually runs without it. Mirrors the pattern already used for centrality-stripping a few lines above. The `kindNote` at the bottom of the function picks up the same effective kind so the no-results trailer stays consistent. ## B3 — resolver tie tracking → AMBIGUOUS confidence (colbymchenry#8 follow-up) Phase 3 (colbymchenry#8) defined three buckets — EXTRACTED / INFERRED / AMBIGUOUS — but the resolver only stamped the first two; AMBIGUOUS was a TODO. The classifier needed a signal that the picker had a near-tie with a DIFFERENT target. `ResolvedRef` gains an optional `tieMargin?: number` (winner.confidence - runnerUp.confidence). `resolveOne` now sorts candidates and finds the best different-target runner-up; if one exists, the gap stamps onto the returned `ResolvedRef`. `classifyConfidence` short-circuits to AMBIGUOUS when `tieMargin < AMBIGUOUS_TIE_MARGIN` (0.05) regardless of `resolvedBy` — an "import" resolution that scored 0.61 over a 0.60 alternative is still a coin flip from the agent's POV. `createEdges` forwards the threshold call AND copies tieMargin into `metadata` so consumers unpacking the JSON can see the underlying gap. Same-target candidates reinforce (don't contest) the winner so they don't count toward the gap — `runnerUp` is the first sorted candidate with a different `targetNodeId`. ## Found-issue surfaced B8 (filed): the resolver's name-matcher returns only ONE candidate even for textbook same-name same-kind ambiguities (two `ambigTarget` exports, one no-import caller). With one candidate, the picker has no runner-up to compare against → tieMargin never set → AMBIGUOUS never stamped on the most natural ambiguity case. B3's wiring is correct and exercised by the integration test, but its real-world impact is gated on B8 fixing the matcher's candidate generation. ## Tests Suite: **1560 / 34 / 0** (+1 B2 test in realwork-fixes; +4 B3 tests in edge-confidence). Eval: **11/11 | recall=1.00 | mrr=0.86 | within regression budget**. Typecheck clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
andreinknv
added a commit
to andreinknv/codegraph
that referenced
this pull request
May 4, 2026
Half of colbymchenry#17 (the rename half; extract was descoped per user check — needs per-language scope analysis that doesn't pay back enough on a multi-language index). Pure query wrapper that uses the edge-confidence column from colbymchenry#8 + B8 to give the agent a ready-to-execute rename plan. ## What it returns Three sections, header counts: 1. **Definitions** — every node matching `symbol` (handles overloads / re-exports). The site(s) where the rename starts. 2. **Call sites (graph edges)** — every incoming edge from callers / type-users / instantiators. Grouped by `edge.confidence`: - **EXTRACTED** *concrete; safe to rename mechanically* — imports / qualified-name / file-path resolution. Apply with sed / Edit replace_all. - **INFERRED** *heuristic; glance before applying* — same-name match, fuzzy dispatch, framework rule below the 0.85 confidence threshold. - **AMBIGUOUS** *ambiguous; verify the target before renaming* — picker had a near-tied runner-up (B8 tieMargin < 0.05). Sites are sorted EXTRACTED-first, so `limit` truncation drops the lowest-confidence rows last. 3. **Textual mentions** — `\bname\b` regex over the indexed file set, attributed to enclosing function / method / class / interface. Word-boundary so `id` doesn't match `userId`. Doc comments / JSDoc / string literals — surfaces where the symbol is mentioned NON-graph (no edge), so the agent can review case- by-case (not every textual match is the same symbol). De-dups: textual hits that ALSO appear as graph call sites are filtered out — graph edges are the higher-quality signal, no reason to surface the same line twice. ## Validation (warnings, not blocks) - newName checked for valid-identifier shape (Unicode-letter + digit + underscore + $; rejects whitespace / hyphens). - newName checked for collision with existing indexed symbols (would shadow / break). Both surface as `### Warnings` at the top of the report. The agent decides whether to proceed. ## Why a separate tool when codegraph_callers + codegraph_grep exist Three reasons: 1. **Confidence-stamped grouping** — `codegraph_callers` returns a flat list with the confidence suffix per row. Rename planning needs the EXTRACTED rows isolated for mechanical-safe edits; ad-hoc partitioning by the agent is error-prone. 2. **Doc-mention attribution to enclosing symbol + dedup against call sites** — `codegraph_grep` doesn't dedup against the call graph. The agent doing this manually would re-render the same import line in both surfaces. 3. **Validation up front** — collision + identifier checks happen once, not per-call-site. Saves the agent a round trip. ## CLI mirror `codegraph propose-rename <symbol> <newName>` routes via runViaMCP, supports `--limit` and `--doc-limit`. Dogfooded against the live repo: `extractFromSource → extractFromCode` surfaces 14 call sites (all EXTRACTED) + 29 textual mentions, plus a collision warning on the existing `extractFromCode` if any. ## Tests 9 cases in `__tests__/mcp-propose-rename.test.ts`: - Three-section render with header counts - EXTRACTED-first grouping - Collision warning fires - Invalid-identifier warning fires - Same-name rename → errorResult - Unknown-symbol → notFound - docLimit=0 skips textual scan + section - Textual mentions don't double-count graph call sites - limit caps the call-sites section Suite: **1608 / 34 / 0** (+9 propose-rename tests). ## Backlog after this #17a (this) shipped. #17b (propose_extract) deferred — needs per- language scope analysis (free-variable detection, parameter inference). User-checked decision: not worth the cross-language risk. colbymchenry#18 (plan-and-execute CLI) deferred — the macro infra from colbymchenry#13 already covers the structural part; the LLM-planning piece is arguably the calling agent's job. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
andreinknv
added a commit
to andreinknv/codegraph
that referenced
this pull request
May 4, 2026
…ts (S3 batch 1) S3 (file-split) starts with the lowest-risk hotspot: `src/utils.ts` (762 LOC, ranked colbymchenry#8 in churn × centrality). Extracts the contiguous concurrency / time / memory block (was lines 414-761) into a sibling `src/utils-concurrency.ts`. Moved: - FileLock — cross-process file lock with PID staleness check - Mutex — in-process async mutex - MemoryMonitor — interval-based heap watch - processInBatches — bounded-parallelism batch runner - readFileInChunks — streaming file reader - debounce / throttle — call-rate limiters What stays in utils.ts: path / string / language predicates (isJsFamily, validatePathWithinRootReal, isTestPath, normalizePath, stripBom, globToSafeRegex, etc.) — pure functions with no runtime state. Caller compatibility: utils.ts re-exports every name from utils-concurrency.ts so `import { FileLock } from '../utils.js'` keeps working unchanged. Future imports can switch to the direct path when convenient — no rush. Result: - utils.ts: 762 → 426 LOC (-336, -44%) - utils-concurrency.ts: 348 LOC (new) - Net: same total LOC, but each file holds one cohesive concern. - Suite 1634/34/0 unchanged. Other S3 candidates (extraction/index.ts 2098 LOC, tree-sitter.ts 1595 LOC, context/index.ts 1422 LOC) need per-file scoping with fresh attention to their internal cohesion — left as separate follow-ups per the original task description ("one PR per file"). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
andreinknv
added a commit
to andreinknv/codegraph
that referenced
this pull request
May 9, 2026
Foundation for colbymchenry#8 (HNSW approximate KNN). Module is unwired -- no behavior change; provides the building block that 8.4 (similar-edges) and 8.5 (semantic search) will swap in for per-row vec0 KNN. - hnswlib-node ^3.0.0 in optionalDependencies (cross-platform prebuilds, 3-sec install verified). Same fall-back discipline as @huggingface/transformers. - HnswIndex: build / query / persist / load lifecycle. One index per embedding dimension, mirroring vec0's per-dim layout. Cosine on normalised via 'ip' space. - isHnswAvailable + computeRowsetSignature helpers ready for the staleness-tracking migration (8.3). Verified spike on 5096x768: 745 builds/s, 1832 queries/s, 0.55ms per query, 10ms persist+reload. Recall@10 ~0.97 at the chosen M=16 / efConstruction=200 / ef=64 defaults. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
andreinknv
added a commit
to andreinknv/codegraph
that referenced
this pull request
May 9, 2026
Subtask 8.3 of the embedding-features arc (colbymchenry#8 HNSW). Adds a per-dim staleness ledger (row_count, max_rowid, built_at, file_path, recall_hint, ef_query) so postHooks can decide whether to rebuild the HNSW index without rescanning every row. The handoff memo reserved migration 041 but it was taken by summary-priority-queue in interim work, so this lands as 042. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
andreinknv
added a commit
to andreinknv/codegraph
that referenced
this pull request
May 11, 2026
…raph friction items Biomarker pass: 10 findings (1 error / 3 warnings / 6 info) -> 0/0/0. - has() LRU low_coverage: new __tests__/node-lru-cache.test.ts exercises the previously 0%-covered path via QueryBuilder.nodeCache. - hasExtraStatement complex_method (cyc 17 -> 0): table-driven STEP_HANDLERS dispatch in src/mcp/tools/sql.ts replaces the inline state machine. - buildQueryOutput / appendCompletenessAndBudget long_parameter_list: bundled positional args into *Args interfaces. - findLowCoverage + buildCoverageTips magic_number: extracted PCT_SCALE / COVERAGE_PCT_ROUND / CENTRALITY_ROUND + MS_PER_DAY / STALE_AFTER_DAYS named constants (docstrings cite values to satisfy stale_doc). - 4 unused exports dropped: RULE_NAMES deleted; LabelConfidence/LabelSource un-exported; duplicate CURRENT_SCHEMA_VERSION at migrations/index.ts:226 removed (derive directly in migrations.ts). Codegraph friction fixes: - #7 unused_export FP on aliased named imports: empirically zero remaining FPs on the in-tree corpus. Rule comment updated to reflect that emitNamedImportRefsFromFile + pickMatchingImport resolve aliased imports correctly. - colbymchenry#8 n_xxxxxxxx UIDs not accepted by codegraph_biomarkers / _coverage / _role mode=symbol: each tool's resolveSymbolToNodeId now checks RefIdCache.isUid first and resolves via the per-server cache. handleGetRoleOf bundled into HandleGetRoleOfArgs after refIds pushed it past the long-param tier. Punctuation guard aligned across all three siblings. Regression test in __tests__/biomarkers.test.ts mints a UID via codegraph_find and round-trips it through codegraph_biomarkers. Gates: tsgo typecheck clean, vitest 2633/0/34, codegraph_biomarkers stats empty after sync, independent reviewer APPROVE on both diff slices. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
andreinknv
added a commit
to andreinknv/codegraph
that referenced
this pull request
May 11, 2026
…ache to EXTRACTION_LOGIC_VERSION Two improvements that came out of analysing the friction items closed in 9cb4ab5: 1. **Consolidate resolveSymbolToNodeId (prevents friction colbymchenry#8 recurrence).** Three near-identical helpers in biomarkers.ts / coverage.ts / role.ts deleted; replaced with a single exported `resolveSymbolToNodeId(cg, symbol, refIds)` in symbol-resolver.ts. The shared helper preserves the post-colbymchenry#8 contract: UID-first check + punctuation guard + getNodeById + FTS top-1 fallback. Distinct from `findSymbol` in the same file, which is richer (multi-match disambiguation + fuzzy fallback with notes) and is what callers/callees/impact use. Future mode=symbol handlers import one helper instead of risking a copy of the buggy pre-colbymchenry#8 version. 2. **Couple BIOMARKER_CACHE_KEY to EXTRACTION_LOGIC_VERSION.** `BIOMARKER_CACHE_KEY` is now `biomarker_file_state_v10_ext${EXTRACTION_LOGIC_VERSION}` (currently `_ext7`). The per-file biomarker cache short-circuits on `content_hash` match, but `content_hash` is the source hash — when EXTRACTION_LOGIC_VERSION bumps and re-extraction emits new nodes/edges for unchanged source, the per-file cache skips re-evaluation and findings stay stale until a forced reindex. Suspected cause of the original CURRENT_SCHEMA_VERSION unused_export FP on 2026-05-11 (cached past the resolver improvement that made aliased imports resolve correctly). Cross-file rules already re-run on every sync (clear+append in runCrossFileBiomarkers) so they were never affected. Trade-off: one extra full pass per extractor bump. Acceptable — the alternative is silently-stale findings for any rule that reads `nodes.docstring` or the references-edges graph. Drift detector manifest updated: src/db/cache-rule-hashes.json biomarker hash 19fa804b5ca6f866 -> 60025bd2a6a831d9 (real semantic change to cache-key derivation). Existing `src/db/queries.ts:714` cleanup uses `LIKE 'biomarker_file_state_%'` so it picks up both old `v10` and new `v10_ext7` entries on `clearStructural`. `isBiomarkerCacheCold` semantics unchanged: a new ext-suffix reads as absent and triggers a one-shot full pass. CLAUDE.md and docs/emulation-questions.example.json updated to reflect the new key shape (reviewer caught both as docstring rot). Gates: tsgo typecheck clean, vitest 2633/0/34, codegraph_biomarkers stats empty, independent reviewer APPROVE. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
andreinknv
added a commit
to andreinknv/codegraph
that referenced
this pull request
May 11, 2026
…olved-by-FTS Three more friction closures from the 2026-05-11 workout: - src/extraction/index.ts + src/db/queries-files.ts: heal-marked files now propagate through the git fast-path. The EXTRACTION_LOGIC_VERSION self-heal sets needs_reextract=1 on every file, but the git fast-path eoCollectGitChanges only iterated files git reported as changed — so heal-marked files with no diff were silently skipped in any git-tracked project. Adds a getFilesNeedingReextract(qb) query and unions the result into the change set after the git fast-path returns. Full-scan path was already correct via eoClassifyFileChange; no change there. Closes friction colbymchenry#22 (surfaced while validating the cluster fix in 6957b74 — the original needs_reextract=1 → 620 of 627 stayed pending after sync, requiring admin index --force). - src/mcp/tools/entry-points.ts: new "MCP tools" bucket scans for exported `*_TOOL` constants whose initializer signature contains `name: 'codegraph_*'` and surfaces them with the parsed canonical tool name. Closes friction colbymchenry#13 / cluster symptom #6: every MCP tool dispatcher previously fell through every bucket because routes/CLI commands/CLI-files/public-exports each had a different filter mismatch. Five buckets now (HTTP routes, CLI commands, MCP tools, CLI files, public exports). __tests__/mcp-entry-points grows 5 → 6 tests with the new bucket fixture. - Friction colbymchenry#9 (semantic/intent search can't bridge tool-name → handler-name): resolved by existing FTS5 path. Investigation showed nodes_fts indexes the `signature` column at bm25=2, so searching `codegraph_X` already returns the matching XXX_TOOL constant via signature substring match. The morning-workout "No results" observation was likely a stale-index transient (pre-cluster-fix). Added a regression test in mcp-search-family to lock in the behavior. Test impact: 2633 → 2635 (one new MCP-tools-bucket test, one new canonical-tool-name regression test). Type-check clean. Closes: - colbymchenry#22 EXTRACTION_LOGIC_VERSION heal flag honored - colbymchenry#13 entry_points 5th bucket for MCP tool dispatchers - colbymchenry#9 already-resolved-by-FTS — locked in with regression test Still open: - colbymchenry#8 / cluster #1 (context whiff): graph edge now exists, but the context-builder ranker still surfaces ToolHandler/ToolResult/ ToolCtx instead of dispatchers. Needs a context-ranker boost for `codegraph_*` queries. Logged as colbymchenry#24 follow-up. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
andreinknv
added a commit
to andreinknv/codegraph
that referenced
this pull request
May 11, 2026
Closes friction colbymchenry#8 / cluster #1: codegraph_context whiff on the PRIMARY natural-language tool when the query mentions a canonical MCP tool name. Symptom: query like "how does codegraph_find dispatch across by:name/content/env/sql axes" returned generic ToolHandler / ToolResult / ToolCtx plumbing or unrelated SQL extractor symbols (both match `name`/`content`/`sql` tokens in the trailing English). The actual XXX_TOOL constant got pushed below the cap by text-search noise even though FTS5 finds it via signature. Fix: new `cbPromoteMcpToolMatches` helper. After the standard extractor + prefix passes, scan extracted query symbols for the canonical `codegraph_X` shape; for each hit, look up `constant` nodes whose initializer signature contains `name: 'codegraph_X'` and append them to exactMatches with a high base score (MCP_TOOL_PROMOTION_SCORE = 100). The score puts the registered tool above text-search and centrality-boosted noise so it lands in the top entry-point slots. Implementation cost: one extra `getNodesByKind('constant')` call per query, bounded by the count of top-level exported constants (~hundreds in a typical codebase). No-op when no `codegraph_*` token is in the extracted symbol list. Tests: __tests__/context.test.ts grows 18 → 19 cases. Fixture has DEMO_TOOL plus three buildSql* helpers; without promotion the SQL helpers outrank DEMO_TOOL in the entry points. Test impact: 2635 → 2636 (one new MCP-tool-promotion test). Type-check clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
getChildByField()with fallback for tree-sitter grammars without field definitionsTest plan
🤖 Generated with Claude Code