perf: TopKHeap vector search, evidence-based edge sorting, graph traversal hardening#33
Closed
MO2k4 wants to merge 1 commit into
Closed
perf: TopKHeap vector search, evidence-based edge sorting, graph traversal hardening#33MO2k4 wants to merge 1 commit into
MO2k4 wants to merge 1 commit into
Conversation
- Add TopKHeap for O(N log K) brute-force vector search - Add Float32Array vectorCache to avoid repeated Buffer conversions - Add query embedding LRU cache (max 50 entries) - Add SQL-level nodeKinds filtering in brute-force search - Replace console.log/warn with logDebug/logWarn - Set maxDepth=20 default, add FIND_PATH_MAX_VISITED=10000 - Add evidence-based edge sorting (sortEdgesByEvidence) - Rewrite findPath to parent-pointer BFS with visit limit - Use separate visitedAncestors/visitedDescendants in type hierarchy - Add picomatch for safe glob matching in findByQualifiedName - Batch getNodesByIds in getNodeContext/getFileDependencies/getFileDependents - Use getNodesByKinds in findByQualifiedName for single-query batch fetch - Optimize cycle detection with depsCache and pathSet - Re-add getNodesByIds and getNodesByKinds (now have callers)
andreinknv
added a commit
to andreinknv/codegraph
that referenced
this pull request
May 3, 2026
) getFindingsRanked gains `excludeFile` — drop findings whose file_path starts with a literal prefix. The triage workflow it enables: "give me ranked warnings excluding src/legacy/" without having to rebaseline the index or post-filter in JS. Implementation -------------- - src/db/queries-findings.ts: WHERE n.file_path NOT LIKE @Prefix ESCAPE '\\'. The prefix is escaped (\\, _, %) before being bound so any of those characters in real path names match themselves; the first draft used GLOB but that was discarded after the reviewer flagged that GLOB has no escape syntax — `excludeFile: '*'` would have collapsed to "match every path" and silently zeroed the result set. - src/mcp/tools/biomarkers.ts: new `excludeFile` arg on the inputSchema, passed through to `getFindingsRanked`. The description notes that the filter only applies to mode='ranked' (silently ignored on 'symbol' / 'stats'). - src/bin/codegraph.ts: `--exclude-file <prefix>` on the `codegraph biomarkers` CLI command. Tests ----- __tests__/biomarkers.test.ts: - baseline (no filter) sees both functions - directory exclusion via trailing-slash prefix drops one - single-file exclusion drops the named file - `excludeFile: '*'` is treated literally (no path starts with '*') so the result equals baseline — pinning the reviewer-caught fix - `excludeFile: 'src/k_ep/'` does NOT match 'src/keep/' because the underscore is escaped, not a wildcard Also: struck the duplicate B colbymchenry#33 marker in BACKLOG.md (it was the same item as A #3). Suite: 1366/13/0 (+1 new test). tsc clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
andreinknv
added a commit
to andreinknv/codegraph
that referenced
this pull request
May 9, 2026
…polish items Eight friction-tracker items addressed in parallel by sub-agents (2 Haiku, 1 Sonnet); reviewer caught one real correctness edge case (bucket overlap on degenerate fresh-index shapes) plus two info items, all addressed in this commit. ## colbymchenry#21 — at_range cost-benefit JSDoc Doc-only update to src/mcp/tools/at-range.ts. Tool description and JSDoc now state "pays off most on dense files (100+ symbols) and multi-range bulk lookups; for tiny preview fetches on small files, raw `head -N` is comparable." No code change. ## colbymchenry#25 — blame surfaces rename detection inline src/git-utils.ts gains a new helper `getFileFollowEarliestTs` that runs `git log --follow --format=%aI -- <path>` (5 s timeout, ISO timestamp). src/mcp/tools/blame.ts compares the rename-aware oldest commit against the line-range-only timeline's oldest. When `--follow` reaches further back, appends a warning that the timeline truncated at the file's rename and points at `git log --follow <file>` for the full history. Edge cases handled: not-a-git-repo, timeout, empty timeline. Test approach uses `vi.spyOn` to mock pre-rename history because real fixtures are unreliable: modern git's `git log -L` follows renames via content-similarity tracking, making a deterministic black-box rename-fixture impossible. ## colbymchenry#26 — hotspots split into 3 mutually-exclusive categories src/db/queries-history.ts gains `getCategorizedHotspots` and src/mcp/tools/hotspots.ts gains a `category: 'risk' | 'maintenance' | 'brittle' | 'all'` arg (default 'risk' for backward compat). Thresholds use 75/25 percentile rather than hardcoded magic numbers — they adapt as the project grows. Buckets: - risk : high centrality AND high churn — where bugs hide - maintenance : high churn AND not-high centrality — refactor target - brittle : high centrality AND not-high churn — stable critical Reviewer-caught correctness bug: original filters used `<= low` for the secondary axis, which collapsed buckets when high == low (fresh index where centrality is uniformly zero, or repos where every file has identical churn). A file at the threshold could appear in both risk AND maintenance simultaneously. Fixed by switching maintenance and brittle to `< highThreshold`, making them strictly disjoint even on degenerate inputs. Also added a more-hint when any section hit the per-category cap (the existing `category='risk'` path already had this; `category='all'` now mirrors). New `__tests__/hotspots.test.ts` (4 cases) covers all-section rendering, single-category dispatch, and the backward-compat default path. ## colbymchenry#27 — search centrality:high differentiates "hook hasn't run" vs "no node met the threshold" src/mcp/tools/search.ts. `probeCentralityFilterCulprit` now runs a sub-millisecond probe `SELECT 1 FROM nodes WHERE centrality IS NOT NULL LIMIT 1` (uses the existing `idx_nodes_centrality` index). When ALL nodes have NULL centrality the agent gets the existing "centrality hook hasn't run — run codegraph index" hint. When SOME nodes have centrality but none cleared the filter, a different hint suggests relaxing the threshold. Two-case hint instead of one. ## colbymchenry#28 — search exact promotes multi-token-query warning to pre-result src/mcp/tools/search.ts. `buildConceptHintIfNeeded` now returns `{ preResult, postResult }` instead of a single string. When the query splits into 2+ space-separated non-qualified tokens (likely "multiple symbol names"), the agent gets a leading hint to call search per name OR use codegraph_explore — BEFORE the result list rather than buried after. Field-qualified tokens (`kind:function lang:typescript`) and single-free-token queries are unchanged. ## colbymchenry#33 — callers on "constructor" with no callers explains the instantiates-edge model src/mcp/tools/callers.ts. When the resolved symbol is `kind=method && name=constructor` AND the callers list is empty, appends a one-line note: "constructors are invoked via `new ClassName(...)`, which graph-edges as `instantiates` on the parent class. To find construction sites, run codegraph_callers on the enclosing class instead of 'constructor'." Both the multi-match and single-match paths got the note (guarded by the same kind+name+empty check). Constructors WITH callers (e.g. via super()) render normally — no false positive. ## colbymchenry#35 — node.symbol tie-break prefers non-fixture, then centrality src/mcp/tools/symbol-resolver.ts. `pickFromMultipleExactMatches` now filters out fixture paths first (falls back to all-fixture when that's all that matches), then sorts by centrality DESC (NULL → 0). A `helper` symbol that exists in both `src/core.ts` and `docs/test-beds/fixture.ts` resolves to `src/core.ts` as the displayed primary. Tier #3 (last_touched_ts) deferred — data not in the resolver's existing query. Reviewer-caught DRY issue: the fixture-path regex set was duplicated between symbol-resolver.ts and dead-code.ts (introduced by parallel sub-agents on the same brief). Extracted to `isFixturePath` in src/mcp/tools/shared.ts; both consumers now import the single source. ## colbymchenry#49 — getSummaryCoverage denominator threading (3 call sites) src/bin/codegraph.ts (lines 348, 1461) + src/mcp/tools/status.ts (line 440). All three pass `SUMMARIZABLE_KINDS` to getSummaryCoverage to match the canonical pattern from the previously-fixed _search-intent.ts:218. Without this, the helper falls back to COUNT(*) which inflates the denominator with parameters / imports / file nodes — its own JSDoc explicitly warns against this. ## Test re-additions Sub-agent #1 deleted its own test files for colbymchenry#33 and colbymchenry#35 (a brief misread — "DO NOT commit" was interpreted as "DO NOT leave tests in repo"). Re-added as `__tests__/mcp-callers-constructor-and-fixture-tiebreak.test.ts` covering: constructor-with-no-callers note appears, non-constructor-method note absent, name-collision picks non-fixture primary. ## Verification - 15 modified files + 2 new test files, +619/-55 - npm run typecheck — clean - 74/74 tests pass across 9 LLM/search/hotspots-related test files - New exports: `isFixturePath` (shared.ts), `getCategorizedHotspots` (queries-history.ts), `getFileFollowEarliestTs` (git-utils.ts) — all have concrete in-tree callers in the same diff per reviewer-memo item #7 Reviewer pass with .claude/reviewer-memo.md prepended caught: - (request_changes) bucket-exclusivity edge case → fixed - (info) isFixturePath duplication → deduped - (info) category='all' missing more-hint → added Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
TopKHeap(min-heap) for O(N log K) brute-force search instead of full array sort; addvectorCachefor in-memory vector storage; addqueryEmbeddingCacheLRU (max 50) in VectorManagersortEdgesByEvidence()withEDGE_KIND_PRIORITYfor deterministic, meaningful edge ordering; capmaxDepthat 20; addFIND_PATH_MAX_VISITED=10000limit; rewritefindPathto parent-pointer BFS; use separatevisitedAncestors/visitedDescendantsin type hierarchypicomatchfor safe glob matching infindByQualifiedName; batch-fetch nodes viagetNodesByIdsingetNodeContext/getFileDependencies/getFileDependents; optimize cycle detection withdepsCacheandpathSetgetNodesByIdsandgetNodesByKinds(now have callers in graph/queries.ts)console.log/console.warnwithlogDebug/logWarnin vector searchDetails
TopKHeap
The brute-force vector search previously sorted the entire results array to find top-K matches.
TopKHeapmaintains a min-heap of size K, yielding O(N log K) instead of O(N log N). For typical searches (K=10, N=10000+), this is a significant improvement.Evidence-based edge sorting
Edges are now sorted by an evidence priority that reflects how informative each edge kind is for code understanding:
Graph traversal hardening
maxDepthcapped at 20 (wasInfinity) to prevent unbounded traversalfindPathrewritten from recursive DFS to iterative BFS with parent-pointer backtracking and a 10,000-node visit limitBatch node fetching
getNodesByIdsfetches nodes in chunks of 999 (SQLite parameter limit) with a Map return for O(1) lookup. Used in 4 call sites in graph/queries.ts.getNodesByKindsuses an IN clause for filtered queries, used ingetProjectGraph.Test plan
npm run buildcompiles without errorsnpm testpasses with same baseline (28 pre-existing failures, 326 passing)src/db/queries.ts,src/graph/queries.ts,src/graph/traversal.ts,src/vectors/manager.ts,src/vectors/search.tssortEdgesByEvidence) — has 2 callers in traversal.tsgetNodesByIds— 4 callers in graph/queries.tsgetNodesByKinds— 1 caller in graph/queries.tscaptureExceptionin db/queries.ts, same as upstream)