Skip to content

perf: TopKHeap vector search, evidence-based edge sorting, graph traversal hardening#33

Closed
MO2k4 wants to merge 1 commit into
colbymchenry:mainfrom
MO2k4:perf/vector-search-graph-traversal
Closed

perf: TopKHeap vector search, evidence-based edge sorting, graph traversal hardening#33
MO2k4 wants to merge 1 commit into
colbymchenry:mainfrom
MO2k4:perf/vector-search-graph-traversal

Conversation

@MO2k4
Copy link
Copy Markdown
Contributor

@MO2k4 MO2k4 commented Feb 11, 2026

Summary

  • Vector search: Add TopKHeap (min-heap) for O(N log K) brute-force search instead of full array sort; add vectorCache for in-memory vector storage; add queryEmbeddingCache LRU (max 50) in VectorManager
  • Graph traversal: Add sortEdgesByEvidence() with EDGE_KIND_PRIORITY for deterministic, meaningful edge ordering; cap maxDepth at 20; add FIND_PATH_MAX_VISITED=10000 limit; rewrite findPath to parent-pointer BFS; use separate visitedAncestors/visitedDescendants in type hierarchy
  • Graph queries: Use picomatch for safe glob matching in findByQualifiedName; batch-fetch nodes via getNodesByIds in getNodeContext/getFileDependencies/getFileDependents; optimize cycle detection with depsCache and pathSet
  • DB queries: Re-add getNodesByIds and getNodesByKinds (now have callers in graph/queries.ts)
  • Replace console.log/console.warn with logDebug/logWarn in vector search

Details

TopKHeap

The brute-force vector search previously sorted the entire results array to find top-K matches. TopKHeap maintains a min-heap of size K, yielding O(N log K) instead of O(N log N). For typical searches (K=10, N=10000+), this is a significant improvement.

Evidence-based edge sorting

Edges are now sorted by an evidence priority that reflects how informative each edge kind is for code understanding:

calls: 10, extends: 9, implements: 8, imports: 7, exports: 6,
type_of: 5, returns: 4, instantiates: 3, overrides: 3,
decorates: 2, references: 1, contains: 0

Graph traversal hardening

  • maxDepth capped at 20 (was Infinity) to prevent unbounded traversal
  • findPath rewritten from recursive DFS to iterative BFS with parent-pointer backtracking and a 10,000-node visit limit
  • Type hierarchy uses separate visited sets for ancestors vs descendants to avoid incorrectly skipping nodes

Batch node fetching

getNodesByIds fetches nodes in chunks of 999 (SQLite parameter limit) with a Map return for O(1) lookup. Used in 4 call sites in graph/queries.ts. getNodesByKinds uses an IN clause for filtered queries, used in getProjectGraph.

Test plan

  • npm run build compiles without errors
  • npm test passes with same baseline (28 pre-existing failures, 326 passing)
  • 5 files changed: src/db/queries.ts, src/graph/queries.ts, src/graph/traversal.ts, src/vectors/manager.ts, src/vectors/search.ts
  • 1 new export (sortEdgesByEvidence) — has 2 callers in traversal.ts
  • getNodesByIds — 4 callers in graph/queries.ts
  • getNodesByKinds — 1 caller in graph/queries.ts
  • Sentry preserved (2 captureException in db/queries.ts, same as upstream)
  • No upstream features removed

- Add TopKHeap for O(N log K) brute-force vector search
- Add Float32Array vectorCache to avoid repeated Buffer conversions
- Add query embedding LRU cache (max 50 entries)
- Add SQL-level nodeKinds filtering in brute-force search
- Replace console.log/warn with logDebug/logWarn
- Set maxDepth=20 default, add FIND_PATH_MAX_VISITED=10000
- Add evidence-based edge sorting (sortEdgesByEvidence)
- Rewrite findPath to parent-pointer BFS with visit limit
- Use separate visitedAncestors/visitedDescendants in type hierarchy
- Add picomatch for safe glob matching in findByQualifiedName
- Batch getNodesByIds in getNodeContext/getFileDependencies/getFileDependents
- Use getNodesByKinds in findByQualifiedName for single-query batch fetch
- Optimize cycle detection with depsCache and pathSet
- Re-add getNodesByIds and getNodesByKinds (now have callers)
andreinknv added a commit to andreinknv/codegraph that referenced this pull request May 3, 2026
)

getFindingsRanked gains `excludeFile` — drop findings whose file_path
starts with a literal prefix. The triage workflow it enables:
"give me ranked warnings excluding src/legacy/" without having to
rebaseline the index or post-filter in JS.

Implementation
--------------
- src/db/queries-findings.ts: WHERE n.file_path NOT LIKE @Prefix
  ESCAPE '\\'. The prefix is escaped (\\, _, %) before being bound
  so any of those characters in real path names match themselves;
  the first draft used GLOB but that was discarded after the
  reviewer flagged that GLOB has no escape syntax — `excludeFile: '*'`
  would have collapsed to "match every path" and silently zeroed
  the result set.
- src/mcp/tools/biomarkers.ts: new `excludeFile` arg on the
  inputSchema, passed through to `getFindingsRanked`. The
  description notes that the filter only applies to mode='ranked'
  (silently ignored on 'symbol' / 'stats').
- src/bin/codegraph.ts: `--exclude-file <prefix>` on the
  `codegraph biomarkers` CLI command.

Tests
-----
__tests__/biomarkers.test.ts:
- baseline (no filter) sees both functions
- directory exclusion via trailing-slash prefix drops one
- single-file exclusion drops the named file
- `excludeFile: '*'` is treated literally (no path starts with '*')
  so the result equals baseline — pinning the reviewer-caught fix
- `excludeFile: 'src/k_ep/'` does NOT match 'src/keep/' because the
  underscore is escaped, not a wildcard

Also: struck the duplicate B colbymchenry#33 marker in BACKLOG.md (it was the
same item as A #3).

Suite: 1366/13/0 (+1 new test). tsc clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
andreinknv added a commit to andreinknv/codegraph that referenced this pull request May 9, 2026
…polish items

Eight friction-tracker items addressed in parallel by sub-agents (2 Haiku,
1 Sonnet); reviewer caught one real correctness edge case (bucket overlap
on degenerate fresh-index shapes) plus two info items, all addressed in
this commit.

## colbymchenry#21 — at_range cost-benefit JSDoc

Doc-only update to src/mcp/tools/at-range.ts. Tool description and
JSDoc now state "pays off most on dense files (100+ symbols) and
multi-range bulk lookups; for tiny preview fetches on small files,
raw `head -N` is comparable." No code change.

## colbymchenry#25 — blame surfaces rename detection inline

src/git-utils.ts gains a new helper `getFileFollowEarliestTs` that runs
`git log --follow --format=%aI -- <path>` (5 s timeout, ISO timestamp).
src/mcp/tools/blame.ts compares the rename-aware oldest commit against
the line-range-only timeline's oldest. When `--follow` reaches further
back, appends a warning that the timeline truncated at the file's
rename and points at `git log --follow <file>` for the full history.
Edge cases handled: not-a-git-repo, timeout, empty timeline.

Test approach uses `vi.spyOn` to mock pre-rename history because
real fixtures are unreliable: modern git's `git log -L` follows
renames via content-similarity tracking, making a deterministic
black-box rename-fixture impossible.

## colbymchenry#26 — hotspots split into 3 mutually-exclusive categories

src/db/queries-history.ts gains `getCategorizedHotspots` and
src/mcp/tools/hotspots.ts gains a `category: 'risk' | 'maintenance'
| 'brittle' | 'all'` arg (default 'risk' for backward compat).
Thresholds use 75/25 percentile rather than hardcoded magic
numbers — they adapt as the project grows.

Buckets:
- risk        : high centrality AND high churn — where bugs hide
- maintenance : high churn AND not-high centrality — refactor target
- brittle     : high centrality AND not-high churn — stable critical

Reviewer-caught correctness bug: original filters used `<= low` for
the secondary axis, which collapsed buckets when high == low (fresh
index where centrality is uniformly zero, or repos where every file
has identical churn). A file at the threshold could appear in both
risk AND maintenance simultaneously. Fixed by switching maintenance
and brittle to `< highThreshold`, making them strictly disjoint
even on degenerate inputs. Also added a more-hint when any section
hit the per-category cap (the existing `category='risk'` path
already had this; `category='all'` now mirrors).

New `__tests__/hotspots.test.ts` (4 cases) covers all-section
rendering, single-category dispatch, and the backward-compat
default path.

## colbymchenry#27 — search centrality:high differentiates "hook hasn't run"
       vs "no node met the threshold"

src/mcp/tools/search.ts. `probeCentralityFilterCulprit` now runs a
sub-millisecond probe `SELECT 1 FROM nodes WHERE centrality IS NOT
NULL LIMIT 1` (uses the existing `idx_nodes_centrality` index). When
ALL nodes have NULL centrality the agent gets the existing "centrality
hook hasn't run — run codegraph index" hint. When SOME nodes have
centrality but none cleared the filter, a different hint suggests
relaxing the threshold. Two-case hint instead of one.

## colbymchenry#28 — search exact promotes multi-token-query warning to pre-result

src/mcp/tools/search.ts. `buildConceptHintIfNeeded` now returns
`{ preResult, postResult }` instead of a single string. When the
query splits into 2+ space-separated non-qualified tokens (likely
"multiple symbol names"), the agent gets a leading hint to call
search per name OR use codegraph_explore — BEFORE the result list
rather than buried after.

Field-qualified tokens (`kind:function lang:typescript`) and
single-free-token queries are unchanged.

## colbymchenry#33 — callers on "constructor" with no callers explains
       the instantiates-edge model

src/mcp/tools/callers.ts. When the resolved symbol is
`kind=method && name=constructor` AND the callers list is empty,
appends a one-line note: "constructors are invoked via
`new ClassName(...)`, which graph-edges as `instantiates` on the
parent class. To find construction sites, run codegraph_callers on
the enclosing class instead of 'constructor'." Both the multi-match
and single-match paths got the note (guarded by the same
kind+name+empty check). Constructors WITH callers (e.g. via super())
render normally — no false positive.

## colbymchenry#35 — node.symbol tie-break prefers non-fixture, then centrality

src/mcp/tools/symbol-resolver.ts. `pickFromMultipleExactMatches`
now filters out fixture paths first (falls back to all-fixture when
that's all that matches), then sorts by centrality DESC (NULL → 0).
A `helper` symbol that exists in both `src/core.ts` and
`docs/test-beds/fixture.ts` resolves to `src/core.ts` as the displayed
primary. Tier #3 (last_touched_ts) deferred — data not in the
resolver's existing query.

Reviewer-caught DRY issue: the fixture-path regex set was duplicated
between symbol-resolver.ts and dead-code.ts (introduced by parallel
sub-agents on the same brief). Extracted to `isFixturePath` in
src/mcp/tools/shared.ts; both consumers now import the single source.

## colbymchenry#49 — getSummaryCoverage denominator threading (3 call sites)

src/bin/codegraph.ts (lines 348, 1461) + src/mcp/tools/status.ts
(line 440). All three pass `SUMMARIZABLE_KINDS` to getSummaryCoverage
to match the canonical pattern from the previously-fixed
_search-intent.ts:218. Without this, the helper falls back to
COUNT(*) which inflates the denominator with parameters / imports /
file nodes — its own JSDoc explicitly warns against this.

## Test re-additions

Sub-agent #1 deleted its own test files for colbymchenry#33 and colbymchenry#35 (a brief
misread — "DO NOT commit" was interpreted as "DO NOT leave tests in
repo"). Re-added as
`__tests__/mcp-callers-constructor-and-fixture-tiebreak.test.ts`
covering: constructor-with-no-callers note appears,
non-constructor-method note absent, name-collision picks non-fixture
primary.

## Verification

- 15 modified files + 2 new test files, +619/-55
- npm run typecheck — clean
- 74/74 tests pass across 9 LLM/search/hotspots-related test files
- New exports: `isFixturePath` (shared.ts), `getCategorizedHotspots`
  (queries-history.ts), `getFileFollowEarliestTs` (git-utils.ts) —
  all have concrete in-tree callers in the same diff per
  reviewer-memo item #7

Reviewer pass with .claude/reviewer-memo.md prepended caught:
- (request_changes) bucket-exclusivity edge case → fixed
- (info) isFixturePath duplication → deduped
- (info) category='all' missing more-hint → added

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants