Skip to content

feat: File nodes, arrow functions, parallel I/O#27

Merged
colbymchenry merged 3 commits into
colbymchenry:mainfrom
MO2k4:feat/extraction-quality
Feb 10, 2026
Merged

feat: File nodes, arrow functions, parallel I/O#27
colbymchenry merged 3 commits into
colbymchenry:mainfrom
MO2k4:feat/extraction-quality

Conversation

@MO2k4
Copy link
Copy Markdown
Contributor

@MO2k4 MO2k4 commented Feb 10, 2026

Summary

  • Create file kind nodes for each parsed source file
  • Extract arrow functions and function expressions from variable declarators
  • Add isInsideClassLikeNode() helper for method vs function detection
  • Batch file I/O with FILE_IO_BATCH_SIZE=10 using Promise.all
  • Add symlink cycle detection with visited directory tracking
  • Add lazy grammar loading with on-demand cache

Files changed

  • src/extraction/tree-sitter.ts — File nodes, arrow functions, isInsideClassLikeNode
  • src/extraction/index.ts — Parallel I/O batching, symlink cycle detection
  • src/extraction/grammars.ts — Lazy grammar loading
  • __tests__/extraction.test.ts — Tests for file nodes and arrow functions

Test plan

  • npm run build compiles without errors
  • npm test - no new failures
  • All Svelte and Dart extraction code preserved
  • All Sentry captureException calls preserved

MO2k4 and others added 3 commits February 10, 2026 11:47
- Create file-kind nodes for each parsed source file
- Add isInsideClassLikeNode() for method vs function detection
- Extract arrow functions and function expressions from variable declarators
- Batch file I/O with FILE_IO_BATCH_SIZE=10 using Promise.all
- Add symlink cycle detection with visitedDirs Set in scanDirectory
- Add lazy grammar loading with exported getGrammar() function
- Add indexFileWithContent() for pre-read content processing
- Add tests for file nodes and arrow function extraction
Keeps the PR's visitedDirs rename and main's gitIgnoredDirs addition.
- Remove extractFunctionVariable() and its dispatch (already handled by extractVariable)
- Remove dead getGrammar() export (zero callers)
- Deduplicate indexFile by delegating to indexFileWithContent
- Remove redundant arrow function variable extraction tests (covered by existing suite)
@colbymchenry
Copy link
Copy Markdown
Owner

@MO2k4 Thanks for the contribution! Great additions with file nodes, isInsideClassLikeNode(), and the parallel I/O batching.

I pushed a cleanup commit (36af284) before merging that trims some redundant/dead code:

  1. Removed extractFunctionVariable() and its dispatch — Arrow functions were already extracted via extractVariable()extractFunction() on main, so this was a duplicate code path
  2. Removed getGrammar() export — Zero callers anywhere in the codebase
  3. Deduplicated indexFile() — Now reads the file then delegates to indexFileWithContent() instead of duplicating ~40 lines of validation/detection/extraction/storage logic
  4. Removed redundant "Arrow Function Variable Extraction" tests — Already covered by the existing "Arrow Function Export Extraction" suite

Everything else from your PR is kept as-is: file node creation, isInsideClassLikeNode() helper, parallel file I/O batching, symlink cycle detection rename, and all test updates.

@colbymchenry colbymchenry merged commit e2a95ee into colbymchenry:main Feb 10, 2026
andreinknv added a commit to andreinknv/codegraph that referenced this pull request May 3, 2026
User-driven backlog cleanup before next session:

- **colbymchenry#27 GraphQL `extend type` — verified NOT implemented.** I had
  briefly thought this was done; double-check found
  `src/extraction/graphql-extractor.ts:131` explicitly skips
  `type_system_extension` in v1 with a "needs a second resolution
  pass we don't do yet" comment. No fixture coverage. Backlog
  entry annotated to flag the verification result so a future
  session doesn't re-make the same mis-recall.

- **colbymchenry#25 build-snapshot:** annotated with the 2026-05-03 triage
  numbers (~50ms upper-bound win on a ~261ms cold start;
  ESM + native-module fragility) but kept on the backlog per
  user — worth picking up later when TS 7 + ESM build-snapshot
  tooling matures.

Also checks in `__tests__/evaluation/baseline-self.json` — the
self-eval baseline captured before B colbymchenry#19 (ranking arc) shipped,
referenced by the runner's `--compare` flag. Without it in-repo,
every fresh checkout would have to regenerate the baseline before
gating ranking changes against it.

The baseline is the **pre-improvement** snapshot so its mean
recall (0.79) is the floor every future ranking change must clear
or stay flat against. Bump the file deliberately when a verified
improvement should be the new floor — never silently overwrite.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
andreinknv added a commit to andreinknv/codegraph that referenced this pull request May 3, 2026
Pre-PR, `graphql-extractor.ts` explicitly skipped `type_system_extension`
AST nodes ("intentionally skipped in v1 — merging extensions
across files needs a second resolution pass we don't do yet"),
so federation-style `extend type User { posts: [Post] }` produced
zero nodes. Post-PR, each extension emits a separate node carrying
the new fields/values plus an `extends` UnresolvedReference
targeting the base type — cross-file merging reconstructible by
walking the resolver-promoted edges.

Mapping
- `extend type X { … }`        → class      + extends ref
- `extend interface X { … }`   → interface  + extends ref
- `extend input X { … }`       → class      + extends ref
- `extend enum X { … }`        → enum       + extends ref + new enum_members
- `extend union X = …`         → type_alias + extends ref + new union refs
- `extend scalar X`            → unsupported by tree-sitter-graphql
                                   0.1.0 (parses as ERROR);
                                   defensive scaffold kept for a
                                   future grammar bump

Per-line node-id derivation makes multi-extension cases distinct
(`extend type User` at L5 and L20 both produce nodes named `User`
of kind `class` with separate ids). Cross-file: filePath in the
id-hash makes them unique by source location. Fields / enum
values / union members go under the extension node, preserving
"this field came from this extender" provenance.

Known same-file edge case
If a base definition and its extension live in the SAME file, the
existing `findBestMatch` line-proximity may pick the extension's
own node (distance 0) over the base definition (distance > 0),
producing a self-referential extends edge. Federation patterns
put base + extension in different files, which is what this
targets. Documented in `pushExtendsRef` JSDoc as a future
resolver-pass filter target.

Files
- src/extraction/graphql-extractor.ts: visitDefinition routes to
  the new `visitTypeSystemExtension` dispatcher; 6 emit*Extension
  methods reuse `emitFieldsOf` and the new `pushExtendsRef`
  helper. Class-level docstring mapping table updated to cover
  the extension forms (memo scrutiny-area #1 catch by reviewer).
- __tests__/graphql-extend-type.test.ts: 3 new cases (5 kinds
  end-to-end, signature distinction, type_of refs).
- __tests__/extraction.test.ts: one existing test flipped from
  "extend type silently produces zero nodes (v1 out-of-scope)" to
  "extension node + extends ref emitted".
- docs/test-beds/graphql/fixture.graphql: full schema fixture
  covering definitions and all 5 supported extension forms;
  auto-discovered by the language-coverage harness.

Verification
- npm run typecheck (tsgo) — clean.
- npx vitest run — 1374 / 34 / 0 (was 1371; +3 new + 1 flipped).
- E2E probe on a multi-kind extend fixture: 5 extension nodes,
  5 extends refs, fields under the right parent, 0 errors.

Reviewer pass — eighth memo-load-bearing review this session:
- Class-level mapping table missing the extension rows (memo
  scrutiny-area #1 docstring rotting). Added.
- Same-file self-resolve edge case noted as future resolver
  filter target.
- emitScalarExtension's unreachable status confirmed adequate
  per its existing JSDoc (memo scrutiny-area #7 doesn't apply
  to private methods).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
andreinknv added a commit to andreinknv/codegraph that referenced this pull request May 9, 2026
…polish items

Eight friction-tracker items addressed in parallel by sub-agents (2 Haiku,
1 Sonnet); reviewer caught one real correctness edge case (bucket overlap
on degenerate fresh-index shapes) plus two info items, all addressed in
this commit.

## colbymchenry#21 — at_range cost-benefit JSDoc

Doc-only update to src/mcp/tools/at-range.ts. Tool description and
JSDoc now state "pays off most on dense files (100+ symbols) and
multi-range bulk lookups; for tiny preview fetches on small files,
raw `head -N` is comparable." No code change.

## colbymchenry#25 — blame surfaces rename detection inline

src/git-utils.ts gains a new helper `getFileFollowEarliestTs` that runs
`git log --follow --format=%aI -- <path>` (5 s timeout, ISO timestamp).
src/mcp/tools/blame.ts compares the rename-aware oldest commit against
the line-range-only timeline's oldest. When `--follow` reaches further
back, appends a warning that the timeline truncated at the file's
rename and points at `git log --follow <file>` for the full history.
Edge cases handled: not-a-git-repo, timeout, empty timeline.

Test approach uses `vi.spyOn` to mock pre-rename history because
real fixtures are unreliable: modern git's `git log -L` follows
renames via content-similarity tracking, making a deterministic
black-box rename-fixture impossible.

## colbymchenry#26 — hotspots split into 3 mutually-exclusive categories

src/db/queries-history.ts gains `getCategorizedHotspots` and
src/mcp/tools/hotspots.ts gains a `category: 'risk' | 'maintenance'
| 'brittle' | 'all'` arg (default 'risk' for backward compat).
Thresholds use 75/25 percentile rather than hardcoded magic
numbers — they adapt as the project grows.

Buckets:
- risk        : high centrality AND high churn — where bugs hide
- maintenance : high churn AND not-high centrality — refactor target
- brittle     : high centrality AND not-high churn — stable critical

Reviewer-caught correctness bug: original filters used `<= low` for
the secondary axis, which collapsed buckets when high == low (fresh
index where centrality is uniformly zero, or repos where every file
has identical churn). A file at the threshold could appear in both
risk AND maintenance simultaneously. Fixed by switching maintenance
and brittle to `< highThreshold`, making them strictly disjoint
even on degenerate inputs. Also added a more-hint when any section
hit the per-category cap (the existing `category='risk'` path
already had this; `category='all'` now mirrors).

New `__tests__/hotspots.test.ts` (4 cases) covers all-section
rendering, single-category dispatch, and the backward-compat
default path.

## colbymchenry#27 — search centrality:high differentiates "hook hasn't run"
       vs "no node met the threshold"

src/mcp/tools/search.ts. `probeCentralityFilterCulprit` now runs a
sub-millisecond probe `SELECT 1 FROM nodes WHERE centrality IS NOT
NULL LIMIT 1` (uses the existing `idx_nodes_centrality` index). When
ALL nodes have NULL centrality the agent gets the existing "centrality
hook hasn't run — run codegraph index" hint. When SOME nodes have
centrality but none cleared the filter, a different hint suggests
relaxing the threshold. Two-case hint instead of one.

## colbymchenry#28 — search exact promotes multi-token-query warning to pre-result

src/mcp/tools/search.ts. `buildConceptHintIfNeeded` now returns
`{ preResult, postResult }` instead of a single string. When the
query splits into 2+ space-separated non-qualified tokens (likely
"multiple symbol names"), the agent gets a leading hint to call
search per name OR use codegraph_explore — BEFORE the result list
rather than buried after.

Field-qualified tokens (`kind:function lang:typescript`) and
single-free-token queries are unchanged.

## colbymchenry#33 — callers on "constructor" with no callers explains
       the instantiates-edge model

src/mcp/tools/callers.ts. When the resolved symbol is
`kind=method && name=constructor` AND the callers list is empty,
appends a one-line note: "constructors are invoked via
`new ClassName(...)`, which graph-edges as `instantiates` on the
parent class. To find construction sites, run codegraph_callers on
the enclosing class instead of 'constructor'." Both the multi-match
and single-match paths got the note (guarded by the same
kind+name+empty check). Constructors WITH callers (e.g. via super())
render normally — no false positive.

## colbymchenry#35 — node.symbol tie-break prefers non-fixture, then centrality

src/mcp/tools/symbol-resolver.ts. `pickFromMultipleExactMatches`
now filters out fixture paths first (falls back to all-fixture when
that's all that matches), then sorts by centrality DESC (NULL → 0).
A `helper` symbol that exists in both `src/core.ts` and
`docs/test-beds/fixture.ts` resolves to `src/core.ts` as the displayed
primary. Tier #3 (last_touched_ts) deferred — data not in the
resolver's existing query.

Reviewer-caught DRY issue: the fixture-path regex set was duplicated
between symbol-resolver.ts and dead-code.ts (introduced by parallel
sub-agents on the same brief). Extracted to `isFixturePath` in
src/mcp/tools/shared.ts; both consumers now import the single source.

## colbymchenry#49 — getSummaryCoverage denominator threading (3 call sites)

src/bin/codegraph.ts (lines 348, 1461) + src/mcp/tools/status.ts
(line 440). All three pass `SUMMARIZABLE_KINDS` to getSummaryCoverage
to match the canonical pattern from the previously-fixed
_search-intent.ts:218. Without this, the helper falls back to
COUNT(*) which inflates the denominator with parameters / imports /
file nodes — its own JSDoc explicitly warns against this.

## Test re-additions

Sub-agent #1 deleted its own test files for colbymchenry#33 and colbymchenry#35 (a brief
misread — "DO NOT commit" was interpreted as "DO NOT leave tests in
repo"). Re-added as
`__tests__/mcp-callers-constructor-and-fixture-tiebreak.test.ts`
covering: constructor-with-no-callers note appears,
non-constructor-method note absent, name-collision picks non-fixture
primary.

## Verification

- 15 modified files + 2 new test files, +619/-55
- npm run typecheck — clean
- 74/74 tests pass across 9 LLM/search/hotspots-related test files
- New exports: `isFixturePath` (shared.ts), `getCategorizedHotspots`
  (queries-history.ts), `getFileFollowEarliestTs` (git-utils.ts) —
  all have concrete in-tree callers in the same diff per
  reviewer-memo item #7

Reviewer pass with .claude/reviewer-memo.md prepended caught:
- (request_changes) bucket-exclusivity edge case → fixed
- (info) isFixturePath duplication → deduped
- (info) category='all' missing more-hint → added

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants