Skip to content

fix(extraction): drop duplicate export-var nodes and honour maxFileSize in bulk path#129

Merged
colbymchenry merged 1 commit into
colbymchenry:mainfrom
andreinknv:fix/extraction-dup-exports-and-maxsize
May 8, 2026
Merged

fix(extraction): drop duplicate export-var nodes and honour maxFileSize in bulk path#129
colbymchenry merged 1 commit into
colbymchenry:mainfrom
andreinknv:fix/extraction-dup-exports-and-maxsize

Conversation

@andreinknv
Copy link
Copy Markdown
Contributor

Summary

Two correctness bugs in the core extraction pipeline, surfaced by stress-testing against an adversarial corpus (5k synthetic export-const declarations plus an 8MB single-line file).

1. Every export const produces two duplicate nodes

export const X = ... was producing two nodes for the same symbol:

  • one with kind: 'variable' from extractExportedVariables
  • one with kind: 'constant' from extractVariable (invoked when the walker descended into the export_statement child)

Stress-testing showed 100% duplication across 5,003 export const declarations.

The dedicated extractVariable dispatch is the correct one — it picks kind from isConst, captures the initializer signature, and walks type annotations. The extractExportedVariables helper was redundant because each language extractor's isExported predicate already walks the parent chain to detect the export wrapper. Removed the export_statement branch from the walker dispatch (children are descended into normally) and dropped the private helper.

2. maxFileSize silently ignored on the bulk-index path

extractFile() (single-file API) checked stats.size > config.maxFileSize, but the bulk indexAll() path read each file's stats and never compared. Vendored generated files (multi-MB headers, minified bundles) were indexed regardless of the user's cap. Mirrored the single-file behaviour: emit a size_exceeded warning, count the file as skipped, advance progress, continue.

Test plan

Verified live against a stress workspace (5,005 synthetic files; 50,000 fns in one 3MB file; 8MB single-line file; 5,000 export const declarations):

Metric Before After
Duplicate var/const node sets 5,003 (100%) 0
Files >1MB indexed despite maxFileSize: 1MB 2 0
Total nodes 65,014 10,008
  • npx vitest run380 passed (was 374 passed, 6 failed before this fix; the 6 failing tests asserted the duplicate behavior, now updated to match the correct kind: 'constant')
  • npx tsc --noEmit passes
  • npm run build succeeds
  • Live codegraph index against stress corpus produces the expected node counts and size_exceeded warnings

🤖 Generated with Claude Code

…ze in bulk path

Two correctness bugs in the core extraction pipeline, surfaced by an
adversarial stress corpus (5k synthetic export-const declarations
plus a deliberate 8MB single-line file):

1) Every `export const X = ...` produced TWO nodes for the same
   symbol — one kind:'variable' from extractExportedVariables, plus
   one kind:'constant' from extractVariable (called when the walker
   descended into the export_statement child). Stress test showed
   100% duplication across 5,003 export-const declarations. The
   dedicated extractVariable dispatch is the correct one — it picks
   kind from isConst, captures the initializer signature, and walks
   type annotations; the export-statement helper was redundant
   because the language extractors' isExported predicate already
   walks parent chains. Remove the export_statement branch from the
   dispatch (children are descended into normally) and drop the
   private helper.

2) The bulk indexAll path read each file's stats but never compared
   stats.size against config.maxFileSize. Vendored generated files
   (multi-MB headers, minified bundles, etc.) were indexed regardless
   of the user's size cap. The single-file extractFile path enforced
   it; only the bulk path was missing the check. Mirror the
   single-file behaviour: emit a 'size_exceeded' warning, count the
   file as skipped, advance progress, and continue.

On the stress workspace (5,005 synthetic files; 50,000 fns in one
3MB file; 8MB single-line file; 5,000 export-const declarations):

  before:  65,014 nodes (100% var/const duplication, every >1MB file
           indexed despite maxFileSize=1MB)
   after:  10,008 nodes (0 duplicates, large files correctly skipped
           with size_exceeded warnings)

Tests calibrated to the duplicate behavior were updated to look for
kind:'constant' on `export const`, which is the correct kind. Full
suite: 380 passed (was 374 passed, 6 failed before this fix).
andreinknv added a commit to andreinknv/codegraph that referenced this pull request Apr 28, 2026
andreinknv added a commit to andreinknv/codegraph that referenced this pull request Apr 29, 2026
Adds Steps K-O to walk the new PRs in dependency order:
  K: bug-fix wave (clean):    colbymchenry#128, colbymchenry#129
  L: resolution + search:     colbymchenry#130 (resolve), colbymchenry#131 (resolve)
  M: extraction edges:        colbymchenry#134 (resolve)
  N: biomarker stack:         colbymchenry#132, colbymchenry#133 (both resolve, on top of colbymchenry#125)
  O: search advanced:         colbymchenry#135 (resolve, on top of colbymchenry#131)

Also flips colbymchenry#125 from merge_clean to merge_resolve - it now hits a
queries.ts conflict after the Phase-4 stack lands (colbymchenry#111/colbymchenry#112/colbymchenry#123/colbymchenry#124
all extend the same QueryBuilder surface, so colbymchenry#125's biomarker columns
no longer apply cleanly without a resolution).

Validated end-to-end against colbymchenry/main HEAD: script ran
clean through all 43 PRs, npm run build succeeded, full test
suite reports 877/877 passing (was 829 before this wave: +48 from
new tests added by the new PRs plus the reviewer-driven follow-ups).
@colbymchenry
Copy link
Copy Markdown
Owner

Reviewed and merging.

Both bugs are real and the fix is the obviously-right shape — removing the redundant export_statement dispatch (the inner declaration's extractor handles it correctly with isExported propagating via parent-walk) and mirroring the single-file path's maxFileSize check on the bulk path. Net negative LOC, no regressions, auto-merges cleanly with main. Thanks.

@colbymchenry colbymchenry merged commit 4f6c51d into colbymchenry:main May 8, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants