fix(extraction): drop duplicate export-var nodes and honour maxFileSize in bulk path#129
Merged
colbymchenry merged 1 commit intoMay 8, 2026
Conversation
…ze in bulk path
Two correctness bugs in the core extraction pipeline, surfaced by an
adversarial stress corpus (5k synthetic export-const declarations
plus a deliberate 8MB single-line file):
1) Every `export const X = ...` produced TWO nodes for the same
symbol — one kind:'variable' from extractExportedVariables, plus
one kind:'constant' from extractVariable (called when the walker
descended into the export_statement child). Stress test showed
100% duplication across 5,003 export-const declarations. The
dedicated extractVariable dispatch is the correct one — it picks
kind from isConst, captures the initializer signature, and walks
type annotations; the export-statement helper was redundant
because the language extractors' isExported predicate already
walks parent chains. Remove the export_statement branch from the
dispatch (children are descended into normally) and drop the
private helper.
2) The bulk indexAll path read each file's stats but never compared
stats.size against config.maxFileSize. Vendored generated files
(multi-MB headers, minified bundles, etc.) were indexed regardless
of the user's size cap. The single-file extractFile path enforced
it; only the bulk path was missing the check. Mirror the
single-file behaviour: emit a 'size_exceeded' warning, count the
file as skipped, advance progress, and continue.
On the stress workspace (5,005 synthetic files; 50,000 fns in one
3MB file; 8MB single-line file; 5,000 export-const declarations):
before: 65,014 nodes (100% var/const duplication, every >1MB file
indexed despite maxFileSize=1MB)
after: 10,008 nodes (0 duplicates, large files correctly skipped
with size_exceeded warnings)
Tests calibrated to the duplicate behavior were updated to look for
kind:'constant' on `export const`, which is the correct kind. Full
suite: 380 passed (was 374 passed, 6 failed before this fix).
andreinknv
added a commit
to andreinknv/codegraph
that referenced
this pull request
Apr 28, 2026
andreinknv
added a commit
to andreinknv/codegraph
that referenced
this pull request
Apr 29, 2026
Adds Steps K-O to walk the new PRs in dependency order: K: bug-fix wave (clean): colbymchenry#128, colbymchenry#129 L: resolution + search: colbymchenry#130 (resolve), colbymchenry#131 (resolve) M: extraction edges: colbymchenry#134 (resolve) N: biomarker stack: colbymchenry#132, colbymchenry#133 (both resolve, on top of colbymchenry#125) O: search advanced: colbymchenry#135 (resolve, on top of colbymchenry#131) Also flips colbymchenry#125 from merge_clean to merge_resolve - it now hits a queries.ts conflict after the Phase-4 stack lands (colbymchenry#111/colbymchenry#112/colbymchenry#123/colbymchenry#124 all extend the same QueryBuilder surface, so colbymchenry#125's biomarker columns no longer apply cleanly without a resolution). Validated end-to-end against colbymchenry/main HEAD: script ran clean through all 43 PRs, npm run build succeeded, full test suite reports 877/877 passing (was 829 before this wave: +48 from new tests added by the new PRs plus the reviewer-driven follow-ups).
Owner
|
Reviewed and merging. Both bugs are real and the fix is the obviously-right shape — removing the redundant |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Two correctness bugs in the core extraction pipeline, surfaced by stress-testing against an adversarial corpus (5k synthetic export-const declarations plus an 8MB single-line file).
1. Every
export constproduces two duplicate nodesexport const X = ...was producing two nodes for the same symbol:kind: 'variable'fromextractExportedVariableskind: 'constant'fromextractVariable(invoked when the walker descended into theexport_statementchild)Stress-testing showed 100% duplication across 5,003
export constdeclarations.The dedicated
extractVariabledispatch is the correct one — it pickskindfromisConst, captures the initializer signature, and walks type annotations. TheextractExportedVariableshelper was redundant because each language extractor'sisExportedpredicate already walks the parent chain to detect the export wrapper. Removed theexport_statementbranch from the walker dispatch (children are descended into normally) and dropped the private helper.2.
maxFileSizesilently ignored on the bulk-index pathextractFile()(single-file API) checkedstats.size > config.maxFileSize, but the bulkindexAll()path read each file's stats and never compared. Vendored generated files (multi-MB headers, minified bundles) were indexed regardless of the user's cap. Mirrored the single-file behaviour: emit asize_exceededwarning, count the file as skipped, advance progress, continue.Test plan
Verified live against a stress workspace (5,005 synthetic files; 50,000 fns in one 3MB file; 8MB single-line file; 5,000
export constdeclarations):maxFileSize: 1MBnpx vitest run— 380 passed (was 374 passed, 6 failed before this fix; the 6 failing tests asserted the duplicate behavior, now updated to match the correctkind: 'constant')npx tsc --noEmitpassesnpm run buildsucceedscodegraph indexagainst stress corpus produces the expected node counts andsize_exceededwarnings🤖 Generated with Claude Code