Skip to content

refactor: per-language registry — eliminate cross-PR conflict surface for language additions#116

Open
andreinknv wants to merge 2 commits into
colbymchenry:mainfrom
andreinknv:refactor/language-registry
Open

refactor: per-language registry — eliminate cross-PR conflict surface for language additions#116
andreinknv wants to merge 2 commits into
colbymchenry:mainfrom
andreinknv:refactor/language-registry

Conversation

@andreinknv
Copy link
Copy Markdown
Contributor

@andreinknv andreinknv commented Apr 27, 2026

Reviewers: for a 5-minute review path that surfaces only what to verify, see the review checklist comment below. The original description is preserved here for full technical detail.


Summary

Adding a new language used to require coordinated edits to 6 shared lists across 4 files. Two PRs adding different languages (HCL, R, SQL, Scala, Vue, ReScript, mql5, …) typically conflicted on every one of them.

After this refactor, adding a new language is:

  1. Drop a file at src/extraction/languages/<name>.ts exporting an <NAME>_DEF: LanguageDef constant.
  2. Add one import line and one array entry to src/extraction/languages/registry.ts (alphabetical position — adjacent additions still possible but rare).

Everything else (grammars.ts, types.ts, tree-sitter.ts dispatch, DEFAULT_CONFIG.include) is derived from the registry.

What's a LanguageDef?

interface LanguageDef {
  name: string;                       // canonical id, used in Node/Edge.language
  displayName: string;                // "Pascal / Delphi"
  extensions: readonly string[];      // ['.pas', '.dpr', '.dfm', ...]
  includeGlobs: readonly string[];    // ['**/*.pas', ...]
  grammar?: {                         // tree-sitter-backed languages
    wasmFile: string;
    vendored?: boolean;
    extractor: LanguageExtractor;
  };
  customExtractor?: CustomExtractorFn;  // for Liquid, Svelte, etc.
  extensionOverrides?: Record<string, { customExtractor }>;  // Pascal .dfm/.fmx
}

Each existing language file now exports both its xxxExtractor (unchanged) AND a new XXX_DEF. New files were added for tsx, jsx, svelte, liquid — bringing all 19 currently-supported languages into the registry uniformly.

Refactored consumers

File Before After
src/extraction/grammars.ts Hardcoded WASM_GRAMMAR_FILES, EXTENSION_MAP, getLanguageDisplayName map All derived from registry. EXTENSION_MAP is now a Proxy that lazy-builds on first access (avoids TDZ in cyclic load paths). WASM_GRAMMAR_FILES removed (was internal-only).
src/extraction/tree-sitter.ts if (lang === 'svelte') ... if 'liquid' ... if 'pascal' && ... chain in extractFromSource One lookup: `def.extensionOverrides[ext]?.customExtractor
src/types.ts DEFAULT_CONFIG literal with hardcoded include globs per language Moved to src/default-config.ts; types.ts re-exports for backward compat. include array built lazily from LanguageDef.includeGlobs.
src/extraction/languages/index.ts Hardcoded EXTRACTORS: Partial<Record<Language, LanguageExtractor>> Now derived from registry.

What still requires a one-line edit

The Language string union in types.ts still hard-codes the known languages (typescript | javascript | … | unknown). New languages added to the registry work at runtime as strings, but adding the literal here is required IF the resolver wants exhaustive narrowing on the new language — src/resolution/index.ts and src/resolution/import-resolver.ts have a few language === 'X' branches. Most new languages don't need any.

Trade-off: strict narrowing preserved for the existing handful of language-specific code paths; everything else is registry-driven.

Tests

380/380 existing tests pass + 16 new structural-invariant tests in __tests__/language-registry.test.ts (added in the fix commit e43a618 after the reviewer pass — see below).

The 16 invariants assert that every derived consumer (EXTRACTORS, EXTENSION_MAP, detectLanguage, isLanguageSupported, getSupportedLanguages, getLanguageDisplayName) exactly mirrors the registry contents — drift between the registry and any derived view is impossible. Existing extraction.test.ts and pr19-improvements.test.ts continue to exercise detectLanguage, isLanguageSupported, getSupportedLanguages, loadAllGrammars, and the per-language extractors — all green.

Impact on open PRs

Every open language PR (#92 HCL, #94 R, #95 SQL, #91 Scala, #66 Vue, #58 ReScript, #57 mql5) currently conflicts with main + with each other on the 6 shared lists. After this lands, they each shrink to:

  • 1 new file: src/extraction/languages/<name>.ts
  • 2 lines in registry.ts (the import + the array entry)
  • 1 line in Language union if exhaustive narrowing is needed (rare)

That's it. Two language PRs no longer conflict on each other unless their alphabetical positions overlap in registry.ts (and even then, it's a trivial "add both lines" resolution).

Follow-ups (out of scope, tracked in commit message)

  • Auto-discovery in registry.ts (fs.readdirSync) — works in built dist/ but vite-node doesn't support extensionless require() of TS source. A small build-time generator could replace the static import list entirely.
  • Splitting __tests__/extraction.test.ts into per-language files — eliminates the test-end-of-file conflict surface that every language PR currently also hits.
  • Similar registry refactors for: MCP tool definitions, migration files, index/sync hooks. Each follows the same pattern (self-registering files, derived registries) and would unblock another class of PR conflicts.

Test plan

  • tsc --noEmit clean
  • npm test — 380/380 passing
  • End-to-end smoke: npm run build && node -e "require('./dist').getSupportedLanguages()" returns all 19 languages
  • DEFAULT_CONFIG.include built lazily — verified via require('./dist/types').DEFAULT_CONFIG.include.length === 29
  • Pascal .dfm files still route to DfmExtractor via extensionOverrides
  • Svelte and Liquid still route to their custom extractors

🤖 Generated with Claude Code

andreinknv and others added 2 commits April 27, 2026 16:27
Adding a new language used to require coordinated edits to 6
shared lists across 4 files (Language union in types.ts;
DEFAULT_CONFIG.include; WASM_GRAMMAR_FILES, EXTENSION_MAP, and
getLanguageDisplayName in grammars.ts; EXTRACTORS map in
languages/index.ts). Two PRs adding different languages typically
conflicted on every one of those.

After this refactor, adding a new language is:

  1. Drop a file at src/extraction/languages/<name>.ts exporting an
     <NAME>_DEF: LanguageDef constant.
  2. Add ONE import line and ONE array entry to
     src/extraction/languages/registry.ts (alphabetical position —
     adjacent additions are still possible but rare).

That's it. grammars.ts, types.ts, tree-sitter.ts dispatch, and the
default include globs are all derived from the registry.

## What's in a LanguageDef

```ts
interface LanguageDef {
  name: string;                  // canonical id
  displayName: string;           // "Pascal / Delphi"
  extensions: readonly string[]; // ['.pas', '.dpr', ...]
  includeGlobs: readonly string[];
  grammar?: { wasmFile, vendored?, extractor };  // tree-sitter
  customExtractor?: (fp, src) => ExtractionResult;  // Liquid, Svelte
  extensionOverrides?: { '.dfm': { customExtractor } };  // Pascal forms
}
```

Each existing language file now exports both its `xxxExtractor`
(unchanged) AND a new `XXX_DEF`. New files were added for tsx, jsx,
svelte, liquid (the latter two wrap their existing custom extractor
classes via the customExtractor field).

## Refactored consumers

- src/extraction/grammars.ts: WASM_GRAMMAR_FILES removed (was
  internal-only); EXTENSION_MAP now a Proxy that lazy-builds from
  the registry on first access (avoids TDZ in cyclic load paths).
  loadGrammarsForLanguages, isLanguageSupported, isGrammarLoaded,
  getSupportedLanguages, getLanguageDisplayName, detectLanguage —
  all read from registry.
- src/extraction/tree-sitter.ts: extractFromSource's if-chain
  (svelte / liquid / pascal+.dfm/.fmx) replaced with one lookup:
  def.extensionOverrides[ext]?.customExtractor || def.customExtractor.
  Drops direct imports of LiquidExtractor, SvelteExtractor,
  DfmExtractor.
- src/types.ts: DEFAULT_CONFIG moved to src/default-config.ts (cycle
  break). types.ts re-exports for backward compat. The `include`
  array is now built lazily from each LanguageDef's includeGlobs.

## What still requires a one-line edit

The Language string union in types.ts still hard-codes the known
languages (typescript | javascript | … | unknown). New languages
added to the registry work at runtime as strings, but adding the
literal here is required IF the resolver wants to do exhaustive
narrowing on the new language (resolution/index.ts and
resolution/import-resolver.ts have a few `language === 'X'`
branches). Most new languages don't need such branches.

This trade-off keeps strict narrowing for the existing handful of
language-specific code paths while making everything else
registry-driven.

## Tests

380/380 pass. No new tests; behavior is identical. Existing
extraction.test.ts and pr19-improvements.test.ts heavily exercise
detectLanguage, isLanguageSupported, getSupportedLanguages, and
loadAllGrammars — all green.

## Follow-ups (out of scope)

- Auto-discovery in registry.ts via fs.readdirSync — works in
  built dist/ but vite-node doesn't support extensionless require()
  of TS source. A small build-time generator could remove the
  static import list entirely.
- Splitting __tests__/extraction.test.ts into per-language test
  files — eliminates the test-end-of-file conflict surface that
  every language PR currently hits.
- Similar registry refactors for:
  - MCP tool definitions (each tool self-registers; no shared
    tools[] array or case-switch in execute())
  - Migration files (each migration in src/db/migrations/NNN-*.ts;
    auto-discovered by version)
  - Index/sync hooks (centrality, churn, issue-history,
    config-refs, sql-refs, cochange all currently mutate
    CodeGraph.indexAll/sync; an IndexHook interface would make
    each pass self-contained)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…tractor

Reviewer caught a real bug: the original commit kept the
EXTRACTORS map in src/extraction/languages/index.ts as a separate
hand-curated registry that TreeSitterExtractor read from. Adding
a new grammar-backed language would have required editing
EXTRACTORS too, undermining the refactor's stated single-source-of-
truth claim. A future contributor missing the EXTRACTORS update
would silently produce empty extraction results.

Fix:
- TreeSitterExtractor now reads its extractor straight off the
  language def: getLanguageDefByName(this.language)?.grammar?.extractor
- EXTRACTORS in languages/index.ts becomes a Proxy that derives
  lazily from the registry (kept for backward compat — readers
  unchanged).
- Add 16 structural-invariant tests in __tests__/language-registry.test.ts
  that fail loudly if any derived consumer drifts from the registry:
  EXTRACTORS / EXTENSION_MAP / detectLanguage / isLanguageSupported /
  getSupportedLanguages / getLanguageDisplayName all asserted to
  exactly mirror the registry contents.

Adding a new grammar-backed language is now genuinely "one new file
+ two lines in registry.ts" — no other files to touch.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@andreinknv
Copy link
Copy Markdown
Contributor Author

Reviewer caught a real bug — the original commit left a parallel EXTRACTORS map in src/extraction/languages/index.ts that TreeSitterExtractor read from. Adding a new grammar-backed language would have required editing it too, undermining the stated "single source of truth."

Pushed fix in e43a618:

  • TreeSitterExtractor now reads getLanguageDefByName(this.language)?.grammar?.extractor directly.
  • EXTRACTORS becomes a Proxy that derives lazily from the registry (kept for back-compat; no new readers needed).
  • Added __tests__/language-registry.test.ts (16 tests) asserting that every derived consumer (EXTRACTORS, EXTENSION_MAP, detectLanguage, isLanguageSupported, getSupportedLanguages, getLanguageDisplayName) exactly mirrors the registry. Future drift fails CI loudly.

396/396 tests pass.

@andreinknv
Copy link
Copy Markdown
Contributor Author

Part of a coordinated 4-PR refactor that unblocks the open PR backlog. See #120 for the full merge-order guide explaining how this PR fits into the picture and how each open language/feature PR rebases onto the new pattern.

andreinknv added a commit to andreinknv/codegraph that referenced this pull request Apr 27, 2026
Originally based on monolithic grammars.ts/tree-sitter.ts; rebased
to the per-language registry pattern (PR colbymchenry#116):
- src/extraction/languages/hcl.ts (LanguageDef with grammar+custom)
- 'hcl' added to Language union
- Vendored tree-sitter-hcl.wasm
- HclExtractor (custom block extractor)
- 220 extraction tests pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
andreinknv added a commit to andreinknv/codegraph that referenced this pull request Apr 27, 2026
Sections 3, 5a, 5b previously taught the monolithic-file pattern that
PR colbymchenry#116 obsoletes. After colbymchenry#116, adding a language is one new file in
src/extraction/languages/ + 2 lines in registry.ts (and an optional
1-line addition to the Language union for TypeScript narrowing).

Updated:
- Section 3: full rewrite. Was 3-file mutation (types.ts, grammars.ts,
  CLAUDE.md). Now: 1 LanguageDef file + registry import + Language
  union entry. Includes a "why per-file" sidebar pointing at the
  cross-PR conflict bottleneck the registry resolves.
- Section 5a: drops the EXTRACTORS-map registration step. The
  extractor is referenced from the LanguageDef directly.
- Section 5b: drops the tree-sitter.ts dispatch wiring. customExtractor
  on the LanguageDef takes the dispatch — no per-language if-branches.

Section 1 (sourcing wasm), Section 2 (probing AST), Sections 6/7/8
(NodeKind mapping, tests, PR checklist), and the existing-extractors
reference table are unchanged — those parts of the workflow didn't
change.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@andreinknv
Copy link
Copy Markdown
Contributor Author

andreinknv commented Apr 28, 2026

Maintainer review checklist (5-min path)

This refactor is behavior-preserving. Below is the shortest path to verify that without re-reading the 26-file patch.

1. What changed structurally

  • New src/extraction/languages/types.ts (83 lines) — LanguageDef interface
  • New src/extraction/languages/registry.ts (108 lines) — alphabetised barrel + lookup helpers (getLanguageDefByName, getLanguageDefByExtension)
  • 17 language files under src/extraction/languages/ — most are existing language files extended to also export <NAME>_DEF: LanguageDef; 4 are brand new (tsx, jsx, svelte, liquid). Svelte/liquid wrap their existing custom-extractor classes via the customExtractor field
  • Shrunk src/types.ts (-205 lines): DEFAULT_CONFIG moved to src/default-config.ts to break a load cycle; include[] now built lazily from each LanguageDef.includeGlobs
  • Shrunk src/extraction/grammars.ts (~204 lines reorganised): WASM_GRAMMAR_FILES removed; EXTENSION_MAP is now a Proxy that lazy-builds from the registry; detectLanguage, isLanguageSupported, getSupportedLanguages, getLanguageDisplayName all read from registry
  • Shrunk src/extraction/tree-sitter.ts: the if-chain (svelte/liquid/pascal+.dfm/.fmx) replaced with one lookup: def.extensionOverrides[ext]?.customExtractor || def.customExtractor

2. What stays exactly the same — verify in seconds

Public surface Where Status
Language string union src/types.ts:58 unchanged (intentionally — see §4)
DEFAULT_CONFIG src/types.ts re-exports from src/default-config.ts re-exported, same shape
EXTENSION_MAP src/extraction/grammars.ts:52 now a Proxy, same Record<string, Language> API
EXTRACTORS src/extraction/languages/index.ts now a Proxy derived from registry, same shape
detectLanguage, isLanguageSupported, getSupportedLanguages, getLanguageDisplayName, loadAllGrammars, etc. src/extraction/grammars.ts unchanged signatures

Existing imports keep working unchanged. The DEFAULT_CONFIG move is the only file relocation, and types.ts re-exports it.

3. Invariants now enforced by tests (__tests__/language-registry.test.ts, 16 tests)

  • ✅ Every registered extension is unique (no two languages claim the same .ext)
  • ✅ Every grammar-backed def has wasmFile + extractor; custom-extractor defs have customExtractor
  • EXTRACTORS exactly mirrors grammar-backed languages — drift between the proxied derivation and the registry fails loudly
  • ✅ Every grammar-backed EXTRACTORS[lang] is def.grammar.extractor (same reference, not a copy) — the bug from the reviewer pass (see §4) cannot regress
  • EXTENSION_MAP entries exactly mirror registry extensions
  • detectLanguage, isLanguageSupported, getSupportedLanguages, getLanguageDisplayName all asserted to mirror registry contents
  • ✅ Pascal extensionOverrides routes .dfm/.fmx to its custom extractor — covers the only language using overrides today
  • ✅ Regression guard: at least 19 languages registered

If a future PR adds a language but forgets to add it to the registry, every one of these tests fails — pointing at the missing registration directly.

4. Reviewer pass — caught and fixed before this comment

The fix commit (e43a618) addresses one real bug:

  • The original commit kept EXTRACTORS as a separate hand-curated map in languages/index.ts that TreeSitterExtractor read from. Adding a grammar-backed language would have required editing EXTRACTORS too — undermining the "one file + two lines" claim. A future contributor missing it would silently produce empty extraction results.
  • Fix: TreeSitterExtractor now reads its extractor straight off the def: getLanguageDefByName(this.language)?.grammar?.extractor. EXTRACTORS retained as a Proxy derived from the registry for backward compat.
  • Guard: invariant test windows 11 didn't install, won't start... #4 above asserts EXTRACTORS[lang] is the same reference as def.grammar.extractor — silent drift impossible.

5. One honest caveat

The Language string union in src/types.ts:58 still hard-codes the known languages. This is intentional: a few language === 'X' branches in resolution/index.ts and resolution/import-resolver.ts use exhaustive narrowing. New languages added to the registry work at runtime as strings — adding the literal here is required ONLY if the new language wants strict-narrowing branches in resolution. Most new languages don't need them.

6. Mergeability

380/380 existing tests pass; the 16 new invariant tests pass. If §1-§4 look right, this is safe to land.

🤖 Comment template generated with Claude Code

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant