Skip to content

feat(search): subword tokens + Porter stemmer + stopword filter for FTS#104

Open
andreinknv wants to merge 1 commit into
colbymchenry:mainfrom
andreinknv:feat/fts-search-quality
Open

feat(search): subword tokens + Porter stemmer + stopword filter for FTS#104
andreinknv wants to merge 1 commit into
colbymchenry:mainfrom
andreinknv:feat/fts-search-quality

Conversation

@andreinknv
Copy link
Copy Markdown
Contributor

Context

Embeddings were removed in 453c39d (issue #87). Since then, all search quality has to come from FTS. The original PR #74 documented several queries where FTS-only badly trailed semantic search — the dominant cause is that SQLite's default tokenizer treats getParser as one indivisible token, so a query for parser never matches it at the FTS layer no matter how clever the post-hoc rescorer is.

This PR closes most of that gap with three compounding, model-free changes.

What changed

  1. Subword tokens. New name_subwords column on nodes populated with the camel/snake split of the identifier (kept alongside the original) and indexed by FTS5 at weight 10×. A query for parser now finds getParser at the FTS layer, not just via post-hoc rescoring on the limited candidate set BM25 surfaces.

  2. Porter stemmer. tokenize="porter unicode61" on the FTS table collapses morphological variants — parser / parsing / parses all stem to pars so a natural-language query matches identifier subwords and docstring prose alike.

  3. Stopword stripping. searchNodesFTS filters stopwords from the query before the OR-join. Without this, words like how / does / the become OR'd FTS hits against any prose-bearing docstring and crowd out the actually-relevant identifier tokens. Reuses the existing STOP_WORDS set in src/search/query-utils.ts via a new shared filterStopwords helper (no second source of truth).

Empirical results

Bench: codegraph's own src/, 1242 nodes, 71 files. Same query set as PR #74's evidence. Lower rank = better.

Query baseline (current main) this PR
ExtractionOrchestrator 1 1
how does file parsing work (target: getParser) NOT FOUND in 20 #2
database connection management (target: DatabaseConnection) #18 #1
resolves references between modules (target: resolveOne) #19 #2

Mean rank: ~14 → 1.5. Same workload, no model dependencies, no install-time changes, no Node version constraints.

What I tested and rejected

Concept-mode docstring re-weighting (swap BM25 weights to favor docstring 20× when query contains stopwords): regressed how does file parsing work from #4 to #11 because amplifying docstring weight floods results with prose-keyword spam more than it lifts truly relevant prose. Not included.

Migration v4

Existing v3 databases get migrated by:

  • Adding the name_subwords column to nodes (idempotent PRAGMA table_info guard so a re-run after partial DDL failure doesn't fail with "duplicate column")
  • Dropping the old FTS table + triggers (FTS5 tokenizer cannot be ALTERed)
  • Recreating FTS without triggers
  • Backfilling name_subwords for every existing node
  • Rebuilding the FTS index in one shot via INSERT INTO nodes_fts(nodes_fts) VALUES('rebuild')
  • Recreating the triggers afterward (so they don't fire mid-backfill — earlier prototype runs corrupted FTS5 when triggers fired against half-populated shadow tables)

CURRENT_SCHEMA_VERSION 3 → 4. Two existing version-pinned tests updated.

Files changed

File Change
src/utils.ts Add splitIdentifierTokens, buildNameSubwords (with dedup)
src/search/query-utils.ts Add shared filterStopwords helper using existing STOP_WORDS
src/db/schema.sql Add name_subwords column, add it to nodes_fts, add tokenize="porter unicode61", update three triggers
src/db/migrations.ts Bump version to 4; add migration v4 with idempotent ALTER guard
src/db/queries.ts Populate name_subwords on insert/update; new BM25 weights (0,20,5,1,2,10); stopword filter in searchNodesFTS
__tests__/foundation.test.ts, __tests__/pr19-improvements.test.ts Update expected schema version
__tests__/search-quality.test.ts (NEW) 21 regression tests: helpers, end-to-end search, full v3-to-v4 migration with a real v3-shape DB, and migration idempotency on partial-DDL re-run

Test plan

  • npm test: 404/404 pass on macOS (one pre-existing fs.watch flake under parallel load, passes in isolation — unrelated)
  • npx tsc --noEmit clean
  • Bench script confirms the rank lifts in the table above
  • Independent reviewer pass before pushing — addressed three findings:
    • merged duplicate stopword sets (now uses STOP_WORDS from query-utils.ts)
    • dedup tokens in buildNameSubwords (parse no longer stores "parse parse")
    • made migration idempotent on partial-DDL re-run (re-running after a crash that landed only ALTER TABLE no longer fails)

Notes for review

🤖 Generated with Claude Code

The codebase no longer ships embeddings (commit 453c39d), so all search
quality has to come from FTS. The maintainer's evidence in PR colbymchenry#74
documented several queries where FTS-only badly trailed semantic search
because the SQLite default tokenizer treats `getParser` as a single
indivisible token. Three changes that compound to fix that:

1. **Subword tokens.** New `name_subwords` column on `nodes` populated
   with the camel/snake split of the identifier (kept alongside the
   original) and indexed by FTS5 at weight 10x. A query for `parser`
   now finds `getParser` at the FTS layer, not just via post-hoc
   rescoring on the limited candidate set BM25 surfaces.

2. **Porter stemmer.** `tokenize="porter unicode61"` on the FTS table
   collapses morphological variants — `parser`/`parsing`/`parses` all
   stem to `pars` so a natural-language query matches identifier subwords
   and docstring prose alike.

3. **Stopword stripping.** `searchNodesFTS` now filters stopwords from
   the query before constructing the OR-join. Without this, words like
   `how` / `does` / `the` become OR'd FTS hits against any prose-bearing
   docstring and crowd out the actually-relevant identifier tokens.
   Reuses the existing `STOP_WORDS` set in src/search/query-utils.ts via
   a new shared `filterStopwords` helper.

## Empirical results (codegraph's own src/, 1242 nodes, 71 files)

| Query | baseline rank | this PR rank |
|---|---:|---:|
| `ExtractionOrchestrator` | 1 | 1 |
| `how does file parsing work` | NOT FOUND in 20 | 2 |
| `database connection management` | 18 | 1 |
| `resolves references between modules` | 19 | 2 |

Mean rank: ~14 → 1.5.

Concept-mode docstring re-weighting was tested as a fourth lever and
rejected — it regressed `how does file parsing work` because amplifying
docstring weight floods the result list with prose-keyword spam more
than it lifts truly relevant prose. Not included.

## Migration v4

Existing v3 databases get migrated by:
  - Adding the `name_subwords` column to `nodes` (idempotent guard so a
    re-run after partial DDL failure doesn't fail with "duplicate column")
  - Dropping the old FTS table + triggers (tokenize cannot be ALTERed)
  - Recreating FTS without triggers
  - Backfilling name_subwords for every existing node
  - Rebuilding the FTS index in one shot via `INSERT INTO nodes_fts(nodes_fts) VALUES('rebuild')`
  - Recreating the triggers afterward (so they don't fire mid-backfill,
    which corrupted FTS5 in earlier prototype runs)

## Files changed

| File | Change |
|---|---|
| `src/utils.ts` | Add `splitIdentifierTokens`, `buildNameSubwords` |
| `src/search/query-utils.ts` | Add shared `filterStopwords` helper using existing STOP_WORDS |
| `src/db/schema.sql` | Add `name_subwords` column, add it to nodes_fts, add `tokenize="porter unicode61"`, update triggers |
| `src/db/migrations.ts` | Bump version to 4; add migration v4 with idempotent ALTER guard |
| `src/db/queries.ts` | Populate name_subwords on insert/update; new BM25 weights; stopword filter in searchNodesFTS |
| `__tests__/foundation.test.ts`, `__tests__/pr19-improvements.test.ts` | Update expected schema version |
| `__tests__/search-quality.test.ts` | 21 regression tests including helpers, end-to-end search, full v3-to-v4 migration, and migration idempotency |

## Test plan

- [x] `npm test`: 404/404 pass on macOS (one pre-existing fs.watch flake under parallel load, passes in isolation)
- [x] `npx tsc --noEmit` clean
- [x] Bench script confirms targets at colbymchenry#18, colbymchenry#19, NOT-FOUND on baseline jump to #1, #2, #2 with this PR
- [x] Independent reviewer pass before pushing — addressed three findings:
  - merged duplicate stopword sets (now uses STOP_WORDS from query-utils.ts)
  - dedup tokens in buildNameSubwords (`parse` no longer stores `parse parse`)
  - made migration idempotent on partial-DDL re-run

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant