feat(search): subword tokens + Porter stemmer + stopword filter for FTS by andreinknv · Pull Request #104 · colbymchenry/codegraph

andreinknv · 2026-04-26T17:36:42Z

Context

Embeddings were removed in 453c39d (issue #87). Since then, all search quality has to come from FTS. The original PR #74 documented several queries where FTS-only badly trailed semantic search — the dominant cause is that SQLite's default tokenizer treats getParser as one indivisible token, so a query for parser never matches it at the FTS layer no matter how clever the post-hoc rescorer is.

This PR closes most of that gap with three compounding, model-free changes.

What changed

Subword tokens. New name_subwords column on nodes populated with the camel/snake split of the identifier (kept alongside the original) and indexed by FTS5 at weight 10×. A query for parser now finds getParser at the FTS layer, not just via post-hoc rescoring on the limited candidate set BM25 surfaces.
Porter stemmer. tokenize="porter unicode61" on the FTS table collapses morphological variants — parser / parsing / parses all stem to pars so a natural-language query matches identifier subwords and docstring prose alike.
Stopword stripping. searchNodesFTS filters stopwords from the query before the OR-join. Without this, words like how / does / the become OR'd FTS hits against any prose-bearing docstring and crowd out the actually-relevant identifier tokens. Reuses the existing STOP_WORDS set in src/search/query-utils.ts via a new shared filterStopwords helper (no second source of truth).

Empirical results

Bench: codegraph's own src/, 1242 nodes, 71 files. Same query set as PR #74's evidence. Lower rank = better.

Query	baseline (current main)	this PR
`ExtractionOrchestrator`	1	1
`how does file parsing work` (target: `getParser`)	NOT FOUND in 20	#2
`database connection management` (target: `DatabaseConnection`)	#18	#1
`resolves references between modules` (target: `resolveOne`)	#19	#2

Mean rank: ~14 → 1.5. Same workload, no model dependencies, no install-time changes, no Node version constraints.

What I tested and rejected

Concept-mode docstring re-weighting (swap BM25 weights to favor docstring 20× when query contains stopwords): regressed how does file parsing work from #4 to #11 because amplifying docstring weight floods results with prose-keyword spam more than it lifts truly relevant prose. Not included.

Migration v4

Existing v3 databases get migrated by:

Adding the name_subwords column to nodes (idempotent PRAGMA table_info guard so a re-run after partial DDL failure doesn't fail with "duplicate column")
Dropping the old FTS table + triggers (FTS5 tokenizer cannot be ALTERed)
Recreating FTS without triggers
Backfilling name_subwords for every existing node
Rebuilding the FTS index in one shot via INSERT INTO nodes_fts(nodes_fts) VALUES('rebuild')
Recreating the triggers afterward (so they don't fire mid-backfill — earlier prototype runs corrupted FTS5 when triggers fired against half-populated shadow tables)

CURRENT_SCHEMA_VERSION 3 → 4. Two existing version-pinned tests updated.

Files changed

File	Change
`src/utils.ts`	Add `splitIdentifierTokens`, `buildNameSubwords` (with dedup)
`src/search/query-utils.ts`	Add shared `filterStopwords` helper using existing `STOP_WORDS`
`src/db/schema.sql`	Add `name_subwords` column, add it to `nodes_fts`, add `tokenize="porter unicode61"`, update three triggers
`src/db/migrations.ts`	Bump version to 4; add migration v4 with idempotent ALTER guard
`src/db/queries.ts`	Populate `name_subwords` on insert/update; new BM25 weights `(0,20,5,1,2,10)`; stopword filter in `searchNodesFTS`
`__tests__/foundation.test.ts`, `__tests__/pr19-improvements.test.ts`	Update expected schema version
`__tests__/search-quality.test.ts` (NEW)	21 regression tests: helpers, end-to-end search, full v3-to-v4 migration with a real v3-shape DB, and migration idempotency on partial-DDL re-run

Test plan

npm test: 404/404 pass on macOS (one pre-existing fs.watch flake under parallel load, passes in isolation — unrelated)
npx tsc --noEmit clean
Bench script confirms the rank lifts in the table above
Independent reviewer pass before pushing — addressed three findings:
- merged duplicate stopword sets (now uses STOP_WORDS from query-utils.ts)
- dedup tokens in buildNameSubwords (parse no longer stores "parse parse")
- made migration idempotent on partial-DDL re-run (re-running after a crash that landed only ALTER TABLE no longer fails)

Notes for review

This PR is independent of fix(db): enforce UNIQUE on edges so INSERT OR IGNORE actually dedupes #102 (which also bumps schema version to 4). If fix(db): enforce UNIQUE on edges so INSERT OR IGNORE actually dedupes #102 lands first, I'll rebase to v5.
No new dependencies. No install-time changes. No Node version constraints.
The change is gradual — fresh installs hit the new schema directly via schema.sql, existing installs run migration v4 on first open.

🤖 Generated with Claude Code

The codebase no longer ships embeddings (commit 453c39d), so all search quality has to come from FTS. The maintainer's evidence in PR colbymchenry#74 documented several queries where FTS-only badly trailed semantic search because the SQLite default tokenizer treats `getParser` as a single indivisible token. Three changes that compound to fix that: 1. **Subword tokens.** New `name_subwords` column on `nodes` populated with the camel/snake split of the identifier (kept alongside the original) and indexed by FTS5 at weight 10x. A query for `parser` now finds `getParser` at the FTS layer, not just via post-hoc rescoring on the limited candidate set BM25 surfaces. 2. **Porter stemmer.** `tokenize="porter unicode61"` on the FTS table collapses morphological variants — `parser`/`parsing`/`parses` all stem to `pars` so a natural-language query matches identifier subwords and docstring prose alike. 3. **Stopword stripping.** `searchNodesFTS` now filters stopwords from the query before constructing the OR-join. Without this, words like `how` / `does` / `the` become OR'd FTS hits against any prose-bearing docstring and crowd out the actually-relevant identifier tokens. Reuses the existing `STOP_WORDS` set in src/search/query-utils.ts via a new shared `filterStopwords` helper. ## Empirical results (codegraph's own src/, 1242 nodes, 71 files) | Query | baseline rank | this PR rank | |---|---:|---:| | `ExtractionOrchestrator` | 1 | 1 | | `how does file parsing work` | NOT FOUND in 20 | 2 | | `database connection management` | 18 | 1 | | `resolves references between modules` | 19 | 2 | Mean rank: ~14 → 1.5. Concept-mode docstring re-weighting was tested as a fourth lever and rejected — it regressed `how does file parsing work` because amplifying docstring weight floods the result list with prose-keyword spam more than it lifts truly relevant prose. Not included. ## Migration v4 Existing v3 databases get migrated by: - Adding the `name_subwords` column to `nodes` (idempotent guard so a re-run after partial DDL failure doesn't fail with "duplicate column") - Dropping the old FTS table + triggers (tokenize cannot be ALTERed) - Recreating FTS without triggers - Backfilling name_subwords for every existing node - Rebuilding the FTS index in one shot via `INSERT INTO nodes_fts(nodes_fts) VALUES('rebuild')` - Recreating the triggers afterward (so they don't fire mid-backfill, which corrupted FTS5 in earlier prototype runs) ## Files changed | File | Change | |---|---| | `src/utils.ts` | Add `splitIdentifierTokens`, `buildNameSubwords` | | `src/search/query-utils.ts` | Add shared `filterStopwords` helper using existing STOP_WORDS | | `src/db/schema.sql` | Add `name_subwords` column, add it to nodes_fts, add `tokenize="porter unicode61"`, update triggers | | `src/db/migrations.ts` | Bump version to 4; add migration v4 with idempotent ALTER guard | | `src/db/queries.ts` | Populate name_subwords on insert/update; new BM25 weights; stopword filter in searchNodesFTS | | `__tests__/foundation.test.ts`, `__tests__/pr19-improvements.test.ts` | Update expected schema version | | `__tests__/search-quality.test.ts` | 21 regression tests including helpers, end-to-end search, full v3-to-v4 migration, and migration idempotency | ## Test plan - [x] `npm test`: 404/404 pass on macOS (one pre-existing fs.watch flake under parallel load, passes in isolation) - [x] `npx tsc --noEmit` clean - [x] Bench script confirms targets at colbymchenry#18, colbymchenry#19, NOT-FOUND on baseline jump to #1, #2, #2 with this PR - [x] Independent reviewer pass before pushing — addressed three findings: - merged duplicate stopword sets (now uses STOP_WORDS from query-utils.ts) - dedup tokens in buildNameSubwords (`parse` no longer stores `parse parse`) - made migration idempotent on partial-DDL re-run Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

This was referenced Apr 26, 2026

vector search #87

Closed

feat(cochange): file-level co-change graph mined from git history #105

Open

Merge-order guide for the open PR backlog (4 refactor PRs unblock the rest) #120

Open

This was referenced May 6, 2026

feat(cochange): file-level co-change graph mined from git history mschreib28/codegraph#31

Open

feat(search): subword tokens + Porter stemmer + stopword filter for FTS mschreib28/codegraph#32

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(search): subword tokens + Porter stemmer + stopword filter for FTS#104

feat(search): subword tokens + Porter stemmer + stopword filter for FTS#104
andreinknv wants to merge 1 commit into
colbymchenry:mainfrom
andreinknv:feat/fts-search-quality

andreinknv commented Apr 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

andreinknv commented Apr 26, 2026

Context

What changed

Empirical results

What I tested and rejected

Migration v4

Files changed

Test plan

Notes for review

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant