Skip to content

feat(search): field-qualified queries (kind:/lang:/path:/name:) + fuzzy typo fallback#131

Merged
colbymchenry merged 3 commits into
colbymchenry:mainfrom
andreinknv:feat/search-fields-and-fuzzy
May 8, 2026
Merged

feat(search): field-qualified queries (kind:/lang:/path:/name:) + fuzzy typo fallback#131
colbymchenry merged 3 commits into
colbymchenry:mainfrom
andreinknv:feat/search-fields-and-fuzzy

Conversation

@andreinknv
Copy link
Copy Markdown
Contributor

Summary

Two UX improvements that turn free-text search into something a user can drive precisely.

1. Field-qualified queries

A new query parser splits the raw query into structured filters and a free-text remainder:

kind:function name:auth path:src/api authenticate

becomes:

{ kinds: ['function'], nameFilters: ['auth'],
  pathFilters: ['src/api'], text: 'authenticate' }

Filters compose with the SearchOptions arg (intersection). Unknown prefixes pass through as plain text so query "TODO:" keeps working. Quoted values (path:"my dir") handle whitespace. When the user supplies only filters with no text, the search uses a filter-only candidate scan instead of bailing out.

Recognised fields:

Prefix Value
kind: any NodeKind value (function, method, class, ...)
lang: (alias language:) any Language value
path: case-insensitive substring of file_path
name: case-insensitive substring of node.name

2. Fuzzy typo fallback

When both FTS and LIKE return nothing AND the text is at least 3 chars, scan the distinct-name set with a bounded edit distance (≤2 for ≥5-char queries, ≤1 for 4-char). Bounded edit distance early-exits once the row min exceeds maxDist, so the per-query cost stays O(distinct-names × avg-name-length) with a very low constant.

Test plan

Verified live against ollama/ollama@v0.22.0:

Query Result
kind:function auth only function-kind hits
lang:go path:server route Go files under server/
getUssr (typo) finds getUser, SetUser
confg (typo) finds Config
  • npx vitest run380 passed
  • npx tsc --noEmit clean
  • npm run build succeeds

🤖 Generated with Claude Code

…zy typo fallback

Two UX improvements that turn a free-text search into something a
real user can drive precisely.

1) Field-qualified queries.

A new query parser (src/search/query-parser.ts) splits the raw query
into structured filters and a free-text remainder:

  kind:function name:auth path:src/api authenticate

becomes
  { kinds: ['function'], nameFilters: ['auth'],
    pathFilters: ['src/api'], text: 'authenticate' }

Filters compose with the SearchOptions arg (intersection). Unknown
prefixes pass through as plain text so `query "TODO:"` keeps working.
Quoted values (`path:"my dir"`) handle whitespace. When the user
specifies only filters with no text, the search uses a filter-only
candidate scan instead of bailing out.

Recognised today:
  kind:        any NodeKind value
  lang:        any Language value (alias: language:)
  path:        case-insensitive substring of file_path
  name:        case-insensitive substring of node.name

2) Fuzzy fallback.

When BOTH FTS and LIKE return nothing AND the text is at least 3
chars, the resolver scans the distinct-name set with a bounded
Damerau-Levenshtein-style edit distance (≤2 for ≥5 chars, ≤1 for
4-char queries, off for shorter). Bounded edit-distance early-exits
once the row min exceeds maxDist, so this stays O(distinct-names *
avg-name-length) with a very low constant.

Verified live against ollama/ollama@v0.22.0:
  query "kind:function auth"          → only function-kind hits
  query "lang:go path:server route"   → Go files under server/
  query "getUssr"   (typo)            → finds getUser, SetUser
  query "confg"     (typo)            → finds Config

Full test suite: 380 passed.
…fuzzy fan-out cap, larger filter-only over-fetch, unit tests

Five fixes from independent review:

- parseQuery tokenizer: quotes that appear MID-token (path:"my dir/
  file") were not being recognised — only quotes at the start of a
  token were treated as quoted spans. The fixture path:"my dir"
  parsed as ['path:"my', 'dir"'] instead of ['path:"my dir"'].
  Tokeniser is now a single state machine that scans into a token
  until whitespace OR a quote, and recognises quotes anywhere within
  the token (skips to the matching close quote).

- searchNodesFuzzy: cap the per-name follow-up SQL queries at
  Math.max(limit*2, 50) AFTER edit-distance filtering. Without
  this, a project with many similar names (getUser1, getUser2...)
  could fan out far beyond limit queries before the inner-loop
  break kicks in.

- searchAllByFilters (filter-only no-text path): bumped over-fetch
  multiplier from 2× to 5× so a selective post-filter (e.g.
  path:src/very/specific/file.ts) doesn't return fewer than limit
  results despite the DB having matches.

- 23 new unit tests in __tests__/search-query-parser.test.ts:
  parseQuery covers known-field filter, lang/language alias,
  multiple kind: ORs, quoted spans (incl. mid-token), URL
  passthrough, empty-value passthrough, unknown prefix passthrough,
  unknown value passthrough, all-filters-no-text, empty input,
  20k-char input. boundedEditDistance covers identity, single
  insertion/deletion/substitution, length-difference shortcut,
  empty inputs, case-sensitivity, early-exit correctness.

Full test suite: 853 passed (up from 830).
andreinknv added a commit to andreinknv/codegraph that referenced this pull request Apr 28, 2026
andreinknv added a commit to andreinknv/codegraph that referenced this pull request Apr 29, 2026
Adds Steps K-O to walk the new PRs in dependency order:
  K: bug-fix wave (clean):    colbymchenry#128, colbymchenry#129
  L: resolution + search:     colbymchenry#130 (resolve), colbymchenry#131 (resolve)
  M: extraction edges:        colbymchenry#134 (resolve)
  N: biomarker stack:         colbymchenry#132, colbymchenry#133 (both resolve, on top of colbymchenry#125)
  O: search advanced:         colbymchenry#135 (resolve, on top of colbymchenry#131)

Also flips colbymchenry#125 from merge_clean to merge_resolve - it now hits a
queries.ts conflict after the Phase-4 stack lands (colbymchenry#111/colbymchenry#112/colbymchenry#123/colbymchenry#124
all extend the same QueryBuilder surface, so colbymchenry#125's biomarker columns
no longer apply cleanly without a resolution).

Validated end-to-end against colbymchenry/main HEAD: script ran
clean through all 43 PRs, npm run build succeeded, full test
suite reports 877/877 passing (was 829 before this wave: +48 from
new tests added by the new PRs plus the reviewer-driven follow-ups).
Convert NodeKind and Language to runtime-iterable as const arrays
(NODE_KINDS, LANGUAGES) so the query parser imports the canonical
list instead of duplicating it. Also fix the path: JSDoc to say
substring (matches the .includes() impl).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@colbymchenry
Copy link
Copy Markdown
Owner

Reviewed and merging. Pushed a small polish commit:

  • Derived KIND_VALUES / LANGUAGE_VALUES from new NODE_KINDS / LANGUAGES as const arrays in types.ts so the parser stays in sync if a new kind or language gets added (e.g. lang:unknown now works because the type already had it).
  • Fixed the path: JSDoc — it claimed prefix match but the implementation is substring.

The field-qualified syntax and the bounded-edit-distance fuzzy fallback are both clean, well-tested wins. Thanks for the contribution.

@colbymchenry colbymchenry merged commit 56f6b3b into colbymchenry:main May 8, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants