[ENG-2689] Language selection 3/5 — CJK-aware MiniSearch tokenizer#712
Merged
Merged
Conversation
…e / Korean queries
Fixes a confirmed CJK-search bug in `search-knowledge-service.ts`'s
BM25 index. MiniSearch 7.2.0's default tokenizer splits on `\p{Z}\p{P}`
(whitespace + punctuation). CJK scripts have no whitespace, so
`'认证系统使用JWT令牌'` tokenizes as a single token and a query for
`'认证'` returns zero matches against indexed content.
Empirical confirmation pre-fix:
const ms = new MiniSearch({fields: ['t'], idField: 'id'})
ms.addAll([{id: 1, t: '认证系统使用JWT令牌'}])
ms.search('认证') // → [] — broken
ms.search('Привет мир') // → matches as expected
Without this fix, language preservation from ENG-2687 / ENG-2688 is
invisible to CJK users: their curate output is correctly authored in
Chinese / Japanese / Korean but their queries return zero matches.
Approach:
- New `cjk-tokenizer.ts` module. Algorithm:
1. Split on Unicode whitespace + punctuation (same as MiniSearch default).
2. For each token, split at CJK ↔ non-CJK script boundaries.
3. Non-CJK segments emit as-is (Latin / Cyrillic / Vietnamese / European
behave byte-identical to the default).
4. CJK segments emit overlapping bigrams; single-char fallback to unigram.
- CJK ranges: U+4E00–9FFF (Unified Ideographs), U+3040–309F (Hiragana),
U+30A0–30FF (Katakana), U+AC00–D7AF (Hangul Syllables).
- Bigrams are the standard CJK IR compromise — unigrams are too noisy
(common chars like `的` dominate scoring), trigrams too sparse.
- Wired via the top-level `tokenize` option on `MINISEARCH_OPTIONS`. Per
the MiniSearch source (line 1564-1566), that single option applies at
both index and query time when `searchOptions.tokenize` is unset.
- No new runtime deps — `Intl.Segmenter` is available in Node 16+ but its
CJK quality varies by ICU version; inline bigrams are deterministic
across Node versions / platforms.
INDEX_SCHEMA_VERSION bumped 6 → 7 so cached indexes built with the
default tokenizer invalidate and rebuild on first daemon start.
Tests (18 new):
- 4 cases for non-CJK scripts (English, Russian, Vietnamese, punctuation)
asserting byte-identical behavior to the MiniSearch default.
- 5 cases for CJK scripts (Chinese 4-char bigrams, Chinese 2-char single
bigram, Japanese kanji+kana, Korean Hangul, single-char unigram fallback).
- 3 cases for mixed Latin+CJK tokens (whitespace-separated, no-whitespace
split, multiple alternating runs).
- 6 MiniSearch integration cases — the actual production wiring contract:
Chinese / Japanese / Korean queries return matches; Russian regression;
English regression; English query does NOT match CJK content
(cross-script isolation).
251/251 across affected surfaces (18 new tokenizer + 58 existing
search-knowledge + 175 prior PRs from ENG-2687/2688). Typecheck + lint
clean; 9 pre-existing warnings on unrelated functions unchanged.
5 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Commit 3 of 5 for the language-selection initiative (issue #616). Fixes a confirmed BM25 search bug that makes Chinese / Japanese / Korean content unsearchable, even when the content is curated correctly.
Without this fix, the language preservation from ENG-2687 / ENG-2688 is invisible to CJK users — their curate output is in Chinese / Japanese / Korean but every query returns zero matches.
Plan:
features/language-selection/plan.md. Spec:tasks/commit-03-tokenizer.md.The bug
MiniSearch 7.2.0's default tokenizer splits on
\p{Z}\p{P}(Unicode whitespace + punctuation). CJK scripts have no whitespace between characters, so a whole sentence becomes one token. Empirical confirmation before this fix:The fix
New
cjk-tokenizer.tsmodule wired via the top-leveltokenizeoption onMINISEARCH_OPTIONS. The algorithm:CJK Unicode ranges covered: U+4E00–9FFF (Unified Ideographs), U+3040–309F (Hiragana), U+30A0–30FF (Katakana), U+AC00–D7AF (Hangul Syllables).
Bigrams are the standard CJK IR compromise — unigrams are too noisy (common characters like
的dominate scoring), trigrams are too sparse. Per the MiniSearch source (MiniSearch.js:1564-1566), the top-leveltokenizeoption applies at both index and query time, so a single setting covers both paths.No new runtime deps —
Intl.Segmenteris available in Node 16+ but its CJK quality varies by ICU version; inline bigrams are deterministic across Node versions and platforms.INDEX_SCHEMA_VERSIONbumped 6 → 7 so any cached index built with the default tokenizer invalidates and rebuilds on first daemon start.Test plan
search-knowledge-service.tsunchanged)JWT 令牌,JWT令牌,API请求JSON响应)Reviewer notes
'认'would emit[]and MiniSearch would treat the document as having no content for that field.0x4E_00(numeric-separator) style. Matches the codebase convention.searchOptions.tokenizefor per-query overrides; left unset so the top-level option auto-applies. Confirmed in MiniSearch source at line 1564-1566.INDEX_SCHEMA_VERSIONbump is conservative — even non-CJK indexes will rebuild once, but the rebuild is fast (~filesystem walk + indexer pass) and the alternative (selective invalidation) is more code for a one-time hit.What's next
brv config set language.*+ release notes crediting Language selecting #616 reporter.After this commit lands, the language-selection feature works end-to-end for every target script. CJK users can curate AND query in their language.