[ENG-2689] Language selection 3/5 — CJK-aware MiniSearch tokenizer by danhdoan · Pull Request #712 · campfirein/byterover-cli

danhdoan · 2026-05-26T11:07:30Z

Summary

Commit 3 of 5 for the language-selection initiative (issue #616). Fixes a confirmed BM25 search bug that makes Chinese / Japanese / Korean content unsearchable, even when the content is curated correctly.

Without this fix, the language preservation from ENG-2687 / ENG-2688 is invisible to CJK users — their curate output is in Chinese / Japanese / Korean but every query returns zero matches.

Plan: features/language-selection/plan.md. Spec: tasks/commit-03-tokenizer.md.

The bug

MiniSearch 7.2.0's default tokenizer splits on \p{Z}\p{P} (Unicode whitespace + punctuation). CJK scripts have no whitespace between characters, so a whole sentence becomes one token. Empirical confirmation before this fix:

const ms = new MiniSearch({fields: ['t'], idField: 'id'})
ms.addAll([{id: 1, t: '认证系统使用JWT令牌'}])
ms.search('认证')          // → [] — broken
ms.search('Привет мир')  // → matches as expected

The fix

New cjk-tokenizer.ts module wired via the top-level tokenize option on MINISEARCH_OPTIONS. The algorithm:

Split on Unicode whitespace + punctuation (matches MiniSearch default).
For each token, split at CJK ↔ non-CJK script boundaries.
Non-CJK segments emit as-is — Latin / Cyrillic / Vietnamese / European behave byte-identical to the default.
CJK segments emit overlapping bigrams; single-char input falls back to unigram so it stays searchable.

CJK Unicode ranges covered: U+4E00–9FFF (Unified Ideographs), U+3040–309F (Hiragana), U+30A0–30FF (Katakana), U+AC00–D7AF (Hangul Syllables).

Bigrams are the standard CJK IR compromise — unigrams are too noisy (common characters like 的 dominate scoring), trigrams are too sparse. Per the MiniSearch source (MiniSearch.js:1564-1566), the top-level tokenize option applies at both index and query time, so a single setting covers both paths.

No new runtime deps — Intl.Segmenter is available in Node 16+ but its CJK quality varies by ICU version; inline bigrams are deterministic across Node versions and platforms.

INDEX_SCHEMA_VERSION bumped 6 → 7 so any cached index built with the default tokenizer invalidates and rebuilds on first daemon start.

Test plan

Typecheck clean
Lint clean (0 errors; 9 pre-existing warnings on unrelated functions in search-knowledge-service.ts unchanged)
18 new tests:
- 4 non-CJK regression cases (English, Russian, Vietnamese, punctuation) asserting byte-identical behavior to the MiniSearch default
- 5 CJK cases (Chinese 4-char → 3 bigrams, Chinese 2-char → 1 bigram, Japanese kanji+katakana, Korean Hangul, single-char unigram fallback)
- 3 mixed Latin+CJK cases (JWT 令牌, JWT令牌, API请求JSON响应)
- 6 MiniSearch integration cases — the production wiring contract: Chinese / Japanese / Korean queries return matches; Russian regression; English regression; cross-script isolation (English query does NOT match CJK content)
58/58 existing search-knowledge-service tests pass — no regression on the production indexer
175/175 tests from PRs [ENG-2687] Language selection 1/5 — foundation (BrvConfig.language + language-clause module) #710 and [ENG-2688] Language selection 2/5 — tool-mode injection (kickoff + correction + MCP tool description) #711 still pass

Reviewer notes

The single-char CJK fallback (unigram for length-1 input) is documented and tested. Without it, a sentence like '认' would emit [] and MiniSearch would treat the document as having no content for that field.
Cross-script isolation: the integration test "English query does NOT match unrelated CJK content" pins the contract that bigrams stay opaque to Latin queries. If a future refactor accidentally widens this, the test will catch it.
Hex literal style — auto-fix converted to 0x4E_00 (numeric-separator) style. Matches the codebase convention.
The MiniSearch 7.x API exposes searchOptions.tokenize for per-query overrides; left unset so the top-level option auto-applies. Confirmed in MiniSearch source at line 1564-1566.
INDEX_SCHEMA_VERSION bump is conservative — even non-CJK indexes will rebuild once, but the rebuild is fast (~filesystem walk + indexer pass) and the alternative (selective invalidation) is more code for a one-time hit.

What's next

Commit 04 (validation): RU / VI / ZH / JA fixtures, cross-language E2E, English regression byte-identical.
Commit 05 (CLI): brv config set language.* + release notes crediting Language selecting #616 reporter.

After this commit lands, the language-selection feature works end-to-end for every target script. CJK users can curate AND query in their language.

…e / Korean queries Fixes a confirmed CJK-search bug in `search-knowledge-service.ts`'s BM25 index. MiniSearch 7.2.0's default tokenizer splits on `\p{Z}\p{P}` (whitespace + punctuation). CJK scripts have no whitespace, so `'认证系统使用JWT令牌'` tokenizes as a single token and a query for `'认证'` returns zero matches against indexed content. Empirical confirmation pre-fix: const ms = new MiniSearch({fields: ['t'], idField: 'id'}) ms.addAll([{id: 1, t: '认证系统使用JWT令牌'}]) ms.search('认证') // → [] — broken ms.search('Привет мир') // → matches as expected Without this fix, language preservation from ENG-2687 / ENG-2688 is invisible to CJK users: their curate output is correctly authored in Chinese / Japanese / Korean but their queries return zero matches. Approach: - New `cjk-tokenizer.ts` module. Algorithm: 1. Split on Unicode whitespace + punctuation (same as MiniSearch default). 2. For each token, split at CJK ↔ non-CJK script boundaries. 3. Non-CJK segments emit as-is (Latin / Cyrillic / Vietnamese / European behave byte-identical to the default). 4. CJK segments emit overlapping bigrams; single-char fallback to unigram. - CJK ranges: U+4E00–9FFF (Unified Ideographs), U+3040–309F (Hiragana), U+30A0–30FF (Katakana), U+AC00–D7AF (Hangul Syllables). - Bigrams are the standard CJK IR compromise — unigrams are too noisy (common chars like `的` dominate scoring), trigrams too sparse. - Wired via the top-level `tokenize` option on `MINISEARCH_OPTIONS`. Per the MiniSearch source (line 1564-1566), that single option applies at both index and query time when `searchOptions.tokenize` is unset. - No new runtime deps — `Intl.Segmenter` is available in Node 16+ but its CJK quality varies by ICU version; inline bigrams are deterministic across Node versions / platforms. INDEX_SCHEMA_VERSION bumped 6 → 7 so cached indexes built with the default tokenizer invalidate and rebuild on first daemon start. Tests (18 new): - 4 cases for non-CJK scripts (English, Russian, Vietnamese, punctuation) asserting byte-identical behavior to the MiniSearch default. - 5 cases for CJK scripts (Chinese 4-char bigrams, Chinese 2-char single bigram, Japanese kanji+kana, Korean Hangul, single-char unigram fallback). - 3 cases for mixed Latin+CJK tokens (whitespace-separated, no-whitespace split, multiple alternating runs). - 6 MiniSearch integration cases — the actual production wiring contract: Chinese / Japanese / Korean queries return matches; Russian regression; English regression; English query does NOT match CJK content (cross-script isolation). 251/251 across affected surfaces (18 new tokenizer + 58 existing search-knowledge + 175 prior PRs from ENG-2687/2688). Typecheck + lint clean; 9 pre-existing warnings on unrelated functions unchanged.

danhdoan requested review from RyanNg1403, bao-byterover, cuongdo-byterover, leehpham and ngduyanhece as code owners May 26, 2026 11:07

danhdoan merged commit c19c5bf into proj/language-selection May 26, 2026

danhdoan mentioned this pull request May 26, 2026

[ENG-2691] Language selection 5/5 — brv config set/get + release notes + ship gate #714

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ENG-2689] Language selection 3/5 — CJK-aware MiniSearch tokenizer#712

[ENG-2689] Language selection 3/5 — CJK-aware MiniSearch tokenizer#712
danhdoan merged 1 commit into
proj/language-selectionfrom
feat/ENG-2689-tokenizer-cjk

danhdoan commented May 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

danhdoan commented May 26, 2026

Summary

The bug

The fix

Test plan

Reviewer notes

What's next

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant