Skip to content

[ENG-2689] Language selection 3/5 — CJK-aware MiniSearch tokenizer#712

Merged
danhdoan merged 1 commit into
proj/language-selectionfrom
feat/ENG-2689-tokenizer-cjk
May 26, 2026
Merged

[ENG-2689] Language selection 3/5 — CJK-aware MiniSearch tokenizer#712
danhdoan merged 1 commit into
proj/language-selectionfrom
feat/ENG-2689-tokenizer-cjk

Conversation

@danhdoan
Copy link
Copy Markdown
Collaborator

Summary

Commit 3 of 5 for the language-selection initiative (issue #616). Fixes a confirmed BM25 search bug that makes Chinese / Japanese / Korean content unsearchable, even when the content is curated correctly.

Without this fix, the language preservation from ENG-2687 / ENG-2688 is invisible to CJK users — their curate output is in Chinese / Japanese / Korean but every query returns zero matches.

Plan: features/language-selection/plan.md. Spec: tasks/commit-03-tokenizer.md.

The bug

MiniSearch 7.2.0's default tokenizer splits on \p{Z}\p{P} (Unicode whitespace + punctuation). CJK scripts have no whitespace between characters, so a whole sentence becomes one token. Empirical confirmation before this fix:

const ms = new MiniSearch({fields: ['t'], idField: 'id'})
ms.addAll([{id: 1, t: '认证系统使用JWT令牌'}])
ms.search('认证')          // → [] — broken
ms.search('Привет мир')  // → matches as expected

The fix

New cjk-tokenizer.ts module wired via the top-level tokenize option on MINISEARCH_OPTIONS. The algorithm:

  1. Split on Unicode whitespace + punctuation (matches MiniSearch default).
  2. For each token, split at CJK ↔ non-CJK script boundaries.
  3. Non-CJK segments emit as-is — Latin / Cyrillic / Vietnamese / European behave byte-identical to the default.
  4. CJK segments emit overlapping bigrams; single-char input falls back to unigram so it stays searchable.

CJK Unicode ranges covered: U+4E00–9FFF (Unified Ideographs), U+3040–309F (Hiragana), U+30A0–30FF (Katakana), U+AC00–D7AF (Hangul Syllables).

Bigrams are the standard CJK IR compromise — unigrams are too noisy (common characters like dominate scoring), trigrams are too sparse. Per the MiniSearch source (MiniSearch.js:1564-1566), the top-level tokenize option applies at both index and query time, so a single setting covers both paths.

No new runtime deps — Intl.Segmenter is available in Node 16+ but its CJK quality varies by ICU version; inline bigrams are deterministic across Node versions and platforms.

INDEX_SCHEMA_VERSION bumped 6 → 7 so any cached index built with the default tokenizer invalidates and rebuilds on first daemon start.

Test plan

  • Typecheck clean
  • Lint clean (0 errors; 9 pre-existing warnings on unrelated functions in search-knowledge-service.ts unchanged)
  • 18 new tests:
    • 4 non-CJK regression cases (English, Russian, Vietnamese, punctuation) asserting byte-identical behavior to the MiniSearch default
    • 5 CJK cases (Chinese 4-char → 3 bigrams, Chinese 2-char → 1 bigram, Japanese kanji+katakana, Korean Hangul, single-char unigram fallback)
    • 3 mixed Latin+CJK cases (JWT 令牌, JWT令牌, API请求JSON响应)
    • 6 MiniSearch integration cases — the production wiring contract: Chinese / Japanese / Korean queries return matches; Russian regression; English regression; cross-script isolation (English query does NOT match CJK content)
  • 58/58 existing search-knowledge-service tests pass — no regression on the production indexer
  • 175/175 tests from PRs [ENG-2687] Language selection 1/5 — foundation (BrvConfig.language + language-clause module) #710 and [ENG-2688] Language selection 2/5 — tool-mode injection (kickoff + correction + MCP tool description) #711 still pass

Reviewer notes

  • The single-char CJK fallback (unigram for length-1 input) is documented and tested. Without it, a sentence like '认' would emit [] and MiniSearch would treat the document as having no content for that field.
  • Cross-script isolation: the integration test "English query does NOT match unrelated CJK content" pins the contract that bigrams stay opaque to Latin queries. If a future refactor accidentally widens this, the test will catch it.
  • Hex literal style — auto-fix converted to 0x4E_00 (numeric-separator) style. Matches the codebase convention.
  • The MiniSearch 7.x API exposes searchOptions.tokenize for per-query overrides; left unset so the top-level option auto-applies. Confirmed in MiniSearch source at line 1564-1566.
  • INDEX_SCHEMA_VERSION bump is conservative — even non-CJK indexes will rebuild once, but the rebuild is fast (~filesystem walk + indexer pass) and the alternative (selective invalidation) is more code for a one-time hit.

What's next

  • Commit 04 (validation): RU / VI / ZH / JA fixtures, cross-language E2E, English regression byte-identical.
  • Commit 05 (CLI): brv config set language.* + release notes crediting Language selecting #616 reporter.

After this commit lands, the language-selection feature works end-to-end for every target script. CJK users can curate AND query in their language.

…e / Korean queries

Fixes a confirmed CJK-search bug in `search-knowledge-service.ts`'s
BM25 index. MiniSearch 7.2.0's default tokenizer splits on `\p{Z}\p{P}`
(whitespace + punctuation). CJK scripts have no whitespace, so
`'认证系统使用JWT令牌'` tokenizes as a single token and a query for
`'认证'` returns zero matches against indexed content.

Empirical confirmation pre-fix:

  const ms = new MiniSearch({fields: ['t'], idField: 'id'})
  ms.addAll([{id: 1, t: '认证系统使用JWT令牌'}])
  ms.search('认证')          // → [] — broken
  ms.search('Привет мир')  // → matches as expected

Without this fix, language preservation from ENG-2687 / ENG-2688 is
invisible to CJK users: their curate output is correctly authored in
Chinese / Japanese / Korean but their queries return zero matches.

Approach:

- New `cjk-tokenizer.ts` module. Algorithm:
  1. Split on Unicode whitespace + punctuation (same as MiniSearch default).
  2. For each token, split at CJK ↔ non-CJK script boundaries.
  3. Non-CJK segments emit as-is (Latin / Cyrillic / Vietnamese / European
     behave byte-identical to the default).
  4. CJK segments emit overlapping bigrams; single-char fallback to unigram.
- CJK ranges: U+4E00–9FFF (Unified Ideographs), U+3040–309F (Hiragana),
  U+30A0–30FF (Katakana), U+AC00–D7AF (Hangul Syllables).
- Bigrams are the standard CJK IR compromise — unigrams are too noisy
  (common chars like `的` dominate scoring), trigrams too sparse.
- Wired via the top-level `tokenize` option on `MINISEARCH_OPTIONS`. Per
  the MiniSearch source (line 1564-1566), that single option applies at
  both index and query time when `searchOptions.tokenize` is unset.
- No new runtime deps — `Intl.Segmenter` is available in Node 16+ but its
  CJK quality varies by ICU version; inline bigrams are deterministic
  across Node versions / platforms.

INDEX_SCHEMA_VERSION bumped 6 → 7 so cached indexes built with the
default tokenizer invalidate and rebuild on first daemon start.

Tests (18 new):

- 4 cases for non-CJK scripts (English, Russian, Vietnamese, punctuation)
  asserting byte-identical behavior to the MiniSearch default.
- 5 cases for CJK scripts (Chinese 4-char bigrams, Chinese 2-char single
  bigram, Japanese kanji+kana, Korean Hangul, single-char unigram fallback).
- 3 cases for mixed Latin+CJK tokens (whitespace-separated, no-whitespace
  split, multiple alternating runs).
- 6 MiniSearch integration cases — the actual production wiring contract:
  Chinese / Japanese / Korean queries return matches; Russian regression;
  English regression; English query does NOT match CJK content
  (cross-script isolation).

251/251 across affected surfaces (18 new tokenizer + 58 existing
search-knowledge + 175 prior PRs from ENG-2687/2688). Typecheck + lint
clean; 9 pre-existing warnings on unrelated functions unchanged.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant