fix: unicode token boundaries for non-ASCII “Other” scripts by Konf · Pull Request #1 · darkskygit/memory-indexer

Konf · 2026-03-10T05:55:39Z

This PR fixes tokenization for all non-ASCII scripts that go through SegmentScript::Other, e.g Cyrillic languages, Greek, Armenian, etc.

Bug summary

DefaultTextNormalizer::normalize had explicit split logic for ASCII (normalize_ascii_split), but non-ASCII Other segments were processed as a single token via normalize_span.
This missed token boundaries in many languages that rely on whitespace/punctuation separation.
As a result, non-ASCII Other text was normalized as one full segment (for example, a phrase with spaces/punctuation), which made search queries fail.

Fix

This PR adds normalize_unicode_split function that acts like normalize_ascii_split, but for unicode strings and wires that function for SegmentScript::Other text normalization.
Now Other text is split into Unicode word-like spans before normalization, so terms are indexed and matched independently.
Also, 4 tests were added to cover and reproduce such issues and don't allow regressions in future.

Konf added 2 commits March 10, 2026 08:41

Add proper word-split for unicode (SegmentScript::Other)

4dd7a00

Add tests to cover unicode split issues

a38ef6f

Konf mentioned this pull request Mar 10, 2026

[Bug]: Search is broken in desktop apps for non-latin and non-CJK words toeverything/AFFiNE#14595

Open

2 tasks

darkskygit changed the title ~~Fix Unicode token boundaries for non-ASCII “Other” scripts~~ fix: unicode token boundaries for non-ASCII “Other” scripts Apr 5, 2026

darkskygit approved these changes Apr 5, 2026

View reviewed changes

darkskygit merged commit 35c7f7a into darkskygit:master Apr 5, 2026
1 check failed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: unicode token boundaries for non-ASCII “Other” scripts#1

fix: unicode token boundaries for non-ASCII “Other” scripts#1
darkskygit merged 2 commits into
darkskygit:masterfrom
Konf:unicode_fix

Konf commented Mar 10, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Konf commented Mar 10, 2026

Bug summary

Fix

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants