Skip to content

Case-Folding UTF-8: ẞ → ss#285

Merged
ashvardanian merged 21 commits intomainfrom
main-dev
Nov 29, 2025
Merged

Case-Folding UTF-8: ẞ → ss#285
ashvardanian merged 21 commits intomainfrom
main-dev

Conversation

@ashvardanian
Copy link
Owner

No description provided.

ashvardanian and others added 21 commits November 27, 2025 15:18
The use of emojis in the section title causes the links
to break and the page is not scrolled correctly (tested
in Chrome & Safari).

Co-authored-by: Markus Wendorf <18446907+MarkusWendorf@users.noreply.github.com>
Testing across different Leipzig dataset
languages yeilds the following results:

Tier 1: Excellent (5-7x speedup)
 Language      Script      Speedup  Notes
--------------------------------------------------------------------
 Hindi (hi)    Devanagari  7.10x    Pure 3-byte E0, no case folding
 English (en)  Latin       6.94x    Pure ASCII fast path
 Bengali (bn)  Bengali     6.49x    Pure 3-byte E0, no case folding
 Tamil (ta)    Tamil       6.41x    Pure 3-byte E0, no case folding
 Korean (ko)   Hangul      6.13x    Pure 3-byte EA, no case folding
 Dutch (nl)    Latin       5.31x    Mostly ASCII with some Latin-1

Tier 2: Good (3-5x speedup)
 Language         Script          Speedup  Notes
------------------------------------------------------------------------
 German (de)      Latin           4.38x    Latin-1 umlauts (ä, ö, ü)
 Portuguese (pt)  Latin           4.17x    Latin-1 accents
 Japanese (ja)    CJK + Hiragana  4.03x    Mixed but mostly safe 3-byte
 Ukrainian (uk)   Cyrillic        3.90x    Cyrillic fast path
 Spanish (es)     Latin           3.89x    Latin-1 accents
 Russian (ru)     Cyrillic        3.62x    Cyrillic fast path
 Greek (el)       Greek           3.44x    2-byte with +0x20 folding
 Hebrew (he)      Hebrew          3.35x    RTL script, no case folding
 Arabic (ar)      Arabic          3.35x    RTL script, no case folding

Tier 3: Moderate (2-3x speedup)
 Language       Script    Speedup  Notes
-------------------------------------------------------------------
 French (fr)    Latin     3.21x    Latin-1 + Latin Extended
 Armenian (hy)  Armenian  2.92x    2-byte with complex folding
 Persian (fa)   Arabic    2.91x    No case folding, mostly safe
 Czech (cs)     Latin     2.05x    Latin Extended-A (háčky, čárky)
 Polish (pl)    Latin     2.04x    Latin Extended-A (ł, ś, ź)
 Turkish (tr)   Latin     2.00x    Latin Extended (İ→i, dotless ı)

Tier 4: Limited (~1-2x speedup)
 Language         Script         Speedup  Notes
--------------------------------------------------------------------
 Chinese (zh)     CJK            1.89x    Has fullwidth A-Z (EF)
 Vietnamese (vi)  Latin Ext Add  1.04x    Mixed ASCII + 2-byte + E1

Tier 5: Regression (<1x)
 Language       Script    Speedup  Notes
--------------------------------------------------------------
 Georgian (ka)  Georgian  0.36x    Cross-block folding: E1→E2
@ashvardanian ashvardanian merged commit ae307ab into main Nov 29, 2025
32 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants