Merged
Conversation
The use of emojis in the section title causes the links to break and the page is not scrolled correctly (tested in Chrome & Safari). Co-authored-by: Markus Wendorf <18446907+MarkusWendorf@users.noreply.github.com>
Testing across different Leipzig dataset languages yeilds the following results: Tier 1: Excellent (5-7x speedup) Language Script Speedup Notes -------------------------------------------------------------------- Hindi (hi) Devanagari 7.10x Pure 3-byte E0, no case folding English (en) Latin 6.94x Pure ASCII fast path Bengali (bn) Bengali 6.49x Pure 3-byte E0, no case folding Tamil (ta) Tamil 6.41x Pure 3-byte E0, no case folding Korean (ko) Hangul 6.13x Pure 3-byte EA, no case folding Dutch (nl) Latin 5.31x Mostly ASCII with some Latin-1 Tier 2: Good (3-5x speedup) Language Script Speedup Notes ------------------------------------------------------------------------ German (de) Latin 4.38x Latin-1 umlauts (ä, ö, ü) Portuguese (pt) Latin 4.17x Latin-1 accents Japanese (ja) CJK + Hiragana 4.03x Mixed but mostly safe 3-byte Ukrainian (uk) Cyrillic 3.90x Cyrillic fast path Spanish (es) Latin 3.89x Latin-1 accents Russian (ru) Cyrillic 3.62x Cyrillic fast path Greek (el) Greek 3.44x 2-byte with +0x20 folding Hebrew (he) Hebrew 3.35x RTL script, no case folding Arabic (ar) Arabic 3.35x RTL script, no case folding Tier 3: Moderate (2-3x speedup) Language Script Speedup Notes ------------------------------------------------------------------- French (fr) Latin 3.21x Latin-1 + Latin Extended Armenian (hy) Armenian 2.92x 2-byte with complex folding Persian (fa) Arabic 2.91x No case folding, mostly safe Czech (cs) Latin 2.05x Latin Extended-A (háčky, čárky) Polish (pl) Latin 2.04x Latin Extended-A (ł, ś, ź) Turkish (tr) Latin 2.00x Latin Extended (İ→i, dotless ı) Tier 4: Limited (~1-2x speedup) Language Script Speedup Notes -------------------------------------------------------------------- Chinese (zh) CJK 1.89x Has fullwidth A-Z (EF) Vietnamese (vi) Latin Ext Add 1.04x Mixed ASCII + 2-byte + E1 Tier 5: Regression (<1x) Language Script Speedup Notes -------------------------------------------------------------- Georgian (ka) Georgian 0.36x Cross-block folding: E1→E2
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.