# Unicode Normalization and Search Anchor Analysis

This notebook explores Unicode case folding and normalization properties to identify optimal "anchor points" for case-insensitive and normalization-insensitive string search algorithms.

In [1]:
!python -m pip install -q pandas 2>/dev/null || curl -sS https://bootstrap.pypa.io/get-pip.py | python && python -m pip install -q pandas


Before we start, a small reminder on Unicode.
Unicode is a versioned standard.
In 2025, the latest version is Unicode 17.0.
It defines over a million code points, of which around 150,000 are assigned characters.
Some of them belong to "bicameral" scripts (like Latin, Greek, Cyrillic) that have distinct uppercase and lowercase forms.
Others belong to "unicameral" scripts (like Chinese, Japanese, Korean, Arabic) that do not have case distinctions.
It doesn't, however, mean that there are no different ways to represent the same character in the same script.
So "case folding" and "normalization" are two different concepts.
We will explore both in this notebook.

Unicode also doesn't require UTF-8 encoding, but UTF-8 is the most popular encoding on the web and in modern applications and the one we will focus on in StringZilla.
In UTF-8, each code point is represented by one, two, three, or four bytes.
A folded or normalized character can map to a sequence of multiple code points, and each of those code points can have a different length representation in UTF-8.
That's why, in the absolute majority of modern text-processing applications full Unicode processing is disabled.

Typically, when people perform case-insensitive search, they either:

1. Use simple ASCII case folding (A-Z to a-z), ignoring all other characters.
2. Use pretty much the only major library that supports full Unicode case folding and normalization, ICU (International Components for Unicode).

The first is clearly insufficient, and the second is quite heavy and works at a character level, making SIMD optimizations difficult.
This notebook will focus on more SIMD-vectorizable ideas.

To start, let's pull the most recent Unicode Character Database (UCD) files from the Unicode website.

In [2]:
import sys
from collections import Counter
import unicodedata
import pandas as pd

# Import shared Unicode data loading functions
sys.path.insert(0, ".")
from test_helpers import (
    UNICODE_VERSION,
    get_all_codepoints,
    get_case_folding_rules_as_codepoints,
    get_normalization_props,
    get_unicode_xml_data,
)

# UTF-8 byte boundaries
UTF8_1BYTE_MAX = 0x7F  # 127 - ASCII range
UTF8_2BYTE_MAX = 0x7FF  # 2047
UTF8_3BYTE_MAX = 0xFFFF  # 65535

print(f"Using Unicode version: {UNICODE_VERSION}")
all_codepoints = get_all_codepoints(UNICODE_VERSION)

Using Unicode version: 17.0.0
Using cached Unicode 17.0.0 UCD XML: /tmp/ucd-17.0.0.all.flat.xml


[The highest allowed code point in Unicode is `0x10FFFF` or "U+10FFFF"](https://stackoverflow.com/questions/52203351/why-is-unicode-restricted-to-0x10ffff), but it doesn't mean that all code points up to that value are assigned.

- Planes 15-16 (U+F0000 to U+10FFFF) are reserved for "Private Use Area" and do not contain assigned characters.
- Most of plane 14 (U+E0000 to U+E0FFF) is reserved for "Supplementary Special-purpose Plane" and contains very few assigned characters.
- Many code points in other planes are also unassigned.

In [3]:
print(f"Total assigned codepoints: {len(all_codepoints):,}")
print(f"Highest assigned codepoint: {all_codepoints[-1]:,}")
print(f"Highest possible codepoint: {0x10FFFF:,}")
print(f"Range density: {len(all_codepoints) / (all_codepoints[-1] + 1):.6%}")

Total assigned codepoints: 159,866
Highest assigned codepoint: 917,999
Highest possible codepoint: 1,114,111
Range density: 17.414597%


## Unicode Case Folding Analysis

### Direct Folding Targets

Case folding maps characters to a "folded" form for case-insensitive comparisons.
This is more comprehensive than simple lowercasing - it handles special cases like German √ü ‚Üí ss.
The very first thing we are interested in is: how often each codepoint becomes a folding target for other characters?

The reason we are curious about this is that in simple cases, like the Russian letter "–ê" (A) and "–∞" (a), both fold to the same codepoint U+0430 (Cyrillic small letter a).
So when scanning for exact case-insensitive matches, we can just compare each 2-byte UTF-8 slice against just 2 possible values: 0xD090 (U+0410) and 0xD0B0 (U+0430), without actually performing any case folding.
The easiest way to solve the problem is to avoid it after all!

In [4]:
case_folds = get_case_folding_rules_as_codepoints(UNICODE_VERSION)
print(f"Total case folding rules: {len(case_folds):,}")

Using cached Unicode 17.0.0 CaseFolding.txt: /tmp/CaseFolding-17.0.0.txt
Total case folding rules: 1,585


In [5]:
target_frequency = Counter()

for source_codepoint, target_codepoints in case_folds.items():
    for target_codepoint in target_codepoints:
        target_frequency[target_codepoint] += 1

print(f"Total folding rules: {sum(target_frequency.values()):,}")
print(f"Unique target codepoints: {len(target_frequency):,}")

Total folding rules: 1,705
Unique target codepoints: 1,462


Let's display the most common folding targets:

In [6]:
rows = []
for codepoint, frequency in target_frequency.most_common():
    try:
        character = chr(codepoint)
        name = unicodedata.name(character, "")
    except (ValueError, OverflowError):
        character = "?"
        name = ""
    rows.append({"Codepoint": f"U+{codepoint:04X}", "Char": character, "Frequency": frequency, "Name": name})

pd.DataFrame(rows)

Unnamed: 0,Codepoint,Char,Frequency,Name
0,U+03B9,Œπ,71,GREEK SMALL LETTER IOTA
1,U+0342,ÕÇ,11,COMBINING GREEK PERISPOMENI
2,U+03C5,œÖ,10,GREEK SMALL LETTER UPSILON
3,U+0066,f,9,LATIN SMALL LETTER F
4,U+0308,Ãà,9,COMBINING DIAERESIS
...,...,...,...,...
1457,U+1E93F,û§ø,1,ADLAM SMALL LETTER KHA
1458,U+1E940,û•Ä,1,ADLAM SMALL LETTER GBE
1459,U+1E941,û•Å,1,ADLAM SMALL LETTER ZAL
1460,U+1E942,û•Ç,1,ADLAM SMALL LETTER KPO


This suggests that the "GREEK SMALL LETTER IOTA" (U+03B9) is the most common folding target, being the folded form of 71 different codepoints.
The reason for this is historical.
Ancient Greek had a grammatical feature called the "iota subscript" where iota was written as a small subscript beneath vowels (Œ±, Œ∑, œâ) to indicate certain grammatical forms (dative case, etc.).
When case-folding, these decompose and the subscript iota becomes a regular lowercase iota:

- ·æ≥ (alpha with ypogegrammeni) ‚Üí Œ±Œπ
- ·øÉ (eta with ypogegrammeni) ‚Üí Œ∑Œπ
- ·ø≥ (omega with ypogegrammeni) ‚Üí œâŒπ

More importantly, at this point we see that `'f'`, `'s'`, `'i'`, `'t'` are the most common direct single-byte UTF-8 folding targets.
Each is the target of at least 4 different codepoints.
But that doesn't tell the whole story!

### Otherwise Ambiguous Folding Targets

Oftentimes, a character is only one of many characters in the produced folding result.

- `'Ô¨Ä'` ‚Üí `"ff"` - 3-byte codepoint mapping into 2x 1-byte codepoints
- `'Ô¨Å'` ‚Üí `"fi"` - 3-byte codepoint mapping into 2x 1-byte codepoints
- `'Ô¨Ç'` ‚Üí `"fl"` - 3-byte codepoint mapping into 2x 1-byte codepoints
- `'Ô¨É'` ‚Üí `"ffi"` - 3-byte codepoint mapping into 3x 1-byte codepoints
- `'Ô¨Ñ'` ‚Üí `"ffl"` - 3-byte codepoint mapping into 3x 1-byte codepoints

Let's account for those as well:

In [7]:
direct_target_frequency = Counter()  # Codepoint is the ONLY target of a folding
partial_target_frequency = Counter()  # Codepoint is ONE OF multiple targets in a folding

for source_codepoint, target_codepoints in case_folds.items():
    if len(target_codepoints) == 1:
        # Direct 1:1 folding (e.g., 'A' ‚Üí 'a')
        direct_target_frequency[target_codepoints[0]] += 1
    else:
        # Multi-codepoint expansion (e.g., 'Ô¨Å' ‚Üí 'f', 'i')
        for target_codepoint in target_codepoints:
            partial_target_frequency[target_codepoint] += 1

# Some codepoints may be both direct AND partial targets
both_targets = set(direct_target_frequency.keys()) & set(partial_target_frequency.keys())

print(f"Total folding rules: {len(case_folds):,}")
print(f"  - Direct 1:1 foldings: {sum(1 for t in case_folds.values() if len(t) == 1):,}")
print(f"  - Multi-codepoint expansions: {sum(1 for t in case_folds.values() if len(t) > 1):,}")
print()
print(f"Unique target codepoints:")
print(f"  - Only direct targets: {len(direct_target_frequency - partial_target_frequency):,}")
print(f"  - Only partial targets: {len(partial_target_frequency - direct_target_frequency):,}")
print(f"  - Both direct AND partial: {len(both_targets):,}")

Total folding rules: 1,585
  - Direct 1:1 foldings: 1,481
  - Multi-codepoint expansions: 104

Unique target codepoints:
  - Only direct targets: 1,398
  - Only partial targets: 48
  - Both direct AND partial: 54


The following table differentiates complete and partial folding targets:

In [8]:
rows = []
for codepoint, partial_frequency in partial_target_frequency.most_common():
    try:
        character = chr(codepoint)
    except (ValueError, OverflowError):
        character = "?"

    direct_frequency = direct_target_frequency.get(codepoint, 0)

    # Find an example expansion containing this codepoint
    example = ""
    for source_codepoint, target_codepoints in case_folds.items():
        if len(target_codepoints) > 1 and codepoint in target_codepoints:
            try:
                source_character = chr(source_codepoint)
                target_string = "".join(chr(c) for c in target_codepoints)
                example = f"'{source_character}' ‚Üí \"{target_string}\""
            except (ValueError, OverflowError):
                example = f"U+{source_codepoint:04X} ‚Üí {target_codepoints}"
            break

    rows.append(
        {
            "Codepoint": f"U+{codepoint:04X}",
            "Char": character,
            "Partial": partial_frequency,
            "Direct": direct_frequency,
            "Example Expansion": example,
        }
    )

pd.DataFrame(rows)

Unnamed: 0,Codepoint,Char,Partial,Direct,Example Expansion
0,U+03B9,Œπ,68,3,"'Œê' ‚Üí ""ŒπÃàÃÅ"""
1,U+0342,ÕÇ,11,0,"'·Ωñ' ‚Üí ""œÖÃìÕÇ"""
2,U+0308,Ãà,9,0,"'Œê' ‚Üí ""ŒπÃàÃÅ"""
3,U+03C5,œÖ,9,1,"'Œ∞' ‚Üí ""œÖÃàÃÅ"""
4,U+0066,f,8,1,"'Ô¨Ä' ‚Üí ""ff"""
...,...,...,...,...,...
60,U+1F7C,·Ωº,1,1,"'·ø≤' ‚Üí ""·ΩºŒπ"""
61,U+03CE,œé,1,1,"'·ø¥' ‚Üí ""œéŒπ"""
62,U+056B,’´,1,1,"'Ô¨ï' ‚Üí ""’¥’´"""
63,U+057E,’æ,1,1,"'Ô¨ñ' ‚Üí ""’æ’∂"""


### Safe Single-byte Folding Anchors

Of all those characters, we are most interested in the codepoints representable in just 1 byte in UTF-8, as we can process 64 of them in a `ZMM` register at once.
Those are the boring ASCII letters.
But we can't just apply traditional SIMD ASCII case-insensitive search techniques like:

```c
__m512i lower_mask = _mm512_set1_epi8(0x20);
__m512i input_chunk = _mm512_loadu_si512(input_ptr);
__m512i folded_chunk = _mm512_or_si512(input_chunk, lower_mask);
```

If the needle contains an `'f'` and the haystack contains an `'Ô¨É'`, we would miss the match.
So we must know, which of the single-byte codepoints are folding targets of multiple codepoints.

In [14]:
ascii_targets = {}
for codepoint in range(UTF8_1BYTE_MAX + 1):
    direct = direct_target_frequency.get(codepoint, 0)
    partial = partial_target_frequency.get(codepoint, 0)
    total = direct + partial
    if total > 0:
        ascii_targets[codepoint] = {"direct": direct, "partial": partial, "total": total}

# Separate into "safe" (exactly 1 source) vs "ambiguous" (multiple sources)
safe_ascii = {codepoint: info for codepoint, info in ascii_targets.items() if info["total"] == 1}
ambiguous_ascii = {codepoint: info for codepoint, info in ascii_targets.items() if info["total"] > 1}

print(f"Total ASCII targets: {len(ascii_targets)}")
print(f"  - Safe (exactly 1 source): {len(safe_ascii)}")
print(f"  - Ambiguous (multiple sources): {len(ambiguous_ascii)}")

Total ASCII targets: 26
  - Safe (exactly 1 source): 14
  - Ambiguous (multiple sources): 12


The following table shows safe ASCII targets that can use simple SIMD case folding (each has exactly one source):

In [10]:
safe_rows = []
for codepoint in sorted(safe_ascii.keys()):
    character = chr(codepoint)
    for source_codepoint, target_codepoints in case_folds.items():
        if codepoint in target_codepoints:
            source_character = chr(source_codepoint)
            safe_rows.append(
                {
                    "Target": f"'{character}' (U+{codepoint:04X})",
                    "Source": f"'{source_character}' (U+{source_codepoint:04X})",
                }
            )
            break

pd.DataFrame(safe_rows)

Unnamed: 0,Target,Source
0,'b' (U+0062),'B' (U+0042)
1,'c' (U+0063),'C' (U+0043)
2,'d' (U+0064),'D' (U+0044)
3,'e' (U+0065),'E' (U+0045)
4,'g' (U+0067),'G' (U+0047)
5,'m' (U+006D),'M' (U+004D)
6,'o' (U+006F),'O' (U+004F)
7,'p' (U+0070),'P' (U+0050)
8,'q' (U+0071),'Q' (U+0051)
9,'r' (U+0072),'R' (U+0052)


The following table shows ambiguous ASCII targets that need special handling in SIMD (each has multiple sources):

In [11]:
ambiguous_rows = []
for codepoint in sorted(ambiguous_ascii.keys()):
    character = chr(codepoint)
    info = ambiguous_ascii[codepoint]

    # Find all sources
    sources = []
    for source_codepoint, target_codepoints in case_folds.items():
        if codepoint in target_codepoints:
            try:
                source_character = chr(source_codepoint)
                if len(target_codepoints) == 1:
                    sources.append(f"'{source_character}'")
                else:
                    target_string = "".join(chr(c) for c in target_codepoints)
                    sources.append(f"'{source_character}'‚Üí\"{target_string}\"")
            except:
                sources.append(f"U+{source_codepoint:04X}")

    sources_string = ", ".join(sources[:6])
    if len(sources) > 6:
        sources_string += f" (+{len(sources)-6} more)"

    ambiguous_rows.append(
        {
            "Char": f"'{character}'",
            "Codepoint": f"U+{codepoint:04X}",
            "Direct": info["direct"],
            "Partial": info["partial"],
            "Total": info["total"],
            "Sources": sources_string,
        }
    )

pd.DataFrame(ambiguous_rows)

Unnamed: 0,Char,Codepoint,Direct,Partial,Total,Sources
0,'a',U+0061,1,1,2,"'A', '·∫ö'‚Üí""a æ"""
1,'f',U+0066,1,8,9,"'F', 'Ô¨Ä'‚Üí""ff"", 'Ô¨Å'‚Üí""fi"", 'Ô¨Ç'‚Üí""fl"", 'Ô¨É'‚Üí""ffi"", ..."
2,'h',U+0068,1,1,2,"'H', '·∫ñ'‚Üí""hÃ±"""
3,'i',U+0069,1,3,4,"'I', 'ƒ∞'‚Üí""iÃá"", 'Ô¨Å'‚Üí""fi"", 'Ô¨É'‚Üí""ffi"""
4,'j',U+006A,1,1,2,"'J', '«∞'‚Üí""jÃå"""
5,'k',U+006B,2,0,2,"'K', '‚Ñ™'"
6,'l',U+006C,1,2,3,"'L', 'Ô¨Ç'‚Üí""fl"", 'Ô¨Ñ'‚Üí""ffl"""
7,'n',U+006E,1,1,2,"'N', '≈â'‚Üí"" ºn"""
8,'s',U+0073,2,6,8,"'S', '√ü'‚Üí""ss"", '≈ø', '·∫û'‚Üí""ss"", 'Ô¨Ö'‚Üí""st"", 'Ô¨Ü'‚Üí""st"""
9,'t',U+0074,1,3,4,"'T', '·∫ó'‚Üí""tÃà"", 'Ô¨Ö'‚Üí""st"", 'Ô¨Ü'‚Üí""st"""


However, even "ambiguous" ASCII characters can be contextually safe based on what follows them in the needle.
For example, `'f'` is ambiguous because of ligatures like `'Ô¨Å'` ‚Üí `"fi"`.
But if the needle contains `"fog"`, the `'f'` is safe because no ligature expands to `"fo..."`.
The following analysis identifies when each ambiguous character becomes safe based on its context:

In [19]:
contextual_safety = {}

for codepoint in ambiguous_ascii.keys():
    char = chr(codepoint)
    dangerous_following = set()
    dangerous_preceding = set()
    ligature_examples = []

    # Find all multi-codepoint expansions that include this character
    for source_codepoint, target_codepoints in case_folds.items():
        if len(target_codepoints) > 1:  # Multi-codepoint expansion
            expansion = "".join(chr(c) for c in target_codepoints)

            # Find all positions where our character appears
            for pos, c in enumerate(expansion):
                if ord(c) == codepoint:
                    source_char = chr(source_codepoint)

                    # If not the last character, next char is "dangerous following"
                    if pos < len(expansion) - 1:
                        next_char = expansion[pos + 1]
                        dangerous_following.add(next_char)
                        if len(ligature_examples) < 3:
                            ligature_examples.append(f"'{source_char}'‚Üí\"{expansion}\"")

                    # If not the first character, prev char is "dangerous preceding"
                    if pos > 0:
                        prev_char = expansion[pos - 1]
                        dangerous_preceding.add(prev_char)

    if dangerous_following or dangerous_preceding:
        contextual_safety[char] = {
            "dangerous_following": dangerous_following,
            "dangerous_preceding": dangerous_preceding,
            "examples": ligature_examples,
        }

# Build output table
context_rows = []
for char in sorted(contextual_safety.keys()):
    info = contextual_safety[char]
    following = info["dangerous_following"]
    preceding = info["dangerous_preceding"]

    if following:
        safe_following = f"NOT followed by: {{{', '.join(repr(c) for c in sorted(following))}}}"
    else:
        safe_following = "any following char"

    if preceding:
        safe_preceding = f"NOT preceded by: {{{', '.join(repr(c) for c in sorted(preceding))}}}"
    else:
        safe_preceding = "any preceding char"

    context_rows.append(
        {
            "Char": f"'{char}'",
            "Safe when following": safe_following,
            "Safe when preceding": safe_preceding,
            "Example ligatures": ", ".join(info["examples"]),
        }
    )

pd.DataFrame(context_rows)

Unnamed: 0,Char,Safe when following,Safe when preceding,Example ligatures
0,'a',NOT followed by: {' æ'},any preceding char,"'·∫ö'‚Üí""a æ"""
1,'f',"NOT followed by: {'f', 'i', 'l'}",NOT preceded by: {'f'},"'Ô¨Ä'‚Üí""ff"", 'Ô¨Å'‚Üí""fi"", 'Ô¨Ç'‚Üí""fl"""
2,'h',NOT followed by: {'Ã±'},any preceding char,"'·∫ñ'‚Üí""hÃ±"""
3,'i',NOT followed by: {'Ãá'},NOT preceded by: {'f'},"'ƒ∞'‚Üí""iÃá"""
4,'j',NOT followed by: {'Ãå'},any preceding char,"'«∞'‚Üí""jÃå"""
5,'l',any following char,NOT preceded by: {'f'},
6,'n',any following char,NOT preceded by: {' º'},
7,'s',"NOT followed by: {'s', 't'}",NOT preceded by: {'s'},"'√ü'‚Üí""ss"", '·∫û'‚Üí""ss"", 'Ô¨Ö'‚Üí""st"""
8,'t',NOT followed by: {'Ãà'},NOT preceded by: {'s'},"'·∫ó'‚Üí""tÃà"""
9,'w',NOT followed by: {'Ãä'},any preceding char,"'·∫ò'‚Üí""wÃä"""


Looking at this, if the needle contains a continuous sequence of `'b'`, `'c'`, `'d'`, `'e'`, `'g'`, `'m'`, `'o'`, `'p'`, `'q'`, `'r'`, `'u'`, `'v'`, `'x'`, `'z'` in any order or case, we can trivially match them using the simple SIMD snippet from above, as long as it doesn't contain `'a'`, `'f'`, `'h'`, `'i'`, `'j'`, `'k'`, `'l'`, `'n'`, `'s'`, `'t'`, `'w'`, or `'y'`.

Moreover, there is a group of single-byte UTF-8 codepoints that don't participate in any folding mappings at all:

In [12]:
uninvolved_ascii = [
    codepoint
    for codepoint in range(UTF8_1BYTE_MAX + 1)
    if codepoint not in ascii_targets and codepoint not in case_folds
]
print(f"ASCII codepoints completely uninvolved in folding: {len(uninvolved_ascii)}")

control_characters = [codepoint for codepoint in uninvolved_ascii if codepoint < 32 or codepoint == 127]
digits = [chr(codepoint) for codepoint in uninvolved_ascii if chr(codepoint).isdigit()]
punctuation = [
    chr(codepoint) for codepoint in uninvolved_ascii if 32 <= codepoint < 127 and not chr(codepoint).isalnum()
]

print(f"Control characters: {len(control_characters)} (0x00-0x1F, 0x7F)")
print(f"Digits: {''.join(digits)}")
print(f"Punctuation/Symbols: {''.join(punctuation)}")

ASCII codepoints completely uninvolved in folding: 76
Control characters: 33 (0x00-0x1F, 0x7F)
Digits: 0123456789
Punctuation/Symbols:  !"#$%&'()*+,-./:;<=>?@[\]^_`{|}~


### Safe Two-byte Folding Anchors

The more interesting and challenging part is the 2-byte UTF-8 codepoints that map into either other single 2-byte codepoint or two 1-byte codepoints.

In [None]:
two_byte_folds = {}
for source_codepoint, target_codepoints in case_folds.items():
    if UTF8_1BYTE_MAX < source_codepoint <= UTF8_2BYTE_MAX:  # 2-byte UTF-8 range
        two_byte_folds[source_codepoint] = target_codepoints

print(f"2-byte UTF-8 codepoints with case folding: {len(two_byte_folds):,}")
print()

# Categorize by target type
folds_to_1byte = {}  # 2-byte ‚Üí single 1-byte (e.g., some Latin letters)
folds_to_2byte = {}  # 2-byte ‚Üí single 2-byte (most common)
folds_to_2x1byte = {}  # 2-byte ‚Üí two 1-byte codepoints
folds_to_other = {}  # Other patterns

for source_codepoint, target_codepoints in two_byte_folds.items():
    target_sizes = [
        (
            1
            if codepoint <= UTF8_1BYTE_MAX
            else 2 if codepoint <= UTF8_2BYTE_MAX else 3 if codepoint <= UTF8_3BYTE_MAX else 4
        )
        for codepoint in target_codepoints
    ]

    if len(target_codepoints) == 1:
        if target_sizes[0] == 1:
            folds_to_1byte[source_codepoint] = target_codepoints
        elif target_sizes[0] == 2:
            folds_to_2byte[source_codepoint] = target_codepoints
        else:
            folds_to_other[source_codepoint] = target_codepoints
    elif len(target_codepoints) == 2 and all(size == 1 for size in target_sizes):
        folds_to_2x1byte[source_codepoint] = target_codepoints
    else:
        folds_to_other[source_codepoint] = target_codepoints

print(f"Folding patterns for 2-byte UTF-8 sources:")
print(f"  2-byte ‚Üí 1-byte:     {len(folds_to_1byte):,}")
print(f"  2-byte ‚Üí 2-byte:     {len(folds_to_2byte):,}")
print(f"  2-byte ‚Üí 2x 1-byte:  {len(folds_to_2x1byte):,}")
print(f"  Other patterns:      {len(folds_to_other):,}")

2-byte UTF-8 codepoints with case folding: 460

Folding patterns for 2-byte UTF-8 sources:
  2-byte ‚Üí 1-byte:     1
  2-byte ‚Üí 2-byte:     450
  2-byte ‚Üí 2x 1-byte:  1
  Other patterns:      8


Of the 460 case folding rules for 2-byte UTF-8 sources, the vast majority (450) map to another 2-byte codepoint.
The remaining 10 are special cases worth understanding:

__2-byte ‚Üí 1-byte (1 case):__

- `'≈ø'` (U+017F, LATIN SMALL LETTER LONG S) ‚Üí `'s'` - historical long S folds to regular ASCII s

__2-byte ‚Üí 2x 1-byte (1 case):__

- `'√ü'` (U+00DF, LATIN SMALL LETTER SHARP S) ‚Üí `"ss"` - German eszett expands to two ASCII characters

__Other patterns (8 cases):__

These are the tricky edge cases that don't fit clean patterns:

- `'ƒ∞'` (U+0130) ‚Üí `'i'` + combining dot above (1-byte + 2-byte) - Turkish capital I with dot
- `'≈â'` (U+0149) ‚Üí modifier apostrophe + `'n'` (2-byte + 1-byte) - deprecated character
- `'«∞'` (U+01F0) ‚Üí `'j'` + combining caron (1-byte + 2-byte) - J with caron decomposes
- `'»∫'` (U+023A) ‚Üí `'‚±•'` (U+2C65) - 2-byte source maps to 3-byte target!
- `'»æ'` (U+023E) ‚Üí `'‚±¶'` (U+2C66) - another 2-byte ‚Üí 3-byte case
- `'Œê'` (U+0390) ‚Üí Œπ + combining diaeresis + combining acute (3x 2-byte) - Greek with diacritics
- `'Œ∞'` (U+03B0) ‚Üí œÖ + combining diaeresis + combining acute (3x 2-byte) - Greek with diacritics
- `'÷á'` (U+0587) ‚Üí ’• + ÷Ç (2x 2-byte) - Armenian ligature

The `'»∫'` and `'»æ'` cases are particularly noteworthy: they are 2-byte UTF-8 sources that fold to 3-byte targets, meaning the folded form is *longer* than the original!

Assuming the much larger search space, where possible, we want to group them into continuous to/from ranges.

The following table shows continuous ranges of 2-byte UTF-8 codepoints that fold to other 2-byte codepoints with a constant offset (e.g., uppercase ‚Üí lowercase within the same script block):

In [15]:
# Find continuous ranges of 2-byte ‚Üí 2-byte foldings with constant offset
# Sort by source codepoint to find consecutive sequences
sorted_2byte = sorted(folds_to_2byte.items())

ranges = []
if sorted_2byte:
    range_start = sorted_2byte[0][0]
    range_offset = sorted_2byte[0][1][0] - sorted_2byte[0][0]
    prev_source = sorted_2byte[0][0]

    for source_codepoint, target_codepoints in sorted_2byte[1:]:
        target_codepoint = target_codepoints[0]
        current_offset = target_codepoint - source_codepoint

        # Check if this continues the current range (consecutive source AND same offset)
        if source_codepoint == prev_source + 1 and current_offset == range_offset:
            prev_source = source_codepoint
        else:
            # End the current range and start a new one
            ranges.append((range_start, prev_source, range_offset))
            range_start = source_codepoint
            range_offset = current_offset
            prev_source = source_codepoint

    # Don't forget the last range
    ranges.append((range_start, prev_source, range_offset))

# Build DataFrame with range information
range_rows = []
for start, end, offset in ranges:
    length = end - start + 1
    try:
        start_char = chr(start)
        end_char = chr(end)
        target_start_char = chr(start + offset)
        target_end_char = chr(end + offset)
        script = unicodedata.name(start_char, "").split()[0] if length > 1 else ""
    except (ValueError, OverflowError):
        start_char = end_char = target_start_char = target_end_char = "?"
        script = ""

    range_rows.append(
        {
            "Source Start": f"U+{start:04X} ({start_char})",
            "Source End": f"U+{end:04X} ({end_char})",
            "Target Start": f"U+{start + offset:04X} ({target_start_char})",
            "Target End": f"U+{end + offset:04X} ({target_end_char})",
            "Length": length,
            "Offset": f"+{offset}" if offset > 0 else str(offset),
            "Script": script,
        }
    )

print(f"Found {len(ranges)} continuous ranges of 2-byte ‚Üí 2-byte foldings")
print(f"Ranges of length > 1: {sum(1 for r in ranges if r[1] - r[0] > 0)}")
print(f"Single-codepoint 'ranges': {sum(1 for r in ranges if r[1] == r[0])}")
print()

# Show only ranges with length > 1 (the interesting ones for SIMD)
multi_ranges = [r for r in range_rows if r["Length"] > 1]
print(f"Multi-codepoint ranges (useful for SIMD optimization):")
pd.DataFrame(multi_ranges)

Found 308 continuous ranges of 2-byte ‚Üí 2-byte foldings
Ranges of length > 1: 12
Single-codepoint 'ranges': 296

Multi-codepoint ranges (useful for SIMD optimization):


Unnamed: 0,Source Start,Source End,Target Start,Target End,Length,Offset,Script
0,U+00C0 (√Ä),U+00D6 (√ñ),U+00E0 (√†),U+00F6 (√∂),23,32,LATIN
1,U+00D8 (√ò),U+00DE (√û),U+00F8 (√∏),U+00FE (√æ),7,32,LATIN
2,U+0189 (∆â),U+018A (∆ä),U+0256 (…ñ),U+0257 (…ó),2,205,LATIN
3,U+01B1 (∆±),U+01B2 (∆≤),U+028A ( ä),U+028B ( ã),2,217,LATIN
4,U+0388 (Œà),U+038A (Œä),U+03AD (Œ≠),U+03AF (ŒØ),3,37,GREEK
5,U+038E (Œé),U+038F (Œè),U+03CD (œç),U+03CE (œé),2,63,GREEK
6,U+0391 (Œë),U+03A1 (Œ°),U+03B1 (Œ±),U+03C1 (œÅ),17,32,GREEK
7,U+03A3 (Œ£),U+03AB (Œ´),U+03C3 (œÉ),U+03CB (œã),9,32,GREEK
8,U+03FD (œΩ),U+03FF (œø),U+037B (Õª),U+037D (ÕΩ),3,-130,GREEK
9,U+0400 (–Ä),U+040F (–è),U+0450 (—ê),U+045F (—ü),16,80,CYRILLIC


### Three-byte UTF-8 Case Folding

3-byte UTF-8 covers codepoints U+0800 to U+FFFF (2048 to 65535).
This includes many scripts: Extended Greek, Cherokee, Georgian, and various symbol blocks.

In [16]:
# 3-byte UTF-8 codepoints: U+0800 to U+FFFF (2048 to 65535)
three_byte_folds = {}
for source_codepoint, target_codepoints in case_folds.items():
    if UTF8_2BYTE_MAX < source_codepoint <= UTF8_3BYTE_MAX:
        three_byte_folds[source_codepoint] = target_codepoints

print(f"3-byte UTF-8 codepoints with case folding: {len(three_byte_folds):,}")
print()

# Categorize by target pattern
three_to_3byte = {}  # 3-byte ‚Üí single 3-byte
three_to_2byte = {}  # 3-byte ‚Üí single 2-byte (shrinks!)
three_to_1byte = {}  # 3-byte ‚Üí 1-byte sequence
three_to_other = {}  # Multi-codepoint or mixed

for source_codepoint, target_codepoints in three_byte_folds.items():
    target_sizes = [
        1 if cp <= UTF8_1BYTE_MAX else 2 if cp <= UTF8_2BYTE_MAX else 3 if cp <= UTF8_3BYTE_MAX else 4
        for cp in target_codepoints
    ]

    if len(target_codepoints) == 1:
        if target_sizes[0] == 3:
            three_to_3byte[source_codepoint] = target_codepoints
        elif target_sizes[0] == 2:
            three_to_2byte[source_codepoint] = target_codepoints
        elif target_sizes[0] == 1:
            three_to_1byte[source_codepoint] = target_codepoints
        else:
            three_to_other[source_codepoint] = target_codepoints
    else:
        three_to_other[source_codepoint] = target_codepoints

print(f"Folding patterns for 3-byte UTF-8 sources:")
print(f"  3-byte ‚Üí 3-byte:  {len(three_to_3byte):,}")
print(f"  3-byte ‚Üí 2-byte:  {len(three_to_2byte):,}")
print(f"  3-byte ‚Üí 1-byte:  {len(three_to_1byte):,}")
print(f"  Other patterns:   {len(three_to_other):,}")

3-byte UTF-8 codepoints with case folding: 792

Folding patterns for 3-byte UTF-8 sources:
  3-byte ‚Üí 3-byte:  663
  3-byte ‚Üí 2-byte:  31
  3-byte ‚Üí 1-byte:  1
  Other patterns:   97


The following table shows continuous ranges of 3-byte UTF-8 codepoints that fold to other 3-byte codepoints:

In [17]:
# Find continuous ranges of 3-byte ‚Üí 3-byte foldings
sorted_3byte = sorted(three_to_3byte.items())

ranges_3byte = []
if sorted_3byte:
    range_start = sorted_3byte[0][0]
    range_offset = sorted_3byte[0][1][0] - sorted_3byte[0][0]
    prev_source = sorted_3byte[0][0]

    for source_codepoint, target_codepoints in sorted_3byte[1:]:
        target_codepoint = target_codepoints[0]
        current_offset = target_codepoint - source_codepoint

        if source_codepoint == prev_source + 1 and current_offset == range_offset:
            prev_source = source_codepoint
        else:
            ranges_3byte.append((range_start, prev_source, range_offset))
            range_start = source_codepoint
            range_offset = current_offset
            prev_source = source_codepoint

    ranges_3byte.append((range_start, prev_source, range_offset))

# Build DataFrame
range_rows_3byte = []
for start, end, offset in ranges_3byte:
    length = end - start + 1
    try:
        start_char = chr(start)
        end_char = chr(end)
        target_start_char = chr(start + offset)
        target_end_char = chr(end + offset)
        script = unicodedata.name(start_char, "").split()[0] if length > 1 else ""
    except (ValueError, OverflowError):
        start_char = end_char = target_start_char = target_end_char = "?"
        script = ""

    range_rows_3byte.append(
        {
            "Source Start": f"U+{start:04X} ({start_char})",
            "Source End": f"U+{end:04X} ({end_char})",
            "Target Start": f"U+{start + offset:04X} ({target_start_char})",
            "Target End": f"U+{end + offset:04X} ({target_end_char})",
            "Length": length,
            "Offset": f"+{offset}" if offset > 0 else str(offset),
            "Script": script,
        }
    )

print(f"Found {len(ranges_3byte)} continuous ranges of 3-byte ‚Üí 3-byte foldings")
print(f"Ranges of length > 1: {sum(1 for r in ranges_3byte if r[1] - r[0] > 0)}")
print(f"Single-codepoint 'ranges': {sum(1 for r in ranges_3byte if r[1] == r[0])}")
print()

multi_ranges_3byte = [r for r in range_rows_3byte if r["Length"] > 1]
print(f"Multi-codepoint ranges (useful for SIMD optimization):")
pd.DataFrame(multi_ranges_3byte)

Found 337 continuous ranges of 3-byte ‚Üí 3-byte foldings
Ranges of length > 1: 24
Single-codepoint 'ranges': 313

Multi-codepoint ranges (useful for SIMD optimization):


Unnamed: 0,Source Start,Source End,Target Start,Target End,Length,Offset,Script
0,U+10A0 (·Ç†),U+10C5 (·ÉÖ),U+2D00 (‚¥Ä),U+2D25 (‚¥•),38,7264,GEORGIAN
1,U+13F8 (·è∏),U+13FD (·èΩ),U+13F0 (·è∞),U+13F5 (·èµ),6,-8,CHEROKEE
2,U+1C90 (·≤ê),U+1CBA (·≤∫),U+10D0 (·Éê),U+10FA (·É∫),43,-3008,GEORGIAN
3,U+1CBD (·≤Ω),U+1CBF (·≤ø),U+10FD (·ÉΩ),U+10FF (·Éø),3,-3008,GEORGIAN
4,U+1F08 (·ºà),U+1F0F (·ºè),U+1F00 (·ºÄ),U+1F07 (·ºá),8,-8,GREEK
5,U+1F18 (·ºò),U+1F1D (·ºù),U+1F10 (·ºê),U+1F15 (·ºï),6,-8,GREEK
6,U+1F28 (·º®),U+1F2F (·ºØ),U+1F20 (·º†),U+1F27 (·ºß),8,-8,GREEK
7,U+1F38 (·º∏),U+1F3F (·ºø),U+1F30 (·º∞),U+1F37 (·º∑),8,-8,GREEK
8,U+1F48 (·Ωà),U+1F4D (·Ωç),U+1F40 (·ΩÄ),U+1F45 (·ΩÖ),6,-8,GREEK
9,U+1F68 (·Ω®),U+1F6F (·ΩØ),U+1F60 (·Ω†),U+1F67 (·Ωß),8,-8,GREEK
