# Unicode Normalization and Search Anchor Analysis

This notebook explores Unicode case folding and normalization properties to identify optimal "anchor points" for case-insensitive and normalization-insensitive string search algorithms.

In [1]:
!python -m pip install -q tabulate 2>/dev/null || curl -sS https://bootstrap.pypa.io/get-pip.py | python && python -m pip install -q tabulate


Before we start, a small reminder on Unicode.
Unicode is a versioned standard.
In 2025, the latest version is Unicode 17.0.
It defines over a million code points, of which around 150,000 are assigned characters.
Some of them belong to "bicameral" scripts (like Latin, Greek, Cyrillic) that have distinct uppercase and lowercase forms.
Others belong to "unicameral" scripts (like Chinese, Japanese, Korean, Arabic) that do not have case distinctions.
It doesn't, however, mean that there are no different ways to represent the same character in the same script.
So "case folding" and "normalization" are two different concepts.
We will explore both in this notebook.

Unicode also doesn't require UTF-8 encoding, but UTF-8 is the most popular encoding on the web and in modern applications and the one we will focus on in StringZilla.
In UTF-8, each code point is represented by one, two, three, or four bytes.
A folded or normalized character can map to a sequence of multiple code points, and each of those code points can have a different length representation in UTF-8.
That's why, in the absolute majority of modern text-processing applications full Unicode processing is disabled.

Typically, when people perform case-insensitive search, they either:

1. Use simple ASCII case folding (A-Z to a-z), ignoring all other characters.
2. Use pretty much the only major library that supports full Unicode case folding and normalization, ICU (International Components for Unicode).

The first is clearly insufficient, and the second is quite heavy and works at a character level, making SIMD optimizations difficult.
This notebook will focus on more SIMD-vectorizable ideas.

To start, let's pull the most recent Unicode Character Database (UCD) files from the Unicode website.

In [None]:
import sys
from collections import Counter
import unicodedata
from tabulate import tabulate

# Import shared Unicode data loading functions
sys.path.insert(0, ".")
from test_helpers import (
    UNICODE_VERSION,
    get_all_codepoints,
    get_case_folding_rules_as_codepoints,
)

# UTF-8 byte boundaries
UTF8_1BYTE_MAX = 0x7F  # 127 - ASCII range
UTF8_2BYTE_MAX = 0x7FF  # 2047
UTF8_3BYTE_MAX = 0xFFFF  # 65535


def utf8_hex(text):
    """Return UTF-8 hex byte sequence for a string."""
    return " ".join(f"0x{b:02X}" for b in text.encode("utf-8"))


print(f"Using Unicode version: {UNICODE_VERSION}")
all_codepoints = get_all_codepoints(UNICODE_VERSION)

Using Unicode version: 17.0.0
Using cached Unicode 17.0.0 UCD XML: /tmp/ucd-17.0.0.all.flat.xml


[The highest allowed code point in Unicode is `0x10FFFF` or "U+10FFFF"](https://stackoverflow.com/questions/52203351/why-is-unicode-restricted-to-0x10ffff), but it doesn't mean that all code points up to that value are assigned.

- Planes 15-16 (U+F0000 to U+10FFFF) are reserved for "Private Use Area" and do not contain assigned characters.
- Most of plane 14 (U+E0000 to U+E0FFF) is reserved for "Supplementary Special-purpose Plane" and contains very few assigned characters.
- Many code points in other planes are also unassigned.

In [3]:
print(f"Total assigned codepoints: {len(all_codepoints):,}")
print(f"Highest assigned codepoint: {all_codepoints[-1]:,}")
print(f"Highest possible codepoint: {0x10FFFF:,}")
print(f"Range density: {len(all_codepoints) / (all_codepoints[-1] + 1):.6%}")

Total assigned codepoints: 159,866
Highest assigned codepoint: 917,999
Highest possible codepoint: 1,114,111
Range density: 17.414597%


## Unicode Case Folding Analysis

### Direct Folding Targets

Case folding maps characters to a "folded" form for case-insensitive comparisons.
This is more comprehensive than simple lowercasing - it handles special cases like German ß → ss.
The very first thing we are interested in is: how often each codepoint becomes a folding target for other characters?

The reason we are curious about this is that in simple cases, like the Russian letter "А" (A) and "а" (a), both fold to the same codepoint U+0430 (Cyrillic small letter a).
So when scanning for exact case-insensitive matches, we can just compare each 2-byte UTF-8 slice against just 2 possible values: 0xD090 (U+0410) and 0xD0B0 (U+0430), without actually performing any case folding.
The easiest way to solve the problem is to avoid it after all!

In [4]:
case_folds = get_case_folding_rules_as_codepoints(UNICODE_VERSION)
print(f"Total case folding rules: {len(case_folds):,}")

Using cached Unicode 17.0.0 CaseFolding.txt: /tmp/CaseFolding-17.0.0.txt
Total case folding rules: 1,585


In [5]:
target_frequency = Counter()

for source_codepoint, target_codepoints in case_folds.items():
    for target_codepoint in target_codepoints:
        target_frequency[target_codepoint] += 1

print(f"Total folding rules: {sum(target_frequency.values()):,}")
print(f"Unique target codepoints: {len(target_frequency):,}")

Total folding rules: 1,705
Unique target codepoints: 1,462


Let's display the most common folding targets:

In [6]:
rows = []
for codepoint, frequency in target_frequency.most_common():
    try:
        character = chr(codepoint)
        name = unicodedata.name(character, "")
        hex_bytes = utf8_hex(character)
    except (ValueError, OverflowError):
        character = "?"
        name = ""
        hex_bytes = "?"
    rows.append(
        {
            "Codepoint": f"U+{codepoint:04X}",
            "Char": character,
            "UTF-8": hex_bytes,
            "Freq": frequency,
            "Name": name,
        }
    )

print(tabulate(rows, headers="keys", tablefmt="github"))

| Codepoint   | Char   | UTF-8               |   Freq | Name                                                   |
|-------------|--------|---------------------|--------|--------------------------------------------------------|
| U+03B9      | ι      | 0xCE 0xB9           |     71 | GREEK SMALL LETTER IOTA                                |
| U+0342      | ͂       | 0xCD 0x82           |     11 | COMBINING GREEK PERISPOMENI                            |
| U+03C5      | υ      | 0xCF 0x85           |     10 | GREEK SMALL LETTER UPSILON                             |
| U+0066      | f      | 0x66                |      9 | LATIN SMALL LETTER F                                   |
| U+0308      | ̈       | 0xCC 0x88           |      9 | COMBINING DIAERESIS                                    |
| U+0073      | s      | 0x73                |      8 | LATIN SMALL LETTER S                                   |
| U+03C9      | ω      | 0xCF 0x89           |      6 | GREEK SMALL LETTER OMEGA              

This suggests that the "GREEK SMALL LETTER IOTA" (U+03B9) is the most common folding target, being the folded form of 71 different codepoints.
The reason for this is historical.
Ancient Greek had a grammatical feature called the "iota subscript" where iota was written as a small subscript beneath vowels (α, η, ω) to indicate certain grammatical forms (dative case, etc.).
When case-folding, these decompose and the subscript iota becomes a regular lowercase iota:

- ᾳ (alpha with ypogegrammeni) → αι
- ῃ (eta with ypogegrammeni) → ηι
- ῳ (omega with ypogegrammeni) → ωι

More importantly, at this point we see that `'f'`, `'s'`, `'i'`, `'t'` are the most common direct single-byte UTF-8 folding targets.
Each is the target of at least 4 different codepoints.
But that doesn't tell the whole story!

### Otherwise Ambiguous Folding Targets

Oftentimes, a character is only one of many characters in the produced folding result.

- `'ﬀ'` → `"ff"` - 3-byte codepoint mapping into 2x 1-byte codepoints
- `'ﬁ'` → `"fi"` - 3-byte codepoint mapping into 2x 1-byte codepoints
- `'ﬂ'` → `"fl"` - 3-byte codepoint mapping into 2x 1-byte codepoints
- `'ﬃ'` → `"ffi"` - 3-byte codepoint mapping into 3x 1-byte codepoints
- `'ﬄ'` → `"ffl"` - 3-byte codepoint mapping into 3x 1-byte codepoints

Let's account for those as well:

In [7]:
direct_target_frequency = Counter()  # Codepoint is the ONLY target of a folding
partial_target_frequency = Counter()  # Codepoint is ONE OF multiple targets in a folding

for source_codepoint, target_codepoints in case_folds.items():
    if len(target_codepoints) == 1:
        # Direct 1:1 folding (e.g., 'A' → 'a')
        direct_target_frequency[target_codepoints[0]] += 1
    else:
        # Multi-codepoint expansion (e.g., 'ﬁ' → 'f', 'i')
        for target_codepoint in target_codepoints:
            partial_target_frequency[target_codepoint] += 1

# Some codepoints may be both direct AND partial targets
both_targets = set(direct_target_frequency.keys()) & set(partial_target_frequency.keys())

print(f"Total folding rules: {len(case_folds):,}")
print(f"  - Direct 1:1 foldings: {sum(1 for t in case_folds.values() if len(t) == 1):,}")
print(f"  - Multi-codepoint expansions: {sum(1 for t in case_folds.values() if len(t) > 1):,}")
print()
print(f"Unique target codepoints:")
print(f"  - Only direct targets: {len(direct_target_frequency - partial_target_frequency):,}")
print(f"  - Only partial targets: {len(partial_target_frequency - direct_target_frequency):,}")
print(f"  - Both direct AND partial: {len(both_targets):,}")

Total folding rules: 1,585
  - Direct 1:1 foldings: 1,481
  - Multi-codepoint expansions: 104

Unique target codepoints:
  - Only direct targets: 1,398
  - Only partial targets: 48
  - Both direct AND partial: 54


The following table differentiates complete and partial folding targets:

In [8]:
rows = []
for codepoint, partial_frequency in partial_target_frequency.most_common():
    try:
        character = chr(codepoint)
        hex_bytes = utf8_hex(character)
    except (ValueError, OverflowError):
        character = "?"
        hex_bytes = "?"

    direct_frequency = direct_target_frequency.get(codepoint, 0)

    # Find an example expansion containing this codepoint
    example = ""
    example_hex = ""
    for source_codepoint, target_codepoints in case_folds.items():
        if len(target_codepoints) > 1 and codepoint in target_codepoints:
            try:
                source_character = chr(source_codepoint)
                target_string = "".join(chr(c) for c in target_codepoints)
                example = f"'{source_character}' → \"{target_string}\""
                example_hex = f"{utf8_hex(source_character)} → {utf8_hex(target_string)}"
            except (ValueError, OverflowError):
                example = f"U+{source_codepoint:04X} → {target_codepoints}"
                example_hex = ""
            break

    rows.append(
        {
            "Codepoint": f"U+{codepoint:04X}",
            "Char": character,
            "UTF-8": hex_bytes,
            "Partial": partial_frequency,
            "Direct": direct_frequency,
            "Example": example,
            "Example Hex": example_hex,
        }
    )

print(tabulate(rows, headers="keys", tablefmt="github"))

| Codepoint   | Char   | UTF-8          |   Partial |   Direct | Example    | Example Hex                                    |
|-------------|--------|----------------|-----------|----------|------------|------------------------------------------------|
| U+03B9      | ι      | 0xCE 0xB9      |        68 |        3 | 'ΐ' → "ΐ"  | 0xCE 0x90 → 0xCE 0xB9 0xCC 0x88 0xCC 0x81      |
| U+0342      | ͂       | 0xCD 0x82      |        11 |        0 | 'ὖ' → "ὖ"  | 0xE1 0xBD 0x96 → 0xCF 0x85 0xCC 0x93 0xCD 0x82 |
| U+0308      | ̈       | 0xCC 0x88      |         9 |        0 | 'ΐ' → "ΐ"  | 0xCE 0x90 → 0xCE 0xB9 0xCC 0x88 0xCC 0x81      |
| U+03C5      | υ      | 0xCF 0x85      |         9 |        1 | 'ΰ' → "ΰ"  | 0xCE 0xB0 → 0xCF 0x85 0xCC 0x88 0xCC 0x81      |
| U+0066      | f      | 0x66           |         8 |        1 | 'ﬀ' → "ff" | 0xEF 0xAC 0x80 → 0x66 0x66                     |
| U+0073      | s      | 0x73           |         6 |        2 | 'ß' → "ss" | 0xC3 0x9F → 0x73 0x73  

### Safe Single-byte Folding Anchors

Of all those characters, we are most interested in the codepoints representable in just 1 byte in UTF-8, as we can process 64 of them in a `ZMM` register at once.
Those are the boring ASCII letters.
But we can't just apply traditional SIMD ASCII case-insensitive search techniques like:

```c
__m512i lower_mask = _mm512_set1_epi8(0x20);
__m512i input_chunk = _mm512_loadu_si512(input_ptr);
__m512i folded_chunk = _mm512_or_si512(input_chunk, lower_mask);
```

If the needle contains an `'f'` and the haystack contains an `'ﬃ'`, we would miss the match.
So we must know, which of the single-byte codepoints are folding targets of multiple codepoints.

In [9]:
ascii_targets = {}
for codepoint in range(UTF8_1BYTE_MAX + 1):
    direct = direct_target_frequency.get(codepoint, 0)
    partial = partial_target_frequency.get(codepoint, 0)
    total = direct + partial
    if total > 0:
        ascii_targets[codepoint] = {"direct": direct, "partial": partial, "total": total}

# Separate into "safe" (exactly 1 source) vs "ambiguous" (multiple sources)
safe_ascii = {codepoint: info for codepoint, info in ascii_targets.items() if info["total"] == 1}
ambiguous_ascii = {codepoint: info for codepoint, info in ascii_targets.items() if info["total"] > 1}

print(f"Total ASCII targets: {len(ascii_targets)}")
print(f"  - Safe (exactly 1 source): {len(safe_ascii)}")
print(f"  - Ambiguous (multiple sources): {len(ambiguous_ascii)}")

Total ASCII targets: 26
  - Safe (exactly 1 source): 14
  - Ambiguous (multiple sources): 12


The following table shows safe ASCII targets that can use simple SIMD case folding (each has exactly one source):

In [10]:
safe_rows = []
for codepoint in sorted(safe_ascii.keys()):
    character = chr(codepoint)
    for source_codepoint, target_codepoints in case_folds.items():
        if codepoint in target_codepoints:
            source_character = chr(source_codepoint)
            safe_rows.append(
                {
                    "Target": f"'{character}' (U+{codepoint:04X})",
                    "Target Hex": utf8_hex(character),
                    "Source": f"'{source_character}' (U+{source_codepoint:04X})",
                    "Source Hex": utf8_hex(source_character),
                }
            )
            break

print(tabulate(safe_rows, headers="keys", tablefmt="github"))

| Target       | Target Hex   | Source       | Source Hex   |
|--------------|--------------|--------------|--------------|
| 'b' (U+0062) | 0x62         | 'B' (U+0042) | 0x42         |
| 'c' (U+0063) | 0x63         | 'C' (U+0043) | 0x43         |
| 'd' (U+0064) | 0x64         | 'D' (U+0044) | 0x44         |
| 'e' (U+0065) | 0x65         | 'E' (U+0045) | 0x45         |
| 'g' (U+0067) | 0x67         | 'G' (U+0047) | 0x47         |
| 'm' (U+006D) | 0x6D         | 'M' (U+004D) | 0x4D         |
| 'o' (U+006F) | 0x6F         | 'O' (U+004F) | 0x4F         |
| 'p' (U+0070) | 0x70         | 'P' (U+0050) | 0x50         |
| 'q' (U+0071) | 0x71         | 'Q' (U+0051) | 0x51         |
| 'r' (U+0072) | 0x72         | 'R' (U+0052) | 0x52         |
| 'u' (U+0075) | 0x75         | 'U' (U+0055) | 0x55         |
| 'v' (U+0076) | 0x76         | 'V' (U+0056) | 0x56         |
| 'x' (U+0078) | 0x78         | 'X' (U+0058) | 0x58         |
| 'z' (U+007A) | 0x7A         | 'Z' (U+005A) | 0x5A         |


The following table shows ambiguous ASCII targets that need special handling in SIMD (each has multiple sources):

In [11]:
ambiguous_rows = []
for codepoint in sorted(ambiguous_ascii.keys()):
    character = chr(codepoint)
    info = ambiguous_ascii[codepoint]

    # Find all sources
    sources = []
    for source_codepoint, target_codepoints in case_folds.items():
        if codepoint in target_codepoints:
            try:
                source_character = chr(source_codepoint)
                source_hex = utf8_hex(source_character)
                if len(target_codepoints) == 1:
                    sources.append(f"'{source_character}' ({source_hex})")
                else:
                    target_string = "".join(chr(c) for c in target_codepoints)
                    target_hex = utf8_hex(target_string)
                    sources.append(f"'{source_character}'→\"{target_string}\" ({source_hex}→{target_hex})")
            except:
                sources.append(f"U+{source_codepoint:04X}")

    sources_string = ", ".join(sources[:4])
    if len(sources) > 4:
        sources_string += f" (+{len(sources)-4} more)"

    ambiguous_rows.append(
        {
            "Char": f"'{character}'",
            "Hex": utf8_hex(character),
            "Direct": info["direct"],
            "Partial": info["partial"],
            "Total": info["total"],
            "Sources": sources_string,
        }
    )

print(tabulate(ambiguous_rows, headers="keys", tablefmt="github"))

| Char   | Hex   |   Direct |   Partial |   Total | Sources                                                                                                                             |
|--------|-------|----------|-----------|---------|-------------------------------------------------------------------------------------------------------------------------------------|
| 'a'    | 0x61  |        1 |         1 |       2 | 'A' (0x41), 'ẚ'→"aʾ" (0xE1 0xBA 0x9A→0x61 0xCA 0xBE)                                                                                |
| 'f'    | 0x66  |        1 |         8 |       9 | 'F' (0x46), 'ﬀ'→"ff" (0xEF 0xAC 0x80→0x66 0x66), 'ﬁ'→"fi" (0xEF 0xAC 0x81→0x66 0x69), 'ﬂ'→"fl" (0xEF 0xAC 0x82→0x66 0x6C) (+2 more) |
| 'h'    | 0x68  |        1 |         1 |       2 | 'H' (0x48), 'ẖ'→"ẖ" (0xE1 0xBA 0x96→0x68 0xCC 0xB1)                                                                                 |
| 'i'    | 0x69  |        1 |         3 |       4 | 'I' (0x49), 'İ'→"

However, even "ambiguous" ASCII characters can be contextually safe based on what follows them in the needle.
For example, `'f'` is ambiguous because of ligatures like `'ﬁ'` → `"fi"`.
But if the needle contains `"fog"`, the `'f'` is safe because no ligature expands to `"fo..."`.
The following analysis identifies when each ambiguous character becomes safe based on its context:

In [12]:
contextual_safety = {}

for codepoint in ambiguous_ascii.keys():
    char = chr(codepoint)
    dangerous_following = set()
    dangerous_preceding = set()
    ligature_examples = []

    # Find all multi-codepoint expansions that include this character
    for source_codepoint, target_codepoints in case_folds.items():
        if len(target_codepoints) > 1:  # Multi-codepoint expansion
            expansion = "".join(chr(c) for c in target_codepoints)

            # Find all positions where our character appears
            for pos, c in enumerate(expansion):
                if ord(c) == codepoint:
                    source_char = chr(source_codepoint)

                    # If not the last character, next char is "dangerous following"
                    if pos < len(expansion) - 1:
                        next_char = expansion[pos + 1]
                        dangerous_following.add(next_char)
                        if len(ligature_examples) < 3:
                            ligature_examples.append(f"'{source_char}'→\"{expansion}\"")

                    # If not the first character, prev char is "dangerous preceding"
                    if pos > 0:
                        prev_char = expansion[pos - 1]
                        dangerous_preceding.add(prev_char)

    if dangerous_following or dangerous_preceding:
        contextual_safety[char] = {
            "dangerous_following": dangerous_following,
            "dangerous_preceding": dangerous_preceding,
            "examples": ligature_examples,
        }

# Build output table
context_rows = []
for char in sorted(contextual_safety.keys()):
    info = contextual_safety[char]
    following = info["dangerous_following"]
    preceding = info["dangerous_preceding"]

    if following:
        following_chars = ", ".join(f"'{c}' ({utf8_hex(c)})" for c in sorted(following))
        safe_following = f"NOT: {following_chars}"
    else:
        safe_following = "any"

    if preceding:
        preceding_chars = ", ".join(f"'{c}' ({utf8_hex(c)})" for c in sorted(preceding))
        safe_preceding = f"NOT: {preceding_chars}"
    else:
        safe_preceding = "any"

    context_rows.append(
        {
            "Char": f"'{char}'",
            "Hex": utf8_hex(char),
            "Safe following": safe_following,
            "Safe preceding": safe_preceding,
            "Examples": ", ".join(info["examples"]),
        }
    )

print(tabulate(context_rows, headers="keys", tablefmt="github"))

| Char   | Hex   | Safe following                          | Safe preceding       | Examples                     |
|--------|-------|-----------------------------------------|----------------------|------------------------------|
| 'a'    | 0x61  | NOT: 'ʾ' (0xCA 0xBE)                    | any                  | 'ẚ'→"aʾ"                     |
| 'f'    | 0x66  | NOT: 'f' (0x66), 'i' (0x69), 'l' (0x6C) | NOT: 'f' (0x66)      | 'ﬀ'→"ff", 'ﬁ'→"fi", 'ﬂ'→"fl" |
| 'h'    | 0x68  | NOT: '̱' (0xCC 0xB1)                     | any                  | 'ẖ'→"ẖ"                      |
| 'i'    | 0x69  | NOT: '̇' (0xCC 0x87)                     | NOT: 'f' (0x66)      | 'İ'→"i̇"                      |
| 'j'    | 0x6A  | NOT: '̌' (0xCC 0x8C)                     | any                  | 'ǰ'→"ǰ"                      |
| 'l'    | 0x6C  | any                                     | NOT: 'f' (0x66)      |                              |
| 'n'    | 0x6E  | any                                     | NOT: 'ʼ' (0xC

Looking at this, if the needle contains a continuous sequence of `'b'`, `'c'`, `'d'`, `'e'`, `'g'`, `'m'`, `'o'`, `'p'`, `'q'`, `'r'`, `'u'`, `'v'`, `'x'`, `'z'` in any order or case, we can trivially match them using the simple SIMD snippet from above, as long as it doesn't contain `'a'`, `'f'`, `'h'`, `'i'`, `'j'`, `'k'`, `'l'`, `'n'`, `'s'`, `'t'`, `'w'`, or `'y'`.

Moreover, there is a group of single-byte UTF-8 codepoints that don't participate in any folding mappings at all:

In [13]:
uninvolved_ascii = [
    codepoint
    for codepoint in range(UTF8_1BYTE_MAX + 1)
    if codepoint not in ascii_targets and codepoint not in case_folds
]
print(f"ASCII codepoints completely uninvolved in folding: {len(uninvolved_ascii)}")

control_characters = [codepoint for codepoint in uninvolved_ascii if codepoint < 32 or codepoint == 127]
digits = [chr(codepoint) for codepoint in uninvolved_ascii if chr(codepoint).isdigit()]
punctuation = [
    chr(codepoint) for codepoint in uninvolved_ascii if 32 <= codepoint < 127 and not chr(codepoint).isalnum()
]

print(f"Control characters: {len(control_characters)} (0x00-0x1F, 0x7F)")
print(f"Digits: {''.join(digits)}")
print(f"Punctuation/Symbols: {''.join(punctuation)}")

ASCII codepoints completely uninvolved in folding: 76
Control characters: 33 (0x00-0x1F, 0x7F)
Digits: 0123456789
Punctuation/Symbols:  !"#$%&'()*+,-./:;<=>?@[\]^_`{|}~


### Safe Two-byte Folding Anchors

The more interesting and challenging part is the 2-byte UTF-8 codepoints that map into either other single 2-byte codepoint or two 1-byte codepoints.

In [14]:
two_byte_folds = {}
for source_codepoint, target_codepoints in case_folds.items():
    if UTF8_1BYTE_MAX < source_codepoint <= UTF8_2BYTE_MAX:  # 2-byte UTF-8 range
        two_byte_folds[source_codepoint] = target_codepoints

print(f"2-byte UTF-8 codepoints with case folding: {len(two_byte_folds):,}")
print()

# Categorize by target type
folds_to_1byte = {}  # 2-byte → single 1-byte (e.g., some Latin letters)
folds_to_2byte = {}  # 2-byte → single 2-byte (most common)
folds_to_2x1byte = {}  # 2-byte → two 1-byte codepoints
folds_to_other = {}  # Other patterns

for source_codepoint, target_codepoints in two_byte_folds.items():
    target_sizes = [
        (
            1
            if codepoint <= UTF8_1BYTE_MAX
            else 2 if codepoint <= UTF8_2BYTE_MAX else 3 if codepoint <= UTF8_3BYTE_MAX else 4
        )
        for codepoint in target_codepoints
    ]

    if len(target_codepoints) == 1:
        if target_sizes[0] == 1:
            folds_to_1byte[source_codepoint] = target_codepoints
        elif target_sizes[0] == 2:
            folds_to_2byte[source_codepoint] = target_codepoints
        else:
            folds_to_other[source_codepoint] = target_codepoints
    elif len(target_codepoints) == 2 and all(size == 1 for size in target_sizes):
        folds_to_2x1byte[source_codepoint] = target_codepoints
    else:
        folds_to_other[source_codepoint] = target_codepoints

print(f"Folding patterns for 2-byte UTF-8 sources:")
print(f"  2-byte → 1-byte:     {len(folds_to_1byte):,}")
print(f"  2-byte → 2-byte:     {len(folds_to_2byte):,}")
print(f"  2-byte → 2x 1-byte:  {len(folds_to_2x1byte):,}")
print(f"  Other patterns:      {len(folds_to_other):,}")

2-byte UTF-8 codepoints with case folding: 460

Folding patterns for 2-byte UTF-8 sources:
  2-byte → 1-byte:     1
  2-byte → 2-byte:     450
  2-byte → 2x 1-byte:  1
  Other patterns:      8


Of the 460 case folding rules for 2-byte UTF-8 sources, the vast majority (450) map to another 2-byte codepoint.
The remaining 10 are special cases worth understanding.

2-byte → 1-byte (1 case):

- `'ſ'` (U+017F, LATIN SMALL LETTER LONG S) → `'s'` - historical long S folds to regular ASCII s

2-byte → 2x 1-byte (1 case):

- `'ß'` (U+00DF, LATIN SMALL LETTER SHARP S) → `"ss"` - German eszett expands to two ASCII characters

Other patterns (8 cases) are the tricky edge cases that don't fit clean patterns:

- `'İ'` (U+0130) → `'i'` + combining dot above (1-byte + 2-byte) - Turkish capital I with dot
- `'ŉ'` (U+0149) → modifier apostrophe + `'n'` (2-byte + 1-byte) - deprecated character
- `'ǰ'` (U+01F0) → `'j'` + combining caron (1-byte + 2-byte) - J with caron decomposes
- `'Ⱥ'` (U+023A) → `'ⱥ'` (U+2C65) - 2-byte source maps to 3-byte target!
- `'Ⱦ'` (U+023E) → `'ⱦ'` (U+2C66) - another 2-byte → 3-byte case
- `'ΐ'` (U+0390) → ι + combining diaeresis + combining acute (3x 2-byte) - Greek with diacritics
- `'ΰ'` (U+03B0) → υ + combining diaeresis + combining acute (3x 2-byte) - Greek with diacritics
- `'և'` (U+0587) → ե + ւ (2x 2-byte) - Armenian ligature

The `'Ⱥ'` and `'Ⱦ'` cases are particularly noteworthy: they are 2-byte UTF-8 sources that fold to 3-byte targets, meaning the folded form is longer than the original!
Assuming the much larger search space, where possible, we want to group them into continuous to/from ranges.

The following table shows continuous ranges of 2-byte UTF-8 codepoints that fold to other 2-byte codepoints with a constant offset (e.g., uppercase → lowercase within the same script block):

In [None]:
sorted_2byte = sorted(folds_to_2byte.items())

ranges = []
if sorted_2byte:
    range_start = sorted_2byte[0][0]
    range_offset = sorted_2byte[0][1][0] - sorted_2byte[0][0]
    prev_source = sorted_2byte[0][0]

    for source_codepoint, target_codepoints in sorted_2byte[1:]:
        target_codepoint = target_codepoints[0]
        current_offset = target_codepoint - source_codepoint

        # Check if this continues the current range (consecutive source AND same offset)
        if source_codepoint == prev_source + 1 and current_offset == range_offset:
            prev_source = source_codepoint
        else:
            # End the current range and start a new one
            ranges.append((range_start, prev_source, range_offset))
            range_start = source_codepoint
            range_offset = current_offset
            prev_source = source_codepoint

    # Don't forget the last range
    ranges.append((range_start, prev_source, range_offset))

print(f"Found {len(ranges)} continuous ranges of 2-byte → 2-byte foldings")
print(f"Ranges of length > 1: {sum(1 for r in ranges if r[1] - r[0] > 0)}")
print(f"Single-codepoint 'ranges': {sum(1 for r in ranges if r[1] == r[0])}")

Found 308 continuous ranges of 2-byte → 2-byte foldings
Ranges of length > 1: 12
Single-codepoint 'ranges': 296


The following table shows multi-codepoint ranges (length > 1) which are useful for SIMD optimization:

In [16]:
# Build table with range information
range_rows = []
for start, end, offset in ranges:
    length = end - start + 1
    try:
        start_char = chr(start)
        end_char = chr(end)
        target_start_char = chr(start + offset)
        target_end_char = chr(end + offset)
        script = unicodedata.name(start_char, "").split()[0] if length > 1 else ""
    except (ValueError, OverflowError):
        start_char = end_char = target_start_char = target_end_char = "?"
        script = ""

    range_rows.append(
        {
            "Src Start": f"U+{start:04X} ({start_char})",
            "Src Start Hex": utf8_hex(start_char),
            "Src End": f"U+{end:04X} ({end_char})",
            "Src End Hex": utf8_hex(end_char),
            "Tgt Start": f"U+{start + offset:04X} ({target_start_char})",
            "Tgt End": f"U+{end + offset:04X} ({target_end_char})",
            "Len": length,
            "Offset": f"+{offset}" if offset > 0 else str(offset),
            "Script": script,
        }
    )

multi_ranges = [r for r in range_rows if r["Len"] > 1]
print(tabulate(multi_ranges, headers="keys", tablefmt="github"))

| Src Start   | Src Start Hex   | Src End    | Src End Hex   | Tgt Start   | Tgt End    |   Len |   Offset | Script   |
|-------------|-----------------|------------|---------------|-------------|------------|-------|----------|----------|
| U+00C0 (À)  | 0xC3 0x80       | U+00D6 (Ö) | 0xC3 0x96     | U+00E0 (à)  | U+00F6 (ö) |    23 |      +32 | LATIN    |
| U+00D8 (Ø)  | 0xC3 0x98       | U+00DE (Þ) | 0xC3 0x9E     | U+00F8 (ø)  | U+00FE (þ) |     7 |      +32 | LATIN    |
| U+0189 (Ɖ)  | 0xC6 0x89       | U+018A (Ɗ) | 0xC6 0x8A     | U+0256 (ɖ)  | U+0257 (ɗ) |     2 |     +205 | LATIN    |
| U+01B1 (Ʊ)  | 0xC6 0xB1       | U+01B2 (Ʋ) | 0xC6 0xB2     | U+028A (ʊ)  | U+028B (ʋ) |     2 |     +217 | LATIN    |
| U+0388 (Έ)  | 0xCE 0x88       | U+038A (Ί) | 0xCE 0x8A     | U+03AD (έ)  | U+03AF (ί) |     3 |      +37 | GREEK    |
| U+038E (Ύ)  | 0xCE 0x8E       | U+038F (Ώ) | 0xCE 0x8F     | U+03CD (ύ)  | U+03CE (ώ) |     2 |      +63 | GREEK    |
| U+0391 (Α)  | 0xCE 0x91       | U+03A1

### Three-byte UTF-8 Case Folding

3-byte UTF-8 covers codepoints U+0800 to U+FFFF (2048 to 65535).
This includes many scripts: Extended Greek, Cherokee, Georgian, and various symbol blocks.

In [17]:
# 3-byte UTF-8 codepoints: U+0800 to U+FFFF (2048 to 65535)
three_byte_folds = {}
for source_codepoint, target_codepoints in case_folds.items():
    if UTF8_2BYTE_MAX < source_codepoint <= UTF8_3BYTE_MAX:
        three_byte_folds[source_codepoint] = target_codepoints

print(f"3-byte UTF-8 codepoints with case folding: {len(three_byte_folds):,}")
print()

# Categorize by target pattern
three_to_3byte = {}  # 3-byte → single 3-byte
three_to_2byte = {}  # 3-byte → single 2-byte (shrinks!)
three_to_1byte = {}  # 3-byte → 1-byte sequence
three_to_other = {}  # Multi-codepoint or mixed

for source_codepoint, target_codepoints in three_byte_folds.items():
    target_sizes = [
        1 if cp <= UTF8_1BYTE_MAX else 2 if cp <= UTF8_2BYTE_MAX else 3 if cp <= UTF8_3BYTE_MAX else 4
        for cp in target_codepoints
    ]

    if len(target_codepoints) == 1:
        if target_sizes[0] == 3:
            three_to_3byte[source_codepoint] = target_codepoints
        elif target_sizes[0] == 2:
            three_to_2byte[source_codepoint] = target_codepoints
        elif target_sizes[0] == 1:
            three_to_1byte[source_codepoint] = target_codepoints
        else:
            three_to_other[source_codepoint] = target_codepoints
    else:
        three_to_other[source_codepoint] = target_codepoints

print(f"Folding patterns for 3-byte UTF-8 sources:")
print(f"  3-byte → 3-byte:  {len(three_to_3byte):,}")
print(f"  3-byte → 2-byte:  {len(three_to_2byte):,}")
print(f"  3-byte → 1-byte:  {len(three_to_1byte):,}")
print(f"  Other patterns:   {len(three_to_other):,}")

3-byte UTF-8 codepoints with case folding: 792

Folding patterns for 3-byte UTF-8 sources:
  3-byte → 3-byte:  663
  3-byte → 2-byte:  31
  3-byte → 1-byte:  1
  Other patterns:   97


The following table shows continuous ranges of 3-byte UTF-8 codepoints that fold to other 3-byte codepoints:

In [18]:
# Find continuous ranges of 3-byte → 3-byte foldings
sorted_3byte = sorted(three_to_3byte.items())

ranges_3byte = []
if sorted_3byte:
    range_start = sorted_3byte[0][0]
    range_offset = sorted_3byte[0][1][0] - sorted_3byte[0][0]
    prev_source = sorted_3byte[0][0]

    for source_codepoint, target_codepoints in sorted_3byte[1:]:
        target_codepoint = target_codepoints[0]
        current_offset = target_codepoint - source_codepoint

        if source_codepoint == prev_source + 1 and current_offset == range_offset:
            prev_source = source_codepoint
        else:
            ranges_3byte.append((range_start, prev_source, range_offset))
            range_start = source_codepoint
            range_offset = current_offset
            prev_source = source_codepoint

    ranges_3byte.append((range_start, prev_source, range_offset))

print(f"Found {len(ranges_3byte)} continuous ranges of 3-byte → 3-byte foldings")
print(f"Ranges of length > 1: {sum(1 for r in ranges_3byte if r[1] - r[0] > 0)}")
print(f"Single-codepoint 'ranges': {sum(1 for r in ranges_3byte if r[1] == r[0])}")

Found 337 continuous ranges of 3-byte → 3-byte foldings
Ranges of length > 1: 24
Single-codepoint 'ranges': 313


The following table shows multi-codepoint ranges (length > 1) which are useful for SIMD optimization:

In [19]:
# Build table
range_rows_3byte = []
for start, end, offset in ranges_3byte:
    length = end - start + 1
    try:
        start_char = chr(start)
        end_char = chr(end)
        target_start_char = chr(start + offset)
        target_end_char = chr(end + offset)
        script = unicodedata.name(start_char, "").split()[0] if length > 1 else ""
    except (ValueError, OverflowError):
        start_char = end_char = target_start_char = target_end_char = "?"
        script = ""

    range_rows_3byte.append(
        {
            "Src Start": f"U+{start:04X} ({start_char})",
            "Src Start Hex": utf8_hex(start_char),
            "Src End": f"U+{end:04X} ({end_char})",
            "Src End Hex": utf8_hex(end_char),
            "Tgt Start": f"U+{start + offset:04X} ({target_start_char})",
            "Tgt End": f"U+{end + offset:04X} ({target_end_char})",
            "Len": length,
            "Offset": f"+{offset}" if offset > 0 else str(offset),
            "Script": script,
        }
    )

multi_ranges_3byte = [r for r in range_rows_3byte if r["Len"] > 1]
print(tabulate(multi_ranges_3byte, headers="keys", tablefmt="github"))

| Src Start   | Src Start Hex   | Src End     | Src End Hex    | Tgt Start   | Tgt End     |   Len |   Offset | Script     |
|-------------|-----------------|-------------|----------------|-------------|-------------|-------|----------|------------|
| U+10A0 (Ⴀ)  | 0xE1 0x82 0xA0  | U+10C5 (Ⴥ)  | 0xE1 0x83 0x85 | U+2D00 (ⴀ)  | U+2D25 (ⴥ)  |    38 |    +7264 | GEORGIAN   |
| U+13F8 (ᏸ)  | 0xE1 0x8F 0xB8  | U+13FD (ᏽ)  | 0xE1 0x8F 0xBD | U+13F0 (Ᏸ)  | U+13F5 (Ᏽ)  |     6 |       -8 | CHEROKEE   |
| U+1C90 (Ა)  | 0xE1 0xB2 0x90  | U+1CBA (Ჺ)  | 0xE1 0xB2 0xBA | U+10D0 (ა)  | U+10FA (ჺ)  |    43 |    -3008 | GEORGIAN   |
| U+1CBD (Ჽ)  | 0xE1 0xB2 0xBD  | U+1CBF (Ჿ)  | 0xE1 0xB2 0xBF | U+10FD (ჽ)  | U+10FF (ჿ)  |     3 |    -3008 | GEORGIAN   |
| U+1F08 (Ἀ)  | 0xE1 0xBC 0x88  | U+1F0F (Ἇ)  | 0xE1 0xBC 0x8F | U+1F00 (ἀ)  | U+1F07 (ἇ)  |     8 |       -8 | GREEK      |
| U+1F18 (Ἐ)  | 0xE1 0xBC 0x98  | U+1F1D (Ἕ)  | 0xE1 0xBC 0x9D | U+1F10 (ἐ)  | U+1F15 (ἕ)  |     6 |       -8 | GREEK      |
