# Unicode Normalization and Search Anchor Analysis

This notebook explores Unicode case folding and normalization properties to identify optimal "anchor points" for case-insensitive and normalization-insensitive string search algorithms.

In [None]:
!python -m pip install -q tabulate 2>/dev/null || curl -sS https://bootstrap.pypa.io/get-pip.py | python && python -m pip install -q tabulate


Before we start, a small reminder on Unicode.
Unicode is a versioned standard.
In 2025, the latest version is Unicode 17.0.
It defines over a million code points, of which around 150,000 are assigned characters.
Some of them belong to "bicameral" scripts (like Latin, Greek, Cyrillic) that have distinct uppercase and lowercase forms.
Others belong to "unicameral" scripts (like Chinese, Japanese, Korean, Arabic) that do not have case distinctions.
It doesn't, however, mean that there are no different ways to represent the same character in the same script.
So "case folding" and "normalization" are two different concepts.
We will explore both in this notebook.

Unicode also doesn't require UTF-8 encoding, but UTF-8 is the most popular encoding on the web and in modern applications and the one we will focus on in StringZilla.
In UTF-8, each code point is represented by one, two, three, or four bytes.
A folded or normalized character can map to a sequence of multiple code points, and each of those code points can have a different length representation in UTF-8.
That's why, in the absolute majority of modern text-processing applications full Unicode processing is disabled.

Typically, when people perform case-insensitive search, they either:

1. Use simple ASCII case folding (A-Z to a-z), ignoring all other characters.
2. Use pretty much the only major library that supports full Unicode case folding and normalization, ICU (International Components for Unicode).

The first is clearly insufficient, and the second is quite heavy and works at a character level, making SIMD optimizations difficult.
This notebook will focus on more SIMD-vectorizable ideas.

To start, let's pull the most recent Unicode Character Database (UCD) files from the Unicode website.

In [None]:
import sys
from collections import Counter
import unicodedata
from tabulate import tabulate

# Import shared Unicode data loading functions
sys.path.insert(0, ".")
from test_helpers import (
    UNICODE_VERSION,
    get_all_codepoints,
    get_case_folding_rules_as_codepoints,
)

# UTF-8 byte boundaries
UTF8_1BYTE_MAX = 0x7F  # 127 - ASCII range
UTF8_2BYTE_MAX = 0x7FF  # 2047
UTF8_3BYTE_MAX = 0xFFFF  # 65535


def utf8_hex(text):
    """Return UTF-8 hex byte sequence for a string."""
    return " ".join(f"0x{b:02X}" for b in text.encode("utf-8"))


print(f"Using Unicode version: {UNICODE_VERSION}")
all_codepoints = get_all_codepoints(UNICODE_VERSION)

[The highest allowed code point in Unicode is `0x10FFFF` or "U+10FFFF"](https://stackoverflow.com/questions/52203351/why-is-unicode-restricted-to-0x10ffff), but it doesn't mean that all code points up to that value are assigned.

- Planes 15-16 (U+F0000 to U+10FFFF) are reserved for "Private Use Area" and do not contain assigned characters.
- Most of plane 14 (U+E0000 to U+E0FFF) is reserved for "Supplementary Special-purpose Plane" and contains very few assigned characters.
- Many code points in other planes are also unassigned.

In [None]:
print(f"Total assigned codepoints: {len(all_codepoints):,}")
print(f"Highest assigned codepoint: {all_codepoints[-1]:,}")
print(f"Highest possible codepoint: {0x10FFFF:,}")
print(f"Range density: {len(all_codepoints) / (all_codepoints[-1] + 1):.6%}")

## Unicode Case Folding Analysis

### Direct Folding Targets

Case folding maps characters to a "folded" form for case-insensitive comparisons.
This is more comprehensive than simple lowercasing - it handles special cases like German ß → ss.
The very first thing we are interested in is: how often each codepoint becomes a folding target for other characters?

The reason we are curious about this is that in simple cases, like the Russian letter "А" (A) and "а" (a), both fold to the same codepoint U+0430 (Cyrillic small letter a).
So when scanning for exact case-insensitive matches, we can just compare each 2-byte UTF-8 slice against just 2 possible values: 0xD090 (U+0410) and 0xD0B0 (U+0430), without actually performing any case folding.
The easiest way to solve the problem is to avoid it after all!

In [None]:
case_folds = get_case_folding_rules_as_codepoints(UNICODE_VERSION)
print(f"Total case folding rules: {len(case_folds):,}")

In [None]:
target_frequency = Counter()

for source_codepoint, target_codepoints in case_folds.items():
    for target_codepoint in target_codepoints:
        target_frequency[target_codepoint] += 1

print(f"Total folding rules: {sum(target_frequency.values()):,}")
print(f"Unique target codepoints: {len(target_frequency):,}")

Let's display the most common folding targets:

In [None]:
rows = []
for codepoint, frequency in target_frequency.most_common():
    try:
        character = chr(codepoint)
        name = unicodedata.name(character, "")
        hex_bytes = utf8_hex(character)
    except (ValueError, OverflowError):
        character = "?"
        name = ""
        hex_bytes = "?"
    rows.append(
        {
            "Codepoint": f"U+{codepoint:04X}",
            "Char": character,
            "UTF-8": hex_bytes,
            "Freq": frequency,
            "Name": name,
        }
    )

print(tabulate(rows, headers="keys", tablefmt="github"))

This suggests that the "GREEK SMALL LETTER IOTA" (U+03B9) is the most common folding target, being the folded form of 71 different codepoints.
The reason for this is historical.
Ancient Greek had a grammatical feature called the "iota subscript" where iota was written as a small subscript beneath vowels (α, η, ω) to indicate certain grammatical forms (dative case, etc.).
When case-folding, these decompose and the subscript iota becomes a regular lowercase iota:

- ᾳ (alpha with ypogegrammeni) → αι
- ῃ (eta with ypogegrammeni) → ηι
- ῳ (omega with ypogegrammeni) → ωι

More importantly, at this point we see that `'f'`, `'s'`, `'i'`, `'t'` are the most common direct single-byte UTF-8 folding targets.
Each is the target of at least 4 different codepoints.
But that doesn't tell the whole story!

### Otherwise Ambiguous Folding Targets

Oftentimes, a character is only one of many characters in the produced folding result.

- `'ﬀ'` → `"ff"` - 3-byte codepoint mapping into 2x 1-byte codepoints
- `'ﬁ'` → `"fi"` - 3-byte codepoint mapping into 2x 1-byte codepoints
- `'ﬂ'` → `"fl"` - 3-byte codepoint mapping into 2x 1-byte codepoints
- `'ﬃ'` → `"ffi"` - 3-byte codepoint mapping into 3x 1-byte codepoints
- `'ﬄ'` → `"ffl"` - 3-byte codepoint mapping into 3x 1-byte codepoints

Let's account for those as well:

In [None]:
direct_target_frequency = Counter()  # Codepoint is the ONLY target of a folding
partial_target_frequency = Counter()  # Codepoint is ONE OF multiple targets in a folding

for source_codepoint, target_codepoints in case_folds.items():
    if len(target_codepoints) == 1:
        # Direct 1:1 folding (e.g., 'A' → 'a')
        direct_target_frequency[target_codepoints[0]] += 1
    else:
        # Multi-codepoint expansion (e.g., 'ﬁ' → 'f', 'i')
        for target_codepoint in target_codepoints:
            partial_target_frequency[target_codepoint] += 1

# Some codepoints may be both direct AND partial targets
both_targets = set(direct_target_frequency.keys()) & set(partial_target_frequency.keys())

print(f"Total folding rules: {len(case_folds):,}")
print(f"  - Direct 1:1 foldings: {sum(1 for t in case_folds.values() if len(t) == 1):,}")
print(f"  - Multi-codepoint expansions: {sum(1 for t in case_folds.values() if len(t) > 1):,}")
print()
print(f"Unique target codepoints:")
print(f"  - Only direct targets: {len(direct_target_frequency - partial_target_frequency):,}")
print(f"  - Only partial targets: {len(partial_target_frequency - direct_target_frequency):,}")
print(f"  - Both direct AND partial: {len(both_targets):,}")

The following table differentiates complete and partial folding targets:

In [None]:
rows = []
for codepoint, partial_frequency in partial_target_frequency.most_common():
    try:
        character = chr(codepoint)
        hex_bytes = utf8_hex(character)
    except (ValueError, OverflowError):
        character = "?"
        hex_bytes = "?"

    direct_frequency = direct_target_frequency.get(codepoint, 0)

    # Find an example expansion containing this codepoint
    example = ""
    example_hex = ""
    for source_codepoint, target_codepoints in case_folds.items():
        if len(target_codepoints) > 1 and codepoint in target_codepoints:
            try:
                source_character = chr(source_codepoint)
                target_string = "".join(chr(c) for c in target_codepoints)
                example = f"'{source_character}' → \"{target_string}\""
                example_hex = f"{utf8_hex(source_character)} → {utf8_hex(target_string)}"
            except (ValueError, OverflowError):
                example = f"U+{source_codepoint:04X} → {target_codepoints}"
                example_hex = ""
            break

    rows.append(
        {
            "Codepoint": f"U+{codepoint:04X}",
            "Char": character,
            "UTF-8": hex_bytes,
            "Partial": partial_frequency,
            "Direct": direct_frequency,
            "Example": example,
            "Example Hex": example_hex,
        }
    )

print(tabulate(rows, headers="keys", tablefmt="github"))

### Safe Single-byte Folding Anchors

Of all those characters, we are most interested in the codepoints representable in just 1 byte in UTF-8, as we can process 64 of them in a `ZMM` register at once.
Those are the boring ASCII letters.
But we can't just apply traditional SIMD ASCII case-insensitive search techniques like:

```c
__m512i lower_mask = _mm512_set1_epi8(0x20);
__m512i input_chunk = _mm512_loadu_si512(input_ptr);
__m512i folded_chunk = _mm512_or_si512(input_chunk, lower_mask);
```

If the needle contains an `'f'` and the haystack contains an `'ﬃ'`, we would miss the match.
So we must know, which of the single-byte codepoints are folding targets of multiple codepoints.

In [None]:
ascii_targets = {}
for codepoint in range(UTF8_1BYTE_MAX + 1):
    direct = direct_target_frequency.get(codepoint, 0)
    partial = partial_target_frequency.get(codepoint, 0)
    total = direct + partial
    if total > 0:
        ascii_targets[codepoint] = {"direct": direct, "partial": partial, "total": total}

# Separate into "safe" (exactly 1 source) vs "ambiguous" (multiple sources)
safe_ascii = {codepoint: info for codepoint, info in ascii_targets.items() if info["total"] == 1}
ambiguous_ascii = {codepoint: info for codepoint, info in ascii_targets.items() if info["total"] > 1}

print(f"Total ASCII targets: {len(ascii_targets)}")
print(f"  - Safe (exactly 1 source): {len(safe_ascii)}")
print(f"  - Ambiguous (multiple sources): {len(ambiguous_ascii)}")

The following table shows safe ASCII targets that can use simple SIMD case folding (each has exactly one source):

In [None]:
safe_rows = []
for codepoint in sorted(safe_ascii.keys()):
    character = chr(codepoint)
    for source_codepoint, target_codepoints in case_folds.items():
        if codepoint in target_codepoints:
            source_character = chr(source_codepoint)
            safe_rows.append(
                {
                    "Target": f"'{character}' (U+{codepoint:04X})",
                    "Target Hex": utf8_hex(character),
                    "Source": f"'{source_character}' (U+{source_codepoint:04X})",
                    "Source Hex": utf8_hex(source_character),
                }
            )
            break

print(tabulate(safe_rows, headers="keys", tablefmt="github"))

The following table shows ambiguous ASCII targets that need special handling in SIMD (each has multiple sources):

In [None]:
ambiguous_rows = []
for codepoint in sorted(ambiguous_ascii.keys()):
    character = chr(codepoint)
    info = ambiguous_ascii[codepoint]

    # Find all sources
    sources = []
    for source_codepoint, target_codepoints in case_folds.items():
        if codepoint in target_codepoints:
            try:
                source_character = chr(source_codepoint)
                source_hex = utf8_hex(source_character)
                if len(target_codepoints) == 1:
                    sources.append(f"'{source_character}' ({source_hex})")
                else:
                    target_string = "".join(chr(c) for c in target_codepoints)
                    target_hex = utf8_hex(target_string)
                    sources.append(f"'{source_character}'→\"{target_string}\" ({source_hex}→{target_hex})")
            except:
                sources.append(f"U+{source_codepoint:04X}")

    sources_string = ", ".join(sources[:4])
    if len(sources) > 4:
        sources_string += f" (+{len(sources)-4} more)"

    ambiguous_rows.append(
        {
            "Char": f"'{character}'",
            "Hex": utf8_hex(character),
            "Direct": info["direct"],
            "Partial": info["partial"],
            "Total": info["total"],
            "Sources": sources_string,
        }
    )

print(tabulate(ambiguous_rows, headers="keys", tablefmt="github"))

However, even "ambiguous" ASCII characters can be contextually safe based on what follows them in the needle.
For example, `'f'` is ambiguous because of ligatures like `'ﬁ'` → `"fi"`.
But if the needle contains `"fog"`, the `'f'` is safe because no ligature expands to `"fo..."`.
The following analysis identifies when each ambiguous character becomes safe based on its context:

In [None]:
contextual_safety = {}

for codepoint in ambiguous_ascii.keys():
    char = chr(codepoint)
    dangerous_following = set()
    dangerous_preceding = set()
    ligature_examples = []

    # Find all multi-codepoint expansions that include this character
    for source_codepoint, target_codepoints in case_folds.items():
        if len(target_codepoints) > 1:  # Multi-codepoint expansion
            expansion = "".join(chr(c) for c in target_codepoints)

            # Find all positions where our character appears
            for pos, c in enumerate(expansion):
                if ord(c) == codepoint:
                    source_char = chr(source_codepoint)

                    # If not the last character, next char is "dangerous following"
                    if pos < len(expansion) - 1:
                        next_char = expansion[pos + 1]
                        dangerous_following.add(next_char)
                        if len(ligature_examples) < 3:
                            ligature_examples.append(f"'{source_char}'→\"{expansion}\"")

                    # If not the first character, prev char is "dangerous preceding"
                    if pos > 0:
                        prev_char = expansion[pos - 1]
                        dangerous_preceding.add(prev_char)

    if dangerous_following or dangerous_preceding:
        contextual_safety[char] = {
            "dangerous_following": dangerous_following,
            "dangerous_preceding": dangerous_preceding,
            "examples": ligature_examples,
        }

# Build output table
context_rows = []
for char in sorted(contextual_safety.keys()):
    info = contextual_safety[char]
    following = info["dangerous_following"]
    preceding = info["dangerous_preceding"]

    if following:
        following_chars = ", ".join(f"'{c}' ({utf8_hex(c)})" for c in sorted(following))
        safe_following = f"NOT: {following_chars}"
    else:
        safe_following = "any"

    if preceding:
        preceding_chars = ", ".join(f"'{c}' ({utf8_hex(c)})" for c in sorted(preceding))
        safe_preceding = f"NOT: {preceding_chars}"
    else:
        safe_preceding = "any"

    context_rows.append(
        {
            "Char": f"'{char}'",
            "Hex": utf8_hex(char),
            "Safe following": safe_following,
            "Safe preceding": safe_preceding,
            "Examples": ", ".join(info["examples"]),
        }
    )

print(tabulate(context_rows, headers="keys", tablefmt="github"))

Looking at this, if the needle contains a continuous sequence of `'b'`, `'c'`, `'d'`, `'e'`, `'g'`, `'m'`, `'o'`, `'p'`, `'q'`, `'r'`, `'u'`, `'v'`, `'x'`, `'z'` in any order or case, we can trivially match them using the simple SIMD snippet from above, as long as it doesn't contain `'a'`, `'f'`, `'h'`, `'i'`, `'j'`, `'k'`, `'l'`, `'n'`, `'s'`, `'t'`, `'w'`, or `'y'`.

Moreover, there is a group of single-byte UTF-8 codepoints that don't participate in any folding mappings at all:

In [None]:
uninvolved_ascii = [
    codepoint
    for codepoint in range(UTF8_1BYTE_MAX + 1)
    if codepoint not in ascii_targets and codepoint not in case_folds
]
print(f"ASCII codepoints completely uninvolved in folding: {len(uninvolved_ascii)}")

control_characters = [codepoint for codepoint in uninvolved_ascii if codepoint < 32 or codepoint == 127]
digits = [chr(codepoint) for codepoint in uninvolved_ascii if chr(codepoint).isdigit()]
punctuation = [
    chr(codepoint) for codepoint in uninvolved_ascii if 32 <= codepoint < 127 and not chr(codepoint).isalnum()
]

print(f"Control characters: {len(control_characters)} (0x00-0x1F, 0x7F)")
print(f"Digits: {''.join(digits)}")
print(f"Punctuation/Symbols: {''.join(punctuation)}")

### Safe Two-byte Folding Anchors

The more interesting and challenging part is the 2-byte UTF-8 codepoints that map into either other single 2-byte codepoint or two 1-byte codepoints.

In [None]:
two_byte_folds = {}
for source_codepoint, target_codepoints in case_folds.items():
    if UTF8_1BYTE_MAX < source_codepoint <= UTF8_2BYTE_MAX:  # 2-byte UTF-8 range
        two_byte_folds[source_codepoint] = target_codepoints

print(f"2-byte UTF-8 codepoints with case folding: {len(two_byte_folds):,}")
print()

# Categorize by target type
folds_to_1byte = {}  # 2-byte → single 1-byte (e.g., some Latin letters)
folds_to_2byte = {}  # 2-byte → single 2-byte (most common)
folds_to_2x1byte = {}  # 2-byte → two 1-byte codepoints
folds_to_other = {}  # Other patterns

for source_codepoint, target_codepoints in two_byte_folds.items():
    target_sizes = [
        (
            1
            if codepoint <= UTF8_1BYTE_MAX
            else 2 if codepoint <= UTF8_2BYTE_MAX else 3 if codepoint <= UTF8_3BYTE_MAX else 4
        )
        for codepoint in target_codepoints
    ]

    if len(target_codepoints) == 1:
        if target_sizes[0] == 1:
            folds_to_1byte[source_codepoint] = target_codepoints
        elif target_sizes[0] == 2:
            folds_to_2byte[source_codepoint] = target_codepoints
        else:
            folds_to_other[source_codepoint] = target_codepoints
    elif len(target_codepoints) == 2 and all(size == 1 for size in target_sizes):
        folds_to_2x1byte[source_codepoint] = target_codepoints
    else:
        folds_to_other[source_codepoint] = target_codepoints

print(f"Folding patterns for 2-byte UTF-8 sources:")
print(f"  2-byte → 1-byte:     {len(folds_to_1byte):,}")
print(f"  2-byte → 2-byte:     {len(folds_to_2byte):,}")
print(f"  2-byte → 2x 1-byte:  {len(folds_to_2x1byte):,}")
print(f"  Other patterns:      {len(folds_to_other):,}")

Of the 460 case folding rules for 2-byte UTF-8 sources, the vast majority (450) map to another 2-byte codepoint.
The remaining 10 are special cases worth understanding.

2-byte → 1-byte (1 case):

- `'ſ'` (U+017F, LATIN SMALL LETTER LONG S) → `'s'` - historical long S folds to regular ASCII s

2-byte → 2x 1-byte (1 case):

- `'ß'` (U+00DF, LATIN SMALL LETTER SHARP S) → `"ss"` - German eszett expands to two ASCII characters

Other patterns (8 cases) are the tricky edge cases that don't fit clean patterns:

- `'İ'` (U+0130) → `'i'` + combining dot above (1-byte + 2-byte) - Turkish capital I with dot
- `'ŉ'` (U+0149) → modifier apostrophe + `'n'` (2-byte + 1-byte) - deprecated character
- `'ǰ'` (U+01F0) → `'j'` + combining caron (1-byte + 2-byte) - J with caron decomposes
- `'Ⱥ'` (U+023A) → `'ⱥ'` (U+2C65) - 2-byte source maps to 3-byte target!
- `'Ⱦ'` (U+023E) → `'ⱦ'` (U+2C66) - another 2-byte → 3-byte case
- `'ΐ'` (U+0390) → ι + combining diaeresis + combining acute (3x 2-byte) - Greek with diacritics
- `'ΰ'` (U+03B0) → υ + combining diaeresis + combining acute (3x 2-byte) - Greek with diacritics
- `'և'` (U+0587) → ե + ւ (2x 2-byte) - Armenian ligature

The `'Ⱥ'` and `'Ⱦ'` cases are particularly noteworthy: they are 2-byte UTF-8 sources that fold to 3-byte targets, meaning the folded form is longer than the original!
Assuming the much larger search space, where possible, we want to group them into continuous to/from ranges.

The following table shows continuous ranges of 2-byte UTF-8 codepoints that fold to other 2-byte codepoints with a constant offset (e.g., uppercase → lowercase within the same script block):

In [None]:
sorted_2byte = sorted(folds_to_2byte.items())

ranges = []
if sorted_2byte:
    range_start = sorted_2byte[0][0]
    range_offset = sorted_2byte[0][1][0] - sorted_2byte[0][0]
    prev_source = sorted_2byte[0][0]

    for source_codepoint, target_codepoints in sorted_2byte[1:]:
        target_codepoint = target_codepoints[0]
        current_offset = target_codepoint - source_codepoint

        # Check if this continues the current range (consecutive source AND same offset)
        if source_codepoint == prev_source + 1 and current_offset == range_offset:
            prev_source = source_codepoint
        else:
            # End the current range and start a new one
            ranges.append((range_start, prev_source, range_offset))
            range_start = source_codepoint
            range_offset = current_offset
            prev_source = source_codepoint

    # Don't forget the last range
    ranges.append((range_start, prev_source, range_offset))

print(f"Found {len(ranges)} continuous ranges of 2-byte → 2-byte foldings")
print(f"Ranges of length > 1: {sum(1 for r in ranges if r[1] - r[0] > 0)}")
print(f"Single-codepoint 'ranges': {sum(1 for r in ranges if r[1] == r[0])}")

The following table shows multi-codepoint ranges (length > 1) which are useful for SIMD optimization:

In [None]:
# Build table with range information
range_rows = []
for start, end, offset in ranges:
    length = end - start + 1
    try:
        start_char = chr(start)
        end_char = chr(end)
        target_start_char = chr(start + offset)
        target_end_char = chr(end + offset)
        script = unicodedata.name(start_char, "").split()[0] if length > 1 else ""
    except (ValueError, OverflowError):
        start_char = end_char = target_start_char = target_end_char = "?"
        script = ""

    range_rows.append(
        {
            "Src Start": f"U+{start:04X} ({start_char})",
            "Src Start Hex": utf8_hex(start_char),
            "Src End": f"U+{end:04X} ({end_char})",
            "Src End Hex": utf8_hex(end_char),
            "Tgt Start": f"U+{start + offset:04X} ({target_start_char})",
            "Tgt End": f"U+{end + offset:04X} ({target_end_char})",
            "Len": length,
            "Offset": f"+{offset}" if offset > 0 else str(offset),
            "Script": script,
        }
    )

multi_ranges = [r for r in range_rows if r["Len"] > 1]
print(tabulate(multi_ranges, headers="keys", tablefmt="github"))

### Three-byte UTF-8 Case Folding

3-byte UTF-8 covers codepoints U+0800 to U+FFFF (2048 to 65535).
This includes many scripts: Extended Greek, Cherokee, Georgian, and various symbol blocks.

In [None]:
# 3-byte UTF-8 codepoints: U+0800 to U+FFFF (2048 to 65535)
three_byte_folds = {}
for source_codepoint, target_codepoints in case_folds.items():
    if UTF8_2BYTE_MAX < source_codepoint <= UTF8_3BYTE_MAX:
        three_byte_folds[source_codepoint] = target_codepoints

print(f"3-byte UTF-8 codepoints with case folding: {len(three_byte_folds):,}")
print()

# Categorize by target pattern
three_to_3byte = {}  # 3-byte → single 3-byte
three_to_2byte = {}  # 3-byte → single 2-byte (shrinks!)
three_to_1byte = {}  # 3-byte → 1-byte sequence
three_to_other = {}  # Multi-codepoint or mixed

for source_codepoint, target_codepoints in three_byte_folds.items():
    target_sizes = [
        1 if cp <= UTF8_1BYTE_MAX else 2 if cp <= UTF8_2BYTE_MAX else 3 if cp <= UTF8_3BYTE_MAX else 4
        for cp in target_codepoints
    ]

    if len(target_codepoints) == 1:
        if target_sizes[0] == 3:
            three_to_3byte[source_codepoint] = target_codepoints
        elif target_sizes[0] == 2:
            three_to_2byte[source_codepoint] = target_codepoints
        elif target_sizes[0] == 1:
            three_to_1byte[source_codepoint] = target_codepoints
        else:
            three_to_other[source_codepoint] = target_codepoints
    else:
        three_to_other[source_codepoint] = target_codepoints

print(f"Folding patterns for 3-byte UTF-8 sources:")
print(f"  3-byte → 3-byte:  {len(three_to_3byte):,}")
print(f"  3-byte → 2-byte:  {len(three_to_2byte):,}")
print(f"  3-byte → 1-byte:  {len(three_to_1byte):,}")
print(f"  Other patterns:   {len(three_to_other):,}")

The following table shows continuous ranges of 3-byte UTF-8 codepoints that fold to other 3-byte codepoints:

In [None]:
# Find continuous ranges of 3-byte → 3-byte foldings
sorted_3byte = sorted(three_to_3byte.items())

ranges_3byte = []
if sorted_3byte:
    range_start = sorted_3byte[0][0]
    range_offset = sorted_3byte[0][1][0] - sorted_3byte[0][0]
    prev_source = sorted_3byte[0][0]

    for source_codepoint, target_codepoints in sorted_3byte[1:]:
        target_codepoint = target_codepoints[0]
        current_offset = target_codepoint - source_codepoint

        if source_codepoint == prev_source + 1 and current_offset == range_offset:
            prev_source = source_codepoint
        else:
            ranges_3byte.append((range_start, prev_source, range_offset))
            range_start = source_codepoint
            range_offset = current_offset
            prev_source = source_codepoint

    ranges_3byte.append((range_start, prev_source, range_offset))

print(f"Found {len(ranges_3byte)} continuous ranges of 3-byte → 3-byte foldings")
print(f"Ranges of length > 1: {sum(1 for r in ranges_3byte if r[1] - r[0] > 0)}")
print(f"Single-codepoint 'ranges': {sum(1 for r in ranges_3byte if r[1] == r[0])}")

The following table shows multi-codepoint ranges (length > 1) which are useful for SIMD optimization:

In [None]:
# Build table
range_rows_3byte = []
for start, end, offset in ranges_3byte:
    length = end - start + 1
    try:
        start_char = chr(start)
        end_char = chr(end)
        target_start_char = chr(start + offset)
        target_end_char = chr(end + offset)
        script = unicodedata.name(start_char, "").split()[0] if length > 1 else ""
    except (ValueError, OverflowError):
        start_char = end_char = target_start_char = target_end_char = "?"
        script = ""

    range_rows_3byte.append(
        {
            "Src Start": f"U+{start:04X} ({start_char})",
            "Src Start Hex": utf8_hex(start_char),
            "Src End": f"U+{end:04X} ({end_char})",
            "Src End Hex": utf8_hex(end_char),
            "Tgt Start": f"U+{start + offset:04X} ({target_start_char})",
            "Tgt End": f"U+{end + offset:04X} ({target_end_char})",
            "Len": length,
            "Offset": f"+{offset}" if offset > 0 else str(offset),
            "Script": script,
        }
    )

multi_ranges_3byte = [r for r in range_rows_3byte if r["Len"] > 1]
print(tabulate(multi_ranges_3byte, headers="keys", tablefmt="github"))

## Script-by-Script Analysis

Now that we understand the general structure of Unicode case folding, let's dive into specific scripts.
For each script, we'll answer these questions:

1. **What UTF-8 byte patterns identify this script?** (Lead bytes)
2. **Are there any multi-character expansions?** (Characters that become multiple codepoints)
3. **What are the safe ranges for SIMD fast paths?**
4. **What contextual safety rules apply?**

We'll start with the "easiest" scripts (simple 1:1 mappings) and work toward the more complex ones.

### Cyrillic (U+0400-U+04FF)

Cyrillic is the writing system for Russian, Ukrainian, Bulgarian, Serbian, and many other languages.
It's one of the most SIMD-friendly scripts because:

- **All case folding is 1:1** - no multi-character expansions
- **Predictable offsets** - uppercase letters map to lowercase with fixed offsets
- **Compact UTF-8** - all characters fit in 2 bytes (lead bytes 0xD0 and 0xD1)

Let's verify this:

In [None]:
# Cyrillic range: U+0400 to U+04FF (main block)
CYRILLIC_START = 0x0400
CYRILLIC_END = 0x04FF

# Extract Cyrillic case folding rules
cyrillic_folds = {}
cyrillic_expansions = {}  # Multi-character expansions
cyrillic_simple = {}      # Simple 1:1 mappings

for source_codepoint, target_codepoints in case_folds.items():
    if CYRILLIC_START <= source_codepoint <= CYRILLIC_END:
        cyrillic_folds[source_codepoint] = target_codepoints
        if len(target_codepoints) == 1:
            cyrillic_simple[source_codepoint] = target_codepoints[0]
        else:
            cyrillic_expansions[source_codepoint] = target_codepoints

print(f"Cyrillic case folding rules: {len(cyrillic_folds)}")
print(f"  Simple 1:1 mappings: {len(cyrillic_simple)}")
print(f"  Multi-character expansions: {len(cyrillic_expansions)}")

if cyrillic_expansions:
    print("\n⚠️ Found expansions:")
    for source, targets in cyrillic_expansions.items():
        print(f"  '{chr(source)}' → {''.join(chr(t) for t in targets)}")
else:
    print("\n✓ No multi-character expansions! Perfect for SIMD.")

Now let's see the actual mapping pattern.
In Cyrillic, uppercase letters have predictable relationships to lowercase:

- U+0410-U+042F (А-Я) → U+0430-U+044F (а-я) with offset +32
- U+0400-U+040F (Ѐ-Џ) → U+0450-U+045F (ѐ-џ) with offset +80

In [None]:
# Show Cyrillic alphabet with case mappings
print("Main Cyrillic alphabet (А-Я → а-я):")
print()

rows = []
for cp in range(0x0410, 0x0430):  # А to Я
    upper = chr(cp)
    lower = chr(cp + 32)  # Known offset
    upper_utf8 = ' '.join(f'{b:02X}' for b in upper.encode('utf-8'))
    lower_utf8 = ' '.join(f'{b:02X}' for b in lower.encode('utf-8'))
    rows.append({
        'Upper': upper,
        'Lower': lower,
        'Upper UTF-8': upper_utf8,
        'Lower UTF-8': lower_utf8,
        'Offset': '+32'
    })

print(tabulate(rows[:16], headers='keys', tablefmt='github'))  # First 16
print("...")
print(tabulate(rows[-6:], headers='keys', tablefmt='github'))  # Last 6

Notice the UTF-8 pattern:
- Lead byte 0xD0 covers U+0400-U+043F
- Lead byte 0xD1 covers U+0440-U+04FF

This means when we see `0xD0` or `0xD1` as a lead byte, we know we're in Cyrillic territory.
The case folding is just arithmetic on the second byte!

**SIMD Strategy for Cyrillic:**
1. Detect lead bytes 0xD0/0xD1 in 64-byte chunks
2. For uppercase ranges, add offset to second byte
3. Handle lead byte transitions (some uppercase chars cross from 0xD0 to 0xD1)

### Greek (U+0370-U+03FF, U+1F00-U+1FFF)

Greek is more interesting than Cyrillic because of historical features:

- **Final sigma (ς)**: Greek uses different forms of sigma at word-end vs. middle
- **Polytonic orthography**: Ancient/formal Greek uses multiple diacritics (accents, breathings)
- **Iota subscript**: A small iota written beneath vowels in certain grammatical forms

Basic Greek (U+0370-U+03FF) is 2-byte UTF-8 with lead bytes 0xCE/0xCF.
Extended Greek (U+1F00-U+1FFF) is 3-byte UTF-8 for polytonic characters.

In [None]:
# Greek ranges
GREEK_BASIC_START = 0x0370
GREEK_BASIC_END = 0x03FF
GREEK_EXTENDED_START = 0x1F00
GREEK_EXTENDED_END = 0x1FFF

# Extract Greek case folding rules
greek_folds = {}
greek_expansions = {}
greek_simple = {}

for source_codepoint, target_codepoints in case_folds.items():
    if (GREEK_BASIC_START <= source_codepoint <= GREEK_BASIC_END or 
        GREEK_EXTENDED_START <= source_codepoint <= GREEK_EXTENDED_END):
        greek_folds[source_codepoint] = target_codepoints
        if len(target_codepoints) == 1:
            greek_simple[source_codepoint] = target_codepoints[0]
        else:
            greek_expansions[source_codepoint] = target_codepoints

print(f"Greek case folding rules: {len(greek_folds)}")
print(f"  Basic Greek (U+0370-U+03FF): {sum(1 for cp in greek_folds if cp <= GREEK_BASIC_END)}")
print(f"  Extended Greek (U+1F00-U+1FFF): {sum(1 for cp in greek_folds if cp >= GREEK_EXTENDED_START)}")
print()
print(f"Simple 1:1 mappings: {len(greek_simple)}")
print(f"Multi-character expansions: {len(greek_expansions)}")

That's a lot of expansions!
Let's understand why by looking at some examples:

In [None]:
print("Greek multi-character expansions (first 15):")
print()

rows = []
for source, targets in list(greek_expansions.items())[:15]:
    source_char = chr(source)
    target_str = ''.join(chr(t) for t in targets)
    source_name = unicodedata.name(source_char, '?')
    rows.append({
        'Char': source_char,
        'Codepoint': f'U+{source:04X}',
        'Folds to': target_str,
        'Length': len(targets),
        'Name': source_name[:40] + '...' if len(source_name) > 40 else source_name
    })

print(tabulate(rows, headers='keys', tablefmt='github'))
print(f"\n... and {len(greek_expansions) - 15} more")

The expansions all involve **iota subscript** (ypogegrammeni) or **diacritics**.
When case-folded, these decompose:

- `ᾼ` (alpha with ypogegrammeni) → `αι` (alpha + iota)
- `ΐ` (iota with dialytika and tonos) → `ι` + combining marks

But here's the good news: **basic Greek letters (Α-Ω) have simple 1:1 mappings!**

In [None]:
# Show the simple Greek alphabet
print("Basic Greek alphabet (Α-Ω → α-ω):")
print()

# Greek uppercase: U+0391-U+03A9 (with gap at U+03A2)
# Greek lowercase: U+03B1-U+03C9
greek_upper = 'ΑΒΓΔΕΖΗΘΙΚΛΜΝΞΟΠΡΣΤΥΦΧΨΩ'
greek_lower = 'αβγδεζηθικλμνξοπρστυφχψω'

rows = []
for u, l in zip(greek_upper, greek_lower):
    rows.append({
        'Upper': u,
        'Lower': l,
        'Upper UTF-8': ' '.join(f'{b:02X}' for b in u.encode('utf-8')),
        'Lower UTF-8': ' '.join(f'{b:02X}' for b in l.encode('utf-8'))
    })

print(tabulate(rows, headers='keys', tablefmt='github'))

**The Sigma Problem**

Greek has three forms of sigma:
- `Σ` (U+03A3) - uppercase sigma
- `σ` (U+03C3) - lowercase sigma (middle of word)
- `ς` (U+03C2) - lowercase final sigma (end of word)

For case-insensitive matching, how does this work?

In [None]:
# Sigma case folding
sigma_chars = {
    0x03A3: ('Σ', 'CAPITAL'),
    0x03C3: ('σ', 'SMALL'),
    0x03C2: ('ς', 'SMALL FINAL')
}

print("Sigma variants and their case folding:")
print()

for cp, (char, desc) in sigma_chars.items():
    if cp in case_folds:
        target = case_folds[cp][0]
        target_char = chr(target)
        print(f"  {char} ({desc}, U+{cp:04X}) → {target_char} (U+{target:04X})")
    else:
        print(f"  {char} ({desc}, U+{cp:04X}) → (no folding rule, is target)")

print()
print("✓ Both Σ and ς fold to σ!")
print("  This means σ is the 'anchor' - when searching for 'σ',")
print("  we match both 'Σ' and 'ς' automatically.")

### Armenian (U+0530-U+058F)

Armenian is a bicameral script with 38 letters.
It's nearly as SIMD-friendly as Cyrillic, with just one exception:
the ligature **և** (ech-yiwn).

UTF-8: 2-byte sequences with lead bytes 0xD4, 0xD5, 0xD6.

In [None]:
# Armenian range
ARMENIAN_START = 0x0530
ARMENIAN_END = 0x058F

# Extract Armenian case folding rules
armenian_folds = {}
armenian_expansions = {}
armenian_simple = {}

for source_codepoint, target_codepoints in case_folds.items():
    if ARMENIAN_START <= source_codepoint <= ARMENIAN_END:
        armenian_folds[source_codepoint] = target_codepoints
        if len(target_codepoints) == 1:
            armenian_simple[source_codepoint] = target_codepoints[0]
        else:
            armenian_expansions[source_codepoint] = target_codepoints

print(f"Armenian case folding rules: {len(armenian_folds)}")
print(f"  Simple 1:1 mappings: {len(armenian_simple)}")
print(f"  Multi-character expansions: {len(armenian_expansions)}")

if armenian_expansions:
    print("\nThe expansion(s):")
    for source, targets in armenian_expansions.items():
        source_char = chr(source)
        target_str = ''.join(chr(t) for t in targets)
        print(f"  '{source_char}' (U+{source:04X}) → \"{target_str}\"")
        print(f"  {unicodedata.name(source_char, '?')}")
        print(f"  UTF-8: {utf8_hex(source_char)} → {utf8_hex(target_str)}")

The Armenian ligature **և** is the only complication.
When case-folding, it expands to **եdelays** (ech + yiwn).

This is similar to German ß → ss, but less common in practice.

### Georgian (U+10A0-U+10FF, U+1C90-U+1CBF, U+2D00-U+2D2F)

Georgian is fascinating because it has **three** historical writing systems:

1. **Asomtavruli** (U+10A0-U+10C5) - ancient capitals, now uppercase
2. **Nuskhuri** (U+2D00-U+2D2F) - medieval lowercase (rarely used)
3. **Mkhedruli** (U+10D0-U+10FF) - modern lowercase (most common)
4. **Mtavruli** (U+1C90-U+1CBF) - modern uppercase (added in Unicode 11.0)

For most of its history, Georgian was **unicameral** (no case distinction).
Uppercase/lowercase was only formalized recently!

UTF-8: All Georgian is 3-byte (lead byte 0xE1 or 0xE2).

In [None]:
# Georgian ranges
GEORGIAN_ASOMTAVRULI = (0x10A0, 0x10C5)  # Ancient capitals
GEORGIAN_MKHEDRULI = (0x10D0, 0x10FF)    # Modern lowercase
GEORGIAN_MTAVRULI = (0x1C90, 0x1CBF)     # Modern uppercase
GEORGIAN_NUSKHURI = (0x2D00, 0x2D2F)     # Medieval lowercase

# Extract Georgian case folding rules
georgian_folds = {}
for source_codepoint, target_codepoints in case_folds.items():
    if any(start <= source_codepoint <= end for start, end in 
           [GEORGIAN_ASOMTAVRULI, GEORGIAN_MKHEDRULI, GEORGIAN_MTAVRULI, GEORGIAN_NUSKHURI]):
        georgian_folds[source_codepoint] = target_codepoints

print(f"Georgian case folding rules: {len(georgian_folds)}")
print()

# Check for expansions
georgian_expansions = {k: v for k, v in georgian_folds.items() if len(v) > 1}
print(f"Multi-character expansions: {len(georgian_expansions)}")

if not georgian_expansions:
    print("✓ All Georgian case folding is simple 1:1 mappings!")

In [None]:
# Show Georgian alphabet examples
print("Georgian case mapping examples:")
print()

# Mtavruli (modern uppercase) → Mkhedruli (modern lowercase)
print("Mtavruli (modern uppercase) → Mkhedruli (modern lowercase):")
rows = []
for i, cp in enumerate(range(0x1C90, 0x1C90 + 8)):  # First 8
    if cp in georgian_folds:
        upper = chr(cp)
        lower = chr(georgian_folds[cp][0])
        rows.append({
            'Upper': upper,
            'Lower': lower,
            'Upper UTF-8': utf8_hex(upper),
            'Lower UTF-8': utf8_hex(lower)
        })

print(tabulate(rows, headers='keys', tablefmt='github'))
print()
print("Notice: Georgian letters are გ ბ ა - they look quite different from Latin!")

### Cherokee (U+13A0-U+13FF, U+AB70-U+ABBF)

Cherokee is **unique** among writing systems:
its lowercase letters fold **to uppercase** (opposite of every other script!).

This happened because Cherokee lowercase was added to Unicode later,
after the uppercase forms were already established.

- Uppercase: U+13A0-U+13F5 (original Cherokee syllabary)
- Lowercase: U+AB70-U+ABBF (added in Unicode 8.0, called "Cherokee Supplement")

UTF-8: 3-byte sequences (lead bytes 0xE1 and 0xEA).

In [None]:
# Cherokee ranges
CHEROKEE_UPPER = (0x13A0, 0x13F5)
CHEROKEE_LOWER = (0xAB70, 0xABBF)

# Extract Cherokee case folding
cherokee_folds = {}
for source_codepoint, target_codepoints in case_folds.items():
    if (CHEROKEE_UPPER[0] <= source_codepoint <= CHEROKEE_UPPER[1] or
        CHEROKEE_LOWER[0] <= source_codepoint <= CHEROKEE_LOWER[1]):
        cherokee_folds[source_codepoint] = target_codepoints

print(f"Cherokee case folding rules: {len(cherokee_folds)}")
print()

# Show examples
print("Cherokee lowercase → uppercase (unusual!):")
rows = []
for cp in range(0xAB70, 0xAB70 + 6):  # First 6
    if cp in cherokee_folds:
        lower = chr(cp)
        upper = chr(cherokee_folds[cp][0])
        rows.append({
            'Lower': lower,
            'Upper (fold target)': upper,
            'Lower UTF-8': utf8_hex(lower),
            'Upper UTF-8': utf8_hex(upper)
        })

print(tabulate(rows, headers='keys', tablefmt='github'))
print()
print("Notice: The 'fold target' is UPPERCASE, not lowercase!")
print("Cherokee is the only script where case folding goes lowercase → uppercase.")

## Unicameral Scripts (No Case Distinction)

Many of the world's writing systems don't have uppercase/lowercase.
These "unicameral" scripts are **caseless** - they have no case folding rules at all.

This is great for SIMD optimization!
When searching in caseless text, we can use fast binary comparison
instead of case folding.

Major unicameral scripts include:
- **CJK** (Chinese, Japanese Kanji, Korean Hanja)
- **Arabic** and **Hebrew**
- **Thai**, **Devanagari** (Hindi), and other Indic scripts
- **Japanese Hiragana/Katakana**
- **Korean Hangul**

In [None]:
# Check that these scripts have no case folding
unicameral_ranges = {
    'CJK Unified': (0x4E00, 0x9FFF),
    'Hiragana': (0x3040, 0x309F),
    'Katakana': (0x30A0, 0x30FF),
    'Hangul Syllables': (0xAC00, 0xD7AF),
    'Arabic': (0x0600, 0x06FF),
    'Hebrew': (0x0590, 0x05FF),
    'Thai': (0x0E00, 0x0E7F),
    'Devanagari': (0x0900, 0x097F),
}

print("Unicameral script verification:")
print()

rows = []
for name, (start, end) in unicameral_ranges.items():
    # Count codepoints with case folding rules
    with_folding = sum(1 for cp in range(start, end + 1) if cp in case_folds)
    total = end - start + 1
    
    # Get sample characters
    samples = ''.join(chr(cp) for cp in [start, start+1, start+2] 
                      if unicodedata.category(chr(cp))[0] == 'L')[:3]
    
    rows.append({
        'Script': name,
        'Range': f'U+{start:04X}-U+{end:04X}',
        'Sample': samples,
        'Case Folding Rules': with_folding,
        'Caseless?': '✓' if with_folding == 0 else f'✗ ({with_folding})'
    })

print(tabulate(rows, headers='keys', tablefmt='github'))

For case-insensitive search with unicameral needles,
we can bypass case folding entirely and use fast binary `memcmp`.

The function `sz_utf8_is_fully_caseless_` in StringZilla detects this:
- Scan the needle for bicameral characters
- If none found, use fast path (binary search)
- If found, use case-folded search (slower)

## Practical Search Examples

Let's see how all this theory applies to real searches.
For each example, we'll show:
1. The search needle
2. What it case-folds to
3. What optimization path StringZilla uses

In [None]:
def analyze_needle(needle):
    """Analyze a search needle for case-folding behavior."""
    print(f"Needle: '{needle}'")
    print(f"UTF-8:  {utf8_hex(needle)}")
    print()
    
    # Case fold each character
    folded_chars = []
    has_bicameral = False
    has_expansion = False
    
    for char in needle:
        cp = ord(char)
        if cp in case_folds:
            targets = case_folds[cp]
            folded = ''.join(chr(t) for t in targets)
            folded_chars.append(folded)
            has_bicameral = True
            if len(targets) > 1:
                has_expansion = True
                print(f"  '{char}' (U+{cp:04X}) → \"{folded}\" (expansion!)")
            else:
                print(f"  '{char}' (U+{cp:04X}) → '{folded}'")
        else:
            folded_chars.append(char)
            # Check if it's a lowercase that doesn't fold
            cat = unicodedata.category(char)
            if cat == 'Ll':  # Lowercase letter
                has_bicameral = True
                print(f"  '{char}' (U+{cp:04X}) → '{char}' (lowercase, no change)")
    
    folded_str = ''.join(folded_chars)
    print()
    print(f"Folded: '{folded_str}'")
    print(f"UTF-8:  {utf8_hex(folded_str)}")
    print()
    
    if not has_bicameral:
        print("Optimization: FAST PATH (caseless needle, binary search)")
    elif has_expansion:
        print("Optimization: SLOW PATH (has expansions, must case-fold)")
    else:
        print("Optimization: SIMD FAST PATH (simple 1:1 folding)")
    print()

In [None]:
# Example 1: Pure ASCII
print("="*60)
print("Example 1: ASCII needle")
print("="*60)
analyze_needle("Hello World")

In [None]:
# Example 2: Cyrillic
print("="*60)
print("Example 2: Cyrillic needle (Russian 'Hello')")
print("="*60)
analyze_needle("ПРИВЕТ")  # Russian for 'hello'

In [None]:
# Example 3: German with eszett
print("="*60)
print("Example 3: German with ß (expands to 'ss')")
print("="*60)
analyze_needle("straße")  # German for 'street'

In [None]:
# Example 4: CJK (caseless)
print("="*60)
print("Example 4: Chinese (caseless, fast path!)")
print("="*60)
analyze_needle("中文")  # Chinese for 'Chinese'

In [None]:
# Example 5: Greek with final sigma
print("="*60)
print("Example 5: Greek with final sigma")
print("="*60)
analyze_needle("ΚΌΣΜΟΣ")  # Greek for 'world' (uses regular sigma)