# Unicode Normalization and Search Anchor Analysis

This notebook explores Unicode case folding and normalization properties to identify optimal "anchor points" for case-insensitive and normalization-insensitive string search algorithms.

Before we start, a small reminder on Unicode.
Unicode is a versioned standard.
In 2025, the latest version is Unicode 17.0.
It defines a over a million code points, of which around 150,000 are assigned characters.
Some of them belong to "bicameral" scripts (like Latin, Greek, Cyrillic) that have distinct uppercase and lowercase forms.
Others belong to "unicameral" scripts (like Chinese, Japanese, Korean, Arabic) that do not have case distinctions.
It doesn't, however, mean that there are no different ways to represent the same character in the same script.
So "case folding" and "normalization" are two different concepts.
We will explore both in this notebook.

Unicode also doesn't require UTF-8 encoding, but UTF-8 is the most popular encoding on the web and in modern applications and the one we will focus on in StringZilla.
In UTF-8, each code point is represented by one, two, three, or four bytes.
A folded or normalized character can map to a sequence of multiple code points, and each of those code points can have a different length representation in UTF-8.
That's why, in absolute majority of modern text-processing applications full Unicode processing is disabled.

Typically, when people perform case-insensitive search, they either:

1. Use simple ASCII case folding (A-Z to a-z), ignoring all other characters.
2. Use pretty much the only major library that supports full Unicode case folding and normalization, ICU (International Components for Unicode).

The first is clearly insufficient, and the second is quite heavy and works at a character level, making SIMD optimizations difficult.
This notebook will focus on more SIMD-vectorizable ideas.

To start, let's pull the most recent Unicode Character Database (UCD) files from the Unicode website.

In [None]:
import sys
from collections import Counter
import unicodedata

# Import shared Unicode data loading functions
sys.path.insert(0, '.')
from test_helpers import (
    UNICODE_VERSION,
    get_all_codepoints,
    get_case_folding_rules_as_codepoints,
    get_normalization_props,
    get_unicode_xml_data,
)

print(f"Using Unicode version: {UNICODE_VERSION}")
all_codepoints = get_all_codepoints(UNICODE_VERSION)

Using Unicode version: 17.0.0
Using cached Unicode 17.0.0 UCD XML: /tmp/ucd-17.0.0.all.flat.xml
Total assigned codepoints: 159,866


[The highest allowed code point in Unicode is `0x10FFFF` or "U+10FFFF"](https://stackoverflow.com/questions/52203351/why-is-unicode-restricted-to-0x10ffff), but it doesn't mean that all code points up to that value are assigned.

- Planes 15-16 (U+F0000 to U+10FFFF) are reserved for "Private Use Area" and do not contain assigned characters.
- Most of plane 14 (U+E0000 to U+E0FFF) is reserved for "Supplementary Special-purpose Plane" and contains very few assigned characters.
- Many code points in other planes are also unassigned.

In [6]:
print(f"Total assigned codepoints: {len(all_codepoints):,}")
print(f"Highest assigned codepoint: {all_codepoints[-1]:,}")
print(f"Highest possible codepoint: {0x10FFFF:,}")
print(f"Range density: {len(all_codepoints) / (all_codepoints[-1] + 1):.6%}")

Total assigned codepoints: 159,866
Highest assigned codepoint: 917,999
Highest possible codepoint: 1,114,111
Range density: 17.414597%


## Unicode Case Folding Analysis

### Direct Folding Targets

Case folding maps characters to a "folded" form for case-insensitive comparisons.
This is more comprehensive than simple lowercasing - it handles special cases like German √ü ‚Üí ss.
The very first thing we are interested in is: how often each codepoint becomes a folding target for other characters?

The reason we are curious about this is that in simple cases, like the Russian letter "–ê" (A) and "–∞" (a), both fold to the same codepoint U+0430 (Cyrillic small letter a).
So when scanning for exact case-insensitive matches, we can just compare each 2-byte UTF-8 slice against just 2 possible values: 0xD090 (U+0410) and 0xD0B0 (U+0430), without actually performing any case folding.
The easiest way to solve to the problem is to avoid after all :smile:

In [7]:
case_folds = get_case_folding_rules_as_codepoints(UNICODE_VERSION)
print(f"Total case folding rules: {len(case_folds):,}")

Using cached Unicode 17.0.0 CaseFolding.txt: /tmp/CaseFolding-17.0.0.txt
Total case folding rules: 1,585


In [8]:
target_frequency = Counter()

for source_cp, target_cps in case_folds.items():
    for target_cp in target_cps:
        target_frequency[target_cp] += 1

print(f"Total folding rules: {sum(target_frequency.values()):,}")
print(f"Unique target codepoints: {len(target_frequency):,}")

Total folding rules: 1,705
Unique target codepoints: 1,462


Let's display the most common folding targets:

In [16]:
print("=" * 70)
print(f"{'Codepoint':<12} {'Char':<6} {'Freq':<6} {'Name'}")
print("-" * 70)

for cp, freq in target_frequency.most_common():
    try:
        char = chr(cp)
        name = unicodedata.name(char, "")
    except (ValueError, OverflowError):
        char = "?"
        name = ""
    print(f"U+{cp:04X}       {char!r:<6} {freq:<6} {name}")

Codepoint    Char   Freq   Name
----------------------------------------------------------------------
U+03B9       'Œπ'    71     GREEK SMALL LETTER IOTA
U+0342       'ÕÇ'    11     COMBINING GREEK PERISPOMENI
U+03C5       'œÖ'    10     GREEK SMALL LETTER UPSILON
U+0066       'f'    9      LATIN SMALL LETTER F
U+0308       'Ãà'    9      COMBINING DIAERESIS
U+0073       's'    8      LATIN SMALL LETTER S
U+03C9       'œâ'    6      GREEK SMALL LETTER OMEGA
U+0301       'ÃÅ'    5      COMBINING ACUTE ACCENT
U+03B1       'Œ±'    5      GREEK SMALL LETTER ALPHA
U+03B7       'Œ∑'    5      GREEK SMALL LETTER ETA
U+0574       '’¥'    5      ARMENIAN SMALL LETTER MEN
U+0313       'Ãì'    5      COMBINING COMMA ABOVE
U+0069       'i'    4      LATIN SMALL LETTER I
U+0074       't'    4      LATIN SMALL LETTER T
U+006C       'l'    3      LATIN SMALL LETTER L
U+03B8       'Œ∏'    3      GREEK SMALL LETTER THETA
U+03C1       'œÅ'    3      GREEK SMALL LETTER RHO
U+0442       '—Ç'    3      CY

This suggests, that the "GREEK SMALL LETTER IOTA" (U+03B9) is the most common folding target, being the folded form of 71 different codepoints.
The reason for this is historical.
Ancient Greek had a grammatical feature called the "iota subscript" where iota was written as a small subscript beneath vowels (Œ±, Œ∑, œâ) to indicate certain grammatical forms (dative case, etc.).
When case-folding, these decompose and the subscript iota becomes a regular lowercase iota:

- ·æ≥ (alpha with ypogegrammeni) ‚Üí Œ±Œπ
- ·øÉ (eta with ypogegrammeni) ‚Üí Œ∑Œπ
- ·ø≥ (omega with ypogegrammeni) ‚Üí œâŒπ

More importantly, at this point we see that `'f'`, `'s'`, `'i'`, `'t'` are the most common direct single-byte UTF-8 folding targets.
Each is the target of at least 4 different codepoints.
But that doesn't tell the whole story!

### Otherwise Ambiguous Folding Targets

Oftentimes, a character is only one of many characters in the produced folding result.

- `'Ô¨Ä'` ‚Üí `"ff"` - 3-byte codepoint mapping into 2x 1-byte codepoints
- `'Ô¨Å'` ‚Üí `"fi"` - 3-byte codepoint mapping into 2x 1-byte codepoints
- `'Ô¨Ç'` ‚Üí `"fl"` - 3-byte codepoint mapping into 2x 1-byte codepoints
- `'Ô¨É'` ‚Üí `"ffi"` - 3-byte codepoint mapping into 3x 1-byte codepoints
- `'Ô¨Ñ'` ‚Üí `"ffl"` - 3-byte codepoint mapping into 3x 1-byte codepoints

Let's account for those as well:

In [13]:
direct_target_freq = Counter()  # cp is the ONLY target of a folding
partial_target_freq = Counter()  # cp is ONE OF multiple targets in a folding

for source_cp, target_cps in case_folds.items():
    if len(target_cps) == 1:
        # Direct 1:1 folding (e.g., 'A' ‚Üí 'a')
        direct_target_freq[target_cps[0]] += 1
    else:
        # Multi-codepoint expansion (e.g., 'Ô¨Å' ‚Üí 'f', 'i')
        for target_cp in target_cps:
            partial_target_freq[target_cp] += 1

# Some codepoints may be both direct AND partial targets
both_targets = set(direct_target_freq.keys()) & set(partial_target_freq.keys())

print(f"Folding target analysis:")
print(f"=" * 60)
print(f"Total folding rules: {len(case_folds):,}")
print(f"  - Direct 1:1 foldings: {sum(1 for t in case_folds.values() if len(t) == 1):,}")
print(f"  - Multi-codepoint expansions: {sum(1 for t in case_folds.values() if len(t) > 1):,}")
print()
print(f"Unique target codepoints:")
print(f"  - Only direct targets: {len(direct_target_freq - partial_target_freq):,}")
print(f"  - Only partial targets: {len(partial_target_freq - direct_target_freq):,}")
print(f"  - Both direct AND partial: {len(both_targets):,}")

Folding target analysis:
Total folding rules: 1,585
  - Direct 1:1 foldings: 1,481
  - Multi-codepoint expansions: 104

Unique target codepoints:
  - Only direct targets: 1,398
  - Only partial targets: 48
  - Both direct AND partial: 54


Let's now redo our table, differentiating complete and partial folding targets!

In [18]:
print("=" * 80)
print(f"{'Codepoint':<12} {'Char':<6} {'Partial':<8} {'Direct':<8} {'Example expansion'}")
print("-" * 80)

# Sort by partial frequency descending
for cp, partial_freq in partial_target_freq.most_common():
    try:
        char = chr(cp)
        name = unicodedata.name(char, "")
    except (ValueError, OverflowError):
        char = "?"
        name = ""
    
    direct_freq = direct_target_freq.get(cp, 0)
    
    # Find an example expansion containing this codepoint
    example = ""
    for src_cp, tgt_cps in case_folds.items():
        if len(tgt_cps) > 1 and cp in tgt_cps:
            try:
                src_char = chr(src_cp)
                tgt_str = "".join(chr(c) for c in tgt_cps)
                example = f"'{src_char}' ‚Üí \"{tgt_str}\""
            except (ValueError, OverflowError):
                example = f"U+{src_cp:04X} ‚Üí {tgt_cps}"
            break
    
    print(f"U+{cp:04X}       {char!r:<6} {partial_freq:<8} {direct_freq:<8} {example}")

Codepoint    Char   Partial  Direct   Example expansion
--------------------------------------------------------------------------------
U+03B9       'Œπ'    68       3        'Œê' ‚Üí "ŒπÃàÃÅ"
U+0342       'ÕÇ'    11       0        '·Ωñ' ‚Üí "œÖÃìÕÇ"
U+0308       'Ãà'    9        0        'Œê' ‚Üí "ŒπÃàÃÅ"
U+03C5       'œÖ'    9        1        'Œ∞' ‚Üí "œÖÃàÃÅ"
U+0066       'f'    8        1        'Ô¨Ä' ‚Üí "ff"
U+0073       's'    6        2        '√ü' ‚Üí "ss"
U+0301       'ÃÅ'    5        0        'Œê' ‚Üí "ŒπÃàÃÅ"
U+0313       'Ãì'    5        0        '·Ωê' ‚Üí "œÖÃì"
U+03B1       'Œ±'    4        1        '·æ≥' ‚Üí "Œ±Œπ"
U+03B7       'Œ∑'    4        1        '·øÉ' ‚Üí "Œ∑Œπ"
U+03C9       'œâ'    4        2        '·ø≥' ‚Üí "œâŒπ"
U+0574       '’¥'    4        1        'Ô¨ì' ‚Üí "’¥’∂"
U+0069       'i'    3        1        'ƒ∞' ‚Üí "iÃá"
U+0074       't'    3        1        '·∫ó' ‚Üí "tÃà"
U+0300       'ÃÄ'    3        0        '·Ωí' ‚Üí "œÖÃìÃÄ"
U+0565       '’•'    2     

### Safe Single-byte Folding Anchors

Of all those characters, we are most interested in the codepoints representable in just 1 byte in UTF-8, as we can process 64 of them in a `ZMM` register at once.
Those are the boring ASCII letters.
But we can't just apply traditional SIMD ASCII case-insensitive search techniques like:

```c
__m512i lower_mask = _mm512_set1_epi8(0x20);
__m512i input_chunk = _mm512_loadu_si512(input_ptr);
__m512i folded_chunk = _mm512_or_si512(input_chunk, lower_mask);
```

If the needle contains an `'f'` and the haystack contains an `'Ô¨É'`, we would miss the match.
So we must know, which of the single-byte codepoints are folding targets of multiple codepoints.

In [21]:
# ASCII codepoints (0-127) that are folding targets
# Group by: how many sources fold to them (direct + partial)

ascii_targets = {}
for cp in range(128):
    direct = direct_target_freq.get(cp, 0)
    partial = partial_target_freq.get(cp, 0)
    total = direct + partial
    if total > 0:
        ascii_targets[cp] = {'direct': direct, 'partial': partial, 'total': total}

# Separate into "safe" (exactly 1 source) vs "ambiguous" (multiple sources)
safe_ascii = {cp: info for cp, info in ascii_targets.items() if info['total'] == 1}
ambiguous_ascii = {cp: info for cp, info in ascii_targets.items() if info['total'] > 1}

print("ASCII folding targets (single-byte UTF-8):")
print("=" * 70)
print(f"Total ASCII targets: {len(ascii_targets)}")
print(f"  - Safe (exactly 1 source): {len(safe_ascii)}")
print(f"  - Ambiguous (multiple sources): {len(ambiguous_ascii)}")
print()

print("SAFE ASCII targets (can use simple SIMD case folding):")
print("-" * 70)
for cp in sorted(safe_ascii.keys()):
    char = chr(cp)
    info = safe_ascii[cp]
    # Find the single source
    for src_cp, tgt_cps in case_folds.items():
        if cp in tgt_cps:
            src_char = chr(src_cp)
            print(f"  '{char}' (U+{cp:04X}) ‚Üê '{src_char}' (U+{src_cp:04X})")
            break

print()
print("AMBIGUOUS ASCII targets (need special handling in SIMD):")
print("-" * 70)
print(f"{'Char':<8} {'CP':<10} {'Direct':<8} {'Partial':<8} {'Total':<8} Sources")
print("-" * 70)

for cp in sorted(ambiguous_ascii.keys()):
    char = chr(cp)
    info = ambiguous_ascii[cp]
    
    # Find all sources
    sources = []
    for src_cp, tgt_cps in case_folds.items():
        if cp in tgt_cps:
            try:
                src_char = chr(src_cp)
                if len(tgt_cps) == 1:
                    sources.append(f"'{src_char}'")
                else:
                    tgt_str = "".join(chr(c) for c in tgt_cps)
                    sources.append(f"'{src_char}'‚Üí\"{tgt_str}\"")
            except:
                sources.append(f"U+{src_cp:04X}")
    
    sources_str = ", ".join(sources[:6])
    if len(sources) > 6:
        sources_str += f" (+{len(sources)-6} more)"
    
    print(f"'{char}'      U+{cp:04X}     {info['direct']:<8} {info['partial']:<8} {info['total']:<8} {sources_str}")

ASCII folding targets (single-byte UTF-8):
Total ASCII targets: 26
  - Safe (exactly 1 source): 14
  - Ambiguous (multiple sources): 12

SAFE ASCII targets (can use simple SIMD case folding):
----------------------------------------------------------------------
  'b' (U+0062) ‚Üê 'B' (U+0042)
  'c' (U+0063) ‚Üê 'C' (U+0043)
  'd' (U+0064) ‚Üê 'D' (U+0044)
  'e' (U+0065) ‚Üê 'E' (U+0045)
  'g' (U+0067) ‚Üê 'G' (U+0047)
  'm' (U+006D) ‚Üê 'M' (U+004D)
  'o' (U+006F) ‚Üê 'O' (U+004F)
  'p' (U+0070) ‚Üê 'P' (U+0050)
  'q' (U+0071) ‚Üê 'Q' (U+0051)
  'r' (U+0072) ‚Üê 'R' (U+0052)
  'u' (U+0075) ‚Üê 'U' (U+0055)
  'v' (U+0076) ‚Üê 'V' (U+0056)
  'x' (U+0078) ‚Üê 'X' (U+0058)
  'z' (U+007A) ‚Üê 'Z' (U+005A)

AMBIGUOUS ASCII targets (need special handling in SIMD):
----------------------------------------------------------------------
Char     CP         Direct   Partial  Total    Sources
----------------------------------------------------------------------
'a'      U+0061     1        1    

Looking at this, if the needle contains a continuous sequence of `'b'`, `'c'`, `'d'`, `'e'`, `'g'`, `'m'`, `'o'`, `'p'`, `'q'`, `'r'`, `'u'`, `'v'`, `'x'`, `'z'` in any order or case, we can trivially match them using the simple SIMD snippet from above, as long as it doesn't contain `'a'`, `'f'`, `'h'`, `'i'`, `'j'`, `'k'`, `'l'`, `'n'`, `'s'`, `'t'`, `'w'`, or `'y'`.

Moreover, there is a group of single-byte UTF-8 codepoints that don't participate in any folding mappings at all:

In [24]:
uninvolved_ascii = [cp for cp in range(128) if cp not in ascii_targets and cp not in case_folds]
print(f"ASCII codepoints completely uninvolved in folding: {len(uninvolved_ascii)}")

control = [cp for cp in uninvolved_ascii if cp < 32 or cp == 127]
digits = [chr(cp) for cp in uninvolved_ascii if chr(cp).isdigit()]
punct = [chr(cp) for cp in uninvolved_ascii if 32 <= cp < 127 and not chr(cp).isalnum()]

print(f"Control characters: {len(control)} (0x00-0x1F, 0x7F)")
print(f"Digits: {''.join(digits)}")
print(f"Punctuation/Symbols: {''.join(punct)}")

ASCII codepoints completely uninvolved in folding: 76
Control characters: 33 (0x00-0x1F, 0x7F)
Digits: 0123456789
Punctuation/Symbols:  !"#$%&'()*+,-./:;<=>?@[\]^_`{|}~


### Safe Two-byte Folding Anchors

The more interesting & challenging part is the 2-byte UTF-8 codepoints that map into either other single 2-byte codepoint or two 1-byte codepoints.
Assuming the much larger search space, where possible, we want to group them into continuous to/from ranges.

In [23]:
# 2-byte UTF-8 codepoints: U+0080 to U+07FF (128 to 2047)
# Filter case folding rules to only those involving 2-byte source codepoints

two_byte_folds = {}
for src_cp, tgt_cps in case_folds.items():
    if 128 <= src_cp < 2048:  # 2-byte UTF-8 range
        two_byte_folds[src_cp] = tgt_cps

print(f"2-byte UTF-8 codepoints with case folding: {len(two_byte_folds):,}")
print()

# Categorize by target type
folds_to_1byte = {}      # 2-byte ‚Üí single 1-byte (e.g., some Latin letters)
folds_to_2byte = {}      # 2-byte ‚Üí single 2-byte (most common)
folds_to_2x1byte = {}    # 2-byte ‚Üí two 1-byte codepoints
folds_to_other = {}      # Other patterns

for src_cp, tgt_cps in two_byte_folds.items():
    tgt_sizes = [1 if cp < 128 else 2 if cp < 2048 else 3 if cp < 65536 else 4 for cp in tgt_cps]
    
    if len(tgt_cps) == 1:
        if tgt_sizes[0] == 1:
            folds_to_1byte[src_cp] = tgt_cps
        elif tgt_sizes[0] == 2:
            folds_to_2byte[src_cp] = tgt_cps
        else:
            folds_to_other[src_cp] = tgt_cps
    elif len(tgt_cps) == 2 and all(s == 1 for s in tgt_sizes):
        folds_to_2x1byte[src_cp] = tgt_cps
    else:
        folds_to_other[src_cp] = tgt_cps

print(f"Folding patterns for 2-byte UTF-8 sources:")
print(f"  2-byte ‚Üí 1-byte:     {len(folds_to_1byte):,}")
print(f"  2-byte ‚Üí 2-byte:     {len(folds_to_2byte):,}")
print(f"  2-byte ‚Üí 2x 1-byte:  {len(folds_to_2x1byte):,}")
print(f"  Other patterns:      {len(folds_to_other):,}")

2-byte UTF-8 codepoints with case folding: 460

Folding patterns for 2-byte UTF-8 sources:
  2-byte ‚Üí 1-byte:     1
  2-byte ‚Üí 2-byte:     450
  2-byte ‚Üí 2x 1-byte:  1
  Other patterns:      8


Of the 460 case folding rules for 2-byte UTF-8 sources, the vast majority (450) map to another 2-byte codepoint.
The remaining 10 are special cases worth understanding:

__2-byte ‚Üí 1-byte (1 case):__

- `'≈ø'` (U+017F, LATIN SMALL LETTER LONG S) ‚Üí `'s'` - historical long S folds to regular ASCII s

__2-byte ‚Üí 2x 1-byte (1 case):__

- `'√ü'` (U+00DF, LATIN SMALL LETTER SHARP S) ‚Üí `"ss"` - German eszett expands to two ASCII characters

__Other patterns (8 cases):__

These are the tricky edge cases that don't fit clean patterns:

- `'ƒ∞'` (U+0130) ‚Üí `'i'` + combining dot above (1-byte + 2-byte) - Turkish capital I with dot
- `'≈â'` (U+0149) ‚Üí modifier apostrophe + `'n'` (2-byte + 1-byte) - deprecated character
- `'«∞'` (U+01F0) ‚Üí `'j'` + combining caron (1-byte + 2-byte) - J with caron decomposes
- `'»∫'` (U+023A) ‚Üí `'‚±•'` (U+2C65) - 2-byte source maps to 3-byte target!
- `'»æ'` (U+023E) ‚Üí `'‚±¶'` (U+2C66) - another 2-byte ‚Üí 3-byte case
- `'Œê'` (U+0390) ‚Üí Œπ + combining diaeresis + combining acute (3x 2-byte) - Greek with diacritics
- `'Œ∞'` (U+03B0) ‚Üí œÖ + combining diaeresis + combining acute (3x 2-byte) - Greek with diacritics
- `'÷á'` (U+0587) ‚Üí ’• + ÷Ç (2x 2-byte) - Armenian ligature

The `'»∫'` and `'»æ'` cases are particularly noteworthy: they are 2-byte UTF-8 sources that fold to 3-byte targets, meaning the folded form is *longer* than the original!