# Unicode Normalization and Search Anchor Analysis

This notebook explores Unicode case folding and normalization properties to identify optimal
"anchor points" for case-insensitive and normalization-insensitive string search algorithms.

**Key insight:** For fast UTF-8 search, we want to find characters that:
1. Are invariant under case folding (don't change when folded)
2. Are invariant under normalization (NFC, NFD, NFKC, NFKD)
3. Are not targets of multiple other characters' transformations
4. Have high byte-level entropy (good for SIMD pattern matching)

These "anchor points" can be used to quickly scan for potential matches before
performing expensive case-insensitive or normalized comparisons.

In [1]:
import sys
from collections import Counter
import unicodedata

# Import shared Unicode data loading functions
sys.path.insert(0, '.')
from test_stringzilla import (
    UNICODE_VERSION,
    get_case_folding_rules,
    get_case_folding_rules_as_codepoints,
    get_normalization_props,
    get_unicode_xml_data,
)

print(f"Using Unicode version: {UNICODE_VERSION}")

Using Unicode version: 17.0.0


## 1. Case Folding Analysis

Case folding maps characters to a "folded" form for case-insensitive comparisons.
This is more comprehensive than simple lowercasing - it handles special cases like German ß → ss.

In [2]:
# Load case folding rules
case_folds = get_case_folding_rules_as_codepoints(UNICODE_VERSION)
print(f"Total case folding rules: {len(case_folds):,}")

Downloading Unicode 17.0.0 CaseFolding.txt from https://www.unicode.org/Public/17.0.0/ucd/CaseFolding.txt...
Cached to /tmp/CaseFolding-17.0.0.txt
Total case folding rules: 1,585


In [None]:
# Count how often each codepoint appears as a case folding TARGET
target_frequency = Counter()

for source_cp, target_cps in case_folds.items():
    for target_cp in target_cps:
        target_frequency[target_cp] += 1

print(f"Total folding rules: {sum(target_frequency.values()):,}")
print(f"Unique target codepoints: {len(target_frequency):,}")

# Characters that map to themselves are not in the table, so targets with
# frequency > 0 are characters that OTHER characters fold into

Unique target codepoints: 1,462
Total target occurrences: 1,705


In [4]:
# Display the most common case folding targets
print("Most common case folding targets:")
print("=" * 70)
print(f"{'Codepoint':<12} {'Char':<6} {'Freq':<6} {'Name'}")
print("-" * 70)

for cp, freq in target_frequency.most_common(30):
    try:
        char = chr(cp)
        name = unicodedata.name(char, "")
    except (ValueError, OverflowError):
        char = "?"
        name = ""
    print(f"U+{cp:04X}       {char!r:<6} {freq:<6} {name}")

Most common case folding targets:
Codepoint    Char   Freq   Name
----------------------------------------------------------------------
U+03B9       'ι'    71     GREEK SMALL LETTER IOTA
U+0342       '͂'    11     COMBINING GREEK PERISPOMENI
U+03C5       'υ'    10     GREEK SMALL LETTER UPSILON
U+0066       'f'    9      LATIN SMALL LETTER F
U+0308       '̈'    9      COMBINING DIAERESIS
U+0073       's'    8      LATIN SMALL LETTER S
U+03C9       'ω'    6      GREEK SMALL LETTER OMEGA
U+0301       '́'    5      COMBINING ACUTE ACCENT
U+03B1       'α'    5      GREEK SMALL LETTER ALPHA
U+03B7       'η'    5      GREEK SMALL LETTER ETA
U+0574       'մ'    5      ARMENIAN SMALL LETTER MEN
U+0313       '̓'    5      COMBINING COMMA ABOVE
U+0069       'i'    4      LATIN SMALL LETTER I
U+0074       't'    4      LATIN SMALL LETTER T
U+006C       'l'    3      LATIN SMALL LETTER L
U+03B8       'θ'    3      GREEK SMALL LETTER THETA
U+03C1       'ρ'    3      GREEK SMALL LETTER RHO
U+0442  

In [5]:
# Analyze the distribution of target frequencies
freq_distribution = Counter(target_frequency.values())

print("\nFrequency distribution (how many targets have N sources folding to them):")
print("=" * 50)
print(f"{'Sources folding to target':<30} {'Count':<10}")
print("-" * 50)

for n_sources, count in sorted(freq_distribution.items()):
    print(f"{n_sources:<30} {count:<10}")

print(f"\nMost characters are targeted by just 1 source (uppercase → lowercase)")
print(f"Characters with freq > 1 have multiple sources folding to them")


Frequency distribution (how many targets have N sources folding to them):
Sources folding to target      Count     
--------------------------------------------------
1                              1379      
2                              38        
3                              31        
4                              2         
5                              5         
6                              1         
8                              1         
9                              2         
10                             1         
11                             1         
71                             1         

Most characters are targeted by just 1 source (uppercase → lowercase)
Characters with freq > 1 have multiple sources folding to them


In [6]:
# Show characters with multiple sources folding to them (most interesting cases)
print("Characters with multiple sources folding to them:")
print("=" * 80)

multi_source_targets = [(cp, freq) for cp, freq in target_frequency.items() if freq > 1]
multi_source_targets.sort(key=lambda x: (-x[1], x[0]))

print(f"{'Target':<12} {'Char':<6} {'Freq':<6} {'Sources folding to this target'}")
print("-" * 80)

for cp, freq in multi_source_targets[:25]:
    try:
        char = chr(cp)
    except (ValueError, OverflowError):
        char = "?"
    
    # Find all sources that fold to this target
    sources = []
    for src_cp, tgt_cps in case_folds.items():
        if cp in tgt_cps:
            try:
                sources.append(f"U+{src_cp:04X} ({chr(src_cp)})")
            except (ValueError, OverflowError):
                sources.append(f"U+{src_cp:04X}")
    
    sources_str = ", ".join(sources[:5])
    if len(sources) > 5:
        sources_str += f", ... (+{len(sources)-5} more)"
    
    print(f"U+{cp:04X}       {char!r:<6} {freq:<6} {sources_str}")

Characters with multiple sources folding to them:
Target       Char   Freq   Sources folding to this target
--------------------------------------------------------------------------------
U+03B9       'ι'    71     U+0345 (ͅ), U+0390 (ΐ), U+0399 (Ι), U+1F80 (ᾀ), U+1F81 (ᾁ), ... (+66 more)
U+0342       '͂'    11     U+1F56 (ὖ), U+1FB6 (ᾶ), U+1FB7 (ᾷ), U+1FC6 (ῆ), U+1FC7 (ῇ), ... (+6 more)
U+03C5       'υ'    10     U+03A5 (Υ), U+03B0 (ΰ), U+1F50 (ὐ), U+1F52 (ὒ), U+1F54 (ὔ), ... (+5 more)
U+0066       'f'    9      U+0046 (F), U+FB00 (ﬀ), U+FB01 (ﬁ), U+FB02 (ﬂ), U+FB03 (ﬃ), ... (+1 more)
U+0308       '̈'    9      U+0390 (ΐ), U+03B0 (ΰ), U+1E97 (ẗ), U+1FD2 (ῒ), U+1FD3 (ΐ), ... (+4 more)
U+0073       's'    8      U+0053 (S), U+00DF (ß), U+017F (ſ), U+1E9E (ẞ), U+FB05 (ﬅ), ... (+1 more)
U+03C9       'ω'    6      U+03A9 (Ω), U+1FF3 (ῳ), U+1FF6 (ῶ), U+1FF7 (ῷ), U+1FFC (ῼ), ... (+1 more)
U+0301       '́'    5      U+0390 (ΐ), U+03B0 (ΰ), U+1F54 (ὔ), U+1FD3 (ΐ), U+1FE3 (ΰ)
U+0313       '̓' 

In [7]:
# Analyze expansions (one character folding to multiple characters)
expansions = {src: tgt for src, tgt in case_folds.items() if len(tgt) > 1}

print(f"Case folding expansions (1 char → multiple chars): {len(expansions)}")
print("=" * 70)
print(f"{'Source':<12} {'Char':<6} {'Expands to'}")
print("-" * 70)

for src_cp, tgt_cps in list(expansions.items())[:20]:
    try:
        src_char = chr(src_cp)
        tgt_str = "".join(chr(cp) for cp in tgt_cps)
        tgt_cps_str = " ".join(f"U+{cp:04X}" for cp in tgt_cps)
    except (ValueError, OverflowError):
        src_char = "?"
        tgt_str = "?"
        tgt_cps_str = ""
    
    print(f"U+{src_cp:04X}       {src_char!r:<6} {tgt_str!r} ({tgt_cps_str})")

Case folding expansions (1 char → multiple chars): 104
Source       Char   Expands to
----------------------------------------------------------------------
U+00DF       'ß'    'ss' (U+0073 U+0073)
U+0130       'İ'    'i̇' (U+0069 U+0307)
U+0149       'ŉ'    'ʼn' (U+02BC U+006E)
U+01F0       'ǰ'    'ǰ' (U+006A U+030C)
U+0390       'ΐ'    'ΐ' (U+03B9 U+0308 U+0301)
U+03B0       'ΰ'    'ΰ' (U+03C5 U+0308 U+0301)
U+0587       'և'    'եւ' (U+0565 U+0582)
U+1E96       'ẖ'    'ẖ' (U+0068 U+0331)
U+1E97       'ẗ'    'ẗ' (U+0074 U+0308)
U+1E98       'ẘ'    'ẘ' (U+0077 U+030A)
U+1E99       'ẙ'    'ẙ' (U+0079 U+030A)
U+1E9A       'ẚ'    'aʾ' (U+0061 U+02BE)
U+1E9E       'ẞ'    'ss' (U+0073 U+0073)
U+1F50       'ὐ'    'ὐ' (U+03C5 U+0313)
U+1F52       'ὒ'    'ὒ' (U+03C5 U+0313 U+0300)
U+1F54       'ὔ'    'ὔ' (U+03C5 U+0313 U+0301)
U+1F56       'ὖ'    'ὖ' (U+03C5 U+0313 U+0342)
U+1F80       'ᾀ'    'ἀι' (U+1F00 U+03B9)
U+1F81       'ᾁ'    'ἁι' (U+1F01 U+03B9)
U+1F82       'ᾂ'    'ἂι'

## 2. Normalization Analysis (All 4 Forms)

Unicode defines 4 normalization forms:
- **NFC** (Canonical Composition) - most common for text interchange
- **NFD** (Canonical Decomposition) - fully decomposed canonical form
- **NFKC** (Compatibility Composition) - compatibility + canonical composition
- **NFKD** (Compatibility Decomposition) - compatibility + canonical decomposition

Characters have Quick_Check properties indicating if they're already normalized:
- `Y` (Yes) - definitely normalized
- `N` (No) - definitely not normalized
- `M` (Maybe) - need to check context

In [8]:
# Load normalization properties
norm_props = get_normalization_props(UNICODE_VERSION)
print(f"Characters with normalization properties: {len(norm_props):,}")

Downloading Unicode 17.0.0 DerivedNormalizationProps.txt from https://www.unicode.org/Public/17.0.0/ucd/DerivedNormalizationProps.txt...
Cached to /tmp/DerivedNormalizationProps-17.0.0.txt
Characters with normalization properties: 22,417


In [9]:
# Analyze Quick_Check properties
qc_forms = ['NFC_QC', 'NFD_QC', 'NFKC_QC', 'NFKD_QC']

print("Quick_Check property distribution:")
print("=" * 60)

for form in qc_forms:
    values = Counter()
    for cp, props in norm_props.items():
        if form in props:
            values[props[form]] += 1
    
    print(f"\n{form}:")
    for val, count in sorted(values.items()):
        print(f"  {val}: {count:,} characters")

Quick_Check property distribution:

NFC_QC:
  M: 132 characters
  N: 1,120 characters

NFD_QC:
  N: 13,253 characters

NFKC_QC:
  M: 132 characters
  N: 4,965 characters

NFKD_QC:
  N: 17,086 characters


In [10]:
# Find characters that are NOT stable under each normalization form
unstable = {form: set() for form in qc_forms}

for cp, props in norm_props.items():
    for form in qc_forms:
        if form in props and props[form] != 'Y':
            unstable[form].add(cp)

print("Characters unstable under each normalization form:")
print("=" * 50)
for form in qc_forms:
    print(f"{form}: {len(unstable[form]):,} characters")

# Characters unstable under ANY form
unstable_any = set()
for s in unstable.values():
    unstable_any.update(s)
print(f"\nUnstable under ANY form: {len(unstable_any):,} characters")

Characters unstable under each normalization form:
NFC_QC: 1,252 characters
NFD_QC: 13,253 characters
NFKC_QC: 5,097 characters
NFKD_QC: 17,086 characters

Unstable under ANY form: 17,206 characters


In [11]:
# Show some examples of unstable characters
print("Examples of normalization-unstable characters:")
print("=" * 80)
print(f"{'Codepoint':<12} {'Char':<6} {'NFC':<6} {'NFD':<6} {'NFKC':<6} {'NFKD':<6} {'Name'}")
print("-" * 80)

shown = 0
for cp in sorted(unstable_any):
    if shown >= 25:
        break
    props = norm_props.get(cp, {})
    try:
        char = chr(cp)
        name = unicodedata.name(char, "")[:30]
    except (ValueError, OverflowError):
        char = "?"
        name = ""
    
    nfc = props.get('NFC_QC', 'Y')
    nfd = props.get('NFD_QC', 'Y')
    nfkc = props.get('NFKC_QC', 'Y')
    nfkd = props.get('NFKD_QC', 'Y')
    
    print(f"U+{cp:04X}       {char!r:<6} {nfc:<6} {nfd:<6} {nfkc:<6} {nfkd:<6} {name}")
    shown += 1

Examples of normalization-unstable characters:
Codepoint    Char   NFC    NFD    NFKC   NFKD   Name
--------------------------------------------------------------------------------
U+00A0       '\xa0' Y      Y      N      N      NO-BREAK SPACE
U+00A8       '¨'    Y      Y      N      N      DIAERESIS
U+00AA       'ª'    Y      Y      N      N      FEMININE ORDINAL INDICATOR
U+00AF       '¯'    Y      Y      N      N      MACRON
U+00B2       '²'    Y      Y      N      N      SUPERSCRIPT TWO
U+00B3       '³'    Y      Y      N      N      SUPERSCRIPT THREE
U+00B4       '´'    Y      Y      N      N      ACUTE ACCENT
U+00B5       'µ'    Y      Y      N      N      MICRO SIGN
U+00B8       '¸'    Y      Y      N      N      CEDILLA
U+00B9       '¹'    Y      Y      N      N      SUPERSCRIPT ONE
U+00BA       'º'    Y      Y      N      N      MASCULINE ORDINAL INDICATOR
U+00BC       '¼'    Y      Y      N      N      VULGAR FRACTION ONE QUARTER
U+00BD       '½'    Y      Y      N      N    

## 3. Anchor Point Analysis for Search

Now we identify characters that make good "anchor points" for search algorithms.
A good anchor point should:
1. Not change under case folding
2. Not change under normalization (all 4 forms)
3. Not be a target of many other characters' transformations
4. Have good UTF-8 byte properties for SIMD matching

In [12]:
# Load all characters from XML for comprehensive analysis
root = get_unicode_xml_data(UNICODE_VERSION)
chars = [elem for elem in root.iter() if elem.tag.endswith('char')]

all_codepoints = set()
char_info = {}  # cp -> {name, gc, ...}

for elem in chars:
    if 'cp' in elem.attrib:
        cp = int(elem.attrib['cp'], 16)
        all_codepoints.add(cp)
        char_info[cp] = {
            'name': elem.attrib.get('na', '').strip(),
            'gc': elem.attrib.get('gc', '').strip(),
        }

print(f"Total codepoints in Unicode {UNICODE_VERSION}: {len(all_codepoints):,}")

Using cached Unicode 17.0.0 UCD XML: /tmp/ucd-17.0.0.all.flat.xml
Total codepoints in Unicode 17.0.0: 159,866


In [13]:
def compute_anchor_score(cp: int) -> dict:
    """
    Compute an anchor point score for a codepoint.
    Returns a dict with component scores and total score.
    Higher score = better anchor point.
    """
    score = {
        'case_fold_stable': 0,     # Does not change under case folding
        'case_fold_target': 0,     # Not targeted by other characters
        'norm_stable': 0,          # Stable under all normalization forms
        'not_combining': 0,        # Not a combining mark
        'byte_entropy': 0,         # Good UTF-8 byte properties
        'total': 0
    }
    
    # Case folding stability: +2 if not in fold table (maps to itself)
    if cp not in case_folds:
        score['case_fold_stable'] = 2
    
    # Case folding target: +2 if not targeted, -1 per additional source
    if cp not in target_frequency:
        score['case_fold_target'] = 2
    else:
        freq = target_frequency[cp]
        score['case_fold_target'] = max(-2, 1 - freq)  # Penalize heavily-targeted chars
    
    # Normalization stability: +1 for each stable form, +2 bonus if all stable
    props = norm_props.get(cp, {})
    stable_count = 0
    for form in qc_forms:
        if props.get(form, 'Y') == 'Y':
            stable_count += 1
    score['norm_stable'] = stable_count
    if stable_count == 4:
        score['norm_stable'] += 2  # Bonus for fully stable
    
    # Not combining mark: +2 if not a combining mark (gc != Mn, Mc, Me)
    info = char_info.get(cp, {})
    gc = info.get('gc', '')
    if gc not in ('Mn', 'Mc', 'Me'):
        score['not_combining'] = 2
    
    # UTF-8 byte entropy
    # ASCII (0-127): highest entropy, single byte, +3
    # 2-byte sequences (128-2047): continuation bytes are diverse, +2
    # 3-byte sequences: lead byte has fewer info bits, +1
    # 4-byte sequences: lead byte has only 3 info bits, +0
    if cp < 128:
        score['byte_entropy'] = 3
    elif cp < 2048:
        score['byte_entropy'] = 2
    elif cp < 65536:
        score['byte_entropy'] = 1
    else:
        score['byte_entropy'] = 0
    
    score['total'] = sum(v for k, v in score.items() if k != 'total')
    return score

In [14]:
# Compute anchor scores for all codepoints
anchor_scores = {}
for cp in all_codepoints:
    anchor_scores[cp] = compute_anchor_score(cp)

# Sort by total score descending
sorted_anchors = sorted(anchor_scores.items(), key=lambda x: -x[1]['total'])

print(f"Computed anchor scores for {len(anchor_scores):,} codepoints")
print(f"\nScore distribution:")
score_dist = Counter(s['total'] for s in anchor_scores.values())
for score, count in sorted(score_dist.items(), reverse=True)[:10]:
    print(f"  Score {score}: {count:,} characters")

Computed anchor scores for 159,866 codepoints

Score distribution:
  Score 15: 76 characters
  Score 14: 622 characters
  Score 13: 37,174 characters
  Score 12: 101,127 characters
  Score 11: 1,896 characters
  Score 10: 1,813 characters
  Score 9: 13,511 characters
  Score 8: 1,829 characters
  Score 7: 1,150 characters
  Score 6: 596 characters


In [15]:
# Display best anchor points
print("Best anchor points for search:")
print("=" * 90)
print(f"{'CP':<10} {'Char':<6} {'Total':<6} {'Fold':<5} {'Tgt':<5} {'Norm':<5} {'Comb':<5} {'Byte':<5} {'Name'}")
print("-" * 90)

for cp, scores in sorted_anchors[:50]:
    try:
        char = chr(cp)
        # Skip control characters for display
        if unicodedata.category(char) in ('Cc', 'Cf', 'Cs', 'Co', 'Cn'):
            char = '·'
        name = unicodedata.name(char, char_info.get(cp, {}).get('name', ''))[:25]
    except (ValueError, OverflowError):
        char = "?"
        name = ""
    
    print(f"U+{cp:04X}    {char!r:<6} {scores['total']:<6} "
          f"{scores['case_fold_stable']:<5} {scores['case_fold_target']:<5} "
          f"{scores['norm_stable']:<5} {scores['not_combining']:<5} "
          f"{scores['byte_entropy']:<5} {name}")

Best anchor points for search:
CP         Char   Total  Fold  Tgt   Norm  Comb  Byte  Name
------------------------------------------------------------------------------------------
U+0000    '·'    15     2     2     6     2     3     MIDDLE DOT
U+0001    '·'    15     2     2     6     2     3     MIDDLE DOT
U+0002    '·'    15     2     2     6     2     3     MIDDLE DOT
U+0003    '·'    15     2     2     6     2     3     MIDDLE DOT
U+0004    '·'    15     2     2     6     2     3     MIDDLE DOT
U+0005    '·'    15     2     2     6     2     3     MIDDLE DOT
U+0006    '·'    15     2     2     6     2     3     MIDDLE DOT
U+0007    '·'    15     2     2     6     2     3     MIDDLE DOT
U+0008    '·'    15     2     2     6     2     3     MIDDLE DOT
U+0009    '·'    15     2     2     6     2     3     MIDDLE DOT
U+000A    '·'    15     2     2     6     2     3     MIDDLE DOT
U+000B    '·'    15     2     2     6     2     3     MIDDLE DOT
U+000C    '·'    15     2     2     6 

In [16]:
# Focus on ASCII anchors (most useful for fast paths)
print("ASCII Anchor Analysis (printable characters):")
print("=" * 80)
print(f"{'Char':<6} {'Score':<6} {'Fold':<5} {'Tgt':<5} {'Notes'}")
print("-" * 80)

for cp in range(32, 127):
    if cp not in anchor_scores:
        continue
    scores = anchor_scores[cp]
    char = chr(cp)
    
    notes = []
    if cp in case_folds:
        tgt = case_folds[cp]
        notes.append(f"folds to {chr(tgt[0])}")
    if cp in target_frequency:
        notes.append(f"target of {target_frequency[cp]} char(s)")
    
    notes_str = "; ".join(notes) if notes else "STABLE - good anchor"
    
    print(f"{char!r:<6} {scores['total']:<6} {scores['case_fold_stable']:<5} "
          f"{scores['case_fold_target']:<5} {notes_str}")

ASCII Anchor Analysis (printable characters):
Char   Score  Fold  Tgt   Notes
--------------------------------------------------------------------------------
' '    15     2     2     STABLE - good anchor
'!'    15     2     2     STABLE - good anchor
'"'    15     2     2     STABLE - good anchor
'#'    15     2     2     STABLE - good anchor
'$'    15     2     2     STABLE - good anchor
'%'    15     2     2     STABLE - good anchor
'&'    15     2     2     STABLE - good anchor
"'"    15     2     2     STABLE - good anchor
'('    15     2     2     STABLE - good anchor
')'    15     2     2     STABLE - good anchor
'*'    15     2     2     STABLE - good anchor
'+'    15     2     2     STABLE - good anchor
','    15     2     2     STABLE - good anchor
'-'    15     2     2     STABLE - good anchor
'.'    15     2     2     STABLE - good anchor
'/'    15     2     2     STABLE - good anchor
'0'    15     2     2     STABLE - good anchor
'1'    15     2     2     STABLE - good an

In [17]:
# Identify perfect ASCII anchors (highest score, no transformations)
perfect_ascii_anchors = []
for cp in range(32, 127):
    if cp not in anchor_scores:
        continue
    scores = anchor_scores[cp]
    # Perfect anchor: not in fold table, not a target, all norm stable
    if (scores['case_fold_stable'] == 2 and 
        scores['case_fold_target'] == 2 and
        scores['norm_stable'] >= 6):
        perfect_ascii_anchors.append((cp, chr(cp)))

print(f"\nPerfect ASCII anchors ({len(perfect_ascii_anchors)} characters):")
print("These are ideal for fast-path case-insensitive/normalized search:")
print()

# Group by category
digits = [c for cp, c in perfect_ascii_anchors if c.isdigit()]
punct = [c for cp, c in perfect_ascii_anchors if not c.isalnum()]
lower = [c for cp, c in perfect_ascii_anchors if c.islower()]
upper = [c for cp, c in perfect_ascii_anchors if c.isupper()]

print(f"Digits: {''.join(digits)}")
print(f"Lowercase (stable): {''.join(lower) if lower else '(none - all have uppercase variants)'}")
print(f"Uppercase (stable): {''.join(upper) if upper else '(none - all fold to lowercase)'}")
print(f"Punctuation/Symbols: {''.join(punct)}")


Perfect ASCII anchors (43 characters):
These are ideal for fast-path case-insensitive/normalized search:

Digits: 0123456789
Lowercase (stable): (none - all have uppercase variants)
Uppercase (stable): (none - all fold to lowercase)
Punctuation/Symbols:  !"#$%&'()*+,-./:;<=>?@[\]^_`{|}~


In [18]:
# Summary statistics
print("\n" + "=" * 60)
print("SUMMARY: Anchor Point Statistics")
print("=" * 60)

max_score = max(s['total'] for s in anchor_scores.values())
perfect_anchors = sum(1 for s in anchor_scores.values() if s['total'] == max_score)

print(f"\nMaximum possible anchor score: {max_score}")
print(f"Characters with perfect score: {perfect_anchors:,}")

# Characters stable under all conditions
fully_stable = sum(1 for s in anchor_scores.values() 
                   if s['case_fold_stable'] == 2 and 
                      s['case_fold_target'] == 2 and
                      s['norm_stable'] >= 6)
print(f"Characters fully stable (no case/norm transformations): {fully_stable:,}")

print(f"\nFor search algorithm design:")
print(f"  - Use ASCII digits and punctuation as primary anchors")
print(f"  - Lowercase letters are targets of uppercase folding")
print(f"  - Avoid combining marks (gc=Mn/Mc/Me) as anchors")
print(f"  - Prefer single-byte UTF-8 (ASCII) for SIMD efficiency")


SUMMARY: Anchor Point Statistics

Maximum possible anchor score: 15
Characters with perfect score: 76
Characters fully stable (no case/norm transformations): 140,592

For search algorithm design:
  - Use ASCII digits and punctuation as primary anchors
  - Lowercase letters are targets of uppercase folding
  - Avoid combining marks (gc=Mn/Mc/Me) as anchors
  - Prefer single-byte UTF-8 (ASCII) for SIMD efficiency
