# Unicode and Perfect Hashing

Generalizing StringZilla from byte-strings to UTF-8 strings requires a deep understanding of Unicode.
This notebook is a playground to explore Unicode and UTF-8 encoding.
Most importantly it provides a snippet for finding the perfect-hash for unicode, which allows us to produce more efficient histograms and lookup tables for unicode characters.
That can be a constituent part of any UTF-8-aware text-processing algorithm, be it Levenshtein automata or distance calculation, Aho-Corasick automata, or high-level NLP tasks, like feature extraction or text classification.

In [None]:
!python -m pip install -q numba numpy tqdm 2>/dev/null || curl -sS https://bootstrap.pypa.io/get-pip.py | python && python -m pip install -q numba numpy tqdm

In [None]:
from numba import jit as njit
from tqdm import tqdm
import numpy as np
import random
import sys

# Import shared Unicode data loading functions
sys.path.insert(0, '.')
from test_helpers import get_unicode_xml_data, UNICODE_VERSION

print(f"Using Unicode version: {UNICODE_VERSION}")

In [None]:
# Load Unicode XML data (cached)
root = get_unicode_xml_data(UNICODE_VERSION)

The XML structure contains a `<repertoire>` element with many `<char>` elements.
Each `<char>` element has attributes, including:

- `'cp'`: the code point (as hexadecimal)
- `'na'`: the character name
- `'gc'`: the general category

In [None]:
# Use a namespace-agnostic search for all elements ending with 'char'
chars = [elem for elem in root.iter() if elem.tag.endswith('char')]

# List to hold all characters (expanded ranges)
all_chars = []

def process_char(elem):
    """
    Process a <char> element, handling all individual code points (cp)
    and ignoring ranges (first-cp and last-cp). Appends each code point to all_chars.
    """
    if 'cp' in elem.attrib:
        cp = int(elem.attrib['cp'], 16)
        entry = {
            'cp': cp,
            'name': elem.attrib.get('na', '').strip(),
            'gc': elem.attrib.get('gc', '').strip(),
            'age': elem.attrib.get('age', '').strip()
            # You can add pull attributes here if needed.
        }
        all_chars.append(entry)

# Process every 'char' element found
for elem in chars:
    process_char(elem)

print(f"Total code points processed (after expanding ranges): {len(all_chars):,}")

The Unicode standard defines a range of 1,114,112 possible code points (from U+0000 to U+10FFFF), but only a subset of these are actually assigned characters or have specific property data.
As of Unicode version 17.0, there are ~160,000 characters with code points, covering 168 modern and historical scripts, as well as multiple symbol sets, split into [338 blocks](https://en.wikipedia.org/wiki/Unicode_block).
Let's random sample and print a few:

In [None]:
random_chars = random.sample(all_chars, 10)

print("Example symbols:")
for char in random_chars:
    cp = char['cp']
    na = char['name']
    gc = char['gc']
    print(f"U+{cp:04X}: {na} ({gc})")

A natural question can be asked, is that set of codepoints dense or does it contain holes?

In [None]:
highest_code_point = max(char['cp'] for char in all_chars)
print(f"Highest code point: U+{highest_code_point:04X} ({highest_code_point:,})")
count_holes = highest_code_point - len(all_chars)
print(f"Number of holes: {count_holes:,}")

The presence of holes means, that simply using the code-point itself as a lookup index would result in a significant "memory amplification" factor, lower data locality, and very uneven distribution of data.

In [None]:
memory_amplification = 1.0 * highest_code_point / len(all_chars)
print(f"Memory amplification: {memory_amplification:.1f}")

For various hash-functions, we may want to find the smallest buffer size that results in no collisions.
Moreover, assuming how small code-points can be, we would prefer hash-functions that only rely on 32-bit arithmetic and avoid expensive operations.
We may want to start by using a power-of-two hash-table size, as the final stage of the hash-function can be a simple bitwise-and operation.

- $2^{17} = 131072$ is the closes power of two to the number of code-points.
- $2^{18} = 262144$ is the next power of two - the first one that fits all code-points.

The latter would still have a 69% memory amplifications factor with only 59% of the slots filled.

Let's export all code-points to a flat NumPy array and for efficiency, calculate all hashes at once.

In [None]:
code_points = np.array([char['cp'] for char in all_chars], dtype=np.uint32)
print(f"Memory usage for code points: {code_points.nbytes:,} bytes")

In [None]:
# ---------------------------------------------------------------------
# 1. Jenkins One-at-a-Time Hash
# ---------------------------------------------------------------------
def hash_all_jenkins(code_points: np.ndarray) -> np.ndarray:
    # Ensure input is np.uint32.
    code_points = code_points.astype(np.uint32)
    h = np.zeros_like(code_points, dtype=np.uint32)
    # Process each of the 4 bytes of the 32-bit integer.
    for shift in (0, 8, 16, 24):
        # Extract one byte at a time.
        b = (code_points >> shift) & np.uint32(0xFF)
        h = (h + b) & np.uint32(0xFFFFFFFF)
        h = (h + (h << np.uint32(10))) & np.uint32(0xFFFFFFFF)
        h = (h ^ (h >> np.uint32(6))) & np.uint32(0xFFFFFFFF)
    h = (h + (h << np.uint32(3))) & np.uint32(0xFFFFFFFF)
    h = (h ^ (h >> np.uint32(11))) & np.uint32(0xFFFFFFFF)
    h = (h + (h << np.uint32(15))) & np.uint32(0xFFFFFFFF)
    return h

# ---------------------------------------------------------------------
# 2. FNV-1a Hash (32-bit)
# ---------------------------------------------------------------------
def hash_all_fnv1a(code_points: np.ndarray) -> np.ndarray:
    # FNV-1a 32-bit parameters
    FNV_offset = np.uint32(0x811C9DC5)
    FNV_prime  = np.uint32(16777619)
    code_points = code_points.astype(np.uint32)
    h = np.full_like(code_points, FNV_offset, dtype=np.uint32)
    # Process each of the 4 bytes
    for shift in (0, 8, 16, 24):
        byte = (code_points >> shift) & np.uint32(0xFF)
        h = h ^ byte
        h = (h * FNV_prime) & np.uint32(0xFFFFFFFF)
    return h

# ---------------------------------------------------------------------
# 3. Thomas Wang's 32-bit Integer Hash
# ---------------------------------------------------------------------
def hash_all_thomas_wang(code_points: np.ndarray) -> np.ndarray:
    code_points = code_points.astype(np.uint32)
    x = code_points.copy()
    x = (x ^ np.uint32(61)) ^ (x >> np.uint32(16))
    x = (x + (x << np.uint32(3))) & np.uint32(0xFFFFFFFF)
    x = x ^ (x >> np.uint32(4))
    x = (x * np.uint32(0x27d4eb2d)) & np.uint32(0xFFFFFFFF)
    x = x ^ (x >> np.uint32(15))
    return x

# ---------------------------------------------------------------------
# 4. MurmurHash3 (x86 32-bit variant for 4-byte input)
# ---------------------------------------------------------------------
def hash_all_murmur3(code_points: np.ndarray, seed: np.uint32 = np.uint32(0)) -> np.ndarray:
    code_points = code_points.astype(np.uint32)
    c1 = np.uint32(0xcc9e2d51)
    c2 = np.uint32(0x1b873593)
    r1 = np.uint32(15)
    r2 = np.uint32(13)
    m  = np.uint32(5)
    n  = np.uint32(0xe6546b64)
    
    # Treat each 32-bit integer as 4 bytes of data.
    k = (code_points * c1) & np.uint32(0xFFFFFFFF)
    k = ((k << r1) | (k >> (32 - r1))) & np.uint32(0xFFFFFFFF)
    k = (k * c2) & np.uint32(0xFFFFFFFF)
    
    h = seed ^ k
    h = ((h << r2) | (h >> (32 - r2))) & np.uint32(0xFFFFFFFF)
    h = (h * m + n) & np.uint32(0xFFFFFFFF)
    
    # Since input length is always 4 bytes for a 32-bit integer:
    h ^= np.uint32(4)
    # Finalization mix
    h ^= (h >> np.uint32(16))
    h = (h * np.uint32(0x85ebca6b)) & np.uint32(0xFFFFFFFF)
    h ^= (h >> np.uint32(13))
    h = (h * np.uint32(0xc2b2ae35)) & np.uint32(0xFFFFFFFF)
    h ^= (h >> np.uint32(16))
    return h

In [None]:
def count_unique(x: np.ndarray) -> int:
    # This approach is about 50% faster than `len(np.unique(x))`.
    if x.size == 0:
        return 0
    xs = np.sort(x)
    return int(np.count_nonzero(np.diff(xs)) + 1)

In [None]:
def rotate_left(x: np.ndarray, r: int) -> np.ndarray:
    """Rotate left the 32-bit integers in x by r bits."""
    return ((x << np.uint32(r)) | (x >> np.uint32(32 - r))) & np.uint32(0xFFFFFFFF)

def hash_custom(code_points: np.ndarray) -> np.ndarray:
    """
    Compute a composite hash on an array of 32-bit integers.
    The hash is a combination of multiplications, rotations, and XOR mixing.
    """
    # Ensure code_points are treated as 32-bit unsigned integers.
    x = code_points.astype(np.uint32)
    
    # First mixing stage:
    # Multiply by a constant and then rotate left.
    x = (x * np.uint32(0xcc9e2d51)) & np.uint32(0xFFFFFFFF)
    x = rotate_left(x, 15)
    
    # Second stage: XOR with a constant.
    x ^= np.uint32(0x1b873593)
    
    # Third stage: Multiply and then rotate.
    x = (x * np.uint32(0x85ebca6b)) & np.uint32(0xFFFFFFFF)
    x = rotate_left(x, 13)
    
    # Fourth stage: Final XOR mix.
    x ^= np.uint32(0xc2b2ae35)
    
    # Optionally, perform one more multiplication to scramble bits further.
    x = (x * np.uint32(0x27d4eb2d)) & np.uint32(0xFFFFFFFF)
    
    return x

In [None]:
for name, func in [
    ('Jenkins One-at-a-Time', hash_all_jenkins),
    ('FNV-1a', hash_all_fnv1a),
    ("Thomas Wang's Hash", hash_all_thomas_wang),
    ('MurmurHash3', hash_all_murmur3),
    ('Custom', hash_custom),
]:
    print(f"\n{name}:")
    hashes = func(code_points)
    
    unique_hashes = count_unique(hashes)
    print(f"Unique hashes: {unique_hashes:,} = {unique_hashes / len(code_points):.4%}")
    
    # Lets estimate the number of collisions for different modulo values
    hashes_modulo_valid = hashes % len(code_points)
    unique_hashes_modulo_valid = count_unique(hashes_modulo_valid)
    print(f"Unique hashes (modulo size): {unique_hashes_modulo_valid:,} = {unique_hashes_modulo_valid / len(code_points):.4%}")
    
    # Try the next power of 2 for modulo size
    bitceil = 2 ** 18
    hashes_modulo_bitceil = hashes % bitceil
    unique_hashes_modulo_bitceil = count_unique(hashes_modulo_bitceil)
    print(f"Unique hashes (modulo {bitceil}): {unique_hashes_modulo_bitceil:,} = {unique_hashes_modulo_bitceil / len(code_points):.4%}")

We end up with a fairly high collision rate of around 37% with vocabulary-size modulo and slightly more tolerable 25% with the next power of two.
Still, that's far from perfect-hashing.
Let's try different multiplicative hash-functions and see if we can find a better one.

In [None]:
# Take the range of all 32-bit unsigned integers
# ? Random shuffling to simplify the search for the first multiplier is a good idea,
# ? but it would take forever to run on 4 billion elements in Python.
# ! all_integers = np.arange(1, 2**32, dtype=np.uint32)
# ! np.random.shuffle(all_integers)
all_integers = np.random.permutation(2**32)

In [None]:
all_integers = all_integers.astype(np.uint32)
print(f"Memory usage for all_integers: {all_integers.nbytes:,} bytes")

In [None]:
# Take the range of all 32-bit unsigned integers
bitceil = 2 ** 18
for multiplier in tqdm(all_integers):
    hashes = code_points * multiplier
    
    # Lets estimate the number of collisions for different modulo values
    hashes_modulo_valid = hashes % len(code_points)
    unique_hashes_modulo_valid = count_unique(hashes_modulo_valid)
    if unique_hashes_modulo_valid == len(code_points):
        print(f"Multiplier (modulo size): {multiplier}")
        break
    
    # Try the next power of 2 for modulo size
    hashes_modulo_bitceil = hashes % bitceil
    unique_hashes_modulo_bitceil = count_unique(hashes_modulo_bitceil)
    if unique_hashes_modulo_bitceil == len(code_points):
        print(f"Multiplier (modulo {bitceil}): {multiplier}")
        break

In [None]:
from numba import uint32

@njit(nopython=True)
def check_multiplier(code_points: np.ndarray, multiplier: uint32, seen_flags: np.ndarray) -> bool:
    """
    Check if the multiplier produces a perfect hash mapping
    for the given code_points with modulus `len(seen_flags)`.
    Returns True if no collisions are found, False otherwise.
    """
    # Create an array of flags for each hash value.
    n = code_points.shape[0]
    modulo = uint32(len(seen_flags))
    for i in range(n):
        # Compute hash value (simulate 32-bit wrap-around implicitly via modulo arithmetic)
        h = uint32(code_points[i] * multiplier) % modulo
        if seen_flags[h] == 1:
            # Collision found.
            return False
        seen_flags[h] = 1
    return True

In [None]:
# Take the range of all 32-bit unsigned integers
seen_modulo_vocabulary = np.zeros(len(code_points), dtype=np.uint8)
seen_modulo_bitceil = np.zeros(2 ** 18, dtype=np.uint8)

for multiplier in tqdm(all_integers):
    seen_modulo_vocabulary.fill(0)
    seen_modulo_bitceil.fill(0)

    # Lets estimate the number of collisions for different modulo values
    if check_multiplier(code_points, multiplier, seen_modulo_vocabulary):
        print(f"Multiplier (modulo size): {multiplier}")
        break

    if check_multiplier(code_points, multiplier, seen_modulo_bitceil):
        print(f"Multiplier (modulo {bitceil}): {multiplier}")
        break