# Benchmarks for Token-Leen String Operations

We will be conducting benchmarks on a real-world dataset of English words. Feel free to replace with your favorite dataset :)

In [1]:
!wget --no-clobber -O ../leipzig1M.txt https://introcs.cs.princeton.edu/python/42sort/leipzig1m.txt

File ‘../leipzig1M.txt’ already there; not retrieving.


In [2]:
text = open("../xlsum.csv", "r").read(1024 * 1024 * 1024)
words = text.split()
words = tuple(words)
print(f"{len(words):,} words")

171,285,845 words


In [3]:
import random

word_examples = random.choices(words, k=10)
word_lengths = [len(s.encode('utf-8')) for s in word_examples]

list(zip(word_examples, word_lengths))

[('to', 2),
 ('Châu,', 6),
 ('возможность', 22),
 ('la', 2),
 ("doesn't", 7),
 ('सकता', 12),
 ('and', 3),
 ('Interestingly,', 14),
 ('have', 4),
 ('Зрители', 14)]

## Hashing

### Throughput

Let's chack how long it takes the default Python implementation to hash the entire dataset.

In [4]:
%%timeit -n 1 -r 1
text.__hash__()

1.09 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)


In [5]:
%%timeit -n 1 -r 1
for word in words: word.__hash__()

9.75 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)


Let's compare to StringZilla's implementation.

In [6]:
import stringzilla as sz
import numpy as np

In [7]:
%%timeit -n 1 -r 1
sz.hash(text)

7.69 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)


In [8]:
%%timeit -n 1 -r 1
for word in words: sz.hash(word)

13.7 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)


### Quality and Collisions Frequency

One of the most important qualities of the hash function is it's resistence to collisions. Let's check how many collisions we have in the dataset.
For that, we will create a bitset using NumPy with more than `len(word)` bits for each word in the dataset. Then, we will hash each word and set the corresponding bit in the bitset. Finally, we will count the number of set bits in the bitset. The more empty spots are left in the bitset, the weaker is the function.

In [9]:
def count_populated(words, hasher) -> int:
    slots_count = len(words) * 2
    bitset = np.zeros(slots_count, dtype=bool)

    # Hash each word and set the corresponding bit in the bitset
    for word in words:
        hash_value = hasher(word) % slots_count
        bitset[hash_value] = True

    # Count the number of set bits
    return np.sum(bitset)

In [10]:
unique_words = set(words)
print(f"{len(unique_words):,} unique words")

7,982,184 unique words


In [11]:
populated_default = count_populated(words, hash)
collisions_default = len(unique_words) - populated_default
print(f"Collisions for `hash`: {collisions_default:,} ~ {collisions_default / len(unique_words):.4%}")

Collisions for `hash`: 92,147 ~ 1.1544%


In [12]:
populated_stringzilla = count_populated(words, sz.hash)
collisions_stringzilla = len(unique_words) - populated_stringzilla
print(f"Collisions for `sz.hash`: {collisions_stringzilla:,} ~ {collisions_stringzilla / len(unique_words):.4%}")

Collisions for `sz.hash`: 97,183 ~ 1.2175%


#### Base10 Numbers

Benchmarks on small datasets may not be very representative. Let's generate 4 Billion unique strings of different length and check the quality of the hash function on them. To make that efficient, let's define a generator expression that will generate the strings on the fly. Each string is a printed integer representation from 0 to 4 Billion.

In [13]:
def count_populated_synthetic(make_generator, n, hasher) -> int:
    slots_count = n * 2
    bitset = np.zeros(slots_count, dtype=bool)

    # Hash each word and set the corresponding bit in the bitset
    for word in make_generator(n):
        hash_value = hasher(word) % (slots_count)
        bitset[hash_value] = True

    # Count the number of set bits
    return np.sum(bitset)

In [14]:
n = 256 * 1024 * 1024

In [15]:
def generate_printed_numbers_until(n):
    """Generator expression to yield strings of printed integers from 0 to n."""
    for i in range(n):
        yield str(i)

In [16]:
populated_default = count_populated_synthetic(generate_printed_numbers_until, n, hash)
collisions_default = n - populated_default
print(f"Collisions for `hash`: {collisions_default:,} ~ {collisions_default / n:.4%}")

Collisions for `hash`: 57,197,029 ~ 21.3076%


In [17]:
populated_sz = count_populated_synthetic(generate_printed_numbers_until, n, sz.hash)
collisions_sz = n - populated_sz
print(f"Collisions for `sz.hash`: {collisions_sz:,} ~ {collisions_sz / n:.4%}")

Collisions for `sz.hash`: 57,820,385 ~ 21.5398%


#### Base64 Numbers

In [18]:
import base64

def int_to_base64(n):
    byte_length = (n.bit_length() + 7) // 8
    byte_array = n.to_bytes(byte_length, 'big')
    base64_string = base64.b64encode(byte_array)
    return base64_string.decode() 

def generate_base64_numbers_until(n):
    """Generator expression to yield strings of printed integers from 0 to n."""
    for i in range(n):
        yield int_to_base64(i)

In [19]:
populated_default = count_populated_synthetic(generate_base64_numbers_until, n, hash)
collisions_default = n - populated_default
print(f"Collisions for `hash`: {collisions_default:,} ~ {collisions_default / n:.4%}")

Collisions for `hash`: 57,197,478 ~ 21.3077%


In [20]:
populated_sz = count_populated_synthetic(generate_base64_numbers_until, n, sz.hash)
collisions_sz = n - populated_sz
print(f"Collisions for `sz.hash`: {collisions_sz:,} ~ {collisions_sz / n:.4%}")

Collisions for `sz.hash`: 224,998,905 ~ 83.8186%


#### Base256 Representations

In [21]:
def generate_base256_numbers_until(n) -> bytes:
    """Generator a 4-byte long binary string wilth all possible values of `uint32_t` until value `n`."""
    for i in range(n):
        yield i.to_bytes(4, byteorder='big')

In [22]:
populated_default = count_populated_synthetic(generate_base256_numbers_until, n, hash)
collisions_default = n - populated_default
print(f"Collisions for `hash`: {collisions_default:,} ~ {collisions_default / n:.4%}")

Collisions for `hash`: 57,195,602 ~ 21.3070%


In [23]:
populated_sz = count_populated_synthetic(generate_base256_numbers_until, n, sz.hash)
collisions_sz = n - populated_sz
print(f"Collisions for `sz.hash`: {collisions_sz:,} ~ {collisions_sz / n:.4%}")

Collisions for `sz.hash`: 59,905,848 ~ 22.3167%


#### Protein Sequences

Benchmarks on small datasets may not be very representative. Let's generate 4 Billion unique strings of different length and check the quality of the hash function on them. To make that efficient, let's define a generator expression that will generate the strings on the fly. Each string is a printed integer representation from 0 to 4 Billion.

In [24]:
n = 1 * 1024 * 1024

In [25]:
def generate_proteins(n):
    """Generator expression to yield strings of printed integers from 0 to n."""
    alphabet = 'ACGT'
    for _ in range(n):
        yield ''.join(random.choices(alphabet, k=300))


In [26]:
populated_default = count_populated_synthetic(generate_proteins, n, hash)
collisions_default = n - populated_default
print(f"Collisions for `hash`: {collisions_default:,} ~ {collisions_default / n:.4%}")

Collisions for `hash`: 223,848 ~ 21.3478%


In [27]:
populated_sz = count_populated_synthetic(generate_proteins, n, sz.hash)
collisions_sz = n - populated_sz
print(f"Collisions for `sz.hash`: {collisions_sz:,} ~ {collisions_sz / n:.4%}")

Collisions for `sz.hash`: 223,995 ~ 21.3618%


: 