# Benchmarks for Token-Leen String Operations

We will be conducting benchmarks on a real-world dataset of English words. Feel free to replace with your favorite dataset :)

In [1]:
!wget --no-clobber -O ../leipzig1M.txt https://introcs.cs.princeton.edu/python/42sort/leipzig1m.txt

File ‘../leipzig1M.txt’ already there; not retrieving.


In [10]:
text = open("../leipzig1M.txt", "r").read()
words = text.split()
words = tuple(words)
print(f"{len(words):,} words")

21,191,455 words


## Hashing

### Throughput

Let's chack how long it takes the default Python implementation to hash the entire dataset.

In [12]:
%%timeit -n 1 -r 1
text.__hash__()

36.6 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)


In [13]:
%%timeit -n 1 -r 1
for word in words: word.__hash__()

1.09 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)


Let's compare to StringZilla's implementation.

In [6]:
import stringzilla as sz

In [15]:
%%timeit -n 1 -r 1
sz.hash(text)

19.1 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)


In [16]:
%%timeit -n 1 -r 1
for word in words: sz.hash(word)

1.02 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)


### Quality and Collisions Frequency

One of the most important qualities of the hash function is it's resistence to collisions. Let's check how many collisions we have in the dataset.
For that, we will create a bitset using NumPy with more than `len(word)` bits for each word in the dataset. Then, we will hash each word and set the corresponding bit in the bitset. Finally, we will count the number of set bits in the bitset. The more empty spots are left in the bitset, the weaker is the function.

In [22]:
import numpy as np

def count_populated(words, hasher) -> int:
    slots_count = len(words) * 2
    bitset = np.zeros(slots_count, dtype=bool)

    # Hash each word and set the corresponding bit in the bitset
    for word in words:
        hash_value = hasher(word) % slots_count
        bitset[hash_value] = True

    # Count the number of set bits
    return np.sum(bitset)

In [23]:
unique_words = set(words)
print(f"{len(unique_words):,} unique words")

534,580 unique words


In [24]:
populated_default = count_populated(words, hash)
collisions_default = len(unique_words) - populated_default
print(f"Collisions for `hash`: {collisions_default:,} ~ {collisions_default / len(unique_words):.4%}")

Collisions for `hash`: 3,414 ~ 0.6386%


In [25]:
populated_stringzilla = count_populated(words, sz.hash)
collisions_stringzilla = len(unique_words) - populated_stringzilla
print(f"Collisions for `sz.hash`: {collisions_stringzilla:,} ~ {collisions_stringzilla / len(unique_words):.4%}")

Collisions for `sz.hash`: 3,302 ~ 0.6177%


Benchmarks on small datasets may not be very representative. Let's generate 4 Billion unique strings of different length and check the quality of the hash function on them. To make that efficient, let's define a generator expression that will generate the strings on the fly. Each string is a printed integer representation from 0 to 4 Billion.

In [30]:
def count_populated_synthetic(make_generator, n, hasher) -> int:
    slots_count = n * 2
    bitset = np.zeros(slots_count, dtype=bool)

    # Hash each word and set the corresponding bit in the bitset
    for word in make_generator(n):
        hash_value = hasher(word) % (slots_count)
        bitset[hash_value] = True

    # Count the number of set bits
    return np.sum(bitset)

#### Base10 Numbers

In [31]:
n = 4 * 1024 * 1024 * 16

In [32]:
def generate_printed_numbers_until(n):
    """Generator expression to yield strings of printed integers from 0 to n."""
    for i in range(n):
        yield str(i)

In [33]:
populated_default = count_populated_synthetic(generate_printed_numbers_until, n, hash)
collisions_default = n - populated_default
print(f"Collisions for `hash`: {collisions_default:,} ~ {collisions_default / n:.4%}")

Collisions for `hash`: 14,298,109 ~ 21.3058%


In [34]:
populated_sz = count_populated_synthetic(generate_printed_numbers_until, n, sz.hash)
collisions_sz = n - populated_sz
print(f"Collisions for `sz.hash`: {collisions_sz:,} ~ {collisions_sz / n:.4%}")

Collisions for `sz.hash`: 15,519,808 ~ 23.1263%


#### Base64 Numbers

In [35]:
import base64

def int_to_base64(n):
    byte_length = (n.bit_length() + 7) // 8
    byte_array = n.to_bytes(byte_length, 'big')
    base64_string = base64.b64encode(byte_array)
    return base64_string.decode() 

def generate_base64_numbers_until(n):
    """Generator expression to yield strings of printed integers from 0 to n."""
    for i in range(n):
        yield int_to_base64(i)

In [36]:
populated_default = count_populated_synthetic(generate_base64_numbers_until, n, hash)
collisions_default = n - populated_default
print(f"Collisions for `hash`: {collisions_default:,} ~ {collisions_default / n:.4%}")

Collisions for `hash`: 14,295,775 ~ 21.3024%


In [37]:
populated_sz = count_populated_synthetic(generate_base64_numbers_until, n, sz.hash)
collisions_sz = n - populated_sz
print(f"Collisions for `sz.hash`: {collisions_sz:,} ~ {collisions_sz / n:.4%}")

Collisions for `sz.hash`: 21,934,451 ~ 32.6849%
