Matching Text with Fuzzy Characters
-----

Let's say you have the following strings:

In [1]:
s0 = "fuzzy rabbit"
s1 = "wuzzy rabbit"
s2 = "wuzzy wabbit"
s3 = "wuzzy bear"

You want to cluster s0, s1, and s2 together and s3 should be different clusters.

This is best done at the character level (an example of word-level semenatic clustering is [here](https://github.com/brianspiering/nlp-cookbook/blob/main/matching_text_with_embeddings.ipynb)).

Let's frame this with common terminology. Hashing is mapping an object, a string in this case, to an integer. What you want are collisions (i.e., similar objects are mapped to the same integer bucket). The goal is to pick a hashing function that does that based on the number of shared letters in the string.

A scalable implementation of this idea is locality-sensitive hashing (LSH).

In [2]:
# Use the excellent datasketch library
from datasketch import MinHash, MinHashLSH

strings = [s0, s1, s2, s3]

# Hash each string, letter-by-letter
hashes = []
for s in strings:
    m = MinHash(num_perm=128)
    for c in s:
        m.update(c.encode('utf8')) 
    hashes.append(m)

# Create LSH storage for scalable querying
lsh = MinHashLSH(threshold=0.8, num_perm=128)
for n, hash_value in enumerate(hashes):
    lsh.insert(f"s{n}", hash_value)

# Test that the queries for the hashed values return the expected neighbors
hash_s0 = hashes[0]
assert set(lsh.query(hash_s0)) == set(['s0', 's1', 's2'])
hash_s3 = hashes[3]
assert set(lsh.query(hash_s3)) == set(['s3'])

<br>
<br>
<br>

------