# Homework 1: Locally Sensitive Hashing
Locality Sensitive Hashing (LSH) is a technique used in computer science to solve the approximate or exact Near Neighbor Search in high-dimensional spaces. It is used to find similar items in a large dataset by hashing input items so that similar items map to the same "buckets" with high probability. LSH is commonly used in recommendation systems, image and audio recognition, and data mining.

In this particular notebook we will implement a simplified version of the LSH algorithm for to compare texts and find how similar are. We will implement 4 classes which will help us to compute how similar 2 texts are. Those classes are: Shingling, CompareSets, MinHasing and CompareSignatures.

## Dataset
As a dataset, we have used the following texts:
 - Lorem Ipsum with 5 paragraphs (https://www.lipsum.com/)\[1.txt\]
 - Lorem Ipsum with 7 paragraphs (https://www.lipsum.com/)\[2.txt\]
 - Quijote de la Mancha by Miguel de Cervantes (https://www.gutenberg.org/cache/epub/60884/pg60884.txt)\[3.txt\]
 - The Adventures of Sherlock Holmes by Arthur Conan Doyle (https://www.gutenberg.org/cache/epub/1661/pg1661.txt)\[4.txt\]
 - The picture of Dorian Gray by Oscar Wilde (https://www.gutenberg.org/cache/epub/174/pg174.txt)\[5.txt\]
 - Beyond good and evil by Friedrich Wilheim Nietzsche (https://www.gutenberg.org/cache/epub/4363/pg4363.txt)\[6.txt\]

This dataset has been selected like this, so it has two text which should be fairly similar (1 & 2), a text which should be fairly different (3) and three texts which should have something in common even though they may be different (4, 5 & 6).

In this notebook we have implemented 4 classes which we will help us to compute the similarity of 2 texts. Those classes are: Shingling, CompareSets, MinHashing and CompareSignatures.

In [1]:
import os

dataset_path: str = "dataset"

texts: list[str] = []

for filename in os.listdir(dataset_path):
    if filename.endswith(".txt"):
        with open(os.path.join(dataset_path, filename), "r") as f:
            text = f.read()
            texts.append(text)

print(texts[4])


Lorem ipsum dolor sit amet, consectetur adipiscing elit. Aliquam placerat nisi quis tempus semper. Fusce eu dapibus leo, id convallis orci. Vestibulum a ligula nulla. Quisque tincidunt sem hendrerit arcu pulvinar convallis. Quisque pellentesque diam vel nunc condimentum, in pulvinar mi volutpat. In vitae lorem justo. Nullam id gravida libero. Nunc ligula mauris, tincidunt eu nisl eu, tincidunt pharetra purus. Nulla cursus euismod rhoncus. In commodo magna ut mollis condimentum. Cras ut augue id nulla molestie hendrerit ut consequat mauris.

Sed at tempor enim. Ut commodo congue orci quis vestibulum. Nam condimentum et tortor sed maximus. Maecenas tempor dui eu volutpat iaculis. Sed a orci ut mauris lobortis mollis. Ut dictum vestibulum arcu, quis rhoncus dolor imperdiet id. Donec eget vulputate ex, vel cursus eros. Donec pretium odio ac nibh iaculis, ac dictum eros lacinia. Donec eu varius mi. Morbi a dignissim leo. Pellentesque facilisis justo vel sapien eleifend facilisis. In ferment

## Shingling
Shingling is a technique used in text analysis to represent a document as a set of overlapping subsequences of fixed length k, called k-shingles. To compute shingles, we slide a window of size k over the document and extract the k-length substrings that fall within the window. We then store these substrings as a set, which represents the shingles of the document.

In [2]:
class Shingling():
    
    def __init__(self, k: int):
        self.k: int = k
        
    def get_shingles(self, document: str) -> set:
        """
        This method constructs k-shingles of a given length k from a given document.
        """
        shingles: set = set()
        for i in range(len(document) - self.k + 1):
            shingle: str = document[i:i+self.k]
            shingles.add(shingle)
        return shingles
    
    def get_hashed_shingles(self, document: str):
        """
        This method computes a hash value for each unique shingle and represents the document in the form of an ordered set of its hashed k-shingles.
        """
        shingles: set = self.get_shingles(document)
        hashed_shingles: list[int] = [hash(shingle) for shingle in shingles]
        hashed_shingles.sort()
        return hashed_shingles

As an example to show what is shingling, we will use the following text:
> "The quick brown fox jumps over the lazy dog"

We can see that the text is divided on chunks of 3 characters like "fox", "the" or "dog". If there is a space in the middle it is also considered as a character. So, if we want to get the shingles of this text with a k=3, we will get the following shingles:

In [3]:
example_text_1: str = "The quick brown fox jumps over the lazy dog"
example_text_2: str = "The agile black cat leaps over the active dog"

shingling: Shingling = Shingling(k=3)

example_shingles_1: list[str] = shingling.get_shingles(example_text_1)
print(example_shingles_1)

example_hashed_shingles_texts: list[list[int]] = []
for text in [example_text_1, example_text_2]:
    example_hashed_shingles: int = shingling.get_hashed_shingles(text)
    example_hashed_shingles_texts.append(example_hashed_shingles)

{'azy', 'zy ', 'ver', ' ju', 'he ', 'k b', 'ck ', 'er ', ' th', 'y d', 'bro', 'e l', 'ick', ' br', 'n f', 'x j', 'the', ' la', 'e q', 'qui', 's o', 'row', 'r t', 'wn ', ' qu', 'dog', 'ps ', 'The', 'ove', 'fox', 'laz', ' ov', 'mps', ' do', ' fo', 'ump', 'uic', 'ox ', 'jum', 'own'}


Now we do it with our text. We use a k=9 as we are analyzing text from books instead of emails. In the following lines we are doing exactly the same as we showed in the example above. In this case we will compute the hash of each shingle and we will store it in a set. We will do this for each text.

In [4]:
shingling: Shingling = Shingling(k=9)

hashed_shingles_texts: list[list[int]] = []

for text in texts:
    hashed_shingles: int = shingling.get_hashed_shingles(text)
    hashed_shingles_texts.append(hashed_shingles)

## MinHashing
### One-hot encoding
Before computing the MinHashing we will use the **One-hot encoding** to represent the shingles of each text. This will help us to compute the Jaccard similarity. To create the one-hot encoding for each text we create a set with all the shingles of all the texts. Then, for each text we create a vector with the size of the set of all shingles. If the shingle is in the text, we will put a 1 in the vector, otherwise we will put a 0. This way we will have a vector for each text with the size of the set of all shingles. We will do this for each text.

In [5]:
example_vocab_hashed: set = set()
for hashed_shingles in example_hashed_shingles_texts:
    for hashed_shingle in hashed_shingles:
        example_vocab_hashed.add(hashed_shingle)

example_matrix_one_hot_encoding: list[list[int]] = []
for hashed_shingles in example_hashed_shingles_texts:
    example_one_hot_encoding: list[int] = [1 if x in hashed_shingles else 0 for x in example_vocab_hashed]
    example_matrix_one_hot_encoding.append(example_one_hot_encoding)
print(example_matrix_one_hot_encoding)

[[1, 1, 1, 0, 1, 0, 0, 1, 0, 1, 1, 1, 0, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 1, 0, 0, 0, 0, 1, 1, 1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 0, 0, 1, 1, 1, 0, 1], [0, 0, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 0, 1, 1, 1, 0, 1, 1, 0, 0, 0, 1, 0]]


In the following code we do the same, but with the larger texts we originally had.

In [6]:
vocab_hashed: set = set()
for hashed_shingles in hashed_shingles_texts:
    for hashed_shingle in hashed_shingles:
        vocab_hashed.add(hashed_shingle)

matrix_one_hot_encoding: list[list[int]] = []
for hashed_shingles in hashed_shingles_texts:
    one_hot_encoding: list[int] = [1 if x in hashed_shingles else 0 for x in vocab_hashed]
    matrix_one_hot_encoding.append(one_hot_encoding)


### MinHashing
Now that we have the one-hot encoding for each text, we need to shuffle the texts multiple in order to create the signatures, which will be used to compute the similarity between the texts. We will do this by using the **MinHashing** technique. 

To compute the signature of each text we look at each col of the one-hot encoding. We need to find the first row which has a 1. We will store the row number in the signature. We will do this for each col of the one-hot encoding. This way we will have a signature for each text. We will do this for each text.

We will use our previous example to show how MinHashing works. First of all, we create a list with all the indexes.

In [7]:
example_hash_indexes: list[int] = list(range(1, len(example_vocab_hashed)+1))
print(example_hash_indexes)

[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67]


After we have the list, we shuffle it one time (in this case). We will use this shuffled list to compute the signature of each text. We will do this for each text.

In [8]:
from random import shuffle

shuffle(example_hash_indexes)

for i in range(1, 10):
    print(f"{i} -> {example_hash_indexes.index(i)}")

1 -> 11
2 -> 53
3 -> 37
4 -> 39
5 -> 42
6 -> 31
7 -> 62
8 -> 52
9 -> 29


The following class MinHashing is used to compute all the different shuffle needed to create the signature. It can be thought as an auxiliary class which assists Signature. Basically, it does what we have shown previously. In this case we can set as well how many shuffles we want to do. One shuffle is equivalent to one bit in the signature.

In [9]:
class MinHashing():

    def __init__(self, vocab_size: int, nbits: int):
        self.vocab_size: int = vocab_size
        self.nbits: int = nbits
        self.hashes: list[int] = []

    def create_hash_functions(self) -> list[int]:
        """
        Function for creating the hash vector / function
        """
        hash_indexes: list[int] = list(range(1, self.vocab_size+1))
        shuffle(hash_indexes)
        return hash_indexes
    
    def build_minhashing_functions(self) -> list[int]:
        """
        Function for building multiple minhashing vectors
        """
        hashes: list[int] = []
        for _ in range(self.nbits):
            hashes.append(
                self.create_hash_functions()
            )
        return hashes 

### Signature
To illustrate how the creation of the signature works, we will illustrate first how the signature finds the first bit. In this case it looks for the first index with a 1 in the column. This index will be the first bit of the signature. We will do this for each column of the one-hot encoding. This way we will have a signature for each text. We will do this for each text. 

In [10]:
for i in range(1, len(example_vocab_hashed)+1):
    idx: int = example_hash_indexes.index(i)
    signature_value: int = matrix_one_hot_encoding[0][idx]
    print(f"{i}. Index: {idx} -> {signature_value}")
    if signature_value == 1:
        print("Match!")
        break

1. Index: 11 -> 0
2. Index: 53 -> 0
3. Index: 37 -> 0
4. Index: 39 -> 0
5. Index: 42 -> 0
6. Index: 31 -> 0
7. Index: 62 -> 1
Match!


As we can see, after a several indexes, we find a 1. When we have the match, we store the index in the signature. We will do this for each text. For example, if the signature is of 20 bits, we will do this 20 times.

To compute this in a more systematic way, we have the class Signature which does all this computations. Internally, it computes the minhashing to create the different shuffled vectors for the indexes and then it computes the signature. We will do this for each text.

In [22]:
class Signature():
    
    def __init__(self, matrix_one_hot_encoding: list[list[int]], nbits: int = 20):
        self.vocab_size: int = len(matrix_one_hot_encoding[0])
        minhashing = MinHashing(self.vocab_size, nbits)
        self.minhash_functions: list[int] = minhashing.build_minhashing_functions()

    def create_hash(self, vector: list[int]) -> list[int]:
        """
        This function creates our signatures matching the 1s
        """
        signature: list[int] = []
        for function in self.minhash_functions:
            for i in range(1, self.vocab_size+1):
                idx: int = function.index(i)
                signature_value: int = vector[idx]
                if signature_value == 1:
                    signature.append(idx)
                    break
        return signature

Now, we will look which are the signatures with the examples texts. Later, we will use this signatures to compute the similarity between the texts.

In [12]:
example_texts_signatures: Signature = Signature(
    [example_text_1, example_text_2], nbits = 20
)

example_text_1_signature: list[int] = example_texts_signatures.create_hash(example_matrix_one_hot_encoding[0])
print(example_text_1_signature)

example_text_2_signature: list[int] = example_texts_signatures.create_hash(example_matrix_one_hot_encoding[1])
print(example_text_2_signature)

26
13
27
35
37
38
35
23
38
12
28
13
28
40
34
16
36
18
33
39
26
28
6
12
35
34
22
11
38
16
[27, 35, 38, 35, 23, 38, 28, 28, 40, 34, 16, 18, 33, 28, 35, 34, 22, 11, 38, 16]
26
35
37
35
23
38
24
12
13
40
29
34
10
16
41
36
33
39
6
34
12
22
41
11
1
7
38
10
16
18
[26, 35, 37, 35, 23, 24, 12, 13, 29, 10, 41, 36, 33, 39, 6, 12, 41, 7, 10, 18]


In this examples, we can see that some values in the signature are similar. This means that the texts are similar at least in some part. This makes sense as if we look at the examples text we can see that they have some words in common.

- Example text 1: "The quick brown fox jumps over the lazy dog"
- Example text 2: "The agile black cat leaps over the active dog"

In this example, we can expect that there is some similarity, as there are some similar words like "dog", "over" or "the". However, we can see that the similarity is not that high. This is because the texts are not that similar.

Now we do it, with our dataset, instead of the examples.

In [23]:
texts_signatures_func: Signature = Signature(matrix_one_hot_encoding, nbits=100)

texts_signatures: list[list[int]] = []
for text_one_hot_encoding in matrix_one_hot_encoding:
    text_signature: list[int] = texts_signatures_func.create_hash(text_one_hot_encoding)
    texts_signatures.append(text_signature)

[[2635, 2914, 10921, 2294, 17232, 14032, 11054, 6967, 2064, 14304, 12721, 8752, 12889, 8620, 14102, 1403, 11352, 16719, 5324, 3666, 16468, 16138, 15167, 5015, 6843, 5933, 14505, 11044, 15752, 15508, 2740, 16572, 4703, 5655, 9548, 13565, 4814, 10176, 829, 13236, 2630, 345, 2240, 15406, 6857, 17784, 14680, 18244, 11222, 16428, 7210, 5224, 16585, 18106, 13601, 9118, 10054, 5893, 5147, 2630, 8058, 497, 3579, 546, 4237, 16498, 1639, 5537, 14047, 5435, 15154, 7290, 8200, 628, 15393, 6431, 829, 14983, 7015, 3044, 4079, 15728, 17366, 8848, 5854, 7667, 6904, 5285, 1481, 4837, 9987, 6661, 14211, 604, 7684, 5933, 2950, 4501, 10879, 16332], [17078, 6613, 16110, 8192, 2316, 13082, 13141, 1362, 12095, 4321, 7342, 17062, 848, 2203, 13867, 16398, 590, 2562, 9206, 4868, 4715, 10335, 4794, 13638, 9456, 12395, 9730, 16670, 15827, 5647, 9058, 9754, 12094, 12760, 5664, 16126, 17334, 6481, 6467, 9514, 16840, 4245, 5441, 8799, 16108, 6008, 5161, 12678, 17900, 4653, 15375, 7137, 979, 14545, 6030, 12561, 3439,

We can do the comparison of the signatures manually if we want. Nonetheless, looking at this manually for each text would be really difficult. This is where the Jaccard similarity can help to compute how similar texts are.

### CompareSignatures
If we look 

In [None]:
### TO CONTINUE ###

def jaccard(x, y):
    return len(x.intersection(y)) / len(x.union(y))

jaccard(set(example_text_1_signature), set(example_text_2_signature)), jaccard(set(example_text_1), set(example_text_2))


In [None]:
import numpy as np
from typing import List

def min_hash(A: List[int], hash_length: int = 100, seed: int = 0, generator: np.random.Generator = None) -> np.ndarray:
    """
    The function takes as input the list of hashed shingling in a document and returns a vector representation of the
    document hashed through min hashing.
    :param A: the list of hashed shingling representing a document
    :param hash_length: the length of the signature to be returned
    :param seed: the seed used to generate the hash functions
    :param generator: a pre-created random generator
    :return: a vector representation of the document, with len=hash_len
    """
    if generator is None:
        generator = np.random.default_rng(seed=seed)

    min_value = -2 ** 31
    max_value = 2 ** 31 - 1

    hash_parameters = generator.integers(low=min_value, high=max_value, size=(hash_length, 2), dtype=np.int64)

    return np.array([
        min(np.remainder(np.multiply(x, a) + b, max_value) for x in A)
        for a, b in hash_parameters
    ], dtype=np.int64)

class MinHashing:
    
    def __init__(self, n: int, seed: int = 0):
        self.n: int = n
        self.seed: int = seed
        self.generator: np.random.Generator = np.random.default_rng(seed=self.seed)
    
    def compute_minhash_signature(self, hashed_shingles: List[int]) -> np.ndarray:
        return min_hash(hashed_shingles, hash_length=self.n, generator=self.generator)

# Example usage:
minhashing = MinHashing(n=100, seed=42)  # Set the desired length of the MinHash signature and a seed

# Example: Compute the MinHash signature for the hashed shingles of Text 1
hashed_shingles_text1 = list(hashed_shingles_list[0])
minhash_signature_text1 = minhashing.compute_minhash_signature(hashed_shingles_text1)

print(f"MinHash signature for Text 1: {minhash_signature_text1}")


NameError: name 'hashed_shingles_list' is not defined

## CompareSets

Before comparing the texts, we need to know how similar are the shingles of each text. To do this, we will use the Jaccard similarity. The Jaccard similarity is a measure of how similar two sets are. The higher the Jaccard index, the more similar the two sets are. The Jaccard index is defined as the size of the intersection divided by the size of the union of the sample sets.

For the jaccard similarity we will use the following formula:

$$J(A,B) = \frac{|A \cap B|}{|A \cup B|}$$

In the following lines we will compute the Jaccard similarity of each pair of texts. We will store the results in a matrix so we can compare them later.

In [None]:
class CompareSets():
    
    def jaccard_similarity(self, x: set, y: set) -> float:
        """
        This method computes the Jaccard similarity of two sets.
        """
        intersection_size: int = len(x.intersection(y))
        union_size: int = len(x.union(y))
        jaccard_similarity: float = intersection_size / union_size
        return jaccard_similarity

In [None]:
compare_sets: CompareSets = CompareSets()

# compute the Jaccard similarity of the hashed shingles for Text 1 and Text 2
jaccard_similarity: float = compare_sets.jaccard_similarity(set(hashed_shingles_list[0]), set(hashed_shingles_list[1]))

print(f"The Jaccard similarity of Text 1 and Text 2 is {jaccard_similarity:.2f}")

The Jaccard similarity of Text 1 and Text 2 is 0.01


## CompareSignatures


In [None]:
import numpy as np

class CompareSignatures:
    
    def __init__(self):
        pass
    
    def estimate_similarity(self, signature1: np.ndarray, signature2: np.ndarray) -> float:
        """
        This method estimates the similarity of two minhash signatures as the probability to have an equal entry in the signatures.
        """
        if signature1.shape != signature2.shape:
            raise ValueError("Signatures must have the same shape for comparison.")
        
        return np.sum(signature1 == signature2) / len(signature1)

# Example usage:
compare_signatures = CompareSignatures()

# Example: Estimate the similarity of MinHash signatures for Text 1 and Text 2
minhash_signature_text1 = minhashing.compute_minhash_signature(set(hashed_shingles_list[0]))
minhash_signature_text2 = minhashing.compute_minhash_signature(set(hashed_shingles_list[1]))

similarity = compare_signatures.estimate_similarity(minhash_signature_text1, minhash_signature_text2)
print(f"Estimated similarity between Text 1 and Text 2: {similarity:.2f}")

# Print the hashed shingles for Text 1 and Text 2
print(f"Hashed shingles for Text 1: {set(hashed_shingles_list[0])}")
print(f"Hashed shingles for Text 2: {set(hashed_shingles_list[1])}")

# Print a sample of the content for Text 1 and Text 2
print(f"Sample content for Text 1:\n{texts[0][:200]}")
print(f"Sample content for Text 2:\n{texts[1][:200]}")



Estimated similarity between Text 1 and Text 2: 0.00
Hashed shingles for Text 1: {1760274617084862464, -2197375986492907519, -587441954832973823, 8743244729096470525, -4069586526347517944, 6331014978557845515, 5818091183956467727, 7050967123566649359, 3524886073462628372, -2784253521557356520, 6758801991898300440, -7498525360044376033, 4471223957186633757, -7515682048999178204, 7947297268579459107, 4077152595292020775, -3855745517656244182, -5391524090379059155, -8420847051959975888, -273830232932909009, 8512408720321855533, -4560292091898838990, 8556655781343600691, -7582834162104762307, 1380761621584920636, 3814778549835587643, -1397081132129877953, -4627549677990952894, -1330387972764630974, -6613339633290583993, 7121291870531600452, 4775043253244223561, 6263755175384375371, -6244408197613338539, 5728306877713776722, 7475708454762020947, -1373716053137514399, -858019315660906396, 4511259943006421092, -2931918184541716374, 518473348838072427, -8947425525905112977, 412706756010475631,

## LSH