# Homework 1: Locally Sensitive Hashing
Locality Sensitive Hashing (LSH) is a technique used in computer science to solve the approximate or exact Near Neighbor Search in high-dimensional spaces. It is used to find similar items in a large dataset by hashing input items so that similar items map to the same "buckets" with high probability. LSH is commonly used in recommendation systems, image and audio recognition, and data mining.

In this particular notebook we will implement a simplified version of the LSH algorithm for to compare texts and find how similar are. We will implement 4 classes which will help us to compute how similar 2 texts are. Those classes are: Shingling, CompareSets, MinHasing and CompareSignatures.

## Dataset
As a dataset, we have used the following texts:
 - Lorem Ipsum with 5 paragraphs (https://www.lipsum.com/)\[1.txt\]
 - Lorem Ipsum with 7 paragraphs (https://www.lipsum.com/)\[2.txt\]
 - Quijote de la Mancha by Miguel de Cervantes (https://www.gutenberg.org/cache/epub/60884/pg60884.txt)\[3.txt\]
 - The Adventures of Sherlock Holmes by Arthur Conan Doyle (https://www.gutenberg.org/cache/epub/1661/pg1661.txt)\[4.txt\]
 - The picture of Dorian Gray by Oscar Wilde (https://www.gutenberg.org/cache/epub/174/pg174.txt)\[5.txt\]
 - Beyond good and evil by Friedrich Wilheim Nietzsche (https://www.gutenberg.org/cache/epub/4363/pg4363.txt)\[6.txt\]

This dataset has been selected like this, so it has two text which should be fairly similar (1 & 2), a text which should be fairly different (3) and three texts which should have something in common even though they may be different (4, 5 & 6).

In this notebook we have implemented 4 classes which we will help us to compute the similarity of 2 texts. Those classes are: Shingling, CompareSets, MinHashing and CompareSignatures.

In [34]:
import os

# define the path to the dataset folder
dataset_path: str = "dataset"

# create an empty list to store the texts
texts: list[str] = []

# loop through the files in the dataset folder
for filename in os.listdir(dataset_path):
    # check if the file is a text file
    if filename.endswith(".txt"):
        # open the file and read its contents
        with open(os.path.join(dataset_path, filename), "r") as f:
            text = f.read()
            texts.append(text)

# print the texts
for i, text in enumerate(texts):
    print(f"Text {i+1}:")
    print(text)
    print("-"*132)


Text 1:
HAPTER I.


The studio was filled with the rich odour of roses, and when the light
summer wind stirred amidst the trees of the garden, there came through
the open door the heavy scent of the lilac, or the more delicate
perfume of the pink-flowering thorn.

From the corner of the divan of Persian saddle-bags on which he was
lying, smoking, as was his custom, innumerable cigarettes, Lord Henry
Wotton could just catch the gleam of the honey-sweet and honey-coloured
blossoms of a laburnum, whose tremulous branches seemed hardly able to
bear the burden of a beauty so flamelike as theirs; and now and then
the fantastic shadows of birds in flight flitted across the long
tussore-silk curtains that were stretched in front of the huge window,
producing a kind of momentary Japanese effect, and making him think of
those pallid, jade-faced painters of Tokyo who, through the medium of
an art that is necessarily immobile, seek to convey the sense of
swiftness and motion. The sullen murmur of 

## Shingling
Shingling is a technique used in text analysis to represent a document as a set of overlapping subsequences of fixed length k, called k-shingles. To compute shingles, we slide a window of size k over the document and extract the k-length substrings that fall within the window. We then store these substrings as a set, which represents the shingles of the document.

In [35]:
class Shingling():
    
    def __init__(self, k: int):
        self.k: int = k
        
    def get_shingles(self, document: str) -> set:
        """
        This method constructs k-shingles of a given length k from a given document.
        """
        shingles: set = set()
        for i in range(len(document) - self.k + 1):
            shingle: str = document[i:i+self.k]
            shingles.add(shingle)
        return shingles
    
    def get_hashed_shingles(self, document: str):
        """
        This method computes a hash value for each unique shingle and represents the document in the form of an ordered set of its hashed k-shingles.
        """
        shingles: set = self.get_shingles(document)
        hashed_shingles: list[int] = [hash(shingle) for shingle in shingles]
        hashed_shingles.sort()
        return hashed_shingles

As an example to show what is shingling, we will use the following text:
> "The quick brown fox jumps over the lazy dog"

In [36]:
text = "The quick brown fox jumps over the lazy dog"
shingling: Shingling = Shingling(k=5)
shingles: list[str] = shingling.get_shingles(text)
print(shingles)
shingles_hash: list[int] = shingling.get_hashed_shingles(text)
print("-"*132)
print(shingles_hash)



{'e laz', 'ox ju', 'fox j', 'zy do', 'e qui', 'x jum', 'rown ', 'over ', ' lazy', 's ove', 'quick', 'he la', 'k bro', ' jump', 'ver t', 'azy d', 'uick ', 'n fox', ' over', 'lazy ', 'the l', 'ps ov', 'umps ', ' quic', 'jumps', ' brow', ' the ', 'r the', 'own f', 'y dog', 'The q', 'ck br', 'mps o', ' fox ', 'brown', 'he qu', 'er th', 'wn fo', 'ick b'}
------------------------------------------------------------------------------------------------------------------------------------
[-9016638787060246468, -8074594585998353367, -7224978517212656118, -7148557594369510330, -6185998227707969675, -6109577377024358358, -4764682993535391332, -3591335854120488994, -3561638923555095727, -3416112134396693993, -2746489826632443158, -1997148926115336808, -1834752826044684633, -1802119594342921879, 208899909644114857, 343876992444415975, 588132197152147257, 695739747979867140, 742555885541354968, 805873665443469337, 2022737794358441648, 2778142146404480341, 2960403262237501325, 3310843553231635042, 39

Now we do it with our text. We use a k=9 as we are analyzing text from books instead of emails. In the following lines we are doing exactly the same as we showed in the example above.

In [37]:
shingling: Shingling = Shingling(k=9)

hashed_shingles_list: list[int] = []

for text in texts:
    hashed_shingles: int = shingling.get_hashed_shingles(text)
    hashed_shingles_list.append(hashed_shingles)
    
for i, hashed_shingles in enumerate(hashed_shingles_list):
    print(f"Hashed shingles for Text {i+1}:")
    print(hashed_shingles)
    print("-"*132)


Hashed shingles for Text 1:
[-9222435853077391407, -9212514751376339775, -9200449057145930167, -9195050631304365164, -9186282838664733488, -9183342630358980641, -9180135884561149694, -9175370711648055302, -9175189710003292330, -9169101728732376525, -9166713072382895883, -9153076688877219038, -9140635402065522365, -9131208475139012637, -9130839646563869479, -9129737531464145580, -9128631288128285821, -9105401042249038788, -9090592470490776177, -9088850088696599948, -9088746018125664485, -9038670040952873932, -9025340874906732166, -9020626567470885550, -9019989217696408237, -9015990529007389151, -9010699087805287486, -9006980553945939582, -9002691539258298188, -8979347712220403194, -8978444068492702537, -8975762365216874790, -8974035068673354447, -8969026713993328469, -8964336742350676985, -8957913168780538358, -8953356441425154761, -8949525599595459996, -8947425525905112977, -8932941094040640387, -8932167775690979417, -8913041745824496677, -8913012735146107573, -8906582681476646209, -89

## CompareSets


In [38]:
class CompareSets():
    
    def __init__(self):
        pass
    
    def jaccard_similarity(self, set1: set, set2: set) -> float:
        """
        This method computes the Jaccard similarity of two sets.
        """
        intersection_size: int = len(set1.intersection(set2))
        union_size: int = len(set1.union(set2))
        jaccard_similarity: float = intersection_size / union_size
        return jaccard_similarity

In [39]:
compare_sets: CompareSets = CompareSets()

# compute the Jaccard similarity of the hashed shingles for Text 1 and Text 2
jaccard_similarity: float = compare_sets.jaccard_similarity(set(hashed_shingles_list[0]), set(hashed_shingles_list[1]))

print(f"The Jaccard similarity of Text 1 and Text 2 is {jaccard_similarity:.2f}")

The Jaccard similarity of Text 1 and Text 2 is 0.01


## MinHashing

In [40]:
import numpy as np
from typing import List

def min_hash(A: List[int], hash_length: int = 100, seed: int = 0, generator: np.random.Generator = None) -> np.ndarray:
    """
    The function takes as input the list of hashed shingling in a document and returns a vector representation of the
    document hashed through min hashing.
    :param A: the list of hashed shingling representing a document
    :param hash_length: the length of the signature to be returned
    :param seed: the seed used to generate the hash functions
    :param generator: a pre-created random generator
    :return: a vector representation of the document, with len=hash_len
    """
    if generator is None:
        generator = np.random.default_rng(seed=seed)

    min_value = -2 ** 31
    max_value = 2 ** 31 - 1

    hash_parameters = generator.integers(low=min_value, high=max_value, size=(hash_length, 2), dtype=np.int64)

    return np.array([
        min(np.remainder(np.multiply(x, a) + b, max_value) for x in A)
        for a, b in hash_parameters
    ], dtype=np.int64)

class MinHashing:
    
    def __init__(self, n: int, seed: int = 0):
        self.n: int = n
        self.seed: int = seed
        self.generator: np.random.Generator = np.random.default_rng(seed=self.seed)
    
    def compute_minhash_signature(self, hashed_shingles: List[int]) -> np.ndarray:
        return min_hash(hashed_shingles, hash_length=self.n, generator=self.generator)

# Example usage:
minhashing = MinHashing(n=100, seed=42)  # Set the desired length of the MinHash signature and a seed

# Example: Compute the MinHash signature for the hashed shingles of Text 1
hashed_shingles_text1 = list(hashed_shingles_list[0])
minhash_signature_text1 = minhashing.compute_minhash_signature(hashed_shingles_text1)

print(f"MinHash signature for Text 1: {minhash_signature_text1}")


MinHash signature for Text 1: [ 609224  125670 1231698  309933  162964 1960573  316290 1514853  450981
  679265 1135985  207157  318911   19325 1749711  690140  142497  963514
 1509898  658309  212516     800  313803  476700 1074786  624984 1633970
 1664494  379816 1568756  449992  325036   28778    4821  104472   47667
  658791  343694 1798474 1300102  155475  298316  525963  865292  258952
  339986 1985321 2436487  117894  117040   68312 1347156 2342139 1142695
  141365  842880  325929   14450   35432  472683  146720  105312  129378
 1330649 1402429 2356293  245308 1389799  491208  115058 2021892  270697
  192599  101884  149730  162266  449562 2851387 1783399 1330447   69796
  892798   45390 1171454   89403  362288 1022363  939759   74458  571009
  113321  350743  394503   40090  994640  133091  497743  909893   76598
  152145]


## CompareSignatures


In [41]:
import numpy as np

class CompareSignatures:
    
    def __init__(self):
        pass
    
    def estimate_similarity(self, signature1: np.ndarray, signature2: np.ndarray) -> float:
        """
        This method estimates the similarity of two minhash signatures as the probability to have an equal entry in the signatures.
        """
        if signature1.shape != signature2.shape:
            raise ValueError("Signatures must have the same shape for comparison.")
        
        return np.sum(signature1 == signature2) / len(signature1)

# Example usage:
compare_signatures = CompareSignatures()

# Example: Estimate the similarity of MinHash signatures for Text 1 and Text 2
minhash_signature_text1 = minhashing.compute_minhash_signature(set(hashed_shingles_list[0]))
minhash_signature_text2 = minhashing.compute_minhash_signature(set(hashed_shingles_list[1]))

similarity = compare_signatures.estimate_similarity(minhash_signature_text1, minhash_signature_text2)
print(f"Estimated similarity between Text 1 and Text 2: {similarity:.2f}")

# Print the hashed shingles for Text 1 and Text 2
print(f"Hashed shingles for Text 1: {set(hashed_shingles_list[0])}")
print(f"Hashed shingles for Text 2: {set(hashed_shingles_list[1])}")

# Print a sample of the content for Text 1 and Text 2
print(f"Sample content for Text 1:\n{texts[0][:200]}")
print(f"Sample content for Text 2:\n{texts[1][:200]}")



Estimated similarity between Text 1 and Text 2: 0.00
Hashed shingles for Text 1: {1760274617084862464, -2197375986492907519, -587441954832973823, 8743244729096470525, -4069586526347517944, 6331014978557845515, 5818091183956467727, 7050967123566649359, 3524886073462628372, -2784253521557356520, 6758801991898300440, -7498525360044376033, 4471223957186633757, -7515682048999178204, 7947297268579459107, 4077152595292020775, -3855745517656244182, -5391524090379059155, -8420847051959975888, -273830232932909009, 8512408720321855533, -4560292091898838990, 8556655781343600691, -7582834162104762307, 1380761621584920636, 3814778549835587643, -1397081132129877953, -4627549677990952894, -1330387972764630974, -6613339633290583993, 7121291870531600452, 4775043253244223561, 6263755175384375371, -6244408197613338539, 5728306877713776722, 7475708454762020947, -1373716053137514399, -858019315660906396, 4511259943006421092, -2931918184541716374, 518473348838072427, -8947425525905112977, 412706756010475631,

## LSH