# Semantic Hashing Demo

## Description

Initially, the objective is to identify relationships among various types of texts - those that are similar, those that are semantically similar (generated using Large Language Models, or LLMs) and those that are completely unrelated. For this purpose, the "Amazon Food Reviews" dataset from Kaggle is utilized in this notebook.

Subsequently, texts are generated using [Marvin](https://www.askmarvin.ai/) through their `marvin` Python package. The dataset includes paragraph-sized texts and their semantically similar counterparts. Correspondingly, matrices are created and visualized as heatmap plots to demonstrate the semantic relationships, with an increasing number of hyperplanes.

From an algorithmic standpoint, the Locality Sensitive Hashing (LSH) method is employed. This technique facilitates the creation of hyperplanes between different texts (e.g., food reviews), which are represented as embedding vectors.
> The size of the embedding vector is 1536 for OpenAI's small embedding model, and 3072 for the large model.

Subsequently, for any given text, a semantic hash is computed. This process involves converting a large embedding vector (a numerical representation of text) into a few bits (representing the number of hyperplanes), akin to a hash code. For example, the phrase "The food was very delicious" could be represented as `1011` in a system with 4 hyperplanes. Finally, texts with identical hashes are grouped into multiple buckets.

## Imports

In [10]:
import polars as pl
import numpy as np
import csv
from typing import Dict, List
from openai import OpenAI
from pathlib import Path

## Config

All the parameters (constants mostly) are defined here.

In [1]:
"""
This module contains the input data for the semantic hashing demo.
"""

# no. of hyperplanes
nbits = 8


# data file
# NOTE: This data file has around 570k text reviews (of types: single line, paragraph).
# So, parse accordingly depending on the computational resources for bucketing.
data_file = "./data/fine_food_reviews_1k.csv"
preprocessed_data_file = "./output/preprocessed_data.csv"
generated_data_file = "./data/paragraphs.csv"

# no. of text samples
n = 20

# seed for hyperplane generation
seed = 2254  # subspace address format prefix

# embedding model
model = "text-embedding-3-small"
# model = "text-embedding-3-large"

# embedding size
embedding_size = int(1536)  # for small
# embedding_size = int(3072)   # for large

## Utils

In [4]:
def ensure_file_exists(dir_name: str, file_name: str) -> None:
    """
    Create the directory and the file if it doesn't exist.
    The function is efficient as it avoids redundant directory and file creation.

    Args:
        dir (Path): The directory path.
        file_name (str): The file name to create.
    """
    dir = Path(dir_name)
    file_path = dir.joinpath(file_name)
    if not dir.exists():
        dir.mkdir(parents=True, exist_ok=True)
    elif not file_path.is_file():
        file_path.touch()


def check_files_exist(directory: str, file_names: List[str]) -> bool:
    """
    Check if files exist in the given directory.

    Args:
        directory (str): The directory path.
        file_names (List[str]): List of file names to check.

    Returns:
        bool: If all of the files exist in the directory.
    """
    return all(Path(directory).joinpath(name).is_file() for name in file_names)


## LSH class



In [2]:
class LSH:
    def __init__(self, nbits: int, embedding_size: int, seed: int):
        self.nbits = nbits
        self.seed = seed
        self.plane_norms = self._generate_plane_norms(embedding_size)

    def _generate_plane_norms(self, embedding_size: int) -> np.ndarray:
        rng = np.random.RandomState(self.seed)
        return rng.rand(self.nbits, embedding_size) - 0.5

    @staticmethod
    def get_embedding(texts: List[str], model: str) -> np.ndarray:
        client = OpenAI()
        processed_texts = [
            text.replace("\n", " ").replace("<br />", " ") for text in texts
        ]
        embeddings = client.embeddings.create(input=processed_texts, model=model).data
        return np.array([embedding.embedding for embedding in embeddings])

    def hash_vector(self, v: np.ndarray) -> List[str]:
        v_dots = np.dot(v, self.plane_norms.T) > 0
        return ["".join(str(int(i)) for i in v_dot) for v_dot in v_dots]

    @staticmethod
    def bucket_hashes(v: List[str]) -> Dict[str, List[int]]:
        buckets = {}
        for idx, hash_str in enumerate(v):
            buckets.setdefault(hash_str, []).append(idx)
        return buckets

    @staticmethod
    def hashes_to_df(v: List[str], col1: str, col2: str) -> pl.DataFrame:
        buckets = {}
        for i, hash_str in enumerate(v):
            buckets.setdefault(hash_str, []).append(i)

        buckets_df = pl.from_dict(
            {col1: list(buckets.keys()), col2: list(buckets.values())}
        )
        return buckets_df

    @staticmethod
    def write_buckets_to_csv(
        buckets: Dict[str, List[int]], col1: str, col2: str, file_path: str
    ):
        # Open the file in write mode
        with open(file_path, "w", newline="") as file:
            writer = csv.writer(file)
            writer.writerow([col1, col2])

            for key, value in buckets.items():
                list_as_string = str(value).strip("[]")
                writer.writerow([key, list_as_string])

    @staticmethod
    def hamming_distance(str1: str, str2: str) -> int:
        if len(str1) != len(str2):
            raise ValueError("Strings must be of equal length")
        return sum(char1 != char2 for char1, char2 in zip(str1, str2))

    @staticmethod
    def get_text_idx(hamming_distances: List[int]) -> int:
        if len(hamming_distances) == 0:
            raise ValueError("No hamming distances found")
        return int(np.argmin(hamming_distances))


## Data pre-processing

Preprocess the data and generate the hash buckets for the text reviews.<br/>
Store the buckets for different nbits into different files.

In [7]:
dir_name = "output"
file_name = "preprocessed_data.csv"

# load data
df = pl.read_csv(data_file)
reviews = df.get_column("Text").to_list()

# Generate embeddings for each review
embeddings = LSH.get_embedding(reviews, model)

reviews_updated = [
    review.replace("\n", " ").replace("<br />", " ") for review in reviews
]

# Create DataFrame with updated reviews and embeddings
df2 = pl.DataFrame(
    {
        "Text": reviews_updated,
        "Embedding": [str(embedding) for embedding in embeddings.tolist()],
    }
)

print("Writing Embeddings, Hashes, Buckets to CSV files...\n")
for nbits in [8, 16, 32, 64, 128]:
    # Create LSH instance
    lsh = LSH(nbits=nbits, seed=seed, embedding_size=embedding_size)

    # Hash each embeddings into a hash code. Hence, a list of hash codes is returned
    hashes = lsh.hash_vector(embeddings)

    # Add LSH hashes corresponding to the embeddings to the df2 DataFrame
    df2.insert_column(len(df2.columns), pl.Series(f"Hash {nbits}-bit", hashes))

    # Hashes to buckets
    buckets = lsh.bucket_hashes(hashes)

    # Define the path for the directory and ensure the file
    bucket_file_name = f"buckets_{nbits}bit.csv"
    ensure_file_exists(dir_name, f"buckets_{nbits}bit.csv")

    # write to CSV
    lsh.write_buckets_to_csv(
        buckets, "Text Hash", "Text Indices", f"{dir_name}/{bucket_file_name}"
    )

    print(f"\tfor nbits = {nbits} ✅\n")

# Define the path for the directory and the file
ensure_file_exists(dir_name, file_name)

""" Save embeddings + LSH to CSV, linked with source sample """
df2.write_csv(f"{dir_name}/{file_name}", separator=",")

Writing Embeddings, Hashes, Buckets to CSV files...

	for nbits = 8 ✅

	for nbits = 16 ✅

	for nbits = 32 ✅

	for nbits = 64 ✅

	for nbits = 128 ✅



## Detection of similar text

Here, parsing a similar text (to 1st food review, say) and see if it is falling into the expected bucket or not.

Suppose the 1st review is slightly modified from:

```text
I have bought several of the Vitality canned dog food products and have found them all to be of good quality. The product looks more like a stew than a processed meat and it smells better. My Labrador is finicky and she appreciates this product better than  most.
```

to:

```text
I have bought many of the Vitality canned dog food products and have found them all to be of good quality. The product looks more like a stew than a processed meat and it smells good. My Labrador is finicky and she likes this product better than  most.
```

In [8]:
# slightly modified 1st review from the dataset
query = "I have bought many of the Vitality canned dog food products and have found them all to be of good quality. The product looks more like a stew than a processed meat and it smells good. My Labrador is finicky and she likes this product better than  most."

required_files = [f"buckets_{nbits}bit.csv" for nbits in [8, 16, 32, 64, 128]] + [
    "preprocessed_data.csv"
]
if not check_files_exist("output", required_files):
    raise ValueError("Please run `preprocessing.py` first")

for nbits in [8, 16, 32, 64, 128]:
    print(f"\n=====For nbits = {nbits}======")

    # instantiate LSH
    lsh = LSH(nbits=nbits, embedding_size=embedding_size, seed=seed)

    # get hash of a query text
    query_hash = lsh.hash_vector(lsh.get_embedding([query], model))[0]
    print(
        f"For a given text: \n\"{query}\", \nit's computed hash is '{query_hash}'."
    )

    # load data
    df = pl.read_csv(
        f"output/buckets_{nbits}bit.csv", dtypes={"Text Hash": pl.String}
    )
    bucket_hashes = df.get_column("Text Hash").to_list()
    bucket_indices = df.get_column("Text Indices").to_list()

    # get hamming distances between the query and each bucket key
    hamming_distances = [
        lsh.hamming_distance(query_hash, hash_str) for hash_str in bucket_hashes
    ]

    # HD: Hamming distance
    if 0 in hamming_distances:
        print("😊 Falls into a bucket with HD == 0.")
    else:
        print(
            "😟 Falls into closest bucket with HD != 0,\nwhen traversed from left --> right."
        )
    print(
        f"The bucket contains texts at indices: {bucket_indices[lsh.get_text_idx(hamming_distances)]}."
    )


For a given text: 
"I have bought many of the Vitality canned dog food products and have found them all to be of good quality. The product looks more like a stew than a processed meat and it smells good. My Labrador is finicky and she likes this product better than  most.", 
it's computed hash is '10010010'.
😊 Falls into a bucket with HD == 0.
The bucket contains texts at indices: 3, 31, 35, 147, 227, 248, 284, 285, 354, 355, 425, 436, 442, 457, 462, 506, 509, 518, 519, 521, 526, 532, 546, 558, 580, 608, 616, 627, 653, 690, 698, 716, 725, 835, 970.

For a given text: 
"I have bought many of the Vitality canned dog food products and have found them all to be of good quality. The product looks more like a stew than a processed meat and it smells good. My Labrador is finicky and she likes this product better than  most.", 
it's computed hash is '1001001011101011'.
😟 Falls into closest bucket with HD != 0,
when traversed from left --> right.
The bucket contains texts at indices: 0.

For a

As you can see from the results, the query text does fall into the bucket with HD = 0, but the bucket does not contain the original text i.e. text at index-0.