# Semantic Hashing Demo

## Description

Initially, the objective is to identify relationships among various types of texts - those that are similar, those that are semantically similar (generated using Large Language Models, or LLMs) and those that are completely unrelated. For this purpose, the "Amazon Food Reviews" dataset from Kaggle is utilized in this notebook.

Subsequently, texts are generated using [Marvin](https://www.askmarvin.ai/) through their `marvin` Python package. The dataset includes paragraph-sized texts and their semantically similar counterparts. Correspondingly, matrices are created and visualized as heatmap plots to demonstrate the semantic relationships, with an increasing number of hyperplanes.

From an algorithmic standpoint, the Locality Sensitive Hashing (LSH) method is employed. This technique facilitates the creation of hyperplanes between different texts (e.g., food reviews), which are represented as embedding vectors.
> The size of the embedding vector is 1536 for OpenAI's small embedding model, and 3072 for the large model.

Subsequently, for any given text, a semantic hash is computed. This process involves converting a large embedding vector (a numerical representation of text) into a few bits (representing the number of hyperplanes), akin to a hash code. For example, the phrase "The food was very delicious" could be represented as `1011` in a system with 4 hyperplanes. Finally, texts with identical hashes are grouped into multiple buckets.

## Imports

In [15]:
import polars as pl
from pathlib import Path
from concurrent.futures import ThreadPoolExecutor, as_completed
import marvin
from pydantic import BaseModel, Field
import plotly as py
import plotly.express as px
import sys

# Add the 'src' directory to the Python path to find the required modules
src_path = Path("main.ipynb").parent.resolve() / 'src'
sys.path.append(str(src_path))

from semantic_hashing_demo.utils import ensure_file_exists, check_files_exist
from semantic_hashing_demo.lsh import LSH

## Config

All the parameters (constants mostly) are defined here.

In [6]:
"""
This module contains the input data for the semantic hashing demo.
"""

# no. of hyperplanes
nbits = 8


# data file
# NOTE: This data file has around 570k text reviews (of types: single line, paragraph).
# So, parse accordingly depending on the computational resources for bucketing.
data_file = "./data/fine_food_reviews_1k.csv"
preprocessed_data_file = "./output/preprocessed_data.csv"
generated_data_file = "./data/paragraphs.csv"

# no. of text samples
n = 20

# seed for hyperplane generation
seed = 2254  # subspace address format prefix

# embedding model
model = "text-embedding-3-small"
# model = "text-embedding-3-large"

# embedding size
embedding_size = int(1536)  # for small
# embedding_size = int(3072)   # for large

## Utils

Imported from [here](./src/semantic_hashing_demo/utils.py).

## LSH class

Just imported here from [here](./src/semantic_hashing_demo/lsh.py).


## Generate Data

Using `marvin`, generate data.

In [4]:
class ParagraphData(BaseModel):
    original: str = Field(description="The original paragraph")
    very_similar: str = Field(
        description="A paragraph that is almost identical to the original paragraph with only a couple of words changed"
    )
    # semantically_similar:str


def generate_data():
    print("generating data")
    new_data = marvin.generate(
        n=2,
        target=ParagraphData,
        instructions="generate paragraphs for comparison testing. the paragraphs should be almost identical, with only a few words changed. each paragraph should be at least 100 words long.",
    )
    return new_data


data = []

# Number of parallel calls you want to make
num_parallel_calls = 10

# Use ThreadPoolExecutor to execute calls in parallel
with ThreadPoolExecutor(max_workers=num_parallel_calls) as executor:
    # Submit all your generate calls to the executor
    future_to_generate = {
        executor.submit(generate_data) for _ in range(num_parallel_calls)
    }

    # Collect the results as they are completed
    for future in as_completed(future_to_generate):
        try:
            data.extend(future.result())
        except Exception as exc:
            print(f"Generated an exception: {exc}")

# Print the data collected
# for item in data:
#     print(item)

data_dicts = [d.dict() for d in data]
df = pl.DataFrame(data_dicts)
df.write_csv("./data/paragraphs.csv")


generating data
generating data
generating data
generating data
generating data
generating data
generating data
generating data
generating data
generating data


## Process Generated data

Load the generated data and save LSH codes for both source and variants with increasing nbits i.e. 8, 16, 32, 64, 128 hyperplanes.
And then plot HD matrix between all sources and variants for each nbits.

In [19]:
dir_name = "output"
file_name = "paragraphs_processed.csv"

# load data
df = pl.read_csv("data/paragraphs.csv")
source_texts = df.get_column("original").to_list()
variants_texts = df.get_column("very_similar").to_list()

# get embeddings for source and variants
source_embeddings = LSH.get_embedding(source_texts, model)
variant_embeddings = LSH.get_embedding(variants_texts, model)

# Create DataFrame with embeddings of source and variants
df2 = pl.DataFrame(
    {
        "Source": source_texts,
        "Variant": variants_texts,
        "Source Embedding": [
            str(embedding) for embedding in source_embeddings.tolist()
        ],
        "Variant Embedding": [
            str(embedding) for embedding in variant_embeddings.tolist()
        ],
    }
)

print("Saving Embeddings, LSH codes & HD matrix...\n")
for nbits in [8, 16, 32, 64, 128]:
    print(f"\tfor nbits = {nbits}:")
    # Create LSH instance
    lsh = LSH(nbits=nbits, seed=seed, embedding_size=embedding_size)

    hashes_source = lsh.hash_vector(source_embeddings)
    hashes_variant = lsh.hash_vector(variant_embeddings)

    # Add LSH hashes corresponding to the embeddings to the df2 DataFrame
    df2.insert_column(
        len(df2.columns), pl.Series(f"Source Hash {nbits}-bit", hashes_source)
    )
    df2.insert_column(
        len(df2.columns), pl.Series(f"Variant Hash {nbits}-bit", hashes_variant)
    )

    hamming_distances = []
    # calculate HD matrix
    for hash_variant in hashes_variant:
        hamming_distances.append(
            [
                lsh.hamming_distance(hash_source, hash_variant)
                for hash_source in hashes_source
            ]
        )

    # generate a plot for nbits hyperplanes
    fig = px.imshow(hamming_distances)
    fig.show()
    py.offline.plot(fig, filename=f"output/plot_matrix_{nbits}.html", auto_open=False)

# Ensure the file in desired path
ensure_file_exists(dir_name, file_name)

""" Save embeddings + LSH to CSV, linked with source sample """
df2.write_csv(f"{dir_name}/{file_name}", separator=",")


Saving Embeddings, LSH codes & HD matrix...

	for nbits = 8:


	for nbits = 16:


	for nbits = 32:


	for nbits = 64:


	for nbits = 128:
