# 🔍 Vector Database Demo with LSH-based Clustering

This notebook demonstrates a lightweight vector database implementation for semantic search that combines:

#### ✨ Key Features
- **🧩 Flexible Architecture**: Uses abstract base classes to allow swapping embedding models without changing search logic.
- **🧮 LSH-based Clustering**: Implements Locality Sensitive Hashing (LSH) to cluster semantically similar documents.
- **🚀 Optimized Search Strategy**: Uses a two-phase search approach:
  1. First searches within the query's LSH cluster.
  2. Intelligently explores neighboring clusters until reaching diminishing returns.
- **🛑 Early Termination**: Stops searching additional clusters after N consecutive failures to find closer documents.

The demo shows how to setup the database, encode documents using Azure OpenAI embeddings, and perform efficient semantic search queries using the LSH-accelerated approach.

## (1) SQLite Database Setup

Run the script db_setup.py to automatically create the required SQLite database file.
```
usage: db_setup.py [-h] -t TAG

Setup the vector database. Creates necessary tables (Documents, Neighbors).

options:
  -h, --help         show this help message and exit
  -t TAG, --tag TAG  Required: Tag or name for the database file (e.g., 'my_vec_library.db')
```

## (2) Document pre-processing 

(2.1) Create the embedding model by subclassing the EmbeddingModel abstraction. In this example implementation we'll use an OpenAI model in Azure.

In [None]:
import numpy as np
import time
from abstract.base import EmbeddingModel, VectorDatabase
from openai import AzureOpenAI
from tqdm import tqdm

API_KEY = "<your_api_key>"
ENDPOINT = "<your_endpoint>"
DEPLOYMENT = "text-embedding-3-large"
SQLITE_DB_PATH = "<your_db_file_path>.db"


class AzureOpenAIEmbeddingModel(EmbeddingModel):
    def __init__(self, api_key: str, endpoint: str, deployment: str, batch_size: int = 32):
        self.api_key = api_key
        self.endpoint = endpoint
        self.deployment = deployment
        self.batch_size = batch_size
        self.client = AzureOpenAI(
            api_version="2024-12-01-preview",
            azure_endpoint=endpoint,
            api_key=api_key
        )

    @staticmethod
    def get_embedding_dimension() -> int:
        return 3072

    def embed_documents(self, documents: list[str]) -> np.ndarray:
        """
        Given a list of documents, return their embeddings as a numpy array.
        Documents are processed in batches to avoid API throttling.
        Uses binary search to find the largest working batch size.
        """
        all_embeddings = []
        n = len(documents)
        i = 0
        max_batch = self.batch_size
        min_batch = 1
        while i < n:
            batch_size = max_batch
            success = False
            while batch_size >= min_batch and not success:
                batch = documents[i:i+batch_size]
                try:
                    response = self.client.embeddings.create(
                        input=batch,
                        model=self.deployment
                    )
                    all_embeddings.extend([item.embedding for item in response.data])
                    i += batch_size
                    # Try to increase batch size for next round
                    if batch_size < max_batch:
                        batch_size = min(batch_size * 2, max_batch)
                    success = True
                except Exception as e:
                    # On error, halve batch size and retry
                    if batch_size == 1:
                        # If even batch size 1 fails, raise
                        raise
                    batch_size = batch_size // 2
                    time.sleep(1)  # brief backoff
        return np.array(all_embeddings)

    def distance(self, emb1: np.ndarray, emb2: np.ndarray) -> float:
        """
        Cosine distance for normalized embeddings: 1 - dot(emb1, emb2)
        Lower is more similar.
        """
        return 1.0 - float(np.dot(emb1, emb2))


(2.2) Implement VectorDatabase using our AzureOpenAI model above.

In [None]:
class VectorDatabaseAzure(VectorDatabase):
    def get_embedding_model(self):
        return AzureOpenAIEmbeddingModel(
            api_key=API_KEY,
            endpoint=ENDPOINT,
            deployment=DEPLOYMENT,
        )
    
vdb = VectorDatabaseAzure(SQLITE_DB_PATH)

 (2.3) Run the documents' text through the model and store the embeddings in the database.

In [None]:
# test.tgt is from https://huggingface.co/datasets/alexfabbri/multi_news/tree/main/data
with open("test.tgt", "r") as file:
    lines = file.read().splitlines()

batch_size = 64
for i in tqdm(range(0, len(lines), batch_size)):
    vdb.add_documents(lines[i:i + batch_size])

100%|██████████| 87/87 [10:12<00:00,  7.04s/it]


# (3) Demo - using the vector DB to search for documents.

In [25]:
query = "something about machine learning"
# Use the Vector DB to fetch documents
docs = vdb.get_closest_documents(query, 5, search_all=False)
docs

Docs searched: 2904


[(0.7283551096916199,
  '– You probably can already identify the contents of most of your photos, but this is still fun. A new website from Stephen Wolfram, whom you may know from the search tool WolframAlpha, lets you drag and drop any photo; it will then in theory identify what\'s in it. Right now, ImageIdentify manages some impressive feats, the Verge reports: For instance, it was able to tell that a picture of a cow was a black angus. On the other hand, it thought a cupcake was a bottle cap. The Wolfram Language team is happy to acknowledge the mistakes. In a blog post, Wolfram notes that "somehow the errors seemed very understandable, and in a sense very human. It seemed as if what ImageIdentify was doing was successfully capturing some of the essence of the human process of identifying images." In the meantime, it does have some practical uses: At PC World, Jared Newman writes that "using the site, I was able to figure the breed of dog that kept following my wife and I around on 

In [27]:
docs = vdb.get_closest_documents(query, 5, search_all=True)
docs

Docs searched: 5622


[(0.7283551096916199,
  '– You probably can already identify the contents of most of your photos, but this is still fun. A new website from Stephen Wolfram, whom you may know from the search tool WolframAlpha, lets you drag and drop any photo; it will then in theory identify what\'s in it. Right now, ImageIdentify manages some impressive feats, the Verge reports: For instance, it was able to tell that a picture of a cow was a black angus. On the other hand, it thought a cupcake was a bottle cap. The Wolfram Language team is happy to acknowledge the mistakes. In a blog post, Wolfram notes that "somehow the errors seemed very understandable, and in a sense very human. It seemed as if what ImageIdentify was doing was successfully capturing some of the essence of the human process of identifying images." In the meantime, it does have some practical uses: At PC World, Jared Newman writes that "using the site, I was able to figure the breed of dog that kept following my wife and I around on 