**Author:** J. Žovák, `482857@mail.muni.cz`

In [None]:
%reload_ext autoreload
%autoreload 2

# Learned Vector Database (LVD) Usage Notebook
This notebook provides demonstration of the LVD functionalities.
First, the creation of collection and uploading of the data into it is shown.
After the data upload, the index is built and visualised.
Next, all the supported search query types that the LVD supports are presented.
Currently, the following query types are supported regular kANN search query, constrained search query and hybrid search query.

In [None]:
import chromadb
import pandas as pd
from tqdm.notebook import tqdm
from demo_utils import visualize_dataset, plot_bucket_items, visualize_bucket_order

## Load Dataset
In this demo a dataset called `synthetic_clusters_colored.csv` is used to present capabilities of LVD. 
This dataset contains coordinates of points that form 4 clusters in 2d space. 
Points have assigned color based on their cluster. 
The points have additionally a document assigned to them. The document is a one sentence about some topic. Each cluster contains sentences about particular topic.

In [None]:
csv_file_path = 'data/synthetic_clusters_colored.csv'
data = pd.read_csv(csv_file_path)

### Pick Query Point
A query point is selected which will be used across different search operations. 
This will allow us to observe of how the output of the search changes based on the search operation. 
Feel free to change the query point.

In [None]:
query_color = "purple"
query_point = data[data['cluster'] == query_color].iloc[0]
print("Selected query point: \n", query_point)
query_point = query_point[['x', 'y']]

### Visualize Data
Bellow is the visualization of the dataset with the selected query. From visualization, it can be seen that the dataset contains 4 distinct clusters. Each cluster has a color assigned to it and documents within the cluster are about particular topic. Note the vectors are generated randomly and are **not** created by embedding the documents.


In [None]:
visualize_dataset(data, query_point)

## Set Up Database And Collection
Now we will connect to the database and create a collection for the synthetic dataset. Then we will upload the data to the database. After the data is uploaded we will build the index.

Connect to the database and delete previously created collections.

In [None]:
# Connect to the database
client = chromadb.Client()

# Delete previously created collections
collections = client.list_collections()
if collections:
    client.delete_collection(collections[0].name)

Create collection for the synthetic dataset. Specify LMI configuration that will be used for the collection.

In [None]:
collection_name = "synthetic_collection"
collection = client.create_collection(
    name=collection_name,
    metadata={
        "lmi:epochs": "[200]",
        "lmi:model_types": "['MLP']",
        "lmi:lrs": "[0.01]",
        "lmi:n_categories": f"[4]",
        "lmi:kmeans": "{'verbose': False, 'seed': 2023, 'nredo': 10}",
    }
)

Upload data in batches to the collection.

In [None]:
# Use batch upload just to test it out
batch_size = 25
for i in tqdm(range(0, len(data), batch_size), desc="Adding documents"):
    collection.add(
        embeddings=data[['x', 'y']].iloc[i: i + batch_size].values.tolist(),
        metadatas=[{"cluster": cluster} for cluster in data['cluster'].iloc[i: i + batch_size]],
        ids=data['id'].iloc[i: i + batch_size].values.tolist(),
        documents=[document for document in data['document'].iloc[i: i + batch_size]]
    )

Build the LMI index over the uploaded data.

In [None]:
bucket_assignment = collection.build_index()

### Visualize LMI Buckets

In [None]:
# Map the ids in data to buckets using bucket_labels_new_format (assuming this exists outside this function)
data['bucket'] = data['id'].map(lambda x: list(bucket_assignment.get(x, [])))
data['bucket_str'] = data['bucket'].apply(lambda x: str(x))

### Visualize buckets

Let's visualize how the LMI indexed the data by looking at distribution of items in the buckets (leaves of the tree).

In [None]:
plot_bucket_items(data, False)

Now let's look at items in the bucket by their color.

In [None]:
plot_bucket_items(data, True)

From the visualization above it can be seen that each bucket contains items for exactly one cluster.

## Regular kANN search

Bellow we perform kANN search. We search for 5 objects in the collection that are the most similar to our query. In the query we specify the query vector itself with `query_embeddings`, the desired output format with `include`, number of items we want to retrieve with `n_results` and finally number of buckets to search with `n_buckets`. The `n_buckets` is a search hyperparameter for LMI, the more buckets we decide to search the more precise answer we get, but at the same with more buckets searched the longer the search takes.

In [None]:
results = collection.query(
    query_embeddings=list(query_point),
    include=["metadatas", 'distances', "documents"],
    n_results=5, # Specifies k objects to retireve from the collection
    n_buckets=1, # Number of buckets LMI is supposed to search through
)

Bellow we can see the result of our query, we retrieved 5 most similar objects to the query. 

In [None]:
print("Ids: ", results['ids'])
print("Distances", results['distances'])
print("Metadata: " ,results['metadatas'])
print("Documents: ",results['documents'])

## Constrained Search

Let's perform constrained search by stating that the objects in the result set should statisfy the following condition `cluster_color == "red"`. This condition is specified in the `where` argument bellow. For constrained search there are three additional arguments that affect the behaviour of the similarity search. `bruteforce_threshold` specifies at what percentage of the dataset a bruteforce search should be used instead of LMI (if unspecified bruteforce is used if after applying where condition less than 20 000 objects remain). Next is `constraint_weight` which specify how much the buckets with objects satisfying the conditions should be prioritised during navigation in the LMI. Lastly there is `search_until_bucket_not_empty` which if set to `True` the LMI will search more than `n_buckets` if no objects satisfying the condition where found in the first `n_buckets`.

In [None]:
%%time
filter_color = "red"
results = collection.query(
    query_embeddings=list(query_point),
    include=["metadatas", 'documents', 'distances'],
    where={"cluster": filter_color},
    n_results=5,
    n_buckets=1,
    bruteforce_threshold=0.0, 
    constraint_weight=0.0,
    search_until_bucket_not_empty=True
)

In [None]:
print("Ids: ", results['ids'])
print("Distances", results['distances'])
print("Metadata: " ,results['metadatas'])
print("Documents: ",results['documents'])
print("Constraint Weight Used: ",results['constraint_weight'])
print("LMI Bucket Order: ", results['bucket_order'])

The `constraint_weight` parameter influences the resulting bucket order (list) which determines in which order the buckets are searched. Since the value of this parameter is set to `0.0` the buckets are ordered based on their similarity to the query. But in the case of constrained search ordering just based on similarity might not be enough since bucket with object satisfying the condition may not be at the begging of the list. Hence, we might miss them (`search_until_bucket_not_empty` is set to `False` since in real applications we can not afford to search many buckets) or the search can take very long time to find them (`search_until_bucket_not_empty` is set to `True`).

In [None]:
visualize_bucket_order(data, results['bucket_order'][0])

Now lets set `constraint_weight` to value `0.5` and observe how the resulting bucket order will change.

In [None]:
%%time
filter_color = "red"
results = collection.query(
    query_embeddings=list(query_point),
    include=["metadatas", 'documents', 'distances'],
    where={"cluster": filter_color},
    n_results=5,
    n_buckets=1,
    bruteforce_threshold=0.0, 
    constraint_weight=0.5,
    search_until_bucket_not_empty=True
)

In [None]:
print("Ids: ", results['ids'])
print("Distances", results['distances'])
print("Metadata: " ,results['metadatas'])
print("Documents: ",results['documents'])
print("Constraint Weight Used: ",results['constraint_weight'])
print("LMI Bucket Order: ", results['bucket_order'])

As we can see with `constraint_weight` set to `0.5` the bucket that we are interested in gets prioritized to the begging of the bucket order. Picking optimal value for `constraint_weight` can be tricky. Since if it is too high we won't search based on similarity at all leading to bad precision. Setting it too low may cause us to miss the buckets that contain object satisfying the condition. Based on the experiments it is good to set it based on selectivity of the condition. If the % of the data remaining after applying the condition is high the value of `constraint_weight` should be low and vice versa. This is actually how the `constraint_weight` behaves if it is set to `-1`.

In [None]:
visualize_bucket_order(data, results['bucket_order'][0])

## Hybrid Search

Hybrid search represents combination of keyword search with vector similarity search through reciprocal rank fusion. This type of search gives user greater control over the search results while not specifying a strict condition. It was shown that this type of search improves recall of the document retrieval in [The Chronicles of RAG](https://arxiv.org/abs/2401.07883) paper.  
In LVD the hybrid search combines results from LMI with BM25 algrotihm through reciprocal rank fusion.

In [None]:
results = collection.query(
    query_embeddings=list(query_point),
    include=["metadatas", 'distances', "documents"],
    n_results=5, # Specifies k objects to retireve from the collection
    n_buckets=1, # Number of buckets LMI is supposed to search through
    where_document={"$hybrid":{ "$hybrid_terms": ["digital", "data", "programming"]}}
)

As we can see from the result bellow the results contains different ids (as opposed to regular kANN search) and is ordered based on the ranking obtained from the reciprocal rank fusion. The `id97` is first in the result list since it contains terms specified in the `$hybrid_terms"` and is also similar to the query vector. Also, as can be seen bellow `'id12'` has distance -1 that is because it was retrieved using BM25 algorithm and the LMI.

In [None]:
print("Ids: ", results['ids'])
print("Distances", results['distances'])
print("Metadata: " ,results['metadatas'])
print("Documents: ")
for doc in results['documents'][0]:
    print(doc)
    print()

# Data Manipulation

## Delete embedding

In [None]:
collection.delete(['id59'])

In [None]:
bucket_assignment = collection.build_index()

In [None]:
results = collection.query(
    query_embeddings=list(query_point),
    include=["metadatas", 'distances', "documents"],
    n_results=5, # Specifies k objects to retireve from the collection
    n_buckets=1, # Number of buckets LMI is supposed to search through
)

# id59 is no longer be in the results
print("Ids: ", results['ids'])

# Persistency

In [None]:
persistentClient = chromadb.PersistentClient()

In [None]:
collection = persistentClient.create_collection(
    name="persistent_synthetic_collection",
    metadata={
        "lmi:epochs": "[200]",
        "lmi:model_types": "['MLP']",
        "lmi:lrs": "[0.01]",
        "lmi:n_categories": f"[4]",
        "lmi:kmeans": "{'verbose': False, 'seed': 2023, 'nredo': 10}",
    }
)

In [None]:
# Use batch upload just to test it out
batch_size = 25
for i in tqdm(range(0, len(data), batch_size), desc="Adding documents"):
    collection.add(
        embeddings=data[['x', 'y']].iloc[i: i + batch_size].values.tolist(),
        metadatas=[{"cluster": cluster} for cluster in data['cluster'].iloc[i: i + batch_size]],
        ids=data['id'].iloc[i: i + batch_size].values.tolist(),
        documents=[document for document in data['document'].iloc[i: i + batch_size]]
    )

In [None]:
bucket_assignment = collection.build_index()

### Restart the Jupyter kernel. Then load the data and the index from the disk.

In [None]:
anotherClient = chromadb.PersistentClient()

In [None]:
anotherCollection = anotherClient.get_collection("persistent_synthetic_collection")

In [None]:
results = anotherCollection.query(
    query_embeddings=list(query_point),
    include=["metadatas", 'distances', "documents"],
    n_results=5, # Specifies k objects to retireve from the collection
    n_buckets=1, # Number of buckets LMI is supposed to search through
)

In [None]:
print("Ids: ", results['ids'])
print("Distances", results['distances'])
print("Metadata: " ,results['metadatas'])
print("Documents: ",results['documents'])