# Use a vector database for storage and retrieval

In the [last notebook](11-similarity-embeddings.ipynb) you have seen how
embeddings can be calculated and used for retrieving data. We saved
the embeddings as `.npy` files so that we do not have to calculate them
again.

Retrieval worked by calculating the similarity of the question to all
the answers. In a scenario with just a few thousand documents, this works
well. However, as the number of documents increases, we have to find a more
scalable solution. This can be achieved with a vector database.

The vector database is used for storing the document vectors and for
performing a *similarity search*. In this notebook, we use
[milvus](https://milvus.io/de) as a vector database. Milvus is (much)
more popular compared to usearch, but also a lot slower in this scenario.

Therefore, we won't run this notebook in the live course.

## Load data (from previous notebook)

In [None]:
import json
with open("sentences.json") as f:
    sentences = json.load(f)

In [None]:
len(sentences)

In [None]:
import numpy as np
with open("sentences-mqa.npy", "rb") as f:
    sembeddings = np.load(f)

## Vector DB

In [None]:
# usearch?!

In [None]:
from pymilvus import MilvusClient

In [None]:
client = MilvusClient("un-78.db")

In [None]:
data = [ { "id": i, 
           "vector": sembeddings[i], 
           "text": sentences[i] } for i in range(len(sembeddings)) ]

We could use many more fields here, like `country`. These fields can be used for filtering then.

In [None]:
client.drop_collection(collection_name="mqa")

In [None]:
%%time
client.create_collection(collection_name="mqa", dimension=sembeddings[0].shape[0])
res = client.insert(collection_name="mqa", data=data)

In [None]:
# need model for calculating new embeddings
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('multi-qa-MiniLM-L6-cos-v1')

In [None]:
import pandas as pd
def search(query, client, collection, model, query_prompt_name=None, top=20):
    # code query to restrict search space
    question_embedding = model.encode(query, normalize_embeddings=True, prompt_name=query_prompt_name)
    
    # search vector database
    hits = client.search(collection_name=collection, data=[question_embedding], limit=top,
                        output_fields=["text"])
    
    # Return as dataframe
    return pd.DataFrame([{ "id": r["id"], 
                           "text": r["entity"]["text"], 
                           "score": r["distance"] } for r in hits[0]])

In [None]:
pd.set_option('display.max_colwidth', 0)

In [None]:
search("Is the climate crisis worse for poorer countries?", client, "mqa", model)

In [None]:
model3 = SentenceTransformer('Snowflake/snowflake-arctic-embed-l-v2.0')
with open("sentences-arctic.npy", "rb") as f:
    sembeddings3 = np.load(f)

In [None]:
data = [ { "id": i, 
           "vector": sembeddings3[i], 
           "text": sentences[i] } for i in range(len(sembeddings3)) ]

In [None]:
client.drop_collection(collection_name="arctic")

In [None]:
client.create_collection(collection_name="arctic", dimension=sembeddings3[0].shape[0])
res = client.insert(collection_name="arctic", data=data)

In [None]:
search("Is the climate crisis worse for poorer countries?", 
       client, "arctic", model3, query_prompt_name="query")