# Use a vector database for storage and retrieval

In the [last notebook](11-similarity-embeddings.ipynb) you have seen how
embeddings can be calculated and used for retrieving data. We saved
the embeddings as `.npy` files so that we do not have to calculate them
again.

Retrieval worked by calculating the similarity of the question to all
the answers. In a scenario with just a few thousand documents, this works
well. However, as the number of documents increases, we have to find a more
scalable solution. This can be achieved with a vector database.

The vector database is used for storing the document vectors and for
performing a *similarity search*. In this notebook, we use
[usearch](https://github.com/unum-cloud/usearch) as a vector database.
It is a lightweight solution but explains the concept very well.

## Load data (from previous notebook)

In [None]:
import json
with open("sentences.json") as f:
    sentences = json.load(f)

In [None]:
len(sentences)

In [None]:
import numpy as np
with open("sentences-arctic.npy", "rb") as f:
    sembeddings = np.load(f)

In [None]:
sembeddings.shape

## Use usearch vector DB

In [None]:
from usearch.index import Index, MetricKind

# create an index with the correct number of dimensions
index = Index(ndim=sembeddings.shape[1], metric='cos')

add all vectors to the datbase

In [None]:
%%time
index.add(list(range(len(sembeddings))), sembeddings)

In [None]:
index.save("sentences-arctic.usearch")

In [None]:
# need model for calculating new embeddings
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('Snowflake/snowflake-arctic-embed-l-v2.0')

In [None]:
import pandas as pd
from usearch.index import MetricKind
def search(query, index, sentences, model, query_prompt_name=None, top=20):
    # code query to restrict search space
    question_embedding = model.encode(query, normalize_embeddings=True, prompt_name=query_prompt_name)
    
    # search vector database
    hits = index.search(question_embedding, top, MetricKind.Cos)
    
    # Return as dataframe, note that distance and score are different metrics!
    return pd.DataFrame([{ "id": r.key, 
                           "text": sentences[r.key], 
                           "score": 1-r.distance } for r in hits] )

In [None]:
pd.set_option('display.max_colwidth', 0)

In [None]:
search("Is the climate crisis worse for poorer countries?", index, sentences, model, query_prompt_name="query")