# Vector Database Example.
Modified version of the notebook that appears in [this Medium article](https://arupnanda.medium.com/lowdown-on-vector-databases-ec39fe70a17). The article is Part 2 in a 3-part series that provides a good intro to vector databases.

## Initial dataset.

In [None]:
# From Hugging Face dataset library
from datasets import load_dataset

We use the dataset `wiki_qa`, called the "WikiQA Corpus," and is provided by Microsoft. Find the official documentation [here](https://huggingface.co/datasets/wiki_qa). Records in this dataset correspond to single questions/answer pairs.

In [None]:
ds = load_dataset('wiki_qa', split='train')

Here are the first 5 entries in the dataset. Note the `'label'` key. It seems to indicate how well the answer addresses the question:

In [None]:
first_five = ds[:5]
for i in range(5):
    print('\nRECORD {}'.format(i))
    for key in first_five:
        print(key, ':', first_five[key][i])

We collect just the questions in this dataset, and remove duplicates:

In [None]:
questions = []
for i in ds ['question']:
    questions.append(i)

questions = list(set(questions))

print('\nNumber of unique questions:', len(questions), '\n')

Questions 0 through 9:

In [None]:
questions[:10]

## ChromaDB.
ChromaDB is an "open-source embedding database," i.e., an opensource vector database that makes you of vector embeddings. Find the official documentatino [here](https://docs.trychroma.com/getting-started). If you do not already have the ChromaDB library, run `!pip install chromadb`.

In [None]:
import chromadb

Create *Client* object for interacting with the database: 

In [None]:
client = chromadb.Client()

Create a new collection, called `'my_collection'`:

In [None]:
my_collection = client.create_collection(name='my_collection')

### Set up embeddings.
We will need:
1. An ID for the record.
2. A "document", namely the question we collected.
3. A vector representation of the document, i.e., a vector embedding.

We will encode questions using a model in the Hugging Face *Sentence Transformers* library. Find official documentation [here](https://www.sbert.net/#). If you haven't installed this library, run `!pip install -U sentence-transformers`. **Note:** I had to update my Hugging Face Hub as well by running `!pip install --upgrade huggingface_hub`.

In [None]:
from tqdm.auto import tqdm # For trakcing runtime
from sentence_transformers import SentenceTransformer # Hugging Face's Sentence Transformer library

In [None]:
model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')

Upsert (i,.e., update or insert, depending on case) to `'my_collection'` in batches of size 128:

In [None]:
batch_size=128
total_size=2118
for ctr in tqdm(range(0,total_size,batch_size)):
    ctr_end = min(ctr+batch_size, total_size)
    IDs = [str(i) for i in range(ctr, ctr_end)]
    documents = [text for text in questions[ctr:ctr_end]]
    embeddings = model.encode(questions[ctr:ctr_end]).tolist() # Here we encode each question as a vector
    my_collection.upsert(documents=documents, ids=IDs, embeddings=embeddings)

print('\nDatabase contains {} distinct records.\n'.format(my_collection.count()))

### Query execution.

Supose we want to ask a question that might not be in the database. For instance:

In [None]:
question = 'why did Americans fight their own'

Think of this `question` as being our query. We do part of the query processing by hand. Namely, we encode the question as a vector:

In [None]:
question_vector = model.encode(question).tolist()

...The rest of the query execution can now be done by the vector database. The `.query()` method we use here returns a Python dictionary. The key `'documents'` pulls the documents associated to the "nearest records" to our `question`. These approximate questions are the *values* output by the databases query execution: 

In [None]:
similar_vectors = my_collection.query(question_vector, n_results = 3)

for n, entry in enumerate(similar_vectors['documents'][0]):
    print('Closest question {count}: \'{question}\''.format(count = n, question = entry))

Finer data about the values we retrieved, including distances for the vectors in the database:

In [None]:
print(f'{"Distance":>8} {"ID":>4} {"Question"}') # Print table header
for ids in similar_vectors['ids'][0]: # Cycle through query output
    i = similar_vectors['ids'][0].index(ids)
    print(f"{round(similar_vectors['distances'][0][i],6):1.6f} {ids:>4} {similar_vectors['documents'][0][i]}") # Print table row

### Deleting your collection.

**WARNING:** Running the following cell will delete the collection `'my_collection'` that you created above.

In [None]:
client.delete_collection(name='my_collection')