# Exercise 2: Introduction to vector database
In the previous exercise, we explored the clustering capabilities of embedding models. 

This notebook will explore how to use these properties to find semantically similar items.

We will do this first by sorting the items by their similarity score; the higher the score, the more similar the items are.

However, this approach does not scale well to large datasets due to the $O(n log (n))$ complexity of sorting. Therefore, we will also explore how to use vector databases to find similar items in a more efficient way.


In [None]:
import json
import numpy as np
from langchain_community.embeddings import HuggingFaceEmbeddings
from llm_in_production.huggingface_utils import get_device
from llm_in_production.numpy_utils import cosine_similarity
from langchain.vectorstores import FAISS
import dotenv
import pandas as pd

dotenv.load_dotenv()

We will again use the [MiniLM](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2) model from [HuggingFace](https://huggingface.co/) to embed the text. To make it easier to use, we have wrapped the model in a class called [HuggingFaceEmbeddings](https://api.python.langchain.com/en/latest/embeddings/langchain.embeddings.huggingface.HuggingFaceEmbeddings.html) from LangChain. Running the cell below will download the model from the HuggingFace model hub and load it into memory. This can take a while the first time you run it. However, the model will be cached on your computer, so it will be much faster the next time you run it.

In [None]:
# This function check if the accelerator is available like a GPU and if so, it will use it.
device = get_device()
# Here we create the embedding function that will be used to embed the sentences.
embedding_func = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2", model_kwargs={"device": get_device()})

## Exercise 2a: Finding similar items by sorting
Here, we will explore how to find similar items using cosine similarity.
We do this by taking the following steps:
1. We embed all the items we want to search through. 
2. We embed the query. 
3. We compute the cosine similarity between the query and all the items. This will give us a score for each item. The higher the score, the more similar the item is to the query.
4. We sort the items by their score. The higher the score, the more similar the item is to the query.

The code below does everything except for the sorting. Your task is to sort the items by their score using the [df.sort_values](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.sort_values.html) method from pandas.


After finishing the code, experiment with different queries and see how the results change. Also, try to change the sentences to see how that affects the results.

In [None]:
sentences = [
    "Merel said something about cats.",
    "Merel said dogs are awesome.",
    "Samantha's buddies said to meet them in the bar",
    "The cat is walking in the bedroom",
    "The kittens are in the bedroom",
    "A dogs were running across the kitchen",
    "The puppies were running around in the kitchen",
]
# Here we embed all the sentences at once using the embed_documents method
sentence_embeddings = embedding_func.embed_documents(sentences)
# We convert the embeddings to a numpy array/matrix of shape (n_samples, n_features).
sentence_embeddings = np.array(sentence_embeddings)


query = "What did she say?"
# query = "what are the puppies doing?" # This is an alternative query that you can try out.

# embed the query using the embed_query method.
# this method works exactly the same as the embed_documents 
# except that it takes as input a single string
query_embedding = np.array(embedding_func.embed_query(query))

similarity_score = []
for sentence_embedding in sentence_embeddings:
    similarity_score.append(cosine_similarity(query_embedding, sentence_embedding))

# Here we create a dataframe with the sentences and their similarity score
# the main reason for this is that it renders nicer in jupyter notebooks.
df = pd.DataFrame({"sentences": sentences, "score": similarity_score})
# YOUR CODE HERE START: Use the df.sort_values method to sort the dataframe by the score column.
df = df.sort_values("score", ascending=False)
# YOUR CODE HERE END

print(f"Query: `{query}`")
print("Most similar sentences")
df

## Introduction to vector databases
Vector databases are a type of database that is optimized for similarity search.
They don't need to search through all the items to find the most similar items.
This makes them even faster than $O(n)$ search algorithms.

In the remainder of this notebook, we will experiment with the [FAISS](https://github.com/facebookresearch/faiss) vector database.
This is a very fast vector database from Facebook. 
It is also easy to use because all it takes to install it is `pip install faiss-cpu`.
FAISS also nicely integrates with LangChain making it even easier to use.

In the cell below, we do the following:
1. We define our query and sentences.
2. We create a vector database around the sentences using [FAISS.from_texts](https://api.python.langchain.com/en/latest/vectorstores/langchain.vectorstores.faiss.FAISS.html#langchain.vectorstores.faiss.FAISS.from_texts) method.
3. We search through the vector database using the [similarity_search](https://api.python.langchain.com/en/latest/vectorstores/langchain.vectorstores.faiss.FAISS.html#langchain.vectorstores.faiss.FAISS.similarity_search) method. We pass the query and the number of items we want to retrieve (`k`) as arguments. 
4. We print the results.


Please run the cell below and do the following:
1. Change the query and see how the results change. Does it act similar to the previous exercise?
2. Change the number of items we want to retrieve (`k`). What happens if you set it to 1? What happens if you set it to `len(sentences)`?

In [None]:
query = "what are the puppies doing?"
sentences = [
    "Merel said something about cats.",
    "Merel said dogs are awesome.",
    "Samantha's buddies said to meet them in the bar",
    "The cat is walking in the bedroom",
    "The kittens are in the bedroom",
    "A dogs were running across the kitchen",
    "The puppies were running around in the kitchen",
]

vector_database = FAISS.from_texts(sentences, embedding=embedding_func)

k = 3
documents = vector_database.similarity_search(query, k=k)

print("The results is a list of documents these documents are sorter in order of relevance:")
print(documents)

print()
print("Each document has a page_content (the string that was embedded) and optionally metadata")
for doc in documents:
    print(f"- page_content=`{doc.page_content}`")

## Exercise 2b: Introduction to metadata in vector databases
The nice thing about vector databases is that they can also store metadata. This is extra information that is not embedded but can be used to filter the results. Additionally, you can also store additional information about the original document, such as:
- The original document id.
- The original document URL.
- The original document title.
- When was the document added to the database?
- etc.

Please run the cell below and do the following:
- Change the metadata. Does it have any effect on the results?
- Add some additional metadata and print it.
- Add some additional metadata and filter the results based on this metadata. (for example, try printing only the documents with an even `original_document_id`).

In [None]:
query = "what are the puppies doing?"
sentences = [
    "Merel said something about cats.",
    "Merel said dogs are awesome.",
    "A dogs were running across the kitchen",
    "The puppies were running around in the kitchen",
]
metadatas = [
    {"original_document_id": 1, },
    {"original_document_id": 2},
    {"original_document_id": 3, "some_additional_key": "bla bla"},
    {"original_document_id": 4, "are puppies fun?": "yes"},
]
vector_database = FAISS.from_texts(sentences, metadatas=metadatas, embedding=embedding_func)

k = 3
documents = vector_database.similarity_search(query, k=k)

print("Now Each document has a page_content and metadata")
for doc in documents:
    print(f"- page_content=`{doc.page_content}` metadata={doc.metadata}")

print("#" * 80)
# YOUR CODE HERE START: Try to filter the results based on the metadata. (e.g., print only the documents with even original_document_id)
for doc in documents:
    original_document_id = doc.metadata.get("original_document_id")
    if original_document_id is not None and original_document_id % 2 == 0:
        print(f"- page_content=`{doc.page_content}` metadata={doc.metadata}")
 # YOUR CODE HERE END

## Exercise 2c: Metadata based title search
In this exercise, we will search through a database of stories based on their title.

We will do the following:
1. For each story, we will split the story into sentences using the `.split(".")` method.
2. We will then build a vector database from these sentences (with the name of the story they belong to stored as metadata).
3. To find the most relevant story for a given query, the process is then as follows:
    1. Embedd the query.
    2. Search through the vector database to find the most similar sentences.
    3. Count how often each story is mentioned in the results (based on the metadata of the found sentences).
    4. Pick the story that is mentioned most often.
    
We have already given you a skeleton of the code. Your task is to fill in the missing parts marked with `# YOUR CODE HERE START` and `# YOUR CODE HERE END`.

In [None]:
stories = {
    "Snow the husky story": "Snow, the husky puppy, was born with eyes as blue as the winter sky. He loved bounding through the snow-covered forest, his paws leaving tiny imprints behind. With each playful leap, Snow brought joy and warmth to all who crossed his path, reminding them that even in the coldest of times, there is always a glimmer of happiness.",
    "Siamese twins story": "Luna and Stella, the Siamese twins cats, were inseparable from the moment they were born. With their striking blue eyes and sleek coats, these cats were a sight to behold. Their synchronized movements and playful antics enchanted everyone they met, leaving a lasting impression that two Siamese cats are always better than one.",
    "Biking story": "As the sun kissed the horizon, Sarah hopped on her bike, ready for an adventure. With the wind in her hair and the pedals beneath her feet, she embarked on a journey of freedom and exploration. Each mile brought her closer to new sights, fresh air, and the exhilarating feeling of the open road.",
}

sentences = [] # A list of str, where each str is a sentence.
metadatas = [] # A list of dict, which contain the story name of the corresponding sentence. For example, {"story_name": "Snow the husky"}.


for story_name, story in stories.items():
    # YOUR CODE HERE START: split the story into sentences and store each sentence in the sentences list.
    for sentence in story.split("."):
        sentences.append(sentence)
        metadatas.append({"story_name": story_name})
    # YOUR CODE HERE END

assert len(sentences) > 0, f"It looks like you forgot to add the sentences."
assert len(sentences) == len(metadatas), f"Meta data and sentences must have same length but {len(metadatas)} != {len(sentences)}"

# YOUR CODE HERE START: Create a vector database around the sentences and their metadata. (Hint: use the FAISS.from_texts method)
vector_database = FAISS.from_texts(sentences, metadatas=metadatas, embedding=embedding_func)
# YOUR CODE HERE END

# A query and the corresponding answer it should produce.
queries_and_answer = {
    "Give me a story about biking":  "Biking story",
    "Give me a story about Siamese cats": "Siamese twins story",
    "Give me a story about dogs": "Snow the husky story",
}

# Loop through each query:
for query, answer in queries_and_answer.items():

    # For each query, Keep a counter of the recommendations.
    recommendation_per_story = {story_name: 0 for story_name in stories.keys()}

    # Return sentences similar to the query
    k = 3
    similar_sentences = vector_database.similarity_search(query, k=k)
    
    # YOUR CODE HERE START: for each document, count how often each story is mentioned.
    for similar_sentence in similar_sentences:
        story_name = similar_sentence.metadata["story_name"]
        recommendation_per_story[story_name] += 1
    # YOUR CODE HERE END
    
    story_mentioned_most_often = max(recommendation_per_story, key=recommendation_per_story.get)
    print(f"Testing query: `{query}`. Got the following mentions: {recommendation_per_story}" )
    assert story_mentioned_most_often == answer, f"Expected `{answer}` but got `{story_mentioned_most_often}` for query `{query}."
    print("✅ Passed!")

## PyData talks
Now we know how to build a FAISS vector database with metadata and how to search through it.
Let's try to build a vector database around the PyData talks and try to find the most relevant talks for a given query.

Let's start by loading the data.

In [None]:
with open("pydata.json", "r") as f:
    talks = json.load(f)["talks"]
    titles = [talk["title"] for talk in talks]
    abstracts = [talk["abstract"] for talk in talks]
    descriptions = [talk["description"] for talk in talks]

## Exercise 2d: Search through PyData talks
In this exercise, we will do the following:
1. Create a vector database around the titles, abstracts, and descriptions of the talks. For each item, we also store the following metadata:
    - The `title` of the talk.
    - The `talk_idx`, which is the index of the talk in the `talks` list.
    - The `type` of text (title, abstract, or description) so we know where the text came from.
2. We then search the vector database based on the query and print the results.

We have already given you a skeleton of the code. Your task is to fill in the missing parts marked with `# YOUR CODE HERE START` and `# YOUR CODE HERE END`.


In [None]:
texts = [] # A list of str 
metadatas = [] # A list of dict

for talk_idx in range(len(talks)):
    title = titles[talk_idx]
    abstract = abstracts[talk_idx]
    description = descriptions[talk_idx]
    
    texts.append(title)
    metadatas.append({"title": title, "talk_idx": talk_idx, "form": "title"})
    
    # YOUR CODE HERE START: Add the abstract and its metadata to the texts and metadatas lists.
    texts.append(abstract)
    metadatas.append({"title": title, "talk_idx": talk_idx, "form": "abstract"})
    # YOUR CODE HERE END
    
    # YOUR CODE HERE START: Add the description and its metadata to the texts and metadatas lists.
    texts.append(description)
    metadatas.append({"title": title, "talk_idx": talk_idx, "form": "description"})
    # YOUR CODE HERE END:

assert len(texts) == len(metadatas)
    
# YOUR CODE HERE START: Create a vector database around the texts and their metadata (Hint: use the FAISS.from_texts method).
vector_database = FAISS.from_texts(texts, metadatas=metadatas, embedding=embedding_func)
# YOUR CODE HERE END:

In [None]:
query = 'which talks are about LLM?'
# query = 'which talks are about data engineering?' # This is an alternative query that you can try out.
documents = vector_database.similarity_search(query)

for document in documents:
    # YOUR CODE HERE START: print the title and the text where the query was found.
    title = document.metadata['title']
    form = document.metadata['form']
    page_content = document.page_content
    
    print(f"Title: {title}")
    print(f"Found via the {form}: {page_content}")
    # YOUR CODE HERE END
    
    print("#" * 80 + "\n")