### 2023.11.30 - Introduction to Transformers | Homework 4
In this exercise, you will implement key components of Retrieval-Augmented Generation (RAG): Data Ingestion, Retrieval and Augmentation.
RAG significantly enhances the capabilities of language models by allowing them to incorporate external knowledge.

In case you are interested in diving deeper into RAG, checkout the following resources:
- Original Paper on RAG: [Retrieval-Augmented Generation for
Knowledge-Intensive NLP Tasks](https://arxiv.org/pdf/2005.11401.pdf)
- LamaIndex Tutorial Series: [Building RAG from Scratch (Lower-Level)](https://docs.llamaindex.ai/en/stable/optimizing/building_rag_from_scratch.html)

Base your code on the following skeleton code that we provide:

In [1]:
!/opt/conda/envs/pytorch/bin/python -m pip install scikit-learn

/bin/bash: line 1: /opt/conda/envs/pytorch/bin/python: No such file or directory


In [2]:
import requests
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

In [3]:
# Import any additional dependencies
# In case you don't need any, just remove the error raise below
# YOUR CODE HERE
# raise NotImplementedError()

### Embedding Model
The embedding model transforms textual data into a numerical format (embeddings) that can be easily stored and processed.

In our exercise we will leverage the free inference API from huggingface as well as an open source model.
In order to use this API you need to create an account and obtain an access token under https://huggingface.co/settings/tokens.

In [4]:
token = "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"
# YOUR CODE HERE
# raise NotImplementedError()

In [5]:
API_URL = "https://api-inference.huggingface.co/models/BAAI/bge-small-en-v1.5"
headers = {"Authorization": f"Bearer {token}"}

def query(payload):
	response = requests.post(API_URL, headers=headers, json=payload)
	return response.json()

To keep our example simple we will use a small set of predefined, small sentences as our knowledge base. Keep in mind that in real life scenario pre-processing is an important step.

In [6]:
knowledge_base = [
    "on the 23th december i ate a lovely cheesecake for dinner and a carrot as a breakfast",
    "the second name of my aunt's second chicken is miranda",
    "the eiffel tower is located in south tirol."
]

In [7]:
embeddings = query({"inputs": knowledge_base})
embeddings # NOTE: Sometimes the API returns an error, if this is the case, just run this cell again

[[-0.0007707758340984583,
  0.0556488074362278,
  0.03702349588274956,
  -0.018917910754680634,
  0.0005186566850170493,
  -0.035176362842321396,
  0.04667029157280922,
  0.052842576056718826,
  0.013757056556642056,
  0.03643593564629555,
  0.019616125151515007,
  -0.00797932781279087,
  0.07250022888183594,
  0.030136065557599068,
  0.004694719798862934,
  -0.016018908470869064,
  0.058853879570961,
  -0.06375222653150558,
  -0.10813543945550919,
  -0.02374984510242939,
  -0.01715598814189434,
  0.012579520232975483,
  -0.031027015298604965,
  -0.0042838044464588165,
  0.01162469107657671,
  0.11835338920354843,
  0.0008058389066718519,
  -0.03337622806429863,
  -0.057265691459178925,
  -0.0945599228143692,
  0.03406844288110733,
  -0.016903160139918327,
  0.07702722400426865,
  -0.05516739934682846,
  -0.06428147852420807,
  -0.0012352548073977232,
  0.009444925002753735,
  0.02972765453159809,
  -0.028057515621185303,
  0.02758563496172428,
  0.08588645607233047,
  0.00724495342001

After encoding our knowledge base into embeddings we need to store them together with the original text, since most embedding models don't provide a decoder element.

<b>Task:</b> Create an array of nodes, where each node has the form {"embd": THE EMBEDDING, "text": THE HUMAN READABLE TEXT}. Each element of the knowledge base should have one node. So your db should look something like [{"embd": [0,321, ...], "text": "on the 23th ..."}, ...]

In [10]:
db = [] # TODO: DB Ingestion
# YOUR CODE HERE
# raise NotImplementedError()
# embeddings
# knowledge_base
db = [{"embd": emb, "text": text} for emb, text in zip(embeddings, knowledge_base)]

To be able to query our db we need to transform a given prompt into the same vector space

In [13]:
prompt = "What is the second name of my aunt's second chicken?"

In [14]:
prompt_embd = query({"inputs": prompt})
prompt_embd # NOTE: Sometimes the API returns an error, if this is the case, just run this cell again

[-0.02983754687011242,
 -0.06206024810671806,
 -0.024697057902812958,
 -0.018111390992999077,
 0.03994860500097275,
 0.006569406483322382,
 0.06831027567386627,
 0.0198664627969265,
 0.03358141705393791,
 -0.04474203288555145,
 -0.005215412005782127,
 -0.07393546402454376,
 0.049773506820201874,
 -0.02159593068063259,
 0.04254261031746864,
 0.012756102718412876,
 -0.009595093317329884,
 0.047350045293569565,
 -0.08663540333509445,
 -0.08012942969799042,
 -0.07792751491069794,
 -0.07075493782758713,
 -0.016432998701930046,
 -0.025671884417533875,
 0.02180907316505909,
 0.030728744342923164,
 -0.014563776552677155,
 0.035833690315485,
 -0.048452556133270264,
 -0.07933619618415833,
 0.018661275506019592,
 0.007877512834966183,
 -0.026215240359306335,
 0.016605643555521965,
 -0.030086567625403404,
 -0.0022265741135925055,
 -0.015365676954388618,
 0.009265095926821232,
 0.008185644634068012,
 -0.02807963453233242,
 0.022793009877204895,
 -0.03771887347102165,
 0.0018306400161236525,
 -0.043

<b>Task:</b> Implement a function named calculate_similarity which takes two arguments, vec1 and vec2. These arguments represent text embeddings that should be semantically compared. The function should return a single similarity value between 0 and 1, where 1 indicates an identical vector and 0 orthogonal vectors.

In [16]:
def calculate_similarity(vec1, vec2):
    """
    Calculate the cosine similarity between two vectors.

    Args:
    vec1 (list or array): The first vector.
    vec2 (list or array): The second vector.

    Returns:
    float: A similarity score between 0 and 1, where 1 means identical and 0 means orthogonal.
    """
    # YOUR CODE HERE
    # raise NotImplementedError()
    vec1 = np.array(vec1).reshape(1, -1)  # Reshape to a 2D array
    vec2 = np.array(vec2).reshape(1, -1)

    cosine_similarity_value = cosine_similarity(vec1, vec2)[0, 0]
    
    return cosine_similarity_value

<b>Task:</b> Calculate the cosine similarity between a given prompt embedding and each embedding in your database (db).
Identify the database entry (node) that has the highest similarity to the prompt and retrieve the text associated with this most similar node as your augmentation data. (_hint:_ you might want to use np.argmax on an array of similarities)

In [22]:
# TODO: Implement similarity search below
# YOUR CODE HERE
# raise NotImplementedError()
augmentation_data = "" # text of the most similar node
cos_sim = []
for emb in db:
    cos_sim.append(calculate_similarity(emb["embd"],prompt_embd))
augmentation_data = db[np.argmax(cos_sim)]["text"]

In [18]:
def get_augmented_prompt(prompt, augmentation):
    return f"""
Context information: "{augmentation}".
Given the context information and no prior knowledge, answer the query.
Query: {prompt}
Answer: \
"""

In [19]:
"""
Expected Output:
'\nContext information: "the second name of my aunt's second chicken is miranda".\nGiven the context information and no prior knowledge, answer the query.\nQuery: What is the second name of my aunts second chicken?\nAnswer: '
"""
augmented_prompt = get_augmented_prompt(prompt, augmentation_data)
print(augmented_prompt)

assert "miranda" in augmented_prompt




Context information: "the second name of my aunt's second chicken is miranda".
Given the context information and no prior knowledge, answer the query.
Query: What is the second name of my aunt's second chicken?
Answer: 
