<a href="https://colab.research.google.com/github/aaronachermann/Amphion/blob/main/assignment_2_q4_rag_student.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Assignment 2 - RAG

Parts which require your interaction are marked with `TODO:`

In [None]:
# Install the relevant dependencies
!pip3 install datasets sentence_transformers tqdm numpy

Imagine your task is to build a question-answering (QA) system for a company. You are given a language model and have to create this product out of it.
The requirements of the system need to adapt very quickly to the new data without training.
For this, we will use **Retrieval Augmented Generation (RAG)**.
The company insists you use their in-house LM model trained on multiple tasks, a _flan-t5-small_.
You can test its QA functionality by asking the question _"When ETH was founded?"_:

In [None]:
# Example inference with the model.
# TODO: run me to test the environment

from transformers import pipeline
vanilla_qa_pipe = pipeline("text2text-generation", model="google/flan-t5-small", device=0, truncation=True)

QUESTION = "QUESTION: When was ETH founded?"

vanilla_qa_pipe(f"{QUESTION} ANSWER:", max_new_tokens=10)[0]["generated_text"]

In [None]:
vanilla_qa_pipe(f"""
        CONTEXT: ETH Zurich (German: Eidgenoessische Technische Hochschule Zurich; English:
        Federal Institute of Technology Zurich) is a public research university in Zurich,
        Switzerland. Founded in 1854 with the stated mission to educate engineers and scientists,
        the university focuses primarily on science, technology, engineering, and mathematics. It
        consistently ranks among the top universities in the world and its 16 departments span a
        variety of disciplines and subjects.
        {QUESTION}
        ANSWER:",
    """,
    max_new_tokens=10
)[0]["generated_text"]

The first output is 1897, which is incorrect.

This is not a problem, we can use RAG to automatically provide the passage from an [external source](https://en.wikipedia.org/wiki/ETH_Zurich) and make the model answer. Concatenating the first paragraph from Wikipedia to the question makes the model yield the correct answer 1854.

In [None]:
# Define model function; do not modify
from typing import List

def rag_qa_pipe(question: str, passages: List[str]) -> str:
    """
    Define the RAG pipeline which concatenates passages to the question.
    :param question: Question text.
    :param passages: Relevant text passages.
    :return: Generated text from the pipeline.
    """
    passages = "\n".join([f"CONTEXT: {c}" for c in passages])
    return vanilla_qa_pipe(f"{passages}\nQUESTION: {question}\nANSWER: ", max_new_tokens=10)[0]["generated_text"]

To make sure you understand the function `rag_qa_pipe`, ask some question without and with some relevant context.

In [None]:
# TODO: use rag_qa_pipeline some random question that you might have just to test this function

print(rag_qa_pipe("TODO your question", []))
print(rag_qa_pipe("TODO the same question", ["TODO add some relevant context, such as from Wikipedia"]))

Start with the provided model and the first 500 questions from the validation part of the _SQuAD_ dataset. The dataset has a ground truth Wikipedia passage linked to it and you can directly use it.

Then, compute the QA performance of the model with and without prepended passage using `rag_qa_pipe(question, passages)`.

Report the average case-sensitive answer exact match (model output is identical to the gold answer, EM) and case-insensitive [answer F1 scores](https://kierszbaumsamuel.medium.com/f1-score-in-nlp-span-based-qa-task-5b115a5e7d41) (F1) for both setups.
Because each question has multiple possible answers, take the maximum score for a model answer across all gold answers.

In [None]:
# baseline model evaluation
# TODO: the this cell requires <30 new lines

import tqdm
import numpy as np
from datasets import load_dataset
dataset = load_dataset("rajpurkar/squad")

def metric_exact_match(ans_pred: str, ans_true: str) -> float:
    """
    Case-sensitive answer exact match, model output is identical to the gold answer.
    :param ans_pred: Predicted answer
    :param ans_true: Ground truth answer
    :return: 1. if the answers are the same, 0. otherwise
    """
    # TODO: ~1 line
    return 0.

def metric_f1(ans_pred: str, ans_true: str) -> float:
    """
    Case-insensitive answer F1 score.
    :param ans_pred: Predicted answer.
    :param ans_true: Ground truth answer.
    :return: F1 score between the predicted and ground truth answers.
    """
    # TODO: ~10 lines
    return 0.

for line in tqdm.tqdm(dataset["validation"].select(range(500))):
    # hint: use `line["question"]`, `line["context"]`, and `line["answers"]`
    # TODO: run with and without prepended passage
    pass

# TODO: Print mean of the exact match and mean of F1 scores for the model with and without prepended passage

You will likely see improvements in scores by providing a passage to the model.

In contrast to the previous evaluation, during inference in a real world scenario, we do not have access to the ground truth passage.
All we have access to is the question from a user.
Luckily, the company is providing you with an unstructured knowledge base. This could be the whole of Wikipedia but in our scenario, we use all the passages from the SQuAD dataset and shuffle them to remove any existing structure.

In [None]:
import random

kb = list(set(dataset["validation"]["context"]))

# make sure that there is no remaining structure
random.Random(42).shuffle(kb)
print(len(kb), "passages in the knowledge base")

Now whenever we receive a question, we need to find the relevant passage(s) from the knowledge base and put it in the model input.
This is a non-trivial task and a whole research field of Information Retrieval is devoted to it.


We are going to convert all the knowledge base passages into vectors using TF-IDF and the provided embedding model ([bert-base-nli-max-tokens](https://huggingface.co/sentence-transformers/bert-base-nli-max-tokens)).
The model inference is already implemented for you but you need to fill in all the functions in the `KnowledgeBase` class.
You will need to implement the retrieval, the distance metrics, and the three similarity metrics (Euclidean, cosine, inner product).

We need to build an abstraction for the knowledge base. It needs to support:
- adding new keys (vectors) and their corresponding values
- retrieving the closest key given one, based on 3 vector distance metrics

The implementation does not need to be efficient.

Hint: it's ok to just add all the elements to a list and on retrieval sort the list by the distance.

In [None]:
# Knowledge base building. This cell requires <20 new lines.
from typing import Literal, List, Any

Vec = List
Val = Any

class KnowledgeBase:
    def __init__(self, dim: int):
        """
        Initialize a knowledge base with a given dimensionality.
        :param dim: the dimensionality of the vectors to be stored
        """
        # TODO: initialize a persistent structure, such as a simple list
        pass

    def add_item(self, key: Vec, val: Val):
        """
        Store the key-value pair in the knowledge base.
        :param key: key
        :param val: value
        """
        # TODO: add to the persistent structure
        pass

    def retrieve(
        self, key: Vec, metric: Literal['l2', 'cos', 'ip'], k: int = 1
    ) -> List[Val]:
        """
        Retrieve the top k values from the knowledge base given a key and similarity metric.
        :param key: key
        :param metric: Similarity metric to use.
        :param k: Top k similar items to retrieve.
        :return: List of top k similar values.
        """
        # TODO: retrieve the k closest vectors and return their corresponding values
        # Hint: this does not have to be efficient, feel free to just sort the whole persistent structure and return the top k
        pass

    @staticmethod
    def _sim_euclidean(a: Vec, b: Vec) -> float:
        """
        Compute Euclidean (L2) distance between two vectors.
        :param a: Vector a
        :param b: Vector b
        :return: Similarity score
        """
        # hint: use numpy
        # TODO: compute the Euclidean distance between two vectors
        pass

    @staticmethod
    def _sim_cosine(a: Vec, b: Vec) -> float:
        """
        Compute the cosine similarity between two vectors.
        :param a: Vector a
        :param b: Vector b
        :return: Similarity score
        """
        # hint: use numpy
        # TODO: compute the cosine distance between two vectors
        pass

    @staticmethod
    def _sim_inner_product(a: Vec, b: Vec) -> float:
        """
        Compute the inner product between two vectors.
        :param a: Vector a
        :param b: Vector b
        :return: Similarity score
        """
        # hint: use numpy
        # TODO: compute the iner product distance between two vectors
        pass


In [None]:
# Build knowledge base index
# In ideal case this does not need to be changed and can just be run.
# Make modifications if you feel they are necessary.

from sklearn.feature_extraction.text import TfidfVectorizer
from sentence_transformers import SentenceTransformer

# # Sparse retrieval using TF-IDF - vectorize with tfidf and retrieve
vectorizer = TfidfVectorizer(max_features=768, norm=None)
kb_vectorized = np.asarray(vectorizer.fit_transform([x for x in kb]).todense())
kb_index_tfidf = KnowledgeBase(dim=768)
for passage_index, passage_embd in enumerate(kb_vectorized):
    kb_index_tfidf.add_item(passage_embd.squeeze(), passage_index)

# Dense retrieval using Sentence Transformers
model_embd = SentenceTransformer("bert-base-nli-mean-tokens").to("cuda:0")
kb_index_embd = KnowledgeBase(dim=768)
for passage_index, passage_embd in enumerate(tqdm.tqdm(kb)):
    kb_index_embd.add_item(model_embd.encode(passage_embd).squeeze(), passage_index)

For the same first 500 questions from the validation split evaluate how often is the retrieved passage the correct one (formally Recall@1) or among the top 5 retrieved (Recall@5).
Perform the retrieval with three distance metrics: euclidean distance, cosine distance, and inner product. The result for this should be 12 numbers.

In production, you receive a question from the user and to answer it, you need to first retrieve the relevant passage(s), pass it to the model, and only then generate the answer.

Evaluate the model performance with passages retrieved by TFIDF and EMBD vectorization.
Consider top-1 and top-5 passages.
This time use only case-insensitive F1.
The result for this cell should be |vectorizations $\times$ passage sizes $\times$ distance metrics = 2 x 2 x 3 = 12 numbers.

Answer the following questions:
* Provide one potential advantage and two potential disadvantages of using multiple retrieved passages?
* Describe one approach to detect if none of the retrieved passages is relevant to the user question.

TODO:
* YOUR ANSWER HERE
* YOUR ANSWER HERE



In [None]:
# In ideal case this does not need to be changed and can just be run.
# Make modifications if you feel they are necessary.

for metric in ["l2", "cos", "ip"]:
    print(metric.upper())

    for line in tqdm.tqdm(dataset["validation"].select(range(500))):
        # TODO: evaluate the retrieval
        # TODO: store RAG model output
        # This requires <30 new lines
        pass

Answer the following questions about similarity metrics:
* Compare and contrast the three metrics, what they might be influenced by, and their advantages and disadvantages.
* Consider the scenario if the vectors in the knowledge base were normalized so that $|x|_2 = 1$. What would the results look like? Hint: look at the formulas with this vector assumption.

TODO:

* YOUR ANSWER HERE
* YOUR ANSWER HERE

Lastly, it is a good practice to analyze failure cases of your solution to better understand the pipeline.
Find the first example of each and compute how often the situation happens (percentage). Use the maximum exact match to determine correctness and L2 + embedding for retrieval.

- For top-1: The retrieved passage is **correct** but the model is **not correct**.
- For top-1: The retrieved passage is **not correct** but the model is **still correct**.
- For top-5: One of the retrieved passages is the **correct** one but the model is **not correct**.
- For top-1: Without retrieved passage is the model **correct** but with the passage the model becomes **incorrect**.
- For top-1: Without retrieved passage is the model **incorrect** and with the passage the model becomes **incorrect** but in a different way (different answer).

In [None]:
# compute the 5 phenomena statistics (relative frequency) and find examples

# TODO: <30 lines
for line in tqdm.tqdm(dataset["validation"].select(range(500))):
  pass

A client is complaining that the model answers incorrectly the question _"Who is the current Governor of Victoria?"_.
1. Show your model output to this question with top-1 retrieved passage using any metric.
2. Show which top-1 context is retrieved by L2 embd.

Hint for the correct answer, see: [en.wikipedia.org/wiki/Premier_of_Victoria](https://en.wikipedia.org/wiki/Premier_of_Victoria).

In [None]:
QUESTION = "Who is the premier of Victoria?"
# TODO: < 20 lines

Answer the following questions:
* Provide a reason why your model is giving the incorrect answer. (information tracing)
* Propose a way by which this could be remedied. (information editing)

TODO:
* YOUR ANSWER HERE
* YOUR ANSWER HERE

Note on compute: the GPU time of the gold solution is ~15 minutes. If your solution requires much more compute (e.g. hours), then you are likely doing something incorrectly.