# IN4080: obligatory assignment 3
 
Mandatory assignment 3 is about the practical use of Large Language Models (LLMs). More specifically, you will be tasked to implement a RAG (Retrieval-Augmented Generation) system able to answer factual questions based on a document database, more specifically Wiki pages extracted from an [online Star Wars encyclopedia](https://starwars.fandom.com). 

You are required to get at least 12/20 points to pass. 

- We assume that you have read and are familiar with IFI’s requirements and guidelines for mandatory assignments, see [here](https://www.uio.no/english/studies/examinations/compulsory-activities/mn-ifi-mandatory.html) and [here](https://www.uio.no/english/studies/examinations/compulsory-activities/mn-ifi-guidelines.html).
- This is an individual assignment. You should not deliver joint submissions. 
- You may redeliver in Devilry before the deadline (__Sunday, October 13 at 23:59__).
- Only the last delivery will be read! If you deliver more than one file, put them into a zip-archive. You don't have to include in your delivery the data files already provided for this assignment. 
- Name your submission _your\_username\_in4080\_mandatory\_3_

The preferred way to complete this assignment is using the high-performance computing cluster _Fox_. See [here](https://www.uio.no/studier/emner/matnat/ifi/IN4080/h24/computing-setup.html) for instructions on how to register and log in to Fox.

You should deliver a completed version of this Jupyter notebook, containing both your code and explanations about the steps you followed. We want to stress that simply submitting code is __not__ by itself sufficient to complete the assignment - we expect the notebook to also contain explanations of what you have implemented, along with motivations for the choices you made along the way. Preferably use whole sentences, and mathematical formulas if necessary. Explaining in your own words (using concepts we have covered through in the lectures) what you have implemented and reflecting on your solution is an important part of the learning process - take it seriously!

Regarding the use of LLMs (ChatGPT or similar): you are allowed to use them as 'sparring partner', for instance to clarify something you have not understood. However, you are __not__ allowed to use them to generate solutions (either in part or in full) to the assignment tasks. 


## Basic setup

We will start by building a chatbot that directly answers user questions using an instruction-tuned LLM, without relying on any database. We will use the instruction-tuned version of the [Gemma 1.1 language model](https://huggingface.co/google/gemma-1.1-2b-it) from Google, which is available on HuggingFace. 

_Note: feel free to switch to another model (such as the newly released Llama 3 models) if you wish to experiment with them. Note, however, that the most recent LLMs will likely require a newer version of the `transformers` library than what is currently installed on Fox._



**Task 1** (4 points): Drawing inspiration from the code examples on the [Gemma webpage](https://huggingface.co/google/gemma-1.1-2b-it), implement the `__init__` and `get_response` methods. If you run the code on Fox with a GPU (or on a personal machine with a GPU), make sure that your code actually runs on the GPU.

In [1]:
from huggingface_hub import login
login(token='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')

Token will not been saved to git credential helper. Pass `add_to_git_credential=True` if you want to set the git credential as well.
Token is valid (permission: fineGrained).
Your token has been saved to /fp/homes01/u01/ec-gabrield/.cache/huggingface/token
Login successful


In [2]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

class BasicResponseGenerator:

    def __init__(self, model_name="google/gemma-1.1-2b-it", token=None):
        """Loads the tokenizer and pretrained causal LM for the given model.
        If a GPU is available, the model should be loaded on the GPU."""
        
        self.model_name = model_name

        # Debugging error login
        try:
            self.tokenizer = AutoTokenizer.from_pretrained(self.model_name)
            self.model = AutoModelForCausalLM.from_pretrained(self.model_name)
        except Exception as e:
            print(f"Error loading model {self.model_name}: {e}")
            raise

        # GPU if available
        self.device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
        self.model.to(self.device)

    def get_response(self, prompt: str, max_length: int = 50) -> str:
        """Given a prompt, generate a response (of a maximum max_length tokens) and return it.
        Only the response should be returned, not the text of the prompt itself."""

        input_ids = self.tokenizer.encode(prompt, return_tensors='pt').to(self.device)

        # Generate resp
        output_ids = self.model.generate(
            input_ids,
            max_length=max_length,
            pad_token_id=self.tokenizer.eos_token_id)

        response = self.tokenizer.decode(output_ids[0], skip_special_tokens=True)
        response = response[len(prompt):].strip()
        return response
        
agent = BasicResponseGenerator()

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

_Note: An easy way to verify that the GPU is actually used is to run the command `nvidia-smi` while your code is running. There also exists alternative GPU monitoring tools, like [`gpustat`](https://pypi.org/project/gpustat/0.3.2/)._

You can then test your response generator with the following set of questions: 

In [3]:
questions = ["Who is Luke Skywalker?",
             "Where is the Niima Outpost in Star Wars?",
             "Have you heard of Nute Gunray? Who is he?",
             "What kind of planet is Kashyyyk, and who discovered it?",
             "Who are Condlurans, and can you give 2-3 names of known Condlurans?",
             "What can you tell me about the First Battle of Geonosis?",
             "What is the name of the settlement where Anakin Skywalker and his mother lived?",
             "Which planet did Darth Sidious represent as senator?"]

for question in questions:
    print("Question:", question)
    print("Answer:", agent.get_response(question))
    print("-------")

Question: Who is Luke Skywalker?
Answer: Luke Skywalker is a fictional character in the Star Wars franchise, a member of the Skywalker family, and a key figure in the Star Wars saga. He is the son of Anakin Skywalker and Padmé Amidala, and
-------
Question: Where is the Niima Outpost in Star Wars?
Answer: The Niima Outpost is not mentioned in the Star Wars universe, so it does not exist.
-------
Question: Have you heard of Nute Gunray? Who is he?
Answer: I am unable to find any information about Nute Gunray on the internet.
-------
Question: What kind of planet is Kashyyyk, and who discovered it?
Answer: **Kashyyyk** is a fictional planet from the Star Wars universe. It is a forested planet located in the Outer Rim.

**Kashyyyk was discovered by
-------
Question: Who are Condlurans, and can you give 2-3 names of known Condlurans?
Answer: The term "Condlurans" is derived from the Greek word "kondilos," which means "to scrape."

**Answer:**
-------
Question: What can you tell me about th

If your implementation is correct, the model should give you a few correct answers, but also many responses for which the model is either unable to give a precise answer, or hallucinates a (wrong) answer. This is expected, as the model is relatively small (3 billion parameters) and is a generic model that is not particularly optimised to generate trivia about the Star Wars Franchise. We will now try to improve the model performance by coupling the LLM to a document database.

## Retrieval step

Retrieval-augmented generation operates on a simple idea: instead of directly generating a response based on the "parametric knowledge" of the LLM, we first search for relevant documents in a database (or on the web). We then include the most relevant documents to the prompt, and ask the LLM to answer the user question _based on this retrieved knowledge_. 

In this assignment, you will use a set of Wiki texts extracted from an [online Star Wars encyclopedia](https://starwars.fandom.com) as document database. The wiki texts are available as a JSON file, either [here](https://home.nr.no/~plison/data/starwars.json) or on Fox at `/fp/projects01/ec403/IN4080/starwars.json`. The JSON is simply a dictionary mapping Wiki page titles to their content (in plain text).

### Sparse retrieval 

We can start by using the newly released [BM25s](https://bm25s.github.io/) library, which implements a number of well-known search algorithms, which are all variants of the original [BM25 algorithm](https://en.wikipedia.org/wiki/Okapi_BM25) . Although BM25 is an old-fashioned search technique based on bag-of-words, it remains suprisingly effective, and is still widely used in modern NLP systems.

**Task 2** (4 points): Fill in the implementation for the `BM25Retriever` class using [BM25s](https://bm25s.github.io/) (see the library documentation for details). You should filter out stop words by adding `stopwords='en_plus'` to the arguments of the tokenizer. 

In [4]:
# !pip install rank-bm25

In [5]:
from rank_bm25 import BM25Okapi
from nltk.corpus import stopwords
import json
from typing import List

class BM25Retriever:

    def __init__(self, filename="/fp/projects01/ec403/IN4080/starwars.json"):
        """Using the JSON file provided as input, create a BM25s retriever 
        containing all (indexed) documents."""
        
        with open(filename, 'r') as file:
            self.documents = json.load(file)

        self.texts = list(self.documents.values())
        self.stop_words = set(stopwords.words('english'))
        self.tokenized_documents = [self.tokenize(doc) for doc in self.texts]
        self.bm25 = BM25Okapi(self.tokenized_documents)

    def tokenize(self, text: str) -> List[str]:
        """Tokenize the input text and filter out stop words."""
        
        tokens = text.lower().split()
        filtered_tokens = [token for token in tokens if token not in self.stop_words]
        return filtered_tokens

    def search(self, query: str, k: int = 5) -> List[str]:
        """Use the BM25 retriever to find the k documents that are closest
        to the provided query."""
        
        tokenized_query = self.tokenize(query)
        scores = self.bm25.get_scores(tokenized_query)
        top_k_indices = scores.argsort()[-k:][::-1]

        return [self.texts[i] for i in top_k_indices]


We can then test our retriever by checking whether the documents with highest BM25 scores are indeed the ones that are most relevant to the query:

In [6]:
retriever = BM25Retriever()
for question in questions:
    print("Question:", question)
    print("Retrieved documents:")
    for relevant_doc in retriever.search(question):
        print("- " + relevant_doc.replace("\n", " "))
    print("===========")

Question: Who is Luke Skywalker?
Retrieved documents:
- The mecho-organic droid was a drone created by Ship to wield a lightsaber in combat against Luke Skywalker. At first, Ship thought Luke was merely another simulated opponent, part of the games he played with the ship's computer. When Luke managed to overcome the mecho-organic droid, however, Ship realized that Luke was real.
- Luke Skywalker's lightsaber was the first lightsaber constructed by Luke Skywalker and the second one he owned.
- Yoda Teaches Driver's Ed is a comic printed in Star Wars Jedi Quest Kids Club 3. Luke Skywalker asks Yoda if Yoda is certain Luke needs a driver's license to be a Jedi. Yoda chastises Luke not to question his teaching. Han Solo asks Luke if he wants to "drag" before saying "Wrong movie!" and "Outta my way graffiti geek!" (referencing Harrison Ford's character in American Graffiti). Han speeds off, leaving Luke coughing in his exhaust. Darth Vader implores Luke to complete his training so the two 

If your implementation is correct, the retrieved documents should for the most part relevant to the query. 

### Dense retrieval 

Many of those documents are, however, way too long to be included in the prompt for our Gemma model (especially if we wish to include 4-5 retrieved texts for each query!). Can we ensure that the length of each retrieved text stays within a reasonable length, such as one or two sentences? 

One strategy is to not return the full documents, but instead determine the most relevant _sentences_ within those documents. But how do we determine which sentence is most relevant? A sparse retriever using BM25 would not work well here, as it does not really account for the semantics of the query. Instead, what we can do is to:
- split the documents (retrieved through BM25) into sentences
- extract sentence embeddings for the query and for each sentence
- compute the cosine similarities between the query vector and each sentence vector
- and return the _k_ most similar sentences

In other words, our approach starts with a _sparse retrieval step_ at the level of full documents (which we already have implemented, using BM25S), and continues with a _dense retrieval step_ to determine the most relevant sentences among the sentences that are found in the retrieved documents.

**Task 3** (4 points): Re-implement the `search` method to segment into sentences each document retrieved with BM25, extract sentence embeddings for the query and sentences using the encoder model (see [here](https://sbert.net/examples/applications/semantic-search/README.html) for explanations and code examples), and then select the _k_ sentences with highest cosine similarities.  

_Tips_: You can use `nltk.sent_tokenize` to segment your document in sentences.

In [9]:
import bm25s
import re, json
import sentence_transformers
import nltk
from typing import List
from rank_bm25 import BM25Okapi
from nltk.corpus import stopwords

class Retriever_mk2(BM25Retriever): # we add sentence embeddings

    def __init__(self, filename="/fp/projects01/ec403/IN4080/starwars.json", 
                 encoder_model="msmarco-MiniLM-L-6-v3"):
        
        """Using the json file provided as input, create a BM25 retriever 
        containing all (indexed) documents, and loads a sentence transformer model
        used to compute the embeddings for the query and sentences"""

        BM25Retriever.__init__(self, filename)
        self.encoder = sentence_transformers.SentenceTransformer(encoder_model) # transformer for embeddings

    def search(self, query: str, k: int = 5) -> List[str]:
        """Use the BM25 retriever to find the documents that are closest
        to the provided query, and then the sentence transformer model to
        determine the most relevant sentences"""

        docs = BM25Retriever.search(self, query, k) # top-k documents
        all_sentences = []
        for doc in docs:
            sentences = nltk.sent_tokenize(doc)  # segment into sentences
            all_sentences.extend(sentences) # ìsentences from top-k documents

        # Step 3: Compute sentence embeddings for the query and all sentences
        query_embedding = self.encoder.encode(query)  # query embedding
        sentence_embeddings = self.encoder.encode(all_sentences)  # embeddings for all sentences

        from sklearn.metrics.pairwise import cosine_similarity
        similarities = cosine_similarity([query_embedding], sentence_embeddings)[0] # compute distance
        top_k_indices = similarities.argsort()[-k:][::-1]  # top-k sentences with the highest similarities
        
        return [all_sentences[i] for i in top_k_indices]


And we can test our hybrid (sparse followed by dense) retriever on the same questions as before:

In [10]:

retriever = Retriever_mk2()
for question in questions:
    print("Question:", question)
    print("Retrieved documents:")
    for relevant_doc in retriever.search(question):
        print("- " + relevant_doc.replace("\n", " "))
    print("===========")

Question: Who is Luke Skywalker?
Retrieved documents:
- Luke Skywalker's lightsaber was the first lightsaber constructed by Luke Skywalker and the second one he owned.
- The Luke Skywalker X-Wing Mech was a mech piloted by Luke Skywalker.
- Shortly after the Battle of Yavin, Kivas repaired the Y 4 BTL-S3 Y-wing Starfighter belonging to Luke Skywalker, a rebel who crash-landed near Tikaroo.
- Luke Skywalker asks Yoda if Yoda is certain Luke needs a driver's license to be a Jedi.
- Darth Vader implores Luke to complete his training so the two of them can "cruise the galaxy together," but Luke complains that he's trying but the vehicle he's in won't move because it's made of plastic.
Question: Where is the Niima Outpost in Star Wars?
Retrieved documents:
- Niima Outpost was a junkyard settlement on Jakku, a desert planet in the Western Reaches of the galaxy.
- Niima Outpost was also the location of Constable Zuvio's office.
- Niima Outpost was the only spaceport on the planet, although it

## Putting it all together

Now that we have a functioning retriever model, we can connect it to the generative language model employed to produce the responses.

**Task 4** (4 points): Implement the `RetrievalAugmentedResponseGenerator`. Given an initial input prompt, the method should first retrieve relevant sentences using the `HybridRetriever` we have just developed. Then, it should expand the initial prompt using the provided template (you are of course free to edit or adapt it as you see fit). This expanded prompt should then be tokenized and fed as input to the LLM in the same way as before.

In [13]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

PROMPT_TEMPLATE = (
    "You are given the following information about Star Wars:\n-{retrieved_sentences}\n"
    "Now answer the following question in 1 or 2 sentences, based on the provided information: '{query}'"
)

class RetrievalAugmentedResponseGenerator:

    def __init__(self, model_name="google/gemma-1.1-2b-it", 
                 doc_filename="/fp/projects01/ec403/IN4080/starwars.json", 
                 encoder_model="all-MiniLM-L6-v2"):
        """Loads the tokenizer, pretrained causal LM for the given model, along with the 
        hybrid sparse-dense retriever model populated with the documents in doc_filename."""
        
        # Load the Retriever_mk2
        self.retriever = Retriever_mk2(filename=doc_filename, encoder_model=encoder_model)

        # Load the language model and tokenizer
        self.tokenizer = AutoTokenizer.from_pretrained(model_name)
        self.model = AutoModelForCausalLM.from_pretrained(model_name)

    def get_response(self, query: str, max_length: int = 200, k: int = 3) -> str:
        """Given a prompt, retrieve k relevant sentences, generate a response (of a maximum 
        max_length tokens) and return it.
        Only the response should be returned, not the text of the prompt itself.
        """
        
        retrieved_sentences = self.retriever.search(query, k=k)
        retrieved_text = "\n-".join(retrieved_sentences)  # Format sentences
        formatted_prompt = PROMPT_TEMPLATE.format(retrieved_sentences=retrieved_text, query=query) # prompt using PROMPT_TEMPLATE
        input_ids = self.tokenizer.encode(formatted_prompt, return_tensors='pt')
        output_ids = self.model.generate(input_ids, max_length=max_length, num_return_sequences=1, pad_token_id=self.tokenizer.eos_token_id) # Generating response
        response = self.tokenizer.decode(output_ids[0], skip_special_tokens=True)
        response_text = response[len(formatted_prompt):].strip() # extract the response

        return response_text

agent = RetrievalAugmentedResponseGenerator()

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

The last step is to test our system end-to-end:

In [14]:
for question in questions:
    print("Question:", question)
    print("Answer:", agent.get_response(question))
    print("-------")

Question: Who is Luke Skywalker?
Answer: Based on the provided information, Luke Skywalker is a young man who is the second owner of a lightsaber and has a questionable relationship with Darth Vader.
-------
Question: Where is the Niima Outpost in Star Wars?
Answer: The Niima Outpost is located on Jakku, a desert planet in the Western Reaches of the galaxy.
-------
Question: Have you heard of Nute Gunray? Who is he?
Answer: Nute Gunray is a character from the Star Wars universe. He is known for wearing the collar and being the Viceroy of the Trade Federation.
-------
Question: What kind of planet is Kashyyyk, and who discovered it?
Answer: We do not have enough information to determine the planet of Kashyyyk or who discovered it from the provided text.
-------
Question: Who are Condlurans, and can you give 2-3 names of known Condlurans?
Answer: The provided text does not contain any information regarding Condlurans, so I am unable to answer this question from the provided context.
----

**Task 5** (4 points): If you have implemented your model correctly, the system should answer correctly to at least a few questions. But it is still far from perfect, and some of the answers are flat-out wrong. Suggest 2-3 ways one could improve the current system and get even better answers. You don't need to implement anything, simply flesh out a few ideas you believe are worth trying out.

_(of course, it is even better if you actually try to implement those ideas and evaluate their influence on the quality of the system responses!)_

**Answer**

To improve the current system, that is basically Retrieval-Augmented Generation (RAG), we can implement new techniques in the retrieval and generation components. For example we can implement the Improving Contextual Relevance in Retrieval. Currently, the 'Retriever_mk2' is only using BM25 and basic sentence embeddings to rank sentences. We can enhance this by using more advanced dense retrievers that are fine-tuned on a relevant dataset (e.g., MS MARCO or other QA-specific datasets). Hybrid sparse-dense retrieval approaches like BM25 + Sentence Embeddings can sometimes miss semantic nuances. A dense retriever fine-tuned for question answering can better capture the semantics of the query.

We can also try the Answer Consistency through Context Re-Ranking, because instead of relying solely on sentence embeddings for relevance, re-rank the retrieved sentences using a cross-encoder model like cross-encoder/ms-marco-MiniLM-L-6-v2. Cross-encoders perform better than bi-encoders in terms of ranking since they consider pairwise interactions between the query and each candidate sentence.

**References**
- [Dense Passage Retrieval for Open-Domain Question Answering](https://arxiv.org/abs/2004.04906)
- [ColBERTv2: Effective and Efficient Retrieval via Lightweight Late Interaction](https://arxiv.org/abs/2112.01488)
- [gemma-1.1-2b-it ](https://huggingface.co/google/gemma-1.1-2b-it)
- [A block-sparse Tensor Train Format for sample-efficient high-dimensional Polynomial Regression](https://arxiv.org/abs/2104.14255)