## Assignment 2

In the second assignment, we are going to use a large language model in a retrieval-augmented setup. As an application, we are going to consider a question answering task.

You can use any LLMs you want in this assignment, but your solution must consider at least one open model (e.g. Mistral or one of the Llama models). Optionally, you may compare to a commercial model.

The dataset we will use in this assignment is a simplified version of Natural Questions, which was compiled by Google and consists of real search engine queries about factual questions.

In [1]:
import pandas as pd

nq_data = pd.read_csv('nq_simplified.val.tsv', sep='\t', header=None, names=['question', 'answer', 'gold_context'], quoting=3)
nq_data

Unnamed: 0,question,answer,gold_context
0,what purpose did seasonal monsoon winds have o...,enabled European empire expansion into the Ame...,The westerlies (blue arrows) and trade winds (...
1,who got the first nobel prize in physics,"Wilhelm Conrad Röntgen, of Germany",The award is presented in Stockholm at an annu...
2,when is the next deadpool movie being released,"May 18, 2018","Though the original creative team of Reynolds,..."
3,where did the idea of fortnite come from,as a cross between Minecraft and Left 4 Dead,"Fortnite is set in contemporary Earth, where t..."
4,which mode is used for short wave broadcast se...,MFSK Olivia,"All one needs is a pair of transceivers, each ..."
...,...,...,...
4284,who challenged the aristotelian model of a geo...,Copernicus,Planets Variations in speed through the zodiac...
4285,when was the miraculous journey of edward tula...,"March 30, 2006",The Miraculous Journey of Edward Tulane-wikipe...
4286,character in macbeth who is murdered and appea...,Lord Banquo,Banquo Thane of Lochaber Macbeth character Thé...
4287,when was as you like it first performed,"uncertain, though a performance at Wilton Hous...",As You Like It-wikipedia As You Like It Jump t...


### Step 1: Evaluating an LLM on Natural Questions

Load an LLM and explore different prompting strategies to try to make it answer the questions in the dataset. As a benchmark, you can use the ROUGE-1 precision/recall/F1 scores.

In [2]:
def rouge1(gold, predicted):
  assert(len(gold) == len(predicted))
  n_p = 0
  n_g = 0
  n_c = 0
  for g, p in zip(gold, predicted):
    g = set(cleanup(str(g)).strip().split())
    p = set(cleanup(str(p)).strip().split())
    n_g += len(g)
    n_p += len(p)
    n_c += len(p.intersection(g))
  pr = n_c / n_p
  re = n_c / n_g
  if pr > 0 and re > 0:
    f1 = 2*pr*re/(pr + re)
  else:
    f1 = 0.0
  return pr, re, f1

def cleanup(text):
  text = text.replace(',', ' ')
  text = text.replace('.', ' ')
  return text

In [3]:
from transformers import AutoModelForCausalLM
from transformers import AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("mistralai/Mistral-7B-Instruct-v0.3", device_map="auto")
tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-Instruct-v0.3")

  from .autonotebook import tqdm as notebook_tqdm
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:21<00:00,  7.08s/it]


In [21]:
from transformers import pipeline
import tqdm

class Mistral:
    def __init__(self, model, tokenizer):
        self.pipe = pipeline("text-generation", model=model, tokenizer=tokenizer, max_new_tokens=64, batch_size=64)
        self.pipe.tokenizer.pad_token_id = model.config.eos_token_id

    def prompt(self, questions):
        messages = [[{"role": "user", "content": question},] for question in questions]
        messages = self.pipe(messages)
        return messages

    def simple_prompt(self, questions):
        messages = [[{"role": "user", "content": f"Answer the following question with a short and straightforward answer: {question}"},] for question in questions]
        messages = self.pipe(messages)
        return messages

    def context_prompt(self, questions, contexts):
        def prompt(question, context):
            return f"Context: {context}\n\nAnswer the following question with a short and straightforward answer based on the provided context: {question}"
        messages = [[{"role": "user", "content": prompt(question, context)},] for question, context in zip(questions, contexts)]
        messages = self.pipe(messages)
        return messages

    def few_shot_prompt(self, examples, questions):
        def prompt(question):
            return f"""You are a question-answering bot. Your job is to answer the provided question with a short answer.

Here are some examples of how you should answer the questions.
Question: {examples["question"][0]}?
Answer: {examples["answer"][0]}

Question: {examples["question"][1]}?
Answer: {examples["answer"][1]}

Now, answer the following question:
{question}?
"""
        messages = [[{"role": "user", "content": prompt(question)},] for question in questions]
        messages = self.pipe(messages)
        return messages

    def few_shot_context_prompt(self, examples, questions, contexts):
        def prompt(examples, question, context):
            return f"""You are a question-answering bot. Your job is to answer the provided question with a short answer.

Here are some examples of how you should answer the questions.
Question: {examples["question"][0]}?
Answer: {examples["answer"][0]}

Question: {examples["question"][1]}?
Answer: {examples["answer"][1]}

You should also consider the following context:
{context}

Now, answer the following question:
{question}?
"""
        messages = [[{"role": "user", "content": prompt(examples, question, context)},] for question, context in zip(questions, contexts)]
        messages = self.pipe(messages)
        return messages

In [22]:
mistral = Mistral(model, tokenizer)

In [10]:
answers = mistral.prompt(nq_data["question"])
extracted_answers = [answer[-1]["generated_text"][-1]["content"] for answer in answers]
rouge1(extracted_answers, nq_data["answer"])

(0.36628042254176557, 0.052906460893178825, 0.09245804970016623)

In [11]:
simple_answers = mistral.simple_prompt(nq_data["question"])
extracted_simple_answers = [answer[-1]["generated_text"][-1]["content"] for answer in simple_answers]
rouge1(extracted_simple_answers, nq_data["answer"])

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for o

(0.3693982968030155, 0.06598064966585764, 0.11196287650655515)

In [6]:
few_shot_answers = mistral.few_shot_prompt(nq_data[:2], nq_data["question"][2:])
extracted_few_shot_answers = [answer[-1]["generated_text"][-1]["content"] for answer in few_shot_answers]
rouge1(extracted_few_shot_answers, nq_data["answer"][2:])

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for o

(0.3570596729864443, 0.07175286683828692, 0.11949303152184082)

### Step 2: An idealized retrieval-augmented LLM

The third column in the dataset (called gold_context above) contains a text fragment from a Wikipedia page, from which the answer can be deduced. Try out new prompts where you include this relevant context. How does this change the evaluation scores?


In [10]:
gold_context_answers = mistral.context_prompt(nq_data["question"], nq_data["gold_context"])
extracted_gold_context_answers = [a*nswer[-1]["generated_text"][-1]["content"] for answer in gold_context_answers]
rouge1(extracted_gold_context_answers, nq_data["answer"])

OutOfMemoryError: CUDA out of memory. Tried to allocate 2.70 GiB. GPU 

In [None]:
few_shot_gold_context_answers = mistral.few_shot_context_prompt(nq_data[:2], nq_data["question"][2:], nq_data["gold_context"][2:])
extracted_few_shot_gold_context_answers = [answer[-1]["generated_text"][-1]["content"] for answer in few_shot_gold_context_answers]
rouge1(extracted_few_shot_gold_context_answers, nq_data["answer"][2:])

### Step 3: Setting up the retriever

The setup in Step 2 is idealized, because we provided a context from Wikipedia where we know that the answer is avaialable. In real-world settings, this is not going to be the case.

To make this assignment work in Colab, we are going to work with a rather small set of passages. You can download these texts from here. For a given question, we are going to search among these passages to find the best-matching passage.


#### Representing the passages as vectors

Set up a representation model that maps a text passage to a numerical vector.

For instance, some model from SentenceTransformers, such as all-MiniLM-L6-v2 could be a good choice.

Apply this model to all text passages.


In [15]:
with open("passages.txt", "r") as f:
    passages = [passage for passage in f.readlines()]

print(passages[0])
print(len(passages))

and fielded .964. He was 10th in Hoofdklasse in average, fifth with 47 hits (one behind brother Mark), 6th with 24 RBI and tied for 7th with 9 steals. He led the 2005 European Championship with 14 walks in 10 contests; no one else had more than 9. He also played error-free ball at second base. In the 2005 Baseball World Cup, Duursma hit .302/.412/.395 with 12 runs in 11 games. During the 2006 World Baseball Classic, Duursma had the best average on the Dutch team, though he played in just one game. He went 2 for 4 with two

34312


In [16]:
from sentence_transformers import SentenceTransformer

sentence_model = SentenceTransformer("all-MiniLM-L6-v2")
embeddings = sentence_model.encode(passages)



#### Storing the passage vectors in a database

We now create a vector database that allows us to search efficiently for the neareast neighbors in the vector space of a given query vector. We recommend the FAISS library for this purpose.

In [17]:
import faiss

index = faiss.IndexFlatL2(embeddings.shape[1])
index.add(embeddings)

In [18]:
question_embeddings = sentence_model.encode(nq_data["question"])
_, ix = index.search(question_embeddings, 1)

In [23]:
rag_context_answers = mistral.context_prompt(nq_data["question"], [passages[idx[0]] for idx in ix])
extracted_rag_context_answers = [answer[-1]["generated_text"][-1]["content"] for answer in rag_context_answers]
rouge1(extracted_rag_context_answers, nq_data["answer"])

(0.36986365117036624, 0.08213543873427925, 0.1344202408334461)

In [24]:
rag_few_shot_gold_context_answers = mistral.few_shot_context_prompt(nq_data[:2], nq_data["question"][2:], [passages[idx[0]] for idx in ix[2:]])
extracted_rag_few_shot_gold_context_answers = [answer[-1]["generated_text"][-1]["content"] for answer in rag_few_shot_gold_context_answers]
rouge1(extracted_rag_few_shot_gold_context_answers, nq_data["answer"][2:])

(0.3926491824661108, 0.08820082456103635, 0.14404484205309614)