# Exercício 8_9: Retrieval-Augmented Generation

**Nome:** Caio Petrucci dos Santos Rosa

**RA:** 248245

## Enunciado

- Reproduzir o Visconde

- Buscador: BM25 (pyserini) ou um sentence-transformer

- Geração: LLaMa 3 70B (groq)

- Avaliação no IIRC (F1, exact_match) usar 10% (150 primeiras perguntas.)

- Pedir resultado do LLM em JSON para facilitar a avaliação automática

- Usar exact match e F1-bow como métricas de avaliação

- Usem o código do Visconde como referência: https://github.com/neuralmind-ai/visconde/

- Ao indexar a base de busca, indexem apenas os documentos que são utilizados pelas 150 primeiras perguntas. Procurem pela key "links". Observem que o uso de segmentação é necessário. Tentei usar slding window baseado nas sentenças. Uma janela com 5 sentenças e 1 ou 2 de overlap é interessante. Podem experimentar outros valores, se houver tempo. Procurem no código do visconde o trecho onde o janelamento é feito e usem como base.

- Para fazer os exemplos few-shot usem o autocot (ver slides)

# Bibliotecas e pacotes

In [1]:
!apt install openjdk-21-jdk-headless

Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
openjdk-21-jdk-headless is already the newest version (21.0.2+13-1~22.04.1).
0 upgraded, 0 newly installed, 0 to remove and 45 not upgraded.


In [2]:
import os

os.environ['JAVA_HOME'] = '/usr/lib/jvm/java-21-openjdk-amd64'

In [3]:
!pip install pyserini faiss-cpu
!pip install -q groq
!pip install -q beautifulsoup4



In [4]:
from google.colab import userdata
from groq import Groq, RateLimitError
from tqdm import tqdm
from bs4 import BeautifulSoup
from pyserini.search.lucene import LuceneSearcher
from collections import Counter

import json
import threading
import time
import json
import spacy
import argparse
import collections
import numpy as np
import re
import string
import sys
import unicodedata

# Atributos e parâmetros

In [5]:
LLM_MODEL_NAME = "llama3-70b-8192"
LLM_CONTEXT_SIZE = 8192
LLM_TEMPERATURE = 0
LLM_TOP_P = 1

DOCUMENT_WINDOW_SENTENCE_OVERLAP = 2
DOCUMENT_WINDOW_SENTENCES_NUM = 5

RETRIEVER_TOP_K = 5

N_SAMPLES = 150

# Dataset IIRC

In [6]:
!wget https://iirc-dataset.s3.us-west-2.amazonaws.com/iirc_test.json

!wget https://iirc-dataset.s3.us-west-2.amazonaws.com/context_articles.tar.gz
!tar -xzf context_articles.tar.gz

--2024-05-09 02:00:58--  https://iirc-dataset.s3.us-west-2.amazonaws.com/iirc_test.json
Resolving iirc-dataset.s3.us-west-2.amazonaws.com (iirc-dataset.s3.us-west-2.amazonaws.com)... 3.5.86.39, 52.92.248.130, 52.92.195.170, ...
Connecting to iirc-dataset.s3.us-west-2.amazonaws.com (iirc-dataset.s3.us-west-2.amazonaws.com)|3.5.86.39|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2874825 (2.7M) [application/json]
Saving to: ‘iirc_test.json’


2024-05-09 02:00:58 (25.3 MB/s) - ‘iirc_test.json’ saved [2874825/2874825]

--2024-05-09 02:00:58--  https://iirc-dataset.s3.us-west-2.amazonaws.com/context_articles.tar.gz
Resolving iirc-dataset.s3.us-west-2.amazonaws.com (iirc-dataset.s3.us-west-2.amazonaws.com)... 3.5.86.39, 52.92.248.130, 52.92.195.170, ...
Connecting to iirc-dataset.s3.us-west-2.amazonaws.com (iirc-dataset.s3.us-west-2.amazonaws.com)|3.5.86.39|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 385263479 (367M) [application/x-gzi

In [7]:
test_set = json.load(open('iirc_test.json', 'r'))
context_articles = json.load(open('context_articles.json', 'r'))

# Etapa de *Indexing* e pré-processamento

In [8]:
# Código adaptado do código do Ramon Simões Abilio
# Também foi inspirado no código do Visconde (https://github.com/neuralmind-ai/visconde/blob/main/iirc_create_indices.ipynb)

def remove_html_tags(text):
    soup = BeautifulSoup(text, "html.parser")
    return soup.get_text()

def select_n_samples(dataset, max_samples):
    documents = []
    all_titles = []
    count_contexts = 0

    for item in tqdm(dataset):
        title = item['title'].lower()

        if title not in all_titles:
            documents.append(item)
            all_titles.append(title)
            count_contexts += 1

        if count_contexts == max_samples:
            break

    return documents

def extract_documents(dataset, context_articles):
    documents = []
    all_titles = []

    count_contexts = 0
    count_links = 0

    for item in tqdm(dataset):
        title = item['title'].lower()
        if title not in all_titles:
            cleaned_text = remove_html_tags(item["text"])
            documents.append({
                "title": item['title'],
                "content": cleaned_text
            })
            all_titles.append(title)
            count_contexts += 1

        for link in item["links"]:
            link_title = link['target'].lower()
            if link_title in context_articles and link_title not in all_titles:
                cleaned_text = remove_html_tags(context_articles[link_title])
                documents.append({
                    "title": link['target'],
                    "content": cleaned_text
                })
                all_titles.append(link_title)
                count_links += 1

    print(f"\nQuantidade de contextos: \t {count_contexts}")
    print(f"Quantidade de contextos relacionados: \t {count_links}")
    print(f"Quantidade total de documentos: \t {len(documents)}")

    return documents

In [9]:
selected_samples = select_n_samples(test_set, N_SAMPLES)
full_documents = extract_documents(selected_samples, context_articles)

 29%|██▉       | 149/514 [00:00<00:00, 218133.09it/s]
100%|██████████| 150/150 [00:34<00:00,  4.38it/s]


Quantidade de contextos: 	 148
Quantidade de contextos relacionados: 	 2016
Quantidade total de documentos: 	 2164





In [10]:
nlp = spacy.blank("en")
nlp.add_pipe("sentencizer")

<spacy.pipeline.sentencizer.Sentencizer at 0x7fefb6a039c0>

In [11]:
# Código baseado no código do Visconde (https://github.com/neuralmind-ai/visconde/blob/main/iirc_create_indices.ipynb)
# Pequena adaptação para o spaCy v3

def sliding_window_split(documents, stride, max_length):
    treated_documents = []

    for j, document in enumerate(tqdm(documents)):
        doc_text = document['content']
        doc = nlp(doc_text[:10000])
        sentences = [sent.text.strip() for sent in doc.sents]
        for i in range(0, len(sentences), stride):
            segment = ' '.join(sentences[i:i+max_length]).strip()
            treated_documents.append({
                "title": document['title'],
                "contents": document['title']+". "+segment,
                "segment": segment
            })
            if i+max_length >= len(sentences):
                break

    return treated_documents

In [12]:
def add_id_and_filter_empty(documents):
    filtered_documents = []
    for i, doc in enumerate(documents):
        if doc['segment'] != "":
            filtered_doc = { **doc }
            filtered_doc['id'] = i
            filtered_documents.append(filtered_doc)
    return filtered_documents

In [13]:
treated_documents = add_id_and_filter_empty(sliding_window_split(full_documents, stride=DOCUMENT_WINDOW_SENTENCE_OVERLAP, max_length=DOCUMENT_WINDOW_SENTENCES_NUM))

100%|██████████| 2164/2164 [00:26<00:00, 81.01it/s] 


In [14]:
!mkdir iirc_index_content

In [15]:
with open("iirc_index_content/contents.jsonl",'w') as file:
    for doc in treated_documents:
        file.write(json.dumps(doc)+"\n")

In [16]:
!python3 -m pyserini.index -collection JsonCollection -generator DefaultLuceneDocumentGenerator -threads 1 -input iirc_index_content -index iirc_index -storeRaw

pyserini.index is deprecated, please use pyserini.index.lucene.
2024-05-09 02:02:39,605 INFO  [main] index.AbstractIndexer (AbstractIndexer.java:204) - Setting log level to INFO
2024-05-09 02:02:39,608 INFO  [main] index.AbstractIndexer (AbstractIndexer.java:208) - AbstractIndexer settings:
2024-05-09 02:02:39,609 INFO  [main] index.AbstractIndexer (AbstractIndexer.java:209) -  + DocumentCollection path: iirc_index_content
2024-05-09 02:02:39,617 INFO  [main] index.AbstractIndexer (AbstractIndexer.java:210) -  + CollectionClass: JsonCollection
2024-05-09 02:02:39,619 INFO  [main] index.AbstractIndexer (AbstractIndexer.java:211) -  + Index path: iirc_index
2024-05-09 02:02:39,619 INFO  [main] index.AbstractIndexer (AbstractIndexer.java:212) -  + Threads: 1
2024-05-09 02:02:39,620 INFO  [main] index.AbstractIndexer (AbstractIndexer.java:213) -  + Optimize (merge segments)? false
May 09, 2024 2:02:39 AM org.apache.lucene.store.MemorySegmentIndexInputProvider <init>
INFO: Using MemorySegme

# Etapa de *Retrieval*

In [17]:
class PyseriniRetriever:
    def __init__(self):
        self.searcher = LuceneSearcher('./iirc_index')

    def __call__(self, query, top_k):
        hits = self.searcher.search(query, k=top_k)
        return [ json.loads(hit.lucene_document.get('raw')) for hit in hits ]

In [18]:
retriever = PyseriniRetriever()

# Etapa de *Generation*

In [19]:
class GroqCompletionInterface:
    '''
    Interface for using the Groq API

    Implements a rate limit control for multi-threading use.
    '''

    # Groq client
    _client = None

    # documentacao dos parametros em: https://console.groq.com/docs/text-chat
    _model_name = LLM_MODEL_NAME
    _context_size = LLM_CONTEXT_SIZE
    _temperature = LLM_TEMPERATURE
    _top_p = LLM_TOP_P
    _stop = None
    _stream = False

    # Mutex lock
    _rate_lock = threading.Lock()

    def __init__(self):
        '''
        GroqCompletionInterface constructor.
        '''
        if GroqCompletionInterface._client is None:
            api_key = userdata.get('GROQ_API_KEY')
            if api_key is None:
                raise RuntimeError("'GROQ_API_KEY' variable is not set in environment.")
            GroqCompletionInterface._client = Groq(api_key=api_key)

    def __call__(self, prompt: str) -> str:
        '''
        Generates the model response

        Args:
            prompt (str): prompt to send to the model.

        Returns:
            str: model response.
        '''

        done = False
        while not done:
            try:
                GroqCompletionInterface._rate_lock.acquire()
                GroqCompletionInterface._rate_lock.release()
                chat_completion = GroqCompletionInterface._client.chat.completions.create(
                    messages=[
                        {
                            "role": "system",
                            "content": prompt,
                        }
                    ],
                    model=self._model_name,
                    temperature=self._temperature,
                    max_tokens=self._context_size,
                    top_p=self._top_p,
                    stop=self._stop,
                    stream=self._stream,
                )
                done = True
            except RateLimitError as exception:
                GroqCompletionInterface.error = exception
                if not GroqCompletionInterface._rate_lock.locked():
                    GroqCompletionInterface._rate_lock.acquire()
                    time.sleep(1.75)
                    GroqCompletionInterface._rate_lock.release()

        return chat_completion.choices[0].message.content


In [20]:
groq_completion = GroqCompletionInterface()

# Pipeline RAG

In [32]:
base_prompt = """Consider the following context passages and answer the given question. Let's think step by step.
If you don't know any plausible answer, answer "Not enough information provided in the documents.".

{related_documents}

Question: {query}
"""

In [33]:
class RagPipeline:
    def __init__(self, llm_completion, retriever):
        self._llm_completion: GroqCompletionInterface = llm_completion
        self._retriever: PyseriniRetriever = retriever
        self._base_prompt: str = base_prompt

    def _search_related_documents(self, query):
        return self._retriever(query, RETRIEVER_TOP_K)

    def _augment_prompt(self, query, related_documents):
        formatted_related_documents = ""
        for i, doc in enumerate(related_documents, 1):
            formatted_related_documents += f"Context passage {i}: {doc['segment']}"

        return self._base_prompt.format(
            query=query,
            related_documents=formatted_related_documents,
        )

    def _generate_completion(self, augmented_prompt):
        return self._llm_completion(augmented_prompt)

    def __call__(self, query, associated_contexts):
        documents = self._search_related_documents(query)
        prompt = self._augment_prompt(query, associated_contexts + list(reversed(documents)))
        return self._generate_completion(prompt)

In [34]:
rag_pipeline = RagPipeline(groq_completion, retriever)

In [35]:
rag_pipeline("What was the 9/11 incident?", associated_contexts=[])

'Not enough information provided in the documents.'

# Avaliação

In [36]:
# Código adaptado do código do Visconde (https://github.com/neuralmind-ai/visconde/blob/main/iirc_generate_and_evaluate.ipynb)

def normalize_answer(s):
    """Lower text and remove punctuation, articles and extra whitespace."""

    def remove_articles(text):
        regex = re.compile(r'\b(a|an|the)\b', re.UNICODE)
        return re.sub(regex, ' ', text)

    def white_space_fix(text):
        return ' '.join(text.split())

    def remove_punc(text):
        exclude = set(string.punctuation)
        return ''.join(ch for ch in text if ch not in exclude)

    def lower(text):
        return text.lower()

    def remove_accents(input_str):
        nfkd_form = unicodedata.normalize('NFKD', input_str)
        only_ascii = nfkd_form.encode('ASCII', 'ignore')
        return only_ascii.decode("utf-8")

    return white_space_fix(remove_articles(remove_punc(lower(remove_accents(s)))))


def get_tokens(s):
    if not s: return []
    return normalize_answer(s).split()


def compute_exact(a_gold, a_pred):
    return int(normalize_answer(a_gold) == normalize_answer(a_pred))


def compute_f1(a_gold, a_pred):
    gold_toks = get_tokens(a_gold)
    pred_toks = get_tokens(a_pred)
    common = collections.Counter(gold_toks) & collections.Counter(pred_toks)
    num_same = sum(common.values())

    if len(gold_toks) == 0 or len(pred_toks) == 0:
        # If either is no-answer, then F1 is 1 if they agree, 0 otherwise
        return int(gold_toks == pred_toks)

    if num_same == 0:
        return 0

    precision = 1.0 * num_same / len(pred_toks)
    recall = 1.0 * num_same / len(gold_toks)
    f1 = (2 * precision * recall) / (precision + recall)
    return f1

In [74]:
evaluation_results = []

# use 10% of samples to evaluate
idx_p10 = int(0.1 * len(test_set))

for sample in tqdm(test_set[:idx_p10]):
    for q in sample['questions']:
        associated_contexts = [ { 'segment': c['text'] } for c in q['context'] ]
        question = q['question']
        llm_answer = rag_pipeline(question, associated_contexts)

        expected_answer = ""
        if q['answer']['type'] == "span":
            expected_answer = ", ".join([a['text'] for a in q['answer']["answer_spans"]])
        elif q['answer']['type'] == "value":
            expected_answer = "{0} {1}".format(q['answer']['answer_value'], q['answer']['answer_unit'])
        elif q['answer']['type'] == "binary":
            expected_answer = q['answer']['answer_value']
        elif q['answer']['type'] == "none":
            expected_answer = "Not enough information provided in the documents."

        evaluation_results.append({
            'question': question,
            'expected_answer': expected_answer,
            'llm_answer': llm_answer,
        })

100%|██████████| 51/51 [20:03<00:00, 23.60s/it]


In [75]:
f1s = []
ems = []

for eval_result in evaluation_results:
    f1s.append(compute_f1(eval_result['expected_answer'], eval_result['llm_answer']))
    ems.append(compute_exact(eval_result['expected_answer'], eval_result['llm_answer']))

print("F1: ",np.mean(f1s))
print("EM: ",np.mean(ems))

F1:  0.4021961630058525
EM:  0.2923076923076923
