Nome: Fabio Grassiotto  
RA: 890441

# Aula 9_10 - Implementar a ReAct  com LLaMa 3 70B (groq)

- Testar no dataset do IIRC - 50 primeiras perguntas com resposta (test_questions.json em anexo)  
- Usar o prompt do LLaMAIndex: https://github.com/run-llama/llama_index/blob/a87b63fce3cc3d24dc71ae170a8d431440025565/llama_index/agent/react/prompts.py  
- Salvar as respostas finais das 50 perguntas no JSON para exercício futuro de avaliação  
- Instruir o modelo a seguir a sequência Thougth, Action, Input, Observation (a observação não é do próprio modelo, mas resultado da busca)  
- É necessário usar o parâmetro stop_sequence="Observation:", para o modelo parar de gerar texto e esperar o retorno da busca. Implementem o código da busca e retornem os top-k documentos pro modelo (sugestão: k=5).  
- Instruir o modelo agir passo-a-passo (decomposição da pergunta).  
- Podem usar o LangChain, LLaMAindex ou outro framework. Ou implementar na mão.  
- Usar a busca como ferramenta  
- Usar o BM25 como buscador (repetir indexação do exercício passado)  
- Usar a indexação do Visconde: https://github.com/neuralmind-ai/visconde/blob/main/iirc_create_indices.ipyn  


## Setup Environment
### Install packages

In [51]:
#%%capture
%pip install -q torch
%pip install groq
%pip install langchain
%pip install rank_bm25
%pip install bs4

Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.


### Imports

In [53]:
import os
import sys
import json
import torch
import groq
from groq import Groq
from bs4 import BeautifulSoup
from langchain_core.documents import Document
from langchain_community.retrievers import BM25Retriever
from rank_bm25 import BM25Okapi

import warnings
warnings.simplefilter('ignore')

### Collab setup

In [54]:
# Colab environment
IN_COLAB = 'google.colab' in sys.modules

if (IN_COLAB):
    # Google Drive
    from google.colab import drive
    drive.mount('/content/drive', force_remount=True)

    project_folder="/content/drive/MyDrive/Classes/IA024/Aula_9_10"
    os.chdir(project_folder)
    !ls -la

device = 'cuda' if torch.cuda.is_available() else 'cpu'
print(device)

cuda


### Groq API

In [55]:
def load_groq_key():
    try:
        # Open and read the entire content of the file
        with open("groq-key.txt", 'r') as file:
            contents = file.read()
        
        return contents
    
    except FileNotFoundError:
        print(f"The file does not exist.")
        return None
    except Exception as e:
        # Handle other potential exceptions (e.g., permission errors)
        print(f"An error occurred while reading the file: {str(e)}")
        return None
    
groq_key = load_groq_key()
os.environ["GROQ_API_KEY"] = groq_key

client = Groq(
    api_key=os.environ.get("GROQ_API_KEY"),
)

def groq_chat(content):
    try:

        chat_completion = client.chat.completions.create(
            #
            # Required parameters
            #
            messages=[
                # Set an optional system message. This sets the behavior of the
                # assistant and can be used to provide specific instructions for
                # how it should behave throughout the conversation.
                {
                    "role": "system",
                    "content": "you are a helpful assistant."
                },
                # Set a user message for the assistant to respond to.
                {
                    "role": "user",
                    "content": content,
                }
            ],

            # The language model which will generate the completion.
            model="llama3-70b-8192",

            #
            # Optional parameters
            #

            # Controls randomness: lowering results in less random completions.
            # As the temperature approaches zero, the model will become deterministic
            # and repetitive.
            temperature=0,

            # The maximum number of tokens to generate. Requests can use up to
            # 32,768 tokens shared between prompt and completion.
            #max_tokens=10,

            # Controls diversity via nucleus sampling: 0.5 means half of all
            # likelihood-weighted options are considered.
            top_p=1,

            # A stop sequence is a predefined or user-specified text string that
            # signals an AI to stop generating content, ensuring its responses
            # remain focused and concise. Examples include punctuation marks and
            # markers like "[end]".
            stop=None,

            # If set, partial message deltas will be sent.
            stream=False,
        )

    except groq.APIConnectionError as e:
        print("The server could not be reached")
        print(e.__cause__)  # an underlying Exception, likely raised within httpx.
    except groq.RateLimitError as e:
        print("A 429 status code was received; we should back off a bit.")
    except groq.APIStatusError as e:
        print("Another non-200-range status code was received")
        print(e.status_code)
        print(e.response)
    
    return chat_completion.choices[0].message.content

## Globals

In [56]:
NUM_QUESTIONS = 50

"""Default prompt for ReAct agent."""


# ReAct chat prompt
# TODO: have formatting instructions be a part of react output parser

REACT_CHAT_SYSTEM_HEADER = """\

You are designed to help with a variety of tasks, from answering questions \
    to providing summaries to other types of analyses.

## Tools
You have access to a wide variety of tools. You are responsible for using
the tools in any sequence you deem appropriate to complete the task at hand.
This may require breaking the task into subtasks and using different tools
to complete each subtask.

You have access to the following tools:
{tool_desc}

## Output Format
To answer the question, please use the following format.

```
Thought: I need to use a tool to help me answer the question.
Action: tool name (one of {tool_names}) if using a tool.
Action Input: the input to the tool, in a JSON format representing the kwargs (e.g. {{"input": "hello world", "num_beams": 5}})
```

Please ALWAYS start with a Thought.

Please use a valid JSON format for the Action Input. Do NOT do this {{'input': 'hello world', 'num_beams': 5}}.

If this format is used, the user will respond in the following format:

```
Observation: tool response
```

You should keep repeating the above format until you have enough information
to answer the question without using any more tools. At that point, you MUST respond
in the one of the following two formats:

```
Thought: I can answer without using any more tools.
Answer: [your answer here]
```

```
Thought: I cannot answer the question with the provided tools.
Answer: Sorry, I cannot answer your query.
```

## Current Conversation
Below is the current conversation consisting of interleaving human and assistant messages.

"""

## IIRC Dataset

In [57]:
if not os.path.exists("dataset/context_articles.json"):
    !wget http://jamesf-incomplete-qa.s3.amazonaws.com/context_articles.tar.gz
    !tar -xzf context_articles.tar.gz --directory dataset
    !rm context_articles.tar.gz

In [58]:
test_dataset  = json.load(open('dataset/test_questions.json', 'r'))
articles = json.load(open('dataset/context_articles.json', 'r'))

In [59]:
test_dataset[0]

{'answer': {'type': 'span',
  'answer_spans': [{'text': 'sky and thunder god',
    'passage': 'zeus',
    'type': 'answer',
    'start': 83,
    'end': 102}]},
 'question': 'What is Zeus know for in Greek mythology?',
 'context': [{'text': 'he Palici the sons of Zeus',
   'passage': 'main',
   'indices': [684, 710]},
  {'text': 'in Greek mythology', 'passage': 'main', 'indices': [137, 155]},
  {'text': 'Zeus (British English , North American English ; , Zeús ) is the sky and thunder god in ancient Greek religion',
   'passage': 'Zeus',
   'indices': [0, 110]}],
 'question_links': ['Greek mythology', 'Zeus'],
 'title': 'Palici'}

### Load and parse

In [77]:
# Grab the first 50 questions with an answer

def grab_documents():
  questions_found = []
  num_questions_found = 0
  documents = []
  all_titles = []

  for item in test_dataset:
    
    question = item['question']
    answer = item['answer']
    answer_type = answer['type']

    if answer_type == 'binary' or answer_type == 'value':
      final_answer = answer['answer_value']
    elif answer_type == 'span':
      final_answer = answer['answer_spans'][0]['text']
    elif answer_type == 'none':
      final_answer = 'none'
    else:
      final_answer = 'An error perhaps, bad type'
      print(answer_type)

    if (final_answer == 'none'):
      # Skip this one.
      continue
    else:
      
      # Thats a good question, grab the document and title associated with it
      # from the question_links element in the test set.

      for link in item["question_links"]:
        # Context
        context_list = item['context']
        for context in context_list:
          text = context['text']
          # clean up html
          soup = BeautifulSoup(text, 'html.parser')
          clean_text = soup.get_text()

          documents.append({
            "title": item['title'],
            "content": clean_text
          })
          all_titles.append(item['title'].lower())

        # Articles
        if link.lower() in articles:
          # clean up html
          soup = BeautifulSoup(articles[link.lower()], 'html.parser')
          clean_text = soup.get_text()

          documents.append({
            "title": link,
            "content": clean_text
            })
          all_titles.append(link.lower())

      questions_found.append({"Question": question, "Answer": final_answer})
      num_questions_found += 1
      if (num_questions_found == NUM_QUESTIONS):
        # found our questions
        break

  return questions_found, documents

questions_to_ask, documents = grab_documents()

### Inspecting data

In [83]:
len(questions_to_ask)

50

In [84]:
len(documents)

275

In [85]:
questions_to_ask[0]

{'Question': 'What is Zeus know for in Greek mythology?',
 'Answer': 'sky and thunder god'}

In [86]:
documents[:5]

[{'title': 'Palici', 'content': 'he Palici the sons of Zeus'},
 {'title': 'Palici', 'content': 'in Greek mythology'},
 {'title': 'Palici',
  'content': 'Zeus (British English , North American English ; , Zeús ) is the sky and thunder god in ancient Greek religion'},
 {'title': 'Greek mythology',
 {'title': 'Palici', 'content': 'he Palici the sons of Zeus'}]

## BM25 Indexing

### Simple test using Rank_BM25

In [103]:
question = questions_to_ask[30]['Question']
retriever = BM25Retriever.from_documents(doc_list)
result = retriever.invoke(question)

In [111]:
corpus = [d.get('content', 'default_value') for d in documents]
tokenized_corpus = [doc.split(" ") for doc in corpus]

bm25 = BM25Okapi(tokenized_corpus)
query = questions_to_ask[30]['Question']
tokenized_query = query.split(" ")

result = bm25.get_top_n(tokenized_query, corpus, n=5)

In [112]:
print(f'Question: {question}')
print('Sentences:')
for i in range(len(result)):
    print(result[i])

Question: Which team won the first game in the 2004 National League Division Series?
Sentences:
The 2004 National League Division Series (NLDS), the opening round of the 2004 National League playoffs, began on Tuesday, October 5, and ended on Monday, October 11, with the champions of the three NL divisions—along with a "wild card" team—participating in two best-of-five series. They were:

- (1) St. Louis Cardinals (Central Division champion, 105–57) vs. (3) Los Angeles Dodgers (Western Division champion, 93–69): Cardinals win series, 3–1.
- (2) Atlanta Braves (Eastern Division champion, 96–66) vs. (4) Houston Astros (Wild Card, 92–70): Astros win series, 3–2.
The St. Louis Cardinals and Houston Astros went on to meet in the NL Championship Series (NLCS). The Cardinals became the National League champion, and lost to the American League champion Boston Red Sox in the 2004 World Series.

Game 1, October 5.Busch Stadium (II) in St. Louis, Missouri
Odalis Pérez faced Woody Williams in Game