# Exercício 10_11: Avaliação do dataset IIRC com RAGAs

**Nome:** Caio Petrucci dos Santos Rosa

**RA:** 248245

## Enunciado

- Implementar o RAGAS com o LLaMA-3 70B para avaliar a qualidade das 50 anotações do IIRC usadas no exercício passado.

- O RAGAS considera context, question, answer, keys que estão disponíveis no conjunto de teste do IIRC.

- Opcional:

    - Avaliar as respostas do exercício da aula 9_10

    - Usar multi agents

# Bibliotecas e pacotes

In [None]:
!pip install -q groq
!pip install -q langchain
!pip install -q langchain-groq
!pip install -q langchain-community
!pip install -q -U sentence-transformers
!pip install -q ragas
!pip install -q datasets

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m105.4/105.4 kB[0m [31m2.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m75.6/75.6 kB[0m [31m4.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m77.9/77.9 kB[0m [31m4.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m58.3/58.3 kB[0m [31m4.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m973.7/973.7 kB[0m [31m8.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m308.5/308.5 kB[0m [31m12.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m122.3/122.3 kB[0m [31m12.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m49.3/49.3 kB[0m [31m5.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━

In [None]:
from google.colab import userdata
from tqdm import tqdm
from typing import List, Dict

from groq import Groq, RateLimitError

from langchain_groq import ChatGroq
from langchain_community.embeddings import HuggingFaceEmbeddings

from sentence_transformers import SentenceTransformer
from sentence_transformers.util import cos_sim

from datasets import Dataset

from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy, ContextRelevancy
from ragas.metrics._faithfulness import LONG_FORM_ANSWER_PROMPT, NLI_STATEMENTS_MESSAGE
from ragas.metrics._answer_relevance import QUESTION_GEN
from ragas.metrics._context_relevancy import CONTEXT_RELEVANCE

from functools import reduce

import json
import threading
import time
import json
import numpy as np
import pandas as pd

context_relevancy = ContextRelevancy()

# Atributos e hiper-parâmetros

In [None]:
LLM_MODEL_NAME = "llama3-70b-8192"
LLM_CONTEXT_SIZE = 8192
LLM_TEMPERATURE = 0
LLM_TOP_P = 1

N_SAMPLES = 10

N_QUESTIONS = 3

RAGAS_EVAL_N_BATCHES = 3

EMBEDDINGS_MODEL_NAME = "sentence-transformers/multi-qa-mpnet-base-cos-v1"

GROQ_API_KEY = userdata.get('GROQ_API_KEY')

# Dataset IIRC

## Download e carregamento dos dados

In [None]:
!wget https://iirc-dataset.s3.us-west-2.amazonaws.com/iirc_test.json

!wget https://iirc-dataset.s3.us-west-2.amazonaws.com/context_articles.tar.gz
!tar -xzf context_articles.tar.gz

--2024-05-23 02:26:22--  https://iirc-dataset.s3.us-west-2.amazonaws.com/iirc_test.json
Resolving iirc-dataset.s3.us-west-2.amazonaws.com (iirc-dataset.s3.us-west-2.amazonaws.com)... 52.92.232.130, 52.92.206.66, 52.92.235.42, ...
Connecting to iirc-dataset.s3.us-west-2.amazonaws.com (iirc-dataset.s3.us-west-2.amazonaws.com)|52.92.232.130|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2874825 (2.7M) [application/json]
Saving to: ‘iirc_test.json’


2024-05-23 02:26:24 (2.76 MB/s) - ‘iirc_test.json’ saved [2874825/2874825]

--2024-05-23 02:26:24--  https://iirc-dataset.s3.us-west-2.amazonaws.com/context_articles.tar.gz
Resolving iirc-dataset.s3.us-west-2.amazonaws.com (iirc-dataset.s3.us-west-2.amazonaws.com)... 52.92.232.130, 52.92.206.66, 52.92.235.42, ...
Connecting to iirc-dataset.s3.us-west-2.amazonaws.com (iirc-dataset.s3.us-west-2.amazonaws.com)|52.92.232.130|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 385263479 (367M) [appli

In [None]:
test_set = json.load(open('iirc_test.json', 'r'))
context_articles = json.load(open('context_articles.json', 'r'))

## Analisando a estrutura do conjunto de dados

In [None]:
def structure_as_string(data, n_indentation):
    indent = lambda n : "    " * n
    ret = ""

    if type(data) == dict:
        ret += "{"

        for key, value in data.items():
            ret += "\n" + indent(n_indentation+1)
            ret += f"{key}: "
            ret += structure_as_string(value, n_indentation+1)

        if len(data.items()) != 0:
            ret += "\n" + indent(n_indentation)

        ret += "}"

    elif type(data) == list:
        ret += "["

        if len(data) != 0:
            ret += "\n" + indent(n_indentation+1)
            ret += structure_as_string(data[0], n_indentation+1)
            ret += "\n" + indent(n_indentation)

        ret += "]"

    else:
        ret += f"{type(data)}"

    return ret

In [None]:
print("    Estrutura dos dados do conjunto de teste    ")
print("------------------------------------------------")
print(structure_as_string(test_set, 0))

    Estrutura dos dados do conjunto de teste    
------------------------------------------------
[
    {
        questions: [
            {
                answer: {
                    type: <class 'str'>
                    answer_spans: [
                        {
                            text: <class 'str'>
                            passage: <class 'str'>
                            type: <class 'str'>
                            start: <class 'int'>
                            end: <class 'int'>
                        }
                    ]
                }
                question: <class 'str'>
                context: [
                    {
                        text: <class 'str'>
                        passage: <class 'str'>
                        indices: [
                            <class 'int'>
                        ]
                    }
                ]
                question_links: [
                    <class 'str'>
                ]
            }

## Construindo conjunto QA para análise com RAGAs

In [None]:
def get_qa_with_context(iirc_dataset, n_samples):
    answer_parsing = {
        "none": lambda ans: "Not enough information provided in the documents.",
        "span": lambda ans: ans['answer_spans'][0]['text'],
        "binary": lambda ans: ans["answer_value"],
        "value": lambda ans: f"{ans['answer_value']} {ans['answer_unit']}"
    }

    qa_dataset = []

    for item in iirc_dataset:
        for question in item["questions"]:
            if len(qa_dataset) >= n_samples:
                return qa_dataset

            context_passages = [ f"{c['passage'] if c['passage'] != 'main' else item['title'] }: {c['text']}" for c in question["context"] ]

            qa_item = {
                "question": question["question"],
                "answer": answer_parsing[question['answer']['type']](question['answer']),
                "context": context_passages,
            }

            qa_dataset.append(qa_item)

    return qa_dataset

In [None]:
qa_dataset = get_qa_with_context(test_set, N_SAMPLES)

In [None]:
qa_dataset[0]

{'question': 'What is Zeus know for in Greek mythology?',
 'answer': 'sky and thunder god',
 'context': ['Palici: he Palici the sons of Zeus',
  'Palici: in Greek mythology',
  'Zeus: Zeus (British English , North American English ; , Zeús ) is the sky and thunder god in ancient Greek religion']}

# Implementando o RAGAs

## Interface de ChatCompletion com GROQ

In [None]:
# Parte do código inspirado no código do Rian Radeck Santos Costa

class GroqCompletionInterface:
    '''
    Interface for using the Groq API

    Implements a rate limit control for multi-threading use.
    '''

    # documentacao dos parametros em: https://console.groq.com/docs/text-chat
    _model_name = LLM_MODEL_NAME
    _context_size = LLM_CONTEXT_SIZE
    _temperature = LLM_TEMPERATURE
    _top_p = LLM_TOP_P
    _stop = None
    _stream = False
    _response_format = {"type": "json_object"}

    # Mutex lock
    _rate_lock = threading.Lock()

    def __init__(self, base_prompt):
        '''
        GroqCompletionInterface constructor.
        '''

        api_key = userdata.get('GROQ_API_KEY')
        if api_key is None:
            raise RuntimeError("'GROQ_API_KEY' variable is not set in environment.")

        self._client = Groq(api_key=api_key)
        self._base_prompt = base_prompt


    def __call__(self, prompt: str) -> str:
        '''
        Generates the model response

        Args:
            prompt (str): prompt to send to the model.

        Returns:
            str: model response.
        '''

        done = False
        while not done:
            try:
                GroqCompletionInterface._rate_lock.acquire()
                GroqCompletionInterface._rate_lock.release()

                messages = []

                if self._base_prompt:
                    messages.append(
                        {
                            "role" : "system",
                            "content" : self._base_prompt.instruction
                        }
                    )

                    if self._base_prompt.examples:
                        messages.append(
                            {
                                "role" : "system",
                                "content" : f"You MUST output in JSON exactly like this example:\n{self._base_prompt.examples[0]}\n"
                            }
                        )

                    messages.append(
                        {
                            "role" : "system",
                            "content" : f"You will receive a JSON with the following keys: {self._base_prompt.input_keys}. You must include the received keys and this additional key {self._base_prompt.output_key} in the output, just like the given example."
                        }
                    )



                messages.append(
                    {
                        "role" : "user",
                        "content" : prompt
                    }
                )


                chat_completion = self._client.chat.completions.create(
                    messages=messages,
                    model=self._model_name,
                    temperature=self._temperature,
                    max_tokens=self._context_size,
                    top_p=self._top_p,
                    stop=self._stop,
                    stream=self._stream,
                    response_format=self._response_format,
                )

                done = True

            except RateLimitError as exception:
                GroqCompletionInterface.error = exception
                if not GroqCompletionInterface._rate_lock.locked():
                    GroqCompletionInterface._rate_lock.acquire()
                    time.sleep(1.75)
                    GroqCompletionInterface._rate_lock.release()

        return chat_completion.choices[0].message.content

## Modelo de embeddings para comparação semântica com Sentence Transformers

In [None]:
embeddings_model = SentenceTransformer(EMBEDDINGS_MODEL_NAME)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/9.25k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]



config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [None]:
sentence1 = "I love spaghetti!"
sentence2 = "I really like pasta!"

emb_sentence1 = embeddings_model.encode(sentence1)
emb_sentence2 = embeddings_model.encode(sentence2)

print(f'Similaridade de cosseno entre as sentenças:')
print(f'\tSentença 1: \t {sentence1}')
print(f'\tSentença 2: \t {sentence2}')
print(f'\tSimilaridade: \t {cos_sim(emb_sentence1, emb_sentence2).item()}')

Similaridade de cosseno entre as sentenças:
	Sentença 1: 	 I love spaghetti!
	Sentença 2: 	 I really like pasta!
	Similaridade: 	 0.805577278137207


## Métricas do RAGAs

Em toda essa seção, tiveram várias partes do código que foram inspiradas no código do colega Rian Radeck Santos Costa.

### Faithfulness

In [None]:
class StatementExtractorAgent:
    def __init__(self):
        self._llm_interface = GroqCompletionInterface(base_prompt=LONG_FORM_ANSWER_PROMPT)

    def extract_answer_statements(self, question: str, answer: str) -> List[str]:
        prompt = f'''\u007b
  "question": "{question}",
  "answer": "{answer}",
  "sentences": "{'. '.join([ f"Sentence {i}: {text}" for i, text in enumerate(answer.split('.')) ])}"
\u007d'''

        response = json.loads(self._llm_interface(prompt))
        statements = reduce(lambda x, y : x + y, [ analysis["simpler_statements"] for analysis in response["analysis"] ])

        return statements

In [None]:
class FaithfulnessScoreAgent:
    def __init__(self):
        self._llm_interface = GroqCompletionInterface(base_prompt=NLI_STATEMENTS_MESSAGE)
        self._statement_extractor_agent = StatementExtractorAgent()

    def get_faithfulness_score(self, question: str, answer: str, contexts: List[str]):
        answer_statements = self._statement_extractor_agent.extract_answer_statements(answer, question)
        formatted_statements = ',\n\t\t'.join([ f"\"{statem}\"" for statem in answer_statements ])

        prompt = f'''\u007b
    "context": "{'. '.join(contexts)}",
    "statements": [
        {formatted_statements}
    ]
\u007d'''

        statements_verdict = json.loads(self._llm_interface(prompt))

        supported = 0
        for statement_verdict in statements_verdict["answer"]:
            supported += statement_verdict["verdict"]
        return supported / len(statements_verdict["answer"])

### Answer Relevance

In [None]:
class QuestionGeneratorAgent:
    def __init__(self):
        self._llm_interface = GroqCompletionInterface(base_prompt=QUESTION_GEN)

    def generate_questions(self, answer: str, contexts: List[str], n_questions: int) -> List[str]:
        prompt = f'''\u007b
    "answer": "{answer}",
    "context": "{'. '.join(contexts)}
\u007d'''

        questions = []
        for i in range(n_questions):
            generated_question = json.loads(self._llm_interface(prompt))
            questions.append(generated_question)

        return questions

In [None]:
class AnswerRelevanceScoreAgent:
    def __init__(self, embeddings_model: SentenceTransformer):
        self._question_generator_agent = QuestionGeneratorAgent()
        self._embeddings_model = embeddings_model

    def get_answer_relevance_score(self, question: str, answer: str, contexts: List[str], n_questions: int) -> float:
        generated_questions = self._question_generator_agent.generate_questions(answer, contexts, n_questions)

        emb_ground_truth_question = self._embeddings_model.encode(question)

        cos_sim_sum = 0
        for gen_question in generated_questions:
            actual_gen_question = gen_question["output"]["question"]
            emb_gen_question = self._embeddings_model.encode(actual_gen_question)
            cos_sim_sum += cos_sim(emb_ground_truth_question, emb_gen_question)

        cos_sim_avg = cos_sim_sum / len(generated_questions)
        return cos_sim_avg.item()

### Context Relevance

In [None]:
class ContextRelevanceScoreAgent:
    def __init__(self):
        self._llm_interface = GroqCompletionInterface(base_prompt=CONTEXT_RELEVANCE)

    def get_context_relevance_score(self, question: str, contexts: List[str]):
        prompt = f'''\u007b
    "question": "{question}",
    "context": "{'. '.join(contexts)}"
\u007d'''

        extracted_sentences_output = json.loads(self._llm_interface(prompt))
        extracted_sentences = extracted_sentences_output["candidate sentences"]

        if isinstance(extracted_sentences, str): # Insufficient information
            return 0
        return len(extracted_sentences) / len(contexts)

# Avaliação do dataset IIRC

## Utilizando implementação própria do RAGAs

In [None]:
def evaluate_ragas_on_dataset(
    qa_dataset: List[Dict[str, any]],
    faithfulness_scorer: FaithfulnessScoreAgent,
    answer_relevance_scorer: AnswerRelevanceScoreAgent,
    context_relevance_scorer: ContextRelevanceScoreAgent
):
    # dataframe of evaluated data
    df = pd.DataFrame(
        columns=[
            'question',
            'answer',
            'context',
            'faithfulness',
            'answer_relevance',
            'context_relevance'
        ]
    )

    for qa in tqdm(qa_dataset):
        try:
            question = qa['question']
            answer = qa['answer']
            context = qa['context']

            faithfulness_score = faithfulness_agent.get_faithfulness_score(question, answer, context)
            answer_relevance_score = answer_relevance_agent.get_answer_relevance_score(question, answer, context, N_QUESTIONS)
            context_relevance_score = context_relevance_agent.get_context_relevance_score(question, context)

            # append evaluation item to dataframe
            row = pd.Series(
                [question, answer, context, faithfulness_score, answer_relevance_score, context_relevance_score],
                index=df.columns
            )
            df = pd.concat(
                [df, pd.DataFrame([row])],
                ignore_index=True
            )
        except Exception as e:
            print("Erro ao avaliar uma amostra.", e)

    return df

In [None]:
faithfulness_agent = FaithfulnessScoreAgent()
answer_relevance_agent = AnswerRelevanceScoreAgent(embeddings_model)
context_relevance_agent = ContextRelevanceScoreAgent()

In [None]:
df_own_implementation_evaluation_results = evaluate_ragas_on_dataset(
    qa_dataset,
    faithfulness_agent,
    answer_relevance_agent,
    context_relevance_agent,
)

df_own_implementation_evaluation_results.head()

 60%|██████    | 6/10 [04:09<02:58, 44.61s/it]

Erro ao avaliar uma amostra. Error code: 400 - {'error': {'message': "Failed to generate JSON. Please adjust your prompt. See 'failed_generation' for more details.", 'type': 'invalid_request_error', 'code': 'json_validate_failed', 'failed_generation': "{'answer': 'Not enough information provided in the documents.', 'context': 'Chris Brunt: the 2016–17 season. Chris Brunt: His second goal of the season came on 2 January 2017. Chris Brunt: against Hull City', 'output': {'question': 'What can be said about Chris Brunt in the 2016-17 season?', 'noncommittal': 1}}"}}


 80%|████████  | 8/10 [05:40<01:27, 43.77s/it]

Erro ao avaliar uma amostra. Error code: 400 - {'error': {'message': "Failed to generate JSON. Please adjust your prompt. See 'failed_generation' for more details.", 'type': 'invalid_request_error', 'code': 'json_validate_failed', 'failed_generation': "{'answer': 'Germany', 'context': 'Wilhelm Müller: Wilhelm Müller was born on 7 October 1794 at Dessau. Dessau: Dessau is a town and former municipality in Germany', 'output': {'question': 'Where is Dessau located?', 'noncommittal': 0}}"}}


100%|██████████| 10/10 [07:46<00:00, 46.69s/it]

Erro ao avaliar uma amostra. Error code: 400 - {'error': {'message': "Failed to generate JSON. Please adjust your prompt. See 'failed_generation' for more details.", 'type': 'invalid_request_error', 'code': 'json_validate_failed', 'failed_generation': "{'answer': '9 years', 'context': 'Wilhelm Müller: In 1814 he returned to his studies at Berlin.. Max Müller: Max Müller was born into a cultured family on 6 December 1823 in Dessau, ', 'output': {'question': 'How old was Max Müller when he started his studies at Berlin?', 'noncommittal': 0}}"}}





Unnamed: 0,question,answer,context,faithfulness,answer_relevance,context_relevance
0,What is Zeus know for in Greek mythology?,sky and thunder god,"[Palici: he Palici the sons of Zeus, Palici: i...",1.0,0.834537,0.333333
1,How long had the First World War been over whe...,5 years,[Giovanni Messe: he became aide-de-camp to Kin...,1.0,0.682583,1.0
2,How old was Messe when the First World War sta...,30 years,"[Giovanni Messe: Messe was born in Mesagne, in...",0.5,0.702441,1.0
3,How long had Angela Scoular been acting profes...,2 years,[Casino Royale (1967 film): Angela Scoular app...,0.5,0.820559,0.0
4,What is the capacity of the stadium where Brun...,26688,[Chris Brunt: Brunt returned to first-team act...,0.5,0.681363,0.333333


In [None]:
print("      Métricas avalidas do RAGAs      ")
print("======================================")
print(f"Faithfulness Score: \t {df_own_implementation_evaluation_results['faithfulness'].mean():.3f} ± {df_own_implementation_evaluation_results['faithfulness'].std():.3f}")
print(f"Answer Relevance Score: \t {df_own_implementation_evaluation_results['answer_relevance'].mean():.3f} ± {df_own_implementation_evaluation_results['answer_relevance'].std():.3f}")
print(f"Context Relevance Score: \t {df_own_implementation_evaluation_results['context_relevance'].mean():.3f} ± {df_own_implementation_evaluation_results['context_relevance'].std():.3f}")

      Métricas avalidas do RAGAs      
Faithfulness Score: 	 0.667 ± 0.236
Answer Relevance Score: 	 0.656 ± 0.163
Context Relevance Score: 	 0.567 ± 0.380


## Utilizando implementação oficial do RAGAs

In [None]:
def to_huggingface_dataset(data_samples):
    parsed_samples = {
        "question": [ sample["question"] for sample in data_samples ],
        "answer": [ sample["answer"] for sample in data_samples ],
        "ground_truth": [ sample["answer"] for sample in data_samples ],
        "contexts": [ sample["context"] for sample in data_samples ],
    }

    return Dataset.from_dict(parsed_samples)

In [None]:
def evaluate_batch_with_ragas(dataset, n_batches, llm_model, embeddings_model):
    df = None

    batch_size = len(dataset) // n_batches
    k = 0
    i = 0
    while i < len(dataset):
        try:
            batch_data = dataset[i : min(i+batch_size, len(dataset))]
            hf_batch_data = to_huggingface_dataset(batch_data)

            evaluation_result = evaluate(
                hf_batch_data,
                metrics=[
                    faithfulness,
                    answer_relevancy,
                    context_relevancy
                ],
                llm=llm_model,
                embeddings=embeddings_model,
            )

            batch_df = evaluation_result.to_pandas()

            if not df is None:
                df = pd.concat([df, batch_df], ignore_index=True)
            else:
                df = batch_df

            i += batch_size
            k += 1
        except:
            print(f"Erro ao tentar processar o batch {k}!")
            time.sleep(20)

    return df

In [None]:
langchain_groq_completion = ChatGroq(
    temperature=LLM_TEMPERATURE,
    model_name=LLM_MODEL_NAME,
    api_key=GROQ_API_KEY
)
langchain_embeddings_model = HuggingFaceEmbeddings(model_name=EMBEDDINGS_MODEL_NAME)



In [None]:
df_official_evaluation_result = evaluate_batch_with_ragas(
    qa_dataset,
    RAGAS_EVAL_N_BATCHES,
    langchain_groq_completion,
    langchain_embeddings_model,
)

df_official_evaluation_result.head()

Evaluating:   0%|          | 0/9 [00:00<?, ?it/s]

Exception in thread Thread-14:
Traceback (most recent call last):
  File "/usr/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
    self.run()
  File "/usr/local/lib/python3.10/dist-packages/ragas/executor.py", line 96, in run
    results = self.loop.run_until_complete(self._aresults())
  File "/usr/lib/python3.10/asyncio/base_events.py", line 649, in run_until_complete
    return future.result()
  File "/usr/local/lib/python3.10/dist-packages/ragas/executor.py", line 84, in _aresults
    raise e
  File "/usr/local/lib/python3.10/dist-packages/ragas/executor.py", line 79, in _aresults
    r = await future
  File "/usr/lib/python3.10/asyncio/tasks.py", line 571, in _wait_for_one
    return f.result()  # May raise f.exception().
  File "/usr/local/lib/python3.10/dist-packages/ragas/executor.py", line 38, in sema_coro
    return await coro
  File "/usr/local/lib/python3.10/dist-packages/ragas/executor.py", line 112, in wrapped_callable_async
    return counter, await callable(

Erro ao tentar processar o batch 0!


Evaluating:   0%|          | 0/9 [00:00<?, ?it/s]

Erro ao tentar processar o batch 0!


KeyboardInterrupt: 

In [None]:
print("      Métricas avalidas do RAGAs      ")
print("======================================")
print(f"Faithfulness Score: \t {df_official_evaluation_result['faithfulness'].mean():.3f} ± {df_official_evaluation_result['faithfulness'].std():.3f}")
print(f"Answer Relevance Score: \t {df_official_evaluation_result['answer_relevancy'].mean():.3f} ± {df_official_evaluation_result['answer_relevancy'].std():.3f}")
print(f"Context Relevance Score: \t {df_official_evaluation_result['context_relevancy'].mean():.3f} ± {df_official_evaluation_result['context_relevancy'].std():.3f}")