# Preparation of custom RAG evaluation dataset

## Import LLM client

In [1]:
import os
from langchain_openai import ChatOpenAI
from dotenv import load_dotenv

load_dotenv()

TEMPERATURE=0.1
TOP_P=0.95
MAX_TOKENS=2048

In [5]:
MODEL = "meta-llama/Meta-Llama-3.1-405B-Instruct"

llm_client = ChatOpenAI(
    base_url="https://api.studio.nebius.ai/v1/",
    api_key=os.environ.get("NB_AI_STUDIO_KEY"),
    model=MODEL,
    temperature=TEMPERATURE,
    top_p=TOP_P,
    max_tokens=MAX_TOKENS,
)

llm_client.invoke("What is the capital of France").content

'The capital of France is Paris.'

## Prepare evaluation dataset

### Generate QA pairs

In order to evaluate RAG ppeline, we need to have areference dataset which will include golden QA pairs together with reference context. 


To generate QA pairs we are going to use `jamescalam/ai-arxiv-chunked` which contains chunkd of NLP-related reseearch papers. We are not going to use the chunks themselves but only paper summaries which are provided together with chunks. 
There are over 400 unique summaries and we are going to sample 100 of them to use as source context for QA pairs generation

In [5]:
from datasets import load_dataset
import random

data = load_dataset("jamescalam/ai-arxiv-chunked", split="train")
summaries = list(set(data["summary"]))
sampled_summaries = random.sample(summaries, 100)

Let's make batches of 4 to accelerate the generation process

In [55]:
BATCH_SIZE = 4
batches = [sampled_summaries[i * BATCH_SIZE:(i+1) * BATCH_SIZE] for i in range(len(sampled_summaries)//BATCH_SIZE)]
len(batches)

25

We need to create some helper function which will process batches of summaries and generate QA pairs. It is usefull to wrap this function with `retry` in case we run into rate limit for example:

In [13]:
from langchain_core.prompts import ChatPromptTemplate
from tenacity import retry, wait_random_exponential, stop_after_attempt 

SYSTEM_PROMPT = """
Your task is to write a standalone factoid question and an answer given a context.
Your factoid question should be answerable with a specific, concise piece of factual information from the context.
Your factoid question should be formulated in the same style as questions users could ask in a search engine.
Your answer to the factoid question should be detailed and relying upon given context and be accessible for a wide variety of users.
This means that your standalone factoid question MUST NOT mention something like "according to the passage" or "context".

Provide your answer as follows:

Output:::
Factoid question: (your factoid question)
Answer: (your answer to the factoid question)"""

USER_PROMPT = """Now here is the context.

Context: {context}\n
Output:::"""

gen_prompt_template = ChatPromptTemplate.from_messages(
    [
        ("system", SYSTEM_PROMPT),
        ("human", USER_PROMPT),
    ]
)

gen_prompt_template.batch(inputs=[{"context": "context"},{"context": "context"}])

@retry(wait=wait_random_exponential(multiplier=4, max=120), stop=stop_after_attempt(3))
def process_batch_messages(llm, batch):
    return llm.batch(gen_prompt_template.batch(inputs=[{"context": d} for d in batch]))

Some other helper functions which may be useful later:

In [7]:
import hashlib

def hash_string(input: str) -> str:
    h = hashlib.new('sha256')
    h.update(input.encode())
    
    return h.hexdigest()

from datetime import datetime

def get_timestamp() -> str:
    timestamp = datetime.now()
    return timestamp.strftime("%Y-%m-%d_%H-%M-%S")

Here's the code which processes the batches and parses the output into QA pairs:

In [59]:
import json
import mlflow
from tqdm import tqdm


formatted_timestamp = get_timestamp()
mlflow.set_experiment(f"Data generation for RAG eval {formatted_timestamp}")
mlflow.langchain.autolog()

file_path = f'generated_items_{formatted_timestamp}.jsonl'

outputs = []
for batch in tqdm(batches[:]):
    responses = process_batch_messages(llm_client, batch)
    for response, doc in zip(responses, batch):
        output_QA_couple = response.content
        try:
            question = output_QA_couple.split("Factoid question: ")[-1].split("Answer: ")[0]
            answer = output_QA_couple.split("Answer: ")[-1]
            item =  {
                "document": {
                    "content": doc,
                    "collection_id": str(hash_string(doc)),
                },
                "question": question,
                "answer": answer,
            }
            outputs.append(item)
            json_line = json.dumps(item)
            with open(file_path, 'a') as file:
                file.write(json_line + '\n')
        except Exception as e:
            print(e.__str__())
            continue

2024/09/25 14:43:38 INFO mlflow.tracking.fluent: Experiment with name 'Data generation for RAG eval 2024-09-25_14-43-38' does not exist. Creating a new experiment.
100%|██████████| 25/25 [00:32<00:00,  1.28s/it]


### (Optional) Evaluate generated eval dataset with Prometheus 2

Start serving `text-generation-inference` with `prometheus-bgb-8x7b-v2.0` LLM:

```text
docker run -e HF_TOKEN=$HF_TOKEN --gpus all --shm-size 64g -p 8089:80 -v $PWD/data:/data \
    ghcr.io/huggingface/text-generation-inference:2.2.0 \
    --model-id prometheus-eval/prometheus-bgb-8x7b-v2.0 \
    --dtype bfloat16 \
    --sharded true \
    --num-shard 2 \
    --max-input-tokens 32000 \
    --max-total-tokens 32768
```

Let's create a custom `LiteLLM` client which will be using our self-hosted model on TGI:

In [6]:
from prometheus_eval.litellm import LiteLLM
from prometheus_eval import PrometheusEval
from prometheus_eval.prompts import SCORE_RUBRIC_TEMPLATE
from typing import Tuple

_MODEL = "prometheus-eval/prometheus-bgb-8x7b-v2.0"

class CustomLiteLLM(LiteLLM):
    def __init__(self, api_key: str="-", **kwargs):
        """Initialize the LiteLLM with basic configurations."""
        super().__init__(**kwargs)
        
        self.api_key = api_key

    def completions(self, *args, **kwargs):

        kwargs.update({"api_key": self.api_key})
        return super().completions(*args, **kwargs)

litellm_client = CustomLiteLLM(
    name=f"huggingface/{_MODEL}",
    api_base="http://localhost:8089",
)

Necessary pormpts required for evaluation of generated QA pairs with Prometheus-2:

In [7]:
prometheus_prompt = """###Task Description:
An instruction (might include an Input inside it), a question to evaluate, and a score rubric representing a evaluation criteria are given.
1. Write a detailed feedback that assess the quality of the response strictly based on the given score rubric, not evaluating in general.
2. After writing a feedback, write a score that is an integer between 1 and 5. You should refer to the score rubric.
3. The output format should look as follows: "(write a feedback for criteria) [RESULT] (an integer number between 1 and 5)"
4. Please do not generate any other opening, closing, and explanations.

###The instruction to evaluate:
{instruction}

###Question to evaluate:
{response}

###Score Rubrics:
{rubric}

###Feedback: """

In [8]:
groundedness_rubric_data = {
    "criteria": "The Question to evaluate is generated by a model in response to an instruction or question and uses the provided Context. Is question formulated clearly, without ambiguity, and remain grounded in the Context?",
    "score1_description": "The question is vague or unclear and does not engage with the provided Context, making it impossible to discern how it relates to the question or instruction.",
    "score2_description": "The question partially engages with the Context but includes significant ambiguities or unclear portions, often straying from the context or not fully addressing the question.",
    "score3_description": "The question generally addresses the question using the provided Context but has occasional ambiguities or is unclear in certain aspects, making parts of the question less grounded.",
    "score4_description": "The question is mostly clear and unambiguous, providing a grounded question based on the Context. However, there are minor instances where clarity could be improved or where the grounding in the context is weaker.",
    "score5_description": "The question is entirely clear, unambiguous, and fully grounded in the provided Context. It aligns with context precisely with no unnecessary or unclear content."
}

relevance_rubric_data = {
    "criteria": "The Question to evaluate is intended for NLP researchers and practitioners. How relevant and useful is this question to the practical needs, concerns, or tasks of NLP practitioners?",
    "score1_description": "The question is irrelevant or entirely unrelated to NLP researchers and practitioners. It provides no utility and does not address any relevant issues or tasks.",
    "score2_description": "The question has minimal relevance to research and operations on NLP. It touches on tangential or obscure topics with little practical use for most researchers and practitioners.",
    "score3_description": "The question is somewhat relevant but lacks focus on key NLP use cases or concerns. While it could be useful in some scenarios, it doesn't address a common or critical issue for researchers and practitioners.",
    "score4_description": "The question is relevant and addresses a useful or moderately important aspect of NLP. It could help researchers and practitioners solve a typical problem or address a common task, but it may not target a critical or high-impact area.",
    "score5_description": "The question is highly relevant to NLP researchers and practitioners. It addresses a common, critical, or high-impact issue or task that many users are likely to face, making it very useful for NLP researchers and practitioners."
}

standalone_rubric_data = {
    "criteria": "The Question to evaluate is intended to be self-contained. Can the question be understood and answered without looking at the Context in the Instruction to evaluate, or does it depend on external information (like previous context, documents, or scenarios) to be fully comprehensible?",
    "score1_description": "The question heavily depends on external context or previous information to be understood. It refers to specific content (e.g., 'in the context' or 'in the document') and is incomplete on its own.",
    "score2_description": "The question is mostly dependent on external information. While parts of the question may be clear, it still requires knowledge of additional context or documents to be fully understood.",
    "score3_description": "The question is partially understandable on its own but still relies on some implicit context or background knowledge to be fully clear. It is incomplete without certain pieces of information.",
    "score4_description": "The question is largely standalone and makes sense without needing much additional context. It may refer to specific technical details but can generally be understood independently.",
    "score5_description": "The question is entirely self-contained and makes complete sense on its own. Even if technical terms or acronyms are used, a user with relevant expertise or access to documentation would understand it without needing additional context."
}

In [9]:
instruction_prompt = """
Your task is to write a standalone factoid question and an answer given a context.
Your factoid question should be answerable with a specific, concise piece of factual information from the context.
Your factoid question should be formulated in the same style as questions users could ask in a search engine.

Context: {context}"""

The code which evaluates QA pairs updates the original items with evaluation scores and corresponding justification:

In [None]:

params={
    "temperature": 0.0, 
    "max_tokens": 500,
}

judge = PrometheusEval(model=litellm_client, absolute_grade_template=prometheus_prompt)

formatted_timestamp = get_timestamp()
file_path = f'evaluated_items_{formatted_timestamp}.jsonl'

for output in tqdm(outputs):
    for metric, rubric_criteria in {"groundedness": groundedness_rubric_data, "relevance": relevance_rubric_data, "standalone": standalone_rubric_data}.items():
        feedback, score = judge.single_absolute_grade(
            instruction=instruction_prompt.format(context=' '.join(output["document"]["content"])),       
            response=output["question"],             
            rubric=SCORE_RUBRIC_TEMPLATE.format(**rubric_criteria),                             
            params=params,                               
        )
        output.update(
            {
                f"{metric}_score": score,
                f"{metric}_feedback": feedback,
            }
        )
    
    json_line = json.dumps(output)
    with open(file_path, 'a') as file:
        file.write(json_line + '\n')

Let's take a look at the evaluation results:

In [69]:
import pandas as pd
import datasets

pd.set_option("display.max_colwidth", None)

generated_questions = pd.DataFrame.from_dict(outputs)

print("Evaluation dataset before filtering:")
display(
    generated_questions[
        [
            "question",
            "answer",
            "groundedness_score",
            "relevance_score",
            "standalone_score",
        ]
    ]
)
generated_questions = generated_questions.loc[
    (generated_questions["groundedness_score"] >= 4)
    & (generated_questions["relevance_score"] >= 4)
    & (generated_questions["standalone_score"] >= 5)
]
print("============================================")
print("Final evaluation dataset:")
display(
    generated_questions[
        [
            "question",
            "answer",
            "groundedness_score",
            "relevance_score",
            "standalone_score",
        ]
    ]
)

eval_dataset = datasets.Dataset.from_pandas(generated_questions, split="train", preserve_index=False)

Evaluation dataset before filtering:


Unnamed: 0,question,answer,groundedness_score,relevance_score,standalone_score
0,What is the name of the approach that generates both reasoning traces and task-specific actions in an interleaved manner for large language models?\n\n,The approach is named ReAct.,5,5,4
1,What are the two main sources of carbon emissions in computing?\n,The two main sources of carbon emissions in computing are operational energy consumption and hardware manufacturing and infrastructure.,5,3,4
2,What is the name of the test introduced to measure the magnitude of overall bias in neural language models?\n,The Contextualized Embedding Association Test (CEAT).,5,4,5
3,"What is the name of the large language model introduced in the paper that can store, combine and reason about scientific knowledge?\n",Galactica,5,5,4
4,What is the name of the dataset collected for studying the characteristics of ChatGPT's responses and comparing them with human experts?\n\n,"The collected dataset is called the Human ChatGPT Comparison Corpus (HC3), which contains tens of thousands of comparison responses from both human experts and ChatGPT, covering various areas such as open-domain, financial, medical, legal, and psychological questions.",5,5,5
...,...,...,...,...,...
95,What is the name of the pre-training objective proposed in the paper that combines diverse pre-training paradigms together?\n\n,The pre-training objective proposed in the paper is called Mixture-of-Denoisers (MoD).,5,4,4
96,What is the dimensionality of the subspace that the gradient dynamically converges to in large-scale deep learning scenarios?\n,"The subspace is spanned by a few top eigenvectors of the Hessian, equal to the number of classes in the dataset.",5,3,4
97,What is the name of the novel method proposed to effectively leverage positional information in transformer-based language models?\n,The novel method is named Rotary Position Embedding (RoPE).,5,5,4
98,What is the size range of the language models investigated in the red teaming efforts?\n,The language models investigated in the red teaming efforts range in size from 2.7 billion parameters to 52 billion parameters.,5,4,4


Final evaluation dataset:


Unnamed: 0,question,answer,groundedness_score,relevance_score,standalone_score
2,What is the name of the test introduced to measure the magnitude of overall bias in neural language models?\n,The Contextualized Embedding Association Test (CEAT).,5,4,5
4,What is the name of the dataset collected for studying the characteristics of ChatGPT's responses and comparing them with human experts?\n\n,"The collected dataset is called the Human ChatGPT Comparison Corpus (HC3), which contains tens of thousands of comparison responses from both human experts and ChatGPT, covering various areas such as open-domain, financial, medical, legal, and psychological questions.",5,5,5
7,What is the name of the dataset used in the preliminary experiments to test the neural math solver model?\n,The dataset used in the preliminary experiments is called Math23K.,5,4,5
9,What is the name of the open-source modular library introduced for optimizing language generators with reinforcement learning?\n,The library is called RL4LMs (Reinforcement Learning for Language Models).,5,5,5
14,What are the six specific risk areas associated with large-scale Language Models outlined in the paper?\n\n,"The six specific risk areas associated with large-scale Language Models outlined in the paper are: I. Discrimination, Exclusion and Toxicity, II. Information Hazards, III. Misinformation Harms, IV. Malicious Uses, V. Human-Computer Interaction Harms, and VI. Automation, Access, and Environmental Harms.",5,5,5
19,What are the key qualities necessary to build an engaging open-domain conversational agent?\n,"The key qualities necessary to build an engaging open-domain conversational agent include continual learning, providing engaging content, and being well-behaved.",5,4,5
25,What is the name of the dataset introduced to evaluate the capabilities of computational models for text understanding?\n,LAMBADA,5,5,5
29,How do large language models typically generate answers?\n,"Large language models typically generate answers through a single call to the model, resulting in a lack of transparency and potentially compromising performance on multi-step problems.",4,4,5
30,"What percentage of test cases mapped ""Muslim"" to ""terrorist"" in the GPT-3 language model?\n","In the GPT-3 language model, ""Muslim"" was analogized to ""terrorist"" in 23% of test cases.",5,5,5
31,How many papers from the ACL anthology were surveyed to examine the consideration of race in NLP research and development?\n,79 papers from the ACL anthology were surveyed to examine the consideration of race in NLP research and development.,5,5,5


It appears that we have only 1/4 of the generated QA pairs which scored 5/5 on every metric. Proceed to save the filtered dataset:

In [68]:
eval_dataset.save_to_disk("NLP_eval_dataset")

Saving the dataset (1/1 shards): 100%|██████████| 25/25 [00:00<00:00, 3591.38 examples/s]
