## RAG Evaluation (LLM as a Judge)

### Evaluation-Dataset Generation Workflow

The basic workflow for automatically generating a RAG dataset starts with reading our knowledge base from documents, such as PDF files.

Then we ask a generator LLM to generate question-answer pairs from the given document context.

Finally, we use a judge LLM to perform quality control. The LLM will give each question-answer-context sample a score, which we can use to filter out bad samples.

Inspired by: The OpenAI cookbook “RAG Evaluation”
(Link: https://huggingface.co/learn/cookbook/rag_evaluation)

#### Loading environment variables

In [None]:
#%pip install python-dotenv

In [None]:
from dotenv import load_dotenv

load_dotenv()  # take environment variables from .env.

#### Setting the client

##### Azure OpenAI

In [None]:
# install from PyPI
#%pip install openai

In [None]:
import os
# from openai import 
from openai import AzureOpenAI

openai_client = AzureOpenAI(
  azure_endpoint=os.getenv("AZURE_OPENAI_ENDPOINT"), 
  api_key=os.getenv("AZURE_OPENAI_API_KEY"),  
  api_version=os.getenv("AZURE_OPENAI_API_VERSION")
)

# simple test
chat_completion = openai_client.chat.completions.create(
    model=os.getenv("AZURE_DEPLOYMENT_ID"), # in my case: "models-gpt-4o"
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Does Azure OpenAI support customer managed keys?"},
        {"role": "assistant", "content": "Yes, customer managed keys are supported by Azure OpenAI."},
        {"role": "user", "content": "Do other Azure AI services support this too?"}
    ]
)
print(chat_completion.choices[0].message.content)


In [None]:
# # Possible alternative?

# #!pip install azure-ai-inference

# ### Azure Inference Client
# from azure.ai.inference import ChatCompletionsClient
# from azure.core.credentials import AzureKeyCredential

# # For Azure OpenAI endpoint
# client = ChatCompletionsClient(
#     endpoint=endpoint,  # Of the form https://<your-resouce-name>.openai.azure.com/openai/deployments/<your-deployment-name>
#     credential=AzureKeyCredential(key),
#     api_version="2024-06-01",  # Azure OpenAI api-version. See https://aka.ms/azsdk/azure-ai-inference/azure-openai-api-versions
# )

##### Huggingface as alternative? (Free but very limited capacity)

In [None]:
# from langchain_huggingface import HuggingFaceEndpoint

# # Use your Hugging Face API token
# HF_TOKEN = "your HuggingFace token here!"

# # Initialize the LLM client with authentication
# client = HuggingFaceEndpoint(
#     repo_id="HuggingFaceH4/zephyr-7b-beta",
#     huggingfacehub_api_token=HF_TOKEN
# )

# # Let’s perform a quick sanity check to see that everything works as expected:

# response = client.invoke("Say this is a test")
# print(response)  # Directly prints the response string

### Read Files

In [None]:
#%pip nltk

#### NLTK
The Natural Language Toolkit: Is a suite of libraries and programs for symbolic and statistical natural language processing (NLP) for English written in the Python programming language. It supports classification, tokenization, stemming, tagging, parsing, and semantic reasoning functionalities

In [None]:
import os
import nltk

project_path = "<your-project-path"
nltk_path = os.path.join(project_path, "nltk_data")

if os.path.exists(nltk_path):
    print("Punkt tokenizer directory exists. Files inside:")
    print(os.listdir(nltk_path))
else:
    print("Punkt tokenizer is missing.")

nltk.data.path.append(nltk_path)
#nltk.download('averaged_perceptron_tagger', download_dir=nltk_path)
#nltk.download('punkt', download_dir=nltk_path)


In [None]:
# check current directory
cwd = os.getcwd()
print(cwd)

In [None]:
# Verify that the NLTK package is recognized
try:
    nltk.data.find('tokenizers/punkt')
    print("NLTK 'punkt' tokenizer is available.")
except LookupError:
    print("NLTK 'punkt' tokenizer is missing. Please check your installation.")

try:
    nltk.data.find('taggers/averaged_perceptron_tagger')
    print("NLTK 'averaged_perceptron_tagger' tokenizer is available.")
except LookupError:
    print("NLTK 'averaged_perceptron_tagger' not found. Please download it manually using nltk.download('averaged_perceptron_tagger')")


We will use LangChain to read a folder with all our files.

First, we need to install the necessary packages. LangChain’s DirectoryLoader uses the unstructured library to read all kinds of file types. In this notebook, I will only be reading PDFs so we can install a smaller version of unstructured.

In [None]:
#%pip install langchain-community unstructured[pdf]

Now we can read our data folder to get the LangChain documents. The following code first loads all the PDF files from a folder and then chunks them into relatively large chunks of size 2000.

In [None]:
import os
from langchain_community.document_loaders.directory import DirectoryLoader  # type: ignore
documents_path = os.path.join(project_path, r"flowiseai\eu_ai_act\document_store\EN")
loader = DirectoryLoader(documents_path, glob="**/*.pdf", show_progress=True)
docs = loader.load()

#### Text Splitting/Chunking

In [None]:
from langchain_text_splitters.character import RecursiveCharacterTextSplitter 
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=2000,
    chunk_overlap=200,
    add_start_index=True,
    separators=["\n\n", "\n", ".", " ", ""],
)

docs_processed = []
for doc in docs:
    docs_processed.extend(text_splitter.split_documents([doc]))

#### Verify chunked documents

In [None]:
from IPython.display import display, HTML 

formatted_docs = "\n\n".join([
    f"<div style='margin-bottom: 20px; padding: 10px; border-bottom: 1px solid #ddd;'>"
    f"<h3>Document {i+1}</h3><pre>{doc}</pre></div>"
    for i, doc in enumerate(docs_processed)
])

display(HTML(f"""
    <div style="max-height: 400px; overflow-y: auto; border: 1px solid #ccc; padding: 10px;">
        {formatted_docs}
    </div>
"""))


The result is a list 'docs_processed' with items of the type Document. Each document has some metadata and the actual page_content.

This list of documents is our knowledge base from which we will create question-answer pairs based on the context of the page_content.

### Generating Question-Answer-Context Samples
Using the OpenAI client and the model we created earlier, we first write a generator function to create questions and answers from our documents.

In [None]:
#from huggingface_hub import InferenceClient
import json

def qa_generator_llm(context: str, client, model: str = os.getenv("AZURE_DEPLOYMENT_ID")): # original model: AMead10/Llama-3.2-3B-Instruct-AWQ
    generation_prompt = f"""
    Your task is to write a factoid question and an answer given a context.
    Your factoid question should be answerable with a specific, concise piece of factual information from the context.
    Your factoid question should be formulated in the same style as questions users could ask in a search engine.
    This means that your factoid question MUST NOT mention something like "according to the passage" or "context".

    Provide your answer as follows:

    Output:::
    Factoid question: (your factoid question)
    Answer: (your answer to the factoid question)

    Now here is the context.

    Context: {context}\n
    Output:::
    """

    # Send request to the model
    
    response = client.chat.completions.create(
        model=model,
        messages=[{"role": "user", "content": generation_prompt}],
        temperature=0.5,
        max_tokens=500,
        top_p=0.99
    )
    
    return response  # The response should contain the generated question-answer pair

# Example usage:
client = openai_client
context = "Albert Einstein developed the theory of relativity, which revolutionized modern physics."
output = qa_generator_llm(context, client)

display(output)

In [None]:
# def qa_generator_llm(context: str, client: OpenAI, model: str = "AMead10/Llama-3.2-3B-Instruct-AWQ"):
#     generation_prompt = """
# Your task is to write a factoid question and an answer given a context.
# Your factoid question should be answerable with a specific, concise piece of factual information from the context.
# Your factoid question should be formulated in the same style as questions users could ask in a search engine.
# This means that your factoid question MUST NOT mention something like "according to the passage" or "context".

# Provide your answer as follows:

# Output:::
# Factoid question: (your factoid question)
# Answer: (your answer to the factoid question)

# Now here is the context.

# Context: {context}\n
# Output:::"""

#     chat_completion = client.chat.completions.create(
#         messages=[
#             {
#                 "role": "system",
#                 "content": "You are a question-answer pair generator."
#             },
#             {
#                 "role": "user",
#                 "content": generation_prompt.format(context=context),
#             }
#         ],
#         model=model,
#         temperature=0.5,
#         top_p=0.99,
#         max_tokens=500
#     )

#     return chat_completion.choices[0].message.content

If you want to use a language other than English, you will need to translate the generation_prompt (and the system instruction).

Next, we simply loop through all of our document chunks in our knowledge base and generate a question and an answer for each chunk.

In [None]:
# Check the number of document chunks
count = len(docs_processed)
print("Number of document chunks: ", count)

In [None]:
from tqdm.notebook import tqdm # Instantly make your loops show a progress meter

outputs = []
num_questions = 100  # Change this to the desired number of Q&A pairs
print(f"Generating {num_questions} QA couples...")

for doc in tqdm(docs_processed[:num_questions]):
#for doc in tqdm(docs_processed): # in this case the number of generated qa is = size(docs_processed)
    
    # Generate QA couple
    output_QA = qa_generator_llm(doc.page_content, client).choices[0].message.content
    
    try:
        question = output_QA.split("Factoid question: ")[-1].split("Answer: ")[0].strip()
        answer = output_QA.split("Answer: ")[-1].strip()
        assert len(answer) < 500, "Answer is too long"
        outputs.append(
            {
                "context": doc.page_content,
                "question": question,
                "answer": answer,
                "source_doc": doc.metadata["source"],
            }
        )
    except Exception as e:
        print(e)

Depending on how many PDF files are used, this may take a while... <br> 
Don’t forget to translate the strings in output_QA.split if necessary.

#### Verifying the outputs

In [None]:
print("The genrated question-answers are in ", str(type(outputs)), " format")
display(outputs)

#### Converting the QA to JSON Format
To generate a RAG evaluation dataset, I used a PDF about the regulation of the EU AI Act from the European Union. <br>
Here is my generated raw outputs dataset:

In [None]:
import json
from IPython.display import display, HTML

json_output = json.dumps(outputs, indent=4, ensure_ascii=False)
display(HTML(f'<div style="white-space: pre-wrap; overflow-y: auto; height: 300px; border: 1px solid #ccc;">{json_output}</div>'))


### Filtering out Bad Question-Answer Pairs
Next, we use an LLM as a judge to automatically filter out bad samples.

When using an LLM as a judge to evaluate the quality of a sample, it is best practice to use a different model than the one that was used to generate it because of a self-preference bias.

When it comes to judging our generated questions and answers, there are a lot of possible prompts we could use.

To build our prompt, there is a structure we can use from the G-Eval paper:

We start with the task introduction
We present our evaluation criteria
We want the model to perform chain-of-thought (CoT) reasoning to improve its performance
We ask for the total score at the end
For the evaluation criteria, we can use a list where each criterion adds one point if it is fulfilled.

The evaluation criteria should ensure that the question, the answer, and the context all fit together and make sense.

Here are two evaluation criteria from the OpenAI RAG evaluation cookbook:

**Groundedness**: can the question be answered from the given context?<br>
**Stand-alone**: is the question understandable without any context? (To avoid a question like "What is the name of the function used in this guide?")

And two more evaluation criteria from the RAGAS paper:

**Faithfulness**: the answer should be grounded in the given context<br>
**Answer Relevance**: the answer should address the actual question posed

You can try to add more criteria or change the text for the ones that I used.

Here is the judge_llm() function, which critiques a question, answer, and context sample and produces a total rating score at the end:

In [None]:
def judge_llm(
    context: str,
    question: str,
    answer: str,
    client,
    eval_model: str, # this models needs to be different from the model which generated the qa
):
    critique_prompt = """
    You will be given a question, answer, and a context.
    Your task is to provide a total rating using the additive point scoring system described below.
    Points start at 0 and are accumulated based on the satisfaction of each evaluation criterion:

    Evaluation Criteria:
    - Groundedness: Can the question be answered unambiguously from the given context? Add 1 point if the question can be answered from the context
    - Stand-alone: Is the question understandable free of any context, for someone with domain knowledge/Internet access? Add 1 point if the question is independent and can stand alone.
    - Faithfulness: The answer should be grounded in the given context. Add 1 point if the answer can be derived from the context
    - Answer Relevance: The generated answer should address the actual question that was provided. Add 1 point if the answer actually answers the question

    Provide your answer as follows:

    Answer:::
    Evaluation: (your rationale for the rating, as a text)
    Total rating: (your rating, as an integer number between 0 and 4)

    You MUST provide values for 'Evaluation:' and 'Total rating:' in your answer.

    Now here are the question, answer, and context.

    Question: {question}\n
    Answer: {answer}\n
    Context: {context}\n
    Answer::: """

    chat_completion = client.chat.completions.create(
        messages=[
            {"role": "system", "content": "You are a neutral judge."},
            {
                "role": "user",
                "content": critique_prompt.format(
                    question=question, answer=answer, context=context
                ),
            },
        ],
        model=eval_model,
        temperature=0.1,
        top_p=0.99,
        max_tokens=800
    )

    return chat_completion.choices[0].message.content

Now we loop through our generated dataset and critique each sample:

In [None]:
import math
from tqdm.notebook import tqdm  # Use tqdm.notebook for Jupyter Notebook compatibility

qa_evaluator_client = openai_client
evaluation_progress_bar = tqdm(total=len(outputs), desc="Evaluating Outputs", unit="evaluation")
for output in outputs:
    try:
        evaluation = judge_llm(
            context=output["context"],
            question=output["question"],
            answer=output["answer"],
            client=qa_evaluator_client,
            eval_model= "models-gpt-35-turbo",
        )
        score, eval = (
            #int(evaluation.split("Total rating: ")[-1].strip()),
            math.floor(float(evaluation.split("Total rating: ")[-1].strip())),
            evaluation.split("Total rating: ")[-2].split("Evaluation: ")[1],
        )
        output.update(
            {
                "score": score,
                "eval": eval
            }
        )
        print("evaluation score of: ", str(score) ," \n for question: ", output["question"], " \n and answer: ", output["answer"])  
    except Exception as e:
        print(e)
    evaluation_progress_bar.update(1)
evaluation_progress_bar.close()

Let’s filter out all the bad samples.

Since the generated dataset will be the ground truth for evaluation purposes, we should only allow very high-quality data samples. That’s why I decided to keep only samples with the highest possible score.

#### Verify if every QA-pair has a valid score

In [None]:
for item in outputs:
    item.setdefault("score", 1)
    item.setdefault("eval", "The question was not scored by the LLM. A defualt score of 1 is set.")


In [None]:
#%pip install datasets
import pandas as pd
from datasets import Dataset

# Filtering based on the score
qa_filtered = [doc for doc in outputs if doc["score"] >= 4]

# Converting qa to pandas Dataframe
qa_df = pd.DataFrame(qa_filtered)

In [None]:
# verifying the number of qualified question & answers:
count = len(qa_df)
print(f"{count} QA couples have passed the quality control.")

And here is our final RAG evaluation dataset as a Pandas DataFrame:

#### First visualization option:

In [None]:
# Set Pandas' display options
pd.set_option("display.max_colwidth", 50)

display(qa_df)

#### Second visualization option:

In [None]:
import textwrap
from IPython.display import display, HTML

# Set Pandas' display options
pd.set_option("display.max_colwidth", 50)

# Function to wrap long text (ensuring readability)
def wrap_text(text, width=50):
    if isinstance(text, str):
        return "<br>".join(textwrap.wrap(text, width))  # Wraps text using <br> for HTML
    return text

# Apply text wrapping
df2 = qa_df.map(lambda x: wrap_text(x, width=50))

# Convert DataFrame to HTML
html = df2.to_html(escape=False)


styled_html = f"""
    <div style="overflow-x: auto; overflow-y: auto; max-height: 400px; border: 1px solid #ddd; padding: 5px;">
        <style>
            table {{
                border-collapse: collapse; 
                width: 100%; 
                table-layout: auto;
            }}
            th, td {{
                min-width: 30px; 
                padding: 5px; 
                word-wrap: break-word; 
                text-align: left !important;  /* Ensure left alignment for both headers and data */
            }}
            th {{
                white-space: nowrap;         /* Prevent text from wrapping in headers */
            }}
        </style>
        {html}
    </div>
"""

# Display the properly formatted DataFrame
display(HTML(styled_html))


Saving The Dataset
We can convert our Pandas DataFrame into a HuggingFace dataset. Then, we can save it to disk and load it later when needed.

In [None]:
# converting to dataset format from HuggingFace
qa_dataset = Dataset.from_pandas(qa_df, split="test")
print("Data type after conversion: ", str(type(qa_dataset)))

# save QA in HuggingFace and JSON Format:
qa_list_path = os.path.join(project_path, r"flowiseai\eu_ai_act\documents_qa")
qa_dataset.save_to_disk(qa_list_path)

# Keys to keep
keys_to_keep = {"question","answer","source_doc"}

# Extract a subset of keys
qa_selected = [{key: item[key] for key in keys_to_keep if key in item} for item in qa_filtered]

# Convert to JSON string
qa_json = json.dumps(qa_selected, indent=4, ensure_ascii=False)

# Writing to qa_eu_ai_act.json
with open((qa_list_path + "\\qa_eu_ai_act.json"), "w") as outfile:
    outfile.write(qa_json)


#### Load the (previously) saved dataset

In [None]:
#from datasets import load_dataset
qa_dataset = qa_dataset.load_from_disk(qa_list_path)

# display using pandas
qa_df = pd.DataFrame(qa_dataset)
display(qa_df)

### Connecting to the RAG using Flowise API
Now we have created a RAG evaluation dataset from a collection of documents.

To change the domain of our RAG evaluation dataset, we simply exchange the documents that we feed to the DirectoryLoader. The documents do not have to be PDF files, they can be CSV files, markdown files, etc.

To change the language of our RAG evaluation dataset, we simply translate the LLM prompts from English to another language.

The next step is to connect to the RAG in the Flowise:

In [None]:
import requests 

API_URL = "<your-api-url>" # QnA V 0.4

def query(payload):
    response = requests.post(API_URL, json=payload, verify=False)
    return response.json()

##### Simple Test

In [None]:
output = query({
    "question": "What is the purpose of the Regulation regarding artificial intelligence systems in the Union?",
})

# list of available keys:
print("List of available keys:")
print(output.keys())

# Keys to extract
keys_to_extract = ['text', 'chatId', 'sessionId','sourceDocuments']

# Create a new dictionary with only the desired keys
filtered_output = {key: output[key] for key in keys_to_extract if key in output}
print(filtered_output)

### Benchmarking the RAG system

The RAG system and the evaluation datasets are now ready. The last step is to judge the RAG system’s output on this evaluation dataset.

To this end, we setup a judge agent.

Out of the different RAG evaluation metrics, we choose to focus only on **faithfulness** since it the best end-to-end metric of our system’s performance.

In [None]:
import datasets
from typing import Optional    
    
def query_rag_system(payload: dict):
    """Queries the remote RAG system via API and retrieves an answer."""
    
    response = requests.post(API_URL, json=payload, verify=False)
    
    if response.status_code == 200:
        data = response.json()
        if not data.get("text"):  # Raise an error if the answer is empty
            raise ValueError(f"Error: No response received from the RAG system. Response: {data}")
        return data
    else:
        raise RuntimeError(f"Error {response.status_code}: Failed to retrieve response. Info: {response.text}")


def run_rag_tests(
    eval_dataset: datasets.Dataset,
    output_file: str,
    verbose: Optional[bool] = True,
    test_settings: Optional[dict] = None,  # To pass the test settings used
):
    """Runs RAG tests on the given dataset and saves the results to the given output file."""
    try: # load previous generations if they exist
        with open(output_file, "r") as f:
            outputs = json.load(f)
    except:
        outputs = []

    for example in eval_dataset:
        question = example["question"]
        
        if question in [output["question"] for output in outputs]:
            continue

        query_dict = {
            "question": question,
            "overrideConfig": {
                "JinaRerankRetriever_0":{
                    "TOP N": test_settings["rerank_topn"]
                }
            }
        }
        
 
        response = query_rag_system(query_dict) 
        
        # Keys to extract
        
        # List of available keys (for debugging):
        # print("List of available keys in the API response:")
        # print(response.keys())
        
        answer = response.get("text")
        relevant_docs = response.get("sourceDocuments")

        if verbose:
            print("======================================================")
            print(f"Question: {question}")
            print(f"Answer: {answer}")
            print(f'True answer: {example["answer"]}')

        result = {
            "question": question,
            "true_answer": example["answer"],
            "source_doc": example.get("source_doc", "Unknown"),
            "generated_answer": answer,
            "retrieved_docs": relevant_docs,
        }
        if test_settings:
            result["test_settings"] = str(test_settings)
        outputs.append(result)

        with open(output_file, "w") as f:
            json.dump(outputs, f, indent=4)
    print("======================================================")
    print("Testing complete. Results saved to", output_file)


#### Defining the evaluation prompt

In [None]:
EVALUATION_PROMPT = """###Task Description:
An instruction (might include an Input inside it), a response to evaluate, a reference answer that gets a score of 5, and a score rubric representing a evaluation criteria are given.
1. Write a detailed feedback that assess the quality of the response strictly based on the given score rubric, not evaluating in general.
2. After writing a feedback, write a score that is an integer between 1 and 5. You should refer to the score rubric.
3. The output format should look as follows: \"Feedback: {{write a feedback for criteria}} [RESULT] {{an integer number between 1 and 5}}\"
4. Please do not generate any other opening, closing, and explanations. Be sure to include [RESULT] in your output.

###The instruction to evaluate:
{instruction}

###Response to evaluate:
{response}

###Reference Answer (Score 5):
{reference_answer}

###Score Rubrics:
[Is the response correct, accurate, and factual based on the reference answer?]
Score 1: The response is completely incorrect, inaccurate, and/or not factual.
Score 2: The response is mostly incorrect, inaccurate, and/or not factual.
Score 3: The response is somewhat correct, accurate, and/or factual.
Score 4: The response is mostly correct, accurate, and factual.
Score 5: The response is completely correct, accurate, and factual.

###Feedback:"""

from langchain.prompts.chat import (
    ChatPromptTemplate,
    HumanMessagePromptTemplate,
)
from langchain.schema import SystemMessage


evaluation_prompt_template = ChatPromptTemplate.from_messages(
    [
        SystemMessage(content="You are a fair evaluator language model."),
        HumanMessagePromptTemplate.from_template(EVALUATION_PROMPT),
    ]
)

#### Defining the evaluation function

In [None]:
client_eval_model = "models-gpt-4o-mini"
client = openai_client

def evaluate_answers(
    answer_path: str,
    client,
    evaluator_model: str,
    evaluation_prompt_template: ChatPromptTemplate,
) -> None:
    """Evaluates generated answers. Modifies the given answer file in place for better checkpointing."""
    answers = []
    if os.path.isfile(answer_path):  # load previous generations if they exist
        answers = json.load(open(answer_path, "r"))

    for experiment in answers:
        if f"eval_score_{evaluator_model}" in experiment:
            continue # Skip already evaluated experiments

        # Here is the original prompt for OpenAI:
        # eval_prompt = evaluation_prompt_template.format_messages(
        #     instruction=experiment["question"],
        #     response=experiment["generated_answer"],
        #     reference_answer=experiment["true_answer"],
        # )
        
        # Adopted prompt for Azure OpenAI:
        eval_prompt = [
            {"role": "system", "content": "You are a fair evaluator language model. First, provide reasoning, then give the final score from 1 to 5."},
            {"role": "user", "content": evaluation_prompt_template.format(
            instruction=experiment["question"],
            response=experiment["generated_answer"],
            reference_answer=experiment["true_answer"],
            )}
        ]
        
        # Call the evaluation model
        eval_result =client.chat.completions.create(
            model=evaluator_model,  # Adjust this based on your service API
            messages=eval_prompt,  # Ensure eval_prompt is a list of messages
        )

        # feedback, score = [item.strip() for item in eval_result.content.split("[RESULT]")] # original command for Open AI
        
        # Extract response content properly
        response_text = eval_result.choices[0].message.content  # Correct way to access model output

        # Ensure result format is correct before splitting
        if "[RESULT]" in response_text:
            feedback, score = [item.strip() for item in response_text.split("[RESULT]")]
            
        else:
            feedback= response_text # Handle missing "[RESULT]"
            print("A missing score is replaced with a default score value of 3")
            print("Here is the corresponding feedback:", response_text)
            score = 3 # Handle missing score
            #raise ValueError(f"Error: Missing '[RESULT]' in response: {response_text}")
            
            
        experiment[f"eval_score_{evaluator_model}"] = score
        experiment[f"eval_feedback_{evaluator_model}"] = feedback

        with open(answer_path, "w") as f:
            json.dump(answers, f)
    
    print("Evaluation complete. Scores saved to", answer_path)


#### Set the path

In [None]:
cwd = os.getcwd()
print("current directory: ", cwd)
os.chdir(qa_list_path)
cwd = os.getcwd()
print("is moved to: ", cwd)

In [None]:
# suppress the warning during the development phase:
import warnings
from urllib3.exceptions import InsecureRequestWarning

# Suppress the InsecureRequestWarning
warnings.simplefilter('ignore', InsecureRequestWarning)

Let’s run the tests and evaluate answers:<br>
You can set the chunk size, the embedding models, and the reranking option for evaluation.

In [None]:
if not os.path.exists("./eval_output"):
    os.mkdir("./eval_output")
eval_dataset = qa_dataset

# Define test configurations
chunk_sizes = [200]
embeddings_list = ["text-embedding-ada-002"]
# rerank_options = [False] # If the re-ranking is implemented in the RAG-system, this option can be activated.
rerank_topn = [4, 6]
total_iterations = len(chunk_sizes) * len(embeddings_list) * len(rerank_topn)
progress_bar = tqdm(total=total_iterations, desc="Processing Configurations", unit="iteration")

for chunk_size in chunk_sizes:  # Add other chunk sizes as needed
    
    for embeddings in embeddings_list:  # Add other embeddings as needed
        
        for rerank in rerank_topn:
            
            # name the output file and the setting
            settings_name = f"chunk_{chunk_size}_embeddings_{embeddings}_rerank-top-n_{rerank}"
            #settings_name = f"chunk_{chunk_size}_embeddings_{embeddings.replace('/', '~')}_rerank_{rerank}_reader-model_AzureOpenAI"
            output_file_name = f"./eval_output/rag_{settings_name}.json"
            
            # set the parameters which are to be optimized
            settings = {"rerank_topn": rerank}
            
            print(f"\n \n Running RAG test & evaluation for {settings_name}:")
            
            # run rag test
            print("\n 1- Running test...")
            run_rag_tests(
                eval_dataset=eval_dataset,
                output_file=output_file_name,
                verbose=True,
                test_settings=settings,
            )
            
            # run rag evaluation
            print("\n 2- Running evaluation...")
            evaluate_answers(
                output_file_name,
                openai_client,
                client_eval_model,
                evaluation_prompt_template,
            )
            
            progress_bar.update(1)
            
progress_bar.close()

#### Inspect results

In [None]:
import glob

outputs = []
for file in glob.glob("./eval_output/*.json"):
    output = pd.DataFrame(json.load(open(file, "r")))
    output["settings"] = file
    outputs.append(output)
result = pd.concat(outputs)

#### Normalize the evaluation scores

In [None]:
# Convert string values to integers:

# If a value in the column is a string, it attempts to convert it to an integer.
# If it's not a string, it assigns a default value of 1.

eval_score_name = "eval_score_" + str(client_eval_model) 
result[eval_score_name] = result[eval_score_name].apply(lambda x: int(x) if isinstance(x, str) else 1)

# Normalize the evaluation scores between [0, 1]
result[eval_score_name] = (result[eval_score_name] - 1) / 4

#### Averaging the score over the QAs

In [None]:
# Group by "test_settings" and calculate the mean of "eval_score_gpt-4o-mini"
avg_scores_df = result.groupby("test_settings", as_index=False)[eval_score_name].mean()

# Rename the columns
avg_scores_df.columns = ["test_settings", "average_scores"]

# Display the new dataframe
print("Average scores for the genrated dataset for different configurations: ")
print(avg_scores_df)

#### Performance comparison via visualization

In [None]:
import plotly.express as px

scores = avg_scores_df  # Make sure this is the correct DataFrame

fig = px.bar(
    scores,
    x="test_settings",  # X-axis: test settings
    y="average_scores",  # Y-axis: average scores (corrected column name)
    color="average_scores",  # Use the correct column for coloring
    labels={
        "average_scores": "Faithfulness",
        "test_settings": "Configuration",
    },
    color_continuous_scale="bluered",
)

fig.update_layout(
    width=1000,
    height=600,
    barmode="group",
    yaxis_range=[0, 1],
    title="<b>Faithfulness of different RAG configurations (normalized)</b>",
    xaxis_title="RAG settings",
    font=dict(size=15),
)

fig.update_coloraxes(showscale=False)
fig.update_traces(texttemplate="%{y:.2f}", textposition="outside")
fig.show()

#### Convert this notebook to python code (Optional)

In [None]:
# Browse to the directory where you want to save the file.
os.chdir(project_path + "\\scripts\\bitbucket_repository\\rag-evaluation\\")
cwd = os.getcwd()
print("current directory: ", cwd)

In [None]:
!jupyter nbconvert --to script RAG_Evaluation_LLM_as_a_Judge.ipynb