# Evaluation of RAG model

One way to think about different types of RAG evaluators is as a tuple of what is being evaluated X what its being evaluated against:

1. Correctness: Response vs reference answer
- Goal: Measure "how similar/correct is the RAG chain answer, relative to a ground-truth answer"
- Mode: Requires a ground truth (reference) answer supplied through a dataset
- Evaluator: Use LLM-as-judge to assess answer correctness.

2. Relevance: Response vs input
- Goal: Measure "how well does the generated response address the initial user input"
- Mode: Does not require reference answer, because it will compare the answer to the input question
- Evaluator: Use LLM-as-judge to assess answer relevance, helpfulness, etc.

3. Groundedness: Response vs retrieved docs
- Goal: Measure "to what extent does the generated response agree with the retrieved context"
- Mode: Does not require reference answer, because it will compare the answer to the retrieved context
- Evaluator: Use LLM-as-judge to assess faithfulness, hallucinations, etc.

4. Retrieval relevance: Retrieved docs vs input
- Goal: Measure "how relevant are my retrieved results for this query"
- Mode: Does not require reference answer, because it will compare the question to the retrieved context
- Evaluator: Use LLM-as-judge to assess relevance

## Libraries

In [2]:
import json
from typing_extensions import Annotated, TypedDict
import os
import pickle as pkl
import helper_functions as hf
from langchain_huggingface import HuggingFaceEmbeddings
from langchain_text_splitters import RecursiveCharacterTextSplitter
import glob
import re
from tqdm import tqdm
import pandas as pd
import random
from langchain.prompts.chat import (
    ChatPromptTemplate,
    HumanMessagePromptTemplate,
)
from langchain.schema import SystemMessage

## Setting

In [18]:
with open("api_google.txt") as f:
    api_key = json.load(f)

In [19]:
if not os.environ.get("GOOGLE_API_KEY"):
    os.environ["GOOGLE_API_KEY"] = api_key["key"]

In [None]:
from langchain.chat_models import init_chat_model

llm = init_chat_model("gemini-2.0-flash", model_provider="google_genai", temperature=0.1)

In [None]:
with open("generated_questions_final_gemini_2.pkl", "rb") as f:
    eval_datasets = pkl.load(f)

## Load data

In [4]:
with open("info_articles_main.pkl", "rb") as f:
    info_articles_main = pkl.load(f)
with open("info_articles_ref_final.pkl", "rb") as f:
    info_articles_ref = pkl.load(f)

info_articles_final = info_articles_main + info_articles_ref

In [3]:
splitter = RecursiveCharacterTextSplitter(
    chunk_size=1500,  # We selected this threshold according to the performance of the model.
    chunk_overlap=150,  # 1500/10
    separators=["\n\n", "\n", ".", "!", "?", " "],  # smart splitting
)

In [5]:
info_splitted = hf.chunk_makers(info_articles_final, splitter=splitter)

## Evaluation

### Agents for questions

In [None]:
def call_llm(llm, prompt):
    response = llm.invoke(prompt)
    return response.content

In [None]:
QA_generation_prompt = """
You are given a piece of scientific text (context).
Your task is to generate ONE question and ONE answer from it.

Guidelines for the question:
- It must be factual and answerable using the context only.
- Phrase it naturally, as if a researcher typed it into a search engine.
- Do NOT mention "context", "passage", or "according to the text".
- The question should be specific and concise.

Guidelines for the answer:
- The answer must be a short, factual statement directly supported by the context.
- Do not add explanations, speculation, or references to the text.

Formatting rules (strict):
Output:::
Question: <your question here>
Answer: <your answer here>

Now here is the context:

{context}

Output:::
"""

In [6]:
info_splitted_evaluation = [
    d
    for d in info_splitted
    if d["parent"]
    in ["Abstract", "Introduction", "Results", "Conclusion", "Discussion", "Methods"]
]

In [None]:
def get_context(piece_of_paper, all_papers):
    for d in all_papers:
        if (
            d["Reference"] == piece_of_paper["Reference"]
            and d["parent"] == piece_of_paper["parent"]
        ):
            if d["chunk_index"] == 0:
                info = d["content"]
            else:
                info += d["content"].split(":\n ")[-1]

    return info

We generate 600 questions using the information provided by each article to evaluate the performance of RAG.

In [None]:
N = 600
examples = []
for sample in tqdm(random.sample(info_splitted_evaluation, N), total=N):
    context = get_context(piece_of_paper=sample, all_papers=info_splitted_evaluation)
    response = call_llm(llm=llm, prompt=QA_generation_prompt.format(context=context))

    try:
        question = response.split("Question:")[-1].split("Answer: ")[0].strip()
        answer = response.split("Answer: ")[-1].strip()
        examples.append({"context": context, "question": question, "answer": answer})
    except Exception:
        continue

with open("examples_evaluation_gemini_2.pkl", "wb") as f:
    pkl.dump(examples, f)

100%|██████████| 600/600 [48:13<00:00,  4.82s/it]


## Evaluation of generated questions

In [1]:
question_groundedness_critique_prompt = """
You will be given a context and a question.
Your task is to provide a 'total rating' scoring how well one can answer the given question unambiguously with the given context.
Give your answer on a scale of 1 to 5, where 1 means that the question is not answerable at all given the context, and 5 means that the question is clearly and unambiguously answerable with the context.

Provide your answer as follows:

Answer:::
Evaluation: (your rationale for the rating, as a text)
Total rating: (your rating, as a number between 1 and 5)

You MUST provide values for 'Evaluation:' and 'Total rating:' in your answer.

Now here are the question and context.

Question: {question}\n
Context: {context}\n
Answer::: """

question_relevance_critique_prompt = """
You will be given a question.
Your task is to provide a 'total rating' representing how useful this question can be to researchers in the reproductive medicine field.
Give your answer on a scale of 1 to 5, where 1 means that the question is not useful at all, and 5 means that the question is extremely useful.

Provide your answer as follows:

Answer:::
Evaluation: (your rationale for the rating, as a text)
Total rating: (your rating, as a number between 1 and 5)

You MUST provide values for 'Evaluation:' and 'Total rating:' in your answer.

Now here is the question.

Question: {question}\n
Answer::: """

question_standalone_critique_prompt = """
You will be given a question.
Your task is to provide a 'total rating' representing how context-independent this question is.
Give your answer on a scale of 1 to 5, where 1 means that the question depends on additional information to be understood, and 5 means that the question makes sense by itself.
For instance, if the question refers to a particular setting, like 'in the context' or 'in the document', the rating must be 1.
The questions can contain obscure technical nouns or acronyms like Gradio, Hub, Hugging Face or Space and still be a 5: it must simply be clear to an operator with access to documentation what the question is about.

For instance, "What is the name of the checkpoint from which the ViT model is imported?" should receive a 1, since there is an implicit mention of a context, thus the question is not independent from the context.

Provide your answer as follows:

Answer:::
Evaluation: (your rationale for the rating, as a text)
Total rating: (your rating, as a number between 1 and 5)

You MUST provide values for 'Evaluation:' and 'Total rating:' in your answer.

Now here is the question.

Question: {question}\n
Answer::: """

#### Evaluating the generated questions

In [None]:
print("Generating critique for each QA couple...")
for output in tqdm(examples, total=len(examples)):
    evaluations = {
        "groundedness": call_llm(
            llm,
            question_groundedness_critique_prompt.format(
                context=output["context"], question=output["question"]
            ),
        ),
        "relevance": call_llm(
            llm,
            question_relevance_critique_prompt.format(question=output["question"]),
        ),
        "standalone": call_llm(
            llm,
            question_standalone_critique_prompt.format(question=output["question"]),
        ),
    }
    try:
        for criterion, evaluation in evaluations.items():
            score, eval = (
                int(evaluation.split("Total rating: ")[-1].strip()),
                evaluation.split("Total rating: ")[-2].split("Evaluation: ")[1],
            )
            output.update(
                {
                    f"{criterion}_score": score,
                    f"{criterion}_eval": eval,
                }
            )
    except Exception:
        continue

Generating critique for each QA couple...


100%|██████████| 600/600 [56:53<00:00,  5.69s/it]  


In [None]:
generated_questions = pd.DataFrame.from_dict(examples)
generated_questions.loc[
    :,
    [
        "question",
        "context",
        "answer",
        "groundedness_score",
        "relevance_score",
        "standalone_score",
    ],
]

# Saving the questions with their answers and their scores

with open("generated_questions_gemini_2.pkl", "wb") as f:
    pkl.dump(generated_questions, f)

This evaluation is based on grading. Therefore, only questions with a score of at least 4 in each criterion are kept.

In [None]:
generated_questions_final = generated_questions.loc[
    (generated_questions["groundedness_score"] >= 4)
    & (generated_questions["relevance_score"] >= 4)
    & (generated_questions["standalone_score"] >= 4)
]

with open("generated_questions_final_gemini_2.pkl", "wb") as f:
    pkl.dump(generated_questions_final, f)

In [4]:
generated_questions_final.loc[
    :,
    ["question", "answer", "groundedness_score", "relevance_score", "standalone_score"],
]

Unnamed: 0,question,answer,groundedness_score,relevance_score,standalone_score
6,How does the FGB rs1800790A allele affect fibr...,"In F13A 34Val/Val wildtypes, carriage of the F...",5.0,5.0,5.0
20,For which patient group might the ERA test be ...,The ERA test may be helpful for women with sus...,5.0,5.0,5.0
25,What kind of values does the Color Pathway too...,The Color Pathway tool accepts numerical values.,5.0,4.0,5.0
32,What is the implantation potential of an euplo...,"Once an euploid blastocyst is identified, its ...",5.0,5.0,5.0
38,What does the PRISMA 2020 statement reflect?,The PRISMA 2020 statement reflects advances in...,5.0,5.0,5.0
...,...,...,...,...,...
566,Which genes share genetic susceptibility for A...,"The ESR1, HK3, and BRSK1 genes share genetic s...",5.0,5.0,5.0
573,What database were the GSE26787 and GSE63901 d...,The Gene Expression Omnibus (GEO) database.,5.0,4.0,5.0
576,What percentage of women globally are affected...,3.7% of women globally.,5.0,5.0,5.0
589,What is the purpose of unique molecular identi...,Unique molecular identifiers are applied to ov...,5.0,5.0,5.0


In [6]:
# We turn the df into a dict
eval_dataset = generated_questions_final.to_dict("records")
f"We have a total of {len(eval_dataset)} questions"

'We have a total of 70 questions'

### Evaluate RAG performance using grading for correctness

##### PROMPTs for LLM

In [None]:
RAG_PROMPT_TEMPLATE = """
<|system|>
Using the information contained in the context,
give a comprehensive answer to the question.
Respond only to the question asked, response should be concise and relevant to the question.
Provide the number of the source document when relevant.
If the answer cannot be deduced from the context, do not give an answer.</s>
<|user|>
Context:
{context}
---
Now here is the question you need to answer.

Question: {question}
</s>
<|assistant|>
"""

In [None]:
EVALUATION_PROMPT = """###Task Description:
An instruction (might include an Input inside it), a response to evaluate, a reference answer that gets a score of 5, and a score rubric representing a evaluation criteria are given.
1. Write a detailed feedback that assess the quality of the response strictly based on the given score rubric, not evaluating in general.
2. After writing a feedback, write a score that is an integer between 1 and 5. You should refer to the score rubric.
3. The output format MUST look as follows: \"Feedback: {{write a feedback for criteria}} [RESULT] {{an integer number between 1 and 5}}\"
4. Please do not generate any other opening, closing, and explanations. Be sure to include [RESULT] in your output, it is required.

###The instruction to evaluate:
{instruction}

###Response to evaluate:
{response}

###Reference Answer (Score 5):
{reference_answer}

###Score Rubrics:
[Is the response correct, accurate, and factual based on the reference answer?]
Score 1: The response is completely incorrect, inaccurate, and/or not factual.
Score 2: The response is mostly incorrect, inaccurate, and/or not factual.
Score 3: The response is somewhat correct, accurate, and/or factual.
Score 4: The response is mostly correct, accurate, and factual.
Score 5: The response is completely correct, accurate, and factual.

###Feedback:"""

evaluation_prompt_template = ChatPromptTemplate.from_messages(
    [
        SystemMessage(content="You are a fair evaluator language model."),
        HumanMessagePromptTemplate.from_template(EVALUATION_PROMPT),
    ]
)

### Answering question and getting a grade from 1 to 5 for correctness

In [None]:
if not os.path.exists("./output"):
    os.mkdir("./output")

For each variable, we generate a question and evaluate the correctness using a grading scale from 1 to 5.

In [None]:
for chunk_size in [1000, 1500, 2000]:  # Add other chunk sizes (in tokens) as needed
    for llm_model in ["gemini-2.0-flash"]:
        llm_reader = init_chat_model(
            llm_model,
            model_provider="google_genai",
            temperature=0.5,
            max_output_tokens=1024,
        )

        for embedding_model in [
            "avsolatorio/GIST-small-Embedding-v0",
            "Snowflake/snowflake-arctic-embed-s",
            "intfloat/e5-small-v2",
        ]:
            name_model = embedding_model.split("/")[1]
            embedding_function = HuggingFaceEmbeddings(
                model_name=embedding_model, model_kwargs={"device": "cuda"}
            )

            for recursive_chunk in [False, True]:
                for k in range(4, 11):
                    settings_name = f"chunk_{chunk_size}_reader-model_{llm_model}_emnedding_model_{name_model}recursive_{recursive_chunk}_k_{k}"
                    output_file_name = f"./output/rag_{settings_name}.json"

                    if os.path.exists(output_file_name):
                        output = pd.DataFrame(json.load(open(output_file_name)))
                        if f"eval_score_{llm_model}" in output.columns:
                            print("The file already exists")
                            continue

                    print(f"Running evaluation for {settings_name}:")

                    print("Loading knowledge base embeddings...")

                    db = hf.load_embeddings(
                        info_articles_final,
                        chunk_size=chunk_size,
                        embedding_model=embedding_function,
                    )

                    print("Running RAG...")
                    hf.run_rag_tests(
                        eval_dataset=eval_dataset,
                        llm=llm_reader,
                        database=db,
                        template=RAG_PROMPT_TEMPLATE,
                        output_file=output_file_name,
                        verbose=False,
                        test_settings=settings_name,
                        recursive_chunk=recursive_chunk,
                        num_docs_final=k,
                    )

                    print("Running evaluation...")
                    hf.evaluate_answers(
                        output_file_name,
                        llm_reader,
                        llm_model,
                        evaluation_prompt_template,
                    )
                    print("Removing database")
                    db.delete_collection()  # We remove the collection in each iteration

Running evaluation...


100%|██████████| 63/63 [00:41<00:00,  1.51it/s]


Removing database
Running evaluation for chunk_2000_reader-model_gemini-2.0-flash_emnedding_model_GIST-small-Embedding-v0recursive_True_k_5:
Loading knowledge base embeddings...
Running RAG...


100%|██████████| 70/70 [00:40<00:00,  1.71it/s]


Running evaluation...


100%|██████████| 63/63 [00:39<00:00,  1.58it/s]


Removing database
Running evaluation for chunk_2000_reader-model_gemini-2.0-flash_emnedding_model_GIST-small-Embedding-v0recursive_True_k_6:
Loading knowledge base embeddings...
Running RAG...


100%|██████████| 70/70 [00:40<00:00,  1.72it/s]


Running evaluation...


100%|██████████| 63/63 [00:41<00:00,  1.52it/s]


Removing database
Running evaluation for chunk_2000_reader-model_gemini-2.0-flash_emnedding_model_GIST-small-Embedding-v0recursive_True_k_7:
Loading knowledge base embeddings...
Running RAG...


100%|██████████| 70/70 [00:40<00:00,  1.72it/s]


Running evaluation...


100%|██████████| 63/63 [00:41<00:00,  1.51it/s]


Removing database
Running evaluation for chunk_2000_reader-model_gemini-2.0-flash_emnedding_model_GIST-small-Embedding-v0recursive_True_k_8:
Loading knowledge base embeddings...
Running RAG...


100%|██████████| 70/70 [00:42<00:00,  1.64it/s]


Running evaluation...


100%|██████████| 63/63 [00:41<00:00,  1.51it/s]


Removing database
Running evaluation for chunk_2000_reader-model_gemini-2.0-flash_emnedding_model_GIST-small-Embedding-v0recursive_True_k_9:
Loading knowledge base embeddings...
Running RAG...


100%|██████████| 70/70 [00:44<00:00,  1.56it/s]


Running evaluation...


100%|██████████| 63/63 [00:40<00:00,  1.54it/s]


Removing database
Running evaluation for chunk_2000_reader-model_gemini-2.0-flash_emnedding_model_GIST-small-Embedding-v0recursive_True_k_10:
Loading knowledge base embeddings...
Running RAG...


100%|██████████| 70/70 [00:45<00:00,  1.54it/s]


Running evaluation...


100%|██████████| 63/63 [00:38<00:00,  1.62it/s]


Removing database
Running evaluation for chunk_2000_reader-model_gemini-2.0-flash_emnedding_model_snowflake-arctic-embed-srecursive_False_k_4:
Loading knowledge base embeddings...
Running RAG...


100%|██████████| 70/70 [00:33<00:00,  2.06it/s]


Running evaluation...


100%|██████████| 63/63 [00:40<00:00,  1.55it/s]


Removing database
Running evaluation for chunk_2000_reader-model_gemini-2.0-flash_emnedding_model_snowflake-arctic-embed-srecursive_False_k_5:
Loading knowledge base embeddings...
Running RAG...


100%|██████████| 70/70 [00:36<00:00,  1.93it/s]


Running evaluation...


100%|██████████| 63/63 [00:41<00:00,  1.52it/s]


Removing database
Running evaluation for chunk_2000_reader-model_gemini-2.0-flash_emnedding_model_snowflake-arctic-embed-srecursive_False_k_6:
Loading knowledge base embeddings...
Running RAG...


100%|██████████| 70/70 [00:35<00:00,  1.97it/s]


Running evaluation...


100%|██████████| 63/63 [00:40<00:00,  1.55it/s]


Removing database
Running evaluation for chunk_2000_reader-model_gemini-2.0-flash_emnedding_model_snowflake-arctic-embed-srecursive_False_k_7:
Loading knowledge base embeddings...
Running RAG...


100%|██████████| 70/70 [00:36<00:00,  1.92it/s]


Running evaluation...


100%|██████████| 63/63 [00:41<00:00,  1.51it/s]


Removing database
Running evaluation for chunk_2000_reader-model_gemini-2.0-flash_emnedding_model_snowflake-arctic-embed-srecursive_False_k_8:
Loading knowledge base embeddings...
Running RAG...


100%|██████████| 70/70 [00:36<00:00,  1.91it/s]


Running evaluation...


100%|██████████| 63/63 [00:39<00:00,  1.60it/s]


Removing database
Running evaluation for chunk_2000_reader-model_gemini-2.0-flash_emnedding_model_snowflake-arctic-embed-srecursive_False_k_9:
Loading knowledge base embeddings...
Running RAG...


100%|██████████| 70/70 [00:36<00:00,  1.89it/s]


Running evaluation...


100%|██████████| 63/63 [00:41<00:00,  1.51it/s]


Removing database
Running evaluation for chunk_2000_reader-model_gemini-2.0-flash_emnedding_model_snowflake-arctic-embed-srecursive_False_k_10:
Loading knowledge base embeddings...
Running RAG...


100%|██████████| 70/70 [00:38<00:00,  1.83it/s]


Running evaluation...


100%|██████████| 63/63 [00:41<00:00,  1.52it/s]


Removing database
Running evaluation for chunk_2000_reader-model_gemini-2.0-flash_emnedding_model_snowflake-arctic-embed-srecursive_True_k_4:
Loading knowledge base embeddings...
Running RAG...


100%|██████████| 70/70 [00:38<00:00,  1.80it/s]


Running evaluation...


100%|██████████| 63/63 [00:42<00:00,  1.48it/s]


Removing database
Running evaluation for chunk_2000_reader-model_gemini-2.0-flash_emnedding_model_snowflake-arctic-embed-srecursive_True_k_5:
Loading knowledge base embeddings...
Running RAG...


100%|██████████| 70/70 [00:38<00:00,  1.84it/s]


Running evaluation...


100%|██████████| 63/63 [01:09<00:00,  1.10s/it]


Removing database
Running evaluation for chunk_2000_reader-model_gemini-2.0-flash_emnedding_model_snowflake-arctic-embed-srecursive_True_k_6:
Loading knowledge base embeddings...
Running RAG...


100%|██████████| 70/70 [00:40<00:00,  1.72it/s]


Running evaluation...


100%|██████████| 63/63 [00:41<00:00,  1.50it/s]


Removing database
Running evaluation for chunk_2000_reader-model_gemini-2.0-flash_emnedding_model_snowflake-arctic-embed-srecursive_True_k_7:
Loading knowledge base embeddings...
Running RAG...


100%|██████████| 70/70 [00:40<00:00,  1.73it/s]


Running evaluation...


100%|██████████| 63/63 [00:43<00:00,  1.46it/s]


Removing database
Running evaluation for chunk_2000_reader-model_gemini-2.0-flash_emnedding_model_snowflake-arctic-embed-srecursive_True_k_8:
Loading knowledge base embeddings...
Running RAG...


100%|██████████| 70/70 [00:42<00:00,  1.66it/s]


Running evaluation...


100%|██████████| 63/63 [00:42<00:00,  1.49it/s]


Removing database
Running evaluation for chunk_2000_reader-model_gemini-2.0-flash_emnedding_model_snowflake-arctic-embed-srecursive_True_k_9:
Loading knowledge base embeddings...
Running RAG...


100%|██████████| 70/70 [00:41<00:00,  1.69it/s]


Running evaluation...


100%|██████████| 63/63 [00:43<00:00,  1.43it/s]


Removing database
Running evaluation for chunk_2000_reader-model_gemini-2.0-flash_emnedding_model_snowflake-arctic-embed-srecursive_True_k_10:
Loading knowledge base embeddings...
Running RAG...


100%|██████████| 70/70 [00:44<00:00,  1.56it/s]


Running evaluation...


100%|██████████| 63/63 [00:40<00:00,  1.56it/s]


Removing database
Running evaluation for chunk_2000_reader-model_gemini-2.0-flash_emnedding_model_e5-small-v2recursive_False_k_4:
Loading knowledge base embeddings...
Running RAG...


100%|██████████| 70/70 [00:34<00:00,  2.03it/s]


Running evaluation...


100%|██████████| 63/63 [00:41<00:00,  1.53it/s]


Removing database
Running evaluation for chunk_2000_reader-model_gemini-2.0-flash_emnedding_model_e5-small-v2recursive_False_k_5:
Loading knowledge base embeddings...
Running RAG...


100%|██████████| 70/70 [00:37<00:00,  1.87it/s]


Running evaluation...


100%|██████████| 63/63 [00:41<00:00,  1.53it/s]


Removing database
Running evaluation for chunk_2000_reader-model_gemini-2.0-flash_emnedding_model_e5-small-v2recursive_False_k_6:
Loading knowledge base embeddings...
Running RAG...


100%|██████████| 70/70 [00:37<00:00,  1.88it/s]


Running evaluation...


100%|██████████| 63/63 [00:40<00:00,  1.54it/s]


Removing database
Running evaluation for chunk_2000_reader-model_gemini-2.0-flash_emnedding_model_e5-small-v2recursive_False_k_7:
Loading knowledge base embeddings...
Running RAG...


100%|██████████| 70/70 [00:37<00:00,  1.86it/s]


Running evaluation...


100%|██████████| 63/63 [00:41<00:00,  1.51it/s]


Removing database
Running evaluation for chunk_2000_reader-model_gemini-2.0-flash_emnedding_model_e5-small-v2recursive_False_k_8:
Loading knowledge base embeddings...
Running RAG...


100%|██████████| 70/70 [00:38<00:00,  1.80it/s]


Running evaluation...


100%|██████████| 63/63 [00:40<00:00,  1.54it/s]


Removing database
Running evaluation for chunk_2000_reader-model_gemini-2.0-flash_emnedding_model_e5-small-v2recursive_False_k_9:
Loading knowledge base embeddings...
Running RAG...


100%|██████████| 70/70 [00:39<00:00,  1.75it/s]


Running evaluation...


100%|██████████| 63/63 [00:41<00:00,  1.53it/s]


Removing database
Running evaluation for chunk_2000_reader-model_gemini-2.0-flash_emnedding_model_e5-small-v2recursive_False_k_10:
Loading knowledge base embeddings...
Running RAG...


100%|██████████| 70/70 [00:40<00:00,  1.72it/s]


Running evaluation...


100%|██████████| 63/63 [00:42<00:00,  1.48it/s]


Removing database
Running evaluation for chunk_2000_reader-model_gemini-2.0-flash_emnedding_model_e5-small-v2recursive_True_k_4:
Loading knowledge base embeddings...
Running RAG...


100%|██████████| 70/70 [00:37<00:00,  1.87it/s]


Running evaluation...


100%|██████████| 63/63 [00:40<00:00,  1.56it/s]


Removing database
Running evaluation for chunk_2000_reader-model_gemini-2.0-flash_emnedding_model_e5-small-v2recursive_True_k_5:
Loading knowledge base embeddings...
Running RAG...


100%|██████████| 70/70 [00:37<00:00,  1.85it/s]


Running evaluation...


100%|██████████| 63/63 [00:41<00:00,  1.52it/s]


Removing database
Running evaluation for chunk_2000_reader-model_gemini-2.0-flash_emnedding_model_e5-small-v2recursive_True_k_6:
Loading knowledge base embeddings...
Running RAG...


100%|██████████| 70/70 [00:38<00:00,  1.80it/s]


Running evaluation...


100%|██████████| 63/63 [00:42<00:00,  1.49it/s]


Removing database
Running evaluation for chunk_2000_reader-model_gemini-2.0-flash_emnedding_model_e5-small-v2recursive_True_k_7:
Loading knowledge base embeddings...
Running RAG...


100%|██████████| 70/70 [00:40<00:00,  1.71it/s]


Running evaluation...


100%|██████████| 63/63 [00:41<00:00,  1.51it/s]


Removing database
Running evaluation for chunk_2000_reader-model_gemini-2.0-flash_emnedding_model_e5-small-v2recursive_True_k_8:
Loading knowledge base embeddings...
Running RAG...


100%|██████████| 70/70 [00:39<00:00,  1.76it/s]


Running evaluation...


100%|██████████| 63/63 [00:40<00:00,  1.57it/s]


Removing database
Running evaluation for chunk_2000_reader-model_gemini-2.0-flash_emnedding_model_e5-small-v2recursive_True_k_9:
Loading knowledge base embeddings...
Running RAG...


100%|██████████| 70/70 [00:39<00:00,  1.75it/s]


Running evaluation...


100%|██████████| 63/63 [00:41<00:00,  1.52it/s]


Removing database
Running evaluation for chunk_2000_reader-model_gemini-2.0-flash_emnedding_model_e5-small-v2recursive_True_k_10:
Loading knowledge base embeddings...
Running RAG...


100%|██████████| 70/70 [00:42<00:00,  1.66it/s]


Running evaluation...


100%|██████████| 63/63 [00:42<00:00,  1.48it/s]


Removing database


## Evaluation the models for four metrics without grading

We evaluate again the questions generated by using four different metrics correctness, groundedness, relevance and retrieval relevance.

- Correctness

In [38]:
# Grade output schema
class CorrectnessGrade(TypedDict):
    # Note that the order in the fields are defined is the order in which the model will generate them.
    # It is useful to put explanations before responses because it forces the model to think through
    # its final response before generating it:
    explanation: Annotated[str, ..., "Explain your reasoning for the score"]
    correct: Annotated[bool, ..., "True if the answer is correct, False otherwise."]


# Grade prompt
correctness_instructions = """You are a senior researcher in reprodutive field and bioinformatics. 

You will be given a QUESTION, the GROUND TRUTH (correct) ANSWER, and the PhD Student answer. 

Here is the grade criteria to follow:
(1) Grade the student answers based ONLY on their factual accuracy relative to the ground truth answer. 
(2) Ensure that the student answer does not contain any conflicting statements.
(3) It is OK if the student answer contains more information than the ground truth answer, as long as it is factually accurate relative to the  ground truth answer.

Correctness:
A correctness value of True means that the PhD student's answer meets all of the criteria.
A correctness value of False means that the PhD student's answer does not meet all of the criteria.

Explain your reasoning in a step-by-step manner to ensure your reasoning and conclusion are correct. 

Avoid simply stating the correct answer at the outset."""

# Grader LLM

grader_llm = llm.with_structured_output(
    CorrectnessGrade, method="json_schema", strict=True
)

- Relevance

In [39]:
# Grade output schema
class RelevanceGrade(TypedDict):
    explanation: Annotated[str, ..., "Explain your reasoning for the score"]
    relevant: Annotated[
        bool, ..., "Provide the score on whether the answer addresses the question"
    ]


# Grade prompt
relevance_instructions = """You are a senior researcher in reprodutive field and bioinformatics. 

You will be given a QUESTION and a STUDENT ANSWER. 

Here is the grade criteria to follow:
(1) Ensure the PhD STUDENT ANSWER is concise and relevant to the QUESTION
(2) Ensure the PhD STUDENT ANSWER helps to answer the QUESTION

Relevance:
A relevance value of True means that the student's answer meets all of the criteria.
A relevance value of False means that the student's answer does not meet all of the criteria.

Explain your reasoning in a step-by-step manner to ensure your reasoning and conclusion are correct. 

Avoid simply stating the correct answer at the outset."""

relevance_llm = llm.with_structured_output(
    RelevanceGrade, method="json_schema", strict=True
)

- Retrieval Relevance

In [40]:
# Grade output schema
class RetrievalRelevanceGrade(TypedDict):
    explanation: Annotated[str, ..., "Explain your reasoning for the score"]
    relevant: Annotated[
        bool,
        ...,
        "True if the retrieved documents are relevant to the question, False otherwise",
    ]


# Grade prompt
retrieval_relevance_instructions = """You are a senior researcher in reprodutive field and bioinformatics grading PhD students questions. 

You will be given a QUESTION and a set of FACTS provided by the PhD student. 

Here is the grade criteria to follow:
(1) You goal is to identify FACTS that are completely unrelated to the QUESTION
(2) If the facts contain ANY keywords or semantic meaning related to the question, consider them relevant
(3) It is OK if the facts have SOME information that is unrelated to the question as long as (2) is met

Relevance:
A relevance value of True means that the FACTS contain ANY keywords or semantic meaning related to the QUESTION and are therefore relevant.
A relevance value of False means that the FACTS are completely unrelated to the QUESTION.

Explain your reasoning in a step-by-step manner to ensure your reasoning and conclusion are correct. 

Avoid simply stating the correct answer at the outset."""

# Grader LLM
retrieval_llm = llm.with_structured_output(
    RetrievalRelevanceGrade, method="json_schema", strict=True
)

- Groundedness

In [24]:
# Grade output schema
class GroundedGrade(TypedDict):
    explanation: Annotated[str, ..., "Explain your reasoning for the score"]
    grounded: Annotated[
        bool, ..., "Provide the score on if the answer hallucinates from the documents"
    ]


# Grade prompt
grounded_instructions = """You are a senior researcher in reprodutive field and bioinformatics grading PhD students questions. 

You will be given FACTS and a PhD STUDENT ANSWER. 

Here is the grade criteria to follow:
(1) Ensure the PhD STUDENT ANSWER is grounded in the FACTS. 
(2) Ensure the PhD STUDENT ANSWER does not contain "hallucinated" information outside the scope of the FACTS.

Grounded:
A grounded value of True means that the student's answer meets all of the criteria.
A grounded value of False means that the student's answer does not meet all of the criteria.

Explain your reasoning in a step-by-step manner to ensure your reasoning and conclusion are correct. 

Avoid simply stating the correct answer at the outset."""

# Grader LLM
grounded_llm = llm.with_structured_output(
    GroundedGrade, method="json_schema", strict=True
)

In [10]:
outputs = {}
for file in glob.glob("./output/*.json"):
    output = json.load(open(file, "r"))
    outputs[re.sub(r"./output/rag_", "", file)] = output

We load the pkl in case the file had been created but the loop had not finished.

In [None]:
try:
    with open("./evaluation/evaluation.pkl", "rb") as f:
        list_dataframes = pkl.load(f)
except Exception:
    print("File not found")
    list_dataframes = []

File not found


In [None]:
if os.path.exists(
    "./evaluation/evaluation_log.txt"
):  # We check the file created previously
    with open("./evaluation/evaluation_log.txt", "r") as f:
        lines = [line.strip() for line in f]
else:
    lines = []

for settings, evaluation in tqdm(outputs.items(), total=len(outputs)):
    if settings in lines:
        print(f"Settings: {settings} already analyzed")
        continue

    print(f"Analyzing {settings}")
    aux_correct = []
    aux_relevant = []
    aux_retrival = []
    aux_ground = []

    for mydic in evaluation:
        aux_correct.append(
            hf.correctness(
                inputs=mydic,
                grader_llm=grader_llm,
                correctness_instructions=correctness_instructions,
            )
        )
        aux_relevant.append(
            hf.relevance(
                inputs=mydic,
                relevance_llm=relevance_llm,
                relevance_instructions=relevance_instructions,
            )
        )
        aux_retrival.append(
            hf.retrieval_relevance(
                inputs=mydic,
                retrieval_llm=retrieval_llm,
                retrieval_relevance_instructions=retrieval_relevance_instructions,
            )
        )
        aux_ground.append(
            hf.groundedness(
                inputs=mydic,
                grounded_llm=grounded_llm,
                grounded_instructions=grounded_instructions,
            )
        )

    # Create dataframe
    correctness_results = pd.Series(aux_correct, name="Correctness")
    relevance_results = pd.Series(aux_relevant, name="Relevance")
    retrieval_results = pd.Series(aux_retrival, name="Retrieval")
    ground_results = pd.Series(aux_ground, name="Groundedness")
    list_dataframes.append(
        pd.concat(
            [
                pd.DataFrame.from_dict(evaluation),
                relevance_results,
                correctness_results,
                retrieval_results,
                ground_results,
            ],
            axis=1,
        )
    )

    with open("./evaluation/evaluation.pkl", "wb") as f:
        pkl.dump(list_dataframes, f)

    with open("./evaluation/evaluation_log.txt", "a") as f:
        f.write(f"{settings}\n")

  0%|          | 0/126 [00:00<?, ?it/s]

Analyzing chunk_1000_reader-model_gemini-2.0-flash_emnedding_model_e5-small-v2recursive_True_k_8.json


  1%|          | 1/126 [00:48<1:40:56, 48.45s/it]

Analyzing chunk_1500_reader-model_gemini-2.0-flash_emnedding_model_snowflake-arctic-embed-srecursive_False_k_9.json


  2%|▏         | 2/126 [01:34<1:36:42, 46.79s/it]

Analyzing chunk_1500_reader-model_gemini-2.0-flash_emnedding_model_GIST-small-Embedding-v0recursive_False_k_8.json


  2%|▏         | 3/126 [02:30<1:44:43, 51.08s/it]

Analyzing chunk_1000_reader-model_gemini-2.0-flash_emnedding_model_snowflake-arctic-embed-srecursive_False_k_4.json


  3%|▎         | 4/126 [03:46<2:04:10, 61.07s/it]

Analyzing chunk_1000_reader-model_gemini-2.0-flash_emnedding_model_snowflake-arctic-embed-srecursive_False_k_9.json


  4%|▍         | 5/126 [04:36<1:55:00, 57.03s/it]

Analyzing chunk_1000_reader-model_gemini-2.0-flash_emnedding_model_GIST-small-Embedding-v0recursive_True_k_7.json


  5%|▍         | 6/126 [05:22<1:46:16, 53.13s/it]

Analyzing chunk_1000_reader-model_gemini-2.0-flash_emnedding_model_GIST-small-Embedding-v0recursive_True_k_8.json


  6%|▌         | 7/126 [06:07<1:40:27, 50.65s/it]

Analyzing chunk_1000_reader-model_gemini-2.0-flash_emnedding_model_GIST-small-Embedding-v0recursive_True_k_4.json


  6%|▋         | 8/126 [06:55<1:37:38, 49.65s/it]

Analyzing chunk_1000_reader-model_gemini-2.0-flash_emnedding_model_GIST-small-Embedding-v0recursive_True_k_6.json


  7%|▋         | 9/126 [07:39<1:33:28, 47.94s/it]

Analyzing chunk_1000_reader-model_gemini-2.0-flash_emnedding_model_e5-small-v2recursive_True_k_6.json


  8%|▊         | 10/126 [08:25<1:31:45, 47.46s/it]

Analyzing chunk_1000_reader-model_gemini-2.0-flash_emnedding_model_snowflake-arctic-embed-srecursive_False_k_8.json


  9%|▊         | 11/126 [09:15<1:32:28, 48.25s/it]

Analyzing chunk_1500_reader-model_gemini-2.0-flash_emnedding_model_e5-small-v2recursive_False_k_10.json


 10%|▉         | 12/126 [10:01<1:30:06, 47.42s/it]

Analyzing chunk_1500_reader-model_gemini-2.0-flash_emnedding_model_snowflake-arctic-embed-srecursive_False_k_7.json


 10%|█         | 13/126 [10:48<1:29:20, 47.44s/it]

Analyzing chunk_2000_reader-model_gemini-2.0-flash_emnedding_model_snowflake-arctic-embed-srecursive_True_k_6.json


 11%|█         | 14/126 [11:35<1:27:55, 47.10s/it]

Analyzing chunk_1000_reader-model_gemini-2.0-flash_emnedding_model_snowflake-arctic-embed-srecursive_True_k_8.json


 12%|█▏        | 15/126 [12:21<1:27:01, 47.04s/it]

Analyzing chunk_2000_reader-model_gemini-2.0-flash_emnedding_model_GIST-small-Embedding-v0recursive_True_k_8.json


 13%|█▎        | 16/126 [13:11<1:27:41, 47.83s/it]

Analyzing chunk_1500_reader-model_gemini-2.0-flash_emnedding_model_GIST-small-Embedding-v0recursive_True_k_10.json


 13%|█▎        | 17/126 [14:00<1:27:15, 48.03s/it]

Analyzing chunk_2000_reader-model_gemini-2.0-flash_emnedding_model_e5-small-v2recursive_False_k_5.json


 14%|█▍        | 18/126 [14:48<1:26:41, 48.16s/it]

Analyzing chunk_1000_reader-model_gemini-2.0-flash_emnedding_model_e5-small-v2recursive_True_k_9.json


 15%|█▌        | 19/126 [15:34<1:24:50, 47.58s/it]

Analyzing chunk_1000_reader-model_gemini-2.0-flash_emnedding_model_GIST-small-Embedding-v0recursive_True_k_9.json


 16%|█▌        | 20/126 [16:21<1:23:42, 47.38s/it]

Analyzing chunk_2000_reader-model_gemini-2.0-flash_emnedding_model_snowflake-arctic-embed-srecursive_True_k_7.json


 17%|█▋        | 21/126 [17:10<1:23:23, 47.66s/it]

Analyzing chunk_1000_reader-model_gemini-2.0-flash_emnedding_model_e5-small-v2recursive_False_k_6.json


 17%|█▋        | 22/126 [17:58<1:22:53, 47.82s/it]

Analyzing chunk_1500_reader-model_gemini-2.0-flash_emnedding_model_e5-small-v2recursive_False_k_5.json


 18%|█▊        | 23/126 [18:46<1:22:05, 47.82s/it]

Analyzing chunk_1500_reader-model_gemini-2.0-flash_emnedding_model_snowflake-arctic-embed-srecursive_True_k_4.json


 19%|█▉        | 24/126 [19:33<1:21:03, 47.68s/it]

Analyzing chunk_1000_reader-model_gemini-2.0-flash_emnedding_model_e5-small-v2recursive_False_k_5.json


 20%|█▉        | 25/126 [20:24<1:21:53, 48.65s/it]

Analyzing chunk_2000_reader-model_gemini-2.0-flash_emnedding_model_e5-small-v2recursive_True_k_9.json


 21%|██        | 26/126 [21:11<1:20:20, 48.20s/it]

Analyzing chunk_2000_reader-model_gemini-2.0-flash_emnedding_model_snowflake-arctic-embed-srecursive_False_k_9.json


 21%|██▏       | 27/126 [22:00<1:20:03, 48.52s/it]

Analyzing chunk_1500_reader-model_gemini-2.0-flash_emnedding_model_snowflake-arctic-embed-srecursive_False_k_4.json


 22%|██▏       | 28/126 [22:48<1:19:07, 48.44s/it]

Analyzing chunk_1500_reader-model_gemini-2.0-flash_emnedding_model_snowflake-arctic-embed-srecursive_True_k_8.json


 23%|██▎       | 29/126 [23:35<1:17:26, 47.90s/it]

Analyzing chunk_2000_reader-model_gemini-2.0-flash_emnedding_model_GIST-small-Embedding-v0recursive_False_k_10.json


 24%|██▍       | 30/126 [24:23<1:16:28, 47.79s/it]

Analyzing chunk_1000_reader-model_gemini-2.0-flash_emnedding_model_e5-small-v2recursive_True_k_7.json


 25%|██▍       | 31/126 [25:09<1:15:06, 47.44s/it]

Analyzing chunk_1000_reader-model_gemini-2.0-flash_emnedding_model_e5-small-v2recursive_True_k_4.json


 25%|██▌       | 32/126 [25:58<1:14:42, 47.69s/it]

Analyzing chunk_1500_reader-model_gemini-2.0-flash_emnedding_model_GIST-small-Embedding-v0recursive_False_k_10.json


 26%|██▌       | 33/126 [26:48<1:15:13, 48.53s/it]

Analyzing chunk_2000_reader-model_gemini-2.0-flash_emnedding_model_snowflake-arctic-embed-srecursive_True_k_4.json


 27%|██▋       | 34/126 [27:34<1:13:16, 47.79s/it]

Analyzing chunk_2000_reader-model_gemini-2.0-flash_emnedding_model_GIST-small-Embedding-v0recursive_True_k_9.json


 28%|██▊       | 35/126 [28:23<1:13:04, 48.19s/it]

Analyzing chunk_2000_reader-model_gemini-2.0-flash_emnedding_model_snowflake-arctic-embed-srecursive_True_k_5.json


 29%|██▊       | 36/126 [29:08<1:10:52, 47.25s/it]

Analyzing chunk_1000_reader-model_gemini-2.0-flash_emnedding_model_snowflake-arctic-embed-srecursive_True_k_5.json


 29%|██▉       | 37/126 [29:54<1:09:28, 46.83s/it]

Analyzing chunk_2000_reader-model_gemini-2.0-flash_emnedding_model_snowflake-arctic-embed-srecursive_True_k_10.json


 30%|███       | 38/126 [30:44<1:09:51, 47.63s/it]

Analyzing chunk_1500_reader-model_gemini-2.0-flash_emnedding_model_e5-small-v2recursive_True_k_8.json


 31%|███       | 39/126 [31:33<1:09:49, 48.16s/it]

Analyzing chunk_1500_reader-model_gemini-2.0-flash_emnedding_model_GIST-small-Embedding-v0recursive_True_k_7.json


 32%|███▏      | 40/126 [32:26<1:10:56, 49.49s/it]

Analyzing chunk_1000_reader-model_gemini-2.0-flash_emnedding_model_GIST-small-Embedding-v0recursive_False_k_4.json


 33%|███▎      | 41/126 [33:17<1:10:54, 50.05s/it]

Analyzing chunk_1500_reader-model_gemini-2.0-flash_emnedding_model_e5-small-v2recursive_True_k_10.json


 33%|███▎      | 42/126 [34:09<1:10:48, 50.57s/it]

Analyzing chunk_1500_reader-model_gemini-2.0-flash_emnedding_model_e5-small-v2recursive_True_k_9.json


 34%|███▍      | 43/126 [34:59<1:09:40, 50.37s/it]

Analyzing chunk_1500_reader-model_gemini-2.0-flash_emnedding_model_snowflake-arctic-embed-srecursive_False_k_5.json


 35%|███▍      | 44/126 [35:51<1:09:44, 51.03s/it]

Analyzing chunk_1500_reader-model_gemini-2.0-flash_emnedding_model_GIST-small-Embedding-v0recursive_False_k_5.json


 36%|███▌      | 45/126 [36:39<1:07:22, 49.90s/it]

Analyzing chunk_2000_reader-model_gemini-2.0-flash_emnedding_model_GIST-small-Embedding-v0recursive_True_k_7.json


 37%|███▋      | 46/126 [37:28<1:06:21, 49.77s/it]

Analyzing chunk_1500_reader-model_gemini-2.0-flash_emnedding_model_snowflake-arctic-embed-srecursive_False_k_6.json


 37%|███▋      | 47/126 [38:14<1:04:09, 48.72s/it]

Analyzing chunk_1500_reader-model_gemini-2.0-flash_emnedding_model_e5-small-v2recursive_False_k_8.json


 38%|███▊      | 48/126 [39:02<1:02:47, 48.30s/it]

Analyzing chunk_1000_reader-model_gemini-2.0-flash_emnedding_model_e5-small-v2recursive_False_k_10.json


 39%|███▉      | 49/126 [39:53<1:03:02, 49.12s/it]

Analyzing chunk_1500_reader-model_gemini-2.0-flash_emnedding_model_e5-small-v2recursive_True_k_6.json


 40%|███▉      | 50/126 [40:40<1:01:22, 48.45s/it]

Analyzing chunk_1000_reader-model_gemini-2.0-flash_emnedding_model_e5-small-v2recursive_False_k_4.json


 40%|████      | 51/126 [41:31<1:01:45, 49.40s/it]

Analyzing chunk_1000_reader-model_gemini-2.0-flash_emnedding_model_GIST-small-Embedding-v0recursive_False_k_5.json


 41%|████▏     | 52/126 [42:24<1:02:15, 50.48s/it]

Analyzing chunk_1000_reader-model_gemini-2.0-flash_emnedding_model_GIST-small-Embedding-v0recursive_False_k_10.json


 42%|████▏     | 53/126 [43:17<1:02:18, 51.21s/it]

Analyzing chunk_1000_reader-model_gemini-2.0-flash_emnedding_model_e5-small-v2recursive_True_k_10.json


 43%|████▎     | 54/126 [44:09<1:01:38, 51.37s/it]

Analyzing chunk_2000_reader-model_gemini-2.0-flash_emnedding_model_e5-small-v2recursive_False_k_9.json


 44%|████▎     | 55/126 [44:59<1:00:21, 51.01s/it]

Analyzing chunk_1000_reader-model_gemini-2.0-flash_emnedding_model_snowflake-arctic-embed-srecursive_False_k_10.json


 44%|████▍     | 56/126 [45:47<58:31, 50.16s/it]  

Analyzing chunk_1500_reader-model_gemini-2.0-flash_emnedding_model_GIST-small-Embedding-v0recursive_True_k_5.json


 45%|████▌     | 57/126 [46:38<57:54, 50.35s/it]

Analyzing chunk_1500_reader-model_gemini-2.0-flash_emnedding_model_GIST-small-Embedding-v0recursive_False_k_4.json


 46%|████▌     | 58/126 [47:25<55:49, 49.26s/it]

Analyzing chunk_1500_reader-model_gemini-2.0-flash_emnedding_model_snowflake-arctic-embed-srecursive_False_k_8.json


 47%|████▋     | 59/126 [48:14<55:09, 49.39s/it]

Analyzing chunk_2000_reader-model_gemini-2.0-flash_emnedding_model_GIST-small-Embedding-v0recursive_True_k_4.json


 48%|████▊     | 60/126 [49:01<53:25, 48.57s/it]

Analyzing chunk_2000_reader-model_gemini-2.0-flash_emnedding_model_e5-small-v2recursive_True_k_6.json


 48%|████▊     | 61/126 [49:52<53:19, 49.22s/it]

Analyzing chunk_1500_reader-model_gemini-2.0-flash_emnedding_model_GIST-small-Embedding-v0recursive_True_k_6.json


 49%|████▉     | 62/126 [50:41<52:30, 49.22s/it]

Analyzing chunk_2000_reader-model_gemini-2.0-flash_emnedding_model_e5-small-v2recursive_False_k_4.json


 50%|█████     | 63/126 [51:34<52:55, 50.41s/it]

Analyzing chunk_1500_reader-model_gemini-2.0-flash_emnedding_model_e5-small-v2recursive_False_k_6.json


 51%|█████     | 64/126 [52:24<52:03, 50.38s/it]

Analyzing chunk_1500_reader-model_gemini-2.0-flash_emnedding_model_snowflake-arctic-embed-srecursive_True_k_6.json


 52%|█████▏    | 65/126 [53:16<51:32, 50.70s/it]

Analyzing chunk_2000_reader-model_gemini-2.0-flash_emnedding_model_e5-small-v2recursive_True_k_4.json


 52%|█████▏    | 66/126 [54:06<50:27, 50.46s/it]

Analyzing chunk_2000_reader-model_gemini-2.0-flash_emnedding_model_e5-small-v2recursive_False_k_8.json


 53%|█████▎    | 67/126 [54:58<50:10, 51.03s/it]

Analyzing chunk_1500_reader-model_gemini-2.0-flash_emnedding_model_GIST-small-Embedding-v0recursive_False_k_6.json


 54%|█████▍    | 68/126 [55:50<49:28, 51.18s/it]

Analyzing chunk_1500_reader-model_gemini-2.0-flash_emnedding_model_GIST-small-Embedding-v0recursive_True_k_9.json


 55%|█████▍    | 69/126 [56:41<48:44, 51.31s/it]

Analyzing chunk_1000_reader-model_gemini-2.0-flash_emnedding_model_snowflake-arctic-embed-srecursive_False_k_5.json


 56%|█████▌    | 70/126 [57:30<47:15, 50.63s/it]

Analyzing chunk_2000_reader-model_gemini-2.0-flash_emnedding_model_e5-small-v2recursive_True_k_8.json


 56%|█████▋    | 71/126 [58:18<45:33, 49.71s/it]

Analyzing chunk_1000_reader-model_gemini-2.0-flash_emnedding_model_snowflake-arctic-embed-srecursive_True_k_9.json


 57%|█████▋    | 72/126 [59:06<44:15, 49.18s/it]

Analyzing chunk_1000_reader-model_gemini-2.0-flash_emnedding_model_snowflake-arctic-embed-srecursive_False_k_6.json


 58%|█████▊    | 73/126 [59:50<42:11, 47.76s/it]

Analyzing chunk_2000_reader-model_gemini-2.0-flash_emnedding_model_snowflake-arctic-embed-srecursive_True_k_8.json


 59%|█████▊    | 74/126 [1:00:41<42:02, 48.50s/it]

Analyzing chunk_1500_reader-model_gemini-2.0-flash_emnedding_model_e5-small-v2recursive_True_k_5.json


 60%|█████▉    | 75/126 [1:01:27<40:49, 48.03s/it]

Analyzing chunk_2000_reader-model_gemini-2.0-flash_emnedding_model_GIST-small-Embedding-v0recursive_False_k_9.json


 60%|██████    | 76/126 [1:02:13<39:22, 47.24s/it]

Analyzing chunk_2000_reader-model_gemini-2.0-flash_emnedding_model_e5-small-v2recursive_False_k_10.json


 61%|██████    | 77/126 [1:03:02<38:55, 47.67s/it]

Analyzing chunk_2000_reader-model_gemini-2.0-flash_emnedding_model_snowflake-arctic-embed-srecursive_False_k_4.json


 62%|██████▏   | 78/126 [1:03:48<37:48, 47.25s/it]

Analyzing chunk_2000_reader-model_gemini-2.0-flash_emnedding_model_e5-small-v2recursive_True_k_5.json


 63%|██████▎   | 79/126 [1:04:34<36:49, 47.00s/it]

Analyzing chunk_1500_reader-model_gemini-2.0-flash_emnedding_model_e5-small-v2recursive_False_k_7.json


 63%|██████▎   | 80/126 [1:05:25<36:58, 48.23s/it]

Analyzing chunk_1500_reader-model_gemini-2.0-flash_emnedding_model_e5-small-v2recursive_True_k_4.json


 64%|██████▍   | 81/126 [1:06:13<36:03, 48.07s/it]

Analyzing chunk_1000_reader-model_gemini-2.0-flash_emnedding_model_GIST-small-Embedding-v0recursive_False_k_7.json


 65%|██████▌   | 82/126 [1:07:04<35:50, 48.88s/it]

Analyzing chunk_1500_reader-model_gemini-2.0-flash_emnedding_model_snowflake-arctic-embed-srecursive_True_k_10.json


 66%|██████▌   | 83/126 [1:07:51<34:46, 48.53s/it]

Analyzing chunk_2000_reader-model_gemini-2.0-flash_emnedding_model_GIST-small-Embedding-v0recursive_True_k_10.json


 67%|██████▋   | 84/126 [1:08:44<34:48, 49.72s/it]

Analyzing chunk_2000_reader-model_gemini-2.0-flash_emnedding_model_snowflake-arctic-embed-srecursive_False_k_6.json


 67%|██████▋   | 85/126 [1:09:33<33:54, 49.62s/it]

Analyzing chunk_1000_reader-model_gemini-2.0-flash_emnedding_model_GIST-small-Embedding-v0recursive_False_k_6.json


 68%|██████▊   | 86/126 [1:10:24<33:19, 50.00s/it]

Analyzing chunk_1000_reader-model_gemini-2.0-flash_emnedding_model_snowflake-arctic-embed-srecursive_True_k_7.json


 69%|██████▉   | 87/126 [1:11:13<32:12, 49.55s/it]

Analyzing chunk_2000_reader-model_gemini-2.0-flash_emnedding_model_GIST-small-Embedding-v0recursive_True_k_6.json


 70%|██████▉   | 88/126 [1:12:04<31:45, 50.15s/it]

Analyzing chunk_1000_reader-model_gemini-2.0-flash_emnedding_model_snowflake-arctic-embed-srecursive_True_k_4.json


 71%|███████   | 89/126 [1:12:54<30:45, 49.89s/it]

Analyzing chunk_2000_reader-model_gemini-2.0-flash_emnedding_model_snowflake-arctic-embed-srecursive_True_k_9.json


 71%|███████▏  | 90/126 [1:13:41<29:27, 49.10s/it]

Analyzing chunk_1500_reader-model_gemini-2.0-flash_emnedding_model_GIST-small-Embedding-v0recursive_True_k_8.json


 72%|███████▏  | 91/126 [1:14:32<29:02, 49.78s/it]

Analyzing chunk_1000_reader-model_gemini-2.0-flash_emnedding_model_GIST-small-Embedding-v0recursive_False_k_8.json


 73%|███████▎  | 92/126 [1:15:22<28:16, 49.90s/it]

Analyzing chunk_1500_reader-model_gemini-2.0-flash_emnedding_model_snowflake-arctic-embed-srecursive_True_k_9.json


 74%|███████▍  | 93/126 [1:16:13<27:35, 50.17s/it]

Analyzing chunk_1000_reader-model_gemini-2.0-flash_emnedding_model_e5-small-v2recursive_True_k_5.json


 75%|███████▍  | 94/126 [1:17:00<26:16, 49.26s/it]

Analyzing chunk_1500_reader-model_gemini-2.0-flash_emnedding_model_e5-small-v2recursive_True_k_7.json


 75%|███████▌  | 95/126 [1:17:50<25:27, 49.26s/it]

Analyzing chunk_2000_reader-model_gemini-2.0-flash_emnedding_model_e5-small-v2recursive_False_k_6.json


 76%|███████▌  | 96/126 [1:18:38<24:32, 49.09s/it]

Analyzing chunk_1000_reader-model_gemini-2.0-flash_emnedding_model_snowflake-arctic-embed-srecursive_True_k_10.json


 77%|███████▋  | 97/126 [1:19:29<23:58, 49.60s/it]

Analyzing chunk_1000_reader-model_gemini-2.0-flash_emnedding_model_GIST-small-Embedding-v0recursive_False_k_9.json


 78%|███████▊  | 98/126 [1:20:17<22:52, 49.02s/it]

Analyzing chunk_2000_reader-model_gemini-2.0-flash_emnedding_model_snowflake-arctic-embed-srecursive_False_k_10.json


 79%|███████▊  | 99/126 [1:21:08<22:18, 49.56s/it]

Analyzing chunk_1500_reader-model_gemini-2.0-flash_emnedding_model_snowflake-arctic-embed-srecursive_True_k_7.json


 79%|███████▉  | 100/126 [1:21:55<21:11, 48.90s/it]

Analyzing chunk_1000_reader-model_gemini-2.0-flash_emnedding_model_snowflake-arctic-embed-srecursive_False_k_7.json


 80%|████████  | 101/126 [1:22:43<20:17, 48.70s/it]

Analyzing chunk_2000_reader-model_gemini-2.0-flash_emnedding_model_snowflake-arctic-embed-srecursive_False_k_7.json


 81%|████████  | 102/126 [1:23:34<19:40, 49.20s/it]

Analyzing chunk_1000_reader-model_gemini-2.0-flash_emnedding_model_GIST-small-Embedding-v0recursive_True_k_10.json


 82%|████████▏ | 103/126 [1:24:21<18:40, 48.72s/it]

Analyzing chunk_1500_reader-model_gemini-2.0-flash_emnedding_model_e5-small-v2recursive_False_k_4.json


 83%|████████▎ | 104/126 [1:25:11<18:00, 49.12s/it]

Analyzing chunk_2000_reader-model_gemini-2.0-flash_emnedding_model_snowflake-arctic-embed-srecursive_False_k_5.json


 83%|████████▎ | 105/126 [1:26:01<17:15, 49.30s/it]

Analyzing chunk_2000_reader-model_gemini-2.0-flash_emnedding_model_e5-small-v2recursive_True_k_10.json


 84%|████████▍ | 106/126 [1:26:51<16:32, 49.62s/it]

Analyzing chunk_1500_reader-model_gemini-2.0-flash_emnedding_model_snowflake-arctic-embed-srecursive_True_k_5.json


 85%|████████▍ | 107/126 [1:27:39<15:29, 48.94s/it]

Analyzing chunk_2000_reader-model_gemini-2.0-flash_emnedding_model_GIST-small-Embedding-v0recursive_True_k_5.json


 86%|████████▌ | 108/126 [1:28:26<14:34, 48.60s/it]

Analyzing chunk_2000_reader-model_gemini-2.0-flash_emnedding_model_GIST-small-Embedding-v0recursive_False_k_6.json


 87%|████████▋ | 109/126 [1:29:16<13:50, 48.85s/it]

Analyzing chunk_1500_reader-model_gemini-2.0-flash_emnedding_model_e5-small-v2recursive_False_k_9.json


 87%|████████▋ | 110/126 [1:30:04<12:58, 48.66s/it]

Analyzing chunk_2000_reader-model_gemini-2.0-flash_emnedding_model_GIST-small-Embedding-v0recursive_False_k_7.json


 88%|████████▊ | 111/126 [1:30:53<12:12, 48.85s/it]

Analyzing chunk_2000_reader-model_gemini-2.0-flash_emnedding_model_e5-small-v2recursive_False_k_7.json


 89%|████████▉ | 112/126 [1:31:41<11:20, 48.57s/it]

Analyzing chunk_2000_reader-model_gemini-2.0-flash_emnedding_model_GIST-small-Embedding-v0recursive_False_k_5.json


 90%|████████▉ | 113/126 [1:32:31<10:37, 49.02s/it]

Analyzing chunk_1500_reader-model_gemini-2.0-flash_emnedding_model_snowflake-arctic-embed-srecursive_False_k_10.json


 90%|█████████ | 114/126 [1:33:20<09:45, 48.82s/it]

Analyzing chunk_2000_reader-model_gemini-2.0-flash_emnedding_model_GIST-small-Embedding-v0recursive_False_k_8.json


 91%|█████████▏| 115/126 [1:34:06<08:47, 47.92s/it]

Analyzing chunk_1000_reader-model_gemini-2.0-flash_emnedding_model_snowflake-arctic-embed-srecursive_True_k_6.json


 92%|█████████▏| 116/126 [1:34:55<08:03, 48.36s/it]

Analyzing chunk_1500_reader-model_gemini-2.0-flash_emnedding_model_GIST-small-Embedding-v0recursive_True_k_4.json


 93%|█████████▎| 117/126 [1:35:41<07:09, 47.70s/it]

Analyzing chunk_2000_reader-model_gemini-2.0-flash_emnedding_model_snowflake-arctic-embed-srecursive_False_k_8.json


 94%|█████████▎| 118/126 [1:36:28<06:20, 47.61s/it]

Analyzing chunk_1000_reader-model_gemini-2.0-flash_emnedding_model_e5-small-v2recursive_False_k_8.json


 94%|█████████▍| 119/126 [1:37:18<05:37, 48.17s/it]

Analyzing chunk_2000_reader-model_gemini-2.0-flash_emnedding_model_GIST-small-Embedding-v0recursive_False_k_4.json


 95%|█████████▌| 120/126 [1:38:04<04:45, 47.52s/it]

Analyzing chunk_1500_reader-model_gemini-2.0-flash_emnedding_model_GIST-small-Embedding-v0recursive_False_k_7.json


 96%|█████████▌| 121/126 [1:38:53<04:00, 48.05s/it]

Analyzing chunk_1500_reader-model_gemini-2.0-flash_emnedding_model_GIST-small-Embedding-v0recursive_False_k_9.json


 97%|█████████▋| 122/126 [1:39:41<03:11, 47.93s/it]

Analyzing chunk_1000_reader-model_gemini-2.0-flash_emnedding_model_GIST-small-Embedding-v0recursive_True_k_5.json


 98%|█████████▊| 123/126 [1:40:28<02:22, 47.57s/it]

Analyzing chunk_1000_reader-model_gemini-2.0-flash_emnedding_model_e5-small-v2recursive_False_k_9.json


 98%|█████████▊| 124/126 [1:41:16<01:35, 47.75s/it]

Analyzing chunk_1000_reader-model_gemini-2.0-flash_emnedding_model_e5-small-v2recursive_False_k_7.json


 99%|█████████▉| 125/126 [1:42:05<00:48, 48.24s/it]

Analyzing chunk_2000_reader-model_gemini-2.0-flash_emnedding_model_e5-small-v2recursive_True_k_7.json


100%|██████████| 126/126 [1:42:54<00:00, 49.00s/it]


## Evaluate the benchmark (Without RAG)

In [None]:
with open("./evaluation/evaluation_final.pkl", "rb") as f:
    list_dataframes = pkl.load(f)
with open("./generated_questions_final_gemini_2.pkl", "rb") as f:
    examples = pkl.load(f)

In [6]:
examples["question"].tolist()[0]

'How does the FGB rs1800790A allele affect fibrinogen levels in F13A 34Val/Val wildtypes?'

In [26]:
template = """
<|system|>
Give a comprehensive answer to the question.
Respond only to the question asked, response should be concise and relevant to the question.
Now here is the question you need to answer.

Question: {question}
</s>
<|assistant|>
"""

We generate each answer

In [None]:
examples_eval = examples.to_dict(orient="records")
llm_reader = init_chat_model(
    "gemini-2.0-flash",
    model_provider="google_genai",
    temperature=0.5,
    max_output_tokens=1024,
)
output_file_name = "./output_bench/benchmark.json"

hf.run_rag_tests(
    eval_dataset=examples_eval,
    llm=llm_reader,
    output_file=output_file_name,
    template=template,
    rag=False,
    verbose=True,
    test_settings="No_RAG",
    recursive_chunk=False,
)

In [None]:
# Load data for benchmark
with open("./output_bench/benchmark.json", "rb") as f:
    outputs = json.load(f)

We apply only those measure that make sense for benchmark

In [52]:
list_benchmark = []
aux_correct = []
aux_relevant = []

for mydic in outputs:
    aux_correct.append(
        hf.correctness(
            inputs=mydic,
            grader_llm=grader_llm,
            correctness_instructions=correctness_instructions,
        )
    )
    aux_relevant.append(
        hf.relevance(
            inputs=mydic,
            relevance_llm=relevance_llm,
            relevance_instructions=relevance_instructions,
        )
    )

# Create dataframe
correctness_results = pd.Series(aux_correct, name="Correctness")
relevance_results = pd.Series(aux_relevant, name="Relevance")
list_benchmark.append(
    pd.concat(
        [pd.DataFrame.from_dict(outputs), relevance_results, correctness_results],
        axis=1,
    )
)

with open("./output_bench/benchmark.pkl", "wb") as f:
    pkl.dump(list_benchmark, f)