# Evaluation of RAG model

One way to think about different types of RAG evaluators is as a tuple of what is being evaluated X what its being evaluated against:

1. Correctness: Response vs reference answer
- Goal: Measure "how similar/correct is the RAG chain answer, relative to a ground-truth answer"
- Mode: Requires a ground truth (reference) answer supplied through a dataset
- Evaluator: Use LLM-as-judge to assess answer correctness.

2. Relevance: Response vs input
- Goal: Measure "how well does the generated response address the initial user input"
- Mode: Does not require reference answer, because it will compare the answer to the input question
- Evaluator: Use LLM-as-judge to assess answer relevance, helpfulness, etc.

3. Groundedness: Response vs retrieved docs
- Goal: Measure "to what extent does the generated response agree with the retrieved context"
- Mode: Does not require reference answer, because it will compare the answer to the retrieved context
- Evaluator: Use LLM-as-judge to assess faithfulness, hallucinations, etc.

4. Retrieval relevance: Retrieved docs vs input
- Goal: Measure "how relevant are my retrieved results for this query"
- Mode: Does not require reference answer, because it will compare the question to the retrieved context
- Evaluator: Use LLM-as-judge to assess relevance

In [3]:
from langchain_google_genai import GoogleGenerativeAI
import google.generativeai as genai
import json
from pydantic import BaseModel
from typing_extensions import Annotated, TypedDict, Type
from langsmith import Client

  from .autonotebook import tqdm as notebook_tqdm


In [4]:
with open('api_google.txt') as f:
    
    api_key = json.load(f)

genai.configure(api_key=api_key)

## Setting

In [16]:
import getpass
import os

if not os.environ.get("GOOGLE_API_KEY"):
  os.environ["GOOGLE_API_KEY"] = getpass.getpass("Enter API key for Google Gemini: ")

In [17]:
from langchain.chat_models import init_chat_model
llm = init_chat_model("gemma-3-27b-it", model_provider="google_genai")

## Correctness: Response vs reference answer

In [20]:
# Define the examples for the dataset
examples = [
    {
        "inputs": {"question": "What are the key regulators of endometrial progression and function?"},
        "outputs": {"answer": "According to Parraga-Leo 2023, endometrial progression and function are regulated by steroid hormones, mainly estrogen and progesterone, which orchestrate transcriptional changes essential for implantation and fertility."},
    },
    {
        "inputs": {"question": "What is the role of progesterone and estrogen in endometrial progression?"},
        "outputs": {"answer": "Parraga-Leo 2023 showed that progesterone drives the transition from proliferation to differentiation during the secretory phase, while estrogen promotes endometrial proliferation during the follicular phase. Together, they regulate transcriptomic dynamics crucial for receptivity."},
    },
    {
        "inputs": {"question": "Which transcription factors regulate endometrial progression?"},
        "outputs": {"answer": "As described in Parraga-Leo 2023, key transcription factors include CTCF, GATA6, HOXA10, FOXO1, C/EBPβ, and GATA2, which integrate hormonal signaling and control gene expression underlying endometrial receptivity."},
    },
    {
        "inputs": {"question": "Which microRNAs regulate endometrial progression?"},
        "outputs": {"answer": "Parraga-Leo 2023 identified microRNAs such as hsa-miR-15a-5p, hsa-miR-218-5p, hsa-miR-107, hsa-miR-103a-3p, hsa-miR-128-3p, and members of the let-7 family as key regulators that fine-tune endometrial receptivity and embryo implantation processes."},
    },
    {
        "inputs": {"question": "What are the main findings of the study on endometrial transcriptomic regulation?"},
        "outputs": {"answer": "Parraga-Leo 2023 revealed that endometrial receptivity is driven by a coordinated program of hormone-regulated genes, transcription factors (CTCF, GATA6) and microRNAs. This network ensures proper timing of implantation and highlights potential therapeutic targets for infertility."},
    },
]

In [23]:
client = Client()

# Create the dataset and examples in LangSmith
dataset_name = "ReproRAG Q&A"
dataset = client.create_dataset(dataset_name=dataset_name)
client.create_examples(
    dataset_id=dataset.id,
    examples=examples
)



LangSmithAuthError: Authentication failed for /datasets. HTTPError('401 Client Error: Unauthorized for url: https://api.smith.langchain.com/datasets', '{"detail":"Invalid token"}')

In [None]:

# Grade output schema
class CorrectnessGrade(BaseModel):
    # Note that the order in the fields are defined is the order in which the model will generate them.
    # It is useful to put explanations before responses because it forces the model to think through
    # its final response before generating it:
    explanation: Annotated[str, ..., "Explain your reasoning for the score"]
    correct: Annotated[bool, ..., "True if the answer is correct, False otherwise."]

# Grade prompt
correctness_instructions = """You are a senior researcher. 

You will be given a QUESTION, the GROUND TRUTH (correct) ANSWER, and the PhD Student answer. 

Here is the grade criteria to follow:
(1) Grade the student answers based ONLY on their factual accuracy relative to the ground truth answer. 
(2) Ensure that the student answer does not contain any conflicting statements.
(3) It is OK if the student answer contains more information than the ground truth answer, as long as it is factually accurate relative to the  ground truth answer.

Correctness:
A correctness value of True means that the student's answer meets all of the criteria.
A correctness value of False means that the student's answer does not meet all of the criteria.

Explain your reasoning in a step-by-step manner to ensure your reasoning and conclusion are correct. 

Avoid simply stating the correct answer at the outset."""

In [19]:
# Grader LLM

grader_llm = llm.with_structured_output(CorrectnessGrade, method="json_schema", strict=True)

def correctness(inputs: dict, outputs: dict, reference_outputs: dict) -> bool:
    """An evaluator for RAG answer accuracy"""
    answers = f"""\
QUESTION: {inputs['question']}
GROUND TRUTH ANSWER: {reference_outputs['answer']}
STUDENT ANSWER: {outputs['answer']}"""

    # Run evaluator
    grade = grader_llm.invoke([
        {"role": "system", "content": correctness_instructions}, 
        {"role": "user", "content": answers}
    ])
    return grade["correct"]