# RAG Evaluation

This notebook helps you experiment with the RAG evaluation using LLM as a judge.

- We use the ground truth dataset generated during retrieval evaluation
- Set your `OPENAI_API_KEY` in the environment if you want to use the LLM step.

In [7]:
# Setup: add project `src` to path and optional envs
import os, sys
from pathlib import Path
import pandas as pd

# In Jupyter, __file__ is not defined. Use the current notebook's directory.
# Notebook lives in PROJECT_ROOT / "notebooks", so project root is parent of cwd.
PROJECT_ROOT = Path.cwd().parent
SRC_PATH = PROJECT_ROOT / "src"
if str(SRC_PATH) not in sys.path:
    sys.path.insert(0, str(SRC_PATH))

# Reduce tokenizer threads warning noise for fastembed
os.environ.setdefault("TOKENIZERS_PARALLELISM", "false")

# Optionally set your API key for LLM usage here (prefer using a .env or shell env)
# os.environ["OPENAI_API_KEY"] = "<your_key_here>"

print("Project root:", PROJECT_ROOT)
print("Using src path:", SRC_PATH)

restaurants_csv = str(PROJECT_ROOT / "data" / "restaurants.csv")
menus_csv = str(PROJECT_ROOT / "data" / "restaurant-menus.csv")


Project root: /Users/anupamgupta/Desktop/Github projects/LLM-based-Agentic-RAG
Using src path: /Users/anupamgupta/Desktop/Github projects/LLM-based-Agentic-RAG/src


In [2]:
ground_truth_path = PROJECT_ROOT / "data" / "ground-truth-retrieval.csv"
!head "{ground_truth_path}"

id,question
0,What is the price of the Extra Large Meat Lovers pie at PJ Fresh?
0,"What is PJ Fresh's full address in Birmingham, AL 35207?"
0,Under which menu category is the Extra Large Meat Lovers pizza listed?
1,What is the price of the Extra Large Supreme pizza?
1,Is the Extra Large Supreme a whole pie?
1,"What is the street address and ZIP code for PJ Fresh in Birmingham, AL 35207?"
2,"What is the price of the Extra Large Pepperoni pizza at PJ Fresh in Birmingham, AL?"
2,What is the street address of PJ Fresh?
2,What is the description for the Extra Large Pepperoni item on the menu?


In [3]:
df_ground_truth = pd.read_csv(ground_truth_path)
df_ground_truth.head()

Unnamed: 0,id,question
0,0,What is the price of the Extra Large Meat Love...
1,0,"What is PJ Fresh's full address in Birmingham,..."
2,0,Under which menu category is the Extra Large M...
3,1,What is the price of the Extra Large Supreme p...
4,1,Is the Extra Large Supreme a whole pie?


In [4]:
from tqdm.auto import tqdm

In [5]:
prompt2_template = """
You are an expert evaluator for a RAG system.
Your task is to analyze the relevance of the generated answer to the given question.
Based on the relevance of the generated answer, you will classify it
as "NON_RELEVANT", "PARTLY_RELEVANT", or "RELEVANT".

Here is the data for evaluation:

Question: {question}
Generated Answer: {answer_llm}

Please analyze the content and context of the generated answer in relation to the question
and provide your evaluation in parsable JSON without using code blocks:

{{
  "Relevance": "NON_RELEVANT" | "PARTLY_RELEVANT" | "RELEVANT",
  "Explanation": "[Provide a brief explanation for your evaluation]"
}}
""".strip()

In [8]:
sample = df_ground_truth.to_dict(orient='records')

from restaurant_retreival_engine import RestaurantVectorStore, EmbeddingService, DataLoader, RestaurantSearchEngine
from typing import Any
from llm_utility import RAGQueryEngine 
import os

vector_store = RestaurantVectorStore()
embedding = EmbeddingService()
data_loader = DataLoader(
    restaurants_csv,
    menus_csv
)

collection_name = "rag-eval-temp"
engine = RestaurantSearchEngine(vector_store, embedding, data_loader)
rag = RAGQueryEngine()

In [14]:
import json

evaluations = []

for record in tqdm(sample):
    question = record['question']
    results = engine.search(question)
    res = []
    for point in results.points:
        res.append(point.payload)
    answer_llm, stats = rag.query(question, res)

    prompt = prompt2_template.format(
        question=question,
        answer_llm=answer_llm
    )

    evaluation = rag.llm(prompt)
    evaluation = json.loads(evaluation)

    evaluations.append((record, answer_llm, evaluation))

  0%|          | 0/30 [00:00<?, ?it/s]

In [15]:
df_eval = pd.DataFrame(evaluations, columns=['record', 'answer', 'evaluation'])

df_eval['id'] = df_eval.record.apply(lambda d: d['id'])
df_eval['question'] = df_eval.record.apply(lambda d: d['question'])

df_eval['relevance'] = df_eval.evaluation.apply(lambda d: d['Relevance'])
df_eval['explanation'] = df_eval.evaluation.apply(lambda d: d['Explanation'])

del df_eval['record']
del df_eval['evaluation']

In [16]:
df_eval.relevance.value_counts(normalize=True)

relevance
RELEVANT           0.766667
NON_RELEVANT       0.133333
PARTLY_RELEVANT    0.100000
Name: proportion, dtype: float64

In [17]:
df_eval[df_eval.relevance == 'NON_RELEVANT']

Unnamed: 0,answer,id,question,relevance,explanation
8,Slice.,2,What is the description for the Extra Large Pe...,NON_RELEVANT,The generated answer 'Slice.' does not describ...
10,There are two entries for Extra Large BBQ Chic...,3,What is the description for the Extra Large BB...,NON_RELEVANT,The user asked for the actual description of t...
17,"Extra Large 16"" Pizza.",5,Question: Under which menu category is the Ext...,NON_RELEVANT,The generated answer describes a specific pizz...
22,Extra Large Pizza.,7,Under which menu category is the Extra Large M...,NON_RELEVANT,The question asks for the menu category under ...
