# RAG Evaluation


Install packages


In [74]:
!uv pip install -q \
    pandas==2.3.2 \
    pandas-stubs==2.3.2.250827 \
    requests==2.32.5 \
    python-dotenv==1.2.1 \
    tqdm==4.67.1 \
    litellm==1.78.5

Import packages


In [None]:
import hashlib
import json
import random
import time
import uuid
from collections import defaultdict
from pathlib import Path

import litellm
import pandas as pd
import requests
from dotenv import load_dotenv
from tqdm import tqdm

load_dotenv()

True

## Ground Truth Dataset


Download documents


In [None]:
docs_url = "https://github.com/alexeygrigorev/llm-rag-workshop/raw/main/notebooks/documents.json"
docs_response = requests.get(docs_url)
documents_raw = docs_response.json()

documents = []

for course in documents_raw:
    course_name = course["course"]
    for doc in course["documents"]:
        doc["course"] = course_name
        documents.append(doc)

documents[0]

{'text': "The purpose of this document is to capture frequently asked technical questions\nThe exact day and hour of the course will be 15th Jan 2024 at 17h00. The course will start with the first  “Office Hours'' live.1\nSubscribe to course public Google Calendar (it works from Desktop only).\nRegister before the course starts using this link.\nJoin the course Telegram channel with announcements.\nDon’t forget to register in DataTalks.Club's Slack and join the channel.",
 'section': 'General course-related questions',
 'question': 'Course - When will the course start?',
 'course': 'data-engineering-zoomcamp'}

Generate document ids based on it's content


In [None]:
def generate_document_id(doc):
    combined = f"{doc['course']}-{doc['question']}-{doc['text'][:10]}"
    hash_object = hashlib.md5(combined.encode())
    hash_hex = hash_object.hexdigest()
    document_id = hash_hex[:8]

    return document_id

Apply ids


In [None]:
for doc in documents:
    doc["id"] = generate_document_id(doc)
documents[3]

{'text': "You don't need it. You're accepted. You can also just start learning and submitting homework without registering. It is not checked against any registered list. Registration is just to gauge interest before the start date.",
 'section': 'General course-related questions',
 'question': 'Course - I have registered for the Data Engineering Bootcamp. When can I expect to receive the confirmation email?',
 'course': 'data-engineering-zoomcamp',
 'id': '0bbf41ec'}

Check for duplicates


In [None]:
hashes = defaultdict(list)

for doc in documents:
    doc_id = doc["id"]
    hashes[doc_id].append(doc)

len(hashes), len(documents)

(947, 948)

Duplicated ids


In [None]:
for k, v in hashes.items():
    if len(v) > 1:
        print(k, len(v))

593f7569 2


In [None]:
hashes["593f7569"]

[{'text': "They both do the same, it's just less typing from the script.\nAsked by Andrew Katoch, Added by Edidiong Esu",
  'section': '6. Decision Trees and Ensemble Learning',
  'question': 'Does it matter if we let the Python file create the server or if we run gunicorn directly?',
  'course': 'machine-learning-zoomcamp',
  'id': '593f7569'},
 {'text': "They both do the same, it's just less typing from the script.",
  'section': '6. Decision Trees and Ensemble Learning',
  'question': 'Does it matter if we let the Python file create the server or if we run gunicorn directly?',
  'course': 'machine-learning-zoomcamp',
  'id': '593f7569'}]

Save documents with ids


In [None]:
with open("documents-with-ids.json", "wt") as f_out:
    json.dump(documents, f_out, indent=2)

In [12]:
!head documents-with-ids.json

[
  {
    "text": "The purpose of this document is to capture frequently asked technical questions\nThe exact day and hour of the course will be 15th Jan 2024 at 17h00. The course will start with the first  \u201cOffice Hours'' live.1\nSubscribe to course public Google Calendar (it works from Desktop only).\nRegister before the course starts using this link.\nJoin the course Telegram channel with announcements.\nDon\u2019t forget to register in DataTalks.Club's Slack and join the channel.",
    "section": "General course-related questions",
    "question": "Course - When will the course start?",
    "course": "data-engineering-zoomcamp",
    "id": "c02e79ef"
  },
  {
    "text": "GitHub - DataTalksClub data-engineering-zoomcamp#prerequisites",


In [None]:
# !docker exec -it ollama ollama pull qwen3:0.6b

In [None]:
# !docker exec -it ollama ollama list

Lite llm with open router


In [None]:
response = litellm.completion(
    model="openrouter/meta-llama/llama-3.3-70b-instruct",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {
            "role": "user",
            "content": "Explain transformers llm architecture in one paragraph.",
        },
    ],
    # api_base="http://localhost:11434",
    # api_key="ollama",
    # format="json",
    # custom_llm_provider="ollama",
)

print(response.choices[0].message["content"])

The Transformer LLM (Large Language Model) architecture is a type of neural network designed for natural language processing tasks. It's based on the Transformer model, which relies on self-attention mechanisms to weigh the importance of different input elements relative to each other. The architecture consists of an encoder and a decoder, but in the case of LLMs, the decoder is often used in a standalone fashion to generate text. The model is composed of a series of identical layers, each comprising two sub-layers: a self-attention mechanism and a position-wise fully connected feed-forward network. The self-attention mechanism allows the model to attend to different parts of the input sequence simultaneously and weigh their importance, while the feed-forward network transforms the output of the self-attention mechanism. The output of each layer is then passed through a layer normalization and a residual connection, which helps to stabilize the training process and allow the model to l

Prompt template


In [None]:
prompt_template = """
You emulate a student who's taking our course.
Formulate 5 questions this student might ask based on a FAQ record. The record
should contain the answer to the questions, and the questions should be complete and not too short.
If possible, use as fewer words as possible from the record.

The record:

section: {section}
question: {question}
answer: {text}

Provide the output in parsable JSON without using code blocks:

["question1", "question2", ..., "question5"]
""".strip()

In [None]:
def generate_questions(doc):
    prompt = prompt_template.format(**doc)

    response = litellm.completion(
        model="openrouter/meta-llama/llama-3.3-70b-instruct",
        messages=[
            {"role": "user", "content": prompt},
        ],
        format="json",
        # api_base="http://localhost:11434",
        # api_key="ollama",
        # custom_llm_provider="ollama",
    )

    return response.choices[0].message.content

In [None]:
OUTPUT_PATH = Path("generated_questions.json")

if OUTPUT_PATH.exists():
    with OUTPUT_PATH.open("r", encoding="utf-8") as f:
        results = json.load(f)
else:
    results = {}

In [None]:
def save_results(data, path=OUTPUT_PATH):
    tmp_path = path.with_suffix(".tmp")
    with tmp_path.open("w", encoding="utf-8") as f:
        json.dump(data, f, ensure_ascii=False, indent=2)
    tmp_path.replace(path)

In [None]:
for doc in tqdm(documents):
    doc_id = str(doc["id"])

    if doc_id in results:
        continue

    questions = generate_questions(doc)
    results[doc_id] = questions

    save_results(results)

100%|██████████| 948/948 [00:02<00:00, 355.93it/s]


In [None]:
def extract_json(text):
    start_idx = text.find("[") if "[" in text else len(text)

    if start_idx == len(text):
        return None

    for end_idx in range(len(text), start_idx, -1):
        try:
            return json.loads(text[start_idx:end_idx])
        except:
            continue
    return None

In [None]:
parsed_results = {}

for doc_id, questions in results.items():
    try:
        parsed_results[doc_id] = extract_json(questions)
    except Exception as error:
        print(error)
        print(questions)
        break

In [None]:
doc_index = {d["id"]: d for d in documents}

In [None]:
final_results = []

for doc_id, questions in parsed_results.items():
    course = doc_index[doc_id]["course"]
    for q in questions:
        final_results.append((q, course, doc_id))

In [None]:
df = pd.DataFrame(final_results, columns=["question", "course", "document"])
df.head()

Unnamed: 0,question,course,document
0,What is the exact date and time when our cours...,data-engineering-zoomcamp,c02e79ef
1,How can I stay updated about the course schedu...,data-engineering-zoomcamp,c02e79ef
2,What are the necessary steps I need to take be...,data-engineering-zoomcamp,c02e79ef
3,Where can I find the course calendar and how d...,data-engineering-zoomcamp,c02e79ef
4,What are the different platforms I need to joi...,data-engineering-zoomcamp,c02e79ef


In [None]:
df.to_csv("ground-truth-data.csv", index=False)

In [88]:
!head ground-truth-data.csv

question,course,document
What is the exact date and time when our course is scheduled to begin,data-engineering-zoomcamp,c02e79ef
How can I stay updated about the course schedule and important announcements,data-engineering-zoomcamp,c02e79ef
What are the necessary steps I need to take before the course starts,data-engineering-zoomcamp,c02e79ef
Where can I find the course calendar and how do I access it,data-engineering-zoomcamp,c02e79ef
What are the different platforms I need to join to be fully registered for the course,data-engineering-zoomcamp,c02e79ef
What do I need to know before enrolling in this course,data-engineering-zoomcamp,1f6520ca
Are there any specific requirements to join this course,data-engineering-zoomcamp,1f6520ca
Do I need prior experience to take this course,data-engineering-zoomcamp,1f6520ca
What are the necessary skills to succeed in this course,data-engineering-zoomcamp,1f6520ca
