# Test Set Generator

In this tutorial, we'll explore the test set generation module in Ragas to create a synthetic test set for a Retrieval-Augmented Generation (RAG)-based question-answering bot

To make sure our synthetic dataset is as realistic and diverse as possible, we will create different customer personas. Each persona will represent distinct traveler types and behaviors, helping us build a comprehensive and representative test set. This approach ensures that we can thoroughly evaluate the effectiveness and robustness of our RAG model.



## Download and Load documents
Run the command below to download the dummy Ragas Airline dataset and load the documents using LangChain.

In [1]:
# git clone https://huggingface.co/datasets/explodinggradients/ragas-airline-dataset

In [4]:
from langchain_community.document_loaders import DirectoryLoader

path = "data/ragas-airline-dataset"
loader = DirectoryLoader(path, glob="**/*.md")
docs = loader.load()

## Set up the LLM and Embedding Model

In [5]:
from ragas.llms import LangchainLLMWrapper
from ragas.embeddings import OpenAIEmbeddings
from langchain_openai import ChatOpenAI
import openai


generator_llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4o-mini"))
openai_client = openai.OpenAI()
generator_embeddings = OpenAIEmbeddings(client=openai_client, model="text-embedding-3-small")

  generator_llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4o-mini"))


## Create Knowledge Graph
Create a base knowledge graph with the documents

In [6]:
from ragas.testset.graph import KnowledgeGraph
from ragas.testset.graph import Node, NodeType


kg = KnowledgeGraph()

for doc in docs:
    kg.nodes.append(
        Node(
            type=NodeType.DOCUMENT,
            properties={"page_content": doc.page_content, "document_metadata": doc.metadata}
        )
    )

kg

KnowledgeGraph(nodes: 9, relationships: 0)

## Setup the transforms
In this tutorial, we create a Single Hop Query dataset using a knowledge graph built solely from nodes. To enhance our graph and improve query generation, we apply three key transformations:

1. Headline Extraction: Uses a language model to extract clear section titles from each document (e.g., “Airline Initiated Cancellations” from flight cancellations.md). These titles isolate specific topics and provide direct context for generating focused questions.

2. Headline Splitting: Divides documents into manageable subsections based on the extracted headlines. This increases the number of nodes and ensures more granular, context-specific query generation.

3. Keyphrase Extraction: Identifies core thematic keyphrases (such as key seating information) that serve as semantic seed points, enriching the diversity and relevance of the generated queries.

In [7]:
from ragas.testset.transforms import apply_transforms
from ragas.testset.transforms import HeadlinesExtractor, HeadlineSplitter, KeyphrasesExtractor

headline_extractor = HeadlinesExtractor(llm=generator_llm, max_num=20)
headline_splitter = HeadlineSplitter(max_tokens=1500)
keyphrase_extractor = KeyphrasesExtractor(llm=generator_llm)

transforms = [
    headline_extractor,
    headline_splitter,
    keyphrase_extractor
]

apply_transforms(kg, transforms=transforms)

Applying HeadlinesExtractor:   0%|          | 0/9 [00:00<?, ?it/s]

Applying HeadlineSplitter:   0%|          | 0/9 [00:00<?, ?it/s]

Applying KeyphrasesExtractor:   0%|          | 0/30 [00:00<?, ?it/s]

Property 'keyphrases' already exists in node 'ee3d35'. Skipping!


## Configuring Personas for Query Generation
Personas provide context and perspective, ensuring that generated queries are natural, user-specific, and diverse. By tailoring queries to different user viewpoints, our test set covers a wide range of scenarios:

* First Time Flier: Generates queries with detailed, step-by-step guidance, catering to newcomers who need clear instructions.
* Frequent Flier: Produces concise, efficiency-focused queries for experienced travelers.
* Angry Business Class Flier: Yields queries with a critical, urgent tone to reflect high expectations and immediate resolution demands.

In [8]:
from ragas.testset.persona import Persona

persona_first_time_flier = Persona(
    name="First Time Flier",
    role_description="Is flying for the first time and may feel anxious. Needs clear guidance on flight procedures, safety protocols, and what to expect throughout the journey.",
)

persona_frequent_flier = Persona(
    name="Frequent Flier",
    role_description="Travels regularly and values efficiency and comfort. Interested in loyalty programs, express services, and a seamless travel experience.",
)

persona_angry_business_flier = Persona(
    name="Angry Business Class Flier",
    role_description="Demands top-tier service and is easily irritated by any delays or issues. Expects immediate resolutions and is quick to express frustration if standards are not met.",
)

personas = [persona_first_time_flier, persona_frequent_flier, persona_angry_business_flier]

## Query Generation Using Synthesizers
Synthesizers are responsible for converting enriched nodes and personas into queries. They achieve this by selecting a node property (e.g., "entities" or "keyphrases"), pairing it with a persona, style, and query length, and then using a LLM to generate a query-answer pair based on the content of the node.

Two instances of the SingleHopSpecificQuerySynthesizer are used to define the query distribution:

* Headlines-Based Synthesizer – Generates queries using extracted document headlines, leading to structured questions that reference specific sections.
* Keyphrases-Based Synthesizer – Forms queries around key concepts, generating broader, thematic questions.

Both synthesizers are weighted equally (0.5 each), ensuring a balanced mix of specific and conceptual queries, which ultimately enhances the diversity of the test set.

In [9]:
from ragas.testset.synthesizers.single_hop.specific import (
    SingleHopSpecificQuerySynthesizer,
)

query_distibution = [
    (
        SingleHopSpecificQuerySynthesizer(llm=generator_llm, property_name="headlines"),
        0.5,
    ),
    (
        SingleHopSpecificQuerySynthesizer(
            llm=generator_llm, property_name="keyphrases"
        ),
        0.5,
    ),
]

## Testset Generation

In [10]:
from ragas.testset import TestsetGenerator

generator = TestsetGenerator(
    llm=generator_llm,
    embedding_model=generator_embeddings,
    knowledge_graph=kg,
    persona_list=personas,
)

In [11]:
testset = generator.generate(testset_size=10, query_distribution=query_distibution)
testset.to_pandas()

Generating Scenarios:   0%|          | 0/2 [00:00<?, ?it/s]

Generating Samples:   0%|          | 0/10 [00:00<?, ?it/s]

Unnamed: 0,user_input,reference_contexts,reference,synthesizer_name
0,What Ragas Airlines baggage policies I need to...,[Baggage Policies\n\nThis section provides a d...,Ragas Airlines' baggage policies include allow...,single_hop_specific_query_synthesizer
1,What should I do if my flight is delayed?,[Flight Delays\n\nFlight delays can be caused ...,"If your flight is delayed, Ragas Airlines will...",single_hop_specific_query_synthesizer
2,What happens during Step 3: Refund Processing ...,[Flight Cancellations\n\nFlight cancellations ...,"During Step 3: Refund Processing, refunds will...",single_hop_specific_query_synthesizer
3,How I change my booking?,[Managing Reservations\n\nManaging your reserv...,"To change your booking, first check the fare r...",single_hop_specific_query_synthesizer
4,What happens if I don't request special assist...,[Special Assistance\n\nRagas Airlines provides...,"If you did not request assistance in advance, ...",single_hop_specific_query_synthesizer
5,What are the baggage restrictions for Ragas Ai...,[Baggage Policies\n\nThis section provides a d...,"To avoid delays at security checkpoints, ensur...",single_hop_specific_query_synthesizer
6,What I do if my baggage is damaged?,"[Delayed, Lost, or Damaged Baggage\n\nIf you e...","If your checked baggage arrives damaged, repor...",single_hop_specific_query_synthesizer
7,How I resubmit the claim if I miss documents?,[Potential Issues and Resolutions for Baggage ...,If your claim is denied due to missing documen...,single_hop_specific_query_synthesizer
8,Wher r the compansation detals for my flight d...,[Flight Delays\n\nFlight delays can be caused ...,The compensation details will be included in t...,single_hop_specific_query_synthesizer
9,How can I claim compensation for additional ex...,[Passenger Responsibilities During Delays\n\nS...,To claim compensation for additional expenses ...,single_hop_specific_query_synthesizer


# Non-English Testset Generation

## Download and Load corpus

In [None]:
# ! git clone https://huggingface.co/datasets/explodinggradients/Sample_non_english_corpus

In [13]:
from langchain_community.document_loaders import DirectoryLoader, TextLoader


path = "data/Sample_non_english_corpus/"
loader = DirectoryLoader(path, glob="**/*.txt")
docs = loader.load()

In [None]:
len(docs)

6

In [None]:
docs

[Document(metadata={'source': 'Sample_non_english_corpus/New York.txt'}, page_content="New York (prononcé en anglais américain : /nu ˈjɔɹk/), officiellement nommée City of New York, connue également sous les noms et abréviations de New York City ou NYC (pour éviter la confusion avec l'État de New York), et dont le surnom le plus connu est The Big Apple (« La grosse pomme »), est la plus grande ville des États-Unis en nombre d'habitants et l'une des plus importantes du continent américain et du monde. Elle se situe dans le Nord-Est du pays, sur la côte atlantique, à l'extrémité sud-est de l'État de New York. La ville de New York se compose de cinq arrondissements appelés boroughs : Manhattan, Brooklyn, Queens, le Bronx et Staten Island. Ses habitants s'appellent les New-Yorkais (en anglais : New Yorkers).\n\nNew York exerce un impact significatif sur le commerce mondial, la finance, les médias, l'art, la mode, la recherche, la technologie, l'éducation, le divertissement et le tourisme, 

In [None]:
from ragas.llms import LangchainLLMWrapper, llm_factory
from ragas.embeddings import OpenAIEmbeddings
from langchain_openai import ChatOpenAI
import openai
import os

openai_client = openai.OpenAI(api_key=os.getenv("OPENAI_API_KEY"))
generator_llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4o-mini"))
generator_embeddings = OpenAIEmbeddings(client=openai_client, model="text-embedding-3-small")

  generator_llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4o-mini"))


In [None]:
from ragas.testset.persona import Persona

personas = [
    Persona(
        name="curious student",
        role_description="A student who is curious about the world and wants to learn more about different cultures and languages",
    ),
]

In [None]:
from ragas.testset.transforms.extractors.llm_based import NERExtractor, HeadlinesExtractor
from ragas.testset.transforms.splitters import HeadlineSplitter

headline_extractor = HeadlinesExtractor(llm=generator_llm, max_num=20)
headline_splitter = HeadlineSplitter(max_tokens=1500)
ner_extractor = NERExtractor(llm=generator_llm)

transforms = [
    headline_extractor,
    headline_splitter,
    ner_extractor
]

#transforms = [HeadlinesExtractor(), HeadlineSplitter(), NERExtractor()]

In [None]:
from ragas.testset import TestsetGenerator

generator = TestsetGenerator(
    llm=generator_llm, embedding_model=generator_embeddings, persona_list=personas
)

In [None]:
from ragas.testset.synthesizers.single_hop.specific import (
    SingleHopSpecificQuerySynthesizer,
)

distribution = [
    (SingleHopSpecificQuerySynthesizer(llm=generator_llm), 1.0),
]

for query, _ in distribution:
    prompts = await query.adapt_prompts("spanish", llm=generator_llm)
    query.set_prompts(**prompts)

In [None]:
dataset = generator.generate_with_langchain_docs(
    docs[:],
    testset_size=5,
    transforms=transforms,
    query_distribution=distribution,
)

Applying HeadlinesExtractor:   0%|          | 0/6 [00:00<?, ?it/s]

Applying HeadlineSplitter:   0%|          | 0/6 [00:00<?, ?it/s]

Applying NERExtractor:   0%|          | 0/13 [00:00<?, ?it/s]

Property 'entities' already exists in node '998d5c'. Skipping!
Property 'entities' already exists in node '02a4d1'. Skipping!
Property 'entities' already exists in node '3f8eb0'. Skipping!
Property 'entities' already exists in node '8e7810'. Skipping!
Property 'entities' already exists in node '206565'. Skipping!


Generating Scenarios:   0%|          | 0/1 [00:00<?, ?it/s]

Generating Samples:   0%|          | 0/5 [00:00<?, ?it/s]

In [None]:
eval_dataset = dataset.to_evaluation_dataset()

In [None]:
print("Query:", eval_dataset[0].user_input)
print("Reference:", eval_dataset[0].reference)

Query: Qu'est-ce que Manhatten et pourquoi est-ce un endroit si important dans la ville de New York?
Reference: Manhattan est l'un des cinq arrondissements de New York, qui est la plus grande ville des États-Unis. Il est particulièrement connu pour son quartier financier, ancré par Wall Street, qui fonctionne comme la 'capitale financière du monde'. De plus, Manhattan abrite le marché immobilier parmi les plus chers au monde et le plus grand nombre de milliardaires de toutes les villes du monde. C'est également un centre majeur de l'industrie du divertissement, avec des attractions comme Times Square et Broadway.


# How to Evaluate and Improve a RAG App

## Necesary Imports

In [None]:
from typing import Any, Dict, Optional

from langchain_classic.docstore.document import Document
from langchain_classic.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.retrievers import BM25Retriever as LangchainBM25Retriever
from openai import AsyncOpenAI

import datasets

## Define Retriever

In [None]:
class BM25Retriever:
    """Simple BM25-based retriever for document search."""
    
    def __init__(self, dataset_name="m-ric/huggingface_doc", default_k=3):
        self.default_k = default_k
        self.retriever = self._build_retriever(dataset_name)
    
    def _build_retriever(self, dataset_name: str) -> LangchainBM25Retriever:
        """Build a BM25 retriever from HuggingFace docs."""
        knowledge_base = datasets.load_dataset(dataset_name, split="train")
        
        # Create documents
        source_documents = [
            Document(
                page_content=row["text"],
                metadata={"source": row["source"].split("/")[1]},
            )
            for row in knowledge_base
        ]
        
        # Split documents
        text_splitter = RecursiveCharacterTextSplitter(
            chunk_size=1000,
            chunk_overlap=100,
            add_start_index=True,
            strip_whitespace=True,
            separators=["\n\n", "\n", ".", " ", ""],
        )
        
        all_chunks = []
        for document in source_documents:
            chunks = text_splitter.split_documents([document])
            all_chunks.extend(chunks)
        
        # Simple deduplication
        unique_chunks = []
        seen_content = set()
        for chunk in all_chunks:
            if chunk.page_content not in seen_content:
                seen_content.add(chunk.page_content)
                unique_chunks.append(chunk)
        
        return LangchainBM25Retriever.from_documents(
            documents=unique_chunks,
            k=1,  # Will be overridden by retrieve method
        )
    
    def retrieve(self, query: str, top_k: int = None):
        """Retrieve documents for a given query."""
        if top_k is None:
            top_k = self.default_k
        self.retriever.k = top_k
        return self.retriever.invoke(query)

## Define Simple RAG

In [None]:
from typing import Any, Dict, Optional
from openai import AsyncOpenAI

class RAG:
    """Simple RAG system for document retrieval and answer generation."""

    def __init__(self, llm_client: AsyncOpenAI, retriever: BM25Retriever, system_prompt=None, model="gpt-4o-mini", default_k=3):
        self.llm_client = llm_client
        self.retriever = retriever
        self.model = model
        self.default_k = default_k
        self.system_prompt = system_prompt or "Answer only based on documents. Be concise.\n\nQuestion: {query}\nDocuments:\n{context}\nAnswer:"

    async def query(self, question: str, top_k: Optional[int] = None) -> Dict[str, Any]:
        """Query the RAG system."""
        if top_k is None:
            top_k = self.default_k

        return await self._naive_query(question, top_k)

    async def _naive_query(self, question: str, top_k: int) -> Dict[str, Any]:
        """Handle naive RAG: retrieve once, then generate."""
        # 1. Retrieve documents using BM25
        docs = self.retriever.retrieve(question, top_k)

        if not docs:
            return {"answer": "No relevant documents found.", "retrieved_documents": [], "num_retrieved": 0}

        # 2. Build context from retrieved documents
        context = "\n\n".join([f"Document {i}:\n{doc.page_content}" for i, doc in enumerate(docs, 1)])
        prompt = self.system_prompt.format(query=question, context=context)

        # 3. Generate response using OpenAI with retrieved context
        response = await self.llm_client.chat.completions.create(
            model=self.model,
            messages=[{"role": "user", "content": prompt}]
        )

        return {
            "answer": response.choices[0].message.content.strip(),
            "retrieved_documents": [{"content": doc.page_content, "metadata": doc.metadata, "document_id": i} for i, doc in enumerate(docs)],
            "num_retrieved": len(docs)
        }

## Create evaluation dataset

We'll use `huggingface_doc_qa_eval`, a dataset of questions and answers about Hugging Face documentation.



The evaluation script downloads the dataset from here and converts it into Ragas Dataset format:

In [None]:
import urllib.request
from pathlib import Path
from ragas import Dataset
import pandas as pd

def download_and_save_dataset() -> Path:
    dataset_path = Path("datasets/hf_doc_qa_eval.csv")
    dataset_path.parent.mkdir(exist_ok=True)

    if not dataset_path.exists():
        github_url = "https://raw.githubusercontent.com/explodinggradients/ragas/main/examples/ragas_examples/improve_rag/datasets/hf_doc_qa_eval.csv"
        urllib.request.urlretrieve(github_url, dataset_path)

    return dataset_path

def create_ragas_dataset(dataset_path: Path) -> Dataset:
    dataset = Dataset(name="hf_doc_qa_eval", backend="local/csv", root_dir=".")
    df = pd.read_csv(dataset_path)

    for _, row in df.iterrows():
        dataset.append({"question": row["question"], "expected_answer": row["expected_answer"]})

    dataset.save()
    return dataset

## Set up metrics for RAG evaluation
Now that we have our evaluation dataset ready, we need metrics to measure RAG performance. Start with simple, focused metrics that directly measure your core use case.

Here we use a correctness discrete metric that evaluates whether the RAG response contains the key information from the expected answer and is factually accurate based on the provided context.

In [None]:
from ragas.metrics import DiscreteMetric

# Define correctness metric
correctness_metric = DiscreteMetric(
    name="correctness",
    prompt="""Compare the model response to the expected answer and determine if it's correct.

Consider the response correct if it:
1. Contains the key information from the expected answer
2. Is factually accurate based on the provided context
3. Adequately addresses the question asked

Return 'pass' if the response is correct, 'fail' if it's incorrect.

Question: {question}
Expected Answer: {expected_answer}
Model Response: {response}

Evaluation:""",
    allowed_values=["pass", "fail"],
)

Now that we have our evaluation metric, we need to run it systematically across our dataset. This is where Ragas experiments come in.

## Create the evaluation experiment
The experiment function runs your RAG system on each data sample and evaluates the response using our correctness metric.

The experiment function takes a dataset row containing the question, expected context, and expected answer, then:

1. Queries the RAG system with the question
2. Evaluates the response using the correctness metric
3. Returns detailed results including scores and reason

In [None]:
import asyncio
from typing import Dict, Any
from ragas import experiment


@experiment()
async def evaluate_rag(row: Dict[str, Any], rag: RAG, llm) -> Dict[str, Any]:
    """
    Run RAG evaluation on a single row.

    Args:
        row: Dictionary containing question and expected_answer
        rag: Pre-initialized RAG instance
        llm: Pre-initialized LLM client for evaluation

    Returns:
        Dictionary with evaluation results
    """
    question = row["question"]

    # Query the RAG system
    rag_response = await rag.query(question, top_k=4)
    model_response = rag_response.get("answer", "")

    # Evaluate correctness asynchronously
    score = await correctness_metric.ascore(
        question=question,
        expected_answer=row["expected_answer"],
        response=model_response,
        llm=llm
    )

    # Return evaluation results
    result = {
        **row,
        "model_response": model_response,
        "correctness_score": score.value,
        "correctness_reason": score.reason,
        "mlflow_trace_id": rag_response.get("mlflow_trace_id", "N/A"),  # MLflow trace ID for debugging (explained later)
        "retrieved_documents": [
            doc.get("content", "")[:200] + "..." if len(doc.get("content", "")) > 200 else doc.get("content", "")
            for doc in rag_response.get("retrieved_documents", [])
        ]
    }

    return result

With our dataset, metrics, and experiment function ready, we can now evaluate our RAG system's performance.

## Run initial RAG experiment
Now let's run the complete evaluation pipeline to get baseline performance metrics for our RAG system:

In [None]:
# Import required components
import asyncio
from datetime import datetime
from openai import AsyncOpenAI
from langchain_openai import ChatOpenAI
from ragas.llms import llm_factory
import os

async def run_evaluation():
    # Download and prepare dataset
    dataset_path = download_and_save_dataset()
    dataset = create_ragas_dataset(dataset_path)

    # Initialize RAG components
    openai_client = AsyncOpenAI(api_key=os.getenv("OPENAI_API_KEY"))
    retriever = BM25Retriever()
    rag = RAG(llm_client=openai_client, retriever=retriever)
    llm = llm_factory('gpt-4o-mini', client=openai_client)

    # Run evaluation experiment
    exp_name = f"{datetime.now().strftime('%Y%m%d-%H%M%S')}_naiverag"
    results = await evaluate_rag.arun(
        dataset, 
        name=exp_name,
        rag=rag,
        llm=llm
    )

    # Print results
    if results:
        pass_count = sum(1 for result in results if result.get("correctness_score") == "pass")
        total_count = len(results)
        pass_rate = (pass_count / total_count) * 100 if total_count > 0 else 0
        print(f"Results: {pass_count}/{total_count} passed ({pass_rate:.1f}%)")

    return results

In [None]:
# Run the evaluation
results = await run_evaluation()
print(results)

Running experiment: 100%|██████████| 66/66 [01:05<00:00,  1.00it/s]

Results: 43/66 passed (65.2%)
Experiment(name=20251111-230933_naiverag,  len=66)





This downloads the dataset, initializes the BM25 retriever, runs the evaluation experiment on each sample, and saves detailed results to the experiments/ directory as CSV files for analysis.

With a `65.2%` pass rate, we now have a baseline. The detailed results CSV in experiments/ now contains all the data we need for error analysis and systematic improvement.

## Improve the RAG app
With retrieval identified as the primary bottleneck, we can improve our system in two ways:

Traditional approaches focus on better chunking, hybrid search, or vector embeddings. However, since our BM25 retrieval consistently misses relevant documents with single queries, we'll explore an agentic approach instead.

`Agentic RAG` lets the AI iteratively refine its search strategy - trying multiple search terms and deciding when it has found sufficient context, rather than relying on one static query.

## Agentic RAG implementation

In [None]:
class ImprovedRAG(RAG):
    """RAG system that can operate in naive or agentic mode."""

    def __init__(self, llm_client: AsyncOpenAI, retriever: BM25Retriever, mode="agentic", system_prompt=None, model="gpt-5-mini", default_k=3):
        super().__init__(
            llm_client=llm_client,
            retriever=retriever,
            system_prompt=system_prompt,
            model=model,
            default_k=default_k
        )
        self.mode = mode.lower()
        self._agent = None
        
        if self.mode == "agentic":
            self._setup_agent()

    def _setup_agent(self):
        """Setup agent for agentic mode."""
        try:
            from agents import Agent, function_tool
        except ImportError:
            raise ImportError("agents package required for agentic mode")

        @function_tool
        def retrieve(query: str) -> str:
            """Search Hugging Face docs for technical info, APIs, commands, and examples.
            Use exact terms (e.g., "from_pretrained", "ESPnet upload", "torchrun"). 
            Try 2-3 targeted searches: specific terms → tool names → alternatives."""
            docs = self.retriever.retrieve(query, self.default_k)
            if not docs:
                return f"No documents found for '{query}'. Try different search terms or break down the query into smaller parts."
            return "\n\n".join([f"Doc {i}: {doc.page_content}" for i, doc in enumerate(docs, 1)])

        self._agent = Agent(
            name="RAG Assistant",
            model=self.model,
            instructions="Search with exact terms first (commands, APIs, tool names). Try 2-3 different searches if needed. Only answer from retrieved documents. Preserve exact syntax and technical details.",
            tools=[retrieve]
        )

    async def _agentic_query(self, question: str, top_k: int) -> Dict[str, Any]:
        """Handle agentic mode: agent controls retrieval strategy."""
        try:
            from agents import Runner
        except ImportError:
            raise ImportError("agents package required for agentic mode")
        
        # Let agent handle the retrieval and reasoning
        result = await Runner.run(self._agent, input=question)
        
        # In agentic mode, the agent controls retrieval internally
        # so we don't return specific retrieved documents
        return {
            "answer": result.final_output,
            "retrieved_documents": [],  # Agent handles retrieval internally
            "num_retrieved": 0,  # Cannot determine exact count from agent execution
        }

    async def query(self, question: str, top_k: Optional[int] = None) -> Dict[str, Any]:
        """Query the RAG system."""
        if top_k is None:
            top_k = self.default_k
            
        try:
            if self.mode == "naive":
                return await self._naive_query(question, top_k)
            elif self.mode == "agentic":
                return await self._agentic_query(question, top_k)
            else:
                raise ValueError(f"Unknown mode: {self.mode}")
        except Exception as e:
            return {
                "answer": f"Error: {str(e)}", 
                "retrieved_documents": [], 
                "num_retrieved": 0,
            }

Unlike naive mode's single retrieval call, the agent autonomously decides when and how to search - trying multiple keyword combinations until it finds sufficient context.

Run the Agentic RAG app for a sample query:

In [None]:
openai_client = AsyncOpenAI(api_key=os.getenv("OPENAI_API_KEY"))
retriever = BM25Retriever()

# Switch to agentic mode
rag_agentic = ImprovedRAG(openai_client, retriever, mode="agentic")

In [None]:
question = "What architecture is the `tokenizers-linux-x64-musl` binary designed for?"
result = await rag_agentic.query(question)
print(f"Answer: {result['answer']}")

Answer: It's the x86_64 architecture — specifically the "x86_64-unknown-linux-musl" binary.


## Run experiment again and compare results
Now let's evaluate the agentic RAG approach:

In [None]:
# Import required components
import asyncio
from datetime import datetime


async def run_agentic_evaluation():
    # Download and prepare dataset
    dataset_path = download_and_save_dataset()
    dataset = create_ragas_dataset(dataset_path)

    # Initialize RAG components with agentic mode
    openai_client = AsyncOpenAI(api_key=os.getenv("OPENAI_API_KEY"))
    retriever = BM25Retriever()
    rag = ImprovedRAG(llm_client=openai_client, retriever=retriever, model="gpt-5-mini", mode="agentic")
    llm = llm_factory('gpt-4o-mini', client=openai_client)

    # Run evaluation experiment
    exp_name = f"{datetime.now().strftime('%Y%m%d-%H%M%S')}_agenticrag"
    results = await evaluate_rag.arun(
        dataset, 
        name=exp_name,
        rag=rag,
        llm=llm
    )

    # Print results
    if results:
        pass_count = sum(1 for result in results if result.get("correctness_score") == "pass")
        total_count = len(results)
        pass_rate = (pass_count / total_count) * 100 if total_count > 0 else 0
        print(f"Results: {pass_count}/{total_count} passed ({pass_rate:.1f}%)")

    return results

# Run the agentic evaluation
results = await run_agentic_evaluation()
print("\nDetailed results:")
print(results)

Running experiment: 100%|██████████| 66/66 [01:27<00:00,  1.33s/it]

Results: 58/66 passed (87.9%)

Detailed results:
Experiment(name=20251111-231055_agenticrag,  len=66)





Excellent! We achieved a significant improvement from `65.2%` (naive) to `87.9%` (agentic) - that's a `22.7` percentage point improvement with the agentic RAG approach!