### Importance of High-Quality Test Dataset for RAG Systems

### Purpose
Test sets are critical for:
- Accurately measuring RAG system performance
- Identifying system strengths and weaknesses
- Guiding continuous improvement

### Key Evaluation Dimensions
1. **Retrieval Effectiveness**: Assess how well relevant context is retrieved
2. **Generation Quality**: Evaluate the accuracy and coherence of generated responses
3. **Contextual Relevance**: Measure how well the system understands and integrates retrieved information

### Evaluation Goals
- Benchmark system performance
- Detect hallucinations
- Validate generalization capabilities
- Simulate real-world query complexity

### Best Practices
- Use diverse query types
- Cover multiple domains
- Include edge cases
- Create reproducible test scenarios
- Contains enough number of samples to derive statistically significant conclusions

### 1. Retrieve data:
* The data can be a dataset from `huggingface` or any other platform.

* Alternatively, files available on disk - pdf, md, etc.

* One can also use `AsyncHtmlLoader` from `langchain` to scrape from the internet.
    - **Careful when performing web scraping to not violate any terms and conditions!**

In [1]:
!git clone https://huggingface.co/datasets/explodinggradients/ragas-airline-dataset data

Cloning into 'data'...
remote: Enumerating objects: 14, done.[K
remote: Total 14 (delta 0), reused 0 (delta 0), pack-reused 14 (from 1)[K
Unpacking objects: 100% (14/14), 16.16 KiB | 2.31 MiB/s, done.


**NOTE**:
* For the generation of testdata I use **DeepEval**, however this notebook can also be used.
* Make sure you install the requirements first if you want to test the notebook.
    * To do so run the `setup.sh` in the parent folder.
* Make sure you select the proper environment as your kernel. 

### 2. Load data into document objects

I prefer to use `langchain`, however `llama-index` is also a solution.

In [None]:
from langchain_community.document_loaders import DirectoryLoader

path = "data/"
loader = DirectoryLoader(
    path,
    glob="**/*.md",
    exclude="README.md"
)
docs = loader.load()

### 3. Construct knowledge graph

- A **knowledge graph** is a fundamental concept when it comes to **RAGAs** and using its capabilities for **automatic data generation**.

- A **knowledge graph** consists of **Node** objects at first, which represent **documents** - their content and additional metadata.

- Thereafter, one can enrich the graph by **applying various transformations** to it and **relationships get built**, which express some kind of connection between Node objects. The transformations can be applied only through the use of **Extractors** and or **RelationshipBuilder** objects. They serve as a way to gather relevant data from the documents depending on the type of extractor and this way to logically connect 2 or more nodes together.

- This graph then is used to generate so called **scenarios** and can also be used to generate **personas** to arrive at the test samples.

In [2]:
from ragas.testset.graph import (
    Node,
    NodeType,
    KnowledgeGraph,
)

kg = KnowledgeGraph()

for doc in docs:
    kg.add(
        Node(
            type=NodeType.DOCUMENT,
            properties={"page_content": doc.page_content, "document_metadata": doc.metadata}
        )
    )

  from .autonotebook import tqdm as notebook_tqdm


## 4. Instantiate required objects

- **RAGAs** would require a **Large-Language-Model** and an **Embedding** one to be able to apply the **transformations** to the **knowledge graph**. For that purpose one must create **wrapper** objects for both of the models. `Langchain` and `llama-index` are both supported. 

- Additionally, a **configuration** can be used to modify the default behaviour of the framework. For example timeout values can be modified, maximum retries for failed operations and so on can be configured from the **RunConfig**.
    - **NOTE**: since llama3.1:8b is not particularly efficient you may need to modify the `timeout` value.

- Lastly, there's a single implementation in **RAGAs** for caching intermediate steps onto disk. To use it the **DiskCacheBackend** class can come in play.

In [None]:
import os
from dotenv import load_dotenv
from langchain_ollama.llms import OllamaLLM
from langchain_ollama.embeddings import OllamaEmbeddings

from ragas import (
    RunConfig,
    DiskCacheBackend
)
from ragas.llms import LangchainLLMWrapper
from ragas.embeddings import LangchainEmbeddingsWrapper

run_config = RunConfig(
    timeout=14400, # This may need to be much higher depending on the GPU
    max_retries=15,
    max_wait=30
)

cacher = DiskCacheBackend(cache_dir=".cache")

load_dotenv("../../env/rag.env")
chat_model = os.getenv("CHAT_MODEL")
embedding_model = os.getenv("EMBEDDING_MODEL")
temperature = float(os.getenv("TEMPERATURE"))
num_ctx = int(os.getenv("LLM_CONTEXT_WINDOW_TOKENS"))

ollama_llm = OllamaLLM(
    model=chat_model,
    base_url="http://localhost:11434",
    temperature=temperature,
    num_ctx=num_ctx,
    format="json"
)

ollama_embeddings = OllamaEmbeddings(
    model=embedding_model,
    base_url="http://localhost:11434"
)

langchain_llm = LangchainLLMWrapper(
    langchain_llm=ollama_llm,
    run_config=run_config,
    cache=cacher
)

langchain_embeddings = LangchainEmbeddingsWrapper(
    embeddings=ollama_embeddings,
    run_config=run_config,
    cache=cacher
)

### 5. Create the transformation pipeline

The sequence of transformations:

1. HeadlinesExtractor and HeadlinesSplitter
    - This step is going to ensure that longer documents are split into logical sections

2. Named Entity Recognition (NER) & Theme Extraction  
    - NERExtractor identifies and extracts named entities (e.g., people, organizations, locations).  
    - ThemesExtractor detects overarching topics/themes in each chunk.

3. Summary Extraction and Summary Embedding Extraction
    - Relevant for the MultiHopAbstractQuerySynthesizer

4. Extraction of key phrases
    - Gain additional insights into documents

5. NEROverlapBuilder and CosineSimilarityBuilder
    - Used to group nodes containing similar entities
    - Group semantically close nodes by their summary

6. Parallel Processing for Efficiency
    - Certain transformations run in parallel to improve performance

-> Final Outcome:
    - A structured set of document transformations that extract valuable information
    - Used to enrich the knowledge graph for further generation of scenarios and finally samples

In [4]:
from ragas.testset.transforms.engine import Parallel
from ragas.testset.transforms.extractors.llm_based import (
    HeadlinesExtractor,
    SummaryExtractor,
    NERExtractor,
    ThemesExtractor,
    KeyphrasesExtractor,
)
from ragas.testset.transforms.relationship_builders import (
    OverlapScoreBuilder,
    CosineSimilarityBuilder
)
from ragas.testset.transforms.splitters import HeadlineSplitter
from ragas.testset.transforms.extractors.embeddings import EmbeddingExtractor

headline_extractor = HeadlinesExtractor(
    llm=langchain_llm,
    max_num=10
)

headline_splitter = HeadlineSplitter(
    max_tokens=1500
)

summary_extractor = SummaryExtractor(
    llm=langchain_llm
)

ner_extractor = NERExtractor(
    llm=langchain_llm,
    max_num_entities=20
)

themes_extractor = ThemesExtractor(
    llm=langchain_llm,
    max_num_themes=20
)

summary_emb_extractor = EmbeddingExtractor(
    property_name="summary_embedding",
    embed_property_name="summary",
    embedding_model=langchain_embeddings,
)

keyphrase_extractor = KeyphrasesExtractor(
    llm=langchain_llm,
    max_num=15
)

ner_overlap_sim = OverlapScoreBuilder(
    threshold=0.01
)

cosine_sim_builder = CosineSimilarityBuilder(
    property_name="summary_embedding",
    new_property_name="summary_similarity",
    threshold=0.7
)

transforms = [
    headline_extractor,
    headline_splitter,
    summary_extractor,
    Parallel(
        summary_emb_extractor,
        themes_extractor,
        ner_extractor,
        keyphrase_extractor,
    ),
    Parallel(
        cosine_sim_builder, 
        ner_overlap_sim
    )
]

### 6. Apply the transformations to the knowledge graph

In the section below the `apply_transforms` is going to apply all the previously defined transformations enriching the `knowledge graph` in the process.

In [5]:
from ragas.testset.transforms import apply_transforms

apply_transforms(
    kg,
    transforms,
    run_config
)

Applying SummaryExtractor:   0%|          | 0/23 [00:00<?, ?it/s] Property 'summary' already exists in node '0d39a9'. Skipping!
Property 'summary' already exists in node 'b23be3'. Skipping!
Applying [EmbeddingExtractor, ThemesExtractor, NERExtractor, KeyphrasesExtractor]:   0%|          | 0/92 [00:00<?, ?it/s]Property 'summary_embedding' already exists in node '0d39a9'. Skipping!
Property 'summary_embedding' already exists in node 'b23be3'. Skipping!
Property 'themes' already exists in node '0d39a9'. Skipping!
Property 'themes' already exists in node 'b23be3'. Skipping!
Property 'entities' already exists in node '0d39a9'. Skipping!
Property 'entities' already exists in node 'b23be3'. Skipping!
Property 'keyphrases' already exists in node '0d39a9'. Skipping!
Property 'keyphrases' already exists in node 'b23be3'. Skipping!
                                                                                                                                 

### 7. Generating personas

- A **Persona** is an entity/role which interacts with the system. **Personas** provide context and perspective, ensuring that **generated queries are natural, user-specific, and diverse**.

- `Example: a Senior DevOps engineer, a Junior Data Scientist, a Marketing Manager in the context of an IT company `

- **Persona** object consists of a **name** and a **description**.
    
    - The name is used to identify the persona and the description is used to describe the role of the persona.

In [6]:
from ragas.testset.persona import Persona

persona_first_time_flier = Persona(
    name="First Time Flier",
    role_description="Is flying for the first time and may feel anxious. Needs clear guidance on flight procedures, safety protocols, and what to expect throughout the journey.",
)

persona_frequent_flier = Persona(
    name="Frequent Flier",
    role_description="Travels regularly and values efficiency and comfort. Interested in loyalty programs, express services, and a seamless travel experience.",
)

persona_angry_business_flier = Persona(
    name="Angry Business Flier",
    role_description="Demands top-tier service and is easily irritated by any delays or issues. Expects immediate resolutions and is quick to express frustration if standards are not met.",
)

persona_traveler_with_medical_needs = Persona(
    name="Traveler with Medical Needs",
    role_description="Has specific medical requirements and needs information about carrying medications, requesting medical clearance, and accessing in-flight medical assistance. Concerned about potential health issues during travel.",
)

persona_family_traveling_with_children = Persona(
    name="Family with Children",
    role_description="Traveling with young children, including occasionally unaccompanied minors. Requires information about special services for families, meal options for kids, entertainment options, and how to manage multiple reservations efficiently.",
)

personas = [
    persona_first_time_flier,
    persona_frequent_flier,
    persona_angry_business_flier,
    persona_traveler_with_medical_needs,
    persona_family_traveling_with_children
]

## 8. Generate query types 

- There are two main types of queries in **RAGAs**:
    
    - **SingleHopQuery** where the **context** relevant for answering a question lies in a **single document/chunk**

    - **MultiHopQuery** where the **context** relevant for answering a question lies in **multiple documents/chunks**

- Additionally, for each of those queries there's a **Specific** or **Abstract** query variant:
    
    - **Specific** one which pertains to a **fact**. 

        - `Example: When did WW1 break out? (Can be precisely answered, there's no room for guessing/interpretation)`
    
    - **Abstract** one which is more about testing the **reasoning** capabilities of the LLM. 

        - `Example: Why did WW1 break out? (There's room for interpretation in this case)`

**Synthesizers** are responsible for **converting enriched nodes and personas into queries**. They achieve this by **selecting a node property (e.g., "entities" or "keyphrases"), pairing it with a persona, style, and query length**, and then using a LLM to generate a query-answer pair based on the content of the node.


In [9]:
import numpy as np
import typing as t
from dataclasses import dataclass

from langchain_core.callbacks import Callbacks

from ragas.testset.graph import (
    Node,
    KnowledgeGraph,
)
from ragas.prompt import PydanticPrompt
from ragas.testset.synthesizers.prompts import (
    ThemesPersonasInput,
    ThemesPersonasMatchingPrompt,
)
from ragas.testset.synthesizers.single_hop import (
    SingleHopScenario,
    SingleHopQuerySynthesizer,
)

# Doesn't filter based on document type but solely on the property
@dataclass
class MySingleHopSpecificQuerySynthesizer(SingleHopQuerySynthesizer):
    name: str = "single_hop_specifc_query_synthesizer"
    theme_persona_matching_prompt: PydanticPrompt = ThemesPersonasMatchingPrompt()
    property_name: str = "entities"

    def get_node_clusters(self, knowledge_graph: KnowledgeGraph) -> t.List[Node]:
        """
        Get all nodes that contain the specified property (`entities`), regardless of type.
        """
        return [
            node
            for node in knowledge_graph.nodes
            if node.get_property(self.property_name) is not None
        ]

    async def _generate_scenarios(
        self,
        n: int,
        knowledge_graph: KnowledgeGraph,
        persona_list: t.List[Persona],
        callbacks: Callbacks,
    ) -> t.List[SingleHopScenario]:
        """
        Generates a list of scenarios based on all nodes that have the `entities` property.
        """

        nodes = self.get_node_clusters(knowledge_graph)
        if len(nodes) == 0:
            raise ValueError("No nodes found with the `entities` property.")
        samples_per_node = int(np.ceil(n / len(nodes)))

        scenarios = []
        for node in nodes:
            if len(scenarios) >= n:
                break
            themes = node.properties.get(self.property_name, [""])
            prompt_input = ThemesPersonasInput(themes=themes, personas=persona_list)
            persona_concepts = await self.theme_persona_matching_prompt.generate(
                data=prompt_input, llm=self.llm, callbacks=callbacks
            )
            base_scenarios = self.prepare_combinations(
                node,
                themes,
                personas=persona_list,
                persona_concepts=persona_concepts.mapping,
            )
            scenarios.extend(self.sample_combinations(base_scenarios, samples_per_node))

        return scenarios


In [10]:
from ragas.testset.synthesizers.multi_hop.specific import MultiHopSpecificQuerySynthesizer
from ragas.testset.synthesizers.multi_hop.abstract import MultiHopAbstractQuerySynthesizer

single_hop_specific_entities = MySingleHopSpecificQuerySynthesizer(
    llm=langchain_llm,
    property_name="entities"
)

single_hop_specific_keyphrases = MySingleHopSpecificQuerySynthesizer(
    llm=langchain_llm,
    property_name="keyphrases"
)

multi_hop_specific_entities = MultiHopSpecificQuerySynthesizer(
    llm=langchain_llm
)

multi_hop_abstract_entities = MultiHopAbstractQuerySynthesizer(
    llm=langchain_llm
)

query_distribution = [
    (single_hop_specific_entities, 0.25),
    (single_hop_specific_keyphrases, 0.25),
    (multi_hop_specific_entities, 0.25),
    (multi_hop_abstract_entities, 0.25)
]

### 9. Generate the samples

### Definition of Evaluation Sample

An evaluation sample is a single structured data instance that is used to assess and measure the performance of your LLM application in specific scenarios. It represents a single unit of interaction or a specific use case that the AI application is expected to handle. In Ragas, evaluation samples are represented using the `SingleTurnSample` and `MultiTurnSample` classes.

### SingleTurnSample

`SingleTurnSample` represents a single-turn interaction between a user, LLM, and expected results for evaluation. It is suitable for evaluations that involve a single question and answer pair, possibly with additional context or reference information.

This type of sample is ideal for straightforward question-answering scenarios where a user asks a single question and expects a direct response.

### MultiTurnSample

`MultiTurnSample` represents a multi-turn interaction between Human, AI and optionally a Tool and expected results for evaluation. It is suitable for representing conversational agents in more complex interactions for evaluation.

In `MultiTurnSample`, the `user_input` attribute represents a sequence of messages that collectively form a multi-turn conversation between a human user and an AI system. These messages are instances of the classes `HumanMessage`, `AIMessage`, and `ToolMessage`.

This type of sample is designed for evaluating more complex conversational flows where multiple turns of dialogue occur, potentially involving tool usage for gathering additional information.

In [None]:
from ragas.testset import TestsetGenerator

generator = TestsetGenerator(
    langchain_llm,
    langchain_embeddings,
    kg,
    personas
)

dataset = generator.generate(
    testset_size=50,
    query_distribution=query_distribution,
    num_personas=5,
    run_config=run_config,
    with_debugging_logs=True,
)

Generating Scenarios: 100%|██████████| 4/4 [1:02:56<00:00, 944.13s/it] 
Generating Samples: 100%|██████████| 52/52 [18:05<00:00, 20.88s/it]


### 10. Ingest data into R2R if not already

* **Note:** you can also use the frontend at `http://localhost:8501` after starting the application.
* Alternatively, a script will also do the job:

```python
from pathlib import Path
from r2r import R2RClient, R2RException

client = R2RClient(
    base_url="http://localhost:7272", # Maybe different for you
    timeout=600
)

dir_path = Path("data")
for item in dir_path.iterdir():
    if item.is_file() and item.suffix == '.md' and item.name != "README.md":
        try:
            client.documents.create(
                file_path=str(item),
                ingestion_mode="custom",
                run_with_orchestration=True   
            )
            print(f"Ingested file: {item.name}")
        except R2RException as r2re:
            print(f"Couldn't ingest file: {item.name} due to {str(r2re)}")
```

### 11. Add missing information to dataset using R2R

* Add the `actual response`
* Add the `retrieved context`

**EXAMPLE:**

```python
from r2r import R2RClient, R2RException

client = R2RClient(
    base_url="http://localhost:7272",
    timeout=600
)

# These need to be ideally the same configurations as the ones used in the backend for R2R
# This way we could try to reproduce the same results as in the backend or as close as possible
search_settings = {
    "use_semantic_search": True,
    "limit": 5,
    "offset": 0,
    "include_metadatas": False,
    "search_strategy": "vanilla",
}
    
rag_generation_config = {
    "temperature": 0.1,
    "top_p": 1,
    "max_tokens_to_sample": 512
}

template = """ 
## Task:

Answer the given query ONLY using the provided context. Keep your answer short and concise.

### Guidelines:
- Strictly limit responses to 2-3 sentences whenever possible.
- If a longer answer is necessary, make it as brief as possible, focusing only on relevant details.
- Merge lists/enumerations into a single coherent sentence using conjunctions or commas.
- Do NOT reference line numbers or list items from the context.
- If the provided context lacks sufficient information, explicitly inform the user that the answer cannot be determined.
- NEVER generate an answer beyond the given context — do not speculate or infer missing details.
- Do NOT use external knowledge; rely only on the retrieved context.

---

### Query:
{query}

### Context:
{context}

---

### Reminder:  
- Keep the response short and factual.  
- If the context lacks the answer, say so explicitly.  
- Do NOT generate an answer beyond the provided information.  

## Response:
"""
    
final_dataset = dataset # Make sure to use a different variable if something goes wrong
for i, sample in enumerate(final_dataset.samples):
    try:
        # Submit a query using the randomly generated question by RAGAs
        response = client.retrieval.rag(
            query=sample.eval_sample.user_input,
            search_mode="custom",
            search_settings=search_settings,
            rag_generation_config=rag_generation_config,
            task_prompt=template
        ).results

        llm_response = response.completion
        retrieved_context_txt = [chunk.text for chunk in response.search_results.chunk_search_results]
        
        final_dataset.samples[i].eval_sample.response = llm_response
        final_dataset.samples[i].eval_sample.retrieved_contexts = retrieved_context_txt
        
        print(f"Added data to sample: {i + 1} out of {len(final_dataset.samples)}")
        
    except R2RException as r2re:
        print(f"Something went wrong when submitting query: {sample.eval_sample.user_input} due to {str(r2re)}")
```

### 12. Save the dataset

Save locally if required

In [None]:
dataset.to_jsonl("dataset.jsonl")
dataset.to_csv("dataset.csv")

### 13. Upload to the cloud (Optional)

* To upload the data on **app.ragas.io** make sure you:
    * First create an account
    * Get an **API key**
    * Finally, create a `.env` file in the parent folder like so and export it in your notebook:

```bash
RAGAS_APP_TOKEN=apt.1234a-......-9dfew
```

In [None]:
from dotenv import load_dotenv

load_dotenv() # This will load the token

dataset.upload()

Testset uploaded! View at https://app.ragas.io/dashboard/alignment/testset/b0eb0e65-8914-4910-a103-8dab699927b3


'https://app.ragas.io/dashboard/alignment/testset/b0eb0e65-8914-4910-a103-8dab699927b3'