### Importance of High-Quality Test Dataset for RAG Systems

### Purpose
Test sets are critical for:
- Accurately measuring RAG system performance
- Identifying system strengths and weaknesses
- Guiding continuous improvement

### Key Evaluation Dimensions
1. **Retrieval Effectiveness**: Assess how well relevant context is retrieved
2. **Generation Quality**: Evaluate the accuracy and coherence of generated responses
3. **Contextual Relevance**: Measure how well the system understands and integrates retrieved information

### Evaluation Goals
- Benchmark system performance
- Detect hallucinations
- Validate generalization capabilities
- Simulate real-world query complexity

### Best Practices
- Use diverse query types
- Cover multiple domains
- Include edge cases
- Create reproducible test scenarios
- Contains enough number of samples to derive statistically significant conclusions

### 0. Configuration for generating the goldens

Goldens are the samples containing the `reference`, `reference_contexts` and `user_input`. For filling out the rest of the fields: `response` and `retrieved_contexts`, I used 9 different configurations, which are to be found at the root of the project under experiments.csv.

The goldens are going to be saved under `./goldens`.


Configuration for the generation of each dataset containing $\underline{goldens}$:

```bash
# Going to stay the same across all generations
MAX_TOKENS=512
TEMPERATURE=0.0
CHAT_MODEL=llama3.1:8b-instruct-q4_1
EMBEDDING_MODEL=mxbai-embed-large

# These values will change for each experiment, since we want to simulate how the RAG framework splits the data
CHUNK_SIZE=<experiment_id_chunk_size>
CHUNK_OVERLAP=<experiment_id_chunk_overlap>
```

### 1. Retrieve data:
* The data can be a dataset from `huggingface` or any other platform.

* Alternatively, files available on disk - pdf, md, etc.

* One can also use `AsyncHtmlLoader` from `langchain` to scrape from the internet.
    - **Careful when performing web scraping to not violate any terms and conditions!**

**NOTE**:
* Make sure you install the requirements first if you want to test the notebook.
    * To do so run the `setup.sh` in the parent folder.
* Make sure you select the proper environment as your kernel: `eval`. 

In [1]:
# For this notebook I will use a dataset provided by RAGAs.
# The dataset contains markdown files about a fictional airline company.
! git clone https://huggingface.co/datasets/explodinggradients/ragas-airline-dataset data

Cloning into 'data'...
remote: Enumerating objects: 14, done.[K
remote: Total 14 (delta 0), reused 0 (delta 0), pack-reused 14 (from 1)[K
Unpacking objects: 100% (14/14), 16.16 KiB | 8.08 MiB/s, done.


### 2. Load data into document objects

I prefer to use `langchain`, however `llama-index` is also a solution.

In [2]:
import os
from typing import Final
from dotenv import load_dotenv
from langchain.docstore.document import Document
from langchain_community.document_loaders import DirectoryLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter

load_dotenv("../../env/rag.env")

# The path is the folder, where the documents are stored at.
# Make sure you select the proper one.
DIR_PATH: Final[str] = "data/"
loader = DirectoryLoader(
    DIR_PATH,
    glob="**/*.md",
    exclude="README.md"
)

# The R2R framework uses a recrusive character text splitter to split the documents.
# For that purpose I use it to try to mimic the same behavior.
splitter = RecursiveCharacterTextSplitter(
    chunk_size=int(os.getenv("CHUNK_SIZE")),
    chunk_overlap=int(os.getenv("CHUNK_OVERLAP")),
    length_function=len
)

docs: list[Document] = loader.load_and_split(splitter)

### 3. Construct knowledge graph

- A **knowledge graph** is a fundamental concept when it comes to **RAGAs** and using its capabilities for **automatic synthetic data generation**.

- A **knowledge graph** consists of **Node** objects at first, which represent **documents** - their content and additional metadata. After enriching the graph it might additionally contain attributes like `entities`, `themes` and so on depending on the pipeline you define.

- Thereafter, one can enrich the graph by using various **extractors** and applying various **transformations**. Doing so additional attributes get added to the relevant nodes and **relationships get built**, which express some kind of connection between Node objects. The transformations can be applied only through the use of **Extractor**s, **Splitter**s and or **RelationshipBuilder**s. They serve as a way to gather relevant data from the documents depending on the type of extractor and this way to logically connect 2 or more nodes together.

- This graph then is used to generate so called **scenarios** and can also be used to generate **personas**.

![Knowledge graph creation workflow RAGAs](../../img/kg_rag.webp "Knowledge graph RAGAs")

In [3]:
from ragas.testset.graph import (
    Node,
    NodeType,
    KnowledgeGraph,
)

kg = KnowledgeGraph()

for doc in docs:
    kg.add(
        Node(
            type=NodeType.CHUNK, # Since we already split the documents, we can use the chunk type.
            properties={"page_content": doc.page_content, "document_metadata": doc.metadata}
        )
    )

## 4. Instantiate required objects

- **RAGAs** would require a **Large-Language-Model** and an **Embedding** one to be able to apply the **transformations** to the **knowledge graph**. For that purpose one must create **wrapper** objects for both of the models. `Langchain` and `llama-index` are both supported. 

- Additionally, a **configuration** can be used to modify the default behaviour of the framework. For example timeout values can be modified, maximum retries for failed operations and so on can be configured from the **RunConfig**.
    - **NOTE**: depending on the LLM model and GPU you may need to modify the `timeout` value, otherwise you will stumble upon `TimeoutException`

- Lastly, there's a single implementation in **RAGAs** for caching intermediate steps onto disk. To use it the **DiskCacheBackend** class can come in play.

In [4]:
from langchain_ollama import ChatOllama, OllamaEmbeddings

from ragas import RunConfig, DiskCacheBackend
from ragas.llms import LangchainLLMWrapper
from ragas.embeddings import LangchainEmbeddingsWrapper

run_config = RunConfig(
    timeout=86400,    # 24 hours on waiting for a single operation
    max_retries=20,   # Max retries before giving up
    max_wait=600,     # Max wait between retries
    max_workers=4,    # Concurrent requests
    log_tenacity=True # Print retry attempts
)

# This stores data generation and evaluation results locally on disk
# When using it for the first time, it will create a .cache folder
# When using it again, it will read from that folder and finish almost instantly
cacher = DiskCacheBackend(cache_dir=".cache")

langchain_llm = LangchainLLMWrapper(
    langchain_llm=ChatOllama(
        model=os.getenv("CHAT_MODEL"),
        base_url="http://localhost:11434",
        temperature=float(os.getenv("TEMPERATURE")),
        num_ctx=int(os.getenv("LLM_CONTEXT_WINDOW_TOKENS")),
        format="json"
    ),
    run_config=run_config,
    cache=cacher
)

langchain_embeddings = LangchainEmbeddingsWrapper(
    embeddings=OllamaEmbeddings(
        model=os.getenv("EMBEDDING_MODEL"),
        base_url="http://localhost:11434"
    ),
    run_config=run_config,
    cache=cacher
)

### 5. Create the transformation pipeline

The sequence of transformations:

1. Named Entity Recognition (NER) and Keyphrases extraction 
    - NERExtractor identifies and extracts named entities (e.g., people, organizations, locations).  
    - KeyphrasesExtractor extracts the main keyphrases to be found in the text

2. NEROverlapBuilder and KeyphraseOverlapBuilder
    - Used to establish a relationship between nodes containing similar:
        - entities
        - keyphrases

3. Parallel Processing for Efficiency
    - Certain transformations run in parallel to improve performance

- Final Outcome:
    - A structured set of document transformations that extract valuable information
    - Used to enrich the knowledge graph for further generation of scenarios and finally samples

**NOTE:** Some of the extractors *(LLM-based ones)* do receive an optional `prompt`, which one can use to modify the workflow. For instance the `NERExtractor` can receive a custom prompt, which could contain instructions that differ from the original one and extracts entities in a different way.

In [5]:
from ragas.testset.transforms import (
    Parallel,
    OverlapScoreBuilder,
    KeyphrasesExtractor,
)
from ragas.testset.transforms.extractors import NERExtractor

from prompts.extractors.custom_ner_prompt import MyNERPrompt
from prompts.extractors.custom_keyphrases_prompt import MyKeyphrasesExtractorPrompt

ner_extractor = NERExtractor(
    llm=langchain_llm,
    prompt=MyNERPrompt(
        name="custom_ner_extractor_prompt"
    ),
    max_num_entities=15
)

keyphrases_extractor = KeyphrasesExtractor(
    llm=langchain_llm,
    prompt=MyKeyphrasesExtractorPrompt(
        name="custom_keyphrases_extractor_prompt"
    ),
    max_num=15
)

ner_overlap_sim = OverlapScoreBuilder()

keyphrases_overlap_sim = OverlapScoreBuilder(
    property_name="keyphrases",
)

transforms = [
    Parallel(
        ner_extractor,
        keyphrases_extractor
    ),
    Parallel(
        ner_overlap_sim,
        keyphrases_overlap_sim
    )
]

### 6. Apply the transformations to the knowledge graph

In the section below the `apply_transforms` is going to apply all the previously defined transformations enriching the `knowledge graph` in the process. Do note that some transformations can be executed in parallel and for that purpose **RAGAs** provides an abstraction called `Parallel`, which takes on transformations as parameter.

In [6]:
from ragas.testset.transforms import apply_transforms

apply_transforms(
    kg,
    transforms,
    run_config
)

Applying [NERExtractor, KeyphrasesExtractor]:   0%|          | 0/138 [00:00<?, ?it/s]

Applying [OverlapScoreBuilder, OverlapScoreBuilder]:   0%|          | 0/2 [00:00<?, ?it/s]

### 7. Generating personas

- A **Persona** is an entity/role which interacts with the system. **Personas** provide context and perspective, ensuring that **generated queries are natural, user-specific, and diverse**.

- Example: a Senior DevOps engineer, a Junior Data Scientist, a Marketing Manager in the context of an IT company

- **Persona** object consists of a **name** and a **description**.
    
    - The name is used to identify the persona and the description is used to describe the role of the persona.

- Do note that personas can also be generated by a **knowledge graph** if you have one available

In [10]:
from ragas.testset.persona import Persona

# This example is taken from `RAGAs`:
# https://docs.ragas.io/en/latest/howtos/applications/singlehop_testset_gen/#configuring-personas-for-query-generation

persona_first_time_flier = Persona(
    name="First Time Flier",
    role_description="Is flying for the first time and may feel anxious. Needs clear guidance on flight procedures, safety protocols, and what to expect throughout the journey.",
)

persona_frequent_flier = Persona(
    name="Frequent Flier",
    role_description="Travels regularly and values efficiency and comfort. Interested in loyalty programs, express services, and a seamless travel experience.",
)

persona_angry_business_flier = Persona(
    name="Angry Business Class Flier",
    role_description="Demands top-tier service and is easily irritated by any delays or issues. Expects immediate resolutions and is quick to express frustration if standards are not met.",
)

personas: list[Persona] = [
    persona_first_time_flier,
    persona_frequent_flier,
    persona_angry_business_flier
]

## 8. Generate query types 

- There are two main types of queries in **RAGAs**:
    
    - **SingleHopQuery** where the **context** relevant for answering a question lies in a **single document/chunk**

    - **MultiHopQuery** where the **context** relevant for answering a question lies in **multiple documents/chunks**

- Additionally, for each of those queries there's a **Specific** or **Abstract** query variant:
    
    - **Specific** one which pertains to a **fact**. 

        - Example: When did WW1 break out? (Can be precisely answered, there's no room for guessing/interpretation)
    
    - **Abstract** one which is more about testing the **reasoning** capabilities of the LLM. 

        - Example: Why did WW1 break out? (There's room for interpretation in this case)

- **Specific** vs. **Abstract Queries** in a RAG
    - Specific Query: Focuses on clear, fact-based retrieval. The goal in RAG is to retrieve highly relevant information from one or more documents that directly address the specific question.

    - Abstract Query: Requires a broader, more interpretive response. In RAG, abstract queries challenge the retrieval system to pull from documents that contain higher-level reasoning, explanations, or opinions, rather than simple facts.

![Query tpes in RAGAs](../../img/ragas_query_types.png  "Queries")

**Synthesizers** are responsible for **converting enriched nodes and personas into queries**. They achieve this by **selecting a node property (e.g., "entities" or "keyphrases"), pairing it with a persona, style, and query length**, and then using a LLM to generate a query-answer pair based on the content of the node.

* Query lengths may vary:
    - short
    - medium
    - long

* Query style:
    - misspelled
    - websearch-like
    - perfect-grammar
    - poor-grammar

Note that **synthesizers** can additionally be extended/modified by specifying custom **prompts**.


In [11]:
from ragas.testset.synthesizers.multi_hop.specific import MultiHopSpecificQuerySynthesizer
from ragas.testset.synthesizers.single_hop.specific import SingleHopSpecificQuerySynthesizer

from prompts.synthesizers.custom_themes_matching import MyThemesPersonasMatchingPrompt
from prompts.synthesizers.custom_multi_hop_qa_generation import MyMultiHopQAGenerationPrompt
from prompts.synthesizers.custom_single_hop_qa_generation import MySingleHopQAGenerationPrompt

single_hop_specific_entities = SingleHopSpecificQuerySynthesizer(
    llm=langchain_llm,
    generate_query_reference_prompt=MySingleHopQAGenerationPrompt(),
    theme_persona_matching_prompt=MyThemesPersonasMatchingPrompt(),
    property_name="entities"
)

single_hop_specific_keyphrases = SingleHopSpecificQuerySynthesizer(
    llm=langchain_llm,
    generate_query_reference_prompt=MySingleHopQAGenerationPrompt(),
    theme_persona_matching_prompt=MyThemesPersonasMatchingPrompt(),
    property_name="keyphrases"
)

multi_hop_specific_entities = MultiHopSpecificQuerySynthesizer(
    llm=langchain_llm,
    generate_query_reference_prompt=MyMultiHopQAGenerationPrompt(),
    theme_persona_matching_prompt=MyThemesPersonasMatchingPrompt()
)

multi_hop_specific_keyphrases = MultiHopSpecificQuerySynthesizer(
    llm=langchain_llm,
    generate_query_reference_prompt=MyMultiHopQAGenerationPrompt(),
    relation_type="keyphrases_overlap",
    property_name="keyphrases",
    theme_persona_matching_prompt=MyThemesPersonasMatchingPrompt()
)

query_distribution = [
    (single_hop_specific_entities, 0.25),
    (single_hop_specific_keyphrases, 0.25),
    (multi_hop_specific_entities, 0.25),
    (multi_hop_specific_keyphrases, 0.25)
]

### 9. Generate the samples

### Definition of evaluation sample

An evaluation sample is a single structured data instance that is used to assess and measure the performance of your LLM application in specific scenarios. It represents a single unit of interaction or a specific use case that the AI application is expected to handle. In Ragas, evaluation samples are represented using the `SingleTurnSample` and `MultiTurnSample` classes.

### SingleTurnSample

`SingleTurnSample` represents a single-turn interaction between a user, LLM, and expected results for evaluation. It is suitable for evaluations that involve a single question and answer pair, possibly with additional context or reference information.

This type of sample is ideal for straightforward question-answering scenarios where a user asks a single question and expects a direct response.

### MultiTurnSample

`MultiTurnSample` represents a multi-turn interaction between Human, AI and optionally a Tool and expected results for evaluation. It is suitable for representing conversational agents in more complex interactions for evaluation.

In `MultiTurnSample`, the `user_input` attribute represents a sequence of messages that collectively form a multi-turn conversation between a human user and an AI system. These messages are instances of the classes `HumanMessage`, `AIMessage`, and `ToolMessage`.

This type of sample is designed for evaluating more complex conversational flows where multiple turns of dialogue occur, potentially involving tool usage for gathering additional information.

![Scenario generation workflow RAGAs](../../img/scenario_rag.webp "Scenarios RAGAs")

In [12]:
from ragas.testset import TestsetGenerator, Testset

generator = TestsetGenerator(
    langchain_llm,
    langchain_embeddings,
    kg,
    personas
)

dataset: Testset = generator.generate(
    testset_size=50,
    query_distribution=query_distribution,
    num_personas=len(personas),
    run_config=run_config,
    with_debugging_logs=True,
)

Generating Scenarios:   0%|          | 0/4 [00:00<?, ?it/s]

KeyboardInterrupt: 

### 10. Save dataset containing goldens (no actual output and retrieval context)

In [10]:
dataset.to_jsonl("dataset-goldens.jsonl")

### 11. Upload to the cloud (Optional)

* To upload the data on **app.ragas.io** make sure you:
    * First create an account
    * Get an **API key**
    * Finally, create a `.env` file in the parent folder like so and export it in your notebook:

```bash
RAGAS_APP_TOKEN=apt.1234a-......-9dfew
```

In [11]:
from dotenv import load_dotenv

# Load the token
load_dotenv("../.env") 

dataset.upload()

Testset uploaded! View at https://app.ragas.io/dashboard/alignment/testset/b2a2d2b2-6b6d-4150-9027-73c7fc310500


'https://app.ragas.io/dashboard/alignment/testset/b2a2d2b2-6b6d-4150-9027-73c7fc310500'

### 12. Fill out the missing fields in the dataset to complete it.

* This script will first create a virtual environment with all required dependencies.
* Next the files that are currently ingested will be removed.
* The files from the `data` folder will be re-ingested.
* Then the dataset will be filled out with `response` and `retrieved contexts`.
* Finally, the dataset will be persisted into `/datasets`.

This is required since the experiments have varying `chunk_size` and `chunk_overlap` values. Because of that the application needs to be restarted each time and the data ingested newly, so we can carry out the data generation properly.

**NOTE**:
- Make sure you set the proper values under `/env/rag.env` relative to the experiment id you want to carry out.
- Make sure you restart the application each time you do so.
    - Run `docker compose down --remove-orphans` to stop the containers
    - Run `./run.sh` to start the application. Do note you need `Ollama` and `Docker` up and running.
- For each experiment provide an intuitive name to distinguish between the experiments
    - In my case I just use the experiment id

In [1]:
# Make the script executable
!chmod u+x ./fill_dataset.sh

# Run the script with arguments
# Make sure that before you run the script you restart the app with the proper environment variables
# Also do provide different names for the different configurations
!./fill_dataset.sh "dataset-goldens" "test_id_9"

Virtual environment already exists. Skipping creation.
Dependencies already installed. Skipping installation.
Environment is set and ready to be used
Generating dataset in /datasets/test_id_9-dataset.jsonl
TOP_K=10
MAX_TOKENS_TO_SAMPLE=1024
CHUNK_SIZE=768
CHUNK_OVERLAP=64
CHAT_MODEL=deepseek-r1:7b
TEMPERATURE=0.0

DELETION STEP COMPLETED...
/tmp/tmpyhkmhxh5/special_assistance.md: Document created and ingested successfully.
/tmp/tmpyhkmhxh5/managing_reservations.md: Document created and ingested successfully.
/tmp/tmpyhkmhxh5/flight_delays.md: Document created and ingested successfully.
/tmp/tmpyhkmhxh5/baggage_policies.md: Document created and ingested successfully.
/tmp/tmpyhkmhxh5/inflight_services.md: Document created and ingested successfully.
/tmp/tmpyhkmhxh5/schedule_changes.md: Document created and ingested successfully.
/tmp/tmpyhkmhxh5/bookings.md: Document created and ingested successfully.
/tmp/tmpyhkmhxh5/flight_cancellations.md: Document created and ingested successfully.


### 13. Synthetic Data Generation completed

With this you have created your own synthetic goldens, used various configurations to fill out the rest of the missing fields and saved all the data locally. Now you can move on to the next notebook, check out all the relevant (in my opinion) metrics and evaluate your application using the different experiments.