### Importance of High-Quality Test Dataset for RAG Systems

### Purpose
Test sets are critical for:
- Accurately measuring RAG system performance
- Identifying system strengths and weaknesses
- Guiding continuous improvement

### Key Evaluation Dimensions
1. **Retrieval Effectiveness**: Assess how well context is retrieved
    - completeness (recall)
    - relevancy (relative to user input)
    - precision (proper ranking)
2. **Generation Quality**: Evaluate the generated responses
    - faithfulness/groundedness in context (lack of hallucinations)
    - relevance (relative to the user input)
    - coherence
    - correctness

### Setup

* Execute the script in the root of the `evaluation` folder.
* If executing the script fails run: `chmod u+x setup.sh`.
* It will install all required dependencies and open a jupyterlab instance.

---

#### (OPTIONAL STEP)

**RAGAs** provides a cloud platform where a dataset and evaluation results can be stored and viewed. To use it follow this link: [RAGAs.io](https://app.ragas.io/).

* Sign-up
* Retrieve the **token**
* Create a `.env` file in the root of the evaluation folder with the following content:

```bash
RAGAS_APP_TOKEN=apt.......-9f6ed
```

### 0. Configuration for generating the goldens

Goldens (synthetically generated) are the samples containing the:
- `reference` (expected output) 
- `reference_contexts` (context used for generating the query)
- `user_input` (query). 

For filling out the rest of the fields I use various of different configurations, which are to be found at the root of the project under *experiments.csv*.
- `response` (actual output / LLM-generated using the context)
- `retrieved_contexts` (data retrieved during retrieval) 

For the generation of the *goldens* for all experiments I will make use of the following parameters:
- generative model - `llama3.1:8b-instruct-q4_1`
- temperature - `0.0`
- the rest of the parameters will vary depending on the experiments configuration - `chunk size`, `overlap`, etc.

The data generation pipeline consists of 2 stages:
- we first create the so called `goldens` using RAGAs (up to step 11)
- then we use the RAG application to fill out the rest (you can use your own custom RAG pipeline - step 12)

**NOTE**: Before running this notebook make sure:
- you set the correct variables under `env/rag.env` and start the application by running `./run.sh` in the root of the project
- Alternatively, you could make use of your own RAG application.
    - make sure you set the correct environment variables to reflect the experiment you want to generate data for under `env/rag.env`
    - export all the environment variables under `env`
    - run ollama: `ollama serve` (check out the `run.sh` for additional environment variables)
    - run all the docker services - `docker compose up -d --build`

### 1. Retrieve data (Optional):

The data will be used as a corpus to create synthetic queries (based on the context) and generate references (expected outputs).

* The data can be a dataset from `huggingface` or any other platform.
* Alternatively, files available on disk - pdf, md, etc.
* One can also use `AsyncHtmlLoader` from `langchain` to scrape from the internet.
    - **Careful when performing web scraping to not violate any terms and conditions!**

In [None]:
DIR_PATH: str = "data"

# For this notebook I will use a dataset provided by RAGAs.
# The dataset contains markdown files about a fictional airline company.
! git clone https://huggingface.co/datasets/explodinggradients/ragas-airline-dataset {DIR_PATH}

### 2. Import all dependencies

This notebook relies on various of libraries, so it's best to import everything in a single cell just to keep things neat.

Additionally, environment variables will be exported here.

In [None]:
import os
import base64
import mimetypes
from distutils.util import strtobool
from typing import Final, List, Dict, Union

import requests
import markdown
from bs4 import BeautifulSoup
from dotenv import load_dotenv

from langchain_ollama import (
    ChatOllama, OllamaEmbeddings
)

from ragas import (
    RunConfig, DiskCacheBackend
)
from ragas.llms import LangchainLLMWrapper
from ragas.embeddings import LangchainEmbeddingsWrapper

from ragas.testset.graph import (
    Node, NodeType, KnowledgeGraph
)
from ragas.testset.transforms import (
    Parallel,
    OverlapScoreBuilder,
    KeyphrasesExtractor,
    apply_transforms
)
from ragas.testset.persona import Persona
from ragas.testset import TestsetGenerator, Testset
from ragas.testset.transforms.extractors import NERExtractor
from ragas.testset.synthesizers.multi_hop.specific import MultiHopSpecificQuerySynthesizer
from ragas.testset.synthesizers.single_hop.specific import SingleHopSpecificQuerySynthesizer

from prompts.extractors.custom_ner_prompt import MyNERPrompt
from prompts.extractors.custom_keyphrases_prompt import MyKeyphrasesExtractorPrompt
from prompts.synthesizers.custom_themes_matching import MyThemesPersonasMatchingPrompt
from prompts.synthesizers.custom_multi_hop_qa_generation import MyMultiHopQAGenerationPrompt
from prompts.synthesizers.custom_single_hop_qa_generation import MySingleHopQAGenerationPrompt

# Load the RAG parameters
load_dotenv("../../env/rag.env")

# Load the RAGAs token (if available - optional)
load_dotenv("../.env")

### 3. Load documents data and prepare them for knowledge graph creation (Pre-processing phase)

The information from the documents will be used as context by RAGAs to generate synthetic queries.

For extracting data from documents and splitting them into chunks we will make use of the [unstructured library](https://docs.unstructured.io/welcome). In my project I use the `basic` chunking strategy, however `by_title` could also be useful.

If you want to learn more about that follow these links:
- https://docs.unstructured.io/api-reference/partition/partitioning
- https://docs.unstructured.io/api-reference/partition/chunking

Do also check out the file under `project/backend/config.toml` and the `ingestion` section:
```bash
# https://unstructured.io/blog/chunking-for-rag-best-practices
[ingestion]
provider = "unstructured_local"  # Use the local instance that is running in a container
strategy = "auto"                # https://docs.unstructured.io/open-source/concepts/partitioning-strategies
chunking_strategy = "basic"      # https://docs.unstructured.io/api-reference/partition/chunking
new_after_n_chars = 1024         # Soft limit for chunk size (always max_characters)
max_characters = 1024            # Hard limit for chunk size (it can never exceed this value)
combine_text_under_n_chars = 512 # If chunks are smaller than this value, they will be combined (always max_characters / 2)
overlap = 128                    # chunk overlap 
chunk_size = 1024                # This is used for the RecursiveCharacterTextSplitter in case unstructured doesn't work (fallback)
chunk_overlap = 128              # This is used for the RecursiveCharacterTextSplitter in case unstructured doesn't work (fallback) 
chunks_for_document_summary = 16
document_summary_model = "ollama_chat/llama3.1:8b"
automatic_extraction = false
```

#### Pseudo-algorithm
1. Retrieve the configuration for ingestion
2. Format it properly and prepare the ingestion request
3. Submit the request to the `unstructured` service
4. Extract the data from the response for each file

In [None]:
# Step 1.
# We request the current configuration settings from r2r.
# They are hardcoded at project/backend/config.toml.
# However we can overwrite them at runtime using environment variables.
settings_resp: requests.Response = requests.get(
    url="http://localhost:7272/v3/system/settings"
)

# Make sure the environment variables are correct under: env/rag.env
# Since each experiment can have varying parameters and restarting the docker containers
# is not very efficient one can modify the environment variables directly in the env/rag.env file.
# Then we can just overwrite specific ones so that they match the experiment requirements.
# Before re-running the notebook make sure you restart the kernel so changes take place.

# Get the current ingestion configuration
ingestion_config: Dict = settings_resp.json()['results']['config']['ingestion']

# Extra fields is the relevant part of the ingestion config for submitting a request to unstructured.
ingestion_config_request: Dict = ingestion_config['extra_fields']
ingestion_config_request['chunking_strategy'] = ingestion_config['chunking_strategy']

# Overwrite the experiment specific parameters for ingestion.
ingestion_config_request['max_characters'] = int(os.getenv('CHUNK_SIZE'))
ingestion_config_request['new_after_n_chars'] = int(ingestion_config_request['max_characters'])
ingestion_config_request['overlap'] = int(os.getenv('CHUNK_OVERLAP'))
ingestion_config_request['combine_text_under_n_chars'] = (
    int(ingestion_config_request['max_characters']) // 2
)

# The max_characters should match the CHUNK_SIZE environment variable in the rag.env file
# The new_after_n_chars is equal to the max_characters
# The overlap should match the CHUNK_OVERLAP environment variable in the rag.env file
# combine_text_under_n_chars is always max_characters / 2
print(f"Current ingestion config sent to unstructured:\n{ingestion_config_request}")

In [None]:
def markdown_to_plaintext(md_content: str) -> str:
    html: str = markdown.markdown(md_content)
    soup = BeautifulSoup(html, features="html.parser")
    return soup.get_text()

In [None]:
# This will hold all the chunks for each particular file
# Each key-value pair will be a mapping from the filename to its chunks
chunks: Dict[str, List[str]] = {}

# Iterate over the files and prepare the file content for chunking by unstructured
for file in os.listdir(DIR_PATH):
    if file.endswith(".md") and file != "README.md": # This may vary depending on the documents you are working with
        # Open and read the contents
        with open(file=f"data/{file}", mode="r", encoding="utf-8") as f:
            markdown_text: str = f.read()

        plain_text: str = markdown_to_plaintext(markdown_text)
        file_bytes: bytes = plain_text.encode("utf-8")  # convert back to bytes
        encoded_file: bytes = base64.b64encode(file_bytes).decode("utf-8")
        
        # Step 2.
        # Prepare the payload for the ingestion request
        payload = {
            "file_content": encoded_file,
            "filename": file,
            "ingestion_config": ingestion_config_request
        }

        # Step 3.
        # Send request
        response: requests.Response = requests.post(
            "http://localhost:7275/partition", # See the docker compose file
            json=payload,
            timeout=60
        )
        
        if response.status_code != 200:
            raise Exception(
                f"Failed to process file {file}: {response.status_code} - {response.text}"
            )

        # Step 4.
        # Collect the chunks for each file
        chunks[file] = [el['text'] for el in response.json()['elements'] if el['text']]

### 4. Construct knowledge graph

- https://docs.ragas.io/en/latest/getstarted/rag_testset_generation/
- https://docs.ragas.io/en/latest/concepts/test_data_generation/rag/

- A **knowledge graph** is a fundamental concept when it comes to **RAGAs** and using its capabilities for **automatic synthetic data generation**.

- A **knowledge graph** consists of **node**s at first, which represent **documents/chunks** - their content and additionally metadata (optional).

- Thereafter, one can enrich the graph by using various **extractors** and applying different **transformations**. Doing so additional attributes get added to the relevant nodes and **relationships can get built**, which express some kind of connection between nodes. The transformations can be applied only through the use of **Extractor**s, **Splitter**s and or **RelationshipBuilder**s. They serve as a way to gather relevant data from the documents depending on the type of extractor and this way to logically connect 2 or more nodes together.

- Finally, the graph is used to generate so called **scenario**s.

- Optionally, one could get **persona**s generated from it (optional).

![Knowledge graph creation workflow RAGAs](../../img/ragas/kg_rag.webp "Knowledge graph RAGAs")

In [None]:
kg = KnowledgeGraph()

# After having chunked the documents we can create the knowledge graph
# Then we can use transformations to enrich it
# Finally, RAGAs will use the nodes and their relationships to build contexts and finally synthesize queries
for filename, text_chunks in chunks.items():
    for chunk in text_chunks:
        kg.add(
            Node(
                # Since we already split the documents, we can use the chunk type.
                # Alternatively, with full documents you can use DOCUMENT.
                type=NodeType.CHUNK,
                # Can add other properties if you want
                properties={
                    "page_content": chunk,
                    "document_metadata": {
                        "source": filename,
                        "file_type": mimetypes.guess_type(filename)[0] or "text/plain"
                    }
                }
            )
        )

### 5. Instantiate required objects for interacting with RAGAs

- https://docs.ragas.io/en/latest/references/llms/
- https://docs.ragas.io/en/latest/references/embeddings/
- https://docs.ragas.io/en/latest/references/run_config/
- https://docs.ragas.io/en/latest/references/cache/

- **RAGAs** would require a **LLM** and an **embedding model** depending on the type of **transformation**s one would like to apply to the **knowledge graph**. For that purpose one must create *wrapper* objects for both of the models. `langchain`, `llama-index`, `haystack`, etc are supported.

- Additionally, a **configuration** can be used to modify the default behaviour of the framework. For example timeout values can be modified, maximum retries for failed operations and so on can be configured from the **RunConfig**.

- **NOTE**: depending on the LLM model and GPU you may need to modify the `timeout` value, otherwise you will stumble upon `TimeoutException`

- Lastly, there's a single implementation in **RAGAs** for caching intermediate steps onto disk. To use it the **DiskCacheBackend** class can come in play. Can be useful if the kernel freezes - operations will not be carried out again, since they are cached.

In [None]:
test_id: str = input("Enter the test id (Ex. 1): ")

run_config = RunConfig(
    timeout=86400,    # 24 hours on waiting for a single operation
    max_retries=20,   # Max retries before giving up
    max_wait=600,     # Max wait between retries
    max_workers=4,    # Concurrent requests
    log_tenacity=True # Print retry attempts
)

# This stores data generation and evaluation results locally on disk
# When using it for the first time, it will create a .cache folder
# When using it again, it will read from that folder and finish almost instantly
cacher = DiskCacheBackend(cache_dir=f".cache-{test_id}")

ollama_llm = ChatOllama(
    model=os.getenv("DATA_GENERATION_MODEL"),
    base_url="http://localhost:11434", # Can vary for you
    temperature=float(os.getenv("TEMPERATURE")),
    num_ctx=int(os.getenv("LLM_CONTEXT_WINDOW_TOKENS")),
    format="json" # We need to enforce JSON output, since most outputs would be validated by a pydantic model
)

ollama_embeddings = OllamaEmbeddings(
    model=os.getenv("EMBEDDING_MODEL"),
    base_url="http://localhost:11434" # Can vary for you
)

ragas_llm = LangchainLLMWrapper(
    langchain_llm=ollama_llm,
    run_config=run_config,
    cache=cacher
)

ragas_embeddings = LangchainEmbeddingsWrapper(
    embeddings=ollama_embeddings,
    run_config=run_config,
    cache=cacher
)

### 6. Create the transformation pipeline

https://docs.ragas.io/en/latest/references/transforms/

The sequence of transformations:

1. Named Entity Recognition (NER) and Keyphrases extraction:
    - NERExtractor identifies and extracts named entities (e.g., people, organizations, locations)
    - KeyphrasesExtractor extracts the main keyphrases to be found in the text

2. NEROverlapBuilder and KeyphraseOverlapBuilder:
    - Used to establish a relationship between nodes containing similar:
        - entities
        - keyphrases

3. Parallel Processing for Efficiency:
    - Certain transformations can run in parallel to improve performance.

- Final Outcome:
    - A structured set of document transformations that extract valuable information
    - Used to enrich the knowledge graph for further generation of scenarios and finally samples

**NOTE:**
- Some of the extractors *(LLM-based ones)* do receive an optional `prompt`, which one can use to modify the workflow. For instance the `NERExtractor` can receive a custom prompt, which could contain instructions that differ from the original one and extracts entities in a different way.
- Refer to the `prompts/extractors` folder.
- Also, there're many other transformations that can be used.

In [None]:
# Extract entities
ner_extractor = NERExtractor(
    llm=ragas_llm,
    prompt=MyNERPrompt(
        name="custom_ner_extractor_prompt"
    ),
    max_num_entities=15
)

# Extract keyphrases
keyphrases_extractor = KeyphrasesExtractor(
    llm=ragas_llm,
    prompt=MyKeyphrasesExtractorPrompt(
        name="custom_keyphrases_extractor_prompt"
    ),
    max_num=15
)

# Create relationships between chunks based on entities
ner_overlap_sim = OverlapScoreBuilder()

# Create relationships between chunks based on keyphrases
keyphrases_overlap_sim = OverlapScoreBuilder(
    property_name="keyphrases",
)

# Collect all the transformations
transforms = [
    Parallel(
        ner_extractor,
        keyphrases_extractor
    ),
    Parallel(
        ner_overlap_sim,
        keyphrases_overlap_sim
    )
]

### 7. Apply the transformations to the knowledge graph

In the cell below the `apply_transforms` is going to apply all the previously defined transformations enriching the `knowledge graph` in the process.

Make sure you have your LLM-provider available, otherwise this will not work.
You will also need to have both the LLM and embedding models installed, if using a provider locally.

In [None]:
apply_transforms(
    kg,
    transforms,
    run_config
)

In [None]:
# The knowledge graph should now have relationships
kg

### 8. Generating personas

https://docs.ragas.io/en/latest/howtos/customizations/testgenerator/_persona_generator/

- A **persona** is an entity/role which interacts with the system. **Personas** provide context and perspective, ensuring that **generated queries are natural, user-specific, and diverse**.

- Example: a Senior DevOps engineer, a Junior Data Scientist, a Marketing Manager in the context of an IT company

- **Persona** object consists of a **name** and a **description**.

- The name is used to identify the persona and the description is used to describe the role of the persona.

- Do note that personas can also be generated by a **knowledge graph** if you have one available

In [None]:
# This example is taken from `RAGAs`:
# https://docs.ragas.io/en/latest/howtos/applications/singlehop_testset_gen/#configuring-personas-for-query-generation

persona_first_time_flier = Persona(
    name="First Time Flier",
    role_description="Is flying for the first time and may feel anxious. Needs clear guidance on flight procedures, safety protocols, and what to expect throughout the journey.",
)

persona_frequent_flier = Persona(
    name="Frequent Flier",
    role_description="Travels regularly and values efficiency and comfort. Interested in loyalty programs, express services, and a seamless travel experience.",
)

persona_angry_business_flier = Persona(
    name="Angry Business Class Flier",
    role_description="Demands top-tier service and is easily irritated by any delays or issues. Expects immediate resolutions and is quick to express frustration if standards are not met.",
)

personas: List[Persona] = [
    persona_first_time_flier,
    persona_frequent_flier,
    persona_angry_business_flier
]

### 9. Generate query types 

- https://docs.ragas.io/en/latest/concepts/test_data_generation/rag/
- https://docs.ragas.io/en/latest/references/synthesizers/

- There are two main types of queries in **RAGAs**:
    
    - **SingleHopQuery** where the **context** relevant for answering a question lies in a **single document/chunk**
    - **MultiHopQuery** where the **context** relevant for answering a question lies in **multiple documents/chunks**

- Additionally, for each of those queries there's a **Specific** or **Abstract** query variant:
    
    - **Specific** one which pertains to a **fact**.
 
        - Example: When did WW1 break out? (Can be precisely answered, there's no room for guessing/interpretation)
    
    - **Abstract** one which is more about testing the **reasoning** capabilities of the LLM. 

        - Example: Why did WW1 break out? (There's room for interpretation in this case)

- **Specific** vs. **Abstract Queries** in a RAG
    - Specific Query: Focuses on clear, fact-based retrieval. The goal in RAG is to retrieve highly relevant information from one or more documents that directly address the specific question.
    - Abstract Query: Requires a broader, more interpretive response. In RAG, abstract queries challenge the retrieval system to pull from documents that contain higher-level reasoning, explanations, or opinions, rather than simple facts.

![Query tpes in RAGAs](../../img/ragas/ragas_query_types.png  "Queries")

**Synthesizers** are responsible for **converting enriched nodes and personas into queries**. They achieve this by **selecting a node property (e.g., "entities" or "keyphrases"), pairing it with a persona, style, and query length**, and then using a LLM to generate a query-answer pair based on the content of the node.

* Query lengths may vary:
    - short
    - medium
    - long

* Query style:
    - misspelled
    - websearch-like
    - perfect-grammar
    - poor-grammar

Note that **synthesizers** can additionally be extended/modified by specifying custom **prompts**.


In [None]:
single_hop_specific_entities = SingleHopSpecificQuerySynthesizer(
    llm=ragas_llm,
    generate_query_reference_prompt=MySingleHopQAGenerationPrompt(),
    theme_persona_matching_prompt=MyThemesPersonasMatchingPrompt(),
    property_name="entities"
)

single_hop_specific_keyphrases = SingleHopSpecificQuerySynthesizer(
    llm=ragas_llm,
    generate_query_reference_prompt=MySingleHopQAGenerationPrompt(),
    theme_persona_matching_prompt=MyThemesPersonasMatchingPrompt(),
    property_name="keyphrases"
)

multi_hop_specific_entities = MultiHopSpecificQuerySynthesizer(
    llm=ragas_llm,
    generate_query_reference_prompt=MyMultiHopQAGenerationPrompt(),
    theme_persona_matching_prompt=MyThemesPersonasMatchingPrompt()
)

multi_hop_specific_keyphrases = MultiHopSpecificQuerySynthesizer(
    llm=ragas_llm,
    generate_query_reference_prompt=MyMultiHopQAGenerationPrompt(),
    relation_type="keyphrases_overlap",
    property_name="keyphrases",
    theme_persona_matching_prompt=MyThemesPersonasMatchingPrompt()
)

query_distribution = [
    (single_hop_specific_entities, 0.25),
    (single_hop_specific_keyphrases, 0.25),
    (multi_hop_specific_entities, 0.25),
    (multi_hop_specific_keyphrases, 0.25)
]

### 10. Generate the samples

- https://docs.ragas.io/en/latest/concepts/components/eval_sample/
- https://docs.ragas.io/en/latest/concepts/components/eval_dataset/
- https://docs.ragas.io/en/latest/references/generate/

#### Definition of evaluation sample

An evaluation sample is a single structured data instance that is used to assess and measure the performance of your LLM application in specific scenarios. It represents a single unit of interaction or a specific use case that the AI application is expected to handle. In Ragas, evaluation samples are represented using the `SingleTurnSample` and `MultiTurnSample` classes.

#### SingleTurnSample

`SingleTurnSample` represents a single-turn interaction between a user, LLM, and expected results for evaluation. It is suitable for evaluations that involve a single question and answer pair, possibly with additional context or reference information.

This type of sample is ideal for straightforward question-answering scenarios where a user asks a single question and expects a direct response.

![Scenario generation workflow RAGAs](../../img/ragas/scenario_rag.webp "Scenarios RAGAs")

In [None]:
generator = TestsetGenerator(
    ragas_llm,
    ragas_embeddings,
    kg,
    personas
)

dataset: Testset = generator.generate(
    testset_size=50,
    query_distribution=query_distribution,
    num_personas=len(personas),
    run_config=run_config,
    with_debugging_logs=True,
)

### 11. Upload to the cloud (Optional)

In [None]:
# Only upload if you have a token, otherwise you will not be authorized
if os.getenv("RAGAS_APP_TOKEN"):
    dataset.upload()

### 12. Fill out the missing fields in the dataset to complete it.

- If you are not using [R2R](https://r2r-docs.sciphi.ai/introduction) make sure you configure the parameters and your pipeline as needed.
- This notebook assumes you follow along all the steps sequentially and make use of my application.

In [None]:
# VANILLA_RAG=False would mean RAG-Fusion
use_vanilla_rag: bool = bool(
    strtobool(os.getenv("VANILLA_RAG"))
)

if use_vanilla_rag:
    search_strategy: str = "vanilla"
else:
    search_strategy: str = "query_fusion" # RAG-Fusion

# Used after context had been fetched to generate the final response
RAG_GENERATION_CONFIG: Final[Dict[str, Union[str, float, int]]] = {
    "model": f"ollama_chat/{os.getenv('CHAT_MODEL')}",
    "temperature": float(os.getenv("TEMPERATURE")),
    "top_p": float(os.getenv("TOP_P")),
    "max_tokens_to_sample": int(os.getenv("MAX_TOKENS"))
}

# Relevant during the retrieval phase for fetching relevant context
# Only cosine similarity is used by this project, however keyword-based search and Graph-RAG are also supported
SEARCH_SETTINGS: Final[Dict[str, Union[bool, int, str]]] = {
    "use_semantic_search": True,
    "limit": int(os.getenv("TOP_K")),
    "offset": 0,
    "include_metadatas": False,
    "include_scores": True,
    "search_strategy": search_strategy, # can be vanilla or RAG-fusion, (HyDE is also supported)
    "chunk_settings": {
        "index_measure": "cosine_distance",
        "enabled": True,
        "ef_search": 80
    }
}

if search_strategy == "query_fusion":
    # This is only relevant when using `hyde` or `rag-fusion`
    # Number of hypothetical documents to generate, by default it's 5 if not specified
    # https://r2r-docs.sciphi.ai/api-and-sdks/retrieval/rag-app
    SEARCH_SETTINGS['num_sub_queries'] = 5

In [None]:
# You can modify this template as needed
# This gets submitted to the LLM after context had been fetched and re-ranked
TEMPLATE: Final[str] = """You are a helpful assistant in a Retrieval-Augmented Generation (RAG) system.
Your task is to answer the user's question using *only* the provided context below.

INSTRUCTIONS:
- Answer concisely and clearly using only the information from the context.
- If the context does not contain sufficient information to answer the question, state that clearly.
- DO NOT include citations, references, or mention the context itself.
- Do not speculate or make up information beyond what is in the context.

CONTEXT:
{context}

QUESTION:
{query}

ANSWER:
"""

In [None]:
# First we need to authenticate the admin user and receive a token
# The username and password are the default ones provided by `R2R`
# Note that you can overwrite those in the `config.toml` file
# If this fails it means there're either connectivity issues or the credentials are wrong
authetication: requests.Response = requests.post(
    url="http://localhost:7272/v3/users/login", # This may vary depending on your setup
    headers={
        "Content-Type": "application/x-www-form-urlencoded"
    },
    data="username=admin@example.com&password=change_me_immediately",
)
token: str = authetication.json()['results']['access_token']['token'] # Token for further authentication

In [None]:
# Retrieve the IDs of all currently ingested documents
documents: requests.Response = requests.get(
    url="http://localhost:7272/v3/documents",
    headers={
        "Authorization": f"Bearer {token}"
    }
)

doc_ids: List[str] = [document['id'] for document in documents.json()['results']]
print(f"Found {len(doc_ids)} documents")

# Delete all documents available
for doc_id in doc_ids:
    del_resp: requests.Response = requests.delete(
        url=f"http://localhost:7272/v3/documents/{doc_id}",
        headers={
            "Authorization": f"Bearer {token}"
        }
    )
    if del_resp.status_code == 200:
        print(f"Deleted document with ID: {doc_id}")
    else:
        print(f"Failed to delete document with ID: {doc_id}")

Since our experiments test different parameters like `chunk size` and `chunk overlap` we need to make sure that documents are re-ingested, so that we properly conduct each and every experiment.

In [None]:
# Payload for ingestion by unstructured
settings_resp: requests.Response = requests.get(
    url="http://localhost:7272/v3/system/settings"
)
ingestion_config: Dict = settings_resp.json()['results']['config']['ingestion']
ingestion_config['extra_fields'] = ingestion_config_request

print(ingestion_config['extra_fields'])

# Files getting ingested and saved into r2r
for i, filename in enumerate(os.listdir("data"), 1):
    # Skip files that are not markdown or the README file
    if not filename.endswith(".md") or filename == "README.md":
        print(f"[{i}]. Skipping: {filename}")
        continue

    file_path: str = os.path.join("data", filename)

    # Read the files content
    with open(file_path, "r", encoding="utf-8") as f:
        markdown_text = f.read()

    plaintext: str = markdown_to_plaintext(markdown_text)

    # Guess the content type (MIME type) based on file extension
    mime_type, _ = mimetypes.guess_type(file_path)
    if mime_type is None:
        mime_type = "application/octet-stream"  # fallback if unknown

    # Ingest file - extract text, chunk it, generate embeddings and finally store in vector store
    ingestion_resp: requests.Response = requests.post(
        url="http://localhost:7272/v3/documents",
        headers={
            "Authorization": f"Bearer {token}"
        },
        files={
            "file": (filename.replace(".md", ".txt"), plaintext, mime_type)
        },
        json={
            "metadata": "{}", # Feel free to add your own metadata
            "ingestion_mode": "custom",
            "ingestion_config": ingestion_config
        }
    )

    if ingestion_resp.status_code == 202:
        print(f"[{i}]. Ingested: {filename}")
    else:
        print(f"[{i}]. Failed to ingest {filename} — {response.status_code}")
        print(ingestion_resp.json())

In [None]:
def extract_deepseek_response(full_response: str):
    """
    Extract the actual response from deepseek-r1 output by ignoring the <think>...</think> section.
    """
    if "</think>" not in full_response:
        raise ValueError("Response from deepseek-r1 is not full!")

    strings: List[str] = full_response.split("</think>")
    answer_without_section: str = strings[-1].lstrip()
    return answer_without_section

The cell below actually fills out the rest of the fields - `response` and `retrieved_contexts`.
This is where you could use your own pipeline and create synthetic datasets.

In [None]:
# Some debugging info
print(f"""{'='*80}\nGenerating dataset in ../datasets/{test_id}_dataset.jsonl
TOP_K={int(os.getenv("TOP_K"))}
CHUNK_SIZE={int(os.getenv("CHUNK_SIZE"))}
CHUNK_OVERLAP={int(os.getenv("CHUNK_OVERLAP"))}
CHAT_MODEL={os.getenv("CHAT_MODEL")}
VANILLA_RAG={"True" if search_strategy == "vanilla"else "False"}
{'='*80}
""")

# Filling out the rest of our dataset, for each individual entry
# This is where you can employ your own RAG pipeline
for i, sample in enumerate(dataset.samples):
    # [1] Embed the `user_input`
    # [2] Perform semantic similarity search fetching the top-k most relevant contexts
    # [3] Re-rank based on relevance relative to `user_input`
    context: requests.Response = requests.post(
        url="http://localhost:7272/v3/retrieval/search",
        headers={
            "Content-Type": "application/json",
            "Authorization": f"Bearer {token}"
        },
        json={
            "query": sample.eval_sample.user_input,
            "search_settings": SEARCH_SETTINGS,
            "search_mode": "custom"
        }
    )
    # Extract the relevant context (if any)
    retrieved_chunks = [el['text'] for el in context.json()['results']['chunk_search_results'] if el['text']]
    
    # [4] Use the template defined above and replace placeholders dynamically
    user_message: str = TEMPLATE.format(
        context="\n".join(retrieved_chunks),
        query=sample.eval_sample.user_input
    )
    
    # [5] Submit the augmented prompt to the LLM
    rag_response: requests.Response = requests.post(
        url="http://localhost:7272/v3/retrieval/completion",
        headers={
            "Content-Type": "application/json",
            "Authorization": f"Bearer {token}"
        },
        json={
            "messages": [
                {
                    "role": "system",
                    "content": "You are a helpful assistant in a Retrieval-Augmented Generation (RAG) system. Your task is to answer the user's questions using only the context provided."
                },
                {
                    "role": "user",
                    "content": user_message
                }
            ],
            "generation_config": RAG_GENERATION_CONFIG
        }
    )
    if rag_response.status_code != 200:
        raise Exception(f"Request failed {rag_response.json()}")
    
    # [6] LLM generates the response
    response: str = rag_response.json()['results']['choices'][0]['message']['content']
    
    # If deepseek-r1 is used remove the content between the <think> tags
    if "deepseek-r1" in os.getenv("CHAT_MODEL"):
        response: str = extract_deepseek_response(response)

    # [7] Augment dataset
    sample.eval_sample.response = response
    sample.eval_sample.retrieved_contexts = retrieved_chunks

    print(f"Added data to sample: {i + 1} out of {len(dataset.samples)}")

# Persist the complete dataset
# Create the directory if it doesn't exist
os.makedirs("../datasets", exist_ok=True)
dataset.to_jsonl(path=f"../datasets/{test_id}_dataset.jsonl")

print(f"{'='*80}\nGenerated dataset in ./datasets/{test_id}_dataset.jsonl")

### 13. Synthetic Data Generation completed

Congratulations!

With this you have created your own synthetic goldens, used various configurations to fill out the rest of the missing fields and saved all the data locally. Now you can move on to the next notebook, check out all the relevant (in my opinion) metrics and evaluate your application using the different experiments.