### Importance of High-Quality Test Dataset for RAG Systems

### Purpose
Test sets are critical for:
- Accurately measuring RAG system performance
- Identifying system strengths and weaknesses
- Guiding continuous improvement

### Key Evaluation Dimensions
1. **Retrieval Effectiveness**: Assess how well relevant context is retrieved
2. **Generation Quality**: Evaluate the accuracy and coherence of generated responses
3. **Contextual Relevance**: Measure how well the system understands and integrates retrieved information

### Evaluation Goals
- Benchmark system performance
- Detect hallucinations
- Validate generalization capabilities
- Simulate real-world query complexity

### Best Practices
- Use diverse query types
- Cover multiple domains
- Include edge cases
- Create reproducible test scenarios
- Contains enough number of samples to derive statistically significant conclusions

### Setup

* Execute the script in the root of the `evaluation` folder.

* If executing the script fails run: `chmod u+x setup.sh`.

* It will install all required dependencies.

* Finally, make sure you select it in the notebook by specifying `eval` as kernel.

---

*(OPTIONAL STEP)*

**RAGAs** provides a cloud platform where a dataset and evaluation results can be stored and viewed. To use it follow this link: [RAGAs.io](https://app.ragas.io/).

* Sign-up

* Retrieve the **token**

* Create a `.env` file with the following content:

```bash
RAGAS_APP_TOKEN=apt.......-9f6ed
```

### 0. Configuration for generating the goldens

Goldens are the samples containing the `reference`, `reference_contexts` and `user_input`. For filling out the rest of the fields: `response` and `retrieved_contexts`, I use various different configurations, which are to be found at the root of the project under *experiments.csv*.

The goldens are going to be saved under `./goldens`.

Configuration for the generation of *goldens*:

* For the generation of the goldens for all experiments I will make use of the following parameters:
    
    * Generation model - `llama3.1:8b-instruct-q4_1`

    * Temperature - `0.0`

    * The rest of the parameters will vary depending on the experiments configuration

* For filling out the rest of the fields:

    * Generation model - depends on the experiment configuration

    * Temperature - depends on the experiment configuration

    * The rest of the parameters will vary depending on the experiment configuration

* The goal is that I make use of the stronger instruction following capabilities of `llama3.1:8b-instruct-q4_1` to try to generate a synthetic dataset, which is as *clean* and *error-free/logic-free* as possible and to then try the **RAG** application with various settings and compare the results.

* I think that using the same `model` and `temperature` for the generation of *goldens* would make the experiments fair.


### 1. Retrieve data:
* The data can be a dataset from `huggingface` or any other platform.

* Alternatively, files available on disk - pdf, md, etc.

* One can also use `AsyncHtmlLoader` from `langchain` to scrape from the internet.
    - **Careful when performing web scraping to not violate any terms and conditions!**

**NOTE**:

* Make sure you install the requirements first if you want to test the notebook.

    * To do so run the `setup.sh` in the parent folder.

* Make sure you select the proper environment as your kernel: `eval`. 

In [1]:
# For this notebook I will use a dataset provided by RAGAs.
# The dataset contains markdown files about a fictional airline company.
! git clone https://huggingface.co/datasets/explodinggradients/ragas-airline-dataset data

fatal: destination path 'data' already exists and is not an empty directory.


### 2. Load data into document objects

For extracting data from documents and splitting them into chunks one can make use of various frameworks:

* `langchain` <- The one I use

* `llamaindex` <- Another popular solution

Both frameworks provide various abstractions to load documents, extract data and split into chunks.

In [2]:
import os
from typing import Final

from dotenv import load_dotenv
from langchain.docstore.document import Document
from langchain_community.document_loaders import DirectoryLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter

# Load the RAG parameters
load_dotenv("../../env/rag.env")

# The path is the folder, where the documents are stored at.
# Make sure you select the proper one, if other documents.
DIR_PATH: Final[str] = "data/"
loader = DirectoryLoader(
    DIR_PATH,
    glob="**/*.md",
    exclude="README.md"
)

# The R2R framework uses a recursive character text splitter to split the documents.
# For that purpose I use it to try to mimic the same behavior.
splitter = RecursiveCharacterTextSplitter(
    chunk_size=int(os.getenv("CHUNK_SIZE")),
    chunk_overlap=int(os.getenv("CHUNK_OVERLAP")),
    length_function=len
)

docs: list[Document] = loader.load_and_split(splitter)

### 3. Construct knowledge graph

- A **knowledge graph** is a fundamental concept when it comes to **RAGAs** and using its capabilities for **automatic synthetic data generation**.

- A **knowledge graph** consists of **Node**s at first, which represent **documents/chunks** - their content and additionally metadata (optional).

- Thereafter, one can enrich the graph by using various **extractors** and applying different **transformations**. Doing so additional attributes get added to the relevant nodes and **relationships can get built**, which express some kind of connection between Node objects. The transformations can be applied only through the use of **Extractor**s, **Splitter**s and or **RelationshipBuilder**s. They serve as a way to gather relevant data from the documents depending on the type of extractor and this way to logically connect 2 or more nodes together.

- Finally, the graph is used to generate so called **Scenario**s and can also be used to generate **Persona**s.

![Knowledge graph creation workflow RAGAs](../../img/ragas/kg_rag.webp "Knowledge graph RAGAs")

In [3]:
from ragas.testset.graph import (
    Node,
    NodeType,
    KnowledgeGraph,
)

kg = KnowledgeGraph()

for doc in docs:
    kg.add(
        Node(
            type=NodeType.CHUNK, # Since we already split the documents, we can use the chunk type.
            properties={"page_content": doc.page_content, "document_metadata": doc.metadata}
        )
    )

## 4. Instantiate required objects

- **RAGAs** would require a **LLM** and an **Embedding model** depending on the type of **Transformation**s one would like to apply to the **Knowledge Graph**. For that purpose one must create *wrapper* objects for both of the models. `langchain`, `llama-index`, `haystack`, etc are supported. 

- Additionally, a **configuration** can be used to modify the default behaviour of the framework. For example timeout values can be modified, maximum retries for failed operations and so on can be configured from the **RunConfig**.
    - **NOTE**: depending on the LLM model and GPU you may need to modify the `timeout` value, otherwise you will stumble upon `TimeoutException`

- Lastly, there's a single implementation in **RAGAs** for caching intermediate steps onto disk. To use it the **DiskCacheBackend** class can come in play.

In [4]:
from langchain_ollama import ChatOllama, OllamaEmbeddings

from ragas import RunConfig, DiskCacheBackend
from ragas.llms import LangchainLLMWrapper
from ragas.embeddings import LangchainEmbeddingsWrapper

test_id: str = input("Enter the test id (Ex. 1): ")

run_config = RunConfig(
    timeout=86400,    # 24 hours on waiting for a single operation
    max_retries=20,   # Max retries before giving up
    max_wait=600,     # Max wait between retries
    max_workers=4,    # Concurrent requests
    log_tenacity=True # Print retry attempts
)

# This stores data generation and evaluation results locally on disk
# When using it for the first time, it will create a .cache folder
# When using it again, it will read from that folder and finish almost instantly
cacher = DiskCacheBackend(cache_dir=f".cache-{test_id}")

ollama_llm = ChatOllama(
    model=os.getenv("DATA_GENERATION_MODEL"),
    base_url="http://localhost:11434",
    temperature=float(os.getenv("DATA_GENERATION_TEMPERATURE")),
    num_ctx=int(os.getenv("LLM_CONTEXT_WINDOW_TOKENS")),
    format="json" # We need to enforce JSON output, since most outputs would be validated by a pydantic model
)

ollama_embeddings = OllamaEmbeddings(
    model=os.getenv("EMBEDDING_MODEL"),
    base_url="http://localhost:11434"
)

ragas_llm = LangchainLLMWrapper(
    langchain_llm=ollama_llm,
    run_config=run_config,
    cache=cacher
)

ragas_embeddings = LangchainEmbeddingsWrapper(
    embeddings=ollama_embeddings,
    run_config=run_config,
    cache=cacher
)

### 5. Create the transformation pipeline

The sequence of transformations:

1. Named Entity Recognition (NER) and Keyphrases extraction 
    - NERExtractor identifies and extracts named entities (e.g., people, organizations, locations).  

    - KeyphrasesExtractor extracts the main keyphrases to be found in the text

2. NEROverlapBuilder and KeyphraseOverlapBuilder
    - Used to establish a relationship between nodes containing similar:
        
        - entities
        - keyphrases

3. Parallel Processing for Efficiency:
    - Certain transformations can run in parallel to improve performance.

- Final Outcome:
    - A structured set of document transformations that extract valuable information
    - Used to enrich the knowledge graph for further generation of scenarios and finally samples

**NOTE:** Some of the extractors *(LLM-based ones)* do receive an optional `prompt`, which one can use to modify the workflow. For instance the `NERExtractor` can receive a custom prompt, which could contain instructions that differ from the original one and extracts entities in a different way.

In [5]:
from ragas.testset.transforms import (
    Parallel,
    OverlapScoreBuilder,
    KeyphrasesExtractor,
)
from ragas.testset.transforms.extractors import NERExtractor

from prompts.extractors.custom_ner_prompt import MyNERPrompt
from prompts.extractors.custom_keyphrases_prompt import MyKeyphrasesExtractorPrompt

ner_extractor = NERExtractor(
    llm=ragas_llm,
    prompt=MyNERPrompt(
        name="custom_ner_extractor_prompt"
    ),
    max_num_entities=15
)

keyphrases_extractor = KeyphrasesExtractor(
    llm=ragas_llm,
    prompt=MyKeyphrasesExtractorPrompt(
        name="custom_keyphrases_extractor_prompt"
    ),
    max_num=15
)

ner_overlap_sim = OverlapScoreBuilder()

keyphrases_overlap_sim = OverlapScoreBuilder(
    property_name="keyphrases",
)

transforms = [
    Parallel(
        ner_extractor,
        keyphrases_extractor
    ),
    Parallel(
        ner_overlap_sim,
        keyphrases_overlap_sim
    )
]

### 6. Apply the transformations to the knowledge graph

In the cell below the `apply_transforms` is going to apply all the previously defined transformations enriching the `knowledge graph` in the process.

In [6]:
from ragas.testset.transforms import apply_transforms

apply_transforms(
    kg,
    transforms,
    run_config
)

Applying [NERExtractor, KeyphrasesExtractor]:   0%|          | 0/138 [00:00<?, ?it/s]

Applying [OverlapScoreBuilder, OverlapScoreBuilder]:   0%|          | 0/2 [00:00<?, ?it/s]

### 7. Generating personas

- A **Persona** is an entity/role which interacts with the system. **Personas** provide context and perspective, ensuring that **generated queries are natural, user-specific, and diverse**.

- Example: a Senior DevOps engineer, a Junior Data Scientist, a Marketing Manager in the context of an IT company

- **Persona** object consists of a **name** and a **description**.
    
    - The name is used to identify the persona and the description is used to describe the role of the persona.

- Do note that personas can also be generated by a **knowledge graph** if you have one available

In [7]:
from ragas.testset.persona import Persona

# This example is taken from `RAGAs`:
# https://docs.ragas.io/en/latest/howtos/applications/singlehop_testset_gen/#configuring-personas-for-query-generation

persona_first_time_flier = Persona(
    name="First Time Flier",
    role_description="Is flying for the first time and may feel anxious. Needs clear guidance on flight procedures, safety protocols, and what to expect throughout the journey.",
)

persona_frequent_flier = Persona(
    name="Frequent Flier",
    role_description="Travels regularly and values efficiency and comfort. Interested in loyalty programs, express services, and a seamless travel experience.",
)

persona_angry_business_flier = Persona(
    name="Angry Business Class Flier",
    role_description="Demands top-tier service and is easily irritated by any delays or issues. Expects immediate resolutions and is quick to express frustration if standards are not met.",
)

personas: list[Persona] = [
    persona_first_time_flier,
    persona_frequent_flier,
    persona_angry_business_flier
]

## 8. Generate query types 

- There are two main types of queries in **RAGAs**:
    
    - **SingleHopQuery** where the **context** relevant for answering a question lies in a **single document/chunk**

    - **MultiHopQuery** where the **context** relevant for answering a question lies in **multiple documents/chunks**

- Additionally, for each of those queries there's a **Specific** or **Abstract** query variant:
    
    - **Specific** one which pertains to a **fact**. 

        - Example: When did WW1 break out? (Can be precisely answered, there's no room for guessing/interpretation)
    
    - **Abstract** one which is more about testing the **reasoning** capabilities of the LLM. 

        - Example: Why did WW1 break out? (There's room for interpretation in this case)

- **Specific** vs. **Abstract Queries** in a RAG
    - Specific Query: Focuses on clear, fact-based retrieval. The goal in RAG is to retrieve highly relevant information from one or more documents that directly address the specific question.

    - Abstract Query: Requires a broader, more interpretive response. In RAG, abstract queries challenge the retrieval system to pull from documents that contain higher-level reasoning, explanations, or opinions, rather than simple facts.

![Query tpes in RAGAs](../../img/ragas/ragas_query_types.png  "Queries")

**Synthesizers** are responsible for **converting enriched nodes and personas into queries**. They achieve this by **selecting a node property (e.g., "entities" or "keyphrases"), pairing it with a persona, style, and query length**, and then using a LLM to generate a query-answer pair based on the content of the node.

* Query lengths may vary:
    - short
    - medium
    - long

* Query style:
    - misspelled
    - websearch-like
    - perfect-grammar
    - poor-grammar

Note that **synthesizers** can additionally be extended/modified by specifying custom **prompts**.


In [8]:
from ragas.testset.synthesizers.multi_hop.specific import MultiHopSpecificQuerySynthesizer
from ragas.testset.synthesizers.single_hop.specific import SingleHopSpecificQuerySynthesizer

from prompts.synthesizers.custom_themes_matching import MyThemesPersonasMatchingPrompt
from prompts.synthesizers.custom_multi_hop_qa_generation import MyMultiHopQAGenerationPrompt
from prompts.synthesizers.custom_single_hop_qa_generation import MySingleHopQAGenerationPrompt

single_hop_specific_entities = SingleHopSpecificQuerySynthesizer(
    llm=ragas_llm,
    generate_query_reference_prompt=MySingleHopQAGenerationPrompt(),
    theme_persona_matching_prompt=MyThemesPersonasMatchingPrompt(),
    property_name="entities"
)

single_hop_specific_keyphrases = SingleHopSpecificQuerySynthesizer(
    llm=ragas_llm,
    generate_query_reference_prompt=MySingleHopQAGenerationPrompt(),
    theme_persona_matching_prompt=MyThemesPersonasMatchingPrompt(),
    property_name="keyphrases"
)

multi_hop_specific_entities = MultiHopSpecificQuerySynthesizer(
    llm=ragas_llm,
    generate_query_reference_prompt=MyMultiHopQAGenerationPrompt(),
    theme_persona_matching_prompt=MyThemesPersonasMatchingPrompt()
)

multi_hop_specific_keyphrases = MultiHopSpecificQuerySynthesizer(
    llm=ragas_llm,
    generate_query_reference_prompt=MyMultiHopQAGenerationPrompt(),
    relation_type="keyphrases_overlap",
    property_name="keyphrases",
    theme_persona_matching_prompt=MyThemesPersonasMatchingPrompt()
)

query_distribution = [
    (single_hop_specific_entities, 0.25),
    (single_hop_specific_keyphrases, 0.25),
    (multi_hop_specific_entities, 0.25),
    (multi_hop_specific_keyphrases, 0.25)
]

### 9. Generate the samples

#### Definition of evaluation sample

An evaluation sample is a single structured data instance that is used to assess and measure the performance of your LLM application in specific scenarios. It represents a single unit of interaction or a specific use case that the AI application is expected to handle. In Ragas, evaluation samples are represented using the `SingleTurnSample` and `MultiTurnSample` classes.

#### SingleTurnSample

`SingleTurnSample` represents a single-turn interaction between a user, LLM, and expected results for evaluation. It is suitable for evaluations that involve a single question and answer pair, possibly with additional context or reference information.

This type of sample is ideal for straightforward question-answering scenarios where a user asks a single question and expects a direct response.

#### MultiTurnSample

`MultiTurnSample` represents a multi-turn interaction between Human, AI and optionally a Tool and expected results for evaluation. It is suitable for representing conversational agents in more complex interactions for evaluation.

In `MultiTurnSample`, the `user_input` attribute represents a sequence of messages that collectively form a multi-turn conversation between a human user and an AI system. These messages are instances of the classes `HumanMessage`, `AIMessage`, and `ToolMessage`.

This type of sample is designed for evaluating more complex conversational flows where multiple turns of dialogue occur, potentially involving tool usage for gathering additional information.

![Scenario generation workflow RAGAs](../../img/ragas/scenario_rag.webp "Scenarios RAGAs")

In [9]:
from ragas.testset import TestsetGenerator, Testset

generator = TestsetGenerator(
    ragas_llm,
    ragas_embeddings,
    kg,
    personas
)

dataset: Testset = generator.generate(
    testset_size=50,
    query_distribution=query_distribution,
    num_personas=len(personas),
    run_config=run_config,
    with_debugging_logs=True,
)

Generating Scenarios:   0%|          | 0/4 [00:00<?, ?it/s]

Generating Samples:   0%|          | 0/52 [00:00<?, ?it/s]

### 10. Save dataset containing goldens (no actual output and retrieval context)

In [10]:
dataset.to_jsonl(f"./goldens/{test_id}_goldens.jsonl")

### 11. Upload to the cloud (Optional)

* To upload the data on **app.ragas.io** make sure you:
    * First create an account
    * Get an **API key**
    * Finally, create a `.env` file in the parent folder like so and export it in your notebook:

```bash
RAGAS_APP_TOKEN=apt.1234a-......-9dfew
```

In [11]:
from dotenv import load_dotenv

# Load the token
load_dotenv("../.env")

dataset.upload()

Testset uploaded! View at https://app.ragas.io/dashboard/alignment/testset/97c61f25-cde0-456c-b24c-45ebf86369ef


'https://app.ragas.io/dashboard/alignment/testset/97c61f25-cde0-456c-b24c-45ebf86369ef'

### 12. Fill out the missing fields in the dataset to complete it.

**NOTE**:

- Make sure you set the proper values under `env/rag.env` relative to the experiment id you want to carry out.
- For each experiment provide an intuitive name to distinguish between the experiments
    - In my case I just use the experiment id
- If you are not using `R2R` make sure you configure the parameters and your pipeline as needed. This notebook assumes you follow along all the steps sequentially and make use of `R2R`.

In [12]:
import os
from distutils.util import strtobool
from typing import Dict, Union, Final

# VANILLA_RAG=False would mean RAG-Fusion
use_vanilla_rag: bool = bool(
    strtobool(os.getenv("VANILLA_RAG"))
)

if use_vanilla_rag:
    search_strategy: str = "vanilla"
else:
    search_strategy: str = "query_fusion" # RAG-Fusion

# Used after context had been fetched to generate the final response
RAG_GENERATION_CONFIG: Final[Dict[str, Union[str, float, int]]] = {
    "model": f"ollama_chat/{os.getenv('CHAT_MODEL')}",
    "temperature": float(os.getenv("TEMPERATURE")),
    "top_p": float(os.getenv("TOP_P")),
    "max_tokens_to_sample": int(os.getenv("MAX_TOKENS")),
}

# Relevant during the retrieval phase for fetching relevant context
SEARCH_SETTINGS: Final[Dict[str, Union[bool, int, str]]] = {
    "use_semantic_search": True,
    "limit": int(os.getenv("TOP_K")),
    "offset": 0,
    "include_metadatas": False,
    "include_scores": True,
    "search_strategy": search_strategy, # can be vanilla or hyde, (fusion is also supported)
    "chunk_settings": {
        "index_measure": "cosine_distance",
        "enabled": True,
        "ef_search": 80
    }
}

if search_strategy == "query_fusion":
    # This is only relevant when using `hyde` or `rag-fusion`
    # Number of hypothetical documents to generate, by default it's 5 if not specified
    # https://r2r-docs.sciphi.ai/api-and-sdks/retrieval/rag-app
    SEARCH_SETTINGS['num_sub_queries'] = 5

In [13]:
# You can modify this template as needed
TEMPLATE: Final[str] = """You are a helpful RAG assistant.
Your task is to provide an answer given a question using the context.
Please make sure the answer is complete and relevant to the question.

**IMPORTANT:
1. BASE YOUR ANSWER ONLY ON THE GIVEN CONTEXT.
2. IF THE CONTEXT IS NOT ENOUGH TO ANSWER THE QUESTION, SAY THAT YOU CANNOT ANSWER BASED ON THE AVAILABLE INFORMATION.
3. DO NOT INCLUDE CITATIONS OR REFERENCES TO SPECIFIC LINES OR PARTS OF THE CONTEXT.
4. ALWAYS KEEP YOUR ANSWER RELEVANT AND FOCUSED ON THE QUESTION.
5. DO NOT PROVIDE ANY ADDITIONAL INFORMATION EXCEPT THE ANSWER.
**

### CONTEXT:
{context}

### QUESTION:
{query}

### ANSWER:
"""

In [14]:
import requests

# First we need to authenticate the admin user and receive a token
# The username and password are the default ones provided by `R2R`
# Note that you can overwrite those in the `config.toml` file
# If this fails it means there're either connectivity issues or the credentials are wrong
authetication: requests.Response = requests.post(
    url="http://localhost:7272/v3/users/login", # This may vary depending on your setup
    headers={
        "Content-Type": "application/x-www-form-urlencoded"
    },
    data="username=admin@example.com&password=change_me_immediately",
)
token: str = authetication.json()['results']['access_token']['token'] # Token for further authentication

In [15]:
from typing import List

# Retrieve the IDs of all currently ingested documents
documents: requests.Response = requests.get(
    url="http://localhost:7272/v3/documents",
    headers={
        "Authorization": f"Bearer {token}"
    }
)

doc_ids: List[str] = [document['id'] for document in documents.json()['results']]
print(f"Found {len(doc_ids)} documents")

# Delete all documents available
for doc_id in doc_ids:
    del_resp: requests.Response = requests.delete(
        url=f"http://localhost:7272/v3/documents/{doc_id}",
        headers={
            "Authorization": f"Bearer {token}"
        }
    )
    if del_resp.status_code == 200:
        print(f"Deleted document with ID: {doc_id}")
    else:
        print(f"Failed to delete document with ID: {doc_id}")

Found 8 documents
Deleted document with ID: 15bea571-fbd0-5ed6-a1c2-5bb70b6b7f36
Deleted document with ID: db0faff8-f57f-5459-91f4-92f611106fa1
Deleted document with ID: 6530a034-daa1-5198-b340-563b5e45ca2b
Deleted document with ID: a5ef7ed6-02c2-5a86-9299-fff661c1e7f7
Deleted document with ID: 2a9978ac-84fd-5644-8633-dcc90e19c123
Deleted document with ID: d7c24a75-99ba-5b84-8339-5a9188be0580
Deleted document with ID: bf55e614-3330-5283-a759-ea1bfa15a655
Deleted document with ID: 88fa13ca-5921-590f-8693-408b1ed047bf


In [16]:
import tempfile
import mimetypes
from langchain.docstore.document import Document
from langchain_community.document_loaders import DirectoryLoader

# Load files
loader = DirectoryLoader(
    "./data", # The folder, where the documents are stored at.
    glob="**/*.md",
    exclude="README.md"
)
docs: list[Document] = loader.load()

# Clean-up markdown
with tempfile.TemporaryDirectory() as temp_dir:
    for doc in docs:
        doc_filepath: str = doc.metadata['source'].split("/")[-1]
        temp_file_path = os.path.join(temp_dir, doc_filepath)
        with open(temp_file_path, "w", encoding="utf-8") as file:
            file.write(doc.page_content)

    # Ingest individual files, on every run so that chunk size and chunk overlap match the experiment config
    # If working with another dataset modify as required
    for i, file in enumerate(os.listdir(temp_dir), 1):
        if not file.endswith(".md"):
            continue

        filepath = os.path.join(temp_dir, file)

        # Guess the content type (MIME type) based on file extension
        mime_type, _ = mimetypes.guess_type(filepath)
        if mime_type is None:
            mime_type = "application/octet-stream"  # fallback if unknown

        with open(filepath, "rb") as content:
            # Ingest file - extract text, chunk it, generate embeddings and finally store in vector store
            ingestion_resp: requests.Response = requests.post(
                url="http://localhost:7272/v3/documents",
                headers={
                    "Authorization": f"Bearer {token}"
                },
                files={
                    "file": (file, content, mime_type)
                },
                data={
                    "metadata": "{}", # Feel free to add your own metadata
                }
            )
            
            if ingestion_resp.status_code == 202:
                print(f"[{i}]. Ingested: {file}")
            else:
                print(ingestion_resp.json())

[1]. Ingested: special_assistance.md
[2]. Ingested: managing_reservations.md
[3]. Ingested: flight_delays.md
[4]. Ingested: baggage_policies.md
[5]. Ingested: inflight_services.md
[6]. Ingested: schedule_changes.md
[7]. Ingested: bookings.md
[8]. Ingested: flight_cancellations.md


In [17]:
import json

def load_goldens(filepath: str) -> List[Dict]:
    """Loads the synthetically generated goldens from a JSONL file."""
    _goldens: List[Dict] = []
    try:
        with open(file=f"./goldens/{filepath}.jsonl", mode="r", encoding="utf-8") as file:
            # Read the file line by line and parse each line as JSON
            for line in file:
                if line.strip():  # Skip empty lines
                    _goldens.append(json.loads(line))
    except FileNotFoundError:
        raise FileNotFoundError(f"File `./goldens/{filepath}.jsonl` containing goldens not found.")
    except json.JSONDecodeError as e:
        raise json.JSONDecodeError from e

    return _goldens

In [18]:
def extract_deepseek_response(full_response: str):
    """
    Extract the actual response from deepseek-r1 output by ignoring the <think>...</think> section.
    """
    if "</think>" not in full_response:
        raise ValueError("Response from deepseek-r1 is not full!")

    strings: List[str] = full_response.split("</think>")
    answer_without_section: str = strings[-1].lstrip()
    return answer_without_section

In [19]:
goldens_filepath: str = f"{test_id}_goldens" # This depends on how you name your goldens

# Some debugging info
print(f"""{'='*80}\nGenerating dataset in ./datasets/{test_id}_dataset.jsonl
TOP_K={int(os.getenv("TOP_K"))}
MAX_TOKENS_TO_SAMPLE={int(os.getenv("MAX_TOKENS"))}
CHUNK_SIZE={int(os.getenv("CHUNK_SIZE"))}
CHUNK_OVERLAP={int(os.getenv("CHUNK_OVERLAP"))}
CHAT_MODEL={os.getenv("CHAT_MODEL")}
TEMPERATURE={float(os.getenv("TEMPERATURE"))}
VANILLA_RAG={"True" if search_strategy == "vanilla"else "False"}
{'='*80}
""")

goldens: List[Dict] = load_goldens(goldens_filepath)

# Filling out the rest of our dataset, for each individual entry
for i, golden in enumerate(goldens):
    # [1] Embed the `user_input`
    # [2] Perform semantic similarity search fetching the top-k most relevant contexts
    # [3] Re-rank based on relevance relative to `user_input`
    # [4] Use the template defined above and replace placeholders dynamically
    # [5] Submit the augmented prompt to the LLM
    # [6] LLM generates the response and returns an object containing it and the context
    rag_response: requests.Response = requests.post(
        url="http://localhost:7272/v3/retrieval/rag",
        headers={
            "Authorization": f"Bearer {token}",
            "Content-Type": "application/json"
        },
        json={
            "query": golden["user_input"], # Submit query from synthetically generated goldens
            "rag_generation_config": RAG_GENERATION_CONFIG,
            "search_mode": "custom",
            "search_settings": SEARCH_SETTINGS,
            "task_prompt": TEMPLATE
        }
    )

    if rag_response.status_code != 200:
        raise Exception(f"Request failed {rag_response.json()}")
    
    response: Dict = rag_response.json()['results']

    # Get the LLM response and context
    actual_output: str = response['completion']
    retrieved_contexts: List[str] = [
        chunk['text']
        for chunk in response['search_results']['chunk_search_results']
    ]

    # If deepseek-r1 is used regardless of parameters count
    # remove the content between the <think> tags
    if "deepseek-r1" in os.getenv("CHAT_MODEL"):
        actual_output = extract_deepseek_response(actual_output)

    # Fill out the rest of your dataset
    golden["response"] = actual_output
    golden["retrieved_contexts"] = retrieved_contexts

    print(f"Added data to sample: {i + 1} out of {len(goldens)}")

# Persist the complete dataset
os.makedirs("./datasets", exist_ok=True)  # Create the directory if it doesn't exist
with open(file=f"./datasets/{test_id}_dataset.jsonl", mode="w", encoding="utf-8") as f:
    for golden in goldens:
        f.write(json.dumps(golden, ensure_ascii=False) + "\n")

print(f"{'='*80}\nGenerated dataset in ./datasets/{test_id}_dataset.jsonl")

Generating dataset in ./datasets/1_dataset.jsonl
TOP_K=3
MAX_TOKENS_TO_SAMPLE=512
CHUNK_SIZE=512
CHUNK_OVERLAP=0
CHAT_MODEL=llama3.1:8b
TEMPERATURE=0.0
VANILLA_RAG=True

Added data to sample: 1 out of 52
Added data to sample: 2 out of 52
Added data to sample: 3 out of 52
Added data to sample: 4 out of 52
Added data to sample: 5 out of 52
Added data to sample: 6 out of 52
Added data to sample: 7 out of 52
Added data to sample: 8 out of 52
Added data to sample: 9 out of 52
Added data to sample: 10 out of 52
Added data to sample: 11 out of 52
Added data to sample: 12 out of 52
Added data to sample: 13 out of 52
Added data to sample: 14 out of 52
Added data to sample: 15 out of 52
Added data to sample: 16 out of 52
Added data to sample: 17 out of 52
Added data to sample: 18 out of 52
Added data to sample: 19 out of 52
Added data to sample: 20 out of 52
Added data to sample: 21 out of 52
Added data to sample: 22 out of 52
Added data to sample: 23 out of 52
Added data to sample: 24 out of 52

### 13. Synthetic Data Generation completed

Congratulations!

With this you have created your own synthetic goldens, used various configurations to fill out the rest of the missing fields and saved all the data locally. Now you can move on to the next notebook, check out all the relevant (in my opinion) metrics and evaluate your application using the different experiments.