## Generating a synthetic dataset using DeepEval

### Synthesizer

This object can be used to generate **Golden** instances, which consist out of **input**, **expected output** and **context**. It uses a LLM to come up with random input values based on a context and thereafter tries to enhance those, by making them more complex and realistic through evolutions.

For a comprehensive guide on understanding how this object works please refer here: [Synthesizer](https://www.confident-ai.com/blog/the-definitive-guide-to-synthetic-data-generation-using-llms)

### Summary

I will try to summarize the most important information:

* It uses a **LLM to come-up with a comprehensive dataset** much faster than a human can

* The process starts with the LLM generating **synthetic queries** based on context from a knowledge base

* Those initial queries are then **evolved** to reflect real-life complexity and then together with the context can be used to generate a **target/expected output**

![Dataset generation workflow](../../img/deepeval/synthesizer-overview.png "Synthetic generation")

* There exist two main methods:
    
    - Self-improvement: Iteratively uses the LLMs output to generate more complex queries

    - Distillation: A stronger model is being utilized 

* Constructing contexts:
    - During this phase documents from the knowledge base are split using a *token splitter*

    - A random chunk is selected
    
    - Finally, additional chunks are retrieved based on **semantic similarity**, **knowledge graphs** or other approaches
    
    - Ensuring that **chunk size**, **chunk overlap** or other similar parameters here and in the **retrieval component** of the **RAG** application are identical will yield better results

![Constructing contexts](../../img/deepeval/synthesizer-context.png "Context construction")

* Constructing synthetic queries:
    - In **RAG** when a user submits a query, all the relevant data is retrieved and then a template augments the input with the context. The `synthesizer` reverses the approach.

    - Using the contexts the **Synthesizer** can now generate synthetic input

    - Doing so we ensure that the input corresponds with the context enhancing the **relevancy** and **accuracy**

![Constructing synthetic queries](../../img/deepeval/synthesizer-query.png "Synthetic queries creation")

* Data Filtering:

    Data filtering is important after you have the `synthetic query`, `context` and optionally `reference answer` as to make sure one doesn't try to refine flawed queries and to waste valuable resources. Filtering occurs at 2 critical stages:

    1. Context filtering: Removes low-quality chunks that may be unintelligible, due to whitespaces for example

    ![Context filtering](../../img/deepeval/synthesizer-context-filtering.png "Filtering context")

    2. Input filtering: Ensures generated inputs meet quality standards. Sometimes even with good and well-structured context an input might be somewhat ambiguous or unclear based on the context.

    ![Input filtering](../../img/deepeval/synthesizer-query-filtering.png "Filtering queries")
    
* Customizing dataset generating:
    - Depending on the scenario inputs and outputs can be tailored to specific use cases
        
        - For example a medical chatbot would have a completely different behaviour than a scientific one. It would need to comfort patients and avoid bias. Also false negatives could turn out to be quite dangerous.
    
* Data Evolution:
    This is crucial for the proper generation of a dataset, since it iteratively refines the dataset.

    - **In-Depth Evolving**: Expands simple instructions into more detailed versions

    - **In-Breadth Evolving**: Produces diverse instructions to enrich the dataset
    
    - **Elimination Evolving**: Removes less effective instructions

    ![Data evolution types](../../img/deepeval/synthesizer-evolution.png "Data Evolution")

### Dependencies

* To install the dependencies run the `setup` bash script in the root of the `evaluation` folder.

* Make sure you select the correct kernel (eval) in your notebook environment.

In [None]:
# After installing the dependencies and selecting the kernel you should be good to go.
# Make sure the package is installed before continuing further.
! pip3 show deepeval

### Download dataset

In [None]:
! git clone https://huggingface.co/datasets/explodinggradients/ragas-airline-dataset data

### LLM provider

**DeepEval** uses **OpenAI** by default as a LLM provider, however **Ollama** is also available. To use it execute the code cell below. This will generate a `.deepeval` file where key-value pairs will be stored about that particular LLM-provider like model name, base url and so on. 

In [None]:
import os
from typing import Final
from dotenv import load_dotenv

load_dotenv("../../env/rag.env")

DATA_GENERATION_MODEL: Final[str] = os.getenv("DATA_GENERATION_MODEL")
EMBEDDING_MODEL: Final[str] = os.getenv("EMBEDDING_MODEL")

! deepeval set-ollama {DATA_GENERATION_MODEL} --base-url="http://localhost:11434/"
! deepeval set-ollama-embeddings {EMBEDDING_MODEL} --base-url="http://localhost:11434"

### Extracting chunks from knowledge base to be used as context in data generation


Since **DeepEval** uses a `TokenTextSplitter` when trying to generate synthetic dataset from [documents](https://www.deepeval.com/docs/synthesizer-generate-from-docs), and `R2R` uses `RecursiveCharacterTextSplitter` we need to perform this step ourselves. Then we can generate a dataset from [contexts](https://www.deepeval.com/docs/synthesizer-generate-from-contexts).

**Before executing the next cells:**
* Make sure Ollama is up and running.

* Download the required models for generation and embedding.

* Make sure the RAG application is running, if not `./run.sh`.

In [None]:
import os

import ollama
from dotenv import load_dotenv

load_dotenv("../../env/rag.env")

# These objects are going to be required when performing semantic search (using the embedding model)

ollama_client = ollama.Client(host="http://localhost:11434")

ollama_options = ollama.Options(
    temperature=float(os.getenv("TEMPERATURE")),
    top_p=float(os.getenv("TOP_P")),
    top_k=int(os.getenv("TOP_K")),
    num_ctx=int(os.getenv("LLM_CONTEXT_WINDOW_TOKENS")),
)

In [None]:
from typing import Dict, Tuple, List

# Retrieve this many chunks from a document at random
CHUNKS_PER_DOCUMENT: Final[int] = 3

# A context is a collection of chunks that share some degree of similarity
CHUNKS_PER_CONTEXT: Final[int] = 3

# For each chunk have the id as key and compute the embedding to the text
CHUNKS_WITH_EMBEDDINGS: Dict[str, Tuple[str, List[float]]] = {}

In [None]:
import requests

# First we need to authenticate the admin user and receive a token
# The username and password are the default ones provided by `R2R`
# Note that you can overwrite those in the `config.toml` file
# If this fails it means there're either connectivity issues or the credentials are wrong
authetication: requests.Response = requests.post(
    url="http://localhost:7272/v3/users/login", # This may vary depending on your setup
    headers={
        "Content-Type": "application/x-www-form-urlencoded"
    },
    data="username=admin@example.com&password=change_me_immediately",
)
token: str = authetication.json()['results']['access_token']['token'] # Token for further authentication

In [None]:
# Retrieve the IDs of all currently ingested documents
documents: requests.Response = requests.get(
    url="http://localhost:7272/v3/documents",
    headers={
        "Authorization": f"Bearer {token}"
    }
)

doc_ids: List[str] = [document['id'] for document in documents.json()['results']]
print(f"Found {len(doc_ids)} documents")

# Delete all documents available
for doc_id in doc_ids:
    del_resp: requests.Response = requests.delete(
        url=f"http://localhost:7272/v3/documents/{doc_id}",
        headers={
            "Authorization": f"Bearer {token}"
        }
    )
    if del_resp.status_code == 200:
        print(f"Deleted document with ID: {doc_id}")
    else:
        print(f"Failed to delete document with ID: {doc_id}")

In [None]:
import tempfile
import mimetypes
from langchain.docstore.document import Document
from langchain_community.document_loaders import DirectoryLoader

# Load files
loader = DirectoryLoader(
    "./data", # The folder, where the documents are stored at.
    glob="**/*.md",
    exclude="README.md"
)
docs: list[Document] = loader.load()

# Clean-up markdown
with tempfile.TemporaryDirectory() as temp_dir:
    for doc in docs:
        doc_filepath: str = doc.metadata['source'].split("/")[-1]
        temp_file_path = os.path.join(temp_dir, doc_filepath)
        with open(temp_file_path, "w", encoding="utf-8") as file:
            file.write(doc.page_content)

    # Ingest individual files, on every run so that chunk size and chunk overlap match the experiment config
    # If working with another dataset modify as required
    for i, file in enumerate(os.listdir(temp_dir), 1):
        if not file.endswith(".md"):
            continue

        filepath = os.path.join(temp_dir, file)

        # Guess the content type (MIME type) based on file extension
        mime_type, _ = mimetypes.guess_type(filepath)
        if mime_type is None:
            mime_type = "application/octet-stream"  # fallback if unknown

        with open(filepath, "rb") as content:
            # Ingest file - extract text, chunk it, generate embeddings and finally store in vector store
            ingestion_resp: requests.Response = requests.post(
                url="http://localhost:7272/v3/documents",
                headers={
                    "Authorization": f"Bearer {token}"
                },
                files={
                    "file": (file, content, mime_type)
                },
                data={
                    "metadata": "{}", # Feel free to add your own metadata
                }
            )
            
            if ingestion_resp.status_code == 202:
                print(f"[{i}]. Ingested: {file}")
            else:
                print(ingestion_resp.json())

Define a function to embed an input text and perform semantic similarity using `cosine distance` as measure.

In [None]:
import numpy as np

def compute_embedding(text: str) -> List[float]:
    """
    Uses `ollama` to convert the text to an embedding preserving the semantic meaning.
    
    Args:
        text: Text to compute the embedding for
    
    Returns:
        List of floats representing the vector embedding
    """
    return ollama_client.embeddings(
        model=os.getenv("EMBEDDING_MODEL"),
        prompt=text,
        options=ollama_options
    )["embedding"]

def compute_similarity(embedding1: List[float], embedding2: List[float]) -> float:
    """
    Compute the cosine similarity between two vector embeddings.
    
    Args:
        embedding1: First vector embedding
        embedding2: Second vector embedding
    
    Returns:
        float: Cosine similarity score between the two embeddings. 
        The closer to 1, the more similar the embeddings are.
    
    Raises:
        ValueError: If the embeddings have different lengths
    """
    if len(embedding1) != len(embedding2):
        raise ValueError("Embeddings must have the same length!")

    vec1 = np.array(embedding1)
    vec2 = np.array(embedding2)
    return np.dot(vec1, vec2) / (np.linalg.norm(vec1) * np.linalg.norm(vec2))

For every available chunk in the knowledge base we compute the embedding and store it.

In [None]:
chunks: requests.Response = requests.get(
    url="http://localhost:7272/v3/chunks",
    headers={
        "Authorization": f"Bearer {token}"
    }
)

if chunks.status_code != 200:
    raise Exception("Failed to retrieve chunks from vector store")

for chunk in chunks.json()['results']:
    CHUNKS_WITH_EMBEDDINGS[chunk['id']] = (
        chunk['text'],
        compute_embedding(chunk['text'])
    )

In [None]:
def retrieve_n_similar_chunks(current_chunk_id: str, n: int) -> List[str]:
    """
    It receives a `chunk_id` of the current chunk and returns the `n` most similar chunks.
    It makes use of `compute_similarity` to compute the cosine similarity between chunks.
    The closer to 1, the more similar the chunks are.
    The chunks are sorted in descending order by similarity.
    Finally, we make sure that only the `n` chunks are kept or if less available we keep all.
    
    Args:
        current_chunk_id: ID of the current chunk
        n: Number of similar chunks to retrieve
    
    Returns:
        List[str]: list of chunk texts
    """
    similarities: Dict[str, float] = {} # chunk_id -> similarity

    for chunk_id, (_, embedding) in CHUNKS_WITH_EMBEDDINGS.items():
        # We don't want to consider the same chunk relevant for the context
        if chunk_id == current_chunk_id:
            continue

        similarity: float = compute_similarity(
            embedding1 = CHUNKS_WITH_EMBEDDINGS[current_chunk_id][1],
            embedding2 = embedding
        )
        similarities[chunk_id] = similarity # chunk_id -> similarity

    # Sort by similarity in descending order
    most_similar = sorted(similarities.items(), key=lambda x: x[1], reverse=True)

    # Get the top `n` chunk IDs
    # If less available we keep all
    if len(most_similar) < n:
        n = len(most_similar)
    
    # Retrieve all chunk texts, by not exceeding `n`
    top_n_chunks: List[str] = [CHUNKS_WITH_EMBEDDINGS[chunk_id][0] for chunk_id, _ in most_similar[:n]]

    return top_n_chunks

In [None]:
"""
The algorithm I follow:
    1. I go over all the ingested documents
    2. Select 3 or len(chunks) chunks at random for each document
    3. For each chunk out ouf every document I select 2 other chunks using semantic similarity (out of all the documents)
    4. Finally, all 3 chunks are grouped together and form a context
"""

import random

def extract_context_chunks() -> List[List[str]]:
    """
    Each context contains chunks, which are similar to each other.
    The chunks in a given context might be derived from different documents.
    
    Returns:
        List[str]: list of contexts
    """
    contexts: List[List[str]] = []

    # Fetch all documents
    documents: requests.Response = requests.get(
        url="http://localhost:7272/v3/documents",
        headers={
            "Authorization": f"Bearer {token}"
        }
    )

    # Make sure that the request was successful
    if documents.status_code != 200:
        raise Exception("Failed to retrieve documents from vector store")
    
    # Iterate over all documents
    for i, document in enumerate(documents.json()['results'], 1):

        # Fetch all chunks of a document
        doc_chunks: requests.Response = requests.get(
            url=f"http://localhost:7272/v3/documents/{document['id']}/chunks",
            headers={
                "Authorization": f"Bearer {token}"
            }
        )

        # Make sure that the request was successful
        if doc_chunks.status_code != 200:
            raise Exception(
                f"Failed to retrieve chunks from vector store for {document['id']}"
            )

        chunk_ids: List[str] = [chunk['id'] for chunk in doc_chunks.json()['results']]

        # Select 3 or len(chunks) chunks at random for each document
        random_chunk_ids: List[str] = list(chunk_id for chunk_id in random.sample(
                chunk_ids,
                min(CHUNKS_PER_DOCUMENT, len(chunk_ids)) # At most CHUNKS_PER_DOCUMENT for each document
            )
        )

        # Create contexts 
        for j, chunk_id in enumerate(random_chunk_ids, 1):
            initial_text: str = CHUNKS_WITH_EMBEDDINGS[chunk_id][0]
            context: List[str] = [initial_text]
            similar_chunks: List[str] = retrieve_n_similar_chunks(
                chunk_id,
                CHUNKS_PER_CONTEXT - 1
            )
            context.extend(similar_chunks)
            contexts.append(context)
            print(f"Extracted context {len(contexts)} from document {i} and chunk {j}")

    return contexts

Create contexts to be used for synthetic data generation by **DeepEval**.

In [None]:
import json

experiment_id: int = int(input("Enter experiment id (Ex. 1): "))
contexts_filename: str = f"{experiment_id}_contexts"

context_chunks: List[List[str]] = extract_context_chunks()

with open(f"./contexts/{contexts_filename}.json", "w", encoding="utf-8") as f:
    json.dump(
        context_chunks,
        f,
        ensure_ascii=False,
        indent=4
    )


**Filtration config** serves as a way to configure the quality of the generated synthetic input queries. Having higher threshold would ensure that the input queries are of higher quality.

If the **quality_score** is still lower than the **synthetic_input_quality_threshold** after **max_quality_retries**, the **golden with the highest quality_score** will be used.

In [None]:
from deepeval.synthesizer.config import FiltrationConfig

# (This step is completely OPTIONAL)
# https://www.deepeval.com/docs/synthesizer-introduction
filtration_config = FiltrationConfig(
    synthetic_input_quality_threshold=0.7,
    max_quality_retries=5
)

**Evolutions** are used to specify the type of approach to use when trying to complicate the synthetic queries. Since this is a **RAG** application I will only use the evolution types which use **context**. The `num_evolutions` parameter can be configured to specify the number of iterations for performing those evolutions.

In [None]:
from deepeval.synthesizer.config import (
    Evolution,
    EvolutionConfig,
)

# (This step is completely OPTIONAL)
# https://www.deepeval.com/docs/synthesizer-introduction
evolution_config = EvolutionConfig(
    num_evolutions=1,
    evolutions={
        Evolution.MULTICONTEXT: 0.25,
        Evolution.CONCRETIZING: 0.25,
        Evolution.CONSTRAINED: 0.25,
        Evolution.COMPARATIVE: 0.25,
    }
)

### Synthesizer

The synthesizer object as explained at the beginning of the notebook can be used to generate the synthetic dataset. It provides four different methods for this current version of **DeepEval**.

In [None]:
from deepeval.synthesizer import Synthesizer

# https://www.deepeval.com/docs/synthesizer-introduction
synthesizer = Synthesizer(
    filtration_config=filtration_config,
    evolution_config=evolution_config
)

### Generate the goldens

In this notebook I use the `generate_goldens_from_contexts`, which actually skips some steps that are specified in the synthesizer section - the loading and splitting of documents. This provides more freedom, however one has to be careful to properly ingest the documents and to derive high-quality contexts.

In [None]:
from deepeval.dataset.golden import Golden

goldens: list[Golden] = synthesizer.generate_goldens_from_contexts(
    contexts=context_chunks,
    include_expected_output=True,
    max_goldens_per_context=2,
)

### Confident AI (Optional)

1. In short **Confident AI** is a cloud-based platform part of the **DeepEval** project, which stores **datasets**, **evaluations**, **traces**, etc. 

2. If you want to use **Confident AI** platform create an account from here: [Confident AI](https://www.confident-ai.com/)

3. After signing-up an **API key** will be generated, which can be used to interact with the platform from inside the notebook.

---

Example of .env file:
```bash
DEEPEVAL_RESULTS_FOLDER=<folder> # Results of evaluations can be saved locally (cache)
DEEPEVAL_API_KEY=<your api key>  # Relevant if you want to use Confident AI
DEEPEVAL_TELEMETRY_OPT_OUT="YES" # Remove telemetry
```

In [None]:
import os
from dotenv import load_dotenv
from deepeval import login_with_confident_api_key

# Loads the environment variables from a `.env` file.
# If you want to use Confident AI be sure to create one in this directory.
load_dotenv("../.env")

deepeval_api_key: str = os.getenv("DEEPEVAL_API_KEY")

# You should get a message letting you know you are logged-in.
login_with_confident_api_key(deepeval_api_key)

Make sure to visit the link provided, upon invoking the `push` method. This will redirect you to the page containing the `goldens`. Then you can clean-up the data and that would almost always be mandatory, since we are using a weak model in the project and the input will not always be **clean**.

Do note that if the `push` to the cloud fails you might need to upgrade **DeepEval** to the latest version. To do so run:
`pip3 install --upgrade deepeval`

In [None]:
from deepeval.dataset import EvaluationDataset

dataset = EvaluationDataset(goldens=goldens)

alias: str = f"{experiment_id}_dataset"

dataset.push(
    alias=alias,
    overwrite=True
)

After cleaning up the dataset, you can pull it.

In [None]:
import json
from deepeval.dataset import EvaluationDataset

# I did some cleaning on the data since the input was not fully in the expected format on the ConfidentAI platform.
final_dataset = EvaluationDataset()
final_dataset.pull(alias)

### Filling the missing fields

Configure this as needed.

In [None]:
import os
from distutils.util import strtobool
from typing import Dict, Union, Final

# VANILLA_RAG=False would mean RAG-Fusion
use_vanilla_rag: bool = bool(
    strtobool(os.getenv("VANILLA_RAG"))
)

if use_vanilla_rag:
    search_strategy: str = "vanilla"
else:
    search_strategy: str = "query_fusion" # RAG-Fusion

# Used after context had been fetched to generate the final response
RAG_GENERATION_CONFIG: Final[Dict[str, Union[str, float, int]]] = {
    "model": f"ollama_chat/{os.getenv('CHAT_MODEL')}",
    "temperature": float(os.getenv("TEMPERATURE")),
    "top_p": float(os.getenv("TOP_P")),
    "max_tokens_to_sample": int(os.getenv("MAX_TOKENS")),
}

# Relevant during the retrieval phase for fetching relevant context
SEARCH_SETTINGS: Final[Dict[str, Union[bool, int, str]]] = {
    "use_semantic_search": True,
    "limit": int(os.getenv("TOP_K")),
    "offset": 0,
    "include_metadatas": False,
    "include_scores": True,
    "search_strategy": search_strategy, # can be vanilla or hyde, (fusion is also supported)
    "chunk_settings": {
        "index_measure": "cosine_distance",
        "enabled": True,
        "ef_search": 80
    }
}

if search_strategy == "query_fusion":
    # This is only relevant when using `hyde` or `rag-fusion`
    # Number of hypothetical documents to generate, by default it's 5 if not specified
    # https://r2r-docs.sciphi.ai/api-and-sdks/retrieval/rag-app
    SEARCH_SETTINGS['num_sub_queries'] = 5

In [None]:
# You can modify this template as needed
TEMPLATE: Final[str] = """You are a helpful RAG assistant.
Your task is to provide an answer given a question using the context.
Please make sure the answer is complete and relevant to the question.

**IMPORTANT:
1. BASE YOUR ANSWER ONLY ON THE GIVEN CONTEXT.
2. IF THE CONTEXT IS NOT ENOUGH TO ANSWER THE QUESTION, SAY THAT YOU CANNOT ANSWER BASED ON THE AVAILABLE INFORMATION.
3. DO NOT INCLUDE CITATIONS OR REFERENCES TO SPECIFIC LINES OR PARTS OF THE CONTEXT.
4. ALWAYS KEEP YOUR ANSWER RELEVANT AND FOCUSED ON THE QUESTION.
5. DO NOT PROVIDE ANY ADDITIONAL INFORMATION EXCEPT THE ANSWER.
**

### CONTEXT:
{context}

### QUESTION:
{query}

### ANSWER:
"""

Authenticate

In [None]:
import requests

# First we need to authenticate the admin user and receive a token
# The username and password are the default ones provided by `R2R`
# Note that you can overwrite those in the `config.toml` file
# If this fails it means there're either connectivity issues or the credentials are wrong
authetication: requests.Response = requests.post(
    url="http://localhost:7272/v3/users/login", # This may vary depending on your setup
    headers={
        "Content-Type": "application/x-www-form-urlencoded"
    },
    data="username=admin@example.com&password=change_me_immediately",
)
token: str = authetication.json()['results']['access_token']['token'] # Token for further authentication

Relevant only when working with `deepseek-r1`, since it produces `<think>` sections. It's part of its reasoning algorithm and then it produces the actual output.

In [None]:
def extract_deepseek_response(full_response: str):
    """
    Extract the actual response from deepseek-r1 output by ignoring the <think>...</think> section.
    """
    if "</think>" not in full_response:
        raise ValueError("Response from deepseek-r1 is not full!")

    strings: List[str] = full_response.split("</think>")
    answer_without_section: str = strings[-1].lstrip()
    return answer_without_section

Print some debugging information (Optional).

In [None]:
# Some debugging info
print(f"""{'='*80}\nGenerating dataset in ./datasets/{experiment_id}_dataset.jsonl
TOP_K={int(os.getenv("TOP_K"))}
MAX_TOKENS_TO_SAMPLE={int(os.getenv("MAX_TOKENS"))}
CHUNK_SIZE={int(os.getenv("CHUNK_SIZE"))}
CHUNK_OVERLAP={int(os.getenv("CHUNK_OVERLAP"))}
CHAT_MODEL={os.getenv("CHAT_MODEL")}
TEMPERATURE={float(os.getenv("TEMPERATURE"))}
VANILLA_RAG={"True" if search_strategy == "vanilla"else "False"}
{'='*80}
""")

In [None]:
# Filling out the rest of our dataset, for each individual entry
for i, golden in enumerate(final_dataset.goldens):
    # [1] Embed the `user_input`
    # [2] Perform semantic similarity search fetching the top-k most relevant contexts
    # [3] Re-rank based on relevance relative to `user_input`
    # [4] Use the template defined above and replace placeholders dynamically
    # [5] Submit the augmented prompt to the LLM
    # [6] LLM generates the response and returns an object containing it and the context
    rag_response: requests.Response = requests.post(
        url="http://localhost:7272/v3/retrieval/rag",
        headers={
            "Authorization": f"Bearer {token}",
            "Content-Type": "application/json"
        },
        json={
            "query": golden.input, # Submit query from synthetically generated goldens
            "rag_generation_config": RAG_GENERATION_CONFIG,
            "search_mode": "custom",
            "search_settings": SEARCH_SETTINGS,
            "task_prompt": TEMPLATE
        }
    )

    if rag_response.status_code != 200:
        raise Exception(f"Request failed {rag_response.json()}")
    
    response: Dict = rag_response.json()['results']

    # Get the LLM response and context
    actual_output: str = response['completion']
    retrieved_contexts: List[str] = [
        chunk['text']
        for chunk in response['search_results']['chunk_search_results']
    ]

    # If deepseek-r1 is used regardless of parameters count
    # remove the content between the <think> tags
    if "deepseek-r1" in os.getenv("CHAT_MODEL"):
        actual_output = extract_deepseek_response(actual_output)

    # Fill out the rest of your dataset
    golden.actual_output = actual_output
    golden.retrieval_context = retrieved_contexts

    print(f"Added data to sample: {i + 1} out of {len(goldens)}")

In [None]:
dataset_out: str = f"{experiment_id}_dataset.jsonl"

# Persist the complete dataset
os.makedirs("./datasets", exist_ok=True)  # Create the directory if it doesn't exist
with open(file=f"./datasets/{dataset_out}", mode="w", encoding="utf-8") as f:
    for golden in goldens:
        f.write(json.dumps(golden.model_dump_json(), ensure_ascii=False) + "\n")

print(f"{'='*80}\nGenerated dataset in ./datasets/{dataset_out}")

After having the full dataset, you can once more push it to **Confident AI** and replace the previous one, which was not full.

In [None]:
# After having all of the data push the full dataset to ConfidentAI (Optional)

final_dataset.push(
    alias=alias,
    overwrite=True,
    auto_convert_test_cases_to_goldens=True
)