## Generating a synthetic dataset using DeepEval

### Synthesizer

This object can be used to generate **Golden** instances, which consist out of **input**, **expected output** and **context**. It uses a LLM to come up with random input values based on a context and thereafter tries to enhance those, by making them more complex and realistic through evolutions.

For a comprehensive guide on understanding how this object works please refer here: [Synthesizer](https://www.confident-ai.com/blog/the-definitive-guide-to-synthetic-data-generation-using-llms)

### Summary

I will try to summarize the most important information:

* It uses a **LLM to come-up with a comprehensive dataset** much faster than a human can
* The process starts with the LLM generating **synthetic queries** based on context from a knowledge base - usually documents
* Those initial queries are then **evolved** to reflect real-life complexity and then together with the context can be used to generate a **target/expected output**

![Dataset generation workflow](../../img/synthesizer-overview.png "Synthetic generation")

* There exist two main methods:
    - Self-improvement: Iteratively uses the LLMs output to generate more complex queries
    - Distillation: A stronger model is being utilized 

* Constructing contexts:
    - During this phase documents from the knowledge base are split using a token splitter
    - A random chunk is selected
    - Finally, additional chunks are retrieved based on **semantic similarity**, **knowledge graphs** or other approaches
    - Ensuring that **chunk size**, **chunk overlap** or other similar parameters here and in the **retrieval component** of the **RAG** application are identical will yield better results

![Constructing contexts](../../img/synthesizer-context.png "Context construction")

* Constructing synthetic queries:
    - In **RAG** when a user submits a query, all the relevant data is retrieved and then a template augments the input with the context. The `synthesizer` reverses the approach.
    - Using the contexts the **Synthesizer** can now generate synthetic input
    - Doing so we ensure that the input corresponds with the context enhancing the **relevancy** and **accuracy**

![Constructing synthetic queries](../../img/synthesizer-query.png "Synthetic queries creation")

* Data Filtering:

    Data filtering is important after you have the `synthetic query`, `context` and optionally `reference answer` as to make sure one doesn't try to refine flawed queries and to waste valuable resources. Filtering occurs at 2 critical stages:

    1. Context filtering: Removes low-quality chunks that may be unintelligible, due to whitespaces for example

    ![Context filtering](../../img/synthesizer-context-filtering.png "Filtering context")

    2. Input filtering: Ensures generated inputs meet quality standards. Sometimes even with good and well-structured context an input might be somewhat ambiguous or unclear based on the context.

    ![Input filtering](../../img/synthesizer-query-filtering.png "Filtering queries")
    
* Customizing dataset generating:
    - Depending on the scenario inputs and outputs can be tailored to specific use cases
        - For example a medical chatbot would have a completely different behaviour than a scientific one. It would need to comfort patients and avoid bias. Also false negatives could turn out to be quite dangerous.
    
* Data Evolution:
    This is crucial for the proper generation of a dataset, since it iteratively refines the dataset.

    - **In-Depth Evolving**: Expands simple instructions into more detailed versions
    - **In-Breadth Evolving**: Produces diverse instructions to enrich the dataset
    - **Elimination Evolving**: Removes less effective instructions

    ![Data evolution types](../../img/synthesizer-evolution.png "Data Evolution")

### Dependencies

* To install the dependencies run the `setup` bash script in the root of the `evaluation` folder.
* Make sure you select the correct kernel in your notebook environment.

In [1]:
# After installing the dependencies and selecting the kernel you should be good to go.
# Make sure the package is installed before continuing further.
! pip3 show deepeval

Name: deepeval
Version: 2.8.2
Summary: The LLM Evaluation Framework
Home-page: https://github.com/confident-ai/deepeval
Author: Jeffrey Ip
Author-email: jeffreyip@confident-ai.com
License: Apache-2.0
Location: /home/p3tr0vv/Desktop/Evaluation-Approaches-for-Retrieval-Augmented-Generation-RAG-/evaluation/eval/lib/python3.12/site-packages
Requires: aiohttp, anthropic, black, coverage, google-genai, grpcio, nest_asyncio, ollama, openai, opentelemetry-api, opentelemetry-exporter-otlp-proto-grpc, opentelemetry-sdk, portalocker, posthog, pytest, pytest-asyncio, pytest-repeat, pytest-rerunfailures, pytest-xdist, requests, rich, sentry-sdk, setuptools, tabulate, tenacity, tqdm, twine, typer, wheel
Required-by: 


### LLM provider

**DeepEval** uses **OpenAI** by default as a LLM provider, however **Ollama** is also available. To use it execute the code cell below. This will generate a `.deepeval` file where key-value pairs will be stored about that particular LLM-provider like model name, base url and so on. 

In [None]:
import os
from typing import Final
from dotenv import load_dotenv

load_dotenv("../../env/rag.env")

CHAT_MODEL: Final[str] = os.getenv("CHAT_MODEL")
EMBEDDING_MODEL: Final[str] = os.getenv("EMBEDDING_MODEL")

! deepeval set-ollama {CHAT_MODEL} --base-url="http://localhost:11434/"
! deepeval set-ollama-embeddings {EMBEDDING_MODEL} --base-url="http://localhost:11434"

🙌 Congratulations! You're now using a local Ollama model for all evals that 
require an LLM.
🙌 Congratulations! You're now using Ollama embeddings for all evals that 
require text embeddings.


### Extracting chunks from knowledge base to be used as context in data generation

**Before executing the next cell:**
* Make sure Ollama is up and running.
* Download the required models for generation and embedding.
* Make sure docker is up and running.
* Activate the compose file in the root of the project.

In [3]:
chunks_out: str = input("Enter a name for the output file (no extension): ")

! chmod u+x ./extract_chunks.sh
! ./extract_chunks.sh {chunks_out}

Data directory already exists. Skipping download.
Virtual environment already exists. Skipping creation.
Dependencies already installed. Skipping installation.
Environment is set and ready to be used
Generating context in /contexts/chunks_1.json.json.
TOP_K=10
MAX_TOKENS_TO_SAMPLE=1024
CHUNK_SIZE=768
CHUNK_OVERLAP=64
CHAT_MODEL=llama3.1:8b-instruct-q4_1
TEMPERATURE=0.0

DELETION STEP COMPLETED...
data/special_assistance.md: Document created and ingested successfully.
data/managing_reservations.md: Document created and ingested successfully.
data/flight_delays.md: Document created and ingested successfully.
data/baggage_policies.md: Document created and ingested successfully.
data/inflight_services.md: Document created and ingested successfully.
data/schedule_changes.md: Document created and ingested successfully.
data/bookings.md: Document created and ingested successfully.
data/flight_cancellations.md: Document created and ingested successfully.
INGESTION STEP COMPLETED...
Extracted c

**Filtration config** serves as a way to configure the quality of the generated synthetic input queries. Having higher threshold would ensure that the input queries are of higher quality.

If the **quality_score** is still lower than the **synthetic_input_quality_threshold** after **max_quality_retries**, the **golden with the highest quality_score** will be used.

In [4]:
from deepeval.synthesizer.config import FiltrationConfig

# (This step is completely OPTIONAL)
# https://www.deepeval.com/docs/synthesizer-introduction
filtration_config = FiltrationConfig(
    synthetic_input_quality_threshold=0.7,
    max_quality_retries=5
)

**Evolutions** are used to specify the type of approach to use when trying to complicate the synthetic queries. Since this is a **RAG** application I will only use the evolution types which use **context**. The `num_evolutions` parameter can be configured to specify the number of iterations for performing those evolutions.

In [5]:
from deepeval.synthesizer.config import (
    Evolution,
    EvolutionConfig,
)

# (This step is completely OPTIONAL)
# https://www.deepeval.com/docs/synthesizer-introduction
evolution_config = EvolutionConfig(
    num_evolutions=1,
    evolutions={
        Evolution.MULTICONTEXT: 0.25,
        Evolution.CONCRETIZING: 0.25,
        Evolution.CONSTRAINED: 0.25,
        Evolution.COMPARATIVE: 0.25,
    }
)

### Synthesizer

The synthesizer object as explained at the beginning of the notebook can be used to generate the synthetic dataset. It provides four different methods for this current version of **DeepEval**.

In [6]:
from deepeval.synthesizer import Synthesizer

# https://www.deepeval.com/docs/synthesizer-introduction
synthesizer = Synthesizer(
    filtration_config=filtration_config,
    evolution_config=evolution_config
)

### Load the contexts for this dataset

In [16]:
import json

# Load all the contexts that were previously generated
with open(file=f"./contexts/{chunks_out}.json", mode="r", encoding="utf-8") as f:
    context_chunks = json.load(f)

### Generate the goldens

In this notebook I use the `generate_goldens_from_contexts`, which actually skips some steps that are specified in the synthesizer section - the loading and splitting of documents. This provides more freedom, however one has to be careful to properly ingest the documents and to derive high-quality contexts.

In [18]:
from deepeval.dataset.golden import Golden

goldens: list[Golden] = synthesizer.generate_goldens_from_contexts(
    contexts=context_chunks,
    include_expected_output=True,
    max_goldens_per_context=2,
)

✨ Generating up to 48 goldens using DeepEval (using llama3.1:8b-instruct-q4_1 (Ollama), method=default):  42%|████▏     | 20/48 [40:09<23:10, 49.67s/it]    

KeyboardInterrupt: 

✨ Generating up to 48 goldens using DeepEval (using llama3.1:8b-instruct-q4_1 (Ollama), method=default):  52%|█████▏    | 25/48 [41:34<38:15, 99.80s/it]


### Confident AI

1. In short **Confident AI** is a cloud-based platform part of the **DeepEval** framework, which stores **datasets**, **evaluations** and **monitoring data**. 

2. If you want to use **Confident AI** platform create an account from here: [Confident AI](https://www.confident-ai.com/)

3. After signing-up an **API key** will be generated, which can be used to interact with the platform from inside the notebook.

---

Example of .env file:
```bash
DEEPEVAL_RESULTS_FOLDER=<folder> # Results of evaluations can be saved locally (cache)
DEEPEVAL_API_KEY=<your api key>  # Relevant if you want to use Confident AI
DEEPEVAL_TELEMETRY_OPT_OUT="YES" # Remove telemetry
```

In [19]:
import os
from dotenv import load_dotenv
from deepeval import login_with_confident_api_key

# Loads the environment variables from a `.env` file.
# If you want to use Confident AI be sure to create one in this directory.
load_dotenv("../.env")

deepeval_api_key: str = os.getenv("DEEPEVAL_API_KEY")

# You should get a message letting you know you are logged-in.
login_with_confident_api_key(deepeval_api_key)

Make sure to visit the link provided, upon invoking the `push` method. This will redirect you to the page containing the `goldens`. Then you can clean-up the data and that would almost always be mandatory, since we are using a weak model in the project and the input will not always be **clean**.

Do note that if the `push` to the cloud fails you might need to upgrade **DeepEval** to the latest version. To do so run:
`pip3 install --upgrade deepeval`

In [None]:
from deepeval.dataset import EvaluationDataset

dataset = EvaluationDataset(goldens=goldens)

dataset.push(
    alias=os.getenv("DATASET_ALIAS"),
    overwrite=True
)

Gtk-Message: 12:57:34.499: Failed to load module "canberra-gtk-module"
Gtk-Message: 12:57:34.500: Failed to load module "canberra-gtk-module"


Opening in existing browser session.


In [None]:
import json
from deepeval.dataset import EvaluationDataset

# I did some cleaning on the data since the input was not fully in the expected format on the ConfidentAI platform.
final_dataset = EvaluationDataset()
final_dataset.pull(alias=os.getenv("DATASET_ALIAS"))

# Saving the data locally so I can use it in a script.
# Since R2R and DeepEval have conflicting dependencies a virtual environment with both of these 
# libraries doesn't work. They need to be separated. (`Ollama` is the conflicting package)
json_out: list[dict] = []
for golden in final_dataset.goldens:
    json_out.append(golden.model_dump())

# Save json data
with open("deepeval_dataset.json", "w") as f:
    json.dump(json_out, f, indent=4)

Output()

### Generation of `actual response` and `retrieval context`

In [None]:

goldens_filepath: str = input("Enter a name for the goldens (no extension): ")

! chmod u+x ./fill_dataset.sh
! ./fill_dataset.sh {goldens_filepath}

In [2]:
# After having all of the data push the full dataset to ConfidentAI
import os
import json
from dotenv import load_dotenv
from deepeval.dataset import EvaluationDataset

# Make sure you specify the correct name below
with open("full_deepeval_dataset.json", "r") as f:
    data = json.load(f)

full_dataset = EvaluationDataset()
full_dataset.add_goldens_from_json_file(
    file_path="full_deepeval_dataset.json",
    input_key_name="input",
    actual_output_key_name="actual_output",
    expected_output_key_name="expected_output",
    context_key_name="context",
    retrieval_context_key_name="retrieval_context"
)

load_dotenv("../.env")

full_dataset.push(
    alias=os.getenv("DATASET_ALIAS"),
    overwrite=True
)

Gtk-Message: 16:15:48.222: Failed to load module "canberra-gtk-module"
Gtk-Message: 16:15:48.223: Failed to load module "canberra-gtk-module"


Opening in existing browser session.
