## Generating a synthetic dataset using DeepEval

### Synthesizer

This object can be used to generate **Golden** instances, which consist out of **input**, **expected output** and **context**. It uses a LLM to come up with random input and thereafter tries to enhance those, by making them more complex and realistic.

For a comprehensive guide on understanding how this object works please refer here: [Synthesizer](https://www.confident-ai.com/blog/the-definitive-guide-to-synthetic-data-generation-using-llms)

### Summary

I will try to summarize the most important information:

* It uses a **LLM to come-up with a comprehensive dataset** much faster than a human can
* The process starts with the LLM generating **synthetic queries** based on context from a knowledge base - usually documents
* Those initial queries are then **evolved** to reflect real-life complexity and then together with the context can be used to generate a **target/expected output**

![Dataset generation workflow](https://cdn.prod.website-files.com/64bd90bdba579d6cce245aec/670574639fc6b9d5c483d766_664050ef1eb43f5fb8f57ff8_diagram.png "Synthetic generation")

* There exist two main methods:
    - Self-improvement: Iteratively uses the LLMs output to generate more complex queries
    - Distillation: A stronger model is being utilized 

* Constructing contexts:
    - During this phase documents from the knowledge base are split using a token splitter
    - A random chunk is selected
    - Finally, additional chunks are retrieved based on **semantic similarity**, **knowledge graphs** or others
    - Ensuring that **chunk size**, **chunk overlap** or other similar parameters here and in the **retrieval component** of the **RAG** application are identical will yield better results

![Constructing contexts](https://cdn.prod.website-files.com/64bd90bdba579d6cce245aec/672cb28e9f8f60aabd382788_672cb201dadd3fd2de4451d2_context_generation.png "Context construction")

* Constructing synthetic queries:
    - Using the contexts the **Synthesizer** can now generate synthetic input
    - Doing so we ensure that the input corresponds with the context enhancing the **relevancy** and **accuracy**

![Constructing synthetic queries](https://cdn.prod.website-files.com/64bd90bdba579d6cce245aec/672cb28e9f8f60aabd382775_672cb23c502672c70e0372cd_asymmetry.png "Synthetic queries creation")

* Data Filtering:
    1. Context filtering: Removes low-quality chunks that may be unintelligible

    ![Context filtering](https://cdn.prod.website-files.com/64bd90bdba579d6cce245aec/672cb28e9f8f60aabd38278b_672cb26b461b45b0b5a6cd30_context_filtering.png "Filtering context")

    2. Input filtering: Ensures generated inputs meet quality standards

    ![Input filtering](https://cdn.prod.website-files.com/64bd90bdba579d6cce245aec/672cb28e9f8f60aabd382772_672cb27b799642a337436c3f_input_filtering.png "Filtering queries")
    
* Customizing dataset generating:
    - Depending on the scenario inputs and outputs can be tailored to specific use cases
        - For example a medical chatbot would have a completely different behaviour than a scientific one. It would need to comfort patients.
    
* Data Evolution:
    - **In-Depth Evolving**: Expands simple instructions into more detailed versions
    - **In-Breadth Evolving**: Produces diverse instructions to enrich the dataset
    - **Elimination Evolving**: Removes less effective instructions

    ![Data evolution](https://cdn.prod.website-files.com/64bd90bdba579d6cce245aec/670574639fc6b9d5c483d763_6641a0d7ef709f365d888577_Screenshot%25202024-05-13%2520at%25201.10.30%2520PM.png)

### Install dependencies:

* If you already have a **virtual environment** you **DON'T** need to execute the next function.
* Make sure you select the correct kernel in your notebook environment.

In [1]:
def setup_venv(venv_exists: bool = False, name: str = "venv"):
    if not venv_exists:
        !python3 -m venv {name}
    
    # Install requirements into the venv directly
    !{name}/bin/pip install -U deepeval ipykernel
     
    print(f"Virtual environment '{name}' has been created and packages installed.")
    print("Important: You need to manually select this kernel in your notebook:")
    print(f"1. Restart the kernel")
    print(f"2. Select the '{name}' kernel from the kernel menu")

In [None]:
setup_venv(venv_exists=False, name="venv")

In [None]:
# After installing dependencies and selecting the kernel you should be good to go.
# Make sure the package is installed before continuing further.
!pip3 show deepeval

### LLM provider

**DeepEval** uses **OpenAI** by default as a LLM, however **Ollama** is also available. To use it execute the code cell below. This will generate a `.deepeval` file where key-value pairs will be stored about that particular LLM-provider like model name, base url and so on. 

In [4]:
!deepeval set-ollama llama3.1 --base-url="http://localhost:11434/"
!deepeval set-ollama-embeddings mxbai-embed-large --base-url="http://localhost:11434"

🙌 Congratulations! You're now using a local Ollama model for all evals that 
require an LLM.
🙌 Congratulations! You're now using Ollama embeddings for all evals that 
require text embeddings.


### Extracting chunks from knowledge base to be used as context in data generation

In [2]:
!chmod u+x ./extract_context_chunks.sh
!./extract_context_chunks.sh
!ls # There should be a chunks.json file

Data directory already exists. Skipping download.
Virtual environment already exists. Skipping creation.
Dependencies already installed. Skipping installation.
Environment is set and ready to be used
Error when creating document: {'message': 'Document 88fa13ca-5921-590f-8693-408b1ed047bf already exists. Submit a DELETE request to `/documents/{document_id}` to delete this document and allow for re-ingestion.', 'error_type': 'R2RException'}
Error when creating document: {'message': 'Document bf55e614-3330-5283-a759-ea1bfa15a655 already exists. Submit a DELETE request to `/documents/{document_id}` to delete this document and allow for re-ingestion.', 'error_type': 'R2RException'}
Error when creating document: {'message': 'Document d7c24a75-99ba-5b84-8339-5a9188be0580 already exists. Submit a DELETE request to `/documents/{document_id}` to delete this document and allow for re-ingestion.', 'error_type': 'R2RException'}
Error when creating document: {'message': 'Document 2a9978ac-84fd-5644-

**Filtration config** serves as a way to configure the quality of the generated synthetic input queries. Having higher threshold would ensure that the input queries are of higher quality.

If the **quality_score** is still lower than the **synthetic_input_quality_threshold** after **max_quality_retries**, the **golden with the highest quality_score** will be used.

In [3]:
from deepeval.synthesizer.config import FiltrationConfig

filtration_config = FiltrationConfig(
    synthetic_input_quality_threshold=0.7,
    max_quality_retries=5
)

**Evolutions** are used to specify the type of approach to use when trying to complicate the synthetic queries. Since this is a **RAG** application I will only use the evolution types which use **context**. By setting `num_evolutions` to three, we make the **Synthesizer** go over iteratively over the process of complicating the queries 3 times.

In [None]:
from deepeval.synthesizer.config import (
    Evolution,
    EvolutionConfig,
)

# https://www.deepeval.com/docs/synthesizer-introduction
evolution_config = EvolutionConfig(
    num_evolutions=1,
    evolutions={
        Evolution.MULTICONTEXT: 0.25,
        Evolution.CONCRETIZING: 0.25,
        Evolution.CONSTRAINED: 0.25,
        Evolution.COMPARATIVE: 0.25,
    }
)

In [5]:
from deepeval.synthesizer import Synthesizer

synthesizer = Synthesizer(
    filtration_config=filtration_config,
    evolution_config=evolution_config
)

In [5]:
import os
import json

with open(file="chunks.json", mode="r", encoding="utf-8") as f:
    context_chunks = json.load(f)

source_files = []
for file in os.listdir("data"):
    if file.endswith(".md") and file != "README.md":  
        source_files.append(f"data/{file}")

In [7]:
from deepeval.dataset.golden import Golden

goldens: list[Golden] = synthesizer.generate_goldens_from_contexts(
    contexts=context_chunks,
    include_expected_output=True,
    max_goldens_per_context=7,
    source_files=source_files
)

✨ Generating up to 56 goldens using DeepEval (using llama3.1 (Ollama), method=default): 100%|██████████| 56/56 [1:29:59<00:00, 96.42s/it]  


### Confident AI

1. In short **Confident AI** is a cloud-based platform part of the **DeepEval** framework, which stores **datasets**, **evaluations** and **monitoring data**. 

2. If you want to use **Confident AI** platform create an account from here: [Confident AI](https://www.confident-ai.com/)

3. After signing-up an **API key** will be generated, which can be used to interact with the platform from inside the notebook.

---

Example of .env file:
```bash
DEEPEVAL_RESULTS_FOLDER=<folder> # Results of evaluations can be saved locally
DEEPEVAL_API_KEY=<your api key>  # Relevant if you want to use Confident AI
DEEPEVAL_TELEMETRY_OPT_OUT="YES" # Remove telemetry
```

In [8]:
import os
from dotenv import load_dotenv
from deepeval import login_with_confident_api_key

# Loads the environment variables from a `.env` file.
# If you want to use Confident AI be sure to create one in this directory.
load_dotenv()

deepeval_api_key: str = os.getenv("DEEPEVAL_API_KEY")

# You should get a message letting you know you are logged-in.
login_with_confident_api_key(deepeval_api_key)

In [14]:
from deepeval.dataset import EvaluationDataset

dataset = EvaluationDataset(goldens=goldens)
dataset.push(alias="DeepEval Dataset")

Gtk-Message: 20:55:56.337: Failed to load module "canberra-gtk-module"
Gtk-Message: 20:55:56.338: Failed to load module "canberra-gtk-module"


Opening in existing browser session.


In [15]:
from deepeval.dataset import EvaluationDataset

# I did some cleaning on the data since the input was not fully in the expected format on the ConfidentAI platform.
final_dataset = EvaluationDataset()
final_dataset.pull(alias="DeepEval Dataset")

# final_dataset.save_as(
#     file_type="json",
#     directory="./deepeval_dataset"
# )