## Generating a synthetic dataset using DeepEval

### Synthesizer

This object can be used to generate **Golden** instances, which consist out of **input**, **expected output** and **context**. It uses a LLM to come up with random input and thereafter tries to enhance those, by making them more complex and realistic.

For a comprehensive guide on understanding how this object works please refer here: [Synthesizer](https://www.confident-ai.com/blog/the-definitive-guide-to-synthetic-data-generation-using-llms)

### Summary

I will try to summarize the most important information:

* It uses a **LLM to come-up with a comprehensive dataset** much faster than a human can
* The process starts with the LLM generating **synthetic queries** based on context from a knowledge base - usually documents
* Those initial queries are then **evolved** to reflect real-life complexity and then together with the context can be used to generate a **target/expected output**

![Dataset generation workflow](https://cdn.prod.website-files.com/64bd90bdba579d6cce245aec/670574639fc6b9d5c483d766_664050ef1eb43f5fb8f57ff8_diagram.png "Synthetic generation")

* There exist two main methods:
    - Self-improvement: Iteratively uses the LLMs output to generate more complex queries
    - Distillation: A stronger model is being utilized 

* Constructing contexts:
    - During this phase documents from the knowledge base are split using a token splitter
    - A random chunk is selected
    - Finally, additional chunks are retrieved based on **semantic similarity**, **knowledge graphs** or others
    - Ensuring that **chunk size**, **chunk overlap** or other similar parameters here and in the **retrieval component** of the **RAG** application are identical will yield better results

![Constructing contexts](https://cdn.prod.website-files.com/64bd90bdba579d6cce245aec/672cb28e9f8f60aabd382788_672cb201dadd3fd2de4451d2_context_generation.png "Context construction")

* Constructing synthetic queries:
    - Using the contexts the **Synthesizer** can now generate synthetic input
    - Doing so we ensure that the input corresponds with the context enhancing the **relevancy** and **accuracy**

![Constructing synthetic queries](https://cdn.prod.website-files.com/64bd90bdba579d6cce245aec/672cb28e9f8f60aabd382775_672cb23c502672c70e0372cd_asymmetry.png "Synthetic queries creation")

* Data Filtering:
    1. Context filtering: Removes low-quality chunks that may be unintelligible

    ![Context filtering](https://cdn.prod.website-files.com/64bd90bdba579d6cce245aec/672cb28e9f8f60aabd38278b_672cb26b461b45b0b5a6cd30_context_filtering.png "Filtering context")

    2. Input filtering: Ensures generated inputs meet quality standards

    ![Input filtering](https://cdn.prod.website-files.com/64bd90bdba579d6cce245aec/672cb28e9f8f60aabd382772_672cb27b799642a337436c3f_input_filtering.png "Filtering queries")
    
* Customizing dataset generating:
    - Depending on the scenario inputs and outputs can be tailored to specific use cases
        - For example a medical chatbot would have a completely different behaviour than a scientific one. It would need to comfort patients.
    
* Data Evolution:
    - **In-Depth Evolving**: Expands simple instructions into more detailed versions
    - **In-Breadth Evolving**: Produces diverse instructions to enrich the dataset
    - **Elimination Evolving**: Removes less effective instructions

    ![Data evolution](https://cdn.prod.website-files.com/64bd90bdba579d6cce245aec/670574639fc6b9d5c483d763_6641a0d7ef709f365d888577_Screenshot%25202024-05-13%2520at%25201.10.30%2520PM.png)

### LLM provider

**DeepEval** uses **OpenAI** by default as a LLM, however **Ollama** is also available. To use it execute the code cell below. This will generate a `.deepeval` file where key-value pairs will be stored about that particular LLM-provider like model name, base url and so on. 

In [2]:
!deepeval set-ollama model-name=llama3.1:latest --base-url="http://localhost:11434/"
!deepeval set-ollama-embeddings model_name=mxbai-embed-large --base-url="http://localhost11434"

🙌 Congratulations! You're now using a local Ollama model for all evals that 
require an LLM.
🙌 Congratulations! You're now using Ollama embeddings for all evals that 
require text embeddings.


In [3]:
!git clone https://huggingface.co/datasets/explodinggradients/ragas-airline-dataset data

Cloning into 'data'...
remote: Enumerating objects: 14, done.[K
remote: Total 14 (delta 0), reused 0 (delta 0), pack-reused 14 (from 1)[K
Unpacking objects: 100% (14/14), 16.16 KiB | 4.04 MiB/s, done.


**Evolutions** are used to specify the type of approach to use when trying to complicate the synthetic queries. Since this is a **RAG** application I will only use the evolution types which use **context**. By setting `num_evolutions` to three, we make the **Synthesizer** go over iteratively over the process of complicating the queries 3 times.

In [None]:
import typing as t
from pathlib import Path
from deepeval.dataset import Golden
from deepeval.synthesizer.config import (
    Evolution,
    EvolutionConfig,
    ContextConstructionConfig
)
from deepeval.synthesizer import Synthesizer

# TODO: Define different scenarious to get a more comprehensive dataset
# ChromaDB missing
# Supported file types -> docs, pdf, txt, NOT md
# Instead of using generate_goldens_from_docs we can go for the other approach from contexts
# To do so use R2R to fetch all chunks from ingested files
# Thereafter use the contexts to have the synthetic queries be generated
# The final maximum number of goldens to be generated is the max_goldens_per_context multiplied by the 
# max_contexts_per_document as specified in the context_construction_config, and NOT simply max_goldens_per_context.

synthesizer = Synthesizer(
    evolution_config = EvolutionConfig(
        num_evolutions=3,
        evolutions={
            Evolution.MULTICONTEXT: 0.25,
            Evolution.CONCRETIZING: 0.25,
            Evolution.CONSTRAINED: 0.25,
            Evolution.COMPARATIVE: 0.25,
        }
    )
)

doc_paths = []
docs_dir = Path("./data")
for file in docs_dir.iterdir():
    if file.is_file() and file.suffix in [".md"] and file.name != "README.md":
        doc_paths.append(str(file.absolute()))    

goldens: t.List[Golden] = synthesizer.generate_goldens_from_docs(
    document_paths=doc_paths,
    max_goldens_per_context=5,
    context_construction_config=ContextConstructionConfig(
        max_contexts_per_document=5,
        max_context_length=5,
        chunk_size=1024,
        chunk_overlap=128,
        max_retries=5
    )
)

['/home/p3tr0vv/Desktop/Evaluation-Approaches-for-Retrieval-Augmented-Generation-RAG-/project/deepeval/data/special_assistance.md',
 '/home/p3tr0vv/Desktop/Evaluation-Approaches-for-Retrieval-Augmented-Generation-RAG-/project/deepeval/data/managing_reservations.md',
 '/home/p3tr0vv/Desktop/Evaluation-Approaches-for-Retrieval-Augmented-Generation-RAG-/project/deepeval/data/flight_delays.md',
 '/home/p3tr0vv/Desktop/Evaluation-Approaches-for-Retrieval-Augmented-Generation-RAG-/project/deepeval/data/baggage_policies.md',
 '/home/p3tr0vv/Desktop/Evaluation-Approaches-for-Retrieval-Augmented-Generation-RAG-/project/deepeval/data/inflight_services.md',
 '/home/p3tr0vv/Desktop/Evaluation-Approaches-for-Retrieval-Augmented-Generation-RAG-/project/deepeval/data/schedule_changes.md',
 '/home/p3tr0vv/Desktop/Evaluation-Approaches-for-Retrieval-Augmented-Generation-RAG-/project/deepeval/data/bookings.md',
 '/home/p3tr0vv/Desktop/Evaluation-Approaches-for-Retrieval-Augmented-Generation-RAG-/project

In [None]:
synthesizer.generate_goldens_from_contexts(
    
)