## Generating a synthetic dataset using DeepEval

### Synthesizer

This object can be used to generate **Golden** instances, which consist out of **input**, **expected output** and **context**. It uses a LLM to come up with random input and thereafter tries to enhance those, by making them more complex and realistic.

For a comprehensive guide on understanding how this object works please refer here: [Synthesizer](https://www.confident-ai.com/blog/the-definitive-guide-to-synthetic-data-generation-using-llms)

### Summary

I will try to summarize the most important information:

* It uses a **LLM to come-up with a comprehensive dataset** much faster than a human can
* The process starts with the LLM generating **synthetic queries** based on context from a knowledge base - usually documents
* Those initial queries are then **evolved** to reflect real-life complexity and then together with the context can be used to generate a **target/expected output**

![Dataset generation workflow](https://cdn.prod.website-files.com/64bd90bdba579d6cce245aec/670574639fc6b9d5c483d766_664050ef1eb43f5fb8f57ff8_diagram.png "Synthetic generation")

* There exist two main methods:
    - Self-improvement: Iteratively uses the LLMs output to generate more complex queries
    - Distillation: A stronger model is being utilized 

* Constructing contexts:
    - During this phase documents from the knowledge base are split using a token splitter
    - A random chunk is selected
    - Finally, additional chunks are retrieved based on **semantic similarity**, **knowledge graphs** or others
    - Ensuring that **chunk size**, **chunk overlap** or other similar parameters here and in the **retrieval component** of the **RAG** application are identical will yield better results

![Constructing contexts](https://cdn.prod.website-files.com/64bd90bdba579d6cce245aec/672cb28e9f8f60aabd382788_672cb201dadd3fd2de4451d2_context_generation.png "Context construction")

* Constructin synthetic queries:
    - Using the contexts the **Synthesizer** can now generate synthetic input
    - Doing so we ensure that the input is corresponds with the context enhancing the **relevancy** and **accuracy**

![Constructing synthetic queries](https://cdn.prod.website-files.com/64bd90bdba579d6cce245aec/672cb28e9f8f60aabd382775_672cb23c502672c70e0372cd_asymmetry.png "Synthetic queries creation")

* Data Filtering:
    1. Context filtering: Removes low-quality chunks that may be unintelligible

    ![Context filtering](https://cdn.prod.website-files.com/64bd90bdba579d6cce245aec/672cb28e9f8f60aabd38278b_672cb26b461b45b0b5a6cd30_context_filtering.png "Filtering context")

    2. Input filtering: Ensures generated inputs meet quality standards

    ![Input filtering](https://cdn.prod.website-files.com/64bd90bdba579d6cce245aec/672cb28e9f8f60aabd382772_672cb27b799642a337436c3f_input_filtering.png "Filtering queries")
    
* Customizing dataset generating:
    - Depending on the scenario inputs and outputs can be tailored to specific use cases
        - For example a medical chatbot would have a completely different behaviour than a scientific one. It would need to comfort patients.
    
* Data Evolution:
    - **In-Depth Evolving**: Expands simple instructions into more detailed versions
    - **In-Breadth Evolving**: Produces diverse instructions to enrich the dataset
    - **Elimination Evolving**: Removes less effective instructions

    ![Data evolution](https://cdn.prod.website-files.com/64bd90bdba579d6cce245aec/670574639fc6b9d5c483d763_6641a0d7ef709f365d888577_Screenshot%25202024-05-13%2520at%25201.10.30%2520PM.png)

In [None]:
from deepeval.synthesizer import Synthesizer

synthesizer = Synthesizer()
synthesizer.generate_goldens_from_docs(
    document_paths=['example.txt', 'example.docx', 'example.pdf'],
)