# Synthetic test data generation

## Importance of High-Quality Test Sets for RAG Systems

### Purpose
Test sets are critical for:
- Accurately measuring RAG system performance
- Identifying system strengths and weaknesses
- Guiding continuous improvement

### Key Evaluation Dimensions
1. **Retrieval Effectiveness**: Assess how well relevant context is retrieved
2. **Generation Quality**: Evaluate the accuracy and coherence of generated responses
3. **Contextual Relevance**: Measure how well the system understands and integrates retrieved information

### Evaluation Goals
- Benchmark system performance
- Detect hallucinations
- Validate generalization capabilities
- Simulate real-world query complexity

### Best Practices
- Use diverse query types
- Cover multiple domains
- Include edge cases
- Create reproducible test scenarios
- Contains enough number of samples to be derive statistically significant conclusions

## 1. Retrieve data:
* The data can be a dataset from `huggingface` or any other platform.

* Alternatively, files available on disk - pdf, md, etc.

* One can also use `AsyncHtmlLoader` from `langchain` to scrape from the internet.
    - **Careful when performing web scraping to not violate any terms and conditions!**

In [1]:
!git clone https://huggingface.co/datasets/explodinggradients/Sample_Docs_Markdown data

Cloning into 'data'...
remote: Enumerating objects: 31, done.[K
remote: Counting objects: 100% (21/21), done.[K
remote: Compressing objects: 100% (21/21), done.[K
remote: Total 31 (delta 4), reused 0 (delta 0), pack-reused 10 (from 1)[K
Unpacking objects: 100% (31/31), 132.02 KiB | 8.80 MiB/s, done.


## 2. Load data into document objects

I prefer to use `langchain`, however `llama-index` is also a solution.

In [2]:
from langchain_community.document_loaders import DirectoryLoader

path = "data/"
loader = DirectoryLoader(path, glob="**/*.md", exclude="README.md")
docs = loader.load()

## 3. Construct knowledge graph

- A **knowledge graph** is a fundamental concept when it comes to **RAGAs** and using its capabilities for **automatic data generation**.

- A **knowledge graph** consists of **Node** objects at first, which represent **documents** - their content and additional metadata.

- Thereafter, one can enrich the graph by **apply various transformations** to it and **relationships get built**, which express some kind of connection between Node objects. The transformations can be applied only through the use of **Extractors**. They serve as a way to gather relevant data from the documents depending on the type of extractor and this way to logically connect 2 or more nodes together.

- This graph then is used to generate so called **scenarios** and can also be used to generate **personas** to arrive at the test samples.

In [3]:
from ragas.testset.graph import (
    Node,
    NodeType,
    KnowledgeGraph,
)

kg = KnowledgeGraph()

for doc in docs:
    kg.add(
        Node(
            type=NodeType.DOCUMENT,
            properties={"page_content": doc.page_content, "document_metadata": doc.metadata}
        )
    )

  from .autonotebook import tqdm as notebook_tqdm


## 4. Instantiate required objects

- **RAGAs** would require a **Large-Language-Model** and an **Embedding** one to able to apply the **transformations** to the **knowledge graph**. For that purpose one must create **wrapper** objects for both of the models. `Langchain` and `llama-index` are both supported. 

- Additionally, a **configuration** can be used to modify the default behaviour of the framework. For example timeout values can be modified, maximum retries for failed operations and so on can be configured from the **RunConfig**.
    - **NOTE**: since llama3.1:8b is not particularly efficient you may need to modify the `timeout` value.

- Lastly, there's a single implementation in **RAGAs** for caching saving steps onto disk. To use it the **DiskCacheBackend** class can come in play.

In [4]:
from langchain_core.caches import InMemoryCache
from langchain_ollama.llms import OllamaLLM
from langchain_ollama.embeddings import OllamaEmbeddings

from ragas import (
    RunConfig,
    DiskCacheBackend
)
from ragas.llms import LangchainLLMWrapper
from ragas.embeddings import LangchainEmbeddingsWrapper

run_config = RunConfig(
    timeout=14400,
    max_retries=15,
    max_wait=30
)

cacher = DiskCacheBackend(cache_dir=".cache")

ollama_llm = OllamaLLM(
    cache=InMemoryCache(),
    model="llama3.1",
    base_url="http://localhost:11434",
    temperature=0.1,
    num_ctx=24000,
    format="json"
)

ollama_embeddings = OllamaEmbeddings(
    model="mxbai-embed-large",
    base_url="http://localhost:11434"
)

langchain_llm = LangchainLLMWrapper(
    langchain_llm=ollama_llm,
    run_config=run_config,
    cache=cacher
)

langchain_embeddings = LangchainEmbeddingsWrapper(
    embeddings=ollama_embeddings,
    run_config=run_config,
    cache=cacher
)

## 5. Create the transformation pipeline

The sequence of transformations:

1. Headline Extraction & Splitting (for long documents, 501+ tokens up to 1500)  
    - Extracts headlines from large documents to create logical sections.  
    - Splits long documents into smaller chunks at headline boundaries.

2. Summary Extraction  
    - Generates a concise summary for each document to facilitate quick understanding.

3. Named Entity Recognition (NER) & Theme Extraction  
    - NERExtractor identifies and extracts named entities (e.g., people, organizations, locations).  
    - ThemesExtractor detects overarching topics/themes in each chunk.

4. Embedding Generation
    - Uses an embedding model to convert summaries into vector representations for similarity-based retrieval.

5. Cosine Similarity Computation  
    - Measures semantic similarity between documents based on their summary embeddings.  
    - Creates relationships between similar documents using a threshold of `0.75`.

6. NER-Based Overlap Score Computation
    - Computes overlap scores between extracted named entities in different chunks.  
    - Helps detect if two chunks talk about similar entities.

7. Extraction of key phrases and topics
    - Relevant for generating personas in my case
    - Gain additional insights into documents

8. Custom Node Filtering 
    - Filters nodes to keep only relevant chunks for processing.

9. Parallel Processing for Efficiency
    - Certain transformations run in parallel to improve performance:
      - Summary embeddings, theme extraction, and NER run together.
      - Cosine similarity and entity overlap scoring run together.

-> Final Outcome:
    - A structured set of document transformations that extracts headlines, summaries, key entities, themes, embeddings, 
        and relationships between different chunks/documents
    - Used to construct a knowledge graph for downstream retrieval-augmented generation (RAG) tasks.

In [5]:
from ragas.testset.transforms.extractors.llm_based import (
    HeadlinesExtractor,
    SummaryExtractor,
    ThemesExtractor,
    NERExtractor,
    KeyphrasesExtractor,
    TopicDescriptionExtractor,
)
from ragas.testset.transforms.relationship_builders import (
    OverlapScoreBuilder,
    CosineSimilarityBuilder
    
)
from ragas.testset.transforms.engine import Parallel
from ragas.testset.transforms.filters import CustomNodeFilter
from ragas.testset.transforms.splitters import HeadlineSplitter
from ragas.testset.transforms.extractors.embeddings import EmbeddingExtractor

from ragas.utils import num_tokens_from_string

# Taken from the default_transforms
def filter_doc_with_num_tokens(node, min_num_tokens=500):
    return (
        node.type == NodeType.DOCUMENT
        and num_tokens_from_string(node.properties["page_content"]) > min_num_tokens
    )

headline_extractor = HeadlinesExtractor(
    llm=langchain_llm,
    max_num=10,
    filter_nodes=lambda node: filter_doc_with_num_tokens(node)
)

splitter = HeadlineSplitter(
    min_tokens=500,
    max_tokens=1500
)

summary_extractor = SummaryExtractor(
    llm=langchain_llm,
    filter_nodes=lambda node: filter_doc_with_num_tokens(node)
)

theme_extractor = ThemesExtractor(
    llm=langchain_llm,
    max_num_themes=10
)

ner_extractor = NERExtractor(
    llm=langchain_llm
)

summary_emb_extractor = EmbeddingExtractor(
    embedding_model=langchain_embeddings,
    property_name="summary_embedding",
    embed_property_name="summary",
    filter_nodes=lambda node: filter_doc_with_num_tokens(node),
)

keyphrase_extractor = KeyphrasesExtractor(
    llm=langchain_llm,
    max_num=15
)

topic_description_extractor = TopicDescriptionExtractor(
    llm=langchain_llm
)

cosine_sim_builder = CosineSimilarityBuilder(
    property_name="summary_embedding",
    new_property_name="summary_similarity",
    threshold=0.75,
    filter_nodes=lambda node: filter_doc_with_num_tokens(node),
)

ner_overlap_sim = OverlapScoreBuilder(
    threshold=0.01, filter_nodes=lambda node: node.type == NodeType.CHUNK
)

node_filter = CustomNodeFilter(
    llm=langchain_llm,
    filter_nodes=lambda node: node.type == NodeType.CHUNK
)

transforms = [
    headline_extractor,
    splitter,
    summary_extractor,
    node_filter,
    Parallel(summary_emb_extractor, theme_extractor, ner_extractor, keyphrase_extractor, topic_description_extractor),
    Parallel(cosine_sim_builder, ner_overlap_sim)
]

## 6. Apply the transformations to the knowledge graph

In the section below the `apply_transforms` is going to apply all the previously defined transformations enriching the `knowledge graph` in the process.

In [6]:
from ragas.testset.transforms import apply_transforms

apply_transforms(kg, transforms, run_config)

Applying HeadlineSplitter:   0%|          | 0/11 [00:00<?, ?it/s]         unable to apply transformation: 'headlines' property not found in this node
unable to apply transformation: 'headlines' property not found in this node
unable to apply transformation: 'headlines' property not found in this node
unable to apply transformation: 'headlines' property not found in this node
unable to apply transformation: 'headlines' property not found in this node
unable to apply transformation: 'headlines' property not found in this node
Applying SummaryExtractor:   0%|          | 0/6 [00:00<?, ?it/s] Property 'summary' already exists in node '8d10ac'. Skipping!
Applying CustomNodeFilter:   0%|          | 0/11 [00:00<?, ?it/s]Prompt fix_output_format failed to parse output: The output parser failed to parse the output including retries.
Prompt fix_output_format failed to parse output: The output parser failed to parse the output including retries.
Prompt fix_output_format failed to parse output: The

## 7. Generating personas

- A **Persona** is an entity/role which interacts with the system. **Personas** provide context and perspective, ensuring that **generated queries are natural, user-specific, and diverse**.

- `Example: a Senior DevOps engineer, a Junior Data Scientist, a Marketing Manager in the context of an IT company `

- **Persona** object consists of a **name** and a **description**.
    
    - The name is used to identify the persona and the description is used to describe the role of the persona.

I've decided to diviate a little bit from **RAGAs** and created my own custom function for generating personas, which doesn't differ quite alot. Instead of just using the **summary** property I also make use of the **keyphrases** and **themes** with the hopes of getting more diverse and creative persona objects.

In [9]:
from persona import generate_personas_from_kg

personas = generate_personas_from_kg(
    kg,
    langchain_llm,
    num_personas=10
)

Generating personas: 100%|██████████| 10/10 [00:00<00:00, 595.53it/s]


In [10]:
personas

[Persona(name='Diversity and Inclusion Manager', role_description='Works to create inclusive workplaces and promote diversity through events and initiatives.'),
 Persona(name='Diversity and Inclusion Manager', role_description='Oversees and implements strategies to promote diversity, inclusion, and belonging within the organization.'),
 Persona(name='Diversity and Inclusion Coordinator', role_description='Develops and implements strategies to promote diversity, equity, and inclusion within the organization.'),
 Persona(name='Diversity and Inclusion Advisor', role_description='Works to promote diversity, equity, and inclusion within an organization.'),
 Persona(name='Program Coordinator', role_description='Develops and implements programs to promote diversity, inclusion, and belonging within an organization.'),
 Persona(name='Global Diversity Officer', role_description='Develops strategies to promote diversity, equity, and inclusion in a predominantly US-based organization, focusing on 

## 8. Generate query types 

- There are two main types of queries in **RAGAs**:
    
    - **SingleHopQuery** where the **context** relevant for answering a question lies in a **single document/chunk**

    - **MultiHopQuery** where the **context** relevant for answering a question lies in **multiple documents/chunks**

- Additionally, for each of those queries there's a **Specific** or **Abstract** query variant:
    
    - **Specific** one which pertains to a **fact**. 

        - `Example: When did WW1 break out? (Can be precisely answered, there's no room for guessing/interpretation)`
    
    - **Abstract** one which is more about testing the **reasoning** capabilities of the LLM. 

        - `Example: Why did WW1 break out? (There's room for interpretation in this case)`

**Synthesizers** are responsible for **converting enriched nodes and personas into queries**. They achieve this by **selecting a node property (e.g., "entities" or "keyphrases"), pairing it with a persona, style, and query length**, and then using a LLM to generate a query-answer pair based on the content of the node.


In [11]:
from ragas.testset.synthesizers.multi_hop.specific import MultiHopSpecificQuerySynthesizer
from ragas.testset.synthesizers.multi_hop.abstract import MultiHopAbstractQuerySynthesizer
from ragas.testset.synthesizers.single_hop.specific import SingleHopSpecificQuerySynthesizer

single_hop_specific_entities = SingleHopSpecificQuerySynthesizer(
    llm=langchain_llm,
    property_name="entities"
)

single_hop_specific_keyphrases = SingleHopSpecificQuerySynthesizer(
    llm=langchain_llm,
    property_name="keyphrases"
)

single_hop_specific_headlines = SingleHopSpecificQuerySynthesizer(
    llm=langchain_llm,
    property_name="headlines"
)

single_hop_specific_themes = SingleHopSpecificQuerySynthesizer(
    llm=langchain_llm,
    property_name="themes"
)

multi_hop_specific_entities = MultiHopSpecificQuerySynthesizer(
    llm=langchain_llm
)

multi_hop_abstract_entities = MultiHopAbstractQuerySynthesizer(
    llm=langchain_llm
)

query_distribution = [
    (single_hop_specific_entities, 0.25),
    (single_hop_specific_keyphrases, 0.25),
    (single_hop_specific_headlines, 0.125),
    (single_hop_specific_themes, 0.125),
    (multi_hop_specific_entities, 0.125),
    (multi_hop_abstract_entities, 0.125),
]

## 9. Generate the samples

In [12]:
from dotenv import load_dotenv
from ragas.testset import TestsetGenerator

load_dotenv()

generator = TestsetGenerator(
    langchain_llm,
    langchain_embeddings,
    kg,
    personas
)

dataset = generator.generate_with_langchain_docs(
    docs,
    testset_size=50,
    query_distribution=query_distribution,
    run_config=run_config,
    with_debugging_logs=True,
)

Applying HeadlinesExtractor:   0%|          | 0/5 [00:00<?, ?it/s]

Applying HeadlineSplitter:   0%|          | 0/11 [00:00<?, ?it/s] unable to apply transformation: 'headlines' property not found in this node
unable to apply transformation: 'headlines' property not found in this node
unable to apply transformation: 'headlines' property not found in this node
unable to apply transformation: 'headlines' property not found in this node
unable to apply transformation: 'headlines' property not found in this node
unable to apply transformation: 'headlines' property not found in this node
Applying SummaryExtractor:   0%|          | 0/6 [00:00<?, ?it/s] Property 'summary' already exists in node '6b5c2e'. Skipping!
Applying CustomNodeFilter:   0%|          | 0/12 [00:00<?, ?it/s]Prompt fix_output_format failed to parse output: The output parser failed to parse the output including retries.
Prompt fix_output_format failed to parse output: The output parser failed to parse the output including retries.
Prompt fix_output_format failed to parse output: The output 