# Synthetic Test Data generation¶
## Why synthetic test data?¶
Evaluating RAG (Retrieval-Augmented Generation) augmented pipelines is crucial for assessing their performance. However, manually creating hundreds of QA (Question-Context-Answer) samples from documents can be time-consuming and labor-intensive. Additionally, human-generated questions may struggle to reach the level of complexity required for a thorough evaluation, ultimately impacting the quality of the assessment. By using synthetic data generation developer time in data aggregation process can be reduced by 90%.

## How does Ragas differ in test data generation?¶
Ragas takes a novel approach to evaluation data generation. An ideal evaluation dataset should encompass various types of questions encountered in production, including questions of varying difficulty levels. LLMs by default are not good at creating diverse samples as it tends to follow common paths. Inspired by works like Evol-Instruct, Ragas achieves this by employing an evolutionary generation paradigm, where questions with different characteristics such as reasoning, conditioning, multi-context, and more are systematically crafted from the provided set of documents. This approach ensures comprehensive coverage of the performance of various components within your pipeline, resulting in a more robust evaluation process.

evol-generate
## In-Depth Evolution¶
Large Language Models (LLMs) possess the capability to transform simple questions into more complex ones effectively. To generate medium to hard samples from the provided documents, we employ the following methods:

* Reasoning: Rewrite the question in a way that enhances the need for reasoning to answer it effectively.

* Conditioning: Modify the question to introduce a conditional element, which adds complexity to the question.

* Multi-Context: Rephrase the question in a manner that necessitates information from multiple related sections or chunks to formulate an answer.

Moreover, our paradigm extends its capabilities to create conversational questions from the given documents:

* Conversational: A portion of the questions, following the evolution process, can be transformed into conversational samples. These questions simulate a chat-based question-and-follow-up interaction, mimicking a chat-Q&A pipeline.

In [1]:
import os
from dotenv import load_dotenv

# Load the .env file
load_dotenv()

True

## Load source documents

In [2]:
import json
import langchain
from langchain.docstore.document import Document as LangchainDocument

def load_json_files_to_documents(directory):
  """Loads JSON files from a given directory into a list of Langchain Documents.

  Args:
    directory: The path to the directory containing JSON files.

  Returns:
    A list of Langchain Documents.
  """

  documents = []
  for filename in os.listdir(directory):
      if filename.endswith('.json'):
          file_path = os.path.join(directory, filename)
          with open(file_path, 'r') as f:
              data = json.load(f)
              # Extract relevant fields from the JSON data
              content = data['text']  # Replace 'content' with the actual field name
              metadata = {'source': filename}  # Add additional metadata if needed
              document = LangchainDocument(page_content=content, metadata=metadata)
              documents.append(document)
  return documents

In [3]:
# Example usage:
from from_root import from_root
department = "data\HR" # -> Choose the department you would like to generate synthetic data
documents = load_json_files_to_documents(os.path.join(from_root(), department))

## Creating a generative agent

In [4]:
# Choose one of the LLMs as the generative LLM. Ideally OpenAI's LLMs

# For LLMs from OpenAI
from langchain_openai import ChatOpenAI
os.environ["OPENAI_API_KEY"] = os.getenv("OPENAI_API_KEY")
model_name = 'gpt-3.5-turbo-16k' # -> Choose your desired model
generator_llm = ChatOpenAI(model=model_name)

# For LLMs from Groq
#from langchain_groq import ChatGroq
#os.environ["GROD_CLOUD_API_KEY"] = os.getenv('GROD_CLOUD_API_KEY')
#model_name = 'llama3-8b-8192' # -> Choose your desired model
#generator_groq_llm = ChatGroq(
#                    groq_api_key=os.environ["GROD_CLOUD_API_KEY"],
#                    model_name=model_name
#                    )

## Create an embedding model

In [5]:
#OpenAI Embedding
from langchain_openai import OpenAIEmbeddings
os.getenv("OPENAI_API_KEY")
embeddings = OpenAIEmbeddings(disallowed_special=())

# For other embedding models
#from langchain_community.embeddings.fastembed import FastEmbedEmbeddings
#embeddings_fastembed = FastEmbedEmbeddings()

## Create a critic agent

In [17]:
# Create a critic LLM. This should be different than the generative llm
#from langchain_groq import ChatGroq
#os.environ["GROD_CLOUD_API_KEY"] = os.getenv('GROD_CLOUD_API_KEY')
#model_name = 'llama-3.1-8b-instant'
#critic_llm = ChatGroq(
#                    groq_api_key=os.environ["GROD_CLOUD_API_KEY"],
#                    model_name=model_name
#                    )

import time
from ratelimit import limits, sleep_and_retry
ONE_MINUTE = 60
# Set the rate limit to 25 calls per minute
@sleep_and_retry
@limits(calls=25, period=ONE_MINUTE)
def call_groq_llm(prompt):
    # Your Groq Llama 3.1 API call logic here
    from langchain_groq import ChatGroq
    os.environ["GROD_CLOUD_API_KEY"] = os.getenv('GROD_CLOUD_API_KEY')
    model_name = 'llama-3.1-8b-instant'
    critic_llm = ChatGroq(
                    groq_api_key=os.environ["GROD_CLOUD_API_KEY"],
                    model_name=model_name
                    )
    response = critic_llm(prompt)
    return response

#### Test set generation

In [18]:
from ragas.testset.generator import TestsetGenerator
from ragas.testset.evolutions import simple, reasoning, multi_context
#from langchain_openai import ChatOpenAI, OpenAIEmbeddings

generator_test = TestsetGenerator.from_langchain(
    generator_llm,
    call_groq_llm,
    embeddings
)

# Change resulting question type distribution
distributions = {
    simple: 0.5,
    multi_context: 0.4,
    reasoning: 0.1
}

In [None]:
# use generator.generate_with_llamaindex_docs if you use llama-index as document loader
testset = generator_test.generate_with_langchain_docs(documents[:6], 10, distributions, with_debugging_logs=True) 

embedding nodes:  44%|████▍     | 8/18 [00:00<00:00, 24.97it/s][ragas.testset.extractor.DEBUG] topics: {'keyphrases': ['Training and Development', 'Leaves and Time Off', 'Medical Plans and Benefits', 'Complaints and Grievances', 'General Policies']}
[ragas.testset.extractor.DEBUG] topics: {'keyphrases': ['Training and Development', 'Leaves and Time Off', 'Medical Plans and Benefits', 'Complaints and Grievances', 'General Policies']}
[ragas.testset.extractor.DEBUG] topics: {'keyphrases': ['Monitoring and Auditing', 'Internal Audits', 'External Audits', 'Reporting and Investigation', 'Disciplinary Actions', 'Legal Consequences', 'Employee Responsibility']}
[ragas.testset.extractor.DEBUG] topics: {'keyphrases': ['Monitoring and Auditing', 'Internal Audits', 'External Audits', 'Reporting and Investigation', 'Disciplinary Actions', 'Legal Consequences', 'Employee Responsibility']}
[ragas.testset.extractor.DEBUG] topics: {'keyphrases': ['Retriever.retrieve', 'Generate Responses with LangChai

RateLimitError: Error code: 429 - {'error': {'message': 'Rate limit reached for model `llama-3.1-8b-instant` in organization `org_01j44t1e1nfqws13bdbxhmz8my` on requests per minute (RPM): Limit 30, Used 30, Requested 1. Please try again in 1.953s. Visit https://console.groq.com/docs/rate-limits for more information.', 'type': 'requests', 'code': 'rate_limit_exceeded'}}

[ragas.testset.filters.DEBUG] context scoring: {'clarity': 3, 'depth': 3, 'structure': 3, 'relevance': 3, 'score': 3.0}
[ragas.testset.filters.DEBUG] context scoring: {'clarity': 3, 'depth': 3, 'structure': 3, 'relevance': 3, 'score': 3.0}
[ragas.testset.filters.DEBUG] context scoring: {'clarity': 3, 'depth': 3, 'structure': 3, 'relevance': 3, 'score': 3.0}
[ragas.testset.evolutions.DEBUG] keyphrases in merged node: ['Training and Development', 'Leaves and Time Off', 'Medical Plans and Benefits', 'Complaints and Grievances', 'General Policies']
[ragas.testset.evolutions.DEBUG] keyphrases in merged node: ['Training and Development', 'Leaves and Time Off', 'Medical Plans and Benefits', 'Complaints and Grievances', 'General Policies']
[ragas.testset.evolutions.DEBUG] keyphrases in merged node: ['Training and Development', 'Leaves and Time Off', 'Medical Plans and Benefits', 'Complaints and Grievances', 'General Policies']
[ragas.testset.evolutions.INFO] seed question generated: What is th

In [16]:
test_df = testset.to_pandas()
test_df.head()

NameError: name 'testset' is not defined