# Synthetic Test Data generation¶
## Why synthetic test data?¶
Evaluating RAG (Retrieval-Augmented Generation) augmented pipelines is crucial for assessing their performance. However, manually creating hundreds of QA (Question-Context-Answer) samples from documents can be time-consuming and labor-intensive. Additionally, human-generated questions may struggle to reach the level of complexity required for a thorough evaluation, ultimately impacting the quality of the assessment. By using synthetic data generation developer time in data aggregation process can be reduced by 90%.

## How does Ragas differ in test data generation?¶
Ragas takes a novel approach to evaluation data generation. An ideal evaluation dataset should encompass various types of questions encountered in production, including questions of varying difficulty levels. LLMs by default are not good at creating diverse samples as it tends to follow common paths. Inspired by works like Evol-Instruct, Ragas achieves this by employing an evolutionary generation paradigm, where questions with different characteristics such as reasoning, conditioning, multi-context, and more are systematically crafted from the provided set of documents. This approach ensures comprehensive coverage of the performance of various components within your pipeline, resulting in a more robust evaluation process.

evol-generate
## In-Depth Evolution¶
Large Language Models (LLMs) possess the capability to transform simple questions into more complex ones effectively. To generate medium to hard samples from the provided documents, we employ the following methods:

* Reasoning: Rewrite the question in a way that enhances the need for reasoning to answer it effectively.

* Conditioning: Modify the question to introduce a conditional element, which adds complexity to the question.

* Multi-Context: Rephrase the question in a manner that necessitates information from multiple related sections or chunks to formulate an answer.

Moreover, our paradigm extends its capabilities to create conversational questions from the given documents:

* Conversational: A portion of the questions, following the evolution process, can be transformed into conversational samples. These questions simulate a chat-based question-and-follow-up interaction, mimicking a chat-Q&A pipeline.

In [4]:
import os
from dotenv import load_dotenv

# Load the .env file
load_dotenv()

True

## Load source documents

In [5]:
import json
import langchain
#from langchain.document import Document
from langchain.docstore.document import Document as LangchainDocument

def load_json_files_to_documents(directory):
  """Loads JSON files from a given directory into a list of Langchain Documents.

  Args:
    directory: The path to the directory containing JSON files.

  Returns:
    A list of Langchain Documents.
  """

  documents = []
  for filename in os.listdir(directory):
      if filename.endswith('.json'):
          file_path = os.path.join(directory, filename)
          with open(file_path, 'r') as f:
              data = json.load(f)
              # Extract relevant fields from the JSON data
              content = data['text']  # Replace 'content' with the actual field name
              metadata = {'source': filename}  # Add additional metadata if needed
              document = LangchainDocument(page_content=content, metadata=metadata)
              documents.append(document)
  return documents

In [6]:
# Example usage:
from from_root import from_root
documents = load_json_files_to_documents(os.path.join(from_root(), "data\HR"))

FileNotFoundError: [WinError 3] The system cannot find the path specified: 'C:\\Users\\dqmin\\data\\HR'

## Creating a generative agent

In [11]:
# Choose one of the LLMs as the generative agent. Ideally OpenAI's LLMs

# For LLMs from OpenAI
from openai import ChatOpenAI
model_name = 'gpt-3.5-turbo' # -> Choose your desired model
generator_llm = ChatOpenAI(model=model_name, temperature=0.1)


# For LLMs from Groq
#from langchain_groq import ChatGroq
#os.environ["GROD_CLOUD_API_KEY"] = os.getenv('GROD_CLOUD_API_KEY')
#model_name = 'llama3-8b-8192' # -> Choose your desired model
#generator_groq_llm = ChatGroq(
#                    groq_api_key=os.environ["GROD_CLOUD_API_KEY"],
#                    model_name=model_name
#                    )

## Create an embedding model

In [16]:


#from langchain_community.embeddings.fastembed import FastEmbedEmbeddings
#embeddings_fastembed = FastEmbedEmbeddings()

Fetching 5 files: 100%|██████████| 5/5 [00:00<00:00, 4935.64it/s]


## Create a critic agent

In [13]:
from langchain_groq import ChatGroq
os.environ["GROD_CLOUD_API_KEY"] = os.getenv('GROD_CLOUD_API_KEY')
critic_groq_llm = ChatGroq(
                    groq_api_key=os.environ["GROD_CLOUD_API_KEY"],
                    model_name='gemma2-9b-it'
                    )

#### Test set generation

In [17]:
from ragas.testset.generator import TestsetGenerator
from ragas.testset.evolutions import simple, reasoning, multi_context
#from langchain_openai import ChatOpenAI, OpenAIEmbeddings

generator_test = TestsetGenerator.from_langchain(
    generator_groq_llm,
    critic_groq_llm,
    embeddings_fastembed
)

# Change resulting question type distribution
distributions = {
    simple: 0.5,
    multi_context: 0.4,
    reasoning: 0.1
}

In [18]:
# use generator.generate_with_llamaindex_docs if you use llama-index as document loader
testset = generator_test.generate_with_langchain_docs(documents[:6], 10, distributions, with_debugging_logs=True) 
testset.to_pandas()

Filename and doc_id are the same for all nodes.               
Generating:   0%|          | 0/5 [00:00<?, ?it/s][ragas.testset.filters.DEBUG] context scoring: {'clarity': 3, 'depth': 2, 'structure': 3, 'relevance': 3, 'score': 2.75}
[ragas.testset.evolutions.DEBUG] keyphrases in merged node: ['Compliance with labor laws', 'Internal and external audits', 'Disciplinary actions and consequences', 'Employee responsibility and reporting', 'Workplace ethics and safety']
[ragas.testset.filters.DEBUG] context scoring: {'clarity': 3, 'depth': 2, 'structure': 3, 'relevance': 3, 'score': 2.75}
[ragas.testset.evolutions.DEBUG] keyphrases in merged node: ['Equal Employment Opportunity', 'Workplace Safety and Health', 'Employee Rights and Benefits', 'Labor Laws and Regulations', 'Compliance and Training']
[ragas.testset.filters.DEBUG] context scoring: {'clarity': 3, 'depth': 2, 'structure': 3, 'relevance': 3, 'score': 2.75}
[ragas.testset.evolutions.DEBUG] keyphrases in merged node: ['Equal Employme

Unnamed: 0,question,contexts,ground_truth,evolution_type,metadata,episode_done
0,Here is a question that can be fully answered ...,[Introduction\nTech Innovators Inc. is committ...,Tech Innovators Inc. has procedures in place t...,simple,[{'source': 'Tech Innovators Inc. Compliance w...,True
1,Here is a question that can be fully answered ...,[5.2 Monitoring and Auditing\nInternal Audits\...,The potential consequences of non-compliance w...,simple,[{'source': 'Tech Innovators Inc. Compliance w...,True


## Creating a generative agent

In [3]:
# Cohere llm
from langchain_cohere import ChatCohere
os.environ["COHERE_API_KEY"] = os.getenv('COHERE_API_KEY')
generator_cohere_llm = ChatCohere(model='command-r')

* 'allow_population_by_field_name' has been renamed to 'populate_by_name'
* 'smart_union' has been removed


## Create a critic agent

In [72]:
# Cohere llm
from langchain_cohere import ChatCohere
os.environ["COHERE_API_KEY"] = os.getenv('COHERE_API_KEY')
critic_cohere_llm = ChatCohere(model='command-light')

## Create an embeddings model

In [4]:
from langchain_community.embeddings import CohereEmbeddings
cohere_embeddings = CohereEmbeddings(cohere_api_key=os.getenv('COHERE_API_KEY'),
                                     model="embed-english-light-v3.0")

  warn_deprecated(


KeyError: 'user_agent'

## Customising test data distribution¶

In [75]:
from ragas.testset.generator import TestsetGenerator
from ragas.testset.evolutions import simple, reasoning, multi_context
#from langchain_openai import ChatOpenAI, OpenAIEmbeddings

generator = TestsetGenerator.from_langchain(
    generator_cohere_llm,
    critic_cohere_llm,
    cohere_embeddings
)

# Change resulting question type distribution
distributions = {
    simple: 0.5,
    multi_context: 0.4,
    reasoning: 0.1
}

In [None]:
# use generator.generate_with_llamaindex_docs if you use llama-index as document loader
testset = generator.generate_with_langchain_docs(documents, 100, distributions) 
testset.to_pandas()

In [77]:
test_df = testset.to_pandas()
test_df.head()

Unnamed: 0,question,contexts,ground_truth,evolution_type,metadata,episode_done
0,Here is a question that can be fully answered ...,"[ learn about SEO, social media marketing, and...",Tech Innovators Inc. is committed to the conti...,simple,[{'source': 'Tech Innovators Inc. Training and...,True
1,Here is a question that can be fully answered ...,[ Inc. upholds the highest ethical standards i...,Tech Innovators Inc. creates a diverse and inc...,simple,[{'source': 'Tech Innovators Inc. Recruitment ...,True
2,Here is a rewritten version of the question th...,[Introduction\nTech Innovators Inc. is committ...,Tech Innovators Inc. fosters a positive work c...,multi_context,[{'source': 'Tech Innovators Inc. Compliance w...,True
