# Synthetic Q&A Dataset Generator for RAG Evaluation

This script automates the generation of synthetic question-answer pairs from PDF documents for evaluating Retrieval-Augmented Generation (RAG) systems. It uses LangChain with Amazon Bedrock's LLama2 model to:
- Extract meaningful chunks from PDF documents
- Generate relevant questions based on the content
- Create corresponding answers and identify source contexts
- Output the data in two formats: prompt-only and prompt-with-ground-truth
- Perform quality checks to ensure valid content

### Prerequisites
- Amazon Bedrock access with required model access
- PDF documents in a specified directory
- Required packages: langchain, boto3, pandas, tqdm


### 0 - Setup
Before running the rest of this notebook, you'll need to run the cells below to ensure necessary libraries are installed and connect to Bedrock.

Please ignore any pip dependency error (if you see any while installing libraries)

In [None]:
%pip install --upgrade pip --quiet
%pip install -r ../requirements.txt --no-deps --quiet
%pip install -r ../requirements.txt --upgrade --quiet

In [1]:
import warnings
warnings.filterwarnings('ignore')

from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.document_loaders.pdf import PyPDFDirectoryLoader
import json
import boto3
from langchain_community.chat_models import BedrockChat
from langchain.prompts import PromptTemplate
import pandas as pd
from tqdm import tqdm
import os
import shutil

This code is part of the setup and used to :
- Add the parent directory to the python system path

In [None]:
import sys
import logging
from pathlib import Path

current_path = Path().resolve()
current_path = current_path.parent

if str(current_path) not in sys.path:
    sys.path.append(str(current_path))

# Print sys.path to verify
print(sys.path)

#### Configure Amazon Bedrock

Sets up the connection to Amazon Bedrock and configures the LLama2 model with appropriate parameters for consistent output generation.

In [None]:
boto3_bedrock = boto3.client('bedrock-runtime',region_name='us-east-1')
llama_3_70B = "meta.llama3-70b-instruct-v1:0"
inference_modifier_llama = {
    "max_gen_len": 4096,
    "temperature": 0.5,
}

llm = BedrockChat(
    model_id = llama_3_70B,
    client = boto3_bedrock, 
    model_kwargs = inference_modifier_llama 
)

### 1 - S3 Bucket Configuration & Load document(s)

Before we proceed, lets add the S3 bucket name where you have enabled `CORS` and have permission to use. This dummy dataset will be uploaded in the S3 bucket and it will also be used by Evaluation job.

Check `CORS` requirements on our [documentation](https://docs.aws.amazon.com/bedrock/latest/userguide/model-evaluation-security-cors.html) page.

In [1]:
bucket_name = "<YOUR_EVAL_BUCKET_NAME>"

#### Load and Process PDF Documents

This section loads PDF documents and splits them into manageable chunks for processing. The RecursiveCharacterTextSplitter ensures context-aware splitting with overlap to maintain coherence.

In [26]:
loader = PyPDFDirectoryLoader(f"{current_path}/synthetic_dataset/") 
loader.glob = f"**/octank_financial_10K.pdf"
documents = loader.load()

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 2500,  
    chunk_overlap  = 100,
    separators=["\n\n", "\n", ".", " ", ""],
)

docs = text_splitter.split_documents(documents)

In [None]:
len(docs)

### 2 - Define Prompt Templates

These templates guide the LLM in generating questions, answers, and identifying relevant context. Each template is carefully structured to ensure:
- Questions are meaningful and answerable
- Answers are precise and based on context
- Source contexts are accurately extracted

In [7]:
initial_question_prompt_template = PromptTemplate(
    input_variables=["context"],
    template="""
    [INST]
    <Instructions>
    Here is some context:
    <context>
    {context}
    </context>

    Your task is to generate 1 question that can be answered using the provided context, following these rules:

    <rules>
    1. The question should make sense to humans even when read without the given context.
    2. The question should be fully answered from the given context.
    3. The question should be framed from a part of context that contains important information. It can also be from tables, code, etc.
    4. The answer to the question should not contain any links.
    5. The question should be of moderate difficulty.
    6. The question must be reasonable and must be understood and responded by humans.
    7. Do not use phrases like 'provided context', etc. in the question.
    8. Avoid framing questions using the word "and" that can be decomposed into more than one question.
    9. The question should not contain more than 10 words, make use of abbreviations wherever possible.
    </rules>

    Output only the generated question with a "?" at the end, no other text or characters.
    </Instructions>
    [/INST]
    """)

answer_prompt_template = PromptTemplate(
    input_variables=["context", "question"],
    template="""
    [INST]
    <Instructions>
    <Task>
    <role>You are an experienced QA Engineer for building large language model applications.</role>
    <task>It is your task to generate an answer to the following question <question>{question}</question> only based on the <context>{context}</context></task>
    The output should be only the answer generated from the context.

    <rules>
    1. Only use the given context as a source for generating the answer.
    2. Be as precise as possible with answering the question.
    3. Be concise in answering the question and only answer the question at hand rather than adding extra information.
    </rules>

    Only output the generated answer as a sentence. No extra characters.
    </Task>
    </Instructions>
    [/INST]
    Assistant:
    """)

source_prompt_template = PromptTemplate(
    input_variables=["context", "question"],
    template="""Human:
    [INST]
    <Instructions>
    Here is the context:
    <context>
    {context}
    </context>

    Your task is to extract the relevant sentences from the given context that can potentially help answer the following question. You are not allowed to make any changes to the sentences from the context.

    <question>
    {question}
    </question>

    Output only the relevant sentences you found, one sentence per line, without any extra characters or explanations.
    </Instructions>
    [/INST]
    Assistant:
    """)

## 3 - Define Helper Functions

These core functions handle the interaction with the LLM to generate questions, answers, and extract relevant source contexts.

In [8]:
def generate_question(doc, llm):
    initial_question_prompt = initial_question_prompt_template.format(context=doc)
    initial_question = llm.invoke(initial_question_prompt)
    return initial_question

def generate_answer(question: str, doc, llm):
    answer_prompt = answer_prompt_template.format(question = question, context=doc)
    answer = llm.invoke(answer_prompt)
    return answer

def generate_source(question: str, doc, llm):
    source_prompt = source_prompt_template.format(question = question, context=doc)
    source = llm.invoke(source_prompt)
    return source

#### Define Dataset Generation Functions

These functions orchestrate the QA pair generation process, managing the creation and storage of questions, answers, and contexts in a structured format.

In [9]:
def generate_qa_dataset_doc(doc, llm, dataset, doc_number):
    question = generate_question(doc, llm)
    dataset.at[doc_number, "question"] = question.content
    
    answer = generate_answer(question, doc, llm)
    dataset.at[doc_number, "reference_answer"] = answer.content
        
    source_sentence = generate_source(question, doc, llm)
    dataset.at[doc_number, "source_sentence"] = source_sentence.content
    
    dataset.at[doc_number, "source_raw"] = doc.page_content
    dataset.at[doc_number, "source_document"] = doc.metadata["source"]
    
    return dataset

def generate_dataset(documents, llm, dataset):
    for doc in tqdm(range(len(documents))):
        dataset = generate_qa_dataset_doc(doc = documents[doc], llm = llm, dataset = dataset, doc_number = doc)
    return dataset

#### Define Schema Conversion Functions

These functions handle data validation and conversion into two specific JSON schemas:
- prompt_only: Contains just the question for evaluation
- prompt_with_gt: Contains question, reference answer, and contexts
The functions include quality checks to ensure no empty or invalid content makes it to the final output.

In [10]:
def is_valid_content(text):
    return bool(text and text.strip())

def convert_schema(example, schema_type="prompt_only"):
    if not is_valid_content(example["query"]):
        return None
    
    query = example["query"].strip()

    if schema_type == "prompt_only":
        new_schema = {
            "conversationTurns": [
                {
                    "prompt": {
                        "content": [{"text": query}]
                    }
                }
            ]
        }
    elif schema_type == "prompt_with_gt":
        reference_answer = example["reference_answer"].strip()
        if not (is_valid_content(reference_answer) and example["reference_contexts"]):
            return None
            
        valid_contexts = [
            context.strip() for context in example["reference_contexts"] 
            if is_valid_content(context)
        ]
        
        if not valid_contexts:
            return None

        new_schema = {
            "conversationTurns": [
                {
                    "prompt": {
                        "content": [{"text": query}]
                    },
                    "referenceResponses": [
                        {"content": [{"text": reference_answer}]}
                    ],
                    "referenceContexts": [
                        {"content": [{"text": context}]} for context in valid_contexts
                    ]
                }
            ]
        }
    else:
        raise ValueError(f"Invalid schema_type: {schema_type}. Must be either 'prompt_only' or 'prompt_with_gt'")
    return new_schema

def save_to_jsonl(df, output_file_prefix, schema_type):
    valid_records = 0
    skipped_records = 0
    
    with open(f'{output_file_prefix}_{schema_type}.jsonl', 'w') as file:
        for _, row in df.iterrows():
            example = {
                "query": row["query"],
                "query_by": {"model_name": row["model_name"], "type": row["type"]},
                "reference_contexts": row["reference_contexts"].split(", "),
                "reference_answer": row["reference_answer"],
                "reference_answer_by": {"model_name": row["model_name"], "type": row["type"]}
            }
            
            schema = convert_schema(example, schema_type)
            if schema:
                json.dump(schema, file)
                file.write('\n')
                valid_records += 1
            else:
                skipped_records += 1
    
    print(f"Schema type: {schema_type}")
    print(f"Valid records written: {valid_records}")
    print(f"Skipped records: {skipped_records}")

## 4 - Generate Dataset

Initializes the dataset generation process with a subset of documents for testing or full processing.

In [None]:
docs_subset = docs[:20]
dataset = pd.DataFrame(columns=["question", "reference_answer", "source_sentence","source_raw","source_document"])
dataset_df = generate_dataset(docs_subset, llm, dataset)
dataset_df['reference_answer'] = dataset_df['reference_answer'].str.replace(r'\[\/INST\]', '', regex=True)
dataset_df['source_raw'] = dataset_df['source_raw'].str.replace(r'\[\/INST\]', '', regex=True)

filtered_df = dataset_df.drop(["source_sentence", "source_document"], axis=1)
filtered_df = filtered_df.rename(columns={
    'question': 'query',
    'reference_answer': 'reference_answer',
    'source_raw': 'reference_contexts'
})

filtered_df["model_name"] = "llama_3_70B"
filtered_df["type"] = "ai"

#### Save Dataset Files

Creates the final JSONL files in both formats and organizes them in an evaluation_data directory.

In [None]:
save_to_jsonl(filtered_df, 'rag_dataset', 'prompt_only')
save_to_jsonl(filtered_df, 'rag_dataset', 'prompt_with_gt')

if not os.path.exists("evaluation_data"):
    os.mkdir("evaluation_data")

for file in ['rag_dataset_prompt_only.jsonl', 'rag_dataset_prompt_with_gt.jsonl']:
    shutil.move(file, 'evaluation_data/')

## 5 - Upload to S3 (Optional)

Optional functionality to upload the generated datasets to Amazon S3 for further use.

In [None]:
s3_client = boto3.client('s3', region_name='us-east-1')

for file in ['rag_dataset_prompt_only.jsonl', 'rag_dataset_prompt_with_gt.jsonl']:
    s3_client.upload_file(f'evaluation_data/{file}', bucket_name, f'evaluation_data/{file}')

The script generates two types of evaluation datasets in JSONL format, stored in the 'evaluation_data' directory:
1. prompt_only.jsonl: Contains only questions for basic evaluation
2. prompt_with_gt.jsonl: Contains questions, reference answers, and contexts for comprehensive evaluation

These datasets can be used to evaluate RAG systems by comparing their responses against the generated reference answers and contexts.