# Golden Dataset Creator

This notebook generates a high-quality evaluation dataset from your Knowledge Base documentation using Amazon Bedrock.
It splits documentation into chunks and uses an LLM to generate grounded QA pairs for RAG evaluation.

### 1. Setup and Dependencies
Import necessary libraries and configure the Bedrock client for processing.

In [1]:
pip install langchain_community langchain_text_splitters boto3

Note: you may need to restart the kernel to use updated packages.


In [2]:
import os
import json
import random
import boto3
from typing import List
from langchain_community.document_loaders import DirectoryLoader, TextLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter

# Set project paths
ROOT_DIR = os.path.abspath("../../../../")
KB_PATH = os.path.join(ROOT_DIR, "assets/knowledge_base/samples/data")
OUTPUT_DIR = os.path.join(ROOT_DIR, "assets/knowledge_base/samples/evaluation/test_sets")
PROMPT_PATH = os.path.join(ROOT_DIR, "assets/prompts/evaluation_prompts")

# Initialize Bedrock client
bedrock = boto3.client('bedrock-runtime')
MODEL_ID = "us.anthropic.claude-3-7-sonnet-20250219-v1:0"

### 2. Load Documentation
Load all markdown files from the knowledge base directory to use as context for questions.

In [3]:
print(f"Loading files from {KB_PATH}...")
loader = DirectoryLoader(KB_PATH, glob="**/*.md", loader_cls=TextLoader)
documents = loader.load()
print(f"Loaded {len(documents)} documents.")

Loading files from /Users/alvaro/VSProjects/enterprise-bedrock-agent/assets/knowledge_base/samples/data...
Loaded 37 documents.


### 3. Text Splitting
Divide the documents into smaller, manageable chunks to ensure the LLM generates focused questions.

In [4]:
text_splitter = RecursiveCharacterTextSplitter(chunk_size=800, chunk_overlap=100)
chunks = text_splitter.split_documents(documents)
print(f"Split documents into {len(chunks)} chunks.")

Split documents into 120 chunks.


### 4. Load Prompt Template
Read the system prompt that instructs the LLM on how to generate the question-answer pairs.

In [5]:

import os

evaluation_prompts = {
    "happy_path_prompt": "",
    "edge_case_prompt": "",
    "adversarial_prompt": ""
}

for prompt_type in evaluation_prompts.keys():
    with open(os.path.join(PROMPT_PATH, f"{prompt_type}.txt"), "r", encoding="utf-8") as prompt_file:
        evaluation_prompts[prompt_type] = prompt_file.read()


### 5. Dataset Generation
Iterate through a selection of chunks and call Amazon Bedrock to create the technical QA pairs.

In [6]:
def call_llm(prompt):
    """
    Sends a request to Amazon Bedrock using the Anthropic Messages API.
    """
    payload = {
        "anthropic_version": "bedrock-2023-05-31",
        "max_tokens": 1000,
        "messages": [{"role": "user", "content": prompt}],
        "temperature": 0.0
    }
    
    # Invoke model
    response = bedrock.invoke_model(modelId=MODEL_ID, body=json.dumps(payload))
    
    # Read and parse the response body
    result = json.loads(response['body'].read())
    return result['content'][0]['text']


In [7]:
# Logic to generate multiple test sets
TESTSET_SIZE = 20
os.makedirs(OUTPUT_DIR, exist_ok=True)

# Iterate over different prompt types defined in the previous cell
for prompt_type, prompt_template in evaluation_prompts.items():
    print(f"\n--- Generating dataset for: {prompt_type} ---")
    
    test_set = []
    # Ensure chunks are available from your previous loader logic
    selected_chunks = random.sample(chunks, min(TESTSET_SIZE, len(chunks)))

    for i, chunk in enumerate(selected_chunks):
        print(f"Generating QA pair {i+1}/{len(selected_chunks)} for {prompt_type}...")
        
        # Use the specific prompt template for this iteration
        prompt = prompt_template.format(chunk_content=chunk.page_content)
        
        try:
            raw_res = call_llm(prompt)
            
            # Extract JSON from model response
            start = raw_res.find('{')
            end = raw_res.rfind('}') + 1
            item = json.loads(raw_res[start:end])
            
            # DeepEval Schema Mapping: input, actual_output, expected_output, retrieval_context
            test_set.append({
                "input": item["user_input"],
                "actual_output": "", # This will be populated by the actual Agent response later
                "expected_output": item["response"],
                "retrieval_context": [chunk.page_content]
            })
        except Exception as e:
            print(f"Error generating question {i+1}: {e}")

    print(f"âœ… Successfully generated {len(test_set)} test cases for {prompt_type}.")
    
    # Save the specific test set
    output_filename = f"golden_set_{prompt_type.replace('_prompt', '')}.jsonl"
    output_file = os.path.join(OUTPUT_DIR, output_filename)
    
    with open(output_file, 'w', encoding='utf-8') as f:
        for item in test_set:
            f.write(json.dumps(item, ensure_ascii=False) + "\n")
            
    print(f"Saved to {output_file}")


--- Generating dataset for: happy_path_prompt ---
Generating QA pair 1/20 for happy_path_prompt...
Generating QA pair 2/20 for happy_path_prompt...


KeyboardInterrupt: 

### 6. Save Results
Store the generated dataset in JSONL format for consumption by the evaluation engine.

In [8]:
print("All datasets generated and saved in:", OUTPUT_DIR)
for file in os.listdir(OUTPUT_DIR):
    if file.startswith("golden_set_"):
        print(f"- {file}")

All datasets generated and saved in: /Users/alvaro/VSProjects/enterprise-bedrock-agent/assets/knowledge_base/samples/evaluation/test_sets
- golden_set_happy_path.jsonl
- golden_set_adversarial.jsonl
- golden_set_edge_case.jsonl
