# Golden Dataset Generator - Standalone

This notebook creates a synthetic golden dataset using RAGAS from loan complaint data and uploads it to LangSmith.

**Features:**
- Self-contained (no dependencies on other notebooks)
- Generates 15 synthetic Q&A pairs
- Uploads dataset to LangSmith for evaluation tracking
- Uses "Abstracted SDG" approach from RAGAS

## Step 1: Install Dependencies

In [1]:
# Install required packages
!pip install -q ragas langsmith langchain-community datasets pandas numpy pillow rapidfuzz
!pip install -q langchain-openai langchain-core qdrant-client

/bin/bash: /Users/foohm/AIMakerSpace/AIE7.session9/09_Advanced_Retrieval_Eval/venv/bin/pip: /Users/foohm/AIMakerSpace/AIE7/09_Advanced_Retrieval_Eval/venv/bin/python3.13: bad interpreter: No such file or directory
/bin/bash: /Users/foohm/AIMakerSpace/AIE7.session9/09_Advanced_Retrieval_Eval/venv/bin/pip: /Users/foohm/AIMakerSpace/AIE7/09_Advanced_Retrieval_Eval/venv/bin/python3.13: bad interpreter: No such file or directory


## Step 2: API Keys Configuration

In [4]:
import os
import getpass
from uuid import uuid4

# OpenAI API Key
if "OPENAI_API_KEY" not in os.environ:
    os.environ["OPENAI_API_KEY"] = getpass.getpass("Enter your OpenAI API Key: ")

# LangSmith API configuration
os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_ENDPOINT"] = "https://api.smith.langchain.com"
if "LANGCHAIN_API_KEY" not in os.environ:
    os.environ["LANGCHAIN_API_KEY"] = getpass.getpass("Enter your LangSmith API Key: ")
os.environ["LANGCHAIN_PROJECT"] = f"AIM - S09 - Golden Dataset Generation - {uuid4().hex[0:8]}"

print("✅ API keys configured successfully!")

✅ API keys configured successfully!


## Step 3: Load Loan Complaint Data

In [5]:
from langchain_community.document_loaders.csv_loader import CSVLoader
from datetime import datetime, timedelta

# Load loan complaint data from CSV
loader = CSVLoader(
    file_path="./data/complaints.csv",
    metadata_columns=[
        "Date received", 
        "Product", 
        "Sub-product", 
        "Issue", 
        "Sub-issue", 
        "Consumer complaint narrative", 
        "Company public response", 
        "Company", 
        "State", 
        "ZIP code", 
        "Tags", 
        "Consumer consent provided?", 
        "Submitted via", 
        "Date sent to company", 
        "Company response to consumer", 
        "Timely response?", 
        "Consumer disputed?", 
        "Complaint ID"
    ]
)

loan_complaint_data = loader.load()

# Set page_content to the complaint narrative
for doc in loan_complaint_data:
    doc.page_content = doc.metadata["Consumer complaint narrative"]

print(f"✅ Loaded {len(loan_complaint_data)} loan complaint documents")
print(f"Example document preview: {loan_complaint_data[0].page_content[:200]}...")

✅ Loaded 825 loan complaint documents
Example document preview: The federal student loan COVID-19 forbearance program ended in XX/XX/XXXX. However, payments were not re-amortized on my federal student loans currently serviced by Nelnet until very recently. The new...


## Step 4: Configure Synthetic Data Generation Models

In [6]:
from ragas.testset import TestsetGenerator
from ragas.llms import LangchainLLMWrapper
from ragas.embeddings import LangchainEmbeddingsWrapper
from langchain_openai import ChatOpenAI, OpenAIEmbeddings

# Set up the LLM and embedding models for synthetic data generation
# Using the same models as in the reference notebooks
generator_llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4.1-nano"))
generator_embeddings = LangchainEmbeddingsWrapper(OpenAIEmbeddings(model="text-embedding-3-small"))

print("✅ Synthetic data generation models configured successfully!")

✅ Synthetic data generation models configured successfully!


## Step 5: Generate Synthetic Golden Dataset

In [7]:
# Generate synthetic dataset using the abstracted SDG approach
print("Generating synthetic dataset using RAGAS...")
print(f"Using first 50 documents from {len(loan_complaint_data)} total loan complaint documents")

# Initialize the test set generator
generator = TestsetGenerator(llm=generator_llm, embedding_model=generator_embeddings)

# Generate 15 synthetic Q&A pairs (reduced from original 27)
synthetic_dataset = generator.generate_with_langchain_docs(
    loan_complaint_data[:50],  # Use first 50 docs to balance quality and speed
    testset_size=15  # MODIFIED: Generate 15 Q&A pairs instead of 25/27
)

print(f"✅ Generated {len(synthetic_dataset)} synthetic Q&A pairs successfully!")

Generating synthetic dataset using RAGAS...
Using first 50 documents from 825 total loan complaint documents


Python(78857) MallocStackLogging: can't turn off malloc stack logging because it was not enabled.


Applying SummaryExtractor:   0%|          | 0/31 [00:00<?, ?it/s]

Python(78858) MallocStackLogging: can't turn off malloc stack logging because it was not enabled.


Applying CustomNodeFilter:   0%|          | 0/50 [00:00<?, ?it/s]

Node 6b8d73e1-29fa-4f07-9b88-b27064a9365e does not have a summary. Skipping filtering.
Node a87fad26-84a6-439e-b694-093f6d4fd8d4 does not have a summary. Skipping filtering.
Node f0aed855-110c-475a-be23-bc8a824282de does not have a summary. Skipping filtering.
Node 7f504c21-fe02-4aa4-a883-132ac187c471 does not have a summary. Skipping filtering.
Node dcbecf25-4519-45bc-a69d-74d9829bda6a does not have a summary. Skipping filtering.
Node 270675d9-07ce-4d85-a1a1-771b6fa21730 does not have a summary. Skipping filtering.
Node 7a2b2eca-ce3f-4965-897d-b5644af0c308 does not have a summary. Skipping filtering.
Node 67becb17-310b-42d1-9df8-1d16312ca709 does not have a summary. Skipping filtering.
Node 58bad9a5-4455-40b9-b2c4-7ec08bff9b8e does not have a summary. Skipping filtering.
Node f1d56d9b-2e58-45ce-95e9-00bacb2c4189 does not have a summary. Skipping filtering.
Node a223fb2b-7abe-4bad-8e3d-e1e1480563bf does not have a summary. Skipping filtering.
Node 2b636fb4-ec7a-4508-9453-6e2f6fdecf51 d

Applying [EmbeddingExtractor, ThemesExtractor, NERExtractor]:   0%|          | 0/131 [00:00<?, ?it/s]

Applying [CosineSimilarityBuilder, OverlapScoreBuilder]:   0%|          | 0/2 [00:00<?, ?it/s]

Generating personas:   0%|          | 0/3 [00:00<?, ?it/s]

Generating Scenarios:   0%|          | 0/3 [00:00<?, ?it/s]

Generating Samples:   0%|          | 0/15 [00:00<?, ?it/s]

✅ Generated 15 synthetic Q&A pairs successfully!


## Step 6: Convert to Pandas and Display Dataset

In [8]:
import pandas as pd

# Convert the synthetic dataset to pandas DataFrame
golden_df = synthetic_dataset.to_pandas()

print(f"Golden dataset shape: {golden_df.shape}")
print(f"Columns: {list(golden_df.columns)}")
print("\n" + "="*80)
print("DATASET PREVIEW:")
print("="*80)

# Display first few examples
display(golden_df.head(3))

print("\n" + "="*80)
print("EXAMPLE QUESTION-ANSWER PAIR:")
print("="*80)

# Show a detailed example
example_idx = 0
print(f"Question: {golden_df.iloc[example_idx]['user_input']}")
print(f"\nExpected Answer: {golden_df.iloc[example_idx]['reference']}")
print(f"\nReference Contexts ({len(golden_df.iloc[example_idx]['reference_contexts'])} chunks):")
for i, context in enumerate(golden_df.iloc[example_idx]['reference_contexts'][:2]):  # Show first 2 contexts
    print(f"  Context {i+1}: {context[:200]}...")

Golden dataset shape: (15, 4)
Columns: ['user_input', 'reference_contexts', 'reference', 'synthesizer_name']

DATASET PREVIEW:


Unnamed: 0,user_input,reference_contexts,reference,synthesizer_name
0,What is Nelnet?,[The federal student loan COVID-19 forbearance...,Nelnet is the servicer for federal student loa...,single_hop_specifc_query_synthesizer
1,How is Aidvantage handling my IDR repayment am...,[I submitted my annual Income-Driven Repayment...,Aidvantage assigned me a repayment amount that...,single_hop_specifc_query_synthesizer
2,How does FERPA protect my personal and financi...,[My personal and financial data was compromise...,My personal and financial data was compromised...,single_hop_specifc_query_synthesizer



EXAMPLE QUESTION-ANSWER PAIR:
Question: What is Nelnet?

Expected Answer: Nelnet is the servicer for federal student loans, and payments on these loans were not re-amortized until very recently after the end of the COVID-19 forbearance program. The new payment amount starting from the specified date will nearly double the previous payment, and re-amortization was expected to occur once the forbearance ended to help reduce the impact on borrowers.

Reference Contexts (1 chunks):
  Context 1: The federal student loan COVID-19 forbearance program ended in XX/XX/XXXX. However, payments were not re-amortized on my federal student loans currently serviced by Nelnet until very recently. The new...


## Step 7: Upload Dataset to LangSmith

In [9]:
from langsmith import Client
from datetime import datetime

# Initialize LangSmith client
client = Client()

# Create a unique dataset name with timestamp
dataset_name = f"loan-complaints-golden-dataset-standalone-{datetime.now().strftime('%Y%m%d-%H%M%S')}"

# Create the dataset in LangSmith
try:
    langsmith_dataset = client.create_dataset(
        dataset_name=dataset_name,
        description="Standalone golden dataset for RAG evaluation using 15 synthetic loan complaint Q&A pairs generated with RAGAS"
    )
    print(f"✅ Created LangSmith dataset: {dataset_name}")
    print(f"Dataset ID: {langsmith_dataset.id}")
except Exception as e:
    print(f"Error creating dataset: {e}")
    # If dataset already exists, get it
    langsmith_dataset = client.read_dataset(dataset_name=dataset_name)

✅ Created LangSmith dataset: loan-complaints-golden-dataset-standalone-20250725-105810
Dataset ID: b7278dc2-7a24-4d9b-8e2e-8175adf6411d


In [10]:
# Upload examples to LangSmith dataset
print("Uploading examples to LangSmith...")

upload_count = 0
for idx, row in golden_df.iterrows():
    try:
        client.create_example(
            inputs={
                "question": row["user_input"]
            },
            outputs={
                "answer": row["reference"]
            },
            metadata={
                "reference_contexts": row["reference_contexts"],
                "synthesizer_name": row.get("synthesizer_name", "unknown"),
                "evolution_type": row.get("evolution_type", "unknown"),
                "episode_done": row.get("episode_done", False),
                "dataset_size": len(golden_df),
                "source": "standalone_generator"
            },
            dataset_id=langsmith_dataset.id
        )
        upload_count += 1
    except Exception as e:
        print(f"Error uploading example {idx}: {e}")
        continue

print(f"✅ Successfully uploaded {upload_count}/{len(golden_df)} examples to LangSmith dataset!")

Uploading examples to LangSmith...
✅ Successfully uploaded 15/15 examples to LangSmith dataset!


## Step 8: Summary and Next Steps

In [11]:
print("\n" + "="*80)
print("GOLDEN DATASET GENERATION COMPLETE!")
print("="*80)
print(f"📊 Dataset Name: {dataset_name}")
print(f"📊 Dataset ID: {langsmith_dataset.id}")
print(f"📊 Number of Q&A Pairs: {len(golden_df)}")
print(f"📊 Source Documents Used: {len(loan_complaint_data)} (first 50 for generation)")
print(f"📊 Generation Method: RAGAS Abstracted SDG")
print(f"📊 Models Used:")
print(f"   - LLM: gpt-4.1-nano")
print(f"   - Embeddings: text-embedding-3-small")

print("\n🎯 READY FOR EVALUATION!")
print("This dataset can now be used to evaluate RAG chains with both LangSmith and RAGAS metrics.")
print("\n📋 Dataset Structure:")
print(f"   - Questions: {golden_df['user_input'].nunique()} unique")
print(f"   - Answer Length: {golden_df['reference'].str.len().mean():.0f} chars avg")
print(f"   - Context Chunks: {golden_df['reference_contexts'].apply(len).mean():.1f} per question")

# Display synthesizer distribution
synthesizer_counts = golden_df['synthesizer_name'].value_counts()
print(f"\n📈 Synthesizer Distribution:")
for synthesizer, count in synthesizer_counts.items():
    print(f"   - {synthesizer}: {count} questions")


GOLDEN DATASET GENERATION COMPLETE!
📊 Dataset Name: loan-complaints-golden-dataset-standalone-20250725-105810
📊 Dataset ID: b7278dc2-7a24-4d9b-8e2e-8175adf6411d
📊 Number of Q&A Pairs: 15
📊 Source Documents Used: 825 (first 50 for generation)
📊 Generation Method: RAGAS Abstracted SDG
📊 Models Used:
   - LLM: gpt-4.1-nano
   - Embeddings: text-embedding-3-small

🎯 READY FOR EVALUATION!
This dataset can now be used to evaluate RAG chains with both LangSmith and RAGAS metrics.

📋 Dataset Structure:
   - Questions: 15 unique
   - Answer Length: 606 chars avg
   - Context Chunks: 1.7 per question

📈 Synthesizer Distribution:
   - single_hop_specifc_query_synthesizer: 5 questions
   - multi_hop_abstract_query_synthesizer: 5 questions
   - multi_hop_specific_query_synthesizer: 5 questions


## Optional: Save Dataset Locally

In [12]:
# Save the dataset locally for backup/reference
local_filename = f"golden_dataset_{datetime.now().strftime('%Y%m%d_%H%M%S')}.csv"
golden_df.to_csv(local_filename, index=False)
print(f"💾 Dataset saved locally as: {local_filename}")

# Also save as JSON for better context preservation
json_filename = f"golden_dataset_{datetime.now().strftime('%Y%m%d_%H%M%S')}.json"
golden_df.to_json(json_filename, orient='records', indent=2)
print(f"💾 Dataset saved locally as JSON: {json_filename}")

💾 Dataset saved locally as: golden_dataset_20250725_105853.csv
💾 Dataset saved locally as JSON: golden_dataset_20250725_105853.json
