<a href="https://colab.research.google.com/github/ericphamhung-gretel/demos/blob/main/data_designer_sdk_qa_pairs.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 📚 Data Designer SDK: Generate Diverse Q&A Pairs

This notebook demonstrates a general approach for extracting Q&A pairs from any source document
(e.g. text, markdown, or PDF files). The generated Q&A pairs can be used for:
- **Instruction Tuning:** Training models with clear, self-contained examples.
- **Retrieval-Augmented Generation (RAG):** Enhancing retrieval systems with precise and context-supported Q&A pairs.
- **Search and FAQ Systems:** Powering natural language query systems and documentation.

> **Note:** The [Data Designer](https://docs.gretel.ai/create-synthetic-data/gretel-data-designer-beta) functionality demonstrated in this notebook is currently in **Early Preview**. To access these features and run this notebook, please [join the waitlist](https://gretel.ai/navigator/data-designer#waitlist).

# 📘 Getting Started

First, let's install and import the required packages:

In [1]:
# Install required libraries

%%capture
!pip install -qq langchain unstructured[pdf] smart_open git+https://github.com/gretelai/gretel-python-client

In [22]:
# Configuration
# -------------
# Define your source document(s) and the number of Q&A pairs to generate.
# You can replace this with your own documents in PDF, markdown, or text formats.

DOCUMENT_LIST = ["https://www.americanexpress.com/content/dam/amex/us/rewards/membership-rewards/mr-terms-conditions-11.01.2024.pdf?mrlinknav=footer-tandc"]
NUM_QA_PAIRS = 100

In [3]:
# Document Processing
# ------------------
# The DocumentProcessor class handles loading and chunking source documents for RAG evaluation.
# We use langchain's RecursiveCharacterTextSplitter and unstructured.io for robust document parsing.

from typing import List, Union
from langchain.text_splitter import RecursiveCharacterTextSplitter
from unstructured.partition.auto import partition
from smart_open import open
import tempfile
import os

class DocumentProcessor:
    def __init__(self, chunk_size: int = 4192, chunk_overlap: int = 200):
        self.text_splitter = RecursiveCharacterTextSplitter(
            chunk_size=chunk_size,
            chunk_overlap=chunk_overlap,
            length_function=len,
        )

    def parse_document(self, uri: str) -> str:
        """Parse a single document from URI into raw text."""
        with open(uri, 'rb') as file:
            content = file.read()
            with tempfile.NamedTemporaryFile(delete=False) as temp_file:
                temp_file.write(content)
                temp_file.flush()
                elements = partition(temp_file.name)

        os.unlink(temp_file.name)
        return "\n\n".join([str(element) for element in elements])

    def process_documents(self, uris: Union[str, List[str]]) -> List[str]:
        """Process one or more documents into chunks for RAG evaluation."""
        if isinstance(uris, str):
            uris = [uris]

        all_chunks = []
        for uri in uris:
            text = self.parse_document(uri)
            chunks = self.text_splitter.split_text(text)
            all_chunks.extend(chunks)

        return all_chunks

In [4]:
# Data Models
# -----------
# Define Pydantic models for structured output generation:
# 1. QAPair: Schema for question-answer pairs
# 2. EvalMetrics: Schema for scoring generation quality

from pydantic import BaseModel, Field
from typing import Optional, Literal

class QAPair(BaseModel):
    question: str = Field(..., description="A clear and concise question derived from the context.")
    answer: str = Field(..., description="A detailed and accurate answer fully supported by the context.")
    reasoning: str = Field(..., description="A brief explanation of why this Q&A pair is valuable.")


class EvalMetrics(BaseModel):
    clarity: int = Field(..., description="Clarity of the question", ge=1, le=5)
    factual_accuracy: int = Field(..., description="Factual accuracy of the answer", ge=1, le=5)
    comprehensiveness: int = Field(..., description="Completeness of the answer", ge=1, le=5)
    reasoning_quality: int = Field(..., description="Quality of the provided reasoning", ge=1, le=5)

In [52]:
# Setup & Configure Data Designer
# --------------------------------
# Initialize the Data Designer with a custom system message to ensure that the generated
# Q&A pairs are high-quality, context-supported, and tailored for a variety of downstream applications.


from gretel_client.navigator import DataDesigner

# Process document chunks
processor = DocumentProcessor(chunk_size=4192, chunk_overlap=200)
chunks = processor.process_documents(DOCUMENT_LIST)

# Initialize Data Designer
designer = DataDesigner(
    api_key="prompt",
    model_suite="llama-3.x",  # or "apache-2.0" as needed
    endpoint="https://api.gretel.cloud",
    special_system_instructions="""\
You are an expert at generating high-quality, context-supported Q&A pairs.
Your output should be clear, concise, and factually correct.
Ensure that every question is self-contained and every answer is comprehensive and derived solely from the provided context.
"""
)

# Add Seed Columns for Controlled Diversity

# Context: Document chunks.
designer.add_categorical_seed_column(
    name="context",
    description="A chunk of text extracted from the source document.",
    values=chunks
)

# Difficulty with nested Sophistication.
designer.add_categorical_seed_column(
    name="focus_area",
    description="The focus area of the question.",
    values=["program overview", "earning points", "redeeming rewards", "account management", "promotion and bonuses", "terms and conditions"],
    subcategories=[
        {
            "name": "sub_focus_area",
            "values": {
                "program overview": ["introduction and description", "eligibility and enrollment", "membership tiers", "general benefits" ],
                "earning points": ["spending categories", "earning rates", "bonus offers", "promotional earning opportunities"],
                "redeeming rewards": ["redemption options", "redemption process", "points valuation and conversion rates", "restrictions and limitations"],
                "account management": ["points tracking and monitoring", "points expiration policies", "points transfer and consolidation"],
                "promotion and bonuses": ["sign-up and referral bonuses", "seasonal and limited time offers", "bonus qualification requirements"],
                "terms and conditions": ["program tules and legal terms", "fee structure and charges", "limitations and exclusions", "amendments and updates"]
            }
        }
    ]
)

# # Question Style: Tone and approach.
# designer.add_categorical_seed_column(
#     name="question_style",
#     description="The style or tone of the question.",
#     values=["factual", "exploratory", "analytical", "comparative"]
# )

# Target Audience: Language complexity.
designer.add_categorical_seed_column(
    name="target_audience",
    description="The intended audience for the Q&A pair.",
    values=["novice", "intermediate", "expert"]
)

# Response Format: Expected answer style.
# designer.add_categorical_seed_column(
#     name="response_format",
#     description="The format of the answer expected (e.g., short, detailed, step-by-step).",
#     values=["short", "detailed", "step-by-step", "list"]
# )

# Define Generation Template for Q&A Pairs
designer.add_generated_data_column(
    name="qa_pair",
    generation_prompt=(
        "\n{context}\n\n"
        "Based on the above context, generate a high-quality Q&A pair. The question should be clear and concise, "
        "and tailored for an audience at the '{target_audience}' level. The overall focus should be '{focus_area}', "
        "with a corresponding subfocus area (e.g., {sub_focus_area})."
        "Ensure that the answer is factually accurate, fully derived from the context, and comprehensive. "
        "Additionally, include a brief explanation (reasoning) on why this Q&A pair is valuable.\n\n"
        "Output the result in JSON format as follows:\n"
        '{{\n'
        '  "question": "<your question>",\n'
        '  "answer": "<your answer>",\n'
        '  "reasoning": "<your explanation>"\n'
        '}}\n'
    ),
    data_config={
        "type": "structured",
        "params": {"model": QAPair}
    }
)

# Optional: Define Evaluation Template for Q&A Pairs
designer.add_generated_data_column(
    name="eval_metrics",
    llm_type="judge",
    generation_prompt=(
        "\n{context}\n\n"
        "For the above Q&A pair:\n"
        "{qa_pair}\n\n"
        "Evaluate the following criteria on a scale of 1 (lowest) to 5 (highest):\n"
        "1. Clarity: How clear and understandable is the question?\n"
        "2. Factual Accuracy: Is the answer fully supported by the context?\n"
        "3. Comprehensiveness: Does the answer cover all necessary details?\n"
        "4. Reasoning Quality: Is the provided reasoning well-justified and clear?\n\n"
        "Output the scores in JSON format."
    ),
    data_config={
        "type": "structured",
        "params": {"model": EvalMetrics}
    }
)

# Preview a Sample of 10 Generated Records
preview = designer.generate_dataset_preview()
# preview.display_sample_record()


[23:29:43] [INFO] 🦜 Using llama-3.x model suite
Gretel API Key: ··········
Logged in as eric.phamhung@gretel.ai ✅
[23:29:51] [INFO] 🚀 Generating dataset preview
[23:29:52] [INFO] 📥 Step 1: Load data seeds
[23:29:52] [INFO] 🎲 Step 2: Sample data seeds
[23:29:52] [INFO] 🦜 Step 3: Generate column from template >> generating qa pair
[23:30:03] [INFO] 🦜 Step 4: Generate column from template >> generating eval metrics
[23:30:15] [INFO] 👀 Your dataset preview is ready for a peek!


In [55]:
from rich.console import Console
from rich.table import Table
from collections import Counter
import pandas as pd

def generate_qa_report(df):
    console = Console()
    categories = ['focus_area', 'sub_focus_area', 'target_audience', ]

    console.print("\n[bold blue]🧠 QA Pair Generation Report[/bold blue]", justify="center")
    console.print("=" * 80, justify="center")
    console.print(f"\n[bold]Total QA Pairs:[/bold] {len(df)}")

    for category in categories:
        if category in df.columns:
            # Only count non-empty values
            counts = Counter(x for x in df[category] if pd.notna(x) and x != '')
            if not counts:
                continue

            table = Table(title=f"\n{category.title()} Distribution")
            table.add_column("Category", style="cyan")
            table.add_column("Count", justify="right")
            table.add_column("Percentage", justify="right")

            total = sum(counts.values())
            for value, count in sorted(counts.items()):
                percentage = (count / total) * 100
                table.add_row(str(value), str(count), f"{percentage:.1f}%")

            console.print(table)

    if any(col.startswith('eval_metrics.') for col in df.columns):
        metrics_table = Table(title="\nQuality Metrics Summary")
        metrics_table.add_column("Metric")
        metrics_table.add_column("Average Score", justify="right")

        metrics = ['clarity', 'factual_accuracy', 'comprehensiveness', 'reasoning_quality']
        for metric in metrics:
            col = f'eval_metrics.{metric}'
            if col in df.columns:
                scores = df[col].dropna()
                if len(scores) > 0:
                    avg_score = scores.mean()
                    metrics_table.add_row(
                        metric.replace('_', ' ').title(),
                        f"{avg_score:.2f}/5.00"
                    )

        console.print(metrics_table)

generate_qa_report(preview.dataset)

In [54]:
# View the preview dataset as a dataframe
preview.dataset

Unnamed: 0,context,focus_area,sub_focus_area,target_audience,qa_pair,eval_metrics
0,The points you got for a purchase are reversed...,terms and conditions,amendments and updates,intermediate,"{""question"": ""If a claim is submitted to media...","{""clarity"": 5, ""factual_accuracy"": 5, ""compreh..."
1,Lose Points\n\nThis section of the Terms and C...,earning points,earning rates,intermediate,"{""question"": ""How do I handle a negative point...","{""clarity"": 5, ""factual_accuracy"": 5, ""compreh..."
2,Business Purchase Account\n\nNo\n\n$40\n\nNo\n...,terms and conditions,limitations and exclusions,novice,"{""question"": ""What are the limitations on tran...","{""clarity"": 5, ""factual_accuracy"": 5, ""compreh..."
3,Business Purchase Account\n\nNo\n\n$40\n\nNo\n...,earning points,promotional earning opportunities,intermediate,"{""question"": ""For which business cards can I p...","{""clarity"": 5, ""factual_accuracy"": 5, ""compreh..."
4,If you don't make a timely payment of the mini...,earning points,earning rates,expert,"{""question"": ""How do merchants get assigned co...","{""clarity"": 5, ""factual_accuracy"": 5, ""compreh..."
5,11\n\nLimitation on Arbitration\n\nArbitration...,earning points,spending categories,novice,"{""question"": ""What spending categories are eli...","{""clarity"": 4, ""factual_accuracy"": 4, ""compreh..."
6,11\n\nLimitation on Arbitration\n\nArbitration...,program overview,membership tiers,expert,"{""question"": ""What are the limitations on arbi...","{""clarity"": 5, ""factual_accuracy"": 5, ""compreh..."
7,The points you got for a purchase are reversed...,program overview,eligibility and enrollment,intermediate,"{""question"": ""Can you explain the steps to res...","{""clarity"": 4, ""factual_accuracy"": 5, ""compreh..."
8,1\n\nCorporate Cards issued to more than one i...,promotion and bonuses,bonus qualification requirements,novice,"{""question"": ""How do I earn points with my Cor...","{""clarity"": 5, ""factual_accuracy"": 4, ""compreh..."
9,4 additional points (for a total of 5 points) ...,program overview,general benefits,expert,"{""question"": ""What are the different categorie...","{""clarity"": 5, ""factual_accuracy"": 5, ""compreh..."


In [60]:
preview.display_sample_record(1)

In [62]:
# # Generate and Analyze Dataset
# # ---------------------------
# # Uncomment these lines to generate evaluation pairs, save them to JSONL,
# # and analyze the coverage and quality of the generated dataset.

batch_job = designer.submit_batch_workflow(num_records=NUM_QA_PAIRS)
dataset = batch_job.fetch_dataset(wait_for_completion=True)

generate_qa_report(dataset)

dataset.to_json('qa_pairs.jsonl', orient='records', lines=True)


[23:44:03] [INFO] ⚙️ Configuring Data Designer Workflow steps:
[23:44:03] [INFO]   |-- Step 1: load-data-seeds-1
[23:44:03] [INFO]   |-- Step 2: sample-data-seeds-2
[23:44:03] [INFO]   |-- Step 3: generate-column-from-template-3-generating-qa-pair
[23:44:03] [INFO]   |-- Step 4: generate-column-from-template-4-generating-eval-metrics
[23:44:07] [INFO] 🛜 Connecting to your Gretel Project:
[23:44:07] [INFO] 🔗 -> https://console.gretel.ai/proj_2sutmWeFOxRNAMWb4tpVanamI7e
[23:44:08] [INFO] ▶️ Starting your workflow run to generate 100 records:
[23:44:08] [INFO]   |-- project_name: gretel-sdk-07d1e-44ab09eab0fa0bb
[23:44:08] [INFO]   |-- project_id: proj_2sutmWeFOxRNAMWb4tpVanamI7e
[23:44:08] [INFO]   |-- workflow_run_id: wr_2sutmkBngXKEa6jExORbl5xn7Fj
[23:44:08] [INFO] 🔗 -> https://console.gretel.ai/workflows/w_2sutmpWQeKlMnRaUyyi291rPeMD/runs/wr_2sutmkBngXKEa6jExORbl5xn7Fj
[23:44:12] [INFO] ⏳ Waiting for workflow step `generate-column-from-template-4-generating-eval-metrics` to complete..

In [63]:
dataset

Unnamed: 0,context,focus_area,sub_focus_area,target_audience,qa_pair,eval_metrics
0,All returns are subject to Amazon.com's polici...,redeeming rewards,restrictions and limitations,intermediate,"{""question"": ""Can I use points to cover charge...","{""clarity"": 5, ""comprehensiveness"": 5, ""factua..."
1,11\n\nLimitation on Arbitration\n\nArbitration...,terms and conditions,program tules and legal terms,novice,"{""question"": ""Who pays for arbitration fees if...","{""clarity"": 5, ""comprehensiveness"": 5, ""factua..."
2,1\n\nCorporate Cards issued to more than one i...,program overview,eligibility and enrollment,novice,"{""question"": ""What are the general terms and c...","{""clarity"": 4, ""comprehensiveness"": 4, ""factua..."
3,Arbitration procedures are generally simpler t...,program overview,membership tiers,expert,"{""question"": ""What are the key limitations on ...","{""clarity"": 5, ""factual_accuracy"": 5, ""compreh..."
4,11\n\nLimitation on Arbitration\n\nArbitration...,earning points,earning rates,novice,"{""question"": ""What is the minimum amount you c...","{""clarity"": 5, ""comprehensiveness"": 4, ""factua..."
...,...,...,...,...,...,...
95,Lose Points\n\nThis section of the Terms and C...,terms and conditions,amendments and updates,novice,"{""question"": ""Can I still use points if my Cor...","{""clarity"": 5, ""comprehensiveness"": 5, ""factua..."
96,Welcome to Membership Rewards!\n\nLast Updated...,account management,points transfer and consolidation,novice,"{""question"": ""Can I link multiple products to ...","{""clarity"": 5, ""comprehensiveness"": 4, ""factua..."
97,The list of eligible charges can change from t...,redeeming rewards,redemption options,intermediate,"{""question"": ""Can points be used to pay the Mi...","{""clarity"": 5, ""comprehensiveness"": 5, ""factua..."
98,"Also, some products will earn additional point...",promotion and bonuses,bonus qualification requirements,novice,"{""question"": ""What is the bonus qualification ...","{""clarity"": 5, ""comprehensiveness"": 4, ""factua..."


In [64]:
#preview
import json
parsed = json.loads(dataset['qa_pair'][99])
print(json.dumps(parsed, indent=4))

{
    "question": "Will arbitration replace your rights to seek legal recourse in court?",
    "answer": "Arbitration procedures are generally simpler than court proceedings, but they do have some limitations. Claims may not be joined or consolidated unless you and American Express agree in writing, and the arbitrator's decisions are final and binding, subject to some review by a court.",
    "reasoning": "This Q&A pair is valuable because it highlights the differences between arbitration and court proceedings. Understanding these differences is essential for users to make informed decisions about whether to pursue arbitration or other forms of dispute resolution."
}


# Using OpenAI

In [65]:
from openai import OpenAI
from google.colab import userdata


In [66]:
client = OpenAI(
    base_url="https://litellm-proxy.dev.gretel.cloud/v1",
    api_key=userdata.get("GRETEL_API_KEY")
)

# [m.id for m in client.models.list()]

In [67]:
system_prompt = """
You are an expert at generating high-quality, context-supported Q&A pairs.
Your output should be clear, concise, and factually correct.
Ensure that every question is self-contained and every answer is comprehensive and derived solely from the provided context.
"""

user_prompt = f'''
"\n{chunks}\n\n"
"Based on the above context, generate {NUM_QA_PAIRS} high-quality Q&A pair. The question should be clear and concise, "
"and tailored for an audience at either the ["novice", "intermediate", "expert"] level.
The overall focus should be either ["program overview", "earning points", "redeeming rewards", "account management", "promotion and bonuses", "terms and conditions"], "
"with a corresponding subfocus area "
"Ensure that the answer is factually accurate, fully derived from the context, and comprehensive. "
"Additionally, include a brief explanation (reasoning) on why this Q&A pair is valuable.\n\n"
"Output the result in JSON format as follows:\n"
'{{\n'
'  "question": "<your question>",\n'
'  "answer": "<your answer>",\n'
'  "target_audience": "<target audience>",\n'
'  "focus_area": "<focus area>",\n'
'  "sub_focus_area": "<subfocus area>",\n'
'  "reasoning": "<your explanation>"\n'
'}}\n'


'''

In [68]:
completion = client.chat.completions.create(
    model='gpt-4o',
    messages=[
        {"role": "system", "content": system_prompt},
        {
            "role": "user",
            "content": user_prompt
        }
    ]
)
# print(completion.choices[0].message)




In [69]:
print(completion.choices[0].message.content)

{
  "question": "What happens to your Membership Rewards points if your Card Account is canceled due to inactivity?",
  "answer": "If your Card Account is canceled due to inactivity, you will have 90 days to use the points in your Rewards Account before losing them. This grace period allows you to redeem any accumulated points before they are forfeited.",
  "target_audience": "intermediate",
  "focus_area": "account management",
  "sub_focus_area": "points retention",
  "reasoning": "This Q&A pair is valuable because it addresses a common concern about what happens to accumulated rewards points if an account is canceled due to inactivity. Intermediate users who have been accumulating points may not be aware of the specific rules regarding point retention and potential loss upon account cancellation, and this answer provides clear guidance."
}


# Evaluation

1. finetune and test on a holdout dataset?
  - BLEU/ROUGE scores? While BLEU score is primarily used for machine translation tasks, ROUGE score is used for text summarization tasks
2. comparing intrinsic properties of the datasets:
  - diversity coverage (topic distribution, keyword distributions, question types, answer styles, content analysis)
  - quality annotations (ambiguitiy detection, error analysis)
  - linguistic and structural properties (readability scores, synctactic complexity, length and structure distribution)
  - vocabularity diversity (type/token ratio, vocabularity distribution)
  - semantic consistancy
  - redudancy and duplication checks


value add?
-