<a href="https://colab.research.google.com/github/ericphamhung-gretel/demos/blob/main/data_designer_sdk_qa_pairs.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 📚 Data Designer SDK: Generate Diverse Q&A Pairs

This notebook demonstrates a general approach for extracting Q&A pairs from any source document
(e.g. text, markdown, or PDF files). The generated Q&A pairs can be used for:
- **Instruction Tuning:** Training models with clear, self-contained examples.
- **Retrieval-Augmented Generation (RAG):** Enhancing retrieval systems with precise and context-supported Q&A pairs.
- **Search and FAQ Systems:** Powering natural language query systems and documentation.

> **Note:** The [Data Designer](https://docs.gretel.ai/create-synthetic-data/gretel-data-designer-beta) functionality demonstrated in this notebook is currently in **Early Preview**. To access these features and run this notebook, please [join the waitlist](https://gretel.ai/navigator/data-designer#waitlist).

# 📘 Getting Started

First, let's install and import the required packages:

In [1]:
# Install required libraries

%%capture
!pip install -qq langchain unstructured[pdf] smart_open git+https://github.com/gretelai/gretel-python-client

In [12]:
# Configuration
# -------------
# Define your source document(s) and the number of Q&A pairs to generate.
# You can replace this with your own documents in PDF, markdown, or text formats.

DOCUMENT_LIST = ["https://www.americanexpress.com/content/dam/amex/us/rewards/membership-rewards/mr-terms-conditions-11.01.2024.pdf?mrlinknav=footer-tandc"]
NUM_QA_PAIRS = 10

In [13]:
# Document Processing
# ------------------
# The DocumentProcessor class handles loading and chunking source documents for RAG evaluation.
# We use langchain's RecursiveCharacterTextSplitter and unstructured.io for robust document parsing.

from typing import List, Union
from langchain.text_splitter import RecursiveCharacterTextSplitter
from unstructured.partition.auto import partition
from smart_open import open
import tempfile
import os

class DocumentProcessor:
    def __init__(self, chunk_size: int = 4192, chunk_overlap: int = 200):
        self.text_splitter = RecursiveCharacterTextSplitter(
            chunk_size=chunk_size,
            chunk_overlap=chunk_overlap,
            length_function=len,
        )

    def parse_document(self, uri: str) -> str:
        """Parse a single document from URI into raw text."""
        with open(uri, 'rb') as file:
            content = file.read()
            with tempfile.NamedTemporaryFile(delete=False) as temp_file:
                temp_file.write(content)
                temp_file.flush()
                elements = partition(temp_file.name)

        os.unlink(temp_file.name)
        return "\n\n".join([str(element) for element in elements])

    def process_documents(self, uris: Union[str, List[str]]) -> List[str]:
        """Process one or more documents into chunks for RAG evaluation."""
        if isinstance(uris, str):
            uris = [uris]

        all_chunks = []
        for uri in uris:
            text = self.parse_document(uri)
            chunks = self.text_splitter.split_text(text)
            all_chunks.extend(chunks)

        return all_chunks

In [14]:
# Data Models
# -----------
# Define Pydantic models for structured output generation:
# 1. QAPair: Schema for question-answer pairs
# 2. EvalMetrics: Schema for scoring generation quality

from pydantic import BaseModel, Field
from typing import Optional, Literal

class QAPair(BaseModel):
    question: str = Field(..., description="A clear and concise question derived from the context.")
    answer: str = Field(..., description="A detailed and accurate answer fully supported by the context.")
    reasoning: str = Field(..., description="A brief explanation of why this Q&A pair is valuable.")


class EvalMetrics(BaseModel):
    clarity: int = Field(..., description="Clarity of the question", ge=1, le=5)
    factual_accuracy: int = Field(..., description="Factual accuracy of the answer", ge=1, le=5)
    comprehensiveness: int = Field(..., description="Completeness of the answer", ge=1, le=5)
    reasoning_quality: int = Field(..., description="Quality of the provided reasoning", ge=1, le=5)

In [15]:
# Setup & Configure Data Designer
# --------------------------------
# Initialize the Data Designer with a custom system message to ensure that the generated
# Q&A pairs are high-quality, context-supported, and tailored for a variety of downstream applications.


from gretel_client.navigator import DataDesigner

# Process document chunks
processor = DocumentProcessor(chunk_size=4192, chunk_overlap=200)
chunks = processor.process_documents(DOCUMENT_LIST)

# Initialize Data Designer
designer = DataDesigner(
    api_key="prompt",
    model_suite="llama-3.x",  # or "apache-2.0" as needed
    endpoint="https://api.gretel.cloud",
    special_system_instructions="""\
You are an expert at generating high-quality, context-supported Q&A pairs.
Your output should be clear, concise, and factually correct.
Ensure that every question is self-contained and every answer is comprehensive and derived solely from the provided context.
"""
)

# Add Seed Columns for Controlled Diversity

# Context: Document chunks.
designer.add_categorical_seed_column(
    name="context",
    description="A chunk of text extracted from the source document.",
    values=chunks
)

# Difficulty with nested Sophistication.
designer.add_categorical_seed_column(
    name="difficulty",
    description="The overall difficulty level of the question.",
    values=["easy", "medium", "hard"],
    subcategories=[
        {
            "name": "sophistication",
            "values": {
                "easy": ["basic", "straightforward"],
                "medium": ["intermediate", "moderately complex"],
                "hard": ["advanced", "sophisticated"]
            }
        }
    ]
)

# Question Style: Tone and approach.
designer.add_categorical_seed_column(
    name="question_style",
    description="The style or tone of the question.",
    values=["factual", "exploratory", "analytical", "comparative"]
)

# Target Audience: Language complexity.
designer.add_categorical_seed_column(
    name="target_audience",
    description="The intended audience for the Q&A pair.",
    values=["novice", "intermediate", "expert"]
)

# Response Format: Expected answer style.
designer.add_categorical_seed_column(
    name="response_format",
    description="The format of the answer expected (e.g., short, detailed, step-by-step).",
    values=["short", "detailed", "step-by-step", "list"]
)

# Define Generation Template for Q&A Pairs
designer.add_generated_data_column(
    name="qa_pair",
    generation_prompt=(
        "\n{context}\n\n"
        "Based on the above context, generate a high-quality Q&A pair. The question should be clear and concise, "
        "and tailored for an audience at the '{target_audience}' level. The overall difficulty should be '{difficulty}', "
        "with a corresponding sophistication level (e.g., {sophistication}). The question style should be '{question_style}', "
        "and the answer should be provided in a '{response_format}' format. "
        "Ensure that the answer is factually accurate, fully derived from the context, and comprehensive. "
        "Additionally, include a brief explanation (reasoning) on why this Q&A pair is valuable.\n\n"
        "Output the result in JSON format as follows:\n"
        '{{\n'
        '  "question": "<your question>",\n'
        '  "answer": "<your answer>",\n'
        '  "reasoning": "<your explanation>"\n'
        '}}\n'
    ),
    data_config={
        "type": "structured",
        "params": {"model": QAPair}
    }
)

# Optional: Define Evaluation Template for Q&A Pairs
designer.add_generated_data_column(
    name="eval_metrics",
    llm_type="judge",
    generation_prompt=(
        "\n{context}\n\n"
        "For the above Q&A pair:\n"
        "{qa_pair}\n\n"
        "Evaluate the following criteria on a scale of 1 (lowest) to 5 (highest):\n"
        "1. Clarity: How clear and understandable is the question?\n"
        "2. Factual Accuracy: Is the answer fully supported by the context?\n"
        "3. Comprehensiveness: Does the answer cover all necessary details?\n"
        "4. Reasoning Quality: Is the provided reasoning well-justified and clear?\n\n"
        "Output the scores in JSON format."
    ),
    data_config={
        "type": "structured",
        "params": {"model": EvalMetrics}
    }
)

# Preview a Sample of 10 Generated Records
preview = designer.generate_dataset_preview()
# preview.display_sample_record()


[16:58:27] [INFO] 🦜 Using llama-3.x model suite
Gretel API Key: ··········
Logged in as eric.phamhung@gretel.ai ✅
[16:58:30] [INFO] 🚀 Generating dataset preview
[16:58:31] [INFO] 📥 Step 1: Load data seeds
[16:58:31] [INFO] 🎲 Step 2: Sample data seeds
[16:58:31] [INFO] 🦜 Step 3: Generate column from template >> generating qa pair
[16:58:41] [INFO] 🦜 Step 4: Generate column from template >> generating eval metrics
[16:58:55] [INFO] 👀 Your dataset preview is ready for a peek!


In [16]:
from rich.console import Console
from rich.table import Table
from collections import Counter
import pandas as pd

def generate_qa_report(df):
    console = Console()
    categories = ['difficulty', 'sophistication', 'question_style', 'target_audience', 'response_format']

    console.print("\n[bold blue]🧠 QA Pair Generation Report[/bold blue]", justify="center")
    console.print("=" * 80, justify="center")
    console.print(f"\n[bold]Total QA Pairs:[/bold] {len(df)}")

    for category in categories:
        if category in df.columns:
            # Only count non-empty values
            counts = Counter(x for x in df[category] if pd.notna(x) and x != '')
            if not counts:
                continue

            table = Table(title=f"\n{category.title()} Distribution")
            table.add_column("Category", style="cyan")
            table.add_column("Count", justify="right")
            table.add_column("Percentage", justify="right")

            total = sum(counts.values())
            for value, count in sorted(counts.items()):
                percentage = (count / total) * 100
                table.add_row(str(value), str(count), f"{percentage:.1f}%")

            console.print(table)

    if any(col.startswith('eval_metrics.') for col in df.columns):
        metrics_table = Table(title="\nQuality Metrics Summary")
        metrics_table.add_column("Metric")
        metrics_table.add_column("Average Score", justify="right")

        metrics = ['clarity', 'factual_accuracy', 'comprehensiveness', 'reasoning_quality']
        for metric in metrics:
            col = f'eval_metrics.{metric}'
            if col in df.columns:
                scores = df[col].dropna()
                if len(scores) > 0:
                    avg_score = scores.mean()
                    metrics_table.add_row(
                        metric.replace('_', ' ').title(),
                        f"{avg_score:.2f}/5.00"
                    )

        console.print(metrics_table)

generate_qa_report(preview.dataset)

In [17]:
# View the preview dataset as a dataframe
preview.dataset

Unnamed: 0,context,difficulty,sophistication,question_style,target_audience,response_format,qa_pair,eval_metrics
0,arbitration organization and all parties in wr...,medium,moderately complex,exploratory,expert,short,"{""question"": ""What is the process for appealin...","{""clarity"": 5, ""factual_accuracy"": 5, ""compreh..."
1,11\n\nLimitation on Arbitration\n\nArbitration...,easy,basic,comparative,intermediate,detailed,"{""question"": ""What happens if I bring a claim ...","{""clarity"": 5, ""factual_accuracy"": 5, ""compreh..."
2,Use Redeem for Deposits for a deposit into you...,hard,advanced,analytical,novice,short,"{""question"": ""What is the minimum number of po...","{""clarity"": 5, ""factual_accuracy"": 5, ""compreh..."
3,3. Gasoline at gas stations in the U.S. (not i...,hard,advanced,analytical,novice,detailed,"{""question"": ""What are the two categories that...","{""clarity"": 5, ""factual_accuracy"": 5, ""compreh..."
4,If an airline loyalty program stops participat...,easy,basic,comparative,novice,list,"{""question"": ""What are the requirements to use...","{""clarity"": 5, ""factual_accuracy"": 5, ""compreh..."
5,3. Gasoline at gas stations in the U.S. (not i...,easy,straightforward,factual,expert,list,"{""question"": ""What are the categories that qua...","{""clarity"": 5, ""factual_accuracy"": 5, ""compreh..."
6,Welcome to Membership Rewards!\n\nLast Updated...,medium,intermediate,comparative,novice,short,"{""question"": ""What are the main differences in...","{""clarity"": 4, ""factual_accuracy"": 5, ""compreh..."
7,1 additional point (for a total of 2 points) o...,easy,straightforward,exploratory,novice,detailed,"{""question"": ""What rewards do you earn on purc...","{""clarity"": 5, ""factual_accuracy"": 5, ""compreh..."
8,1\n\nCorporate Cards issued to more than one i...,medium,intermediate,comparative,intermediate,step-by-step,"{""question"": ""How do Corporate Card Members an...","{""clarity"": 5, ""factual_accuracy"": 5, ""compreh..."
9,1 additional point (for a total of 2 points) o...,medium,moderately complex,comparative,intermediate,detailed,"{""question"": ""What are the differences in earn...","{""clarity"": 5, ""factual_accuracy"": 4, ""compreh..."


In [18]:
preview.display_sample_record(3)

In [21]:
df = preview.dataset
df

Unnamed: 0,context,difficulty,sophistication,question_style,target_audience,response_format,qa_pair,eval_metrics
0,arbitration organization and all parties in wr...,medium,moderately complex,exploratory,expert,short,"{""question"": ""What is the process for appealin...","{""clarity"": 5, ""factual_accuracy"": 5, ""compreh..."
1,11\n\nLimitation on Arbitration\n\nArbitration...,easy,basic,comparative,intermediate,detailed,"{""question"": ""What happens if I bring a claim ...","{""clarity"": 5, ""factual_accuracy"": 5, ""compreh..."
2,Use Redeem for Deposits for a deposit into you...,hard,advanced,analytical,novice,short,"{""question"": ""What is the minimum number of po...","{""clarity"": 5, ""factual_accuracy"": 5, ""compreh..."
3,3. Gasoline at gas stations in the U.S. (not i...,hard,advanced,analytical,novice,detailed,"{""question"": ""What are the two categories that...","{""clarity"": 5, ""factual_accuracy"": 5, ""compreh..."
4,If an airline loyalty program stops participat...,easy,basic,comparative,novice,list,"{""question"": ""What are the requirements to use...","{""clarity"": 5, ""factual_accuracy"": 5, ""compreh..."
5,3. Gasoline at gas stations in the U.S. (not i...,easy,straightforward,factual,expert,list,"{""question"": ""What are the categories that qua...","{""clarity"": 5, ""factual_accuracy"": 5, ""compreh..."
6,Welcome to Membership Rewards!\n\nLast Updated...,medium,intermediate,comparative,novice,short,"{""question"": ""What are the main differences in...","{""clarity"": 4, ""factual_accuracy"": 5, ""compreh..."
7,1 additional point (for a total of 2 points) o...,easy,straightforward,exploratory,novice,detailed,"{""question"": ""What rewards do you earn on purc...","{""clarity"": 5, ""factual_accuracy"": 5, ""compreh..."
8,1\n\nCorporate Cards issued to more than one i...,medium,intermediate,comparative,intermediate,step-by-step,"{""question"": ""How do Corporate Card Members an...","{""clarity"": 5, ""factual_accuracy"": 5, ""compreh..."
9,1 additional point (for a total of 2 points) o...,medium,moderately complex,comparative,intermediate,detailed,"{""question"": ""What are the differences in earn...","{""clarity"": 5, ""factual_accuracy"": 4, ""compreh..."


In [38]:
# Generate and Analyze Dataset
# ---------------------------
# Uncomment these lines to generate evaluation pairs, save them to JSONL,
# and analyze the coverage and quality of the generated dataset.
NUM_EVALS = 10
batch_job = designer.submit_batch_workflow(num_records=NUM_EVALS)
dataset = batch_job.fetch_dataset(wait_for_completion=True)

generate_qa_report(dataset)

dataset.to_json('qa_pairs.jsonl', orient='records', lines=True)


[17:29:07] [INFO] ⚙️ Configuring Data Designer Workflow steps:
[17:29:07] [INFO]   |-- Step 1: load-data-seeds-1
[17:29:07] [INFO]   |-- Step 2: sample-data-seeds-2
[17:29:07] [INFO]   |-- Step 3: generate-column-from-template-3-generating-qa-pair
[17:29:07] [INFO]   |-- Step 4: generate-column-from-template-4-generating-eval-metrics
[17:29:09] [INFO] 🛜 Connecting to your Gretel Project:
[17:29:09] [INFO] 🔗 -> https://console.gretel.ai/proj_2suABFX6JoDpwSC9A8KwFir6cVa
[17:29:10] [INFO] ▶️ Starting your workflow run to generate 10 records:
[17:29:10] [INFO]   |-- project_name: gretel-sdk-bd417-44ab09eab0fa0bb
[17:29:10] [INFO]   |-- project_id: proj_2suABFX6JoDpwSC9A8KwFir6cVa
[17:29:10] [INFO]   |-- workflow_run_id: wr_2suABaOto957CoV7vY2o24NWxKa
[17:29:10] [INFO] 🔗 -> https://console.gretel.ai/workflows/w_2suABOp9Kx5FLhKatpfSBIP9CH5/runs/wr_2suABaOto957CoV7vY2o24NWxKa
[17:29:12] [INFO] ⏳ Waiting for workflow step `generate-column-from-template-4-generating-eval-metrics` to complete...

In [1]:
from openai import OpenAI
from google.colab import userdata


In [8]:
client = OpenAI(
    base_url="https://litellm-proxy.dev.gretel.cloud/v1",
    api_key=userdata.get("GRETEL_API_KEY")
)

# [m.id for m in client.models.list()]

In [34]:
system_prompt = """
You are an expert at generating high-quality, context-supported Q&A pairs.
Your output should be clear, concise, and factually correct.
Ensure that every question is self-contained and every answer is comprehensive and derived solely from the provided context.
"""

user_prompt = f"""
Context: Document chunks.
{chunks}
Based on the above context, generate {NUM_QA_PAIRS} high-quality Q&A pair. The question should be clear and concise, and tailored for an audience at the 'novice' level.
The overall difficulty should be 'easy', with a corresponding sophistication level of 'basic'. The question style should be
'factual', and the answer should be provided in a 'short' format. Ensure that the answer is factually accurate, fully derived from the context, and comprehensive.
Additionally, include a brief explanation (reasoning) on why this Q&A pair is valuable.
"""

In [35]:
completion = client.chat.completions.create(
    model='gpt-4o',
    messages=[
        {"role": "system", "content": system_prompt},
        {
            "role": "user",
            "content": user_prompt
        }
    ]
)
print(completion.choices[0].message)




ChatCompletionMessage(content='1. **Q: Who operates the Membership Rewards® Program?**\n   - A: American Express Travel Related Services Company, Inc.\n   - *Reasoning:* This question establishes foundational knowledge about the organization behind the program, which is key for understanding the context of the rewards.\n\n2. **Q: Are terms and conditions for specific products included in the Membership Rewards® Program Terms and Conditions?**\n   - A: No, they are provided in separate documents received upon enrollment.\n   - *Reasoning:* This clarifies that product-specific information must be looked at separately from the general rewards program terms, aiding understanding of the documentation structure.\n\n3. **Q: Can American Express change the number of points required to redeem rewards?**\n   - A: Yes, they can change the number of points required to redeem rewards.\n   - *Reasoning:* This knowledge is essential for users to understand the flexibility and potential changes in the

In [36]:
print(completion.choices[0].message.content)

1. **Q: Who operates the Membership Rewards® Program?**
   - A: American Express Travel Related Services Company, Inc.
   - *Reasoning:* This question establishes foundational knowledge about the organization behind the program, which is key for understanding the context of the rewards.

2. **Q: Are terms and conditions for specific products included in the Membership Rewards® Program Terms and Conditions?**
   - A: No, they are provided in separate documents received upon enrollment.
   - *Reasoning:* This clarifies that product-specific information must be looked at separately from the general rewards program terms, aiding understanding of the documentation structure.

3. **Q: Can American Express change the number of points required to redeem rewards?**
   - A: Yes, they can change the number of points required to redeem rewards.
   - *Reasoning:* This knowledge is essential for users to understand the flexibility and potential changes in the value of their points.

4. **Q: What hap