# Generate QA Dataset from PDFs

This notebook helps you create a QA dataset for evaluation by automatically generating questions and reference answers from your uploaded PDFs.

## How it works:
1. The system retrieves diverse content from your uploaded PDFs
2. Uses LLM to generate questions based on the content
3. Uses LLM to generate reference answers for each question
4. Saves the dataset in the required JSON format

## Setup
- FastAPI server must be running on `http://localhost:8000`
- PDFs must be uploaded to the system
- OpenAI API key is required (for generating questions and answers)


In [47]:
# Import libraries
import os
import json
import requests
from typing import List, Dict
from pathlib import Path
from dotenv import load_dotenv

load_dotenv()

# Configuration
API_BASE_URL = "http://localhost:8000"
OUTPUT_DIR = Path(".")
OUTPUT_DIR.mkdir(exist_ok=True)

print("✓ Libraries imported")
print(f"API Base URL: {API_BASE_URL}")


✓ Libraries imported
API Base URL: http://localhost:8000


## Configuration

Set how many questions you want to generate.


In [48]:
# Configuration
NUM_QUESTIONS = 50  # Number of questions to generate
OUTPUT_FILE = "qa_dataset.json"  # Output filename
QUESTION_TYPES = [
    "factual",      # What is X? Who is Y?
    "procedural",   # How to do X? What are the steps?
    "conceptual",   # Explain X. Why is Y important?
    "comparative",  # Compare X and Y. What's the difference?
    "summary"       # Summarize X. What is this about?
]

print(f"Configuration:")
print(f"  Number of questions: {NUM_QUESTIONS}")
print(f"  Output file: {OUTPUT_FILE}")
print(f"  Question types: {', '.join(QUESTION_TYPES)}")


Configuration:
  Number of questions: 50
  Output file: qa_dataset.json
  Question types: factual, procedural, conceptual, comparative, summary


## Check Server Status

Verify that the server is running and PDFs are uploaded.


In [49]:
# Check server status
try:
    response = requests.get(f"{API_BASE_URL}/api/status", timeout=5)
    if response.status_code == 200:
        status = response.json()
        print("✓ Server is running")
        print(f"  Documents uploaded: {status.get('total_documents', 0)}")
        if status.get('total_documents', 0) == 0:
            print("\n⚠️  WARNING: No documents uploaded!")
            print("   Please upload PDFs via the web interface before generating questions.")
    else:
        print(f"✗ Server error: {response.status_code}")
except Exception as e:
    print(f"✗ Cannot connect to server: {e}")
    print("   Make sure the server is running on http://localhost:8000")


✓ Server is running
  Documents uploaded: 112


## Generate Questions and Answers

This will use the RAG system to generate questions and reference answers.


In [50]:
from openai import OpenAI

# Initialize OpenAI client for generating questions
openai_key = os.getenv("OPENAI_API_KEY")
if not openai_key:
    raise ValueError("OPENAI_API_KEY not found in environment variables")

client = OpenAI(api_key=openai_key)

def generate_qa_pairs(num_questions: int, question_types: List[str]) -> List[Dict]:
    """
    Generate question-answer pairs from PDF content.
    
    Strategy:
    1. First, get diverse content from PDFs by asking various questions
    2. Use LLM to generate questions based on the content
    3. Use RAG system to generate reference answers
    """
    print(f"\n{'='*60}")
    print(f"Generating {num_questions} QA pairs")
    print(f"{'='*60}\n")
    
    # Step 1: Get diverse content by asking various seed questions
    print("Step 1: Retrieving diverse content from PDFs...")
    seed_questions = [
        "What are the main topics covered?",
        "What are the key concepts?",
        "What procedures or steps are described?",
        "What are important definitions?",
        "What comparisons or analyses are made?",
    ]
    
    # Get answers to seed questions to understand content
    content_summaries = []
    for seed_q in seed_questions[:3]:  # Use first 3 to get content overview
        try:
            response = requests.post(
                f"{API_BASE_URL}/api/ask",
                json={"question": seed_q},
                timeout=60
            )
            if response.status_code == 200:
                data = response.json()
                content_summaries.append({
                    "question": seed_q,
                    "answer": data.get("answer", ""),
                    "citations": data.get("citations", [])
                })
        except Exception as e:
            print(f"  Warning: Could not get answer for '{seed_q}': {e}")
    
    if not content_summaries:
        raise ValueError("Could not retrieve content from PDFs. Make sure PDFs are uploaded.")
    
    print(f"  ✓ Retrieved content from {len(content_summaries)} seed questions")
    
    # Step 2: Generate questions using LLM
    print("\nStep 2: Generating questions using LLM...")
    
    # Build context from content summaries
    context_text = "\n\n".join([
        f"Content {i+1}:\nQ: {item['question']}\nA: {item['answer'][:500]}..."
        for i, item in enumerate(content_summaries)
    ])
    
    # Create prompt for question generation
    prompt = f"""You are creating a QA dataset for evaluating a RAG (Retrieval-Augmented Generation) system.

Based on the following content extracted from PDFs, generate exactly {num_questions} diverse questions.

REQUIREMENTS:
1. Questions should cover different aspects: {', '.join(question_types)}
2. Questions should be answerable based on the provided content
3. Mix question types: factual, procedural, conceptual, comparative, summary
4. Questions should be clear and specific
5. Format as a JSON object with a "questions" key containing an array:

{{
  "questions": [
    {{
      "question": "Your question here?",
      "question_type": "factual|procedural|conceptual|comparative|summary",
      "expected_topics": ["topic1", "topic2"]
    }},
    ...
  ]
}}

Return ONLY valid JSON, no additional text.

Content from PDFs:
{context_text}

Generate {num_questions} questions now:"""

    try:
        response = client.chat.completions.create(
            model="gpt-4o",
            messages=[
                {"role": "system", "content": "You are an expert at creating evaluation questions. Always return valid JSON only."},
                {"role": "user", "content": prompt}
            ],
            temperature=0.8,
            response_format={"type": "json_object"}
        )
        
        result_text = response.choices[0].message.content.strip()
        
        # Parse JSON response
        try:
            result_json = json.loads(result_text)
            # Extract questions list from JSON object
            if isinstance(result_json, dict):
                questions_list = result_json.get("questions", [])
                # If "questions" key doesn't exist, try to find any list value
                if not questions_list:
                    for value in result_json.values():
                        if isinstance(value, list):
                            questions_list = value
                            break
            elif isinstance(result_json, list):
                questions_list = result_json
            else:
                raise ValueError("Unexpected JSON structure")
        except json.JSONDecodeError:
            # If direct parsing fails, try to extract JSON from text
            import re
            json_match = re.search(r'\{.*"questions".*\[.*\].*\}', result_text, re.DOTALL)
            if json_match:
                result_json = json.loads(json_match.group(0))
                questions_list = result_json.get("questions", [])
            else:
                # Try to find just the array
                array_match = re.search(r'\[.*\]', result_text, re.DOTALL)
                if array_match:
                    questions_list = json.loads(array_match.group(0))
                else:
                    raise ValueError("Could not parse questions from LLM response")
        
        if not isinstance(questions_list, list):
            raise ValueError("LLM did not return a list of questions")
        
        print(f"  ✓ Generated {len(questions_list)} questions")
        
    except Exception as e:
        raise Exception(f"Error generating questions: {e}")
    
    # Step 3: Generate reference answers using RAG system
    print("\nStep 3: Generating reference answers using RAG system...")
    qa_pairs = []
    skipped_count = 0
    
    for i, q_item in enumerate(questions_list[:num_questions], 1):
        question = q_item.get("question", "")
        if not question:
            continue
        
        print(f"  [{i}/{min(len(questions_list), num_questions)}] Processing: {question[:50]}...", end=" ", flush=True)
        
        try:
            # Get answer from RAG system
            response = requests.post(
                f"{API_BASE_URL}/api/ask",
                json={"question": question},
                timeout=60
            )
            
            if response.status_code == 200:
                data = response.json()
                answer = data.get("answer", "")
                citations = data.get("citations", [])
                
                # Skip questions that couldn't find answers
                if "couldn't find relevant information" in answer.lower():
                    print("✗ (No answer found, skipping)")
                    skipped_count += 1
                    continue
                
                # Extract context from citations for reference
                context_parts = []
                for cit in citations[:3]:  # Use first 3 citations as context
                    context_parts.append(cit.get("text_snippet", "")[:200])
                context = " ".join(context_parts)
                
                qa_pairs.append({
                    "id": len(qa_pairs) + 1,  # Re-number to avoid gaps
                    "question": question,
                    "answer": answer,
                    "context": context if context else None,
                    "question_type": q_item.get("question_type", "factual"),
                    "citations": [{"source": cit.get("source"), "pdf": cit.get("pdf_filename"), "page": cit.get("page")} for cit in citations[:5]]
                })
                print("✓")
            else:
                print(f"✗ Error: {response.status_code}")
                
        except Exception as e:
            print(f"✗ Error: {str(e)[:50]}")
            continue
    
    if skipped_count > 0:
        print(f"\n  ⚠ Skipped {skipped_count} questions that couldn't find answers")
    
    print(f"\n✓ Generated {len(qa_pairs)} QA pairs")
    return qa_pairs

# Generate QA pairs
qa_dataset = generate_qa_pairs(NUM_QUESTIONS, QUESTION_TYPES)



Generating 50 QA pairs

Step 1: Retrieving diverse content from PDFs...
  ✓ Retrieved content from 3 seed questions

Step 2: Generating questions using LLM...
  ✓ Generated 50 questions

Step 3: Generating reference answers using RAG system...
  [1/50] Processing: What are the main topics covered in the Machine Le... ✓
  [2/50] Processing: What are some key concepts related to machine lear... ✓
  [3/50] Processing: What are the procedures described for Feature Sele... ✓
  [4/50] Processing: How do intrinsic feature selection methods functio... ✓
  [5/50] Processing: What are the pros of using intrinsic feature selec... ✓
  [6/50] Processing: What are the cons of intrinsic feature selection m... ✓
  [7/50] Processing: How is feature selection embedded in tree-based mo... ✓
  [8/50] Processing: What is the relationship between feature selection... ✓
  [9/50] Processing: Which models have intrinsic feature selection embe... ✓
  [10/50] Processing: What are the pros and cons of linear mod

## Review Generated Dataset

Preview the generated questions and answers.


In [51]:
# Preview dataset
print(f"\n{'='*60}")
print("Generated QA Dataset Preview")
print(f"{'='*60}\n")

for i, item in enumerate(qa_dataset[:3], 1):  # Show first 3
    print(f"Question {item['id']}: {item['question']}")
    print(f"Type: {item.get('question_type', 'N/A')}")
    print(f"Answer: {item['answer'][:150]}...")
    print(f"Citations: {len(item.get('citations', []))}")
    print()

if len(qa_dataset) > 3:
    print(f"... and {len(qa_dataset) - 3} more questions\n")

print(f"Total questions: {len(qa_dataset)}")



Generated QA Dataset Preview

Question 1: What are the main topics covered in the Machine Learning Interview Cheat Sheet?
Type: factual
Answer: Based on the provided context from the PDF documents, the main topics covered in the **Machine Learning Interview Cheat Sheet** are:

1. Data and feat...
Citations: 5

Question 2: What are some key concepts related to machine learning mentioned in the PDFs?
Type: factual
Answer: ### Key Concepts Related to Machine Learning:

- **Data and Feature Engineering**:
  - Involves handling missing values, feature selection, and addres...
Citations: 5

Question 3: What are the procedures described for Feature Selection?
Type: procedural
Answer: Based on the provided context from the PDF documents, the procedures described for **Feature Selection** include the following steps:

1. **Wrapper Fe...
Citations: 2

... and 47 more questions

Total questions: 50


## Save Dataset

Save the generated dataset to a JSON file.


In [52]:
# Save dataset
output_path = OUTPUT_DIR / OUTPUT_FILE

# Clean up the dataset (remove internal fields, keep only what's needed for evaluation)
clean_dataset = []
for item in qa_dataset:
    clean_item = {
        "id": item.get("id"),
        "question": item.get("question"),
        "answer": item.get("answer"),  # Reference answer
        "context": item.get("context")  # Optional context
    }
    # Only add context if it exists
    if not clean_item["context"]:
        del clean_item["context"]
    clean_dataset.append(clean_item)

with open(output_path, 'w', encoding='utf-8') as f:
    json.dump(clean_dataset, f, indent=2, ensure_ascii=False)

print(f"✓ Dataset saved to: {output_path}")
print(f"  Total questions: {len(clean_dataset)}")
print(f"\nYou can now use this file in llm_judge_evaluation.ipynb")
print(f"  Set DATASET_PATH = '{OUTPUT_FILE}' in the evaluation notebook")


✓ Dataset saved to: qa_dataset.json
  Total questions: 50

You can now use this file in llm_judge_evaluation.ipynb
  Set DATASET_PATH = 'qa_dataset.json' in the evaluation notebook


## Optional: Manual Editing

You can manually edit the generated dataset to:
- Remove questions you don't want
- Add your own questions
- Improve reference answers
- Add more context

Just edit the JSON file directly, or load it here, modify, and save again.


In [53]:
# Optional: Load, edit, and save dataset
# Uncomment and modify as needed

# # Load dataset
# with open(output_path, 'r', encoding='utf-8') as f:
#     dataset = json.load(f)
# 
# # Example: Remove a question
# # dataset = [item for item in dataset if item['id'] != 5]
# 
# # Example: Add a custom question
# # dataset.append({
# #     "id": len(dataset) + 1,
# #     "question": "Your custom question?",
# #     "answer": "Your reference answer",
# #     "context": "Optional context"
# # })
# 
# # Save modified dataset
# with open(output_path, 'w', encoding='utf-8') as f:
#     json.dump(dataset, f, indent=2, ensure_ascii=False)
# 
# print("✓ Dataset updated")
