# 02 - Synthetic BRD Data Generation

This notebook generates synthetic Business Requirements Documents (BRDs) with project estimations.

## What we'll do:
1. Set up API client (Claude or GPT-4)
2. Create generation prompts
3. Generate diverse BRD documents
4. Extract ground truth labels
5. Create data augmentations
6. Save dataset

## Strategy:
We'll generate 1,000 diverse BRDs covering:
- Different project types (web, mobile, API, data pipelines)
- Various industries (finance, healthcare, e-commerce, etc.)
- Different scales ($5K - $500K budgets)
- Multiple document styles

## 1. Setup

In [None]:
import anthropic
import json
import random
from typing import Dict, List
from tqdm.notebook import tqdm
import os
from datetime import datetime
import time

# Set random seed for reproducibility
random.seed(42)

## 2. Configure API

**Choose one:**
- Anthropic Claude (recommended)
- OpenAI GPT-4

You'll need to set your API key as an environment variable or paste it here.

In [None]:
# Option 1: Anthropic Claude
ANTHROPIC_API_KEY = os.getenv("ANTHROPIC_API_KEY") or "your-api-key-here"
client = anthropic.Anthropic(api_key=ANTHROPIC_API_KEY)

# Option 2: OpenAI GPT-4 (uncomment if using)
# import openai
# OPENAI_API_KEY = os.getenv("OPENAI_API_KEY") or "your-api-key-here"
# openai.api_key = OPENAI_API_KEY

print("✓ API client configured")

## 3. Define Project Parameters

These will be used to create diverse BRDs.

In [None]:
# Project types
PROJECT_TYPES = [
    "Web Application",
    "Mobile Application (iOS/Android)",
    "REST API Development",
    "Data Pipeline/ETL",
    "Machine Learning Model",
    "E-commerce Platform",
    "CRM System",
    "Dashboard/Analytics Tool",
    "Payment Integration",
    "Authentication System",
]

# Industries
INDUSTRIES = [
    "Financial Services",
    "Healthcare",
    "E-commerce",
    "Education",
    "Real Estate",
    "Manufacturing",
    "Retail",
    "Media & Entertainment",
    "Travel & Hospitality",
    "SaaS",
]

# Complexity levels (affects timeline and cost)
COMPLEXITY = [
    {"level": "Simple", "hours_range": (80, 300), "hourly_rate": 75},
    {"level": "Medium", "hours_range": (300, 800), "hourly_rate": 100},
    {"level": "Complex", "hours_range": (800, 2000), "hourly_rate": 125},
]

# Team sizes
TEAM_SIZES = [1, 2, 3, 4, 5]

print(f"✓ Defined {len(PROJECT_TYPES)} project types")
print(f"✓ Defined {len(INDUSTRIES)} industries")
print(f"✓ Defined {len(COMPLEXITY)} complexity levels")

## 4. BRD Generation Function

In [None]:
def generate_brd(project_type: str, industry: str, complexity: Dict, team_size: int) -> Dict:
    """
    Generate a synthetic BRD using Claude API.
    
    Returns:
        Dict with 'brd_text' and 'labels' (effort_hours, timeline_weeks, cost_usd)
    """
    # Calculate realistic estimates
    effort_hours = random.randint(*complexity["hours_range"])
    hourly_rate = complexity["hourly_rate"] + random.randint(-15, 15)  # Add variation
    cost_usd = effort_hours * hourly_rate
    
    # Calculate timeline (assuming 40 hours/week per person)
    hours_per_week = team_size * 40
    timeline_weeks = max(1, round(effort_hours / hours_per_week))
    
    # Create generation prompt
    prompt = f"""Generate a realistic Business Requirements Document (BRD) for a {complexity['level'].lower()} complexity {project_type.lower()} project in the {industry} industry.

The BRD should be 2-3 paragraphs and include:
- Project overview and business objectives
- Key features and functional requirements
- Technical scope and deliverables
- Resource requirements (mention {team_size} team member(s))
- Timeline estimate (mention approximately {timeline_weeks} weeks)
- Effort estimate (mention approximately {effort_hours} hours total)
- Budget/cost estimate (mention approximately ${cost_usd:,})

Write it in a professional, business document style. Include some natural variations in how you mention the estimates (e.g., "estimated at", "projected", "approximately", "budget of", etc.).

Make it realistic and specific to the industry. Do NOT use a template format - write it as flowing prose."""
    
    try:
        # Call Claude API
        message = client.messages.create(
            model="claude-3-5-sonnet-20241022",
            max_tokens=1000,
            messages=[{"role": "user", "content": prompt}]
        )
        
        brd_text = message.content[0].text
        
        return {
            "brd_text": brd_text,
            "labels": {
                "effort_hours": float(effort_hours),
                "timeline_weeks": int(timeline_weeks),
                "cost_usd": float(cost_usd)
            },
            "metadata": {
                "project_type": project_type,
                "industry": industry,
                "complexity": complexity["level"],
                "team_size": team_size
            }
        }
    except Exception as e:
        print(f"Error generating BRD: {e}")
        return None

print("✓ BRD generation function defined")

## 5. Test Generation

Let's test the function with one example.

In [None]:
# Generate one test BRD
test_brd = generate_brd(
    project_type="Web Application",
    industry="E-commerce",
    complexity=COMPLEXITY[1],  # Medium
    team_size=3
)

if test_brd:
    print("Test BRD Generated Successfully!")
    print("=" * 80)
    print("\nBRD TEXT:")
    print(test_brd["brd_text"])
    print("\n" + "=" * 80)
    print("\nLABELS:")
    print(json.dumps(test_brd["labels"], indent=2))
    print("\nMETADATA:")
    print(json.dumps(test_brd["metadata"], indent=2))
    print("=" * 80)
else:
    print("❌ Test generation failed. Check your API key and connection.")

## 6. Generate Full Dataset

Now let's generate 1,000 diverse BRDs.

**Note:** This will take 20-30 minutes and cost approximately $2-5 in API credits.

In [None]:
# Configuration
NUM_SAMPLES = 1000  # Adjust if needed
BATCH_SIZE = 50  # Save every 50 samples

print(f"Generating {NUM_SAMPLES} BRD documents...")
print(f"This will take approximately {NUM_SAMPLES * 2 / 60:.0f} minutes.\n")

dataset = []
failed = 0

for i in tqdm(range(NUM_SAMPLES), desc="Generating BRDs"):
    # Random selection
    project_type = random.choice(PROJECT_TYPES)
    industry = random.choice(INDUSTRIES)
    complexity = random.choice(COMPLEXITY)
    team_size = random.choice(TEAM_SIZES)
    
    # Generate BRD
    brd = generate_brd(project_type, industry, complexity, team_size)
    
    if brd:
        brd["id"] = i
        dataset.append(brd)
    else:
        failed += 1
    
    # Save intermediate results
    if (i + 1) % BATCH_SIZE == 0:
        with open(f"../data/synthetic_brds/batch_{i+1}.json", "w") as f:
            json.dump(dataset[-BATCH_SIZE:], f, indent=2)
    
    # Rate limiting (adjust based on your API tier)
    time.sleep(0.5)  # 2 requests per second

print(f"\n✓ Generation complete!")
print(f"Successfully generated: {len(dataset)}")
print(f"Failed: {failed}")

## 7. Save Complete Dataset

In [None]:
# Save full dataset
output_file = f"../data/synthetic_brds/full_dataset_{datetime.now().strftime('%Y%m%d_%H%M%S')}.json"

with open(output_file, "w") as f:
    json.dump(dataset, f, indent=2)

print(f"✓ Dataset saved to: {output_file}")
print(f"Total samples: {len(dataset)}")

## 8. Dataset Statistics

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Create DataFrame for analysis
df = pd.DataFrame([
    {
        **sample["labels"],
        **sample["metadata"]
    }
    for sample in dataset
])

print("Dataset Statistics")
print("=" * 80)
print("\nLabel Statistics:")
print(df[["effort_hours", "timeline_weeks", "cost_usd"]].describe())

print("\nDistribution by Project Type:")
print(df["project_type"].value_counts())

print("\nDistribution by Industry:")
print(df["industry"].value_counts())

print("\nDistribution by Complexity:")
print(df["complexity"].value_counts())

# Visualizations
fig, axes = plt.subplots(2, 2, figsize=(15, 10))

# Effort hours distribution
axes[0, 0].hist(df["effort_hours"], bins=50, edgecolor='black')
axes[0, 0].set_title("Distribution of Effort Hours")
axes[0, 0].set_xlabel("Effort Hours")
axes[0, 0].set_ylabel("Frequency")

# Timeline distribution
axes[0, 1].hist(df["timeline_weeks"], bins=30, edgecolor='black', color='green')
axes[0, 1].set_title("Distribution of Timeline (Weeks)")
axes[0, 1].set_xlabel("Timeline (Weeks)")
axes[0, 1].set_ylabel("Frequency")

# Cost distribution
axes[1, 0].hist(df["cost_usd"], bins=50, edgecolor='black', color='orange')
axes[1, 0].set_title("Distribution of Cost (USD)")
axes[1, 0].set_xlabel("Cost (USD)")
axes[1, 0].set_ylabel("Frequency")

# Complexity distribution
df["complexity"].value_counts().plot(kind='bar', ax=axes[1, 1], color='purple')
axes[1, 1].set_title("Distribution by Complexity")
axes[1, 1].set_xlabel("Complexity Level")
axes[1, 1].set_ylabel("Count")
axes[1, 1].tick_params(axis='x', rotation=45)

plt.tight_layout()
plt.savefig("../data/synthetic_brds/dataset_statistics.png", dpi=300, bbox_inches='tight')
plt.show()

print("\n✓ Visualizations saved to: ../data/synthetic_brds/dataset_statistics.png")

## 9. Create Data Augmentations

Generate variations of existing BRDs to increase dataset size.

In [None]:
def augment_brd(original_brd: Dict) -> Dict:
    """
    Create an augmented version by paraphrasing while keeping labels the same.
    """
    prompt = f"""Rewrite the following Business Requirements Document using different wording and structure, but keep the same information and estimates:

{original_brd['brd_text']}

Use different terminology and sentence structures, but maintain all the key information about effort, timeline, and cost."""
    
    try:
        message = client.messages.create(
            model="claude-3-5-sonnet-20241022",
            max_tokens=1000,
            messages=[{"role": "user", "content": prompt}]
        )
        
        augmented_text = message.content[0].text
        
        return {
            "brd_text": augmented_text,
            "labels": original_brd["labels"].copy(),
            "metadata": {
                **original_brd["metadata"],
                "augmented": True,
                "original_id": original_brd["id"]
            }
        }
    except Exception as e:
        print(f"Error augmenting BRD: {e}")
        return None

# Augment 200 random samples (20% of dataset)
NUM_AUGMENTATIONS = 200
samples_to_augment = random.sample(dataset, min(NUM_AUGMENTATIONS, len(dataset)))

print(f"Creating {len(samples_to_augment)} augmented samples...\n")

augmented_data = []
for sample in tqdm(samples_to_augment, desc="Augmenting"):
    aug = augment_brd(sample)
    if aug:
        aug["id"] = len(dataset) + len(augmented_data)
        augmented_data.append(aug)
    time.sleep(0.5)

print(f"\n✓ Created {len(augmented_data)} augmented samples")

# Combine original and augmented
full_dataset = dataset + augmented_data

print(f"✓ Total dataset size: {len(full_dataset)}")

## 10. Save Final Dataset with Augmentations

In [None]:
# Save complete dataset with augmentations
final_output = f"../data/synthetic_brds/complete_dataset_{len(full_dataset)}_samples.json"

with open(final_output, "w") as f:
    json.dump(full_dataset, f, indent=2)

print(f"✓ Complete dataset saved to: {final_output}")
print(f"\nFinal Statistics:")
print(f"  Original samples: {len(dataset)}")
print(f"  Augmented samples: {len(augmented_data)}")
print(f"  Total: {len(full_dataset)}")

## Summary

### What we've created:
- ✓ Generated 1,000 diverse synthetic BRDs
- ✓ Created 200 augmented variations
- ✓ Total dataset: 1,200 samples
- ✓ Covered 10 project types across 10 industries
- ✓ 3 complexity levels with realistic estimates
- ✓ Saved dataset statistics and visualizations

### Next Steps:
Move on to `03_data_preparation.ipynb` to format this data for training.

### Notes:
- Each BRD is unique with natural language variations
- Labels are derived from realistic project parameters
- Augmentations increase dataset diversity
- Dataset is balanced across different categories