# 🤖 Synthetic Dataset Generator
## AI-Powered Synthetic Data Generation with Claude 3 Haiku
Generate custom synthetic datasets by simply describing your data schema. This tool uses Claude 3 Haiku to create realistic, diverse datasets for testing, development, and prototyping.

![Alt text](https://img.shields.io/badge/python-3.10+-blue.svg) ![Alt text](https://img.shields.io/badge/License-MIT-yellow.svg) 

## ✨ Features

- 🎯 Schema-Based Generation - Describe your data structure in plain English
- 🚀 Fast & Efficient - Powered by Claude 3 Haiku for cost-effective generation
- 📊 Batch Processing - Automatically handles large datasets (200+ records)
- 💾 Export Ready - Download as CSV for immediate use
- 🎨 User-Friendly UI - Built with Gradio for easy interaction
- 🔒 Secure - API key management via .env files
- 📝 Built-in Examples - Pre-configured schemas for common use cases

## 🌍 Use Cases

+ 🧪 Testing & Development - Generate test data for applications
+ 📈 Data Science - Create training datasets for ML models
+ 🎓 Education - Generate sample datasets for learning
+ 🏢 Prototyping - Quick data mockups for demos
+ 🔬 Research - Synthetic data for experiments

## 🧠 Model

- AI Model: Anthropic's claude-3-haiku-20240307
-Task: Structured data generation based on natural language schemas
- Output Format: JSON arrays converted to Pandas DataFrames and CSV

## 🛠️ Requirements
### ⚙️ Hardware

- ✅ CPU is sufficient — No GPU required
- 💾 Minimal RAM (2GB+)

### 📦 Software

Python 3.8 or higher
Anthropic API Key 

### Take the help of (`README.md`) for errors

In [28]:
import gradio as gr
import json
import pandas as pd
from typing import List, Dict
import os
from dotenv import load_dotenv
import tempfile

In [29]:
# Load environment variables from .env file
load_dotenv()

True

In [30]:
# Get API key from environment
ANTHROPIC_API_KEY = os.getenv('ANTHROPIC_API_KEY')

In [31]:
# Import anthropic after other imports to avoid conflicts
try:
    from anthropic import Anthropic, APIError
except ImportError:
    import anthropic
    Anthropic = anthropic.Anthropic
    APIError = anthropic.APIError


In [32]:
# Initialize Anthropic client
def create_client(api_key: str):
    """Create Anthropic client with proper initialization"""
    try:
        # Try normal initialization
        return Anthropic(api_key=api_key)
    except TypeError as e:
        if 'proxies' in str(e):
            # Workaround for httpx version mismatch
            import httpx
            # Create a basic httpx client without proxies
            http_client = httpx.Client()
            return Anthropic(api_key=api_key, http_client=http_client)
        else:
            raise e


In [33]:
def generate_synthetic_data(
    api_key: str,
    schema_description: str,
    num_records: int,
    example_format: str = ""
) -> tuple:
    """
    Generate synthetic dataset using Claude 3 Haiku
    
    Args:
        api_key: Anthropic API key
        schema_description: Description of the data schema
        num_records: Number of records to generate
        example_format: Optional example of desired format
    
    Returns:
        tuple: (DataFrame, status message, csv_file_path)
    """
    try:
        # Create client
        client = create_client(api_key)
        
        # Construct the prompt
        example_section = f"\n\nExample format:\n{example_format}" if example_format else ""
        
        prompt = f"""Generate {num_records} synthetic data records based on the following schema:

{schema_description}{example_section}

Requirements:
1. Return ONLY a valid JSON array of objects
2. Each object should be one record matching the schema
3. Make the data realistic and diverse
4. Ensure data types are appropriate (strings, numbers, booleans, dates, etc.)
5. Do not include any explanation, only the JSON array

Generate exactly {num_records} records."""

        # Call Claude API with explicit parameters
        message = client.messages.create(
            model="claude-3-haiku-20240307",
            max_tokens=4096,
            messages=[
                {"role": "user", "content": prompt}
            ]
        )
        
        # Extract the response
        response_text = message.content[0].text
        
        # Try to parse JSON from the response
        # Sometimes Claude might wrap it in markdown code blocks
        if "```json" in response_text:
            json_str = response_text.split("```json")[1].split("```")[0].strip()
        elif "```" in response_text:
            json_str = response_text.split("```")[1].split("```")[0].strip()
        else:
            json_str = response_text.strip()
        
        # Parse JSON
        data = json.loads(json_str)
        
        # Convert to DataFrame
        df = pd.DataFrame(data)
        
        # Save to temporary CSV file with proper path
        fd, temp_path = tempfile.mkstemp(suffix='.csv', prefix='synthetic_data_')
        os.close(fd)  # Close the file descriptor
        
        # Write CSV to the temp file
        df.to_csv(temp_path, index=False)
        
        status = f"✅ Successfully generated {len(df)} records!"
        return df, status, temp_path
        
    except json.JSONDecodeError as e:
        return None, f"❌ Error parsing JSON: {str(e)}\n\nResponse received:\n{response_text[:500] if 'response_text' in locals() else 'N/A'}...", None
    except APIError as e:
        return None, f"❌ API Error: {str(e)}", None
    except Exception as e:
        return None, f"❌ Error: {type(e).__name__}: {str(e)}", None

In [34]:
def generate_batch_data(
    api_key: str,
    schema_description: str,
    total_records: int,
    example_format: str = "",
    batch_size: int = 50
) -> tuple:
    """
    Generate larger datasets in batches
    """
    all_data = []
    batches = (total_records + batch_size - 1) // batch_size
    
    for i in range(batches):
        records_in_batch = min(batch_size, total_records - len(all_data))
        df_batch, status, csv_path = generate_synthetic_data(
            api_key, schema_description, records_in_batch, example_format
        )
        
        if df_batch is not None:
            all_data.extend(df_batch.to_dict('records'))
        else:
            return None, f"❌ Error in batch {i+1}: {status}", None
    
    final_df = pd.DataFrame(all_data)
    
    # Save final CSV with proper temp file handling
    fd, temp_path = tempfile.mkstemp(suffix='.csv', prefix='synthetic_data_batch_')
    os.close(fd)
    
    final_df.to_csv(temp_path, index=False)
    
    status = f"✅ Successfully generated {len(final_df)} records in {batches} batches!"
    return final_df, status, temp_path


In [39]:
# Create Gradio Interface
def create_interface():
    with gr.Blocks(title="Synthetic Dataset Generator", theme=gr.themes.Soft()) as demo:
        gr.Markdown("""
        # 🤖 Synthetic Dataset Generator
        ### Powered by Claude 3 Haiku
        
        Create custom synthetic datasets by describing your schema. Claude will generate realistic data matching your specifications.
        """)
        
        with gr.Row():
            with gr.Column(scale=1):
                # Show API key input only if not found in environment
                if not ANTHROPIC_API_KEY:
                    api_key_input = gr.Textbox(
                        label="Anthropic API Key",
                        type="password",
                        placeholder="sk-ant-...",
                        info="API key not found in .env file"
                    )
                else:
                    api_key_input = gr.Textbox(
                        label="Anthropic API Key",
                        type="password",
                        value=ANTHROPIC_API_KEY,
                        placeholder="Loaded from .env",
                        info="✅ API key loaded from environment",
                        interactive=False
                    )
                
                schema_input = gr.Textbox(
                    label="Data Schema Description",
                    placeholder="""Example: Generate customer data with:
- name (full name)
- email (valid email address)
- age (between 18-80)
- city (US cities)
- purchase_amount (between $10-$1000)
- join_date (dates in 2023-2024)""",
                    lines=10
                )
                
                example_input = gr.Textbox(
                    label="Example Format (Optional)",
                    placeholder="""{"name": "John Doe", "email": "john@example.com", "age": 35, "city": "New York", "purchase_amount": 299.99, "join_date": "2023-05-15"}""",
                    lines=4
                )
                
                num_records = gr.Slider(
                    minimum=1,
                    maximum=200,
                    value=10,
                    step=1,
                    label="Number of Records"
                )
                
                generate_btn = gr.Button("🚀 Generate Dataset", variant="primary")
            
            with gr.Column(scale=2):
                status_output = gr.Textbox(label="Status", lines=2)
                dataframe_output = gr.Dataframe(
                    label="Generated Dataset",
                    wrap=True
                )
                csv_output = gr.File(label="Download CSV", type="filepath")
        
        # Examples
        gr.Markdown("### 📝 Example Schemas")
        gr.Examples(
            examples=[
                [
                    """Generate employee records with:
- employee_id (format: EMP001, EMP002, etc.)
- name (full name)
- department (Engineering, Sales, Marketing, HR, Finance)
- salary (between $40,000-$150,000)
- hire_date (between 2020-2024)
- performance_rating (1-5)""",
                    10
                ],
                [
                    """Generate e-commerce product data with:
- product_id (format: PRD-XXXX)
- product_name (creative product names)
- category (Electronics, Clothing, Home, Books, Sports)
- price (between $5-$500)
- stock_quantity (between 0-1000)
- rating (1.0-5.0)
- num_reviews (0-500)""",
                    15
                ],
                [
                    """Generate student records with:
- student_id (format: STU2024XXX)
- name (full name)
- major (Computer Science, Biology, Business, Arts, Engineering)
- gpa (2.0-4.0)
- year (Freshman, Sophomore, Junior, Senior)
- credits_completed (0-120)""",
                    20
                ]
            ],
            inputs=[schema_input, num_records]
        )
        
        def generate_wrapper(api_key, schema, num_rec, example):
            # Use environment API key if available, otherwise use input
            final_api_key = ANTHROPIC_API_KEY or api_key
            
            if not final_api_key:
                return None, "❌ Please provide your Anthropic API key (either in .env file or input field)", None
            if not schema:
                return None, "❌ Please describe your data schema", None
            
            # For larger datasets, use batch generation
            if num_rec > 50:
                return generate_batch_data(final_api_key, schema, num_rec, example)
            else:
                return generate_synthetic_data(final_api_key, schema, num_rec, example)
        
        generate_btn.click(
            fn=generate_wrapper,
            inputs=[api_key_input, schema_input, num_records, example_input],
            outputs=[dataframe_output, status_output, csv_output]
        )
        
        gr.Markdown("""
        ---
        ### 💡 Tips:
        - Be specific about data types, ranges, and formats
        - Provide examples for better results
        - For large datasets (>50 records), generation happens in batches
        - Claude 3 Haiku is fast and cost-effective for this task
        
        ### 🔑 API Key Setup:
        Create a `.env` file in the same directory with:
        ```
        ANTHROPIC_API_KEY=your_api_key_here
        ```
        
        ### ⚠️ Troubleshooting:
        If you see a "proxies" error, update httpx:
        ```
        pip install --upgrade httpx
        ```
        """)
    
    return demo


In [40]:
demo = create_interface()
demo.launch()

* Running on local URL:  http://127.0.0.1:7867

To create a public link, set `share=True` in `launch()`.


