# Synthetic Data Generator - Google Colab Version

This notebook provides a synthetic data generator using Hugging Face models with a Gradio interface. It can generate structured data based on templates or custom prompts.

## Setup Instructions

1. **Get a Hugging Face API Key**:
   - Go to [Hugging Face](https://huggingface.co/settings/tokens)
   - Create a new token with read access
   - Copy the token

2. **Run the cells below in order**

3. **Enter your API key when prompted**


## 1. Install Dependencies


In [None]:
# Install required packages
!pip install gradio>=4.0.0 httpx>=0.24.0 python-dotenv>=1.0.0 jsonschema>=4.19.0 transformers>=4.30.0 torch>=2.0.0 huggingface-hub>=0.16.0


## 2. Import Libraries and Setup


In [None]:
import os
import json
import re
from typing import Any, Dict, List, Optional, Tuple
from pathlib import Path

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
from huggingface_hub import login
from jsonschema import validate as jsonschema_validate, ValidationError
import gradio as gr
from google.colab import userdata

print("Libraries imported successfully!")


## 3. API Key Configuration


In [None]:
# Enter your Hugging Face API key
HF_TOKEN = userdata.get('HF_TOKEN') # use the name of the variable in the userdata

print("API key configured successfully!")


## 4. Utility Functions


In [None]:
def safe_json_load(text: str) -> Optional[Any]:
    """
    Attempt to parse JSON. Heuristics:
      1) direct json.loads
      2) extract first [...] block
      3) extract first {...} block
      4) try to find JSON after common prefixes
    Returns parsed object or None.
    """
    # Clean up the text first
    text = text.strip()
    
    # Debug: Print what we're trying to parse
    print(f"DEBUG: Attempting to parse JSON from text (length: {len(text)})")
    print(f"DEBUG: First 200 chars: {text[:200]}")
    
    try:
        result = json.loads(text)
        print(f"DEBUG: Direct JSON parse successful")
        return result
    except Exception as e:
        print(f"DEBUG: Direct JSON parse failed: {e}")

    # Try to find JSON array first (most common for our use case)
    arr_match = re.search(r'(\[.*?\])', text, flags=re.S)
    if arr_match:
        try:
            result = json.loads(arr_match.group(1))
            print(f"DEBUG: Array regex match successful")
            return result
        except Exception as e:
            print(f"DEBUG: Array regex match failed: {e}")

    # Try to find JSON object
    obj_match = re.search(r'(\{.*?\})', text, flags=re.S)
    if obj_match:
        try:
            result = json.loads(obj_match.group(1))
            print(f"DEBUG: Object regex match successful")
            return result
        except Exception as e:
            print(f"DEBUG: Object regex match failed: {e}")

    # Try to find JSON after common prefixes like "Here is the JSON:" or "```json"
    json_patterns = [
        r'(?:Here is the JSON:|```json|JSON:|Output:)\s*(\[.*?\])',
        r'(?:Here is the JSON:|```json|JSON:|Output:)\s*(\{.*?\})',
    ]
    
    for pattern in json_patterns:
        match = re.search(pattern, text, flags=re.S | re.I)
        if match:
            try:
                result = json.loads(match.group(1))
                print(f"DEBUG: Pattern match successful")
                return result
            except Exception as e:
                print(f"DEBUG: Pattern match failed: {e}")
                continue

    print(f"DEBUG: All parsing attempts failed")
    return None

print("Utility functions defined successfully!")


## 5. Template Data


In [None]:
# Template data
TEMPLATES = {
    "user_profile": {
        "id": "user_profile",
        "schema": {
            "type": "object",
            "properties": {
                "first_name": { "type": "string" },
                "last_name": { "type": "string" },
                "email": { "type": "string", "format": "email" },
                "age": { "type": "integer", "minimum": 18, "maximum": 80 },
                "country": { "type": "string" }
            },
            "required": ["first_name", "last_name", "email", "age"]
        },
        "prompt_template": "Generate {{count}} user profile(s) as a JSON array matching this schema: {{schema}}. Tone: {{tone}}. Output ONLY the JSON array, no other text."
    },
    "job_description": {
        "id": "job_description",
        "schema": {
            "type": "object",
            "properties": {
                "title": { "type": "string" },
                "company": { "type": "string" },
                "location": { "type": "string" },
                "salary_range": { "type": "string" },
                "requirements": { "type": "array", "items": { "type": "string" } },
                "benefits": { "type": "array", "items": { "type": "string" } }
            },
            "required": ["title", "company", "location"]
        },
        "prompt_template": "Generate {{count}} job description(s) as a JSON array matching this schema: {{schema}}. Tone: {{tone}}. Output ONLY the JSON array, no other text."
    },
    "product_spec": {
        "id": "product_spec",
        "schema": {
            "type": "object",
            "properties": {
                "name": { "type": "string" },
                "category": { "type": "string" },
                "price": { "type": "number" },
                "description": { "type": "string" },
                "features": { "type": "array", "items": { "type": "string" } },
                "in_stock": { "type": "boolean" }
            },
            "required": ["name", "category", "price"]
        },
        "prompt_template": "Generate {{count}} product specification(s) as a JSON array matching this schema: {{schema}}. Tone: {{tone}}. Output ONLY the JSON array, no other text."
    },
    "address": {
        "id": "address",
        "schema": {
            "type": "object",
            "properties": {
                "street": { "type": "string" },
                "city": { "type": "string" },
                "state": { "type": "string" },
                "zip_code": { "type": "string" },
                "country": { "type": "string" }
            },
            "required": ["street", "city", "state", "zip_code"]
        },
        "prompt_template": "Generate {{count}} address(es) as a JSON array matching this schema: {{schema}}. Tone: {{tone}}. Output ONLY the JSON array, no other text."
    }
}

print("Template data loaded successfully!")
print(f"Available templates: {list(TEMPLATES.keys())}")


## 6. Core Classes


In [None]:
class TemplateRegistry:
    def __init__(self, templates: Dict[str, Any]):
        self.templates = templates

    def get_ids(self) -> List[str]:
        return list(self.templates.keys())

    def get_template(self, template_id: str) -> Optional[Dict[str, Any]]:
        return self.templates.get(template_id)

class PromptEngine:
    def build_prompt(self, template: Dict[str, Any], params: Dict[str, Any]) -> str:
        """
        Build a single prompt string.
        Template must include 'prompt_template' and 'schema'.
        params may include count, tone, etc.
        """
        tpl = template.get("prompt_template", "")
        schema = template.get("schema", {})
        # Render minimal placeholders: {{count}}, {{tone}}, {{schema}}
        prompt = tpl.replace("{{count}}", str(params.get("count", 1)))
        prompt = prompt.replace("{{tone}}", str(params.get("tone", "concise")))
        prompt = prompt.replace("{{schema}}", json.dumps(schema))
        return prompt

print("TemplateRegistry and PromptEngine classes defined!")


In [None]:
class HFClient:
    def __init__(self, model_id: Optional[str] = None, api_key: Optional[str] = None):
        self.model_id = model_id or "meta-llama/Meta-Llama-3.1-8B-Instruct"
        self.api_key = api_key or os.getenv("HF_TOKEN")
        
        if not self.api_key:
            raise RuntimeError("HF_TOKEN not provided in environment.")
        
        print(f"DEBUG: Initializing HFClient with model: {self.model_id}")
        
        # Login to Hugging Face
        try:
            login(token=self.api_key, add_to_git_credential=True)
            print("DEBUG: Successfully logged in to Hugging Face")
        except Exception as e:
            print(f"DEBUG: Login failed: {e}")
            raise RuntimeError(f"Failed to login to Hugging Face: {e}")
        
        # Load tokenizer and model
        try:
            print("DEBUG: Loading tokenizer...")
            self.tokenizer = AutoTokenizer.from_pretrained(self.model_id)
            print("DEBUG: Loading model...")
            self.model = AutoModelForCausalLM.from_pretrained(
                self.model_id,
                dtype=torch.float16 if torch.cuda.is_available() else torch.float32,
                device_map="auto" if torch.cuda.is_available() else None
            )
            print("DEBUG: Model loaded successfully")
        except Exception as e:
            print(f"DEBUG: Model loading failed: {e}")
            raise RuntimeError(f"Failed to load model {self.model_id}: {e}")

    def generate(self, prompt: str, temperature: float = 0.2, max_tokens: int = 512) -> Tuple[str, Dict[str, Any]]:
        print(f"DEBUG: Generating text with prompt length: {len(prompt)}")
        
        try:
            # Tokenize input
            inputs = self.tokenizer(prompt, return_tensors="pt")
            
            # Move to same device as model
            if torch.cuda.is_available():
                inputs = {k: v.cuda() for k, v in inputs.items()}
            
            # Generate with the model
            with torch.no_grad():
                outputs = self.model.generate(
                    **inputs,
                    max_new_tokens=max_tokens,
                    temperature=temperature,
                    do_sample=True,
                    pad_token_id=self.tokenizer.eos_token_id
                )
            
            # Decode the generated text
            generated_text = self.tokenizer.decode(outputs[0], skip_special_tokens=True)
            
            # Remove the original prompt from the generated text
            if generated_text.startswith(prompt):
                generated_text = generated_text[len(prompt):].strip()
            
            print(f"DEBUG: Generated text length: {len(generated_text)}")
            print(f"DEBUG: Generated text preview: {generated_text[:200]}")
            
            return generated_text, {"status": "success", "model": self.model_id}
            
        except Exception as e:
            print(f"DEBUG: Generation failed: {e}")
            return f"Generation error: {str(e)}", {"status": "error", "error": str(e)}

print("HFClient class defined!")


In [None]:
class Validator:
    def validate(self, schema: Dict[str, Any], data: Any) -> Dict[str, Any]:
        if not schema:
            return {"valid": True, "errors": []}
        try:
            # If top-level is array and schema expects object, validate each item
            if isinstance(data, list):
                for item in data:
                    jsonschema_validate(instance=item, schema=schema)
            else:
                jsonschema_validate(instance=data, schema=schema)
            return {"valid": True, "errors": []}
        except ValidationError as e:
            return {"valid": False, "errors": [str(e)]}
        except Exception as e:
            return {"valid": False, "errors": [str(e)]}

class Generator:
    def __init__(self, hf_client: HFClient, registry: TemplateRegistry, prompt_engine: PromptEngine, validator: Validator):
        self.hf = hf_client
        self.registry = registry
        self.engine = prompt_engine
        self.validator = validator

    def generate(self,
                 template_id: Optional[str],
                 params: Dict[str, Any],
                 custom_prompt: Optional[str] = None,
                 temperature: float = 0.2,
                 max_tokens: int = 512) -> Dict[str, Any]:
        # Load template or build ephemeral one
        if custom_prompt:
            template = {"id": "custom", "prompt_template": custom_prompt, "schema": {}}
        else:
            template = self.registry.get_template(template_id) or {"id": template_id, "prompt_template": "", "schema": {}}

        prompt = self.engine.build_prompt(template, params)
        try:
            raw_text, meta = self.hf.generate(prompt, temperature=temperature, max_tokens=max_tokens)
        except Exception as exc:
            return {"status": "error", "output": None, "raw_model_text": "", "validation": {"valid": False, "errors": [str(exc)]}}

        parsed = safe_json_load(raw_text)
        if parsed is None:
            return {"status": "error", "output": None, "raw_model_text": raw_text, "validation": {"valid": False, "errors": ["parse_error"]}}

        validation = self.validator.validate(template.get("schema", {}), parsed)
        return {"status": "ok", "output": parsed, "raw_model_text": raw_text, "validation": validation}

print("Validator and Generator classes defined!")


## 7. Initialize Components


In [None]:
# Initialize components
registry = TemplateRegistry(TEMPLATES)
prompt_engine = PromptEngine()
hf_client = HFClient(model_id="meta-llama/Meta-Llama-3.1-8B-Instruct", api_key=HF_TOKEN)
validator = Validator()
generator = Generator(hf_client, registry, prompt_engine, validator)

print("All components initialized successfully!")
print(f"Available templates: {registry.get_ids()}")


## 8. Gradio Interface


In [None]:
def generate_action(template_id: str, count: int, tone: str, temperature: float, max_tokens: int, custom_prompt: str, show_raw: bool):
    params = {"count": count, "tone": tone}
    if template_id == "custom":
        result = generator.generate(None, params, custom_prompt=custom_prompt, temperature=temperature, max_tokens=max_tokens)
    else:
        result = generator.generate(template_id, params, custom_prompt=None, temperature=temperature, max_tokens=max_tokens)
    output = result.get("output")
    raw = result.get("raw_model_text", "")
    validation = result.get("validation", {"valid": False, "errors": []})
    json_text = json.dumps(output, indent=2) if output is not None else ""
    
    # Debug: Always show raw output for now to help debug
    debug_info = f"Status: {result.get('status', 'unknown')}\nRaw length: {len(raw)} chars\nFirst 200 chars: {raw[:200]}"
    
    if show_raw:
        return json_text, raw, json.dumps(validation, indent=2)
    return json_text, debug_info, json.dumps(validation, indent=2)

print("Generate function defined!")


In [None]:
# Build UI
templates = registry.get_ids()
templates_dropdown = templates + ["custom"]

with gr.Blocks(title="Synthetic Data Generator - Colab") as demo:
    with gr.Row():
        with gr.Column(scale=1):
            template = gr.Dropdown(label="Template", choices=templates_dropdown, value=templates_dropdown[0])
            count = gr.Slider(label="Count", minimum=1, maximum=10, value=1, step=1)
            tone = gr.Textbox(label="Tone (optional)", value="concise")
            temperature = gr.Slider(label="Temperature", minimum=0.0, maximum=1.0, value=0.2, step=0.05)
            max_tokens = gr.Number(label="Max tokens", value=512)
            custom_prompt = gr.Textbox(label="Custom prompt (used if Template=custom)", lines=6, visible=False)
            show_raw = gr.Checkbox(label="Show raw model output", value=False)
            gen_btn = gr.Button("Generate")
        with gr.Column(scale=1):
            output = gr.Code(label="Output (JSON)", language="json")
            raw_out = gr.Textbox(label="Raw model output", lines=8)
            validation = gr.Code(label="Validation", language="json")

    def on_template_change(t):
        return gr.update(visible=(t == "custom"))

    template.change(on_template_change, inputs=[template], outputs=[custom_prompt])

    gen_btn.click(fn=generate_action,
                  inputs=[template, count, tone, temperature, max_tokens, custom_prompt, show_raw],
                  outputs=[output, raw_out, validation])

print("Gradio interface created!")


## 9. Launch the Interface


In [None]:
# Launch the interface
demo.launch(share=True, debug=True)


## Usage Instructions

1. **Select a Template**: Choose from predefined templates (user_profile, job_description, product_spec, address) or select "custom" for your own prompt

2. **Configure Parameters**:
   - **Count**: Number of items to generate (1-10)
   - **Tone**: Style of generation (e.g., "concise", "detailed", "professional")
   - **Temperature**: Controls randomness (0.0 = deterministic, 1.0 = very random)
   - **Max tokens**: Maximum length of generated text

3. **Custom Prompts**: If you select "custom" template, you can enter your own prompt in the text area

4. **Generate**: Click the "Generate" button to create synthetic data

5. **View Results**:
   - **Output (JSON)**: The structured JSON data
   - **Raw model output**: The raw text from the model
   - **Validation**: Whether the output matches the expected schema

## Example Custom Prompts

Here are some example custom prompts you can try:

**Generate product reviews:**
```
Generate 3 product reviews as a JSON array. Each review should have: rating (1-5), title, content, reviewer_name, date. Tone: honest and detailed.
```

**Generate social media posts:**
```
Generate 2 social media posts as a JSON array. Each post should have: platform, content, hashtags (array), engagement_metrics (likes, shares, comments). Tone: engaging and modern.
```

**Generate customer support tickets:**
```
Generate 3 customer support tickets as a JSON array. Each ticket should have: ticket_id, customer_email, subject, description, priority (low/medium/high), status. Tone: realistic and varied.
```
