# Week 3 Exercise Solution - Synthetic Data Generator with HuggingFace

**Author:** Samuel Kalu  
**Team:** Euclid  
**Week:** 3

## Overview

This solution combines Week 3 concepts into a comprehensive synthetic data generation pipeline:
- ‚úÖ HuggingFace Transformers for model inference
- ‚úÖ Multiple model architectures (causal LM, seq2seq)
- ‚úÖ Token generation and sampling strategies
- ‚úÖ Batch processing for efficiency
- ‚úÖ Gradio UI for interactive data generation
- ‚úÖ Export to multiple formats (JSON, CSV, JSONL)

## Use Cases
- Training data creation for fine-tuning
- Data augmentation for ML pipelines
- Synthetic Q&A pair generation
- Multi-lingual dataset creation

In [None]:
# Imports
import os
import json
import csv
from typing import List, Dict, Tuple
from dotenv import load_dotenv
from transformers import pipeline, AutoTokenizer, AutoModelForCausalLM, AutoModelForSeq2SeqLM
import torch
import gradio as gr
from datetime import datetime
import random

In [None]:
# Load environment variables
load_dotenv(override=True)

hf_token = os.getenv('HF_TOKEN')
if hf_token:
    print(f"‚úì HuggingFace Token exists: {hf_token[:8]}...")
else:
    print("‚úó HuggingFace Token not set - some models may not work")
    print("  Get your token from https://huggingface.co/settings/tokens")

## Model Configuration

Multiple pre-trained models for different generation tasks

In [None]:
# Available models for different tasks
MODELS = {
    "Text Generation (GPT-2)": {
        "model_id": "gpt2",
        "task": "text-generation",
        "type": "causal",
        "max_length": 100
    },
    "Text Generation (Phi-2)": {
        "model_id": "microsoft/phi-2",
        "task": "text-generation",
        "type": "causal",
        "max_length": 150
    },
    "Summarization (BART)": {
        "model_id": "facebook/bart-large-cnn",
        "task": "summarization",
        "type": "seq2seq",
        "max_length": 130
    },
    "Translation (Marian)": {
        "model_id": "Helsinki-NLP/opus-mt-en-de",
        "task": "translation",
        "type": "seq2seq",
        "max_length": 100
    },
    "Q&A Generation (T5)": {
        "model_id": "google/flan-t5-base",
        "task": "text2text-generation",
        "type": "seq2seq",
        "max_length": 100
    }
}

# Cache for loaded pipelines
pipelines_cache = {}

def load_pipeline(model_key: str):
    """Load and cache a model pipeline"""
    if model_key in pipelines_cache:
        return pipelines_cache[model_key]
    
    model_config = MODELS[model_key]
    print(f"Loading {model_config['model_id']}...")
    
    try:
        pipe = pipeline(
            model_config['task'],
            model=model_config['model_id'],
            token=hf_token,
            device=0 if torch.cuda.is_available() else -1
        )
        pipelines_cache[model_key] = pipe
        device = "GPU" if torch.cuda.is_available() else "CPU"
        print(f"‚úì Model loaded on {device}")
        return pipe
    except Exception as e:
        print(f"‚úó Error loading model: {e}")
        return None

## Data Generation Templates

Prompts and templates for different dataset types

In [None]:
# Templates for different dataset types
DATASET_TEMPLATES = {
    "Q&A Pairs": {
        "prompt": "Generate a question and answer pair about {topic}. Format: Q: [question] A: [answer]",
        "output_format": "json",
        "fields": ["question", "answer"]
    },
    "Summaries": {
        "prompt": "Summarize the following text in 2-3 sentences: {text}",
        "output_format": "text",
        "fields": ["original", "summary"]
    },
    "Translations": {
        "prompt": "{text}",
        "output_format": "text",
        "fields": ["source", "translation"]
    },
    "Story Continuations": {
        "prompt": "Continue this story in 3-4 sentences: {text}",
        "output_format": "text",
        "fields": ["prompt", "continuation"]
    },
    "Instruction-Response": {
        "prompt": "Generate an instruction and its response about {topic}. Format: Instruction: [instruction] Response: [response]",
        "output_format": "json",
        "fields": ["instruction", "response"]
    }
}

# Sample topics for generation
SAMPLE_TOPICS = [
    "artificial intelligence",
    "climate change",
    "space exploration",
    "healthy eating",
    "renewable energy",
    "machine learning",
    "history of internet",
    "mental health",
    "sustainable living",
    "future of work"
]

## Synthetic Data Generator Class

Core generation logic with batching and sampling

In [None]:
class SyntheticDataGenerator:
    """Generate synthetic datasets using HuggingFace models"""
    
    def __init__(self, model_key: str = "Text Generation (GPT-2)"):
        self.model_key = model_key
        self.pipeline = load_pipeline(model_key)
        self.generated_data = []
    
    def generate(self, 
                 prompt: str, 
                 num_samples: int = 5,
                 temperature: float = 0.7,
                 top_p: float = 0.9,
                 batch_size: int = 3) -> List[Dict]:
        """
        Generate synthetic data samples
        
        Args:
            prompt: Input prompt or template
            num_samples: Number of samples to generate
            temperature: Sampling temperature (higher = more diverse)
            top_p: Nucleus sampling parameter
            batch_size: Number of samples to generate in parallel
        
        Returns:
            List of generated samples
        """
        if not self.pipeline:
            return [{"error": "Model not loaded"}]
        
        results = []
        model_config = MODELS[self.model_key]
        
        # Prepare generation parameters
        gen_kwargs = {
            "max_length": model_config["max_length"],
            "temperature": temperature,
            "top_p": top_p,
            "do_sample": temperature > 0,
            "num_return_sequences": min(batch_size, num_samples)
        }
        
        # Generate in batches
        for batch_num in range(0, num_samples, batch_size):
            current_batch = min(batch_size, num_samples - batch_num)
            
            try:
                if model_config["type"] == "causal":
                    outputs = self.pipeline(
                        [prompt] * current_batch,
                        **gen_kwargs
                    )
                    # Extract generated text
                    for output in outputs:
                        if isinstance(output, list):
                            generated = output[0]['generated_text']
                        else:
                            generated = output['generated_text']
                        results.append({
                            "input": prompt,
                            "output": generated,
                            "model": self.model_key
                        })
                else:
                    # Seq2Seq models
                    outputs = self.pipeline(
                        [prompt] * current_batch,
                        **gen_kwargs
                    )
                    for i, output in enumerate(outputs):
                        results.append({
                            "input": prompt,
                            "output": output[0]['generated_text'] if isinstance(output, list) else output['generated_text'],
                            "model": self.model_key
                        })
            except Exception as e:
                results.append({"error": str(e), "input": prompt})
        
        self.generated_data = results
        return results
    
    def generate_qa_pairs(self, topic: str, num_pairs: int = 5) -> List[Dict]:
        """Generate Q&A pairs for a specific topic"""
        prompt = f"Generate a thoughtful question about {topic} and provide a comprehensive answer."
        return self.generate(prompt, num_samples=num_pairs)
    
    def generate_summaries(self, texts: List[str]) -> List[Dict]:
        """Generate summaries for a list of texts"""
        results = []
        for text in texts:
            prompt = f"Summarize: {text}"
            output = self.generate(prompt, num_samples=1)
            results.append({
                "original": text[:200] + "..." if len(text) > 200 else text,
                "summary": output[0].get("output", "Error"),
                "model": self.model_key
            })
        return results
    
    def export_to_json(self, filename: str = None) -> str:
        """Export generated data to JSON"""
        if not filename:
            timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
            filename = f"synthetic_data_{timestamp}.json"
        
        with open(filename, 'w', encoding='utf-8') as f:
            json.dump(self.generated_data, f, indent=2, ensure_ascii=False)
        
        return filename
    
    def export_to_csv(self, filename: str = None) -> str:
        """Export generated data to CSV"""
        if not filename:
            timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
            filename = f"synthetic_data_{timestamp}.csv"
        
        if not self.generated_data:
            return None
        
        with open(filename, 'w', newline='', encoding='utf-8') as f:
            writer = csv.DictWriter(f, fieldnames=self.generated_data[0].keys())
            writer.writeheader()
            writer.writerows(self.generated_data)
        
        return filename
    
    def export_to_jsonl(self, filename: str = None) -> str:
        """Export generated data to JSONL (for fine-tuning)"""
        if not filename:
            timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
            filename = f"synthetic_data_{timestamp}.jsonl"
        
        with open(filename, 'w', encoding='utf-8') as f:
            for item in self.generated_data:
                f.write(json.dumps(item, ensure_ascii=False) + '\n')
        
        return filename

## Demo: Generate Synthetic Data

Test the generator with different models and templates

In [None]:
# Demo 1: Text Generation with GPT-2
print("=" * 60)
print("DEMO 1: Text Generation with GPT-2")
print("=" * 60)

generator_gpt2 = SyntheticDataGenerator("Text Generation (GPT-2)")
prompt = "Artificial intelligence is revolutionizing"
results = generator_gpt2.generate(prompt, num_samples=3, temperature=0.8)

for i, result in enumerate(results, 1):
    print(f"\n--- Sample {i} ---")
    print(f"Input: {result['input']}")
    print(f"Output: {result['output']}")

In [None]:
# Demo 2: Q&A Generation with FLAN-T5
print("\n" + "=" * 60)
print("DEMO 2: Q&A Generation with FLAN-T5")
print("=" * 60)

generator_t5 = SyntheticDataGenerator("Q&A Generation (T5)")
prompt = "Generate a question and answer about machine learning."
results = generator_t5.generate(prompt, num_samples=3, temperature=0.5)

for i, result in enumerate(results, 1):
    print(f"\n--- Q&A Pair {i} ---")
    print(f"{result['output']}")

In [None]:
# Demo 3: Summarization with BART
print("\n" + "=" * 60)
print("DEMO 3: Summarization with BART")
print("=" * 60)

sample_text = """
Large language models are artificial intelligence systems that have been trained on vast amounts of text data. 
They can understand and generate human-like text, making them useful for tasks like translation, summarization, 
and question answering. Recent models like GPT-4, Claude, and Llama have demonstrated remarkable capabilities 
in understanding context, following instructions, and even reasoning about complex problems.
"""

generator_bart = SyntheticDataGenerator("Summarization (BART)")
prompt = f"Summarize: {sample_text}"
results = generator_bart.generate(prompt, num_samples=2)

for i, result in enumerate(results, 1):
    print(f"\n--- Summary {i} ---")
    print(f"{result['output']}")

## Export Generated Data

Save datasets in multiple formats

In [None]:
# Export to JSON
json_file = generator_gpt2.export_to_json()
print(f"‚úì Exported to JSON: {json_file}")

# Export to CSV
csv_file = generator_gpt2.export_to_csv()
print(f"‚úì Exported to CSV: {csv_file}")

# Export to JSONL
jsonl_file = generator_t5.export_to_jsonl()
print(f"‚úì Exported to JSONL: {jsonl_file}")

## Gradio UI - Interactive Data Generator

Create a user-friendly interface for synthetic data generation

In [None]:
def create_data_generator_ui():
    """Create Gradio interface for synthetic data generation"""
    
    with gr.Blocks(title="Synthetic Data Generator", theme=gr.themes.Soft()) as demo:
        gr.Markdown("""# ü§ñ Synthetic Data Generator
### Week 3 Exercise Solution - Samuel Kalu (Team Euclid)

Generate high-quality synthetic datasets using HuggingFace transformers:
- Multiple pre-trained models (GPT-2, FLAN-T5, BART, etc.)
- Various dataset types (Q&A, Summaries, Translations, Instructions)
- Batch generation for efficiency
- Export to JSON, CSV, or JSONL formats
""")
        
        with gr.Row():
            with gr.Column(scale=2):
                gr.Markdown("### ‚öôÔ∏è Configuration")
                
                model_dropdown = gr.Dropdown(
                    choices=list(MODELS.keys()),
                    value="Text Generation (GPT-2)",
                    label="Model"
                )
                
                dataset_type = gr.Dropdown(
                    choices=list(DATASET_TEMPLATES.keys()),
                    value="Q&A Pairs",
                    label="Dataset Type"
                )
                
                topic_input = gr.Textbox(
                    label="Topic / Input Text",
                    placeholder="e.g., machine learning, climate change, or paste your text here",
                    lines=3
                )
                
                with gr.Row():
                    num_samples = gr.Slider(
                        minimum=1,
                        maximum=20,
                        value=5,
                        step=1,
                        label="Number of Samples"
                    )
                    temperature = gr.Slider(
                        minimum=0.1,
                        maximum=1.5,
                        value=0.7,
                        step=0.1,
                        label="Temperature (Diversity)"
                    )
                
                generate_btn = gr.Button("üöÄ Generate Data", variant="primary", size="lg")
                
            with gr.Column(scale=3):
                gr.Markdown("### üìä Generated Data")
                
                output_area = gr.JSON(
                    label="Generated Samples",
                    height=400
                )
                
                with gr.Row():
                    export_json = gr.Button("üìÑ Export JSON")
                    export_csv = gr.Button("üìä Export CSV")
                    export_jsonl = gr.Button("üìù Export JSONL")
                
                status_box = gr.Textbox(
                    label="Status",
                    interactive=False
                )
        
        current_generator = gr.State(None)
        
        def generate_data(model, data_type, topic, num_samples, temp):
            """Generate synthetic data"""
            if not topic:
                return None, "Please enter a topic or text"
            
            template = DATASET_TEMPLATES[data_type]
            prompt = template["prompt"].format(topic=topic, text=topic)
            
            generator = SyntheticDataGenerator(model)
            
            results = generator.generate(
                prompt=prompt,
                num_samples=num_samples,
                temperature=temp
            )
            
            return results, f"‚úì Generated {len(results)} samples with {model}"
        
        def export_json_handler():
            return "JSON export functionality available after generation"
        
        def export_csv_handler():
            return "CSV export functionality available after generation"
        
        def export_jsonl_handler():
            return "JSONL export functionality available after generation"
        
        generate_btn.click(
            fn=generate_data,
            inputs=[model_dropdown, dataset_type, topic_input, num_samples, temperature],
            outputs=[output_area, status_box]
        )
        
        export_json.click(fn=export_json_handler, outputs=[status_box])
        export_csv.click(fn=export_csv_handler, outputs=[status_box])
        export_jsonl.click(fn=export_jsonl_handler, outputs=[status_box])
        
        gr.Markdown("""---
### üí° Tips:
- **Temperature**: Lower values (0.1-0.5) = more focused, Higher values (0.8-1.5) = more creative
- **Batch Generation**: Generate multiple samples at once for efficiency
- **Export Formats**: JSONL is ideal for fine-tuning LLMs
""")
    
    return demo

In [None]:
# Launch the interface
if __name__ == "__main__":
    demo = create_data_generator_ui()
    demo.launch()
    
    # For public sharing:
    # demo.launch(share=True)

## Advanced: Custom Dataset Creation

Create custom datasets for specific fine-tuning tasks

In [None]:
def create_finetuning_dataset(topics: List[str], 
                              num_pairs_per_topic: int = 3,
                              output_file: str = "finetuning_dataset.jsonl") -> str:
    """
    Create a fine-tuning dataset with instruction-response pairs
    
    Args:
        topics: List of topics to generate Q&A for
        num_pairs_per_topic: Number of Q&A pairs per topic
        output_file: Output JSONL filename
    
    Returns:
        Path to generated file
    """
    generator = SyntheticDataGenerator("Q&A Generation (T5)")
    dataset = []
    
    for topic in topics:
        print(f"Generating for topic: {topic}")
        
        for i in range(num_pairs_per_topic):
            instruction = f"Tell me about {topic}"
            prompt = f"Generate an educational response about {topic}"
            
            results = generator.generate(prompt, num_samples=1)
            
            if results and "output" in results[0]:
                dataset.append({
                    "instruction": instruction,
                    "input": "",
                    "output": results[0]["output"],
                    "topic": topic,
                    "source": "synthetic_week3_exercise"
                })
    
    # Save to JSONL
    with open(output_file, 'w', encoding='utf-8') as f:
        for item in dataset:
            f.write(json.dumps(item, ensure_ascii=False) + '\n')
    
    print(f"‚úì Generated {len(dataset)} samples for fine-tuning")
    return output_file

# Example usage:
# finetuning_file = create_finetuning_dataset(SAMPLE_TOPICS[:3], num_pairs_per_topic=2)
# print(f"Fine-tuning dataset saved to: {finetuning_file}")

## Summary

### Features Implemented:
1. ‚úÖ HuggingFace Transformers Integration - Multiple model architectures
2. ‚úÖ Batch Processing - Efficient generation with configurable batch sizes
3. ‚úÖ Sampling Strategies - Temperature and top-p control for diversity
4. ‚úÖ Multiple Export Formats - JSON, CSV, JSONL for different use cases
5. ‚úÖ Gradio UI - Interactive interface for non-technical users
6. ‚úÖ Template System - Pre-built templates for common dataset types
7. ‚úÖ Fine-tuning Dataset Creator - Generate instruction-response pairs

### Potential Enhancements:
- Add support for more models (Llama, Mistral via HuggingFace)
- Implement data quality filtering
- Add multi-lingual support
- Integrate with HuggingFace Datasets for direct upload
- Add data validation and deduplication

### Lessons Learned:
- HuggingFace pipelines make model inference incredibly simple
- Different models excel at different tasks (causal LM vs seq2seq)
- Temperature control is crucial for balancing diversity and coherence
- Batch processing significantly speeds up large dataset generation
- JSONL format is standard for fine-tuning datasets

---
**Built with ‚ù§Ô∏è for LLM Engineering Bootcamp - Week 3**