# Week 3 Exercise - Synthetic Data Generator

## Task
Generate synthetic data for Pidgin and Yoruba languages from English input using HuggingFace transformers.

### Requirements:
- Use a variety of models and prompts for diverse outputs
- Create a Gradio UI for your product
- Generate translations in both Pidgin (Nigerian Pidgin) and Yoruba
- Use HuggingFace transformers library (following week 3 day 3 & 4 patterns)

### Approach:
1. Multiple HuggingFace models (Llama, Phi, Qwen, Gemma)
2. Different prompt strategies (direct translation, contextual, with examples)
3. Batch generation capabilities
4. Export generated data as CSV/JSON

**Note:** This notebook is designed to run on Google Colab with a T4 GPU for best performance.

## Setup and Installation

In [None]:
# Install required packages (for Google Colab)
# !pip install -q --upgrade bitsandbytes accelerate transformers==4.57.6 gradio pandas

In [None]:
# imports
import os
import json
import pandas as pd
import torch
import gc
from datetime import datetime
from typing import List, Dict
from transformers import AutoTokenizer, AutoModelForCausalLM, TextStreamer, BitsAndBytesConfig
import gradio as gr

In [None]:
# For Google Colab - Login to HuggingFace
# Uncomment these lines when running on Colab

# from google.colab import userdata
# from huggingface_hub import login
# hf_token = userdata.get('HF_TOKEN')
# login(hf_token, add_to_git_credential=True)

## Model Selection

We'll use multiple HuggingFace models to create diverse outputs:
- **Llama 3.2**: Meta's efficient small model
- **Phi-4**: Microsoft's compact but powerful model  
- **Gemma**: Google's lightweight model
- **Qwen**: Alibaba's multilingual model

**Note:** For Llama models, you need to request access at https://huggingface.co/meta-llama/Llama-3.2-1B-Instruct

In [None]:
# Define available models
MODELS = {
    "Llama-3.2-1B": "meta-llama/Llama-3.2-1B-Instruct",
    "Phi-4-Mini": "microsoft/Phi-4-mini-instruct",
    "Gemma-270M": "google/gemma-3-270m-it",
    "Qwen-4B": "Qwen/Qwen3-4B-Instruct-2507"
}

# Device setup
device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Using device: {device}")

if device == "cuda":
    print(f"GPU: {torch.cuda.get_device_name(0)}")
    print(f"Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.2f} GB")

In [None]:
# Quantization config for efficient memory usage (similar to week3/day4)
quant_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_quant_type="nf4"
) if device == "cuda" else None

## Translation Prompt Strategies

We'll use different prompt strategies to create diverse outputs:
1. **Direct Translation**: Simple, straightforward translation
2. **Contextual Translation**: Consider context and tone
3. **Few-Shot Translation**: Provide examples for better accuracy
4. **Cultural Adaptation**: Adapt idioms and cultural references

In [None]:
# Prompt templates for different strategies

DIRECT_PROMPT = {
    "pidgin": """Translate the following English text to Nigerian Pidgin English.
Be accurate and natural. Only provide the translation, no explanations.

English: {text}
Nigerian Pidgin:""",
    
    "yoruba": """Translate the following English text to Yoruba language.
Be accurate and use proper Yoruba grammar. Only provide the translation, no explanations.

English: {text}
Yoruba:"""
}

CONTEXTUAL_PROMPT = {
    "pidgin": """You are an expert in Nigerian Pidgin English. Translate the following English text to Nigerian Pidgin,
maintaining the tone, emotion, and cultural context. Make it sound natural and conversational.
Provide only the translation.

English: {text}
Nigerian Pidgin:""",
    
    "yoruba": """You are a Yoruba language expert. Translate the following English text to Yoruba,
preserving the meaning, tone, and cultural nuances. Use appropriate Yoruba expressions.
Provide only the translation.

English: {text}
Yoruba:"""
}

FEW_SHOT_PROMPT = {
    "pidgin": """Translate English to Nigerian Pidgin. Here are some examples:

English: Good morning, how are you?
Nigerian Pidgin: Mornin', how you dey?

English: I am going to the market.
Nigerian Pidgin: I dey go market.

English: What are you doing?
Nigerian Pidgin: Wetin you dey do?

Now translate this (provide only the translation):
English: {text}
Nigerian Pidgin:""",
    
    "yoruba": """Translate English to Yoruba. Here are some examples:

English: Good morning
Yoruba: E kaaro

English: How are you?
Yoruba: Bawo ni?

English: Thank you
Yoruba: E se

Now translate this (provide only the translation):
English: {text}
Yoruba:"""
}

CULTURAL_PROMPT = {
    "pidgin": """Translate to Nigerian Pidgin English, adapting idioms and cultural references to Nigerian context.
Make it sound like how a Nigerian would naturally speak in Pidgin.
Provide only the translation.

English: {text}
Nigerian Pidgin:""",
    
    "yoruba": """Translate to Yoruba, adapting cultural references and idioms to Yoruba tradition and culture.
Use traditional Yoruba expressions where appropriate.
Provide only the translation.

English: {text}
Yoruba:"""
}

PROMPT_STRATEGIES = {
    "Direct": DIRECT_PROMPT,
    "Contextual": CONTEXTUAL_PROMPT,
    "Few-Shot": FEW_SHOT_PROMPT,
    "Cultural": CULTURAL_PROMPT
}

## Model Loading and Translation Functions

In [None]:
# Cache for loaded models to avoid reloading
model_cache = {}
tokenizer_cache = {}

def load_model(model_name: str):
    """Load model and tokenizer with caching"""
    if model_name in model_cache:
        return model_cache[model_name], tokenizer_cache[model_name]
    
    model_id = MODELS[model_name]
    print(f"Loading {model_name}...")
    
    # Load tokenizer
    tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
    if tokenizer.pad_token is None:
        tokenizer.pad_token = tokenizer.eos_token
    
    # Load model with quantization if GPU available
    if device == "cuda" and quant_config is not None:
        model = AutoModelForCausalLM.from_pretrained(
            model_id,
            device_map="auto",
            quantization_config=quant_config,
            trust_remote_code=True
        )
    else:
        model = AutoModelForCausalLM.from_pretrained(
            model_id,
            device_map="auto" if device == "cuda" else None,
            trust_remote_code=True
        )
        if device == "cpu":
            model = model.to(device)
    
    model_cache[model_name] = model
    tokenizer_cache[model_name] = tokenizer
    
    print(f"{model_name} loaded successfully!")
    return model, tokenizer

def clear_model_cache():
    """Clear model cache to free memory"""
    global model_cache, tokenizer_cache
    model_cache.clear()
    tokenizer_cache.clear()
    gc.collect()
    if device == "cuda":
        torch.cuda.empty_cache()
    print("Model cache cleared")

In [None]:
def translate_text(text: str, language: str, model_name: str, strategy: str = "Contextual") -> str:
    """Translate text using specified model and strategy"""
    try:
        # Get prompt template
        lang_code = "pidgin" if "Pidgin" in language else "yoruba"
        prompt_template = PROMPT_STRATEGIES[strategy][lang_code]
        prompt = prompt_template.format(text=text)
        
        # Load model
        model, tokenizer = load_model(model_name)
        
        # Prepare messages
        messages = [
            {"role": "user", "content": prompt}
        ]
        
        # Apply chat template
        inputs = tokenizer.apply_chat_template(
            messages,
            return_tensors="pt",
            add_generation_prompt=True
        ).to(device)
        
        # Generate translation
        with torch.no_grad():
            outputs = model.generate(
                inputs,
                max_new_tokens=200,
                temperature=0.7,
                do_sample=True,
                pad_token_id=tokenizer.eos_token_id
            )
        
        # Decode output
        full_response = tokenizer.decode(outputs[0], skip_special_tokens=True)
        
        # Extract only the translation (remove the input prompt)
        # This is a simple extraction - the response should contain the translation after the prompt
        if lang_code == "pidgin":
            if "Nigerian Pidgin:" in full_response:
                translation = full_response.split("Nigerian Pidgin:")[-1].strip()
            else:
                translation = full_response.split(prompt)[-1].strip() if prompt in full_response else full_response
        else:
            if "Yoruba:" in full_response:
                translation = full_response.split("Yoruba:")[-1].strip()
            else:
                translation = full_response.split(prompt)[-1].strip() if prompt in full_response else full_response
        
        # Clean up the translation
        translation = translation.split('\n')[0].strip()  # Take first line if multiple
        
        return translation
        
    except Exception as e:
        return f"Error: {str(e)}"

## Test Translations

In [None]:
# Test with a simple sentence
test_text = "Good morning! How are you doing today?"

print("Testing translation with Phi-4-Mini model...\n")
print("="*60)

pidgin_translation = translate_text(test_text, "Nigerian Pidgin", "Phi-4-Mini", "Few-Shot")
print(f"English: {test_text}")
print(f"Pidgin: {pidgin_translation}")

print("\n" + "="*60 + "\n")

yoruba_translation = translate_text(test_text, "Yoruba", "Phi-4-Mini", "Few-Shot")
print(f"English: {test_text}")
print(f"Yoruba: {yoruba_translation}")

## Batch Generation for Synthetic Dataset

In [None]:
def generate_synthetic_dataset(english_texts: List[str], model_name: str = "Phi-4-Mini", 
                                strategy: str = "Contextual", progress_callback=None) -> pd.DataFrame:
    """Generate a complete synthetic dataset with multiple translations"""
    
    data = []
    
    for i, text in enumerate(english_texts):
        if progress_callback:
            progress_callback(f"Processing {i+1}/{len(english_texts)}: {text[:50]}...")
        else:
            print(f"Processing {i+1}/{len(english_texts)}: {text[:50]}...")
        
        # Translate to Pidgin
        pidgin = translate_text(text, "Nigerian Pidgin", model_name, strategy)
        
        # Translate to Yoruba
        yoruba = translate_text(text, "Yoruba", model_name, strategy)
        
        data.append({
            "id": i + 1,
            "english": text,
            "pidgin": pidgin,
            "yoruba": yoruba,
            "model": model_name,
            "strategy": strategy,
            "timestamp": datetime.now().isoformat()
        })
    
    return pd.DataFrame(data)

In [None]:
# Sample English sentences for testing
sample_sentences = [
    "Good morning! How are you doing today?",
    "I am going to the market to buy some food.",
    "Please, can you help me with this?"
]

# Generate a test dataset (uncomment to run)
# test_dataset = generate_synthetic_dataset(sample_sentences, "Phi-4-Mini", "Few-Shot")
# test_dataset

## Gradio UI for Synthetic Data Generator

Create an interactive interface for:
1. Single text translation
2. Model and strategy selection
3. Comparison across models
4. Batch processing with export

In [None]:
# Gradio interface functions

def translate_single(english_text, target_language, model_choice, strategy):
    """Translate a single text"""
    if not english_text.strip():
        return "Please enter some text to translate."
    
    result = translate_text(english_text, target_language, model_choice, strategy)
    return result

def compare_models(english_text, target_language, strategy):
    """Compare translations from different models"""
    if not english_text.strip():
        return "Please enter some text to translate."
    
    output = f"**English:** {english_text}\n\n"
    output += f"**Target Language:** {target_language}\n"
    output += f"**Strategy:** {strategy}\n\n"
    output += "---\n\n"
    
    # Try different models (limit to 2-3 to avoid long wait times)
    models_to_test = ["Phi-4-Mini", "Gemma-270M"]
    
    for model_name in models_to_test:
        try:
            translation = translate_text(english_text, target_language, model_name, strategy)
            output += f"**{model_name}:**\n{translation}\n\n"
        except Exception as e:
            output += f"**{model_name}:** Error: {str(e)}\n\n"
    
    return output

def batch_translate(batch_text, model_choice, strategy, progress=gr.Progress()):
    """Translate multiple sentences (one per line)"""
    if not batch_text.strip():
        return None, "Please enter text to translate (one sentence per line)."
    
    sentences = [s.strip() for s in batch_text.split('\n') if s.strip()]
    
    if len(sentences) == 0:
        return None, "No valid sentences found."
    
    # Progress callback
    def progress_fn(msg):
        progress(msg)
    
    df = generate_synthetic_dataset(sentences, model_choice, strategy, progress_fn)
    
    # Create summary
    summary = f"‚úÖ Generated {len(df)} translations\n\n"
    summary += f"**Model:** {model_choice}\n"
    summary += f"**Strategy:** {strategy}\n\n"
    summary += "You can download the full dataset as CSV below."
    
    return df, summary

In [None]:
# Create the Gradio Interface

with gr.Blocks(title="Synthetic Data Generator - Pidgin & Yoruba", theme=gr.themes.Soft()) as demo:
    gr.Markdown(
        """
        # üåç Synthetic Data Generator for Nigerian Languages
        
        Generate synthetic translation data for **Nigerian Pidgin** and **Yoruba** from English text using HuggingFace transformers.
        
        ### Features:
        - Multiple HuggingFace models (Llama, Phi, Gemma, Qwen)
        - Different translation strategies (Direct, Contextual, Few-Shot, Cultural)
        - Single or batch translation
        - Export to CSV
        
        **Note:** First use of a model will take time to download and load. Subsequent uses will be faster.
        """
    )
    
    with gr.Tabs():
        # Tab 1: Single Translation
        with gr.Tab("üî§ Single Translation"):
            with gr.Row():
                with gr.Column():
                    single_input = gr.Textbox(
                        label="English Text",
                        placeholder="Enter English text to translate...",
                        lines=5
                    )
                    with gr.Row():
                        single_lang = gr.Dropdown(
                            choices=["Nigerian Pidgin", "Yoruba"],
                            label="Target Language",
                            value="Nigerian Pidgin"
                        )
                        single_model = gr.Dropdown(
                            choices=list(MODELS.keys()),
                            label="Model",
                            value="Phi-4-Mini"
                        )
                        single_strategy = gr.Dropdown(
                            choices=["Direct", "Contextual", "Few-Shot", "Cultural"],
                            label="Strategy",
                            value="Few-Shot"
                        )
                    translate_btn = gr.Button("Translate", variant="primary")
                
                with gr.Column():
                    single_output = gr.Textbox(
                        label="Translation",
                        lines=5,
                        interactive=False
                    )
            
            translate_btn.click(
                fn=translate_single,
                inputs=[single_input, single_lang, single_model, single_strategy],
                outputs=single_output
            )
            
            gr.Examples(
                examples=[
                    ["Good morning! How are you doing today?", "Nigerian Pidgin", "Phi-4-Mini", "Few-Shot"],
                    ["I am going to the market to buy food.", "Yoruba", "Gemma-270M", "Contextual"],
                    ["Please help me with this work.", "Nigerian Pidgin", "Phi-4-Mini", "Cultural"],
                ],
                inputs=[single_input, single_lang, single_model, single_strategy],
            )
        
        # Tab 2: Compare Models
        with gr.Tab("üîÑ Compare Models"):
            gr.Markdown(
                """
                Compare translations from different HuggingFace models to see which produces better results.
                """
            )
            with gr.Row():
                with gr.Column():
                    compare_input = gr.Textbox(
                        label="English Text",
                        placeholder="Enter text to see translations from multiple models...",
                        lines=4
                    )
                    with gr.Row():
                        compare_lang = gr.Dropdown(
                            choices=["Nigerian Pidgin", "Yoruba"],
                            label="Target Language",
                            value="Nigerian Pidgin"
                        )
                        compare_strategy = gr.Dropdown(
                            choices=["Direct", "Contextual", "Few-Shot", "Cultural"],
                            label="Strategy",
                            value="Few-Shot"
                        )
                    compare_btn = gr.Button("Compare Models", variant="primary")
            
            compare_output = gr.Markdown(label="Comparison Results")
            
            compare_btn.click(
                fn=compare_models,
                inputs=[compare_input, compare_lang, compare_strategy],
                outputs=compare_output
            )
        
        # Tab 3: Batch Generation
        with gr.Tab("üìä Batch Generation"):
            gr.Markdown(
                """
                ### Generate Synthetic Dataset
                Enter multiple English sentences (one per line) to generate a complete dataset 
                with translations in both Pidgin and Yoruba.
                
                ‚ö†Ô∏è **Note:** Batch generation may take several minutes depending on the number of sentences.
                """
            )
            
            with gr.Row():
                with gr.Column():
                    batch_input = gr.Textbox(
                        label="English Sentences (one per line)",
                        placeholder="Good morning!\nHow are you?\nI am fine.",
                        lines=10
                    )
                    with gr.Row():
                        batch_model = gr.Dropdown(
                            choices=list(MODELS.keys()),
                            label="Model",
                            value="Phi-4-Mini"
                        )
                        batch_strategy = gr.Dropdown(
                            choices=["Direct", "Contextual", "Few-Shot", "Cultural"],
                            label="Strategy",
                            value="Few-Shot"
                        )
                    generate_btn = gr.Button("Generate Dataset", variant="primary")
                
                with gr.Column():
                    batch_summary = gr.Markdown(label="Summary")
            
            batch_dataframe = gr.Dataframe(
                label="Generated Dataset",
                wrap=True,
                interactive=False
            )
            
            # Store the dataframe state
            df_state = gr.State()
            
            generate_btn.click(
                fn=batch_translate,
                inputs=[batch_input, batch_model, batch_strategy],
                outputs=[df_state, batch_summary]
            ).then(
                fn=lambda x: x,
                inputs=[df_state],
                outputs=[batch_dataframe]
            )
            
            gr.Examples(
                examples=[
                    ["Good morning!\nHow are you?\nI am fine, thank you."],
                    ["The market is open.\nI need to buy rice.\nHow much does it cost?"],
                ],
                inputs=[batch_input],
            )
    
    gr.Markdown(
        """
        ---
        ### About Translation Strategies:
        
        - **Direct**: Simple, straightforward translation
        - **Contextual**: Considers tone, emotion, and cultural context
        - **Few-Shot**: Uses examples to improve translation accuracy
        - **Cultural**: Adapts idioms and cultural references
        
        ### Supported Models:
        
        - **Llama-3.2-1B**: Meta's efficient small model (requires HuggingFace access approval)
        - **Phi-4-Mini**: Microsoft's compact but powerful model
        - **Gemma-270M**: Google's lightweight model
        - **Qwen-4B**: Alibaba's multilingual model
        
        ### Tips:
        - First model load will download weights (~1-4GB per model)
        - Use GPU (Colab T4) for faster generation
        - Few-Shot strategy typically gives best results
        - Start with small batches to test before scaling up
        """
    )

# Launch the interface
demo.launch(share=False)