# Synthetic Evaluation Question Generation with NeMo SDG

### Platform: build.nvidia.com

---

In this notebook, you will:
1. Learn about Synthetic Data Generation (SDG)
2. Use NeMo Curator SDG to generate Hindi evaluation questions
3. Create domain-specific MCQ questions
4. Generate questions for History, Philosophy, or other subjects
5. Validate generated questions

Generate high-quality synthetic training and evaluation data.


## Setup 

In [8]:
# Install required packages
# !pip install -q openai tqdm pandas matplotlib


In [9]:
import os
from openai import OpenAI

# Set your NVIDIA API key for LLM backend
NVIDIA_API_KEY = ""  # Replace with your key

# Or use environment variable
if "nvapi-" not in NVIDIA_API_KEY or NVIDIA_API_KEY == "nvapi-YOUR_KEY_HERE":
    NVIDIA_API_KEY = os.getenv("NVIDIA_API_KEY", "")
    if not NVIDIA_API_KEY:
        print("‚ö†Ô∏è  Please set your NVIDIA_API_KEY!")
    else:
        print("‚úÖ API key loaded from environment")
else:
    print("‚úÖ API key set")

# Initialize client for NeMo SDG
client = OpenAI(
    base_url="https://integrate.api.nvidia.com/v1",
    api_key=NVIDIA_API_KEY
)

print("\nüîß NVIDIA API client initialized for SDG!")


‚úÖ API key set

üîß NVIDIA API client initialized for SDG!


## Section 2: Configure SDG Pipeline (5 minutes)

Choose a domain and configure the synthetic data generation pipeline.

**Domains to choose from:**
- History (‡§≠‡§æ‡§∞‡§§‡•Ä‡§Ø ‡§á‡§§‡§ø‡§π‡§æ‡§∏)
- Philosophy (‡§¶‡§∞‡•ç‡§∂‡§®)
- Sanskrit (‡§∏‡§Ç‡§∏‡•ç‡§ï‡•É‡§§)
- Law (‡§ï‡§æ‡§®‡•Ç‡§®)
- Science (‡§µ‡§ø‡§ú‡•ç‡§û‡§æ‡§®)


In [10]:
# Choose domain and provide reference content
DOMAIN_EN = "Indian Independence Movement"

# Provide domain context in English (will be translated)
DOMAIN_CONTEXT_EN = """
Indian Independence Movement (1857-1947):

Key Events:
- 1857: First War of Independence (Sepoy Mutiny)
- 1885: Formation of Indian National Congress
- 1905: Partition of Bengal and Swadeshi Movement
- 1919: Jallianwala Bagh Massacre
- 1920-22: Non-Cooperation Movement (Mahatma Gandhi)
- 1930: Salt Satyagraha (Dandi March)
- 1942: Quit India Movement
- 1947: August 15 - India gains independence

Major Leaders:
- Mahatma Gandhi: Pioneer of non-violence and Satyagraha
- Jawaharlal Nehru: First Prime Minister
- Subhas Chandra Bose: Leader of Azad Hind Fauj
- Sardar Patel: Key role in unification of India
- Bhagat Singh, Rajguru, Sukhdev: Revolutionary martyrs
- Dr. B.R. Ambedkar: Architect of Constitution

Movement Characteristics:
- Non-violence and Satyagraha
- Civil disobedience
- Swadeshi and Boycott
- Peasant and labor movements
"""

# Configure multiple languages
LANGUAGES = {
    "hindi": "‡§π‡§ø‡§Ç‡§¶‡•Ä",
    "telugu": "‡∞§‡±Ü‡∞≤‡±Å‡∞ó‡±Å", 
    "kannada": "‡≤ï‡≤®‡≥ç‡≤®‡≤°"
}

print(f"üéØ Domain: {DOMAIN_EN}")
print(f"üìö Context provided: {len(DOMAIN_CONTEXT_EN)} characters")
print(f"üåê Languages: {', '.join(LANGUAGES.keys())}")

# Configure models: Generator (Gemma 3 27B) and Evaluator (8B)
GENERATOR_MODEL = "google/gemma-2-27b-it"  # Gemma 3 27B for generation
EVALUATOR_MODEL = "meta/llama-3.1-8b-instruct"   # Different model answers them

NUM_QUESTIONS_PER_LANGUAGE = 7  # 7 questions per language = 21 total

print(f"\n‚öôÔ∏è  Configuration:")
print(f"   Generator model: {GENERATOR_MODEL}")
print(f"   Evaluator model: {EVALUATOR_MODEL}")
print(f"   Questions per language: {NUM_QUESTIONS_PER_LANGUAGE}")
print(f"   Total questions: {NUM_QUESTIONS_PER_LANGUAGE * len(LANGUAGES)}")

print("\n‚úÖ Configuration complete!")


üéØ Domain: Indian Independence Movement
üìö Context provided: 844 characters
üåê Languages: hindi, telugu, kannada

‚öôÔ∏è  Configuration:
   Generator model: google/gemma-2-27b-it
   Evaluator model: meta/llama-3.1-8b-instruct
   Questions per language: 7
   Total questions: 21

‚úÖ Configuration complete!


## Section 3: Generate Questions in Multiple Languages (15 minutes)

Use Gemma 3-27B to generate MCQ questions in Hindi, Telugu, and Kannada.


In [11]:
import json
from tqdm import tqdm
import time

def generate_mcq_question(client, model, domain_context, domain_name, language, language_name):
    """Generate one MCQ question using the LLM in specified language"""
    
    prompt = f"""You are an expert question creator. Create a graduate-level multiple choice question (MCQ) on {domain_name} based on the following context.

Context:
{domain_context}

Instructions:
1. Write the question in {language_name} ({language})
2. Four realistic options (A, B, C, D) in {language_name}
3. Only one correct answer
4. Question should be challenging but related to the context
5. Provide explanation in {language_name}

Respond in JSON format ONLY (nothing else):
{{
  "question_text": "Question here in {language_name}",
  "option_a": "Option A in {language_name}",
  "option_b": "Option B in {language_name}",
  "option_c": "Option C in {language_name}",
  "option_d": "Option D in {language_name}",
  "correct_answer": "A",
  "explanation": "Explanation in {language_name}",
  "language": "{language}"
}}"""
    
    try:
        response = client.chat.completions.create(
            model=model,
            messages=[{"role": "user", "content": prompt}],
            temperature=0.7,
            max_tokens=600,
        )
        
        # Parse JSON response
        content = response.choices[0].message.content.strip()
        
        # Extract JSON (handle markdown code blocks)
        if "```json" in content:
            content = content.split("```json")[1].split("```")[0].strip()
        elif "```" in content:
            content = content.split("```")[1].split("```")[0].strip()
        
        question_data = json.loads(content)
        question_data['language'] = language
        return question_data
        
    except Exception as e:
        print(f"   ‚ö†Ô∏è  Error generating {language} question: {e}")
        return None

# Generate questions for all languages
print("üîÑ Generating MCQ questions in multiple Indian languages...\n")
print(f"Using model: {GENERATOR_MODEL}")
print("=" * 80)

generated_questions = []
question_counter = 1

for language, language_name in LANGUAGES.items():
    print(f"\nüåê Generating {NUM_QUESTIONS_PER_LANGUAGE} questions in {language_name} ({language})...")
    
    for i in tqdm(range(NUM_QUESTIONS_PER_LANGUAGE), desc=f"{language.capitalize()}", leave=True):
        question = generate_mcq_question(
            client, 
            GENERATOR_MODEL, 
            DOMAIN_CONTEXT_EN, 
            DOMAIN_EN,
            language,
            language_name
        )
        
        if question:
            question['question_id'] = f"synth_{language[:2]}_{i+1:03d}"
            question['subject'] = DOMAIN_EN
            generated_questions.append(question)
            question_counter += 1
        
        time.sleep(1)  # Rate limiting

print(f"\n‚úÖ Generated {len(generated_questions)} questions across {len(LANGUAGES)} languages!")
print(f"   Hindi: {len([q for q in generated_questions if q.get('language') == 'hindi'])}")
print(f"   Telugu: {len([q for q in generated_questions if q.get('language') == 'telugu'])}")
print(f"   Kannada: {len([q for q in generated_questions if q.get('language') == 'kannada'])}")


üîÑ Generating MCQ questions in multiple Indian languages...

Using model: google/gemma-2-27b-it

üåê Generating 7 questions in ‡§π‡§ø‡§Ç‡§¶‡•Ä (hindi)...


Hindi: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 7/7 [00:46<00:00,  6.67s/it]



üåê Generating 7 questions in ‡∞§‡±Ü‡∞≤‡±Å‡∞ó‡±Å (telugu)...


Telugu: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 7/7 [00:49<00:00,  7.12s/it]



üåê Generating 7 questions in ‡≤ï‡≤®‡≥ç‡≤®‡≤° (kannada)...


Kannada: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 7/7 [01:03<00:00,  9.05s/it]


‚úÖ Generated 21 questions across 3 languages!
   Hindi: 7
   Telugu: 7
   Kannada: 7





## Section 3.5: View Generated Questions (5 minutes)

Display the generated questions to see what was created.


In [12]:
# Display sample questions from each language
print("üìù Sample Generated Questions:\n")
print("=" * 80)

for language, language_name in LANGUAGES.items():
    lang_questions = [q for q in generated_questions if q.get('language') == language]
    
    if lang_questions:
        print(f"\nüåê {language_name.upper()} ({language.upper()}):")
        print("-" * 80)
        
        sample = lang_questions[0]
        print(f"\n‚ùì Question: {sample['question_text']}")
        print(f"\nOptions:")
        print(f"  A) {sample['option_a']}")
        print(f"  B) {sample['option_b']}")
        print(f"  C) {sample['option_c']}")
        print(f"  D) {sample['option_d']}")
        print(f"\n‚úÖ Correct Answer: {sample['correct_answer']}")
        
        if 'explanation' in sample and sample['explanation']:
            print(f"üìñ Explanation: {sample['explanation']}")
        
        print("\n" + "=" * 80)

print(f"\nüí° Total questions generated: {len(generated_questions)}")


üìù Sample Generated Questions:


üåê ‡§π‡§ø‡§Ç‡§¶‡•Ä (HINDI):
--------------------------------------------------------------------------------

‚ùì Question: ‡§®‡§ø‡§Æ‡•ç‡§®‡§≤‡§ø‡§ñ‡§ø‡§§ ‡§Æ‡•á‡§Ç ‡§∏‡•á ‡§ï‡•å‡§® ‡§∏‡•Ä ‡§ò‡§ü‡§®‡§æ ‡§≠‡§æ‡§∞‡§§‡•Ä‡§Ø ‡§∏‡•ç‡§µ‡§§‡§Ç‡§§‡•ç‡§∞‡§§‡§æ ‡§∏‡§Ç‡§ó‡•ç‡§∞‡§æ‡§Æ ‡§Æ‡•á‡§Ç '‡§∏‡§µ‡§ø‡§®‡§Ø ‡§Ö‡§µ‡§ú‡•ç‡§û‡§æ ‡§Ü‡§Ç‡§¶‡•ã‡§≤‡§®' ‡§ï‡§æ ‡§™‡•ç‡§∞‡§§‡•ç‡§Ø‡§ï‡•ç‡§∑ ‡§™‡§∞‡§ø‡§£‡§æ‡§Æ ‡§•‡•Ä?

Options:
  A) ‡§ú‡§≤‡§ø‡§Ø‡§æ‡§Å‡§µ‡§æ‡§≤‡§æ ‡§¨‡§æ‡§ó ‡§π‡§§‡•ç‡§Ø‡§æ‡§ï‡§æ‡§Ç‡§°
  B) ‡§¨‡§Ç‡§ó‡§æ‡§≤ ‡§ï‡§æ ‡§µ‡§ø‡§≠‡§æ‡§ú‡§®
  C) ‡§¶‡§æ‡§Ç‡§°‡•Ä ‡§Æ‡§æ‡§∞‡•ç‡§ö
  D) ‡§≠‡§æ‡§∞‡§§ ‡§õ‡•ã‡§°‡§º‡•ã ‡§Ü‡§Ç‡§¶‡•ã‡§≤‡§®

‚úÖ Correct Answer: A
üìñ Explanation: ‡§ú‡§≤‡§ø‡§Ø‡§æ‡§Å‡§µ‡§æ‡§≤‡§æ ‡§¨‡§æ‡§ó ‡§π‡§§‡•ç‡§Ø‡§æ‡§ï‡§æ‡§Ç‡§° (1919) ‡§®‡•á ‡§≠‡§æ‡§∞‡§§‡•Ä‡§Ø ‡§ú‡§®‡§§‡§æ ‡§Æ‡•á‡§Ç ‡§Ö‡§Ç‡§ó‡•ç‡§∞‡•á‡§ú‡•ã‡§Ç ‡§ï‡•á ‡§™‡•ç‡§∞‡§§‡§ø ‡§Ö‡§µ‡§ø‡§∂‡•ç‡§µ‡§æ‡§∏ ‡§î‡§∞ ‡§ï‡•ç‡§∞‡•ã‡§ß ‡§ï‡•ã ‡§¨‡§¢‡§º‡§æ‡§µ‡§æ ‡§¶‡§ø‡§Ø‡§æ, ‡§ú‡§ø‡§∏‡§∏‡•á ‡§Æ‡§π‡§æ‡§§‡•ç‡§Æ‡