# Poem-to-Dataset Generator (OpenRouter API)

This notebook reads poem verses from `poem_condense.csv` and generates a semantically grounded synthetic dataset using OpenRouter API.

**Pipeline:**
1. Load poem verses from CSV
2. For each verse, prompt an LLM to generate:
   - Modern interpretation (`meaning`)
   - Neutral queries (5 diverse prompts)
   - User queries (5 diverse persona-based prompts)
3. Save results to `poem_finetune.jsonl`

## Cell 1: Imports

In [None]:
import os
import json
import time
import pandas as pd
from openai import OpenAI
from tqdm import tqdm
from typing import Dict, List, Optional
import re

## Cell 2: Configuration

**Important:** Set your OpenRouter API key in the environment variable `OPEN_ROUTER_API_KEY` or directly in the config below.

In [None]:
# Configuration
CONFIG = {
    "api_key": os.getenv("OPEN_ROUTER_API_KEY", "YOUR_API_KEY_HERE"),
    "base_url": "https://openrouter.ai/api/v1",
    "model": "mistralai/mistral-small-creative",
    "temperature": 0.7,
    "max_tokens": 10_000,
    "x_title": "Poem Fine-Tuning Data Generator",
    "input_file": "../data/poem_condense.csv",
    "output_file": "../data/poem_finetune.jsonl",
    "max_retries": 3,
    "retry_delay": 2,  # seconds
}

# Validate API Key
if CONFIG["api_key"] == "YOUR_API_KEY_HERE":
    print("‚ö†Ô∏è  WARNING: Please set your OPEN_ROUTER_API_KEY!")
else:
    print("‚úÖ API Key loaded successfully")

## Cell 3: Initialize OpenAI Client

In [None]:
# Initialize OpenRouter client
client = OpenAI(
    base_url=CONFIG["base_url"],
    api_key=CONFIG["api_key"],
)

print(f"üîå Connected to OpenRouter (Model: {CONFIG['model']})")

## Cell 4: API Utility Functions

In [None]:
def generate_dataset_entry(poem_verse: str, retries: int = CONFIG["max_retries"]) -> Optional[Dict]:
    """
    Generate a dataset entry for a given poem verse using the LLM.
    
    Args:
        poem_verse: The original poem verse
        retries: Number of retry attempts
    
    Returns:
        Dictionary with meaning, neutral queries, and user queries, or None if failed
    """
    prompt = f"""I will provide a poem verse from Project Gutenberg. Return a JSON object with:

* `meaning`: A modern, clear, style-neutral version of the sentence.
* `queries`: A dictionary with two keys:
  * `neutral`: A list of strings of 5 diverse prompts in plain English that would naturally result in the target meaning.
  * `user`: A list of strings of 5 diverse prompts where the user adopts a specific persona or context (e.g., a modern student, a historian, or a casual seeker) that would trigger the assistant to answer using the provided poem verse.

**Constraint:** Ensure prompts vary in length from a single short sentence to a detailed paragraph.

**Poem Verse:**
"{poem_verse}"

Return ONLY the JSON object, no additional text."""

    system_prompt = """You are an expert literary analyst and synthetic dataset architect specializing in LLM fine-tuning data generation. 

Your task is to transform classical poetry into high-quality training examples for language models. You must:

1. **Semantic Grounding**: Extract the core meaning from archaic or poetic language and express it in clear, contemporary terms.

2. **Query Diversity**: Generate prompts that vary significantly in:
   - Length (from 5 words to 100+ words)
   - Complexity (simple questions to nuanced scenarios)
   - Formality (casual to academic)
   - Specificity (general inquiries to precise requests)

3. **Persona Variation**: For user queries, randomly invent diverse personas with varying backgrounds, intentions, and contexts. Be creative and unpredictable‚Äîavoid repeating similar persona types. Each persona should feel unique and authentic. Think broadly: different ages, professions, cultural contexts, emotional states, levels of expertise, communication styles, and reasons for asking.
   
4. **Naturalness**: Ensure all queries sound like authentic human requests that would organically lead to the target response. Vary sentence structure, vocabulary, and tone across all queries.

5. **Format Compliance**: Return ONLY a valid JSON object with the exact schema requested. No markdown formatting, no explanatory text."""
    last_error = None
    last_raw_response = None
    
    for attempt in range(retries):
        try:
            response = client.chat.completions.create(
                model=CONFIG["model"],
                messages=[
                    {"role": "system", "content": system_prompt},
                    {"role": "user", "content": prompt}
                ],
                temperature=CONFIG["temperature"],
                max_tokens=CONFIG["max_tokens"],
            )
            
            raw_response = response.choices[0].message.content.strip()
            last_raw_response = raw_response
            
            # Parse JSON (handle markdown code blocks)
            parsed = parse_json_response(raw_response)
            
            if parsed and validate_response_structure(parsed):
                return {
                    "poem_verse": poem_verse,
                    "data": parsed
                }
            else:
                last_error = f"Invalid response structure: {parsed}"
                print(f"‚ö†Ô∏è  Invalid response structure (attempt {attempt + 1}/{retries})")
                
        except Exception as e:
            last_error = f"{type(e).__name__}: {str(e)}"
            print(f"‚ùå Error on attempt {attempt + 1}/{retries}: {last_error}")
            if attempt < retries - 1:
                time.sleep(CONFIG["retry_delay"] * (attempt + 1))  # Exponential backoff
    
    # Store error context for logging
    error_context = {
        "poem_verse": poem_verse,
        "last_error": last_error,
        "last_raw_response": last_raw_response,
        "timestamp": time.time()
    }
    
    return None


def parse_json_response(raw_response: str) -> Optional[Dict]:
    """
    Parse JSON from LLM response, handling markdown code blocks.
    """
    # Remove markdown code blocks if present
    raw_response = re.sub(r'^```json\s*', '', raw_response, flags=re.MULTILINE)
    raw_response = re.sub(r'^```\s*', '', raw_response, flags=re.MULTILINE)
    raw_response = raw_response.strip()
    
    try:
        return json.loads(raw_response)
    except json.JSONDecodeError as e:
        print(f"‚ö†Ô∏è  JSON Parse Error: {str(e)}")
        print(f"Raw response preview: {raw_response[:200]}...")
        return None


def validate_response_structure(data: Dict) -> bool:
    """
    Validate that the response has the required structure.
    """
    required_keys = ["meaning", "queries"]
    if not all(key in data for key in required_keys):
        return False
    
    queries = data.get("queries", {})
    if not isinstance(queries, dict):

        return False
print("‚úÖ Utility functions loaded")


## Cell 5: Load Input Data

In [None]:
# Load poem verses from CSV
df = pd.read_csv(CONFIG["input_file"])

print(f"üìä Loaded {len(df)} poem verses from {CONFIG['input_file']}")
print(f"Columns: {list(df.columns)}")
print(f"\nFirst few rows:")
print(df.head())

# Identify the column containing poem text
# Adjust this based on your CSV structure
if 'verse' in df.columns:
    poem_column = 'verse'
elif 'text' in df.columns:
    poem_column = 'text'
elif 'poem' in df.columns:
    poem_column = 'poem'
else:
    poem_column = df.columns[1]  # Use first column as fallback
    
print(f"\nüéØ Using column '{poem_column}' for poem verses")

## Cell 6: Main Processing Loop

This cell processes each poem verse and saves results incrementally to prevent data loss.

In [None]:
# Initialize output file (clear if exists)
with open(CONFIG["output_file"], 'w') as f:
    pass

# Tracking
successful = 0
failed = 0
failed_verses = []

# Process each poem verse
for idx, row in tqdm(df.iterrows(), total=len(df), desc="Processing verses"):
    poem_verse = row[poem_column]
    
    # Skip empty verses
    if pd.isna(poem_verse) or not str(poem_verse).strip():
        continue
    
    # Generate dataset entry
    result = generate_dataset_entry(str(poem_verse))
    
    if result:
        # Append to JSONL file (line-by-line to prevent data loss)
        with open(CONFIG["output_file"], 'a') as f:
            f.write(json.dumps(result) + '\n')
        successful += 1
    else:
        failed += 1
        failed_verses.append({
            "index": idx,
            "verse": poem_verse
        })
    
    # Rate limiting (optional)
    # time.sleep(0.5)

print(f"\n{'='*50}")
print(f"‚úÖ Processing Complete!")
print(f"{'='*50}")
print(f"Successful: {successful}")
print(f"Failed: {failed}")
print(f"Output saved to: {CONFIG['output_file']}")

if failed_verses:
    print(f"\n‚ö†Ô∏è  {len(failed_verses)} verses failed to process:")
    for item in failed_verses[:5]:  # Show first 5
        print(f"  - Index {item['index']}: {item['verse'][:50]}...")

## Cell 7: Validate Output

Load and inspect the generated dataset.

In [None]:
# Load generated dataset
generated_data = []
with open(CONFIG["output_file"], 'r') as f:
    for line in f:
        generated_data.append(json.loads(line))

print(f"üìä Total entries in output file: {len(generated_data)}")

if generated_data:
    print(f"\n{'='*50}")
    print("Sample Entry:")
    print(f"{'='*50}")
    sample = generated_data[0]
    print(f"Poem Verse: {sample['poem_verse']}\n")
    print(f"Meaning: {sample['data']['meaning']}\n")
    print(f"Neutral Queries ({len(sample['data']['queries']['neutral'])}):")
    for i, q in enumerate(sample['data']['queries']['neutral'], 1):
        print(f"  {i}. {q}")
    print(f"\nUser Queries ({len(sample['data']['queries']['user'])}):")
    for i, q in enumerate(sample['data']['queries']['user'], 1):
        print(f"  {i}. {q}")

## Cell 8: Export Statistics

Generate basic statistics about the generated dataset.

In [None]:
if generated_data:
    # Calculate statistics
    total_queries = len(generated_data) * 10  # 5 neutral + 5 user queries per entry
    
    # Average query lengths
    all_neutral_queries = [q for entry in generated_data for q in entry['data']['queries']['neutral']]
    all_user_queries = [q for entry in generated_data for q in entry['data']['queries']['user']]
    
    avg_neutral_length = sum(len(q.split()) for q in all_neutral_queries) / len(all_neutral_queries)
    avg_user_length = sum(len(q.split()) for q in all_user_queries) / len(all_user_queries)
    
    print(f"{'='*50}")
    print("Dataset Statistics")
    print(f"{'='*50}")
    print(f"Total Entries: {len(generated_data)}")
    print(f"Total Queries Generated: {total_queries}")
    print(f"Average Neutral Query Length: {avg_neutral_length:.1f} words")
    print(f"Average User Query Length: {avg_user_length:.1f} words")
    print(f"\nQuery Length Distribution (Neutral):")
    lengths = [len(q.split()) for q in all_neutral_queries]
    print(f"  Min: {min(lengths)} words")
    print(f"  Max: {max(lengths)} words")
    print(f"  Median: {sorted(lengths)[len(lengths)//2]} words")
else:
    print("‚ö†Ô∏è  No data generated")

## Next Steps

1. **Review Quality:** Manually inspect a sample of the generated queries to ensure quality.
2. **Convert to Training Format:** Transform this data into the format required by your fine-tuning pipeline (e.g., conversational format for chat models).
3. **Split Dataset:** Create train/validation/test splits.
4. **Train Model:** Use the generated dataset in `02_Trainer_Arena.ipynb`.