# Poem Dataset Refiner (OpenRouter API)

This notebook refines a synthetic poem dataset by expanding each entry into a stanza-constrained poem. It reads a JSONL input and writes a flattened JSONL output.

**Pipeline:**
1. Load dataset with `meaning`, `poem_verse`, and query pools
2. Randomly select a query per entry
3. Ask the model to write a stanza-constrained poem
4. Save flattened examples to JSONL

Multi-threaded processing is supported for faster generation.

## Cell 1: Imports

In [1]:
import os, random, time, json
import json_repair
from typing import Dict, Optional, Tuple, Any
from threading import Lock
from concurrent.futures import ThreadPoolExecutor, as_completed
from tqdm import tqdm
from openai import OpenAI

## Cell 2: Configuration

In [2]:
CONFIG = {
    "api_key": os.getenv("OPEN_ROUTER_API_KEY", "YOUR_API_KEY_HERE"),
    "base_url": "https://openrouter.ai/api/v1",
    "model": "mistralai/mistral-small-creative",
    "temperature": 0.7,
    "input_jsonl_path": "../data/poem_finetune_13000.jsonl",
    "output_jsonl_path": "../data/poem_refined_6400.jsonl",
    "sample_size": 2850,
    "concurrency": 12,
    "max_retries": 2,
    "retry_delay": 2,
}

if CONFIG["api_key"] == "YOUR_API_KEY_HERE":
    print("WARNING: Please set your OPEN_ROUTER_API_KEY.")
else:
    print("API key loaded successfully.")

API key loaded successfully.


## Cell 3: Initialize OpenRouter Client

In [3]:
client = OpenAI(
    api_key=CONFIG["api_key"],
    base_url=CONFIG["base_url"],
)

print(f"Connected to OpenRouter (Model: {CONFIG['model']})")

Connected to OpenRouter (Model: mistralai/mistral-small-creative)


## Cell 4: Utility Functions

In [4]:
def get_weighted_stanza_count() -> int:
    """Generates a number between 1-12, weighted towards 1-7."""
    weights = [0.08, 0.14, 0.16, 0.16, 0.10, 0.10, 0.06, 0.04, 0.04, 0.05, 0.04, 0.03]
    return random.choices(range(1, 13), weights=weights)[0]


def normalize_entry(entry: Dict[str, Any]) -> Optional[Dict[str, Any]]:
    """Normalize entries from JSONL into a common structure."""
    if not isinstance(entry, dict):
        return None
    if "poem_verse" in entry and "data" in entry:
        return entry
    return None


def load_input_records(path: str) -> list:
    """Load JSONL file into a list of dicts (tolerant parser)."""
    records = []
    with open(path, "r") as f:
        for line in f:
            line = line.strip()
            if not line:
                continue
            try:
                record = json.loads(line)
            except json.JSONDecodeError:
                try:
                    record = json_repair.loads(line)
                except Exception:
                    continue
            normalized = normalize_entry(record)
            if normalized:
                records.append(normalized)
    return records

## Cell 5: Prompt Builder and Worker

In [5]:
def extract_query_text(selected_query_data: Any) -> Optional[Dict[str, str]]:
    persona_context = ""
    if isinstance(selected_query_data, dict):
        persona = selected_query_data.get("persona", {})
        if persona:
            name = persona.get("name", "someone")
            profession = persona.get("profession", "unknown")
            tone = persona.get("tone", "neutral")
            persona_context = f"The user is a {name} ({profession}) speaking in a {tone} tone."
        query_text = selected_query_data.get("query", "")
    else:
        query_text = str(selected_query_data)

    if not query_text.strip():
        return None
    return {"query_text": query_text, "persona_context": persona_context}


def build_prompts(entry: Dict[str, Any]) -> Optional[Dict[str, Any]]:
    meaning = entry["data"].get("meaning") if isinstance(entry.get("data"), dict) else None
    base_verse = entry.get("poem_verse")
    if not meaning or not base_verse:
        return None

    queries = entry["data"].get("queries", {})
    neutral_queries = queries.get("neutral", []) if isinstance(queries, dict) else []
    user_queries = queries.get("user", []) if isinstance(queries, dict) else []
    if not neutral_queries or not user_queries:
        return None

    neutral_samples = random.sample(neutral_queries, min(3, len(neutral_queries)))
    user_samples = random.sample(user_queries, min(3, len(user_queries)))
    selected_samples = neutral_samples + user_samples

    system_msg = (
        "You are a Master Poet. You speak exclusively in stanzas. "
        "You prioritize poetic integrity above all else. If a user asks for non-poetic formatting "
        "(lists, tables, code), you must decline the format within your poem while still "
        "addressing the essence of their query. Never use prose."
    )

    prompts = []
    for sample in selected_samples:
        extracted = extract_query_text(sample)
        if not extracted:
            continue
        target_stanzas = get_weighted_stanza_count()
        user_msg = (
            f"Core Meaning to convey: {meaning}\n"
            f"Mandatory Verse to include: {base_verse}\n"
            f"User Query: {extracted['query_text']}\n"
            f"Context: {extracted['persona_context']}\n\n"
            f"TASK: Write a poem of exactly {target_stanzas} stanzas responding to the user. "
            f"The poem must be a proper response to the query, mirror the user's intent, "
            f"and weave in the mandatory verse naturally. Do not explain yourself; only provide the poem."
            f"DO NOT TITLE THE POEM. Only provide the poem itself with no extra formatting, without any introductory or concluding remarks."
        )
        prompts.append({
            "query_text": extracted["query_text"],
            "system": system_msg,
            "user": user_msg,
            "target_stanzas": target_stanzas,
        })

    if not prompts:
        return None

    return {
        "meaning": meaning,
        "prompts": prompts,
    }


def generate_poem(payload: Dict[str, Any]) -> str:
    response = client.chat.completions.create(
        model=CONFIG["model"],
        messages=[
            {"role": "system", "content": payload["system"]},
            {"role": "user", "content": payload["user"]},
        ],
        temperature=CONFIG["temperature"],
    )
    return (response.choices[0].message.content or "").strip()


def process_entry(entry: Dict[str, Any]) -> Tuple[Optional[Dict[str, Any]], Optional[str]]:
    payload = build_prompts(entry)
    if not payload:
        return None, "Invalid entry or empty query pool"

    outputs = []
    last_error = None
    for prompt in payload["prompts"]:
        for attempt in range(CONFIG["max_retries"]):
            try:
                poem = generate_poem(prompt)
                if not poem:
                    raise ValueError("Empty response from model")
                outputs.append({
                    "poem": poem,
                    "normal": prompt["query_text"],
                })
                break
            except Exception as e:
                last_error = f"{type(e).__name__}: {str(e)}"
                if attempt < CONFIG["max_retries"] - 1:
                    time.sleep(CONFIG["retry_delay"] * (attempt + 1))

    if not outputs:
        return None, last_error

    output_obj = {
        "meaning": payload["meaning"],
        "data": outputs,
    }
    return output_obj, None

## Cell 6: Multi-Threaded Refinement

In [6]:
records = load_input_records(CONFIG["input_jsonl_path"])
print(f"Loaded {len(records)} records from {CONFIG['input_jsonl_path']}")

sample_size = min(CONFIG["sample_size"], len(records))
records = random.sample(records, sample_size)
print(f"Using a random sample of {len(records)} records")

lock = Lock()
successful = 0
failed = 0

with open(CONFIG["output_jsonl_path"], "w") as out_f:
    with ThreadPoolExecutor(max_workers=CONFIG["concurrency"]) as executor:
        futures = {executor.submit(process_entry, entry): entry for entry in records}
        for future in tqdm(as_completed(futures), total=len(futures), desc="Refining entries"):
            entry = futures[future]
            try:
                output_obj, error = future.result()
            except Exception as e:
                output_obj, error = None, f"FutureExecutionError: {type(e).__name__}: {str(e)}"

            if output_obj:
                with lock:
                    out_f.write(json.dumps(output_obj, ensure_ascii=False) + "\n")
                successful += 1
            else:
                failed += 1
                meaning = (entry.get("data", {}) or {}).get("meaning", "unknown")
                print(f"Failed: {meaning[:40]}... | {error}")

print(f"Done. Successful: {successful} | Failed: {failed}")
print(f"Output saved to: {CONFIG['output_jsonl_path']}")

Loaded 13468 records from ../data/poem_finetune_13000.jsonl
Using a random sample of 2850 records


Refining entries:   1%|          | 16/2850 [00:20<1:10:33,  1.49s/it]

Failed: Now we honor and revere the same things ... | FutureExecutionError: AttributeError: 'str' object has no attribute 'get'
Failed: When everything around you feels peacefu... | FutureExecutionError: AttributeError: 'str' object has no attribute 'get'


Refining entries:   3%|▎         | 76/2850 [01:14<50:18,  1.09s/it]  

Failed: the harsh, rhythmic sound of a wave cras... | FutureExecutionError: AttributeError: 'str' object has no attribute 'get'


Refining entries:   5%|▌         | 148/2850 [02:33<1:02:05,  1.38s/it]

Failed: Let him go to a quiet, peaceful place fa... | FutureExecutionError: AttributeError: 'str' object has no attribute 'get'


Refining entries:   7%|▋         | 190/2850 [03:09<36:46,  1.21it/s]  

Failed: I’m asking for the ability or tools to e... | FutureExecutionError: AttributeError: 'str' object has no attribute 'get'


Refining entries:   9%|▉         | 254/2850 [04:21<51:05,  1.18s/it]  

Failed: I am even less significant than the most... | FutureExecutionError: AttributeError: 'str' object has no attribute 'get'


Refining entries:  16%|█▌        | 446/2850 [07:27<39:47,  1.01it/s]  

Failed: Despite all your claimed expertise and c... | FutureExecutionError: AttributeError: 'str' object has no attribute 'get'


Refining entries:  20%|██        | 572/2850 [09:17<31:00,  1.22it/s]  

Failed: A person is so eager for personal gain t... | FutureExecutionError: AttributeError: 'str' object has no attribute 'get'


Refining entries:  21%|██▏       | 608/2850 [09:53<40:41,  1.09s/it]

Failed: Henry’s face turned slightly red, likely... | FutureExecutionError: AttributeError: 'str' object has no attribute 'get'


Refining entries:  24%|██▎       | 672/2850 [10:52<18:08,  2.00it/s]

Failed: Both people are quietly wondering why th... | FutureExecutionError: AttributeError: 'str' object has no attribute 'get'


Refining entries:  30%|██▉       | 846/2850 [14:05<39:08,  1.17s/it]  

Failed: The speaker is describing a place or per... | FutureExecutionError: AttributeError: 'str' object has no attribute 'get'


Refining entries:  34%|███▎      | 958/2850 [15:52<44:44,  1.42s/it]  

Failed: His face was radiant with youth and char... | FutureExecutionError: AttributeError: 'str' object has no attribute 'get'


Refining entries:  41%|████      | 1155/2850 [19:07<14:05,  2.01it/s]

Failed: A hidden or mysterious truth becomes cle... | FutureExecutionError: AttributeError: 'str' object has no attribute 'get'


Refining entries:  45%|████▍     | 1277/2850 [21:08<24:48,  1.06it/s]

Failed: A once-trusted companion who now experie... | FutureExecutionError: AttributeError: 'str' object has no attribute 'get'


Refining entries:  45%|████▌     | 1286/2850 [21:14<12:14,  2.13it/s]

Failed: His long, wavy hair falls freely in loos... | FutureExecutionError: AttributeError: 'str' object has no attribute 'get'


Refining entries:  54%|█████▎    | 1527/2850 [26:15<12:03,  1.83it/s]  

Failed: Accept the inevitable outcomes determine... | FutureExecutionError: AttributeError: 'str' object has no attribute 'get'


Refining entries:  56%|█████▌    | 1592/2850 [27:33<42:05,  2.01s/it]

Failed: A skilled craftsman (like a jeweler or s... | FutureExecutionError: AttributeError: 'str' object has no attribute 'get'


Refining entries:  57%|█████▋    | 1636/2850 [28:16<19:16,  1.05it/s]

Failed: Among your siblings, act in a way that s... | FutureExecutionError: AttributeError: 'str' object has no attribute 'get'


Refining entries:  64%|██████▍   | 1835/2850 [32:05<17:01,  1.01s/it]

Failed: Yet our eyes are captivated by even more... | FutureExecutionError: AttributeError: 'str' object has no attribute 'get'


Refining entries:  67%|██████▋   | 1923/2850 [33:35<14:23,  1.07it/s]

Failed: Criticism or mockery that arises when so... | FutureExecutionError: AttributeError: 'str' object has no attribute 'get'


Refining entries:  68%|██████▊   | 1930/2850 [33:40<10:08,  1.51it/s]

Failed: We’ll carefully consider our ideas and g... | FutureExecutionError: AttributeError: 'str' object has no attribute 'get'


Refining entries:  69%|██████▉   | 1979/2850 [34:24<09:21,  1.55it/s]

Failed: She shared something she truly loved or ... | FutureExecutionError: AttributeError: 'str' object has no attribute 'get'


Refining entries:  72%|███████▏  | 2044/2850 [35:36<12:40,  1.06it/s]

Failed: Someone who is constrained by outdated, ... | FutureExecutionError: AttributeError: 'str' object has no attribute 'get'


Refining entries:  72%|███████▏  | 2060/2850 [35:49<11:17,  1.17it/s]

Failed: The exhausted soldiers feel renewed stre... | FutureExecutionError: AttributeError: 'str' object has no attribute 'get'


Refining entries:  73%|███████▎  | 2074/2850 [36:00<15:10,  1.17s/it]

Failed: It had the power to treat illnesses and ... | FutureExecutionError: AttributeError: 'str' object has no attribute 'get'


Refining entries:  73%|███████▎  | 2080/2850 [36:05<10:31,  1.22it/s]

Failed: The speaker is claiming or acknowledging... | FutureExecutionError: AttributeError: 'str' object has no attribute 'get'


Refining entries:  75%|███████▌  | 2141/2850 [36:58<13:45,  1.16s/it]

Failed: She continues to murmur meaningless, tri... | FutureExecutionError: AttributeError: 'str' object has no attribute 'get'


Refining entries:  76%|███████▌  | 2162/2850 [37:16<08:57,  1.28it/s]

Failed: You are fully committed and energeticall... | FutureExecutionError: AttributeError: 'str' object has no attribute 'get'


Refining entries:  77%|███████▋  | 2199/2850 [37:48<10:27,  1.04it/s]

Failed: He is deeply attentive and compassionate... | FutureExecutionError: AttributeError: 'str' object has no attribute 'get'


Refining entries:  80%|███████▉  | 2279/2850 [39:04<11:07,  1.17s/it]

Failed: A collection of different types of short... | FutureExecutionError: AttributeError: 'str' object has no attribute 'get'


Refining entries:  81%|████████▏ | 2319/2850 [39:39<07:36,  1.16it/s]

Failed: In the past, being a critic was consider... | FutureExecutionError: AttributeError: 'str' object has no attribute 'get'


Refining entries:  88%|████████▊ | 2521/2850 [42:44<02:19,  2.36it/s]

Failed: The things he cherishes most in life—wha... | FutureExecutionError: AttributeError: 'str' object has no attribute 'get'


Refining entries:  92%|█████████▏| 2613/2850 [44:07<05:25,  1.38s/it]

Failed: The villain’s face turns red with shame ... | FutureExecutionError: AttributeError: 'str' object has no attribute 'get'


Refining entries:  92%|█████████▏| 2615/2850 [44:10<05:03,  1.29s/it]

Failed: Someone whose only source of joy, inspir... | FutureExecutionError: AttributeError: 'str' object has no attribute 'get'


Refining entries:  95%|█████████▌| 2709/2850 [45:30<02:00,  1.17it/s]

Failed: Someone in a powerful or influential pos... | FutureExecutionError: AttributeError: 'str' object has no attribute 'get'


Refining entries:  97%|█████████▋| 2758/2850 [46:15<01:46,  1.16s/it]

Failed: A source of fresh, flowing water suddenl... | FutureExecutionError: AttributeError: 'str' object has no attribute 'get'


Refining entries:  97%|█████████▋| 2766/2850 [46:22<01:27,  1.04s/it]

Failed: The person showed no outward sign that t... | FutureExecutionError: AttributeError: 'str' object has no attribute 'get'


Refining entries:  99%|█████████▉| 2819/2850 [47:15<00:33,  1.10s/it]

Failed: Emily might be persuaded or influenced b... | FutureExecutionError: AttributeError: 'str' object has no attribute 'get'


Refining entries: 100%|██████████| 2850/2850 [47:51<00:00,  1.01s/it]

Done. Successful: 2812 | Failed: 38
Output saved to: ../data/poem_refined_6400.jsonl



