From b7a9d82f00e1dea8a91f200f2ac3f57b3546cc44 Mon Sep 17 00:00:00 2001 From: Asankhaya Sharma Date: Wed, 30 Jul 2025 01:41:14 +0800 Subject: [PATCH 01/12] Refactor and rename LLM prompt optimization example Renamed the example directory from 'llm_prompt_optimazation' to 'llm_prompt_optimization' for correct spelling and consistency. Updated all example files, added HuggingFace dataset support, new configuration files, and improved documentation. The new example supports prompt evolution for any HuggingFace dataset with custom templates and cascading evaluation. Removed the old example files and replaced them with the new structure. --- README.md | 68 ++- examples/llm_prompt_optimazation/README.md | 184 ------- .../llm_prompt_optimazation/best_program.txt | 19 - examples/llm_prompt_optimazation/config.yaml | 58 -- examples/llm_prompt_optimazation/data.json | 510 ------------------ examples/llm_prompt_optimazation/evaluator.py | 196 ------- .../initial_prompt.txt | 11 - .../llm_prompt_optimazation/requirements.txt | 2 - examples/llm_prompt_optimazation/run.sh | 4 - examples/llm_prompt_optimization/README.md | 231 ++++++++ examples/llm_prompt_optimization/config.yaml | 69 +++ examples/llm_prompt_optimization/dataset.yaml | 8 + examples/llm_prompt_optimization/evaluator.py | 272 ++++++++++ .../examples/ag_news_dataset.yaml | 8 + .../examples/ag_news_prompt.txt | 9 + .../examples/emotion_dataset.yaml | 8 + .../examples/emotion_prompt.txt | 11 + .../initial_prompt.txt | 5 + .../llm_prompt_optimization/requirements.txt | 4 + .../templates/full_rewrite_user.txt | 20 + 20 files changed, 711 insertions(+), 986 deletions(-) delete mode 100644 examples/llm_prompt_optimazation/README.md delete mode 100644 examples/llm_prompt_optimazation/best_program.txt delete mode 100644 examples/llm_prompt_optimazation/config.yaml delete mode 100644 examples/llm_prompt_optimazation/data.json delete mode 100644 examples/llm_prompt_optimazation/evaluator.py delete mode 100644 examples/llm_prompt_optimazation/initial_prompt.txt delete mode 100644 examples/llm_prompt_optimazation/requirements.txt delete mode 100644 examples/llm_prompt_optimazation/run.sh create mode 100644 examples/llm_prompt_optimization/README.md create mode 100644 examples/llm_prompt_optimization/config.yaml create mode 100644 examples/llm_prompt_optimization/dataset.yaml create mode 100644 examples/llm_prompt_optimization/evaluator.py create mode 100644 examples/llm_prompt_optimization/examples/ag_news_dataset.yaml create mode 100644 examples/llm_prompt_optimization/examples/ag_news_prompt.txt create mode 100644 examples/llm_prompt_optimization/examples/emotion_dataset.yaml create mode 100644 examples/llm_prompt_optimization/examples/emotion_prompt.txt create mode 100644 examples/llm_prompt_optimization/initial_prompt.txt create mode 100644 examples/llm_prompt_optimization/requirements.txt create mode 100644 examples/llm_prompt_optimization/templates/full_rewrite_user.txt diff --git a/README.md b/README.md index dd00115f1..a20d1665a 100644 --- a/README.md +++ b/README.md @@ -288,12 +288,72 @@ prompt: num_top_programs: 3 # Performance examples num_diverse_programs: 2 # Creative inspiration include_artifacts: true # Execution feedback + + # Template customization + template_dir: null # Directory for custom prompt templates + use_template_stochasticity: true # Enable random variations in prompts + template_variations: {} # Define variation placeholders ``` Sample configuration files are available in the `configs/` directory: - `default_config.yaml`: Comprehensive configuration with all available options - `island_config_example.yaml`: Advanced island-based evolution setup +### Template Customization + +OpenEvolve supports advanced prompt template customization to increase diversity in code evolution: + +#### Custom Templates with `template_dir` + +You can override the default prompt templates by providing custom ones: + +```yaml +prompt: + template_dir: "path/to/your/templates" +``` + +Create `.txt` files in your template directory with these names: +- `diff_user.txt` - Template for diff-based evolution +- `full_rewrite_user.txt` - Template for full code rewrites +- `evolution_history.txt` - Format for presenting evolution history +- `top_program.txt` - Format for top-performing programs +- `previous_attempt.txt` - Format for previous attempts + +See these directories for complete examples of custom templates: +- `examples/lm_eval/prompts/` - Custom templates for evaluation tasks +- `examples/llm_prompt_optimization/templates/` - Templates for evolving prompts instead of code + +#### Template Variations with Stochasticity + +To add randomness to your prompts and prevent getting stuck in local optima: + +1. **Enable stochasticity** in your config: +```yaml +prompt: + use_template_stochasticity: true + template_variations: + greeting: + - "Let's improve this code." + - "Time to enhance this program." + - "Here's how we can optimize:" + analysis_intro: + - "Current metrics show" + - "Performance analysis indicates" + - "The evaluation reveals" +``` + +2. **Use variation placeholders** in your custom templates: +``` +# custom_template.txt +{greeting} +{analysis_intro} the following results: +{metrics} +``` + +The system will randomly select one variation for each placeholder during prompt generation, creating diverse prompts that can lead to more creative code evolutions. + +**Note**: The default templates don't include variation placeholders, so you'll need to create custom templates to use this feature effectively. + ### Feature Dimensions in MAP-Elites Feature dimensions control how programs are organized in the MAP-Elites quality-diversity grid: @@ -425,8 +485,12 @@ Demonstrates integration with [optillm](https://github.com/codelion/optillm) for - **Mixture of Agents (MoA)**: Multi-response synthesis for improved accuracy - **Local model optimization**: Enhanced reasoning with smaller models -#### [LLM Prompt Optimization](examples/llm_prompt_optimazation/) -Evolving prompts themselves for better LLM performance, demonstrating self-improving AI systems. +#### [LLM Prompt Optimization](examples/llm_prompt_optimization/) +Evolving prompts for better LLM performance on HuggingFace datasets. Features: +- Custom templates for evolving prompts instead of code +- Two-stage cascading evaluation for efficiency +- Support for any HuggingFace dataset +- Automatic prompt improvement through evolution ### Systems & Performance Optimization diff --git a/examples/llm_prompt_optimazation/README.md b/examples/llm_prompt_optimazation/README.md deleted file mode 100644 index c207a0084..000000000 --- a/examples/llm_prompt_optimazation/README.md +++ /dev/null @@ -1,184 +0,0 @@ -# Evolving Better Prompts with OpenEvolve 🧠✨ - -This example shows how to use **OpenEvolve** to automatically optimize prompts for **Large Language Models (LLMs)**. Whether you're working on classification, summarization, generation, or code tasks, OpenEvolve helps you find high-performing prompts using **evolutionary search**. For this example we'll use syntihetic data for sentiment analysis task, but you can adapt it to your own datasets and tasks. - ---- - -## 🎯 What Is Prompt Optimization? - -Prompt engineering is key to getting reliable outputs from LLMsβ€”but finding the right prompt manually can be slow and inconsistent. - -OpenEvolve automates this by: - -* Generating and evolving prompt variations -* Testing them against your task and metrics -* Selecting the best prompts through generations - -You start with a simple prompt and let OpenEvolve evolve it into something smarter and more effective. - ---- - -## πŸš€ Getting Started - -### 1. Install Dependencies - -```bash -cd examples/llm_prompt_optimazation -pip install -r requirements.txt -sh run.sh -``` - -### 2. Add Your models - -1. Update your `config.yaml`: - -```yaml -llm: - primary_model: "llm_name" - api_base: "llm_server_url" - api_key: "your_api_key_here" -``` - -2. Update your task-model in `evaluator.py`: - -```python -TASK_MODEL_NAME = "task_llm_name" -TASK_MODEL_URL = "task_llm_server_url" -TASK_MODEL_API_KEY = "your_api_key_here" -SAMPLE_SIZE = 25 # Number of samples to use for evaluation -MAX_RETRIES = 3 # Number of retries for LLM calls - -``` - -### 3. Run OpenEvolve - -```bash -sh run.sh -``` - ---- - -## πŸ”§ How to Adapt This Template - -### 1. Replace the Dataset - -Edit `data.json` to match your use case: - -```json -[ - { - "id": 1, - "input": "Your input here", - "expected_output": "Target output" - } -] -``` - -### 2. Customize the Evaluator - -In `evaluator.py`, define how to evaluate a prompt: - -* Load your data -* Call the LLM using the prompt -* Measure output quality (accuracy, score, etc.) - -### 3. Write Your Initial Prompt - -Create a basic starting prompt in `initial_prompt.txt`: - -``` -# EVOLVE-BLOCK-START -Your task prompt using {input_text} as a placeholder. -# EVOLVE-BLOCK-END -``` - -This is the part OpenEvolve will improve over time. -Good to add the name of your task in 'initial_prompt.txt' header to help the model understand the context. - ---- - -## βš™οΈ Key Config Options (`config.yaml`) - -```yaml -llm: - primary_model: "gpt-4o" # or your preferred model - secondary_model: "gpt-3.5" # optional for diversity - temperature: 0.9 - max_tokens: 2048 - -database: - population_size: 40 - max_iterations: 15 - elite_selection_ratio: 0.25 - -evaluator: - timeout: 45 - parallel_evaluations: 3 - use_llm_feedback: true -``` - ---- - -## πŸ“ˆ Example Output - -OpenEvolve evolves prompts like this: - -**Initial Prompt:** - -``` -Please analyze the sentiment of the following sentence and provide a sentiment score: - -"{input_text}" - -Rate the sentiment on a scale from 0.0 to 10.0. - -Score: -``` - -**Evolved Prompt:** - -``` -Please analyze the sentiment of the following sentence and provide a sentiment score using the following guidelines: -- 0.0-2.9: Strongly negative sentiment (e.g., expresses anger, sadness, or despair) -- 3.0-6.9: Neutral or mixed sentiment (e.g., factual statements, ambiguous content) -- 7.0-10.0: Strongly positive sentiment (e.g., expresses joy, satisfaction, or hope) - -"{input_text}" - -Rate the sentiment on a scale from 0.0 to 10.0: -- 0.0-2.9: Strongly negative (e.g., "This product is terrible") -- 3.0-6.9: Neutral/mixed (e.g., "The sky is blue today") -- 7.0-10.0: Strongly positive (e.g., "This is amazing!") - -Provide only the numeric score (e.g., "8.5") without any additional text: - -Score: -``` - -**Result**: Improved accuracy and output consistency. - ---- - -## πŸ” Where to Use This - -OpenEvolve could be addapted on many tasks: - -* **Text Classification**: Spam detection, intent recognition -* **Content Generation**: Social media posts, product descriptions -* **Question Answering & Summarization** -* **Code Tasks**: Review, generation, completion -* **Structured Output**: JSON, table filling, data extraction - ---- - -## βœ… Best Practices - -* Start with a basic but relevant prompt -* Use good-quality data and clear evaluation metrics -* Run multiple evolutions for better results -* Validate on held-out data before deployment - ---- - -**Ready to discover better prompts?** -Use this template to evolve prompts for any LLM taskβ€”automatically. diff --git a/examples/llm_prompt_optimazation/best_program.txt b/examples/llm_prompt_optimazation/best_program.txt deleted file mode 100644 index 601c29da2..000000000 --- a/examples/llm_prompt_optimazation/best_program.txt +++ /dev/null @@ -1,19 +0,0 @@ -"""Sentiment analysis prompt example for OpenEvolve""" - -# EVOLVE-BLOCK-START -Please analyze the sentiment of the following sentence and provide a sentiment score using the following guidelines: -- 0.0-2.9: Strongly negative sentiment (e.g., expresses anger, sadness, or despair) -- 3.0-6.9: Neutral or mixed sentiment (e.g., factual statements, ambiguous content) -- 7.0-10.0: Strongly positive sentiment (e.g., expresses joy, satisfaction, or hope) - -"{input_text}" - -Rate the sentiment on a scale from 0.0 to 10.0: -- 0.0-2.9: Strongly negative (e.g., "This product is terrible") -- 3.0-6.9: Neutral/mixed (e.g., "The sky is blue today") -- 7.0-10.0: Strongly positive (e.g., "This is amazing!") - -Provide only the numeric score (e.g., "8.5") without any additional text: - -Score: -# EVOLVE-BLOCK-END diff --git a/examples/llm_prompt_optimazation/config.yaml b/examples/llm_prompt_optimazation/config.yaml deleted file mode 100644 index 57483c1aa..000000000 --- a/examples/llm_prompt_optimazation/config.yaml +++ /dev/null @@ -1,58 +0,0 @@ -# Configuration for prompt optimization -max_iterations: 30 -checkpoint_interval: 10 -log_level: "INFO" - -# LLM configuration -llm: - primary_model: "qwen3-32b-fp8" - api_base: "http://localhost:1234/v1" - api_key: "your_api_key_here" - temperature: 0.9 - top_p: 0.95 - max_tokens: 2048 - -# Prompt configuration -prompt: - system_message: | - You are an expert prompt engineer. Your task is to revise an existing prompt designed for large language models (LLMs), without being explicitly told what the task is. - - Your improvements should: - - * Infer the intended task and expected output format based on the structure and language of the original prompt. - * Clarify vague instructions, eliminate ambiguity, and improve overall interpretability for the LLM. - * Strengthen alignment between the prompt and the desired task outcome, ensuring more consistent and accurate responses. - * Improve robustness against edge cases or unclear input phrasing. - * If helpful, include formatting instructions, boundary conditions, or illustrative examples that reinforce the LLM's expected behavior. - * Avoid adding unnecessary verbosity or assumptions not grounded in the original prompt. - - You will receive a prompt that uses the following structure: - - ```python - prompt.format(input_text=some_text) - ``` - - The revised prompt should maintain the same input interface but be more effective, reliable, and production-ready for LLM use. - - Return only the improved prompt text. Do not include explanations or additional comments. Your output should be a clean, high-quality replacement that enhances clarity, consistency, and LLM performance. - - num_top_programs: 8 - use_template_stochasticity: true - -# Database configuration -database: - population_size: 40 - archive_size: 20 - num_islands: 3 - elite_selection_ratio: 0.25 - exploitation_ratio: 0.65 - -# Evaluator configuration -evaluator: - timeout: 45 - use_llm_feedback: true - -# Evolution settings -diff_based_evolution: true -allow_full_rewrites: true -diversity_threshold: 0.1 diff --git a/examples/llm_prompt_optimazation/data.json b/examples/llm_prompt_optimazation/data.json deleted file mode 100644 index 9fcdc621e..000000000 --- a/examples/llm_prompt_optimazation/data.json +++ /dev/null @@ -1,510 +0,0 @@ -{ - "book_reviews": [ - { - "id": 1, - "text": "This book was absolutely phenomenal! The writing was masterful and the plot kept me captivated from start to finish.", - "sentiment_score": 9.5 - }, - { - "id": 2, - "text": "I was really disappointed with this novel. The story dragged on and the characters felt flat and uninteresting.", - "sentiment_score": 2.5 - }, - { - "id": 3, - "text": "An incredible literary masterpiece! Brilliant prose and outstanding character development throughout.", - "sentiment_score": 9.8 - }, - { - "id": 4, - "text": "This was one of the worst books I've ever read. Terrible pacing and a completely incoherent storyline.", - "sentiment_score": 0.5 - }, - { - "id": 5, - "text": "A true work of art. Every page was beautifully crafted and emotionally resonant.", - "sentiment_score": 10.0 - }, - { - "id": 6, - "text": "Completely underwhelming. I expected so much more but was left feeling bored and frustrated.", - "sentiment_score": 2.0 - }, - { - "id": 7, - "text": "Incredible storytelling with rich world-building. This book exceeded all my expectations.", - "sentiment_score": 9.2 - }, - { - "id": 8, - "text": "A waste of time and money. Poor writing, bad plot, and overall just a terrible reading experience.", - "sentiment_score": 0.8 - }, - { - "id": 9, - "text": "Outstanding narrative and compelling characters. This book will stay with me for a long time.", - "sentiment_score": 9.0 - }, - { - "id": 10, - "text": "Disappointing and predictable. The book felt like a cheap imitation of much better novels.", - "sentiment_score": 2.8 - }, - { - "id": 11, - "text": "The book was decent. Some chapters were good, others not so much. Overall an average read.", - "sentiment_score": 5.0 - }, - { - "id": 12, - "text": "Not the best novel ever written, but certainly readable. Has its moments of brilliance.", - "sentiment_score": 6.5 - }, - { - "id": 13, - "text": "Pretty good book with solid writing and an interesting premise. Worth reading if you have time.", - "sentiment_score": 7.2 - }, - { - "id": 14, - "text": "The book had potential but fell short in execution. Some good ideas but poorly implemented.", - "sentiment_score": 4.0 - }, - { - "id": 15, - "text": "A truly exceptional piece of literature that pushes the boundaries of storytelling. Pure genius!", - "sentiment_score": 10.0 - }, - { - "id": 16, - "text": "Absolutely terrible in every possible way. I want my money and time back. Avoid at all costs.", - "sentiment_score": 0.0 - }, - { - "id": 17, - "text": "Surprisingly good! Exceeded my expectations with clever plot twists and strong character arcs.", - "sentiment_score": 7.8 - }, - { - "id": 18, - "text": "Mediocre at best. Nothing particularly wrong with it, but nothing special either.", - "sentiment_score": 4.5 - }, - { - "id": 19, - "text": "A delightful surprise! Charming prose and a heartwarming story that left me smiling.", - "sentiment_score": 8.5 - }, - { - "id": 20, - "text": "Painfully slow and pretentious. The author seemed more interested in showing off than telling a story.", - "sentiment_score": 1.2 - }, - { - "id": 21, - "text": "An engaging thriller that kept me on the edge of my seat. Well-crafted suspense and believable characters.", - "sentiment_score": 8.3 - }, - { - "id": 22, - "text": "The romance was sweet but the plot was lacking. Some beautiful moments but overall forgettable.", - "sentiment_score": 5.5 - }, - { - "id": 23, - "text": "Brilliant science fiction with thought-provoking themes. The author's imagination is truly remarkable.", - "sentiment_score": 9.1 - }, - { - "id": 24, - "text": "Confusing and poorly structured. I struggled to follow the narrative and lost interest quickly.", - "sentiment_score": 2.3 - }, - { - "id": 25, - "text": "A masterful blend of history and fiction. Thoroughly researched and beautifully written.", - "sentiment_score": 8.9 - }, - { - "id": 26, - "text": "The characters felt one-dimensional and the dialogue was stilted. Not the author's best work.", - "sentiment_score": 3.2 - }, - { - "id": 27, - "text": "Captivating from the first page to the last. A true page-turner with excellent pacing.", - "sentiment_score": 8.7 - }, - { - "id": 28, - "text": "Boring and repetitive. The same themes rehashed over and over without any fresh perspective.", - "sentiment_score": 2.1 - }, - { - "id": 29, - "text": "A profound exploration of human nature. Deep, meaningful, and beautifully executed.", - "sentiment_score": 9.4 - }, - { - "id": 30, - "text": "The plot had too many holes and the ending was unsatisfying. Left me with more questions than answers.", - "sentiment_score": 3.5 - }, - { - "id": 31, - "text": "Solid character development and a compelling mystery. Kept me guessing until the very end.", - "sentiment_score": 7.6 - }, - { - "id": 32, - "text": "The writing style was difficult to follow and the story seemed to go nowhere. A disappointing read.", - "sentiment_score": 2.7 - }, - { - "id": 33, - "text": "Excellent world-building and imaginative storytelling. A fantasy epic that delivers on all fronts.", - "sentiment_score": 8.8 - }, - { - "id": 34, - "text": "The humor fell flat and the characters were annoying rather than endearing. Not my cup of tea.", - "sentiment_score": 3.0 - }, - { - "id": 35, - "text": "A gripping psychological thriller with complex characters and unexpected twists. Highly recommended.", - "sentiment_score": 8.4 - }, - { - "id": 36, - "text": "The book was okay but nothing groundbreaking. Decent enough to finish but not memorable.", - "sentiment_score": 5.2 - }, - { - "id": 37, - "text": "Beautifully written prose that flows like poetry. A literary gem that touched my soul.", - "sentiment_score": 9.3 - }, - { - "id": 38, - "text": "Too much exposition and not enough action. The story moved at a snail's pace throughout.", - "sentiment_score": 3.8 - }, - { - "id": 39, - "text": "An inspiring tale of resilience and hope. The characters' journeys were both realistic and uplifting.", - "sentiment_score": 8.1 - }, - { - "id": 40, - "text": "ClichΓ©d and predictable. I saw every plot twist coming from miles away. Very disappointing.", - "sentiment_score": 2.4 - }, - { - "id": 41, - "text": "A thought-provoking exploration of social issues wrapped in an entertaining narrative.", - "sentiment_score": 7.9 - }, - { - "id": 42, - "text": "The book started strong but lost momentum halfway through. The ending felt rushed and unsatisfying.", - "sentiment_score": 4.3 - }, - { - "id": 43, - "text": "Exceptional character depth and emotional resonance. A story that will haunt you long after reading.", - "sentiment_score": 9.6 - }, - { - "id": 44, - "text": "Poorly edited with numerous grammatical errors. The story couldn't overcome the technical flaws.", - "sentiment_score": 1.8 - }, - { - "id": 45, - "text": "A delightful coming-of-age story with authentic characters and relatable struggles.", - "sentiment_score": 7.4 - }, - { - "id": 46, - "text": "The premise was interesting but the execution was lacking. Felt like a missed opportunity.", - "sentiment_score": 4.1 - }, - { - "id": 47, - "text": "Absolutely riveting! Could not put it down once I started. A masterclass in suspenseful storytelling.", - "sentiment_score": 9.7 - }, - { - "id": 48, - "text": "Overly complicated and pretentious. The author tried too hard to be clever and it backfired.", - "sentiment_score": 2.2 - }, - { - "id": 49, - "text": "A heartwarming family saga with memorable characters and beautiful storytelling.", - "sentiment_score": 8.2 - }, - { - "id": 50, - "text": "The dialogue was unrealistic and the plot was full of convenient coincidences. Hard to believe.", - "sentiment_score": 3.3 - }, - { - "id": 51, - "text": "An ambitious epic that mostly succeeds in its grand vision. Some pacing issues but overall impressive.", - "sentiment_score": 7.7 - }, - { - "id": 52, - "text": "Dull and lifeless. The characters had no personality and the story lacked any real conflict.", - "sentiment_score": 2.6 - }, - { - "id": 53, - "text": "A beautiful meditation on love, loss, and redemption. Emotionally powerful and deeply moving.", - "sentiment_score": 8.9 - }, - { - "id": 54, - "text": "The book felt incomplete, like the author ran out of ideas halfway through. Very unsatisfying.", - "sentiment_score": 3.4 - }, - { - "id": 55, - "text": "Clever and witty with sharp social commentary. An entertaining read that also makes you think.", - "sentiment_score": 7.8 - }, - { - "id": 56, - "text": "Repetitive and boring. The same points made over and over without adding anything new.", - "sentiment_score": 2.9 - }, - { - "id": 57, - "text": "A stunning work of historical fiction that brings the past to life with vivid detail.", - "sentiment_score": 8.6 - }, - { - "id": 58, - "text": "The mystery was easy to solve and the red herrings were obvious. Not very engaging.", - "sentiment_score": 3.7 - }, - { - "id": 59, - "text": "Outstanding world-building and character development. A fantasy series starter that promises great things.", - "sentiment_score": 8.3 - }, - { - "id": 60, - "text": "Too many subplots that went nowhere. The main story got lost in all the unnecessary complexity.", - "sentiment_score": 3.6 - }, - { - "id": 61, - "text": "A perfectly crafted thriller with tight pacing and genuine surprises. Everything a good book should be.", - "sentiment_score": 9.0 - }, - { - "id": 62, - "text": "The writing was awkward and the story felt forced. Could have used more time in development.", - "sentiment_score": 2.8 - }, - { - "id": 63, - "text": "An enchanting tale that captures the magic of childhood while addressing serious themes.", - "sentiment_score": 7.9 - }, - { - "id": 64, - "text": "The book was reasonably entertaining but nothing I hadn't seen before. Average in every way.", - "sentiment_score": 5.0 - }, - { - "id": 65, - "text": "Brilliant use of multiple perspectives to tell a complex story. Masterfully woven narrative threads.", - "sentiment_score": 9.2 - }, - { - "id": 66, - "text": "The pacing was all wrong - too slow in places, too rushed in others. Needed better editing.", - "sentiment_score": 3.9 - }, - { - "id": 67, - "text": "A touching story of friendship and loyalty that resonated deeply with me. Highly recommended.", - "sentiment_score": 8.0 - }, - { - "id": 68, - "text": "Confusing timeline and unclear motivations made this a frustrating read. Lost potential.", - "sentiment_score": 3.1 - }, - { - "id": 69, - "text": "Exceptional prose and a story that stays with you. A modern classic in the making.", - "sentiment_score": 9.5 - }, - { - "id": 70, - "text": "The book tried to do too much and ended up accomplishing very little. Unfocused and scattered.", - "sentiment_score": 2.5 - }, - { - "id": 71, - "text": "A solid mystery with well-developed characters and a satisfying resolution. Good entertainment.", - "sentiment_score": 7.3 - }, - { - "id": 72, - "text": "Derivative and unoriginal. Felt like I'd read this exact story multiple times before.", - "sentiment_score": 2.0 - }, - { - "id": 73, - "text": "Beautiful, lyrical writing that creates an immersive reading experience. A true work of art.", - "sentiment_score": 8.8 - }, - { - "id": 74, - "text": "The book was readable but forgettable. Nothing particularly good or bad about it.", - "sentiment_score": 5.1 - }, - { - "id": 75, - "text": "An epic adventure with memorable characters and breathtaking scope. Fantasy at its finest.", - "sentiment_score": 9.1 - }, - { - "id": 76, - "text": "Poor character development and a weak plot made this a chore to finish. Very disappointing.", - "sentiment_score": 1.9 - }, - { - "id": 77, - "text": "A compelling drama with realistic characters facing believable challenges. Well worth reading.", - "sentiment_score": 7.6 - }, - { - "id": 78, - "text": "The book meandered without purpose and the ending came out of nowhere. Poorly structured.", - "sentiment_score": 3.2 - }, - { - "id": 79, - "text": "Absolutely captivating! A page-turner that combines great writing with an irresistible plot.", - "sentiment_score": 8.7 - }, - { - "id": 80, - "text": "Too many clichΓ©s and stereotypes. The author relied on tired tropes instead of original ideas.", - "sentiment_score": 2.3 - }, - { - "id": 81, - "text": "A thoughtful exploration of complex themes with nuanced characters and elegant prose.", - "sentiment_score": 8.4 - }, - { - "id": 82, - "text": "The story had potential but was ruined by poor execution and sloppy writing. What a waste.", - "sentiment_score": 2.7 - }, - { - "id": 83, - "text": "An outstanding debut novel that announces the arrival of a major new talent. Brilliant work.", - "sentiment_score": 9.3 - }, - { - "id": 84, - "text": "Bland and uninspiring. The characters were flat and the story lacked any real emotion.", - "sentiment_score": 2.1 - }, - { - "id": 85, - "text": "A gripping tale of survival and redemption that kept me reading late into the night.", - "sentiment_score": 8.1 - }, - { - "id": 86, - "text": "The book was okay for what it was, but it didn't really grab me. Decent but unremarkable.", - "sentiment_score": 4.8 - }, - { - "id": 87, - "text": "Masterful storytelling with rich imagery and profound insights into the human condition.", - "sentiment_score": 9.4 - }, - { - "id": 88, - "text": "Choppy writing and an incoherent plot made this difficult to follow and even harder to enjoy.", - "sentiment_score": 1.7 - }, - { - "id": 89, - "text": "A delightful romantic comedy with sparkling dialogue and charming characters. Pure enjoyment.", - "sentiment_score": 7.8 - }, - { - "id": 90, - "text": "The book started promisingly but quickly devolved into nonsense. Very disappointing conclusion.", - "sentiment_score": 3.0 - }, - { - "id": 91, - "text": "An intelligent and well-researched novel that educates as much as it entertains. Excellent work.", - "sentiment_score": 8.2 - }, - { - "id": 92, - "text": "Boring and predictable with cardboard characters and a paint-by-numbers plot. Skip this one.", - "sentiment_score": 1.4 - }, - { - "id": 93, - "text": "A powerful and moving story that tackles difficult subjects with sensitivity and grace.", - "sentiment_score": 8.9 - }, - { - "id": 94, - "text": "The author clearly didn't know how to end the story. The conclusion was abrupt and unsatisfying.", - "sentiment_score": 3.5 - }, - { - "id": 95, - "text": "Extraordinary! A once-in-a-generation masterpiece that redefines what literature can achieve.", - "sentiment_score": 10.0 - }, - { - "id": 96, - "text": "Terrible pacing and wooden dialogue made this one of the worst books I've read this year.", - "sentiment_score": 0.9 - }, - { - "id": 97, - "text": "A satisfying read with good character arcs and a well-constructed plot. Solid entertainment.", - "sentiment_score": 7.1 - }, - { - "id": 98, - "text": "The book felt like a rough draft that was published too early. Needed much more work.", - "sentiment_score": 2.4 - }, - { - "id": 99, - "text": "Brilliant, innovative, and utterly engaging. A book that changes how you think about storytelling.", - "sentiment_score": 9.8 - }, - { - "id": 100, - "text": "Completely unreadable. Poor grammar, worse plotting, and characters with no redeeming qualities.", - "sentiment_score": 0.2 - } - ], - "metadata": { - "description": "Synthesised book review sentiment analysis dataset", - "total_reviews": 100, - "sentiment_scale": "0.0 (extremely negative) to 10.0 (extremely positive)", - "created": "2025-07-01" - } -} \ No newline at end of file diff --git a/examples/llm_prompt_optimazation/evaluator.py b/examples/llm_prompt_optimazation/evaluator.py deleted file mode 100644 index 6a816f15b..000000000 --- a/examples/llm_prompt_optimazation/evaluator.py +++ /dev/null @@ -1,196 +0,0 @@ -""" -Evaluator for the prompt optimization task. -""" - -import re -import traceback -import json -import os -import time -from openai import OpenAI -from tqdm import tqdm - -TASK_MODEL_NAME = "meta-llama-3.1-8b-instruct@q8_0" -TASK_MODEL_URL = "http://localhost:1234/v1" -TASK_MODEL_API_KEY = "your_api_key_here" -SAMPLE_SIZE = 25 # Number of samples to use for evaluation -MAX_RETRIES = 3 # Number of retries for LLM calls - - -def load_dataset(data_file_path): - """ - Load the book review dataset from JSON file. - - Args: - data_file_path: Path to the JSON data file - - Returns: - List of review dictionaries with 'text' and 'label' keys - """ - try: - with open(data_file_path, 'r', encoding='utf-8') as f: - data = json.load(f) - - # Convert the data structure to match the expected format - reviews = [] - for review in data.get('book_reviews', []): - reviews.append({ - 'text': review['text'], - 'label': review['sentiment_score'] - }) - - print(f"Successfully loaded {len(reviews)} book reviews from dataset") - return reviews - - except Exception as e: - print(f"Error loading dataset from {data_file_path}: {e}") - traceback.print_exc() - return [] - -# Load dataset from JSON file -data_file_path = os.path.join(os.path.dirname(__file__), "data.json") -ds = load_dataset(data_file_path) - -if not ds: - raise ValueError("Failed to load dataset or dataset is empty") - -def evaluate(prompt_path): - """ - Evaluate the program by run the LLM model on a benchmarck dataset. - - Args: - program_path: Path to the program file - - Returns: - Dictionary of metrics - """ - print('-' * 80) - print("Starting evaluation...") - print('-' * 80) - try: - # Initialize OpenAI test_model with error handling - try: - test_model = OpenAI( - base_url=TASK_MODEL_URL, - api_key=TASK_MODEL_API_KEY - ) - print(f"Initialized OpenAI test_model with model: {TASK_MODEL_NAME}") - except Exception as e: - print(f"Error initializing OpenAI test_model: {e}") - test_model = None - - # Use a subset for faster evaluation during evolution (can be configured) - eval_sample_size = min(SAMPLE_SIZE, len(ds)) - ds_sample = ds[:eval_sample_size] - print(f"Using {len(ds_sample)} samples from {len(ds)} total reviews for evaluation") - - # load the prompt from the file - with open(prompt_path, "r") as f: - prompt = f.read() - - # extract the prompt between the markers - prompt_match = re.search(r"EVOLVE-BLOCK-START(.*)EVOLVE-BLOCK-END", prompt, re.DOTALL) - if prompt_match: - prompt = prompt_match.group(1).strip() - else: - raise ValueError("No EVOLVE-BLOCK found in the prompt file") - - total_score = 0.0 - total_examples = 0 - individual_scores = [] - - print(f"Evaluating with prompt:\n{prompt}\n") - for example in tqdm(ds_sample, desc="Evaluating examples", unit="example"): - total_examples += 1 - input_text = example["text"] - expected_score = example["label"] - - # Prepare the message for the LLM - messages = [ - {"role": "user", "content": prompt.format(input_text=input_text)} - ] - - # Call the LLM with retry logic - max_retries = MAX_RETRIES - for attempt in range(max_retries): - try: - response = test_model.chat.completions.create( - model=TASK_MODEL_NAME, - messages=messages - ) - break - except Exception as e: - if attempt == max_retries - 1: - print(f"Failed to get response after {max_retries} attempts: {e}") - raise e - time.sleep(1) # Brief pause before retry - - output_text = response.choices[0].message.content.strip() - - # Extract numerical score from the response - try: - # Try to extract a number between 0 and 10 - score_match = re.search(r'(\d+(?:\.\d+)?)', output_text) - if score_match: - predicted_score = float(score_match.group(1)) - - # Ensure score is within valid range (0-10) - predicted_score = max(0.0, min(10.0, predicted_score)) - else: - predicted_score = 5.0 # Default to neutral - - # Calculate accuracy based on how close the prediction is to the expected score - # Using 1 - (absolute difference / 10), so perfect match = 1.0, worst case = 0.0 - accuracy = 1.0 - (abs(predicted_score - expected_score) / 10.0) - individual_scores.append(accuracy) - total_score += accuracy - - except Exception as e: - print(f"Error processing response '{output_text}': {e}") - individual_scores.append(0.0) # Score 0 for failed predictions - # Calculate comprehensive metrics - average_score = total_score / total_examples if total_examples > 0 else 0.0 - min_score = min(individual_scores) if individual_scores else 0.0 - max_score = max(individual_scores) if individual_scores else 0.0 - - # Calculate additional metrics - std_dev = 0.0 - if len(individual_scores) > 1: - mean = sum(individual_scores) / len(individual_scores) - variance = sum((x - mean) ** 2 for x in individual_scores) / len(individual_scores) - std_dev = variance ** 0.5 - - # Count high-accuracy predictions (>0.8 accuracy) - high_accuracy_count = sum(1 for score in individual_scores if score > 0.8) - high_accuracy_rate = high_accuracy_count / len(individual_scores) if individual_scores else 0.0 - - print(f"Total examples: {total_examples}") - print(f"Average accuracy: {average_score:.3f}") - print(f"Standard deviation: {std_dev:.3f}") - print(f"Min accuracy: {min_score:.3f}") - print(f"Max accuracy: {max_score:.3f}") - print(f"High accuracy rate (>0.8): {high_accuracy_rate:.3f}") - print('-' * 80) - return { - "score": average_score, - "total_examples": total_examples, - "individual_scores": individual_scores, - "min_score": min_score, - "max_score": max_score, - "std_dev": std_dev, - "high_accuracy_rate": high_accuracy_rate - } - - except Exception as e: - print(f"Evaluation failed completely: {str(e)}") - traceback.print_exc() - print('-' * 80) - return { - "score": 0.0, - "total_examples": 0, - "individual_scores": [], - "min_score": 0.0, - "max_score": 0.0, - "std_dev": 0.0, - "high_accuracy_rate": 0.0 - } diff --git a/examples/llm_prompt_optimazation/initial_prompt.txt b/examples/llm_prompt_optimazation/initial_prompt.txt deleted file mode 100644 index 6f12bf353..000000000 --- a/examples/llm_prompt_optimazation/initial_prompt.txt +++ /dev/null @@ -1,11 +0,0 @@ -"""Sentiment analysis prompt example for OpenEvolve""" - -# EVOLVE-BLOCK-START -Please analyze the sentiment of the following sentence and provide a sentiment score: - -"{input_text}" - -Rate the sentiment on a scale from 0.0 to 10.0. - -Score: -# EVOLVE-BLOCK-END diff --git a/examples/llm_prompt_optimazation/requirements.txt b/examples/llm_prompt_optimazation/requirements.txt deleted file mode 100644 index 01354db40..000000000 --- a/examples/llm_prompt_optimazation/requirements.txt +++ /dev/null @@ -1,2 +0,0 @@ -openai -tqdm \ No newline at end of file diff --git a/examples/llm_prompt_optimazation/run.sh b/examples/llm_prompt_optimazation/run.sh deleted file mode 100644 index 7226a0b82..000000000 --- a/examples/llm_prompt_optimazation/run.sh +++ /dev/null @@ -1,4 +0,0 @@ - python ../../openevolve-run.py \ - examples/llm_prompt_optimazation/initial_prompt.txt \ - examples/llm_prompt_optimazation/evaluator.py \ - --config examples/llm_prompt_optimazation/config.yaml \ No newline at end of file diff --git a/examples/llm_prompt_optimization/README.md b/examples/llm_prompt_optimization/README.md new file mode 100644 index 000000000..4a6bd4b72 --- /dev/null +++ b/examples/llm_prompt_optimization/README.md @@ -0,0 +1,231 @@ +# HuggingFace Dataset Prompt Optimization with OpenEvolve πŸš€ + +This example demonstrates how to use OpenEvolve to automatically optimize prompts for any HuggingFace dataset. The system uses evolutionary search to discover high-performing prompts by testing them against ground truth data. + +## 🎯 Overview + +OpenEvolve automatically: +- Loads any HuggingFace dataset +- Evolves prompts through multiple generations +- Uses cascading evaluation for efficiency +- Finds optimal prompts for your specific task and model + +The system uses a clean YAML format for configuration, making it easy to set up prompt optimization for any dataset. + +## πŸš€ Quick Start + +### 1. Install Dependencies + +```bash +cd examples/llm_prompt_optimization +pip install -r requirements.txt +``` + +### 2. Configure Your Model + +Update `config.yaml` with your LLM settings: + +```yaml +llm: + api_base: "https://openrouter.ai/api/v1" + api_key: "your_api_key_here" + models: + - name: "google/gemini-2.5-flash" # Or any OpenAI-compatible model + weight: 1.0 +``` + +### 3. Set Up Your Dataset and Prompt + +Configure your dataset in `dataset.yaml`: + +```yaml +# HuggingFace dataset configuration +dataset_name: "stanfordnlp/imdb" # Any HuggingFace dataset +input_field: "text" # Field containing input data +target_field: "label" # Field containing ground truth +split: "test" # Dataset split to use + +# Evaluation samples +max_samples: 50 # Number of samples to evaluate +``` + +Create your initial prompt in `initial_prompt.txt`: + +``` +Your initial prompt here with {input_text} as placeholder +``` + +### 4. Run OpenEvolve + +```bash +python ../../openevolve-run.py initial_prompt.txt evaluator.py --config config.yaml --iterations 100 +``` + +The system will: +- Evolve the prompt in `initial_prompt.txt` +- Use dataset configuration from `dataset.yaml` +- Test evolved prompts against the HuggingFace dataset + +## πŸ“Š Supported Datasets + +This optimizer works with any HuggingFace dataset. Example configurations are provided in the `examples/` directory: + +- **AG News**: `ag_news_dataset.yaml` + `ag_news_prompt.txt` +- **Emotion**: `emotion_dataset.yaml` + `emotion_prompt.txt` + +To use an example: +```bash +# Copy the example files +cp examples/ag_news_dataset.yaml dataset.yaml +cp examples/ag_news_prompt.txt initial_prompt.txt + +# Run optimization +python ../../openevolve-run.py initial_prompt.txt evaluator.py --config config.yaml --iterations 100 +``` + +### Common Dataset Configurations: + +### Sentiment Analysis +```yaml +dataset_name: "stanfordnlp/imdb" +input_field: "text" +target_field: "label" # 0 or 1 +``` + +### Question Answering +```yaml +dataset_name: "squad" +input_field: "question" +target_field: "answers" # Dict with 'text' field +``` + +### Text Classification +```yaml +dataset_name: "ag_news" +input_field: "text" +target_field: "label" # 0-3 for categories +``` + +### Summarization +```yaml +dataset_name: "xsum" +input_field: "document" +target_field: "summary" +``` + +## βš™οΈ How It Works + +### Simple Evaluation + +The evaluator uses a straightforward single-stage evaluation: + +1. **Load Dataset**: Downloads the specified HuggingFace dataset +2. **Sample Data**: Takes `max_samples` examples from the dataset +3. **Test Prompt**: Sends each example through the LLM with the prompt +4. **Calculate Accuracy**: Compares LLM outputs to ground truth labels + +### Evolution Process + +1. OpenEvolve starts with your initial prompt +2. The LLM generates variations based on performance feedback +3. Each variant is tested using cascading evaluation +4. Best performers are kept and evolved further +5. Process continues for specified iterations + +### 🎭 Custom Templates for Prompt Evolution + +By default, OpenEvolve is designed for code evolution. To make it work properly for prompt evolution, this example includes custom templates in the `templates/` directory: + +- **`full_rewrite_user.txt`**: Replaces the default code evolution template with prompt-specific language + +This ensures the LLM understands it should evolve the prompt text itself, not generate code. The configuration automatically uses these templates via: + +```yaml +prompt: + template_dir: "templates" # Use custom templates for prompt evolution +``` + +## 🎯 Configuration Options + +### Evaluation Configuration + +In `config.yaml`: +```yaml +evaluator: + parallel_evaluations: 4 # Run 4 evaluations in parallel + cascade_evaluation: false # Simple single-stage evaluation +``` + +### Sample Size + +Adjust in `dataset.yaml`: +```yaml +max_samples: 50 # Number of samples to evaluate +``` + +## πŸ“ˆ Example Results + +Starting prompt: +``` +Analyze the sentiment: "{input_text}" +``` + +Evolved prompt after 100 iterations: +``` +Analyze the sentiment of the following text. Determine if the overall emotional tone is positive or negative. + +Text: "{input_text}" + +Response: Provide only a single digit - either 1 for positive sentiment or 0 for negative sentiment. Do not include any explanation or additional text. +``` + +Accuracy improvement: 72% β†’ 94% + +## πŸ”§ Advanced Usage + +### Custom Evaluation Metrics + +The evaluator extracts predictions and compares them to ground truth. For classification tasks, it looks for: +- Exact number matches (0, 1, etc.) +- Keywords (positive/negative, yes/no) +- Custom patterns you define + +### Different Task Types + +While the default setup is for classification, you can modify the evaluator for: +- **Regression**: Compare numeric outputs +- **Generation**: Use BLEU/ROUGE scores +- **Extraction**: Check if key information is present + +## πŸ› Troubleshooting + +### Dataset Not Found +- Check the exact name on HuggingFace +- Some datasets require acceptance of terms + +### Low Stage 1 Accuracy +- Your initial prompt may be too vague +- Check if the output format matches expectations +- Verify the dataset fields are correct + +### API Errors +- Ensure your API key is valid +- Check rate limits +- Verify the model name is correct + +## πŸš€ Tips for Best Results + +1. **Start Simple**: Begin with a clear, working prompt +2. **Clear Output Format**: Specify exactly what output you expect +3. **Appropriate Samples**: More samples = better evaluation but slower +4. **Multiple Runs**: Evolution has randomness; try multiple runs +5. **Monitor Progress**: Check intermediate best_program.txt files + +## πŸ“š Next Steps + +- Try different datasets from HuggingFace +- Experiment with different models +- Adjust evolution parameters in config.yaml +- Create task-specific evaluation metrics + +Happy prompt evolving! 🧬✨ \ No newline at end of file diff --git a/examples/llm_prompt_optimization/config.yaml b/examples/llm_prompt_optimization/config.yaml new file mode 100644 index 000000000..0bce0178c --- /dev/null +++ b/examples/llm_prompt_optimization/config.yaml @@ -0,0 +1,69 @@ +# Configuration for HuggingFace prompt optimization +# Based on optimized settings from config2.yaml + +# General settings +max_iterations: 50 +checkpoint_interval: 10 +log_level: "INFO" +diff_based_evolution: false # Full rewrite mode (best for prompt optimization) +max_code_length: 10000 +language: "text" # Explicitly set language to text for prompt evolution + +# LLM Configuration +llm: + api_base: "https://generativelanguage.googleapis.com/v1beta/openai/" + models: + - name: "gemini-2.5-flash-lite" # Using Gemini 2.5 Flash Lite + weight: 1.0 + + temperature: 0.4 # Optimal from experiments + max_tokens: 16000 # Optimal context + timeout: 150 + retries: 3 + +# Prompt Configuration - Optimal settings discovered +prompt: + template_dir: "templates" # Use custom templates for prompt evolution + num_top_programs: 3 # Best balance + num_diverse_programs: 2 # Best balance + include_artifacts: true # +20.7% improvement when enabled + + # System message for prompt evolution + system_message: | + You are an expert prompt engineer. Your task is to revise an existing prompt designed for large language models (LLMs), without being explicitly told what the task is. + + Your improvements should: + + * Infer the intended task and expected output format based on the structure and language of the original prompt. + * Clarify vague instructions, eliminate ambiguity, and improve overall interpretability for the LLM. + * Strengthen alignment between the prompt and the desired task outcome, ensuring more consistent and accurate responses. + * Improve robustness against edge cases or unclear input phrasing. + * If helpful, include formatting instructions, boundary conditions, or illustrative examples that reinforce the LLM's expected behavior. + * Avoid adding unnecessary verbosity or assumptions not grounded in the original prompt. + + The revised prompt should maintain the same input interface but be more effective, reliable, and production-ready for LLM use. + + Return only the improved prompt text. Do not include explanations or additional comments. Your output should be a clean, high-quality replacement that enhances clarity, consistency, and LLM performance. + +# Database Configuration +database: + population_size: 1000 + archive_size: 100 + num_islands: 4 + + # Selection parameters - Optimal ratios from testing + elite_selection_ratio: 0.1 # 10% elite selection + exploration_ratio: 0.3 # 30% exploration + exploitation_ratio: 0.6 # 60% exploitation + + # Migration parameters - Optimal settings + migration_interval: 10 + migration_rate: 0.1 + +# Evaluator Configuration +evaluator: + timeout: 200 + max_retries: 3 + parallel_evaluations: 4 + cascade_evaluation: true # Two-stage cascading evaluation + cascade_thresholds: [0.9] # Stage 1 must achieve 90% accuracy to proceed to stage 2 \ No newline at end of file diff --git a/examples/llm_prompt_optimization/dataset.yaml b/examples/llm_prompt_optimization/dataset.yaml new file mode 100644 index 000000000..8bf503ae3 --- /dev/null +++ b/examples/llm_prompt_optimization/dataset.yaml @@ -0,0 +1,8 @@ +# HuggingFace dataset configuration +dataset_name: "stanfordnlp/imdb" +input_field: "text" +target_field: "label" +split: "test" # Will fallback to train if not available + +# Evaluation samples +max_samples: 50 # Number of samples to evaluate \ No newline at end of file diff --git a/examples/llm_prompt_optimization/evaluator.py b/examples/llm_prompt_optimization/evaluator.py new file mode 100644 index 000000000..3b00d06f8 --- /dev/null +++ b/examples/llm_prompt_optimization/evaluator.py @@ -0,0 +1,272 @@ +""" +Evaluator for HuggingFace dataset-based prompt optimization. +""" + +import re +import traceback +import yaml +import os +import time +from openai import OpenAI +from tqdm import tqdm +from datasets import load_dataset + +# Read config.yaml to get model settings +with open(os.path.join(os.path.dirname(__file__), "config.yaml"), 'r') as f: + config = yaml.safe_load(f) + +# Get model settings from config +llm_config = config.get('llm', {}) +api_base = llm_config.get('api_base', 'http://localhost:1234/v1') + +# Handle both single model and model list configurations +models = llm_config.get('models', []) +if models: + # Use first model from list + TASK_MODEL_NAME = models[0].get('name', 'default-model') +else: + # Fallback to direct model specification + TASK_MODEL_NAME = llm_config.get('primary_model', 'default-model') + +# Get evaluator settings +evaluator_config = config.get('evaluator', {}) +MAX_RETRIES = evaluator_config.get('max_retries', 3) + +# Initialize OpenAI client once for all evaluations +test_model = OpenAI(base_url=api_base) +print(f"Initialized OpenAI client with model: {TASK_MODEL_NAME}") + +def load_prompt_config(prompt_path): + """Load the prompt from text file and dataset config from dataset.yaml.""" + # Load prompt from text file + with open(prompt_path, 'r') as f: + prompt = f.read().strip() + + # Always load dataset configuration from the examples directory + # This ensures it works even when OpenEvolve copies files to temp directories + evaluator_dir = os.path.dirname(os.path.abspath(__file__)) + config_path = os.path.join(evaluator_dir, 'dataset.yaml') + + with open(config_path, 'r') as f: + config = yaml.safe_load(f) + + return config, prompt + +def load_hf_dataset(config): + """Load HuggingFace dataset based on configuration.""" + dataset_name = config['dataset_name'] + split = config.get('split', 'test') + + print(f"Loading dataset: {dataset_name}") + + try: + # Try to load the specified split + dataset = load_dataset(dataset_name, split=split) + except: + # Fallback to train split if test is not available + print(f"Split '{split}' not found, falling back to 'train'") + dataset = load_dataset(dataset_name, split='train') + + print(f"Dataset loaded with {len(dataset)} examples") + return dataset + +def evaluate_prompt(prompt, dataset, config, num_samples): + """Evaluate a prompt on a subset of the dataset.""" + input_field = config['input_field'] + target_field = config['target_field'] + + # Sample from dataset + samples = dataset.select(range(min(num_samples, len(dataset)))) + + correct = 0 + total = 0 + + for example in tqdm(samples, desc=f"Evaluating {num_samples} samples"): + input_text = example[input_field] + expected = example[target_field] + + # Prepare the message for the LLM + messages = [ + {"role": "user", "content": prompt.format(input_text=input_text)} + ] + + # Call the LLM with retry logic + for attempt in range(MAX_RETRIES): + try: + response = test_model.chat.completions.create( + model=TASK_MODEL_NAME, + messages=messages, + temperature=0.1, # Low temperature for consistent classification + max_tokens=10 # We only need a short response + ) + break + except Exception as e: + if attempt == MAX_RETRIES - 1: + print(f"Failed to get response after {MAX_RETRIES} attempts: {e}") + raise e + time.sleep(1) + + # Handle potential None response + if not response: + print(f"Warning: No response object from LLM") + total += 1 # Count as incorrect + continue + + if not response.choices: + print(f"Warning: No choices in response from LLM") + total += 1 # Count as incorrect + continue + + if not response.choices[0].message: + print(f"Warning: No message in response choice") + total += 1 # Count as incorrect + continue + + output_text = response.choices[0].message.content + if output_text is None: + print(f"Warning: None content in LLM response") + print(f"Full response: {response}") + total += 1 # Count as incorrect + continue + + output_text = output_text.strip() + + # Extract prediction from output + try: + # Look for a number (0 or 1) in the output + numbers = re.findall(r'\b[01]\b', output_text) + if numbers: + prediction = int(numbers[-1]) # Use the last number found + else: + # Try to infer from keywords + output_lower = output_text.lower() + if 'positive' in output_lower: + prediction = 1 + elif 'negative' in output_lower: + prediction = 0 + else: + prediction = -1 # Invalid prediction + + if prediction == expected: + correct += 1 + + total += 1 + + except Exception as e: + print(f"Error parsing response '{output_text}': {e}") + total += 1 # Count as incorrect + + accuracy = correct / total if total > 0 else 0.0 + return accuracy, correct, total + +def evaluate_stage1(prompt_path): + """ + Stage 1 evaluation: Quick evaluation with 10% of samples + + Args: + prompt_path: Path to the prompt file + + Returns: + Dictionary with combined_score metric + """ + print('-' * 80) + print("Starting Stage 1 evaluation...") + print('-' * 80) + + try: + # Load prompt configuration + config, prompt = load_prompt_config(prompt_path) + print(f"Loaded prompt configuration") + + # Load dataset + dataset = load_hf_dataset(config) + + # Get number of samples from config + num_samples = config.get('max_samples', 50) + stage1_samples = max(10, int(num_samples * 0.1)) + + print(f"Stage 1: Evaluating {stage1_samples} samples...") + + # Run evaluation + accuracy, correct, total = evaluate_prompt( + prompt, dataset, config, stage1_samples + ) + + print(f"Stage 1 accuracy: {accuracy:.3f} ({correct}/{total})") + print('-' * 80) + + return { + "combined_score": accuracy + } + + except Exception as e: + print(f"Stage 1 evaluation failed: {str(e)}") + traceback.print_exc() + print('-' * 80) + return { + "combined_score": 0.0, + "error": str(e) + } + + +def evaluate_stage2(prompt_path): + """ + Stage 2 evaluation: Full evaluation with all samples + + Args: + prompt_path: Path to the prompt file + + Returns: + Dictionary with combined_score metric + """ + print('-' * 80) + print("Starting Stage 2 evaluation...") + print('-' * 80) + + try: + # Load prompt configuration + config, prompt = load_prompt_config(prompt_path) + print(f"Loaded prompt configuration") + + # Load dataset + dataset = load_hf_dataset(config) + + # Get number of samples from config + num_samples = config.get('max_samples', 50) + + print(f"Stage 2: Evaluating all {num_samples} samples...") + + # Run evaluation + accuracy, correct, total = evaluate_prompt( + prompt, dataset, config, num_samples + ) + + print(f"Stage 2 accuracy: {accuracy:.3f} ({correct}/{total})") + print('-' * 80) + + return { + "combined_score": accuracy + } + + except Exception as e: + print(f"Stage 2 evaluation failed: {str(e)}") + traceback.print_exc() + print('-' * 80) + return { + "combined_score": 0.0, + "error": str(e) + } + + +def evaluate(prompt_path): + """ + Main evaluation function - for backwards compatibility + Calls evaluate_stage2 for full evaluation + + Args: + prompt_path: Path to the prompt file + + Returns: + Dictionary with combined_score metric + """ + return evaluate_stage2(prompt_path) \ No newline at end of file diff --git a/examples/llm_prompt_optimization/examples/ag_news_dataset.yaml b/examples/llm_prompt_optimization/examples/ag_news_dataset.yaml new file mode 100644 index 000000000..0a333ae0e --- /dev/null +++ b/examples/llm_prompt_optimization/examples/ag_news_dataset.yaml @@ -0,0 +1,8 @@ +# AG News topic classification dataset configuration +dataset_name: "ag_news" +input_field: "text" +target_field: "label" # 0: World, 1: Sports, 2: Business, 3: Sci/Tech +split: "test" + +# Evaluation samples +max_samples: 50 \ No newline at end of file diff --git a/examples/llm_prompt_optimization/examples/ag_news_prompt.txt b/examples/llm_prompt_optimization/examples/ag_news_prompt.txt new file mode 100644 index 000000000..8c2519202 --- /dev/null +++ b/examples/llm_prompt_optimization/examples/ag_news_prompt.txt @@ -0,0 +1,9 @@ +Classify the following news article into one of four categories: +0 - World News +1 - Sports +2 - Business +3 - Science/Technology + +Article: "{input_text}" + +Category (respond with only the number 0-3): \ No newline at end of file diff --git a/examples/llm_prompt_optimization/examples/emotion_dataset.yaml b/examples/llm_prompt_optimization/examples/emotion_dataset.yaml new file mode 100644 index 000000000..2ecb2ba32 --- /dev/null +++ b/examples/llm_prompt_optimization/examples/emotion_dataset.yaml @@ -0,0 +1,8 @@ +# Emotion classification dataset configuration +dataset_name: "dair-ai/emotion" +input_field: "text" +target_field: "label" # 0: sadness, 1: joy, 2: love, 3: anger, 4: fear, 5: surprise +split: "test" + +# Evaluation samples +max_samples: 50 \ No newline at end of file diff --git a/examples/llm_prompt_optimization/examples/emotion_prompt.txt b/examples/llm_prompt_optimization/examples/emotion_prompt.txt new file mode 100644 index 000000000..fe8d18f2c --- /dev/null +++ b/examples/llm_prompt_optimization/examples/emotion_prompt.txt @@ -0,0 +1,11 @@ +Identify the primary emotion expressed in this text: +0 - Sadness +1 - Joy +2 - Love +3 - Anger +4 - Fear +5 - Surprise + +Text: "{input_text}" + +Emotion (number only): \ No newline at end of file diff --git a/examples/llm_prompt_optimization/initial_prompt.txt b/examples/llm_prompt_optimization/initial_prompt.txt new file mode 100644 index 000000000..ab329a63f --- /dev/null +++ b/examples/llm_prompt_optimization/initial_prompt.txt @@ -0,0 +1,5 @@ +Analyze the sentiment of the following text and classify it as positive (1) or negative (0). + +Text: "{input_text}" + +Label: \ No newline at end of file diff --git a/examples/llm_prompt_optimization/requirements.txt b/examples/llm_prompt_optimization/requirements.txt new file mode 100644 index 000000000..b72f54907 --- /dev/null +++ b/examples/llm_prompt_optimization/requirements.txt @@ -0,0 +1,4 @@ +openai +tqdm +datasets +pyyaml \ No newline at end of file diff --git a/examples/llm_prompt_optimization/templates/full_rewrite_user.txt b/examples/llm_prompt_optimization/templates/full_rewrite_user.txt new file mode 100644 index 000000000..216844a48 --- /dev/null +++ b/examples/llm_prompt_optimization/templates/full_rewrite_user.txt @@ -0,0 +1,20 @@ +# Current Prompt Information +- Current performance metrics: {metrics} +- Areas identified for improvement: {improvement_areas} + +{artifacts} + +# Prompt Evolution History +{evolution_history} + +# Current Prompt +{current_program} + +# Task +Rewrite the prompt to improve its performance on the specified metrics. +Provide the complete new prompt text. + +IMPORTANT: Make sure your rewritten prompt maintains the same input placeholder ({{input_text}}) +but with improved instructions for better LLM performance. + +Your improved prompt: \ No newline at end of file From accd01c1af9821f37eb6ff842c3af7a2b8a1779d Mon Sep 17 00:00:00 2001 From: Asankhaya Sharma Date: Wed, 30 Jul 2025 10:08:00 +0800 Subject: [PATCH 02/12] Refactor prompt-dataset config matching and add emotion benchmark Updated the evaluator to automatically match prompt files with their corresponding dataset configuration using a naming convention. Added emotion classification benchmark files (`emotion_prompt.txt`, `emotion_prompt_dataset.yaml`) and a wrapper script (`run_evolution.sh`) for easier execution. Deprecated and removed old example files, and improved documentation in the README to reflect the new workflow and dataset handling. --- examples/llm_prompt_optimization/README.md | 66 ++++++++++------ .../dataset_config.yaml | 9 +++ .../emotion_prompt.txt | 5 ++ .../emotion_prompt_dataset.yaml | 18 +++++ examples/llm_prompt_optimization/evaluator.py | 75 ++++++++++++++----- .../examples/ag_news_dataset.yaml | 8 -- .../examples/ag_news_prompt.txt | 9 --- .../examples/emotion_dataset.yaml | 8 -- .../examples/emotion_prompt.txt | 11 --- ...taset.yaml => initial_prompt_dataset.yaml} | 0 .../llm_prompt_optimization/run_evolution.sh | 17 +++++ 11 files changed, 150 insertions(+), 76 deletions(-) create mode 100644 examples/llm_prompt_optimization/dataset_config.yaml create mode 100644 examples/llm_prompt_optimization/emotion_prompt.txt create mode 100644 examples/llm_prompt_optimization/emotion_prompt_dataset.yaml delete mode 100644 examples/llm_prompt_optimization/examples/ag_news_dataset.yaml delete mode 100644 examples/llm_prompt_optimization/examples/ag_news_prompt.txt delete mode 100644 examples/llm_prompt_optimization/examples/emotion_dataset.yaml delete mode 100644 examples/llm_prompt_optimization/examples/emotion_prompt.txt rename examples/llm_prompt_optimization/{dataset.yaml => initial_prompt_dataset.yaml} (100%) create mode 100644 examples/llm_prompt_optimization/run_evolution.sh diff --git a/examples/llm_prompt_optimization/README.md b/examples/llm_prompt_optimization/README.md index 4a6bd4b72..fdd787ada 100644 --- a/examples/llm_prompt_optimization/README.md +++ b/examples/llm_prompt_optimization/README.md @@ -10,7 +10,7 @@ OpenEvolve automatically: - Uses cascading evaluation for efficiency - Finds optimal prompts for your specific task and model -The system uses a clean YAML format for configuration, making it easy to set up prompt optimization for any dataset. +**Key Feature**: The evaluator automatically matches prompt files with dataset configurations using a naming convention (`xxx_prompt.txt` β†’ `xxx_prompt_dataset.yaml`), making it easy to manage multiple benchmark tasks. ## πŸš€ Quick Start @@ -36,52 +36,74 @@ llm: ### 3. Set Up Your Dataset and Prompt -Configure your dataset in `dataset.yaml`: +This example uses a naming convention to match prompts with their dataset configurations: +- For a prompt file `xxx_prompt.txt`, create a matching `xxx_prompt_dataset.yaml` +- For example: `emotion_prompt.txt` uses `emotion_prompt_dataset.yaml` + +Create your dataset configuration file (e.g., `emotion_prompt_dataset.yaml`): ```yaml # HuggingFace dataset configuration -dataset_name: "stanfordnlp/imdb" # Any HuggingFace dataset +dataset_name: "dair-ai/emotion" # Any HuggingFace dataset input_field: "text" # Field containing input data target_field: "label" # Field containing ground truth split: "test" # Dataset split to use # Evaluation samples -max_samples: 50 # Number of samples to evaluate +max_samples: 200 # Number of samples to evaluate ``` -Create your initial prompt in `initial_prompt.txt`: +Create your initial prompt file (e.g., `emotion_prompt.txt`): ``` -Your initial prompt here with {input_text} as placeholder +Classify the emotion expressed in the following text. + +Text: "{input_text}" + +Emotion (0-5): ``` ### 4. Run OpenEvolve +Use the provided `run_evolution.sh` script to ensure the correct dataset is used: + ```bash -python ../../openevolve-run.py initial_prompt.txt evaluator.py --config config.yaml --iterations 100 +# For emotion classification benchmark +./run_evolution.sh emotion_prompt.txt --iterations 50 + +# For IMDB sentiment analysis +./run_evolution.sh initial_prompt.txt --iterations 50 + +# With custom iterations and checkpoint +./run_evolution.sh emotion_prompt.txt --iterations 100 --checkpoint-interval 20 ``` -The system will: -- Evolve the prompt in `initial_prompt.txt` -- Use dataset configuration from `dataset.yaml` -- Test evolved prompts against the HuggingFace dataset +The script automatically: +- Sets the `OPENEVOLVE_PROMPT` environment variable so the evaluator knows which dataset to use +- Passes all additional arguments to OpenEvolve +- Ensures the correct `_dataset.yaml` file is matched with your prompt + +**Note**: If you prefer to run OpenEvolve directly, set the environment variable first: +```bash +export OPENEVOLVE_PROMPT=emotion_prompt.txt +python ../../openevolve-run.py emotion_prompt.txt evaluator.py --config config.yaml --iterations 50 +``` ## πŸ“Š Supported Datasets -This optimizer works with any HuggingFace dataset. Example configurations are provided in the `examples/` directory: +This optimizer works with any HuggingFace dataset. Included examples: -- **AG News**: `ag_news_dataset.yaml` + `ag_news_prompt.txt` -- **Emotion**: `emotion_dataset.yaml` + `emotion_prompt.txt` +- **IMDB Sentiment**: `initial_prompt.txt` + `initial_prompt_dataset.yaml` (binary classification) +- **Emotion**: `emotion_prompt.txt` + `emotion_prompt_dataset.yaml` (6-class, benchmark against DSPy) -To use an example: -```bash -# Copy the example files -cp examples/ag_news_dataset.yaml dataset.yaml -cp examples/ag_news_prompt.txt initial_prompt.txt +### Creating New Tasks -# Run optimization -python ../../openevolve-run.py initial_prompt.txt evaluator.py --config config.yaml --iterations 100 -``` +To add a new dataset: +1. Create `yourtask_prompt.txt` with the initial prompt +2. Create `yourtask_prompt_dataset.yaml` with the dataset configuration +3. Run: `./run_evolution.sh yourtask_prompt.txt --iterations 50` + +**Note**: If you call OpenEvolve directly without the wrapper script, the evaluator will look for a default `dataset_config.yaml` file. ### Common Dataset Configurations: diff --git a/examples/llm_prompt_optimization/dataset_config.yaml b/examples/llm_prompt_optimization/dataset_config.yaml new file mode 100644 index 000000000..08ea83cbf --- /dev/null +++ b/examples/llm_prompt_optimization/dataset_config.yaml @@ -0,0 +1,9 @@ +# Default dataset configuration (fallback when not using run_evolution.sh) +# This is used when OpenEvolve is called directly without setting OPENEVOLVE_PROMPT +dataset_name: "stanfordnlp/imdb" +input_field: "text" +target_field: "label" # 0 or 1 +split: "test" + +# Evaluation samples +max_samples: 50 \ No newline at end of file diff --git a/examples/llm_prompt_optimization/emotion_prompt.txt b/examples/llm_prompt_optimization/emotion_prompt.txt new file mode 100644 index 000000000..c748df51d --- /dev/null +++ b/examples/llm_prompt_optimization/emotion_prompt.txt @@ -0,0 +1,5 @@ +Classify the emotion expressed in the following text. + +Text: "{input_text}" + +Emotion (0-5): \ No newline at end of file diff --git a/examples/llm_prompt_optimization/emotion_prompt_dataset.yaml b/examples/llm_prompt_optimization/emotion_prompt_dataset.yaml new file mode 100644 index 000000000..46a2d5375 --- /dev/null +++ b/examples/llm_prompt_optimization/emotion_prompt_dataset.yaml @@ -0,0 +1,18 @@ +# HuggingFace dataset configuration for emotion classification +# This is a standard benchmark used by DSPy and others +dataset_name: "dair-ai/emotion" +input_field: "text" +target_field: "label" # 0-5: sadness, joy, love, anger, fear, surprise +split: "test" + +# Evaluation samples +max_samples: 200 # Larger sample for 6-class problem + +# Labels mapping for reference +label_names: + 0: "sadness" + 1: "joy" + 2: "love" + 3: "anger" + 4: "fear" + 5: "surprise" \ No newline at end of file diff --git a/examples/llm_prompt_optimization/evaluator.py b/examples/llm_prompt_optimization/evaluator.py index 3b00d06f8..cb4624d59 100644 --- a/examples/llm_prompt_optimization/evaluator.py +++ b/examples/llm_prompt_optimization/evaluator.py @@ -36,18 +36,32 @@ test_model = OpenAI(base_url=api_base) print(f"Initialized OpenAI client with model: {TASK_MODEL_NAME}") +# Determine which dataset to use based on the OPENEVOLVE_PROMPT environment variable +import sys +prompt_file = os.environ.get('OPENEVOLVE_PROMPT') +if not prompt_file: + # Default to a generic dataset config if not using the wrapper script + evaluator_dir = os.path.dirname(os.path.abspath(__file__)) + DATASET_CONFIG_PATH = os.path.join(evaluator_dir, 'dataset_config.yaml') + print("Warning: OPENEVOLVE_PROMPT not set. Using default dataset_config.yaml") +else: + basename = os.path.basename(prompt_file) + dataset_filename = basename.replace('_prompt.txt', '_prompt_dataset.yaml').replace('.txt', '_dataset.yaml') + evaluator_dir = os.path.dirname(os.path.abspath(__file__)) + DATASET_CONFIG_PATH = os.path.join(evaluator_dir, dataset_filename) + print(f"Dataset configuration: {dataset_filename}") + def load_prompt_config(prompt_path): - """Load the prompt from text file and dataset config from dataset.yaml.""" + """Load the prompt from text file and dataset config from matching _dataset.yaml file.""" # Load prompt from text file with open(prompt_path, 'r') as f: prompt = f.read().strip() - # Always load dataset configuration from the examples directory - # This ensures it works even when OpenEvolve copies files to temp directories - evaluator_dir = os.path.dirname(os.path.abspath(__file__)) - config_path = os.path.join(evaluator_dir, 'dataset.yaml') + # Load the configuration (already determined from environment variable) + if not os.path.exists(DATASET_CONFIG_PATH): + raise FileNotFoundError(f"Dataset configuration not found: {DATASET_CONFIG_PATH}") - with open(config_path, 'r') as f: + with open(DATASET_CONFIG_PATH, 'r') as f: config = yaml.safe_load(f) return config, prompt @@ -75,6 +89,9 @@ def evaluate_prompt(prompt, dataset, config, num_samples): input_field = config['input_field'] target_field = config['target_field'] + # Check if this is emotion classification (0-5) or sentiment (0-1) + is_emotion = 'emotion' in config.get('dataset_name', '').lower() + # Sample from dataset samples = dataset.select(range(min(num_samples, len(dataset)))) @@ -97,7 +114,7 @@ def evaluate_prompt(prompt, dataset, config, num_samples): model=TASK_MODEL_NAME, messages=messages, temperature=0.1, # Low temperature for consistent classification - max_tokens=10 # We only need a short response + max_tokens=20 # Allow slightly more tokens for emotion labels ) break except Exception as e: @@ -133,19 +150,41 @@ def evaluate_prompt(prompt, dataset, config, num_samples): # Extract prediction from output try: - # Look for a number (0 or 1) in the output - numbers = re.findall(r'\b[01]\b', output_text) - if numbers: - prediction = int(numbers[-1]) # Use the last number found + if is_emotion: + # For emotion classification (0-5) + numbers = re.findall(r'\b[0-5]\b', output_text) + if numbers: + prediction = int(numbers[-1]) # Use the last number found + else: + # Try to infer from emotion keywords + output_lower = output_text.lower() + emotion_map = { + 'sadness': 0, 'sad': 0, + 'joy': 1, 'happy': 1, 'happiness': 1, + 'love': 2, + 'anger': 3, 'angry': 3, + 'fear': 4, 'afraid': 4, 'scared': 4, + 'surprise': 5, 'surprised': 5 + } + prediction = -1 + for emotion, label in emotion_map.items(): + if emotion in output_lower: + prediction = label + break else: - # Try to infer from keywords - output_lower = output_text.lower() - if 'positive' in output_lower: - prediction = 1 - elif 'negative' in output_lower: - prediction = 0 + # For sentiment classification (0-1) + numbers = re.findall(r'\b[01]\b', output_text) + if numbers: + prediction = int(numbers[-1]) # Use the last number found else: - prediction = -1 # Invalid prediction + # Try to infer from keywords + output_lower = output_text.lower() + if 'positive' in output_lower: + prediction = 1 + elif 'negative' in output_lower: + prediction = 0 + else: + prediction = -1 # Invalid prediction if prediction == expected: correct += 1 diff --git a/examples/llm_prompt_optimization/examples/ag_news_dataset.yaml b/examples/llm_prompt_optimization/examples/ag_news_dataset.yaml deleted file mode 100644 index 0a333ae0e..000000000 --- a/examples/llm_prompt_optimization/examples/ag_news_dataset.yaml +++ /dev/null @@ -1,8 +0,0 @@ -# AG News topic classification dataset configuration -dataset_name: "ag_news" -input_field: "text" -target_field: "label" # 0: World, 1: Sports, 2: Business, 3: Sci/Tech -split: "test" - -# Evaluation samples -max_samples: 50 \ No newline at end of file diff --git a/examples/llm_prompt_optimization/examples/ag_news_prompt.txt b/examples/llm_prompt_optimization/examples/ag_news_prompt.txt deleted file mode 100644 index 8c2519202..000000000 --- a/examples/llm_prompt_optimization/examples/ag_news_prompt.txt +++ /dev/null @@ -1,9 +0,0 @@ -Classify the following news article into one of four categories: -0 - World News -1 - Sports -2 - Business -3 - Science/Technology - -Article: "{input_text}" - -Category (respond with only the number 0-3): \ No newline at end of file diff --git a/examples/llm_prompt_optimization/examples/emotion_dataset.yaml b/examples/llm_prompt_optimization/examples/emotion_dataset.yaml deleted file mode 100644 index 2ecb2ba32..000000000 --- a/examples/llm_prompt_optimization/examples/emotion_dataset.yaml +++ /dev/null @@ -1,8 +0,0 @@ -# Emotion classification dataset configuration -dataset_name: "dair-ai/emotion" -input_field: "text" -target_field: "label" # 0: sadness, 1: joy, 2: love, 3: anger, 4: fear, 5: surprise -split: "test" - -# Evaluation samples -max_samples: 50 \ No newline at end of file diff --git a/examples/llm_prompt_optimization/examples/emotion_prompt.txt b/examples/llm_prompt_optimization/examples/emotion_prompt.txt deleted file mode 100644 index fe8d18f2c..000000000 --- a/examples/llm_prompt_optimization/examples/emotion_prompt.txt +++ /dev/null @@ -1,11 +0,0 @@ -Identify the primary emotion expressed in this text: -0 - Sadness -1 - Joy -2 - Love -3 - Anger -4 - Fear -5 - Surprise - -Text: "{input_text}" - -Emotion (number only): \ No newline at end of file diff --git a/examples/llm_prompt_optimization/dataset.yaml b/examples/llm_prompt_optimization/initial_prompt_dataset.yaml similarity index 100% rename from examples/llm_prompt_optimization/dataset.yaml rename to examples/llm_prompt_optimization/initial_prompt_dataset.yaml diff --git a/examples/llm_prompt_optimization/run_evolution.sh b/examples/llm_prompt_optimization/run_evolution.sh new file mode 100644 index 000000000..2d7daa4c6 --- /dev/null +++ b/examples/llm_prompt_optimization/run_evolution.sh @@ -0,0 +1,17 @@ +#!/bin/bash +# Wrapper script to run OpenEvolve with the correct dataset + +if [ $# -lt 1 ]; then + echo "Usage: $0 [additional_args...]" + echo "Example: $0 emotion_prompt.txt --iterations 50" + exit 1 +fi + +PROMPT_FILE=$1 +shift # Remove first argument + +# Set the environment variable for the evaluator +export OPENEVOLVE_PROMPT=$PROMPT_FILE + +# Run OpenEvolve +python ../../openevolve-run.py "$PROMPT_FILE" evaluator.py --config config.yaml "$@" \ No newline at end of file From b8380995af0fbadfbb0e0bcb3f21cc1645e06cc9 Mon Sep 17 00:00:00 2001 From: Asankhaya Sharma Date: Wed, 30 Jul 2025 11:01:00 +0800 Subject: [PATCH 03/12] Update emotion_prompt.txt --- examples/llm_prompt_optimization/emotion_prompt.txt | 10 ++++++++-- 1 file changed, 8 insertions(+), 2 deletions(-) diff --git a/examples/llm_prompt_optimization/emotion_prompt.txt b/examples/llm_prompt_optimization/emotion_prompt.txt index c748df51d..a947907ac 100644 --- a/examples/llm_prompt_optimization/emotion_prompt.txt +++ b/examples/llm_prompt_optimization/emotion_prompt.txt @@ -1,5 +1,11 @@ -Classify the emotion expressed in the following text. +Classify the emotion in the following text. Choose exactly one emotion from this list: +- sadness +- joy +- love +- anger +- fear +- surprise Text: "{input_text}" -Emotion (0-5): \ No newline at end of file +Emotion (respond with one word only): \ No newline at end of file From 32158b7c045268bcf0783ddf43d5c1338b864ab7 Mon Sep 17 00:00:00 2001 From: Asankhaya Sharma Date: Wed, 30 Jul 2025 14:21:49 +0800 Subject: [PATCH 04/12] Add GSM8K prompt and dataset support for math tasks Introduces GSM8K prompt and dataset configuration files for grade school math problem evaluation. Updates evaluator.py to support GSM8K answer extraction and adjusts evaluation logic for numeric answers. Modifies config.yaml for new optimal parameters and documents GSM8K support in the README. --- examples/llm_prompt_optimization/README.md | 1 + examples/llm_prompt_optimization/config.yaml | 10 ++-- examples/llm_prompt_optimization/evaluator.py | 60 ++++++++++++++++--- .../llm_prompt_optimization/gsm8k_prompt.txt | 5 ++ .../gsm8k_prompt_dataset.yaml | 14 +++++ 5 files changed, 78 insertions(+), 12 deletions(-) create mode 100644 examples/llm_prompt_optimization/gsm8k_prompt.txt create mode 100644 examples/llm_prompt_optimization/gsm8k_prompt_dataset.yaml diff --git a/examples/llm_prompt_optimization/README.md b/examples/llm_prompt_optimization/README.md index fdd787ada..2819c5ec7 100644 --- a/examples/llm_prompt_optimization/README.md +++ b/examples/llm_prompt_optimization/README.md @@ -95,6 +95,7 @@ This optimizer works with any HuggingFace dataset. Included examples: - **IMDB Sentiment**: `initial_prompt.txt` + `initial_prompt_dataset.yaml` (binary classification) - **Emotion**: `emotion_prompt.txt` + `emotion_prompt_dataset.yaml` (6-class, benchmark against DSPy) +- **GSM8K**: `gsm8k_prompt.txt` + `gsm8k_prompt_dataset.yaml` (grade school math, DSPy achieves 97.1%) ### Creating New Tasks diff --git a/examples/llm_prompt_optimization/config.yaml b/examples/llm_prompt_optimization/config.yaml index 0bce0178c..f82be7d1f 100644 --- a/examples/llm_prompt_optimization/config.yaml +++ b/examples/llm_prompt_optimization/config.yaml @@ -16,7 +16,7 @@ llm: - name: "gemini-2.5-flash-lite" # Using Gemini 2.5 Flash Lite weight: 1.0 - temperature: 0.4 # Optimal from experiments + temperature: 0.8 # Optimal from experiments max_tokens: 16000 # Optimal context timeout: 150 retries: 3 @@ -53,8 +53,8 @@ database: # Selection parameters - Optimal ratios from testing elite_selection_ratio: 0.1 # 10% elite selection - exploration_ratio: 0.3 # 30% exploration - exploitation_ratio: 0.6 # 60% exploitation + exploration_ratio: 0.5 # 30% exploration + exploitation_ratio: 0.4 # 60% exploitation # Migration parameters - Optimal settings migration_interval: 10 @@ -62,8 +62,8 @@ database: # Evaluator Configuration evaluator: - timeout: 200 + timeout: 600 max_retries: 3 parallel_evaluations: 4 cascade_evaluation: true # Two-stage cascading evaluation - cascade_thresholds: [0.9] # Stage 1 must achieve 90% accuracy to proceed to stage 2 \ No newline at end of file + cascade_thresholds: [0.4] # Stage 1 must achieve 90% accuracy to proceed to stage 2 \ No newline at end of file diff --git a/examples/llm_prompt_optimization/evaluator.py b/examples/llm_prompt_optimization/evaluator.py index cb4624d59..5f1471bb7 100644 --- a/examples/llm_prompt_optimization/evaluator.py +++ b/examples/llm_prompt_optimization/evaluator.py @@ -69,17 +69,24 @@ def load_prompt_config(prompt_path): def load_hf_dataset(config): """Load HuggingFace dataset based on configuration.""" dataset_name = config['dataset_name'] + dataset_config = config.get('dataset_config', None) split = config.get('split', 'test') print(f"Loading dataset: {dataset_name}") try: # Try to load the specified split - dataset = load_dataset(dataset_name, split=split) + if dataset_config: + dataset = load_dataset(dataset_name, dataset_config, split=split) + else: + dataset = load_dataset(dataset_name, split=split) except: # Fallback to train split if test is not available print(f"Split '{split}' not found, falling back to 'train'") - dataset = load_dataset(dataset_name, split='train') + if dataset_config: + dataset = load_dataset(dataset_name, dataset_config, split='train') + else: + dataset = load_dataset(dataset_name, split='train') print(f"Dataset loaded with {len(dataset)} examples") return dataset @@ -89,8 +96,10 @@ def evaluate_prompt(prompt, dataset, config, num_samples): input_field = config['input_field'] target_field = config['target_field'] - # Check if this is emotion classification (0-5) or sentiment (0-1) - is_emotion = 'emotion' in config.get('dataset_name', '').lower() + # Check dataset type + dataset_name = config.get('dataset_name', '').lower() + is_emotion = 'emotion' in dataset_name + is_gsm8k = 'gsm8k' in dataset_name # Sample from dataset samples = dataset.select(range(min(num_samples, len(dataset)))) @@ -110,11 +119,14 @@ def evaluate_prompt(prompt, dataset, config, num_samples): # Call the LLM with retry logic for attempt in range(MAX_RETRIES): try: + # Adjust max_tokens based on task + max_tokens = 500 if is_gsm8k else 20 + response = test_model.chat.completions.create( model=TASK_MODEL_NAME, messages=messages, - temperature=0.1, # Low temperature for consistent classification - max_tokens=20 # Allow slightly more tokens for emotion labels + temperature=0.1, # Low temperature for consistent results + max_tokens=max_tokens ) break except Exception as e: @@ -150,7 +162,41 @@ def evaluate_prompt(prompt, dataset, config, num_samples): # Extract prediction from output try: - if is_emotion: + if is_gsm8k: + # For GSM8K, extract the numeric answer after #### + # First, extract the expected answer from the ground truth + expected_answer = expected.split('####')[-1].strip() + try: + expected_number = float(expected_answer.replace(',', '')) + except: + print(f"Warning: Could not parse expected answer: {expected_answer}") + total += 1 + continue + + # Extract prediction from model output + prediction = None + if '####' in output_text: + predicted_answer = output_text.split('####')[-1].strip() + # Extract just the number, removing any extra text like $ signs + import re + numbers = re.findall(r'-?\$?[\d,]+\.?\d*', predicted_answer) + if numbers: + try: + # Remove $ and , from the number + number_str = numbers[0].replace('$', '').replace(',', '') + prediction = float(number_str) + except: + pass + + # If we found a prediction, check if it matches + if prediction is not None: + # Check if answers match (with small tolerance for floats) + if abs(prediction - expected_number) < 0.001: + correct += 1 + + total += 1 + + elif is_emotion: # For emotion classification (0-5) numbers = re.findall(r'\b[0-5]\b', output_text) if numbers: diff --git a/examples/llm_prompt_optimization/gsm8k_prompt.txt b/examples/llm_prompt_optimization/gsm8k_prompt.txt new file mode 100644 index 000000000..476efed05 --- /dev/null +++ b/examples/llm_prompt_optimization/gsm8k_prompt.txt @@ -0,0 +1,5 @@ +Solve the following grade school math problem step by step. + +Problem: {input_text} + +Show your work and reasoning for each step. After solving, provide your final numeric answer after "####". \ No newline at end of file diff --git a/examples/llm_prompt_optimization/gsm8k_prompt_dataset.yaml b/examples/llm_prompt_optimization/gsm8k_prompt_dataset.yaml new file mode 100644 index 000000000..db28e49eb --- /dev/null +++ b/examples/llm_prompt_optimization/gsm8k_prompt_dataset.yaml @@ -0,0 +1,14 @@ +# HuggingFace dataset configuration for GSM8K (Grade School Math) +# DSPy achieved 97.1% accuracy with GPT-4 on this benchmark +dataset_name: "openai/gsm8k" +dataset_config: "main" # GSM8K requires config name +input_field: "question" +target_field: "answer" # Contains step-by-step solution ending with #### followed by the numeric answer +split: "test" + +# Evaluation samples +max_samples: 200 # Start with subset, full test set has 1,319 problems + +# Note: The answer field contains the full solution with the format: +# "Step 1 explanation... Step 2... #### numeric_answer" +# The evaluator will need to extract the number after #### \ No newline at end of file From 47147544b8f47f6162d6584bcfeecfab7bbb853a Mon Sep 17 00:00:00 2001 From: Asankhaya Sharma Date: Wed, 30 Jul 2025 15:09:14 +0800 Subject: [PATCH 05/12] Prevent repeated migration of programs between islands Adds a check to avoid re-migrating already migrated programs in the ProgramDatabase migration logic. This prevents exponential duplication of identical programs, conserves computational resources, and maintains diversity in the MAP-Elites + Island hybrid architecture. Also updates config.yaml to adjust LLM temperature and selection ratios for improved optimization. --- examples/llm_prompt_optimization/config.yaml | 6 +++--- openevolve/database.py | 20 ++++++++++++++++++++ 2 files changed, 23 insertions(+), 3 deletions(-) diff --git a/examples/llm_prompt_optimization/config.yaml b/examples/llm_prompt_optimization/config.yaml index f82be7d1f..b5544500d 100644 --- a/examples/llm_prompt_optimization/config.yaml +++ b/examples/llm_prompt_optimization/config.yaml @@ -16,7 +16,7 @@ llm: - name: "gemini-2.5-flash-lite" # Using Gemini 2.5 Flash Lite weight: 1.0 - temperature: 0.8 # Optimal from experiments + temperature: 0.4 # Optimal from experiments max_tokens: 16000 # Optimal context timeout: 150 retries: 3 @@ -53,8 +53,8 @@ database: # Selection parameters - Optimal ratios from testing elite_selection_ratio: 0.1 # 10% elite selection - exploration_ratio: 0.5 # 30% exploration - exploitation_ratio: 0.4 # 60% exploitation + exploration_ratio: 0.3 # 30% exploration + exploitation_ratio: 0.6 # 60% exploitation # Migration parameters - Optimal settings migration_interval: 10 diff --git a/openevolve/database.py b/openevolve/database.py index 0d2ba6fd4..1c71e764a 100644 --- a/openevolve/database.py +++ b/openevolve/database.py @@ -1347,6 +1347,26 @@ def migrate_programs(self) -> None: target_islands = [(i + 1) % len(self.islands), (i - 1) % len(self.islands)] for migrant in migrants: + # Prevent re-migration of already migrated programs to avoid exponential duplication. + # Analysis of actual evolution runs shows this causes severe issues: + # - Program cb5d07f2 had 183 descendant copies by iteration 850 + # - Program 5645fbd2 had 31 descendant copies + # - IDs grow exponentially: program_migrant_2_migrant_3_migrant_4_migrant_0... + # + # This is particularly problematic for OpenEvolve's MAP-Elites + Island hybrid architecture: + # 1. All copies have identical code β†’ same complexity/diversity/performance scores + # 2. They all map to the SAME MAP-Elites cell β†’ only 1 survives, rest discarded + # 3. Wastes computation evaluating hundreds of identical programs + # 4. Reduces actual diversity as islands fill with duplicates + # + # By preventing already-migrated programs from migrating again, we ensure: + # - Each program migrates at most once per lineage + # - True diversity is maintained between islands + # - Computational resources aren't wasted on duplicates + # - Aligns with MAP-Elites' one-program-per-cell principle + if migrant.metadata.get("migrant", False): + continue + for target_island in target_islands: # Create a copy for migration (to avoid removing from source) migrant_copy = Program( From 6d6d50e8e104586c4eb5c47eb4749eb1499afb75 Mon Sep 17 00:00:00 2001 From: Asankhaya Sharma Date: Wed, 30 Jul 2025 17:54:22 +0800 Subject: [PATCH 06/12] Fix island initialization to use copies of best program When initializing an empty island, a new copy of the best program is now created with a unique ID, rather than reusing the same program instance. This prevents a program from being assigned to multiple islands and ensures correct lineage tracking. Additional tests were added to verify correct migration behavior, unique program assignment per island, and proper handling of empty island initialization. --- openevolve/database.py | 53 ++++++++++-- tests/test_database.py | 178 +++++++++++++++++++++++++++++++++++++++++ 2 files changed, 223 insertions(+), 8 deletions(-) diff --git a/openevolve/database.py b/openevolve/database.py index 1c71e764a..740768839 100644 --- a/openevolve/database.py +++ b/openevolve/database.py @@ -8,6 +8,7 @@ import os import random import time +import uuid from dataclasses import asdict, dataclass, field, fields # FileLock removed - no longer needed with threaded parallel processing @@ -998,12 +999,29 @@ def _sample_exploration_parent(self) -> Program: if not current_island_programs: # If current island is empty, initialize with best program or random program if self.best_program_id and self.best_program_id in self.programs: - # Clone best program to current island + # Create a copy of best program for the empty island (don't reuse same ID) best_program = self.programs[self.best_program_id] - self.islands[self.current_island].add(self.best_program_id) - best_program.metadata["island"] = self.current_island - logger.debug(f"Initialized empty island {self.current_island} with best program") - return best_program + copy_program = Program( + id=str(uuid.uuid4()), + code=best_program.code, + language=best_program.language, + parent_id=best_program.id, + generation=best_program.generation, + timestamp=time.time(), + iteration_found=self.last_iteration, + metrics=best_program.metrics.copy(), + complexity=best_program.complexity, + diversity=best_program.diversity, + metadata={"island": self.current_island}, + artifacts_json=best_program.artifacts_json, + artifact_dir=best_program.artifact_dir, + ) + self.programs[copy_program.id] = copy_program + self.islands[self.current_island].add(copy_program.id) + logger.debug( + f"Initialized empty island {self.current_island} with copy of best program" + ) + return copy_program else: # Use any available program return next(iter(self.programs.values())) @@ -1026,10 +1044,29 @@ def _sample_exploration_parent(self) -> Program: f"Island {self.current_island} has no valid programs after cleanup, reinitializing" ) if self.best_program_id and self.best_program_id in self.programs: + # Create a copy of best program for the empty island (don't reuse same ID) best_program = self.programs[self.best_program_id] - self.islands[self.current_island].add(self.best_program_id) - best_program.metadata["island"] = self.current_island - return best_program + copy_program = Program( + id=str(uuid.uuid4()), + code=best_program.code, + language=best_program.language, + parent_id=best_program.id, + generation=best_program.generation, + timestamp=time.time(), + iteration_found=self.last_iteration, + metrics=best_program.metrics.copy(), + complexity=best_program.complexity, + diversity=best_program.diversity, + metadata={"island": self.current_island}, + artifacts_json=best_program.artifacts_json, + artifact_dir=best_program.artifact_dir, + ) + self.programs[copy_program.id] = copy_program + self.islands[self.current_island].add(copy_program.id) + logger.debug( + f"Reinitialized empty island {self.current_island} with copy of best program" + ) + return copy_program else: return next(iter(self.programs.values())) diff --git a/tests/test_database.py b/tests/test_database.py index 0d17f8961..cd11a7e26 100644 --- a/tests/test_database.py +++ b/tests/test_database.py @@ -3,6 +3,7 @@ """ import unittest +import uuid from openevolve.config import Config from openevolve.database import Program, ProgramDatabase @@ -457,6 +458,183 @@ def test_diversity_feature_integration(self): self.assertGreaterEqual(coord, 0) self.assertLess(coord, self.db.feature_bins) + def test_migration_prevents_re_migration(self): + """Test that programs marked as migrants don't migrate again""" + # Create database with multiple islands + config = Config() + config.database.in_memory = True + config.database.num_islands = 3 + config.database.migration_interval = 1 # Migrate every generation + multi_db = ProgramDatabase(config.database) + + # Add programs to each island (avoid "migrant" in original IDs) + for i in range(3): + program = Program( + id=f"test_prog_{i}", + code=f"def test_{i}(): return {i}", + language="python", + metrics={"score": 0.5 + i * 0.1}, + ) + multi_db.add(program, target_island=i) + + # Manually mark one as a migrant + migrant_program = multi_db.get("test_prog_0") + migrant_program.metadata["migrant"] = True + + # Store original ID + original_id = migrant_program.id + + # Count initial programs with "_migrant_" pattern (created by migration) + initial_migrant_count = sum(1 for pid in multi_db.programs if "_migrant_" in pid) + self.assertEqual(initial_migrant_count, 0) # Should be none initially + + # Run migration + multi_db.island_generations[0] = config.database.migration_interval + multi_db.island_generations[1] = config.database.migration_interval + multi_db.island_generations[2] = config.database.migration_interval + multi_db.migrate_programs() + + # Check that the migrant program wasn't re-migrated + # It should still exist with the same ID (not a new migrant ID) + still_exists = multi_db.get(original_id) + self.assertIsNotNone(still_exists) + + # Count new programs created by migration (identified by "_migrant_" pattern) + new_migrant_ids = [pid for pid in multi_db.programs if "_migrant_" in pid] + + # Each non-migrant program (2 of them) migrates to 2 adjacent islands + # So we expect 2 * 2 = 4 new migrant programs + # The already-marked migrant (test_prog_0) should NOT create any new copies + self.assertEqual(len(new_migrant_ids), 4) + + # Verify the already-migrant program didn't create new copies + migrant_descendants = [pid for pid in new_migrant_ids if original_id in pid] + self.assertEqual(len(migrant_descendants), 0, + f"Program {original_id} should not have created migrant copies") + + def test_empty_island_initialization_creates_copies(self): + """Test that empty islands are initialized with copies, not shared references""" + # Create database with multiple islands + config = Config() + config.database.in_memory = True + config.database.num_islands = 3 + # Force exploration mode to test empty island handling + config.database.exploration_ratio = 1.0 + config.database.exploitation_ratio = 0.0 + multi_db = ProgramDatabase(config.database) + + # Add a single program to island 1 + program = Program( + id="original_program", + code="def original(): return 42", + language="python", + metrics={"score": 0.9, "combined_score": 0.9}, + ) + multi_db.add(program, target_island=1) + + # Make it the best program + multi_db.best_program_id = "original_program" + + # Switch to empty island 0 and sample + multi_db.set_current_island(0) + sampled_parent, _ = multi_db.sample() + + # The sampled program should be a copy, not the original + self.assertNotEqual(sampled_parent.id, "original_program") + self.assertEqual(sampled_parent.code, program.code) # Same code + self.assertEqual(sampled_parent.parent_id, "original_program") # Parent is the original + + # Check island membership + self.assertIn("original_program", multi_db.islands[1]) + self.assertNotIn("original_program", multi_db.islands[0]) + self.assertIn(sampled_parent.id, multi_db.islands[0]) + + # Run validation - should not raise any errors + multi_db._validate_migration_results() + + def test_no_program_assigned_to_multiple_islands(self): + """Test that programs are never assigned to multiple islands""" + # Create database with multiple islands + config = Config() + config.database.in_memory = True + config.database.num_islands = 4 + multi_db = ProgramDatabase(config.database) + + # Add programs to different islands + program_ids = [] + for i in range(4): + program = Program( + id=f"island_test_{i}", + code=f"def test_{i}(): return {i}", + language="python", + metrics={"score": 0.5 + i * 0.1, "combined_score": 0.5 + i * 0.1}, + ) + multi_db.add(program, target_island=i) + program_ids.append(program.id) + + # Make the best program from island 3 + multi_db.best_program_id = "island_test_3" + + # Sample from empty islands - this should create copies + for empty_island in range(4): + if len(multi_db.islands[empty_island]) == 0: + multi_db.set_current_island(empty_island) + parent, _ = multi_db.sample() + + # Check that no program ID appears in multiple islands + all_island_programs = {} + for island_idx, island_programs in enumerate(multi_db.islands): + for program_id in island_programs: + if program_id in all_island_programs: + self.fail( + f"Program {program_id} found in both island {all_island_programs[program_id]} " + f"and island {island_idx}" + ) + all_island_programs[program_id] = island_idx + + # Run validation - should not raise any errors + multi_db._validate_migration_results() + + def test_migration_validation_passes(self): + """Test that migration validation passes after our fixes""" + # Create database with multiple islands + config = Config() + config.database.in_memory = True + config.database.num_islands = 3 + config.database.migration_interval = 1 + multi_db = ProgramDatabase(config.database) + + # Add programs and run several migration cycles + for i in range(6): + program = Program( + id=f"test_program_{i}", + code=f"def test_{i}(): return {i * 2}", + language="python", + metrics={"score": 0.4 + i * 0.1, "combined_score": 0.4 + i * 0.1}, + ) + multi_db.add(program, target_island=i % 3) + + # Run multiple migration cycles + for cycle in range(3): + # Increment generations to trigger migration + for island in range(3): + multi_db.island_generations[island] += 1 + + # Migrate programs + multi_db.migrate_programs() + + # Validation should pass without warnings + multi_db._validate_migration_results() + + # Verify no program has exponential ID growth + for program_id in multi_db.programs: + # Count occurrences of "migrant" in ID + migrant_count = program_id.count("migrant") + self.assertLessEqual( + migrant_count, 1, + f"Program ID {program_id} has been migrated multiple times" + ) + if __name__ == "__main__": unittest.main() From cd5ccefddd324e0da130809a9d8904747c859632 Mon Sep 17 00:00:00 2001 From: Asankhaya Sharma Date: Wed, 30 Jul 2025 20:03:10 +0800 Subject: [PATCH 07/12] Add custom prompt feature extraction for MAP-Elites Introduces a `calculate_prompt_features` function in the evaluator to bin prompts by length and reasoning strategy, returning these as features for MAP-Elites optimization. Updates config.yaml to specify these features and their binning. Evaluator now returns these features alongside the combined score in both evaluation stages. --- examples/llm_prompt_optimization/config.yaml | 5 ++ examples/llm_prompt_optimization/evaluator.py | 90 ++++++++++++++++++- 2 files changed, 93 insertions(+), 2 deletions(-) diff --git a/examples/llm_prompt_optimization/config.yaml b/examples/llm_prompt_optimization/config.yaml index b5544500d..19e8b3015 100644 --- a/examples/llm_prompt_optimization/config.yaml +++ b/examples/llm_prompt_optimization/config.yaml @@ -51,6 +51,11 @@ database: archive_size: 100 num_islands: 4 + # Feature dimensions for MAP-Elites + # Using custom features returned by the evaluator + feature_dimensions: ["prompt_length", "reasoning_strategy"] + feature_bins: 10 # 10x10 grid = 100 cells + # Selection parameters - Optimal ratios from testing elite_selection_ratio: 0.1 # 10% elite selection exploration_ratio: 0.3 # 30% exploration diff --git a/examples/llm_prompt_optimization/evaluator.py b/examples/llm_prompt_optimization/evaluator.py index 5f1471bb7..c8259b2d4 100644 --- a/examples/llm_prompt_optimization/evaluator.py +++ b/examples/llm_prompt_optimization/evaluator.py @@ -51,6 +51,80 @@ DATASET_CONFIG_PATH = os.path.join(evaluator_dir, dataset_filename) print(f"Dataset configuration: {dataset_filename}") + +def calculate_prompt_features(prompt): + """ + Calculate custom features for MAP-Elites binning + + Returns: + tuple: (prompt_length, reasoning_strategy) - both in range 0-9 + """ + # Feature 1: Prompt length bin (0-9) + length = len(prompt) + if length < 100: + prompt_length = 0 # Minimal + elif length < 200: + prompt_length = 1 # Very short + elif length < 400: + prompt_length = 2 # Short + elif length < 600: + prompt_length = 3 # Medium-short + elif length < 900: + prompt_length = 4 # Medium + elif length < 1200: + prompt_length = 5 # Medium-long + elif length < 1600: + prompt_length = 6 # Long + elif length < 2000: + prompt_length = 7 # Very long + elif length < 2500: + prompt_length = 8 # Extensive + else: + prompt_length = 9 # Very extensive + + # Feature 2: Reasoning strategy (0-9) + prompt_lower = prompt.lower() + + # Check for few-shot examples + has_example = ('example' in prompt_lower or + prompt.count('####') >= 4 or + bool(re.search(r'problem:.*?solution:', prompt_lower, re.DOTALL))) + + # Check for Chain-of-Thought (CoT) indicators + has_cot = ('step by step' in prompt_lower or + 'step-by-step' in prompt_lower or + any(phrase in prompt_lower for phrase in ['think through', 'reasoning', 'explain your']) or + bool(re.search(r'(first|then|next|finally)', prompt_lower))) + + # Assign reasoning strategy bins + if has_example: + # Few-shot examples (bins 7-9) + if has_cot: + reasoning_strategy = 9 # Few-shot + CoT (most sophisticated) + elif length > 1500: + reasoning_strategy = 8 # Extensive few-shot + else: + reasoning_strategy = 7 # Basic few-shot + elif has_cot: + # Chain-of-thought (bins 4-6) + if 'must' in prompt_lower or 'exactly' in prompt_lower: + reasoning_strategy = 6 # Strict CoT + elif length > 500: + reasoning_strategy = 5 # Detailed CoT + else: + reasoning_strategy = 4 # Basic CoT + else: + # Basic prompts (bins 0-3) + if length < 100: + reasoning_strategy = 0 # Minimal + elif 'solve' in prompt_lower or 'calculate' in prompt_lower: + reasoning_strategy = 2 # Direct instruction + else: + reasoning_strategy = 1 # Simple prompt + + return prompt_length, reasoning_strategy + + def load_prompt_config(prompt_path): """Load the prompt from text file and dataset config from matching _dataset.yaml file.""" # Load prompt from text file @@ -280,8 +354,14 @@ def evaluate_stage1(prompt_path): print(f"Stage 1 accuracy: {accuracy:.3f} ({correct}/{total})") print('-' * 80) + # Calculate custom features + prompt_length, reasoning_strategy = calculate_prompt_features(prompt) + print(f"Prompt features - Length bin: {prompt_length}, Reasoning bin: {reasoning_strategy}") + return { - "combined_score": accuracy + "combined_score": accuracy, + "prompt_length": prompt_length, + "reasoning_strategy": reasoning_strategy } except Exception as e: @@ -329,8 +409,14 @@ def evaluate_stage2(prompt_path): print(f"Stage 2 accuracy: {accuracy:.3f} ({correct}/{total})") print('-' * 80) + # Calculate custom features + prompt_length, reasoning_strategy = calculate_prompt_features(prompt) + print(f"Prompt features - Length bin: {prompt_length}, Reasoning bin: {reasoning_strategy}") + return { - "combined_score": accuracy + "combined_score": accuracy, + "prompt_length": prompt_length, + "reasoning_strategy": reasoning_strategy } except Exception as e: From 65118b5e162d4f887e41e759e560e92fc1306b97 Mon Sep 17 00:00:00 2001 From: Asankhaya Sharma Date: Wed, 30 Jul 2025 22:38:16 +0800 Subject: [PATCH 08/12] Update evaluator threshold logic and config parameters Updated the evaluator to prioritize 'combined_score' when checking thresholds for consistency with evolution, falling back to averaging metrics if not present. Increased evaluator timeout and cascade threshold in the config, and switched to using max_tokens from config in prompt evaluation. Also updated the LLM model name in the config. --- examples/llm_prompt_optimization/config.yaml | 6 +++--- examples/llm_prompt_optimization/evaluator.py | 10 ++++++---- openevolve/evaluator.py | 12 +++++++++++- 3 files changed, 20 insertions(+), 8 deletions(-) diff --git a/examples/llm_prompt_optimization/config.yaml b/examples/llm_prompt_optimization/config.yaml index 19e8b3015..64d75256c 100644 --- a/examples/llm_prompt_optimization/config.yaml +++ b/examples/llm_prompt_optimization/config.yaml @@ -13,7 +13,7 @@ language: "text" # Explicitly set language to text for prompt evolution llm: api_base: "https://generativelanguage.googleapis.com/v1beta/openai/" models: - - name: "gemini-2.5-flash-lite" # Using Gemini 2.5 Flash Lite + - name: "gemini-2.5-flash" # Using Gemini 2.5 Flash Lite weight: 1.0 temperature: 0.4 # Optimal from experiments @@ -67,8 +67,8 @@ database: # Evaluator Configuration evaluator: - timeout: 600 + timeout: 1000 max_retries: 3 parallel_evaluations: 4 cascade_evaluation: true # Two-stage cascading evaluation - cascade_thresholds: [0.4] # Stage 1 must achieve 90% accuracy to proceed to stage 2 \ No newline at end of file + cascade_thresholds: [0.8] # Stage 1 must achieve 90% accuracy to proceed to stage 2 \ No newline at end of file diff --git a/examples/llm_prompt_optimization/evaluator.py b/examples/llm_prompt_optimization/evaluator.py index c8259b2d4..10e4acffc 100644 --- a/examples/llm_prompt_optimization/evaluator.py +++ b/examples/llm_prompt_optimization/evaluator.py @@ -32,6 +32,10 @@ evaluator_config = config.get('evaluator', {}) MAX_RETRIES = evaluator_config.get('max_retries', 3) +# Get max_tokens from LLM config +MAX_TOKENS = llm_config.get('max_tokens', 16000) +print(f"Using max_tokens: {MAX_TOKENS}") + # Initialize OpenAI client once for all evaluations test_model = OpenAI(base_url=api_base) print(f"Initialized OpenAI client with model: {TASK_MODEL_NAME}") @@ -193,14 +197,12 @@ def evaluate_prompt(prompt, dataset, config, num_samples): # Call the LLM with retry logic for attempt in range(MAX_RETRIES): try: - # Adjust max_tokens based on task - max_tokens = 500 if is_gsm8k else 20 - + # Use max_tokens from config response = test_model.chat.completions.create( model=TASK_MODEL_NAME, messages=messages, temperature=0.1, # Low temperature for consistent results - max_tokens=max_tokens + max_tokens=MAX_TOKENS ) break except Exception as e: diff --git a/openevolve/evaluator.py b/openevolve/evaluator.py index 25d880987..80bcac333 100644 --- a/openevolve/evaluator.py +++ b/openevolve/evaluator.py @@ -644,6 +644,9 @@ def _create_cascade_error_context(self, stage: str, error: Exception) -> dict: def _passes_threshold(self, metrics: Dict[str, float], threshold: float) -> bool: """ Check if metrics pass a threshold + + Uses 'combined_score' if available (for consistency with evolution), + otherwise falls back to averaging all numeric metrics except 'error' Args: metrics: Dictionary of metric name to score @@ -655,7 +658,14 @@ def _passes_threshold(self, metrics: Dict[str, float], threshold: float) -> bool if not metrics: return False - # Calculate average score, skipping non-numeric values and 'error' key + # Use combined_score if available - this is what evolution uses + if "combined_score" in metrics: + score = metrics.get("combined_score") + if isinstance(score, (int, float)): + return float(score) >= threshold + + # Fallback: average all numeric metrics except 'error' + # This maintains backward compatibility valid_metrics = [] for name, value in metrics.items(): # Skip 'error' keys and ensure values are numeric From 2c8e9896df27747fdd4beda10fbcdfd59d40334d Mon Sep 17 00:00:00 2001 From: Asankhaya Sharma Date: Wed, 30 Jul 2025 22:48:05 +0800 Subject: [PATCH 09/12] Update evaluator.py --- examples/llm_prompt_optimization/evaluator.py | 1 + 1 file changed, 1 insertion(+) diff --git a/examples/llm_prompt_optimization/evaluator.py b/examples/llm_prompt_optimization/evaluator.py index 10e4acffc..49fad99ba 100644 --- a/examples/llm_prompt_optimization/evaluator.py +++ b/examples/llm_prompt_optimization/evaluator.py @@ -271,6 +271,7 @@ def evaluate_prompt(prompt, dataset, config, num_samples): correct += 1 total += 1 + continue # Skip the general case to avoid double counting elif is_emotion: # For emotion classification (0-5) From b9468db59f7c1154360c8ab9005cbf870130302f Mon Sep 17 00:00:00 2001 From: Asankhaya Sharma Date: Thu, 31 Jul 2025 08:12:19 +0800 Subject: [PATCH 10/12] Update config.yaml --- examples/llm_prompt_optimization/config.yaml | 10 ++++++---- 1 file changed, 6 insertions(+), 4 deletions(-) diff --git a/examples/llm_prompt_optimization/config.yaml b/examples/llm_prompt_optimization/config.yaml index 64d75256c..0cabe0710 100644 --- a/examples/llm_prompt_optimization/config.yaml +++ b/examples/llm_prompt_optimization/config.yaml @@ -13,8 +13,10 @@ language: "text" # Explicitly set language to text for prompt evolution llm: api_base: "https://generativelanguage.googleapis.com/v1beta/openai/" models: - - name: "gemini-2.5-flash" # Using Gemini 2.5 Flash Lite - weight: 1.0 + - name: "gemini-2.5-flash-lite" # Using Gemini 2.5 Flash Lite + weight: 0.9 + - name: "gemini-2.5-flash" # Using Gemini 2.5 Flash + weight: 0.1 temperature: 0.4 # Optimal from experiments max_tokens: 16000 # Optimal context @@ -67,8 +69,8 @@ database: # Evaluator Configuration evaluator: - timeout: 1000 + timeout: 1800 max_retries: 3 parallel_evaluations: 4 cascade_evaluation: true # Two-stage cascading evaluation - cascade_thresholds: [0.8] # Stage 1 must achieve 90% accuracy to proceed to stage 2 \ No newline at end of file + cascade_thresholds: [0.9] # Stage 1 must achieve 90% accuracy to proceed to stage 2 \ No newline at end of file From 892f80b4247120b00ad78da336f2bc9aa0b5d968 Mon Sep 17 00:00:00 2001 From: Asankhaya Sharma Date: Thu, 31 Jul 2025 11:59:48 +0800 Subject: [PATCH 11/12] Update config.yaml --- examples/llm_prompt_optimization/config.yaml | 4 +--- 1 file changed, 1 insertion(+), 3 deletions(-) diff --git a/examples/llm_prompt_optimization/config.yaml b/examples/llm_prompt_optimization/config.yaml index 0cabe0710..da644f77c 100644 --- a/examples/llm_prompt_optimization/config.yaml +++ b/examples/llm_prompt_optimization/config.yaml @@ -14,9 +14,7 @@ llm: api_base: "https://generativelanguage.googleapis.com/v1beta/openai/" models: - name: "gemini-2.5-flash-lite" # Using Gemini 2.5 Flash Lite - weight: 0.9 - - name: "gemini-2.5-flash" # Using Gemini 2.5 Flash - weight: 0.1 + weight: 1.0 temperature: 0.4 # Optimal from experiments max_tokens: 16000 # Optimal context From f903f7eccfa8fedea2b1d5c6a7bd80a0c1ae44d9 Mon Sep 17 00:00:00 2001 From: Asankhaya Sharma Date: Thu, 31 Jul 2025 12:01:29 +0800 Subject: [PATCH 12/12] Update README.md --- examples/llm_prompt_optimization/README.md | 18 +++++++++--------- 1 file changed, 9 insertions(+), 9 deletions(-) diff --git a/examples/llm_prompt_optimization/README.md b/examples/llm_prompt_optimization/README.md index 2819c5ec7..77ff57311 100644 --- a/examples/llm_prompt_optimization/README.md +++ b/examples/llm_prompt_optimization/README.md @@ -1,11 +1,11 @@ -# HuggingFace Dataset Prompt Optimization with OpenEvolve πŸš€ +# LLM Prompt Optimization with OpenEvolve πŸš€ -This example demonstrates how to use OpenEvolve to automatically optimize prompts for any HuggingFace dataset. The system uses evolutionary search to discover high-performing prompts by testing them against ground truth data. +This example demonstrates how to use OpenEvolve to automatically optimize prompts for Large Language Models. The system uses evolutionary search to discover high-performing prompts by testing them against ground truth data from various datasets. ## 🎯 Overview OpenEvolve automatically: -- Loads any HuggingFace dataset +- Loads datasets from various sources - Evolves prompts through multiple generations - Uses cascading evaluation for efficiency - Finds optimal prompts for your specific task and model @@ -43,8 +43,8 @@ This example uses a naming convention to match prompts with their dataset config Create your dataset configuration file (e.g., `emotion_prompt_dataset.yaml`): ```yaml -# HuggingFace dataset configuration -dataset_name: "dair-ai/emotion" # Any HuggingFace dataset +# Dataset configuration +dataset_name: "dair-ai/emotion" # Dataset identifier input_field: "text" # Field containing input data target_field: "label" # Field containing ground truth split: "test" # Dataset split to use @@ -91,7 +91,7 @@ python ../../openevolve-run.py emotion_prompt.txt evaluator.py --config config.y ## πŸ“Š Supported Datasets -This optimizer works with any HuggingFace dataset. Included examples: +This optimizer works with a wide variety of datasets. Included examples: - **IMDB Sentiment**: `initial_prompt.txt` + `initial_prompt_dataset.yaml` (binary classification) - **Emotion**: `emotion_prompt.txt` + `emotion_prompt_dataset.yaml` (6-class, benchmark against DSPy) @@ -142,7 +142,7 @@ target_field: "summary" The evaluator uses a straightforward single-stage evaluation: -1. **Load Dataset**: Downloads the specified HuggingFace dataset +1. **Load Dataset**: Downloads the specified dataset 2. **Sample Data**: Takes `max_samples` examples from the dataset 3. **Test Prompt**: Sends each example through the LLM with the prompt 4. **Calculate Accuracy**: Compares LLM outputs to ground truth labels @@ -223,7 +223,7 @@ While the default setup is for classification, you can modify the evaluator for: ## πŸ› Troubleshooting ### Dataset Not Found -- Check the exact name on HuggingFace +- Check the exact dataset name and source - Some datasets require acceptance of terms ### Low Stage 1 Accuracy @@ -246,7 +246,7 @@ While the default setup is for classification, you can modify the evaluator for: ## πŸ“š Next Steps -- Try different datasets from HuggingFace +- Try different datasets and benchmarks - Experiment with different models - Adjust evolution parameters in config.yaml - Create task-specific evaluation metrics