## Evaluator-Optimizer Workflow

This notebook demonstrates the evaluator-optimizer workflow, where one LLM evaluates and provides feedback on another LLM's output in an iterative loop. This creates a self-improving system where each iteration builds on feedback from previous attempts.

### When to use this workflow

This workflow is particularly effective when:

- You have clear evaluation criteria
- The task can benefit from iterative refinement
- LLM responses can be demonstrably improved when feedback is provided
- The LLM can provide meaningful feedback itself

### Examples where this workflow is useful

- Code generation and review: One LLM writes code while another reviews it for bugs, style issues, and performance problems
- Content writing: One LLM writes content while another checks for tone, clarity, and accuracy
- Data analysis: One LLM performs analysis while another validates the methodology and conclusions
- Document summarization: One LLM creates summaries while another ensures key information is retained

![](https://www.anthropic.com/_next/image?url=https%3A%2F%2Fwww-cdn.anthropic.com%2Fimages%2F4zrzovbb%2Fwebsite%2F14f51e6406ccb29e695da48b17017e899a6119c7-2401x1000.png&w=3840&q=75)

## Setup - Configure the LLM to use Amazon Bedrock

To simplify things, we're going to use LiteLLM rather than Boto3.

In [None]:
from litellm import completion
import re
import boto3


def llm_call(prompt: str, system_prompt: str = "") -> str:
    """
    Calls the model with the given prompt and returns the response.

    Args:
        prompt (str): The user prompt to send to the model.
        system_prompt (str, optional): The system prompt to send to the model. Defaults to "".

    Returns:
        str: The response from the language model.
    """
    messages = []
    if system_prompt:
        messages.append({"role": "system", "content": system_prompt})
    messages.append({"role": "user", "content": prompt})
    
    response = completion(
        model="bedrock/us.amazon.nova-pro-v1:0",
        aws_region_name=boto3.Session().region_name,
        messages=messages,
        max_tokens=4096,
        temperature=0.1
    )
    return response.choices[0].message.content

def extract_xml(text: str, tag: str) -> str:
    """
    Extracts the content of the specified XML tag from the given text.
    Used for parsing structured responses.

    Args:
        text (str): The text containing the XML.
        tag (str): The XML tag to extract content from.

    Returns:
        str: The content of the specified XML tag, or an empty string if the tag is not found.
    """
    match = re.search(f'<{tag}>(.*?)</{tag}>', text, re.DOTALL)
    return match.group(1) if match else ""

## Implementation

In [5]:
def generate(prompt: str, task: str, context: str = "") -> tuple[str, str]:
    """Generate and improve a solution based on feedback."""
    full_prompt = f"{prompt}\n{context}\nTask: {task}" if context else f"{prompt}\nTask: {task}"
    response = llm_call(full_prompt)
    thoughts = extract_xml(response, "thoughts")
    result = extract_xml(response, "response")
    
    print("\n=== GENERATION START ===")
    print(f"Thoughts:\n{thoughts}\n")
    print(f"Generated:\n{result}")
    print("=== GENERATION END ===\n")
    
    return thoughts, result

def evaluate(prompt: str, content: str, task: str) -> tuple[str, str]:
    """Evaluate if a solution meets requirements."""
    full_prompt = f"{prompt}\nOriginal task: {task}\nContent to evaluate: {content}"
    response = llm_call(full_prompt)
    evaluation = extract_xml(response, "evaluation")
    feedback = extract_xml(response, "feedback")
    
    print("=== EVALUATION START ===")
    print(f"Status: {evaluation}")
    print(f"Feedback: {feedback}")
    print("=== EVALUATION END ===\n")
    
    return evaluation, feedback

def loop(task: str, evaluator_prompt: str, generator_prompt: str) -> tuple[str, list[dict]]:
    """Keep generating and evaluating until requirements are met."""
    memory = []
    chain_of_thought = []
    
    thoughts, result = generate(generator_prompt, task)
    memory.append(result)
    chain_of_thought.append({"thoughts": thoughts, "result": result})
    
    while True:
        evaluation, feedback = evaluate(evaluator_prompt, result, task)
        if evaluation == "PASS":
            return result, chain_of_thought
            
        context = "\n".join([
            "Previous attempts:",
            *[f"- {m}" for m in memory],
            f"\nFeedback: {feedback}"
        ])
        
        thoughts, result = generate(generator_prompt, task, context)
        memory.append(result)
        chain_of_thought.append({"thoughts": thoughts, "result": result})

## Example 1: Iterative Code Generation

In [3]:
evaluator_prompt = """
Evaluate this following code implementation for:
1. code correctness
2. time complexity
3. style and best practices

You should be evaluating only and not attempting to solve the task.
Only output "PASS" if all criteria are met and you have no further suggestions for improvements.
Output your evaluation concisely in the following format.

<evaluation>PASS, NEEDS_IMPROVEMENT, or FAIL</evaluation>
<feedback>
What needs improvement and why.
</feedback>
"""

generator_prompt = """
Your goal is to complete the task based on <user input>. If there are feedback 
from your previous generations, you should reflect on them to improve your solution

Output your answer concisely in the following format: 

<thoughts>
[Your understanding of the task and feedback and how you plan to improve]
</thoughts>

<response>
[Your code implementation here]
</response>
"""

task = """
<user input>
Implement a Stack with:
1. push(x)
2. pop()
3. getMin()
All operations should be O(1).
</user input>
"""

loop(task, evaluator_prompt, generator_prompt)


=== GENERATION START ===
Thoughts:

The task is to implement a stack with three operations: push, pop, and getMin, all of which should operate in O(1) time complexity. To achieve this, we need to maintain an additional stack to keep track of the minimum values. When pushing a new element, we compare it with the current minimum and push the minimum value onto the auxiliary stack. When popping, we also pop from the auxiliary stack to maintain consistency. The getMin operation simply returns the top of the auxiliary stack.


Generated:

```python
class MinStack:
    def __init__(self):
        self.stack = []
        self.min_stack = []

    def push(self, x: int) -> None:
        self.stack.append(x)
        if not self.min_stack or x <= self.min_stack[-1]:
            self.min_stack.append(x)

    def pop(self) -> None:
        if self.stack[-1] == self.min_stack[-1]:
            self.min_stack.pop()
        self.stack.pop()

    def top(self) -> int:
        return self.stack[-1]

   

('\n```python\nclass MinStack:\n    """\n    A stack that supports push, pop, peek, and retrieving the minimum element in constant time.\n    """\n\n    def __init__(self):\n        """\n        Initialize the MinStack.\n        """\n        self.stack = []\n        self.min_stack = []\n\n    def push(self, x: int) -> None:\n        """\n        Push element x onto stack.\n        """\n        self.stack.append(x)\n        if not self.min_stack or x <= self.min_stack[-1]:\n            self.min_stack.append(x)\n\n    def pop(self) -> int:\n        """\n        Remove the top element from the stack and return it.\n        """\n        if not self.stack:\n            raise IndexError("pop from empty stack")\n        if self.stack[-1] == self.min_stack[-1]:\n            self.min_stack.pop()\n        return self.stack.pop()\n\n    def peek(self) -> int:\n        """\n        Get the top element.\n        """\n        if not self.stack:\n            raise IndexError("peek from empty stack")\

## Example 2: Content Writing and Review

In [6]:
content_evaluator_prompt = """
Evaluate this content for:
1. Clarity and readability
2. Technical accuracy
3. Engagement and flow
4. Grammar and style

Only output "PASS" if all criteria are met and you have no further suggestions for improvements.
Output your evaluation concisely in the following format.

<evaluation>PASS, NEEDS_IMPROVEMENT, or FAIL</evaluation>
<feedback>
What needs improvement and why.
</feedback>
"""

content_generator_prompt = """
Your goal is to write technical content based on the task. If there is feedback
from your previous attempts, incorporate it to improve your writing.

Output your answer in the following format:

<thoughts>
[Your understanding of the task and feedback and how you plan to improve]
</thoughts>

<response>
[Your content here]
</response>
"""

content_task = """
Write a technical blog post explaining how the evaluator-optimizer workflow works
and its benefits for AI development. Include concrete examples and best practices.
Keep it under 500 words.
"""

loop(content_task, content_evaluator_prompt, content_generator_prompt)


=== GENERATION START ===
Thoughts:

In previous attempts, the feedback suggested that the content was too dense and lacked clear examples. To improve, I will break down the explanation into simpler segments, use concrete examples to illustrate key points, and include best practices for implementing the evaluator-optimizer workflow in AI development. I will also ensure the content is concise and under 500 words.


Generated:

### Understanding the Evaluator-Optimizer Workflow in AI Development

The evaluator-optimizer workflow is a critical process in AI development that enhances model performance through iterative improvement. This workflow involves two main components: the evaluator and the optimizer.

#### How It Works

1. **Evaluation Phase**:
   - The evaluator assesses the current performance of the AI model using predefined metrics (e.g., accuracy, precision, recall).
   - Example: In a image classification task, the evaluator might use validation data to calculate the model’s a

('\n### Understanding the Evaluator-Optimizer Workflow in AI Development\n\nThe evaluator-optimizer workflow is a critical process in AI development that enhances model performance through iterative improvement. This workflow involves two main components: the evaluator and the optimizer.\n\n#### How It Works\n\n1. **Evaluation Phase**:\n   - The evaluator assesses the current performance of the AI model using predefined metrics (e.g., accuracy, precision, recall).\n   - Example: In a image classification task, the evaluator might use validation data to calculate the model’s accuracy.\n\n2. **Optimization Phase**:\n   - Based on the evaluation results, the optimizer adjusts the model’s parameters to improve performance.\n   - Example: If the accuracy is low, the optimizer might tweak the learning rate or modify the model architecture.\n\n3. **Iteration**:\n   - The process repeats, with the evaluator continuously assessing performance and the optimizer making adjustments until the model