## Evaluator-Optimizer Workflow
In this workflow, one LLM call generates a response while another provides evaluation and feedback in a loop.

### When to use this workflow
This workflow is particularly effective when we have:

- Clear evaluation criteria
- Value from iterative refinement

The two signs of good fit are:

- LLM responses can be demonstrably improved when feedback is provided
- The LLM can provide meaningful feedback itself

In [1]:
%pip install anthropic


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.0[0m[39;49m -> [0m[32;49m25.0[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


In [2]:
from util import llm_call, extract_xml

def generate(prompt: str, task: str, context: str = "") -> tuple[str, str]:
    """Generate and improve a solution based on feedback."""
    full_prompt = f"{prompt}\n{context}\nTask: {task}" if context else f"{prompt}\nTask: {task}"
    response = llm_call(full_prompt)
    thoughts = extract_xml(response, "thoughts")
    result = extract_xml(response, "response")
    
    print("\n=== GENERATION START ===")
    print(f"Generation raw prompt: {full_prompt}\n ---------")
    print(f"Thoughts:\n{thoughts}\n")
    print(f"Generated:\n{result}")
    print("=== GENERATION END ===\n")
    
    return thoughts, result

def evaluate(prompt: str, content: str, task: str) -> tuple[str, str]:
    """Evaluate if a solution meets requirements."""
    full_prompt = f"{prompt}\nOriginal task: {task}\nContent to evaluate: {content}"
    response = llm_call(full_prompt)
    evaluation = extract_xml(response, "evaluation")
    feedback = extract_xml(response, "feedback")
    
    print("=== EVALUATION START ===")
    print(f"Evaluation raw prompt: {full_prompt}\n ---------")
    
    print(f"Status: {evaluation}")
    print(f"Feedback: {feedback}")
    print("=== EVALUATION END ===\n")
    
    return evaluation, feedback

def loop(task: str, evaluator_prompt: str, generator_prompt: str) -> tuple[str, list[dict]]:
    """Keep generating and evaluating until requirements are met."""
    memory = []
    chain_of_thought = []
    
    thoughts, result = generate(generator_prompt, task)
    memory.append(result)
    chain_of_thought.append({"thoughts": thoughts, "result": result})
    
    while True:
        evaluation, feedback = evaluate(evaluator_prompt, result, task)
        if evaluation == "PASS":
            return result, chain_of_thought
            
        context = "\n".join([
            "Previous attempts:",
            *[f"- {m}" for m in memory],
            f"\nFeedback: {feedback}"
        ])
        
        thoughts, result = generate(generator_prompt, task, context)
        memory.append(result)
        chain_of_thought.append({"thoughts": thoughts, "result": result})

### Example Use Case: Iterative coding loop



In [3]:
evaluator_prompt = """
Evaluate this following code implementation for:
1. code correctness
2. time complexity
3. style and best practices

You should be evaluating only and not attemping to solve the task.
Only output "PASS" if all criteria are met and you have no further suggestions for improvements.
Output your evaluation concisely in the following format.

<evaluation>PASS, NEEDS_IMPROVEMENT, or FAIL</evaluation>
<feedback>
What needs improvement and why.
</feedback>
"""

generator_prompt = """
Your goal is to complete the task based on <user input>. If there are feedback 
from your previous generations, you should reflect on them to improve your solution

Output your answer concisely in the following format: 

<thoughts>
[Your understanding of the task and feedback and how you plan to improve]
</thoughts>

<response>
[Your code implementation here]
</response>
"""

task = """
<user input>
Implement a Stack with:
1. push(x)
2. pop()
3. getMin()
All operations should be O(1).
</user input>
"""

loop(task, evaluator_prompt, generator_prompt)



=== GENERATION START ===
Generation raw prompt: 
Your goal is to complete the task based on <user input>. If there are feedback 
from your previous generations, you should reflect on them to improve your solution

Output your answer concisely in the following format: 

<thoughts>
[Your understanding of the task and feedback and how you plan to improve]
</thoughts>

<response>
[Your code implementation here]
</response>

Task: 
<user input>
Implement a Stack with:
1. push(x)
2. pop()
3. getMin()
All operations should be O(1).
</user input>

 ---------
Thoughts:

The task requires implementing a Stack with constant time operations including finding minimum. 
To achieve O(1) getMin(), we need to track minimum values alongside regular stack operations.
I'll use two stacks - one for actual values and another for tracking minimums at each step.


Generated:

```python
class MinStack:
    def __init__(self):
        self.stack = []
        self.min_stack = []
        
    def push(self, x: i

('\n```python\nclass MinStack:\n    """A stack that supports push, pop, and getting minimum element in O(1) time."""\n    \n    def __init__(self):\n        """Initialize empty stack with two internal stacks."""\n        self._stack = []  # main stack\n        self._min_stack = []  # auxiliary stack for tracking minimums\n        \n    def push(self, x: int) -> None:\n        """\n        Push element onto stack and update minimum if necessary.\n        Args:\n            x: Integer to push onto stack\n        """\n        if not isinstance(x, (int, float)):\n            raise TypeError("Value must be a number")\n            \n        self._stack.append(x)\n        if not self._min_stack or x <= self._min_stack[-1]:\n            self._min_stack.append(x)\n            \n    def pop(self) -> int:\n        """\n        Remove and return top element from stack.\n        Returns:\n            The popped element\n        Raises:\n            IndexError: If stack is empty\n        """\n      