# Evaluator-Optimizer Workflow

In this notebook, we'll dive into the evaluator-optimizer pattern — a powerful feedback loop approach for AI systems.
Using Kotlin and Claude via LangChain4j,
we'll implement a practical example that shows how to create AI agents that can continuously improve through self-evaluation.

## What is the evaluator-optimizer pattern?

The evaluator-optimizer pattern creates a feedback loop where:
1. One LLM generates a response (the optimizer)
2. Another LLM evaluates that response against specific criteria (the evaluator)
3. The feedback is used to create an improved response
4. This cycle continues until the response meets quality standards

![Evaluator-Optimizer Workflow Diagram](image/evaluator_optimizer.svg)

This approach mimics how humans refine their work through iteration and feedback,
creating a system that can self-improve over multiple attempts.

### When to use this pattern

- Code generation with specific quality requirements
- Content creation that needs to meet style guidelines
- Complex reasoning tasks that benefit from multiple passes
- Translation work requires nuanced understanding

## Setting up environment

Let's set up the Kotlin notebook with the necessary dependencies:

In [1]:
%useLatestDescriptors
%use langchain4j(model = anthropic)

## Defining data models

First, create the data structures to represent generator and evaluator responses:

In [2]:
import com.fasterxml.jackson.annotation.JsonCreator
import com.fasterxml.jackson.annotation.JsonProperty

data class GeneratorResponse @JsonCreator constructor(
    @JsonProperty("thoughts") val thoughts: String,
    @JsonProperty("result") val result: String
)

enum class EvalType {
    PASS, NEEDS_IMPROVEMENT, FAIL;
}

data class EvaluatorResponse @JsonCreator constructor(
    @JsonProperty("evaluation") val evaluation: EvalType,
    @JsonProperty("feedback") val feedback: String
)

interface EvalOptimizerLlm {
    fun llmGenerate(input: String): GeneratorResponse
    fun llmEvaluate(input: String): EvaluatorResponse
}

These classes help structure the communication between our generator and evaluator,
ensuring we can easily track the thought process and feedback at each iteration.

## Setting up LLM

Next, configure our LLM client using Claude

In [3]:
val model = AnthropicChatModel.builder()
    .apiKey(System.getenv("ANTHROPIC_API_KEY"))
    .modelName(AnthropicChatModelName.CLAUDE_3_7_SONNET_20250219)
    .maxTokens(4096)
    .temperature(0.1)
    .build()


val llm = AiServices.create(EvalOptimizerLlm::class.java, model)

## Generation function

Our generator creates the initial solution and subsequent refined versions:

In [4]:
fun generate(prompt: String, task: String, context: String = ""): Pair<String, String> {
    val fullPrompt = if (context.isNotEmpty())
        "$prompt\n$context\nTask:\n$task"
    else
        "$prompt\nTask:\n$task"

    val response = llm.llmGenerate(fullPrompt)
    val (thoughts, result) = response

    println(
"""
=== GENERATION START ===
Thoughts:
$thoughts
Result:
$result
=== GENERATION END ===
""".trimIndent()
    )

    return thoughts to result
}

## Evaluator function

The evaluator analyzes the current solution against our criteria:

In [5]:
fun evaluate(prompt: String, content: String, task: String): Pair<EvalType, String> {
    val fullPrompt = "$prompt\nOriginal task: $task\nContent to evaluate: $content"

    val response = llm.llmEvaluate(fullPrompt)
    val (evaluation, feedback) = response

    println(
        """
=== EVALUATION START ===
Status:
$evaluation
Feedback:
$feedback
=== EVALUATION END ===
""".trimIndent()
    )

    return evaluation to feedback
}

## Feedback loop

Now implement the core loop that drives the iterative improvement process:

In [6]:
fun loop(task: String, evaluatorPrompt: String, generatorPrompt: String): Pair<String, List<Pair<String, String>>> {
    val memory = mutableListOf<String>()
    val chainOfThought = mutableListOf<Pair<String, String>>()

    var (thoughts, result) = generate(generatorPrompt, task)
    memory.add(result)
    chainOfThought.add(thoughts to result)

    while (true) {
        val (evaluation, feedback) = evaluate(evaluatorPrompt, result, task)
        if (evaluation == EvalType.PASS) {
            return result to chainOfThought
        }

        val context = memory.joinToString(
            prefix = "Previous attempts:\n",
            postfix = "\nFeedback: $feedback",
            separator = "\n"
        )

        val generatorResponse = generate(generatorPrompt, task, context)
        thoughts = generatorResponse.first
        result = generatorResponse.second
        memory.add(result)
        chainOfThought.add(thoughts to result)
    }
}

## Setting up prompts

Now let's define the specific prompts for generation task:

In [7]:
val evaluatorPrompt =
    """
    Evaluate this following code implementation for:
    1. code correctness
    2. time complexity
    3. style and best practices

    You should be evaluating only and not attemping to solve the task.
    Only output "PASS" if all criteria are met and you have no further suggestions for improvements.
    Output your evaluation concisely in the following JSON format.

    ```json
    {
        "evaluation": "PASS, NEEDS_IMPROVEMENT, or FAIL",
        "feedback": "What needs improvement and why."
    }
    ```
    """.trimIndent()

val generatorPrompt =
    """
    Your goal is to complete the task based on <user input>. If there are feedback
    from your previous generations, you should reflect on them to improve your solution

    Output your answer concisely in the following JSON format:

    ```json
    {
        "thoughts": "Your understanding of the task and feedback and how you plan to improve",
        "result": "Your code implementation here"
    }
    ```
    """.trimIndent()

val task =
    """
    <user input>
    Implement a Stack in Kotlin with:
    1. push(x)
    2. pop()
    3. getMin()
    All operations should be O(1).
    </user input>
    """.trimIndent()


Finally, let's execute evaluator-optimizer loop:

In [8]:
loop(task, evaluatorPrompt, generatorPrompt)

=== GENERATION START ===
Thoughts:
I need to implement a Stack in Kotlin with push, pop, and getMin operations, all with O(1) time complexity. For push and pop, a standard stack implementation will work. For getMin with O(1), I'll need to maintain a second stack that keeps track of the minimum values. Each time we push a value, we'll compare it with the current minimum and push the smaller one to the minStack. When we pop, we'll also pop from the minStack.
Result:
class MinStack<T : Comparable<T>> {
    private val mainStack = mutableListOf<T>()
    private val minStack = mutableListOf<T>()
    
    fun push(x: T) {
        mainStack.add(x)
        
        // If minStack is empty or x is smaller than current min, add x to minStack
        if (minStack.isEmpty() || x <= minStack.last()) {
            minStack.add(x)
        }
    }
    
    fun pop(): T? {
        if (mainStack.isEmpty()) return null
        
        val popped = mainStack.removeAt(mainStack.size - 1)
        
        

(/**
 * A stack implementation that supports constant time minimum value retrieval.
 * @param T The type of elements stored in the stack, must be comparable.
 */
class MinStack<T : Comparable<T>> {
    private val mainStack = ArrayDeque<T>()
    private val minStack = ArrayDeque<T>()
    
    /**
     * Pushes an element onto the stack.
     * @param x The element to push
     */
    fun push(x: T) {
        mainStack.addLast(x)
        
        // Only add to minStack if it's empty or x is less than or equal to current min
        if (minStack.isEmpty() || x.compareTo(minStack.last()) <= 0) {
            minStack.addLast(x)
        }
    }
    
    /**
     * Removes and returns the top element from the stack.
     * @return The top element
     * @throws NoSuchElementException if the stack is empty
     */
    fun pop(): T {
        if (mainStack.isEmpty()) throw NoSuchElementException("Cannot pop from an empty stack")
        
        val popped = mainStack.removeLast()
        
   

## How it works

1. The generator creates a first attempt at solving the problem
2. The evaluator reviews the solution against our criteria
3. If improvements are needed, the feedback is incorporated into the next generation attempt
4. Previous attempts and feedback are retained to inform future improvements
5. The process continues until a satisfactory solution is found


## Conclusion

The evaluator-optimizer pattern gives us a powerful approach to creating AI systems that can self-improve through structured feedback.
It's particularly valuable for tasks where quality criteria are clear and iteration significantly improves outcomes.

By implementing this pattern in Kotlin, we can create flexible,
maintainable AI agents that produce higher quality results through methodical improvement.
The pattern mimics human processes of drafting and revision,
leading to more refined outputs than one-shot generation could achieve.

This approach bridges the gap between AI capabilities and human quality standards,
creating a system that can adapt and improve based on specific feedback — just like we do.