
# 📊 Prompt Optimization with Evidently: Code Review Quality Classifier

This tutorial demonstrates how to use Evidently's new `PromptOptimizer` API for optimizing prompts for LLM judges. 
We'll walk through optimizing a prompt that classifies the quality of code reviews written for junior developers.

---

## ✅ What you'll learn:
- How to set up a dataset for LLM evaluation
- How to define an LLM judge with a prompt template
- How to run the prompt optimization loop
- How to retrieve and inspect the best performing prompt


In [None]:
# If you haven't installed the required packages yet:
# !pip install evidently openai pandas

In [None]:
import pandas as pd

from evidently import Dataset, DataDefinition, LLMClassification
from evidently.llm.templates import BinaryClassificationPromptTemplate
from evidently.descriptors import LLMEval
from evidently.llm.optimization import PromptOptimizer

In [None]:
# Load your dataset
review_dataset = pd.read_csv("../datasets/code_review.csv")
review_dataset.head()

In [None]:
# Define how Evidently should interpret your dataset
dd = DataDefinition(
    text_columns=["Generated review", "Expert comment"],
    categorical_columns=["Expert label"],
    llm=LLMClassification(input="Generated review", target="Expert label", reasoning="Expert comment")
)

In [None]:
# Convert your pandas DataFrame into an Evidently Dataset
dataset = Dataset.from_pandas(review_dataset, data_definition=dd)

In [None]:
# Define a prompt template and judge for classifying code review quality
criteria = '''A review is GOOD when it's actionable and constructive.
A review is BAD when it is non-actionable or overly critical.'''

feedback_quality = BinaryClassificationPromptTemplate(
    pre_messages=[("system", "You are evaluating the quality of code reviews given to junior developers.")],
    criteria=criteria,
    target_category="bad",
    non_target_category="good",
    uncertainty="unknown",
    include_reasoning=True,
)

judge = LLMEval(
    alias="Code Review Judge",
    provider="openai",
    model="gpt-4o-mini",
    column_name="Generated review",
    template=feedback_quality
)

In [None]:
# Initialize the optimizer and run optimization using feedback strategy
optimizer = PromptOptimizer("code_review_example", strategy="feedback")
optimizer.set_input_dataset(dataset)
await optimizer.arun(judge, "accuracy")
# for sync version:
# optimizer.run(judge, "accuracy")

In [None]:
# Show the best-performing prompt template found by the optimizer
print(optimizer.best_prompt())