# Installation and imports

In [None]:
!pip install evidently[llm]

In [None]:
import pandas as pd
import numpy as np

from evidently import Dataset
from evidently import DataDefinition
from evidently import Report
from evidently import BinaryClassification
from evidently.descriptors import *
from evidently.presets import TextEvals, ValueStats, ClassificationPreset, DataSummaryPreset
from evidently.metrics import *

from evidently.llm.templates import BinaryClassificationPromptTemplate

from evidently.sdk.models import PanelMetric
from evidently.sdk.panels import DashboardPanelPlot

from evidently.ui.workspace import CloudWorkspace

In this tutorial, we will:
- Define the evaluation criteria for our LLM judge
- Build an LLM-as-a-Judge using different prompts/models
- Evaluate the quality of the judge comparing results to human labels

# (Optional) Set up Evidently Cloud

Set up API keys for LLM judges:

In [None]:
## import os
## os.environ["OPENAI_API_KEY"] = "OPEN_AI_API_KEY"
## os.environ["ANTHROPIC_API_KEY"] = "ANTHROPIC_API_KEY"

**Optional**. Connect to Cloud and create a Project:

In [None]:
# ws = CloudWorkspace(token="YOUR_API_TOKEN", url="https://app.evidently.cloud")

In [None]:
#project = ws.create_project("My project name", org_id="YOUR_ORG_ID")
#project.description = "My project description"

# or project = ws.get_project("PROJECT_ID")

# Prepare the dataset

We start with an expert-labeled dataset. We will use it as the ground truth for our LLM judge.

In [None]:
url = "https://raw.githubusercontent.com/evidentlyai/community-examples/main/datasets/code_review_dataset.csv"
review_dataset = pd.read_csv(url)

Preview:

In [None]:
pd.set_option('display.max_colwidth', None)
review_dataset.head(10)

Create an Evidently dataset:

In [None]:
definition = DataDefinition(
    text_columns=["Generated review", "Expert comment"],
    categorical_columns=["Expert label"]
    )

In [None]:
eval_dataset = Dataset.from_pandas(
    pd.DataFrame(review_dataset),
    data_definition=definition)

Preview the distribution of classes:

In [None]:
report = Report([
  ValueStats(column="Expert label")
])

my_eval = report.run(eval_dataset)
my_eval

**Optional**. Let's upload the source dataset to Evidently Cloud.

In [None]:
ws.add_dataset(
    dataset = eval_dataset,
    name = "source_dataset",
    project_id = project.id,
    description = "Dataset with expert labels on review quality")

# Our goal: create LLM judge to match the human labels

**Options**:
- Splitting criteria: (actionable / non-actionable, appropriate tone / inappropriate tone).
- Try create a good/bad judge. (It may be useful to introduce a borderline or "needs review" tag for subtle or new cases).

# Exp 1. Design the LLM judge - First try

For the tutorial flow, we'll keep the steps explicit and run 5 sequential experiments.

First attempt to create the judge:

In [None]:
# 1. Name the experiment
name = "naive_prompt"

# 2. Define LLM judge prompt template
feedback_quality = BinaryClassificationPromptTemplate(
        pre_messages=[("system", "You are evaluating the quality of code reviews given to junior developers.")],
        criteria = """An review is GOOD when it's actionable and constructive.
        A review is BAD when is non-actionable or overly critical.
        """,
        target_category="bad",
        non_target_category="good",
        uncertainty="unknown",
        include_reasoning=True,
        )

# 3. Apply the LLM judge
eval_dataset = Dataset.from_pandas(
    pd.DataFrame(review_dataset),
    data_definition=definition,
    descriptors=[
        LLMEval("Generated review",
                template=feedback_quality,  # We can pass new prompt version
                provider="openai",          # We can change the provider
                model="gpt-4o-mini",        # We can change the model
                alias="LLM-judged quality")
    ]
)

# 4. Add TRUE/FALSE for judge alignment
eval_dataset.add_descriptors([
    ExactMatch(columns=["LLM-judged quality", "Expert label"], alias="Judge_alignment")
])

In [None]:
#print(feedback_quality.get_template())

**LLM judgments**. View all results locally:

In [None]:
eval_dataset.as_dataframe()

**Report**. Let's summarize:

In [None]:
report = Report([
    TextEvals()
])

my_eval = report.run(eval_dataset)
my_eval

**Classification quality**. This function runs the Classification Report to evaluate the LLM judge quality and optionally uploads it to Evidently Cloud (if the workspace is set) with a tag.


In [None]:
def run_classification_report(eval_dataset, name=None, cloud_ws=None, project_id=None):

    df = eval_dataset.as_dataframe()
    df_filtered = df[df["LLM-judged quality"] != "UNKNOWN"]

    # Set the classification Data Definition
    definition_class = DataDefinition(
        classification=[BinaryClassification(
            target="Expert label",
            prediction_labels="LLM-judged quality",
            pos_label="bad"
        )],
        categorical_columns=["Expert label", "LLM-judged quality"]
    )

    # Create a Dataset object
    eval_data = Dataset.from_pandas(df_filtered, data_definition=definition_class)

    # Build classification report
    report = Report([
        ClassificationPreset(),
        ValueStats("LLM-judged quality"),
        ValueStats("Expert label")
    ])

    # Apply tag(s)
    tags = [name] if name else []

    my_eval = report.run(eval_data, tags=tags)

    # Optional: upload to Evidently Cloud
    if cloud_ws and project_id:
        cloud_ws.add_run(project_id, my_eval, include_data=True)

    return my_eval

(See all Evidently Metrics and Presets: https://docs.evidentlyai.com/metrics/all_metrics)

Run the function to evaluate the LLM judge quality:

In [None]:
my_eval = run_classification_report(
    eval_dataset,
    name=name,
    cloud_ws=ws, #Optional
    project_id=project.id #Optional
)

You can also preview the classification report locally:

In [None]:
my_eval

# (Optional) Add dashboard-as-code

This will create a dashboard to track evaluation results over time.

In [None]:
project.dashboard.add_panel(
             DashboardPanelPlot(
                title="LLM judge quality",
                subtitle = "Quality of the LLM judge that evaluates reviews compared to human labels.",
                size="full",
                values=[
                    PanelMetric(
                        metric="Precision",
                        legend="Precision"
                    ),
                    PanelMetric(
                        metric="Recall",
                        legend="Recall"
                    ),
                    PanelMetric(
                        metric="Accuracy",
                        legend="Accuracy"
                    ),
                ],
                plot_params={"plot_type": "line"},
            ),
            tab="LLM evals",
        )

# Exp 2. Try another prompt

Let's try writing a more detailed prompt:

In [None]:
# 1. Name the experiment <- new name
name = "detailed_prompt"

# 2. Define LLM judge prompt template  <- new prompt
feedback_quality_2 = BinaryClassificationPromptTemplate(
    pre_messages=[("system", "You are evaluating the quality of code reviews given to junior developers.")],
    criteria="""
    A review is **GOOD** if it is actionable and constructive. It should:
    - Offer clear, specific suggestions or highlight issues in a way that the developer can address
    - Be respectful and encourage learning or improvement
    - Use professional, helpful language—even when pointing out problems

    A review is **BAD** if it is non-actionable or overly critical. For example:
    - It may be vague, generic, or hedged to the point of being unhelpful
    - It may focus on praise only, without offering guidance
    - It may sound dismissive, contradictory, harsh, or robotic
    - It may raise a concern but fail to explain what should be done
    """,
    target_category="bad",
    non_target_category="good",
    uncertainty="unknown",
    include_reasoning=True,
)

# 3. Apply the LLM judge
eval_dataset = Dataset.from_pandas(
    pd.DataFrame(review_dataset),
    data_definition=definition,
    descriptors=[
        LLMEval("Generated review",
                template=feedback_quality_2,# We can pass new prompt version <- new prompt
                provider="openai",          # We can change the provider
                model="gpt-4o-mini",        # We can change the model
                alias="LLM-judged quality")
    ]
)

# 4. Add TRUE/FALSE for judge alignment
eval_dataset.add_descriptors([
    ExactMatch(columns=["LLM-judged quality", "Expert label"], alias="Judge_alignment")
])

Evaluate the LLM judge quality:

In [None]:
my_eval = run_classification_report(
    eval_dataset,
    name=name,
    cloud_ws=ws, #Optional
    project_id=project.id #Optional
)

# Exp 3. Can we make it better?

In [None]:
# 1. Name the experiment <- new name
name = "detailed_prompt_think_better"

# 2. Define LLM judge prompt template  <- new prompt
feedback_quality_3 = BinaryClassificationPromptTemplate(
    pre_messages=[
        ("system", "You are evaluating the quality of code reviews given to junior developers.")],
    criteria="""
    A review is **GOOD** if it is actionable and constructive. It should:
    - Offer clear, specific suggestions or highlight issues in a way that the developer can address
    - Be respectful and encourage learning or improvement
    - Use professional, helpful language—even when pointing out problems

    A review is **BAD** if it is non-actionable or overly critical. For example:
    - It may be vague, generic, or hedged to the point of being unhelpful
    - It may focus on praise only, without offering guidance
    - It may sound dismissive, contradictory, harsh, or robotic
    - It may raise a concern but fail to explain what should be done

    Always explain your reasoning.
    """,
    target_category="bad",
    non_target_category="good",
    uncertainty="unknown",
    include_reasoning=True,
)

# 3. Apply the LLM judge
eval_dataset = Dataset.from_pandas(
    pd.DataFrame(review_dataset),
    data_definition=definition,
    descriptors=[
        LLMEval("Generated review",
                template=feedback_quality_3,# We can pass new prompt version  <- new prompt
                provider="openai",          # We can change the provider
                model="gpt-4o-mini",        # We can change the model
                alias="LLM-judged quality")
    ]
)

# 4. Add TRUE/FALSE for judge alignment
eval_dataset.add_descriptors([
    ExactMatch(columns=["LLM-judged quality", "Expert label"], alias="Judge_alignment")
])

In [None]:
my_eval = run_classification_report(
    eval_dataset,
    name=name,
    cloud_ws=ws, #Optional
    project_id=project.id #Optional
)

In [None]:
#my_eval

# Exp 4. Try a different model (Turbo)

Can a cheaper, simpler model perform as well?

In [None]:
# 1. Name the experiment  <- new name
name = "turbo"

# 2. Define LLM judge prompt template
feedback_quality_3 = BinaryClassificationPromptTemplate(
    pre_messages=[
        ("system", "You are evaluating the quality of code reviews given to junior developers.")],
    criteria="""
    A review is **GOOD** if it is actionable and constructive. It should:
    - Offer clear, specific suggestions or highlight issues in a way that the developer can address
    - Be respectful and encourage learning or improvement
    - Use professional, helpful language—even when pointing out problems

    A review is **BAD** if it is non-actionable or overly critical. For example:
    - It may be vague, generic, or hedged to the point of being unhelpful
    - It may focus on praise only, without offering guidance
    - It may sound dismissive, contradictory, harsh, or robotic
    - It may raise a concern but fail to explain what should be done

    Always explain your reasoning.
    """,
    target_category="bad",
    non_target_category="good",
    uncertainty="unknown",
    include_reasoning=True,
)

# 3. Apply the LLM judge
eval_dataset = Dataset.from_pandas(
    pd.DataFrame(review_dataset),
    data_definition=definition,
    descriptors=[
        LLMEval("Generated review",
                template=feedback_quality_3,  # We can pass new prompt version
                provider="openai",            # We can change the provider
                model="gpt-3.5-turbo",        # We can change the model  <- different model
                alias="LLM-judged quality")
    ]
)

# 4. Add TRUE/FALSE for judge alignment
eval_dataset.add_descriptors([
    ExactMatch(columns=["LLM-judged quality", "Expert label"], alias="Judge_alignment")
])

In [None]:
my_eval = run_classification_report(
    eval_dataset,
    name=name,
    cloud_ws=ws, #Optional
    project_id=project.id #Optional
)

# Exp 5. Try another provider (Anthropic)

In [None]:
from evidently.legacy.utils.llm.wrapper import AnthropicOptions

In [None]:
# 1. Name the experiment <- new name
name = "anthropic"

# 2. Define LLM judge prompt template
feedback_quality_3 = BinaryClassificationPromptTemplate(
    pre_messages=[
        ("system", "You are evaluating the quality of code reviews given to junior developers.")],
    criteria="""
    A review is **GOOD** if it is actionable and constructive. It should:
    - Offer clear, specific suggestions or highlight issues in a way that the developer can address
    - Be respectful and encourage learning or improvement
    - Use professional, helpful language—even when pointing out problems

    A review is **BAD** if it is non-actionable or overly critical. For example:
    - It may be vague, generic, or hedged to the point of being unhelpful
    - It may focus on praise only, without offering guidance
    - It may sound dismissive, contradictory, harsh, or robotic
    - It may raise a concern but fail to explain what should be done

    Always explain your reasoning.
    """,
    target_category="bad",
    non_target_category="good",
    uncertainty="unknown",
    include_reasoning=True,
)

# 3. Apply the LLM judge
eval_dataset = Dataset.from_pandas(
    pd.DataFrame(review_dataset),
    data_definition=definition,
    descriptors=[
        LLMEval("Generated review",
                template=feedback_quality_3,         # We can pass new prompt version
                provider="anthropic",                # We can change the provider <- new provider
                model="claude-3-5-sonnet-20240620",  # We can change the model <- new model
                alias="LLM-judged quality")
    ],
    options=AnthropicOptions(rpm_limit=50)  # <- rate limit params
)

# 4. Add TRUE/FALSE for judge alignment
eval_dataset.add_descriptors([
    ExactMatch(columns=["LLM-judged quality", "Expert label"], alias="Judge_alignment")
])

In [None]:
my_eval = run_classification_report(
    eval_dataset,
    name=name,
    cloud_ws=ws, #Optional
    project_id=project.id #Optional
)

# What's next?

Can you make a better prompt? Conside splitting the criteria - and review the labels first! (Even humans don't always agree :)

Enjoyed the tutorial? Star Evidently on GitHub to support the project: https://github.com/evidentlyai/evidently