# Crafting robust LLM-as-a-judge scorers

Let's say you're working on a customer service bot and trying to evaluate the quality of its outputs. Consider a question like "What is your return policy?". If the ground truth answer is "You can return
items within 30 days of purchase", and your bot generates the answer "You can return items within 30 days", a direct comparison of the two strings would say they're not equal,
but an LLM-as-a-judge scorer can understand that they are in fact the same.

## Installing dependencies

Let's install a few basic dependencies. We'll use the CoQA dataset (via DuckDB), Braintrust for evals, and OpenAI's LLMs.


In [1]:
%pip install autoevals chevron duckdb braintrust openai

Note: you may need to restart the kernel to use updated packages.


In [2]:
import os

import braintrust
from openai import AsyncOpenAI

client = braintrust.wrap_openai(
    AsyncOpenAI(
        api_key=os.environ["OPENAI_API_KEY"],
        base_url="https://api.braintrust.dev/v1/proxy",  # Caches requests to OpenAI
    )
)

  from .autonotebook import tqdm as notebook_tqdm


## Explore the dataset


In [3]:
import duckdb

con = duckdb.connect(":memory:")
full_result = con.query("""
    SELECT * FROM 'hf://datasets/stanfordnlp/coqa/data/validation-00000-of-00001.parquet'
        LIMIT 20
""").fetchall()

first_result = full_result[0]

print("Passage:")
print(first_result[1])

print("\nQuestion:")
print(first_result[2][0])

print("\nAnswer:")
print(first_result[3]["input_text"][0])

Passage:
Once upon a time, in a barn near a farm house, there lived a little white kitten named Cotton. Cotton lived high up in a nice warm place above the barn where all of the farmer's horses slept. But Cotton wasn't alone in her little home above the barn, oh no. She shared her hay bed with her mommy and 5 other sisters. All of her sisters were cute and fluffy, like Cotton. But she was the only white one in the bunch. The rest of her sisters were all orange with beautiful white tiger stripes like Cotton's mommy. Being different made Cotton quite sad. She often wished she looked like the rest of her family. So one day, when Cotton found a can of the old farmer's orange paint, she used it to paint herself like them. When her mommy and sisters found her they started laughing. 

"What are you doing, Cotton?!" 

"I only wanted to be more like you". 

Cotton's mommy rubbed her face on Cotton's and said "Oh Cotton, but your fur is so pretty and special, like you. We would never want you to

The data contains a series of passages, each with a number of questions and answers. Let's flatten this into a list of `(passage, question, answer)` tuples.


In [4]:
from dataclasses import dataclass


@dataclass
class QuestionAnswer:
    passage: str
    question: str
    expected_answer: str
    generated_answer: str


qa_pairs = [
    QuestionAnswer(
        passage=r[1], question=question, generated_answer=r[3]["input_text"][i], expected_answer=r[3]["input_text"][i]
    )
    for r in full_result
    for (i, question) in enumerate(r[2])
]

print(len(qa_pairs))

297


### Adding hallucinations

In addition to the reference answers, let's add some halucinated answers. We'll consider a halucination to be a random answer from another question.


In [5]:
import asyncio
import random

random.seed(42)


async def hallucinate_answer(qa):
    response = await client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {
                "role": "system",
                "content": """\
Answer the following question in 1 sentence. Make sure to always answer it confidently,
even if you don't know the answer. Do not use words like "perhaps", "likely", "maybe", etc.
Do not admit that you cannot or do not know the answer.""",
            },
            {"role": "user", "content": qa.question},
        ],
        temperature=0,
        max_tokens=100,
    )
    return response.choices[0].message.content


hallucinated_answers = await asyncio.gather(*[hallucinate_answer(qa) for qa in qa_pairs])


hallucinations = [
    QuestionAnswer(
        passage=qa.passage,
        question=qa.question,
        expected_answer=qa.expected_answer,
        generated_answer=hallucination,
    )
    for (qa, hallucination) in zip(qa_pairs, hallucinated_answers)
]

print("Passage:")
print(hallucinations[10].passage)
print("\nQuestion:")
print(hallucinations[10].question)
print("\nExpected Answer:")
print(hallucinations[10].expected_answer)
print("\nGenerated Answer:")
print(hallucinations[10].generated_answer)


Passage:
Once upon a time, in a barn near a farm house, there lived a little white kitten named Cotton. Cotton lived high up in a nice warm place above the barn where all of the farmer's horses slept. But Cotton wasn't alone in her little home above the barn, oh no. She shared her hay bed with her mommy and 5 other sisters. All of her sisters were cute and fluffy, like Cotton. But she was the only white one in the bunch. The rest of her sisters were all orange with beautiful white tiger stripes like Cotton's mommy. Being different made Cotton quite sad. She often wished she looked like the rest of her family. So one day, when Cotton found a can of the old farmer's orange paint, she used it to paint herself like them. When her mommy and sisters found her they started laughing. 

"What are you doing, Cotton?!" 

"I only wanted to be more like you". 

Cotton's mommy rubbed her face on Cotton's and said "Oh Cotton, but your fur is so pretty and special, like you. We would never want you to

## Creating the evaluators

Let's start by initializing the OpenAI client. We'll wrap it with Braintrust to enable logging and caching.


### LLM-as-a-judge #1: Numeric rater

A common initial intuition when creating an LLM-as-a-judge is asking the LLM to rate the answer on a scale of 1 to 10. Let's see how this works for our task.

We'll use a modified version of the [Factuality](https://github.com/braintrustdata/autoevals/blob/main/templates/factuality.yaml) template.


In [19]:
import json

import chevron

PROMPT = """\
You are comparing a submitted answer to an expert answer on a given question. Here is the data:
[BEGIN DATA]
************
[Question]: {{{input}}}
************
[Expert]: {{{expected}}}
************
[Submission]: {{{output}}}
************
[END DATA]

Compare the factual content of the submitted answer with the expert answer. Ignore any differences in style, grammar, or punctuation.
Rate the submission on a scale of 1 to 10.
"""


@braintrust.traced
async def numeric_rater(input, output, expected):
    response = await client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {
                "role": "user",
                "content": chevron.render(PROMPT, data=dict(input=input, output=output, expected=expected)),
            }
        ],
        temperature=0,
        tools=[
            {
                "type": "function",
                "function": {
                    "name": "rate",
                    "description": "Rate the submission on a scale of 1 to 10.",
                    "parameters": {
                        "type": "object",
                        "properties": {
                            "rating": {"type": "integer", "minimum": 1, "maximum": 10},
                        },
                    },
                },
            }
        ],
        tool_choice={"type": "function", "function": {"name": "rate"}},
    )
    arguments = json.loads(response.choices[0].message.tool_calls[0].function.arguments)
    return (arguments["rating"] - 1) / 9


print(qa_pairs[10].question, "On a correct answer:", qa_pairs[10].generated_answer)
print(await numeric_rater(qa_pairs[10].question, qa_pairs[10].generated_answer, qa_pairs[10].expected_answer))

print(hallucinations[10].question, "On a hallucinated answer:", hallucinations[10].generated_answer)
print(
    await numeric_rater(
        hallucinations[10].question, hallucinations[10].generated_answer, hallucinations[10].expected_answer
    )
)


What did the other cats do when Cotton emerged from the bucket of water? On a correct answer: licked her face
1.0
What did the other cats do when Cotton emerged from the bucket of water? On a hallucinated answer: The other cats watched in surprise and amusement as Cotton emerged from the bucket of water.
0.2222222222222222


In [20]:
from dataclasses import asdict

from braintrust import Eval


def data():
    # for pair in qa_pairs[:20]:
    #     yield dict(input=dict(asdict(pair)), expected=1, metadata=dict(hallucination=False))
    for pair in hallucinations:
        yield dict(input=dict(asdict(pair)), expected=0, metadata=dict(hallucination=True))


async def task(input):
    return await numeric_rater(
        input=input["question"],
        output=input["generated_answer"],
        expected=input["expected_answer"],
    )


def normalized_diff(output, expected):
    return 1 - abs(output - expected)


await Eval(
    "LLM-as-a-judge",
    data=data,
    task=task,
    scores=[normalized_diff],
    experiment_name="Numeric rater",
    max_concurrency=10,
)

Experiment Numeric rater is running at https://www.braintrust.dev/app/braintrustdata.com/p/LLM-as-a-judge/experiments/Numeric%20rater
LLM-as-a-judge [experiment_name=Numeric rater] (data): 297it [00:00, 88605.75it/s]
LLM-as-a-judge [experiment_name=Numeric rater] (tasks): 100%|██████████| 297/297 [00:52<00:00,  5.65it/s]



76.99% 'normalized_diff' score

1.84s duration
1.72s llm_duration
177.31tok prompt_tokens
5tok completion_tokens
182.31tok total_tokens
0.00$ estimated_cost

See results for Numeric rater at https://www.braintrust.dev/app/braintrustdata.com/p/LLM-as-a-judge/experiments/Numeric%20rater


EvalResultWithSummary(summary="...", results=[...])

### Adding reasoning

Next, let's tweak the prompt to get the LLM to also reason about its rating. This method is called [Chain of Thought Reasoning](https://en.wikipedia.org/wiki/Chain_of_thought_reasoning).


In [21]:
@braintrust.traced
async def numeric_rater(input, output, expected):
    response = await client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {
                "role": "user",
                "content": chevron.render(PROMPT, data=dict(input=input, output=output, expected=expected)),
            }
        ],
        temperature=0,
        tools=[
            {
                "type": "function",
                "function": {
                    "name": "rate",
                    "description": "Rate the submission on a scale of 1 to 10.",
                    "parameters": {
                        "type": "object",
                        "properties": {
                            "rating": {"type": "integer", "minimum": 1, "maximum": 10},
                            "reasons": {
                                "description": "Write out in a step by step manner your reasoning to be sure that your conclusion is correct. Avoid simply stating the correct answer at the outset.",
                                "title": "Reasoning",
                                "type": "string",
                            },
                        },
                    },
                },
            }
        ],
        tool_choice={"type": "function", "function": {"name": "rate"}},
    )
    arguments = json.loads(response.choices[0].message.tool_calls[0].function.arguments)
    return (arguments["rating"] - 1) / 9


print(qa_pairs[10].question, "On a correct answer:", qa_pairs[10].generated_answer)
print(await numeric_rater(qa_pairs[10].question, qa_pairs[10].generated_answer, qa_pairs[10].expected_answer))

print(hallucinations[10].question, "On a hallucinated answer:", hallucinations[10].generated_answer)
print(
    await numeric_rater(
        hallucinations[10].question, hallucinations[10].generated_answer, hallucinations[10].expected_answer
    )
)

What did the other cats do when Cotton emerged from the bucket of water? On a correct answer: licked her face
1.0
What did the other cats do when Cotton emerged from the bucket of water? On a hallucinated answer: The other cats watched in surprise and amusement as Cotton emerged from the bucket of water.
0.2222222222222222


In [22]:
await Eval(
    "LLM-as-a-judge",
    data=data,
    task=task,
    scores=[normalized_diff],
    experiment_name="Numeric rater with reasoning",
    max_concurrency=10,
)

Experiment Numeric rater with reasoning is running at https://www.braintrust.dev/app/braintrustdata.com/p/LLM-as-a-judge/experiments/Numeric%20rater%20with%20reasoning
LLM-as-a-judge [experiment_name=Numeric rater with reasoning] (data): 297it [00:00, 60547.70it/s]
LLM-as-a-judge [experiment_name=Numeric rater with reasoning] (tasks): 100%|██████████| 297/297 [01:24<00:00,  3.52it/s]



Numeric rater with reasoning compared to Numeric rater:
77.10% (+00.11%) 'normalized_diff' score	(29 improvements, 33 regressions)

2.96s (+111.88%) 'duration'         	(10 improvements, 267 regressions)
2.88s (+115.66%) 'llm_duration'     	(9 improvements, 268 regressions)
216.31tok (+3900.00%) 'prompt_tokens'    	(0 improvements, 297 regressions)
137.54tok (+13253.87%) 'completion_tokens'	(0 improvements, 297 regressions)
353.85tok (+17153.87%) 'total_tokens'     	(0 improvements, 297 regressions)
0.00$ (+00.14%) 'estimated_cost'   	(0 improvements, 277 regressions)

See results for Numeric rater with reasoning at https://www.braintrust.dev/app/braintrustdata.com/p/LLM-as-a-judge/experiments/Numeric%20rater%20with%20reasoning


EvalResultWithSummary(summary="...", results=[...])

Interesting, it actually did worse!


### LLM-as-a-judge #2: Classifier

Next, let's try using classification. The intuition here is that by giving the model specific criteria to rate, it will generate a more accurate assessment. For illustration purposes,
we've written out how this works, but you can access this scorer directly via the `Factuality` scorer in autoevals.


In [17]:
PROMPT = """\
You are comparing a submitted answer to an expert answer on a given question. Here is the data:
[BEGIN DATA]
************
[Question]: {{{input}}}
************
[Expert]: {{{expected}}}
************
[Submission]: {{{output}}}
************
[END DATA]

Compare the factual content of the submitted answer with the expert answer. Ignore any differences in style, grammar, or punctuation.
The submitted answer may either be a subset or superset of the expert answer, or it may conflict with it. Determine which case applies. Answer the question by selecting one of the following options:
(A) The submitted answer is a subset of the expert answer and is fully consistent with it.
(B) The submitted answer is a superset of the expert answer and is fully consistent with it.
(C) The submitted answer contains all the same details as the expert answer.
(D) There is a disagreement between the submitted answer and the expert answer.
(E) The answers differ, but these differences don't matter from the perspective of factuality.

Answer the question by calling `select_choice` with your reasoning in a step-by-step matter to be
sure that your conclusion is correct. Avoid simply stating the correct answer at the outset. Select a
single choice by setting the `choice` parameter to a single choice from A, B, C, D, or E.
"""

# Since we're testing for hallucinations, penalize (B) as much as (D).
CHOICE_SCORES = {
    "A": 0.5,
    "B": 0,
    "C": 1,
    "D": 0,
    "E": 1,
}


@braintrust.traced
async def classifier(input, output, expected):
    response = await client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {
                "role": "user",
                "content": chevron.render(PROMPT, data=dict(input=input, output=output, expected=expected)),
            }
        ],
        temperature=0,
        tools=[
            {
                "type": "function",
                "function": {
                    "name": "rate",
                    "description": "Call this function to select a choice.",
                    "parameters": {
                        "properties": {
                            "reasons": {
                                "description": "Write out in a step by step manner your reasoning to be sure that your conclusion is correct. Avoid simply stating the correct answer at the outset.",
                                "type": "string",
                            },
                            "choice": {
                                "description": "The choice",
                                "type": "string",
                                "enum": ["A", "B", "C", "D", "E"],
                            },
                        },
                        "required": ["reasons", "choice"],
                        "type": "object",
                    },
                },
            }
        ],
        tool_choice={"type": "function", "function": {"name": "rate"}},
    )
    arguments = json.loads(response.choices[0].message.tool_calls[0].function.arguments)
    choice = arguments["choice"]
    return CHOICE_SCORES[choice] if choice in CHOICE_SCORES else None


print(qa_pairs[10].question, "On a correct answer:", qa_pairs[10].generated_answer)
print(await classifier(qa_pairs[10].question, qa_pairs[10].generated_answer, qa_pairs[10].expected_answer))

print(hallucinations[10].question, "On a hallucinated answer:", hallucinations[10].generated_answer)
print(
    await classifier(
        hallucinations[10].question, hallucinations[10].generated_answer, hallucinations[10].expected_answer
    )
)


What did the other cats do when Cotton emerged from the bucket of water? On a correct answer: licked her face
1
What did the other cats do when Cotton emerged from the bucket of water? On a hallucinated answer: The other cats watched in surprise and amusement as Cotton emerged from the bucket of water.
0


In [18]:
async def task(input):
    return await classifier(
        input=input["question"],
        output=input["generated_answer"],
        expected=input["expected_answer"],
    )


await Eval(
    "LLM-as-a-judge",
    data=data,
    task=task,
    scores=[normalized_diff],
    experiment_name="Classifier",
    max_concurrency=10,
)


Experiment Classifier is running at https://www.braintrust.dev/app/braintrustdata.com/p/LLM-as-a-judge/experiments/Classifier
LLM-as-a-judge [experiment_name=Classifier] (data): 297it [00:00, 65377.78it/s]
LLM-as-a-judge [experiment_name=Classifier] (tasks): 100%|██████████| 297/297 [01:44<00:00,  2.83it/s]



92.59% 'normalized_diff' score

3.71s duration
3.57s llm_duration
394.31tok prompt_tokens
164.23tok completion_tokens
558.54tok total_tokens
0.00$ estimated_cost

See results for Classifier at https://www.braintrust.dev/app/braintrustdata.com/p/LLM-as-a-judge/experiments/Classifier


EvalResultWithSummary(summary="...", results=[...])