docs: test based eval documentation by seirasto · Pull Request #916 · generative-computing/mellea

seirasto · 2026-04-23T20:17:15Z

Misc PR

Type of PR

Bug Fix
New Feature
Documentation
Other

Description

Link to Issue: Fixes Expand TestBasedEval documentation #914

We are pleased to see documentation of our TestBasedEval contribution in the Mellea documentation. We have made some adjustments to add further clarification of the functionality and advantage of using Generative Unit Tests via LLM-as-a-Judge:

Changed the title to say Generative Unit Tests
Expanded the explanation to clarify that there can be multiple inputs/outputs and that the user provides the instructions and examples.
Updated the example to highlight the need for an LLM judge by requiring a semantic match (it was classification before)

Testing

Tests added to the respective file if code was changed
New code has 100% coverage if code as added
Ensure existing tests and github automation passes (a maintainer will kick off the github automation when the rest of the PR is populated)

Attribution

AI coding assistants used

github-actions · 2026-04-23T20:21:40Z

The PR description has been updated. Please fill out the template for your PR to be reviewed.

planetf1

A few minor suggestions - but one syntax correction I think is needed as users will hopefully follow the docs

Co-authored-by: Nigel Jones <nigel.l.jones+git@gmail.com>

planetf1 · 2026-04-24T14:50:37Z

On other thing I noticed -- not touched by this PR, but in the same file -- we don't explain 'verdict', at least not in terms of it's content. A reader might think it's a boolean (we refer to this in the docs when talking about llm as a judge... but that's using a different pattern, with conversion)

Here it could be anything I think - whatever the model returns. That may be worth clarifying? Just saying it's the raw llm output?

planetf1 · 2026-04-24T14:55:21Z

One structural suggestion: the three-level table at the top sets up a useful mental model, but TestBasedEval doesn't fit any of those rows — it's a standalone evaluation script where you run your model over a set of examples and pass each output to a judge model to score it. It's not a pytest assertion. A reader has no signal it's coming or how it differs from @pytest.mark.qualitative.

A small addition could help — a fourth row in the table:

Level	What you assert	Deterministic?
Type check	`isinstance(result, bool)`	Yes
Structural check	`result in [...]` or field names present	Yes
Qualitative check	`assert result is True`	No — depends on the model
Semantic evaluation	Judge model scores output against reference responses	No — run separately, not a pytest assertion

And a bridging sentence after the table: "For levels 1–3, use pytest with the patterns below. For semantic evaluation against reference examples — where you want a judge model to score your model's outputs in bulk — see The unit_test_eval component at the end of this page."

planetf1 · 2026-04-24T15:00:44Z

The Next steps section currently points to the Requirements System and Handling Exceptions. Worth adding a link to Evaluate with LLM-as-a-Judge here — it covers the Requirement-based inline judge pattern, which is a related but distinct approach to what TestBasedEval does. Readers finishing this guide are exactly the audience who'd want to know that alternative exists.

## Next steps

- [The Requirements System](../concepts/requirements-system) — understand how
  `Requirement`, `simple_validate`, and `check` interact with the IVR loop
- [Handling Exceptions](../how-to/handling-exceptions) —
  catch and diagnose errors that occur during generation
- [Evaluate with LLM-as-a-Judge](../evaluation-and-observability/evaluate-with-llm-as-a-judge) —
  the `Requirement`-based approach for inline judge evaluation

seirasto · 2026-04-24T15:55:44Z

On other thing I noticed -- not touched by this PR, but in the same file -- we don't explain 'verdict', at least not in terms of it's content. A reader might think it's a boolean (we refer to this in the docs when talking about llm as a judge... but that's using a different pattern, with conversion)

Here it could be anything I think - whatever the model returns. That may be worth clarifying? Just saying it's the raw llm output?

The default judge prompt in the jinja template does use a boolean value. This could be adjusted of course.

seirasto · 2026-04-24T15:55:58Z

The Next steps section currently points to the Requirements System and Handling Exceptions. Worth adding a link to Evaluate with LLM-as-a-Judge here — it covers the Requirement-based inline judge pattern, which is a related but distinct approach to what TestBasedEval does. Readers finishing this guide are exactly the audience who'd want to know that alternative exists.
## Next steps

- [The Requirements System](../concepts/requirements-system) — understand how
  `Requirement`, `simple_validate`, and `check` interact with the IVR loop
- [Handling Exceptions](../how-to/handling-exceptions) —
  catch and diagnose errors that occur during generation
- [Evaluate with LLM-as-a-Judge](../evaluation-and-observability/evaluate-with-llm-as-a-judge) —
  the `Requirement`-based approach for inline judge evaluation

Added this

seirasto · 2026-04-24T15:56:21Z

One structural suggestion: the three-level table at the top sets up a useful mental model, but TestBasedEval doesn't fit any of those rows — it's a standalone evaluation script where you run your model over a set of examples and pass each output to a judge model to score it. It's not a pytest assertion. A reader has no signal it's coming or how it differs from @pytest.mark.qualitative.

A small addition could help — a fourth row in the table:

Level What you assert Deterministic?
Type check isinstance(result, bool) Yes
Structural check result in [...] or field names present Yes
Qualitative check assert result is True No — depends on the model
Semantic evaluation Judge model scores output against reference responses No — run separately, not a pytest assertion
And a bridging sentence after the table: "For levels 1–3, use pytest with the patterns below. For semantic evaluation against reference examples — where you want a judge model to score your model's outputs in bulk — see The unit_test_eval component at the end of this page."

Added this

seirasto · 2026-04-24T16:00:44Z

Thanks @planetf1 I think I addressed all your changes. One other thing - you can call test based eval functionality using the cli via cli/eval - should we reference this as well?

planetf1 · 2026-04-28T11:41:54Z

I tested the code example end-to-end in a fresh project (uv add mellea) and found two bugs that prevent it running correctly. Notes and a corrected file below.

Bug 1 — `instruct()` should be `act()` (pre-existing)

# as written — broken
verdict = judge_session.instruct(eval_case)

# fix
verdict = judge_session.act(eval_case)

TestBasedEval is a Component — it formats its own judge prompt via the Jinja2 template. instruct() expects a plain str; passing a Component causes Python to evaluate it as str(eval_case) which returns the object repr (<TestBasedEval object at 0x...>). The judge model then receives that repr as its prompt and responds with "It appears you have provided an object reference from Python's Mellea library..." — completely wrong output.

This was present before this PR but the original prediction = "no" placeholder masked it. Now that the example is meant to be runnable this needs fixing.

Bug 2 — `.value` needed on `act()` result (introduced by this PR)

# as written — passes ComputedModelOutputThunk to set_judge_context
prediction = generation_session.act(
    SimpleComponent(instruction=input_text)
)

# fix
prediction = generation_session.act(
    SimpleComponent(instruction=input_text)
).value

set_judge_context is typed prediction: str. act() returns a ComputedModelOutputThunk[str] — calling .value extracts the string the model generated. Without this the prediction field in the judge template receives the thunk object rather than the email text.

Clarification needed — what `verdict.value` contains

To answer the open question about verdict.value: based on the actual output of this example, the judge model returns a structured JSON string, not a boolean. The TestBasedEval.jinja2 template instructs the model to respond in this format:

{"score": 0_or_1, "justification": "..."}

Actual output from a working run:

case_001: {"score": 1, "justification": "The model output closely follows the instructions provided. It is a professional and appropriate follow-up email after an interview..."}

Worth adding a note after the code block explaining this, e.g.:

Note: verdict.value is the raw JSON string returned by the judge — {"score": 0|1, "justification": "..."}. Score 0 means the guidelines were violated; score 1 means the output is well aligned. Parse it to use the score programmatically:
import json
result = json.loads(verdict.value)
print(f"{eval_case.name}: {result['score']} — {result['justification']}")

Corrected example

With all fixes applied (including the SimpleComponent import and granite4:micro tag already noted in other comments):

from mellea import start_session
from mellea.stdlib.components import SimpleComponent
from mellea.stdlib.components.unit_test_eval import TestBasedEval

test_evals = TestBasedEval.from_json_file("tests/eval_data/email_writer.json")

judge_session = start_session(backend_name="ollama", model_id="granite4:micro")
generation_session = start_session(backend_name="ollama", model_id="granite4:micro")

for eval_case in test_evals:
    for idx, input_text in enumerate(eval_case.inputs):
        prediction = generation_session.act(
            SimpleComponent(instruction=input_text)
        ).value

        targets = eval_case.targets[idx] if eval_case.targets else []
        eval_case.set_judge_context(input_text, prediction, targets)

        verdict = judge_session.act(eval_case)
        print(f"{eval_case.name}: {verdict.value}")

With this version the example runs and produces correct judge verdicts.

Co-authored-by: Nigel Jones <nigel.l.jones+git@gmail.com>

planetf1

LGTM

seirasto added 3 commits April 22, 2026 15:33

documentation update

5f146d5

small wording

ce5053b

Merge branch 'generative-computing:main' into test-based-eval-docs

3770e00

seirasto requested a review from a team as a code owner April 23, 2026 20:17

seirasto requested review from AngeloDanducci and nrfulton April 23, 2026 20:17

seirasto changed the title ~~Test based eval documentation~~ docs: test based eval documentation Apr 23, 2026

github-actions Bot added the documentation Improvements or additions to documentation label Apr 23, 2026

planetf1 reviewed Apr 24, 2026

View reviewed changes

Comment thread docs/docs/how-to/unit-test-generative-code.md

planetf1 reviewed Apr 24, 2026

View reviewed changes

Comment thread docs/docs/how-to/unit-test-generative-code.md Outdated

planetf1 reviewed Apr 24, 2026

View reviewed changes

Comment thread docs/docs/how-to/unit-test-generative-code.md Outdated

planetf1 reviewed Apr 24, 2026

View reviewed changes

Comment thread docs/docs/how-to/unit-test-generative-code.md Outdated

planetf1 reviewed Apr 24, 2026

View reviewed changes

seirasto and others added 2 commits April 24, 2026 10:35

Update docs/docs/how-to/unit-test-generative-code.md

2960a9f

Co-authored-by: Nigel Jones <nigel.l.jones+git@gmail.com>

Update docs/docs/how-to/unit-test-generative-code.md

34de9e0

Co-authored-by: Nigel Jones <nigel.l.jones+git@gmail.com>

planetf1 reviewed Apr 24, 2026

View reviewed changes

Comment thread docs/docs/how-to/unit-test-generative-code.md Outdated

planetf1 reviewed Apr 24, 2026

View reviewed changes

Comment thread docs/docs/how-to/unit-test-generative-code.md

formatting improvements

4f150fc

planetf1 reviewed Apr 24, 2026

View reviewed changes

Comment thread docs/docs/how-to/unit-test-generative-code.md

planetf1 reviewed Apr 24, 2026

View reviewed changes

Comment thread docs/docs/how-to/unit-test-generative-code.md Outdated

updates based on PR suggestions

cc0a6b6

planetf1 reviewed Apr 28, 2026

View reviewed changes

Comment thread docs/docs/how-to/unit-test-generative-code.md Outdated

planetf1 reviewed Apr 28, 2026

View reviewed changes

Comment thread docs/docs/how-to/unit-test-generative-code.md

planetf1 reviewed Apr 28, 2026

View reviewed changes

Comment thread docs/docs/how-to/unit-test-generative-code.md Outdated

planetf1 reviewed Apr 28, 2026

View reviewed changes

Comment thread docs/docs/how-to/unit-test-generative-code.md Outdated

planetf1 reviewed Apr 28, 2026

View reviewed changes

Comment thread docs/docs/how-to/unit-test-generative-code.md Outdated

seirasto added 4 commits April 28, 2026 10:50

example and comments fix

c0f2ffd

merge: pull main changes into test-based-eval-docs

458345a

link fix

0f18eb9

cli alternative

f6d6797

planetf1 reviewed Apr 28, 2026

View reviewed changes

Comment thread docs/docs/how-to/unit-test-generative-code.md Outdated

planetf1 reviewed Apr 28, 2026

View reviewed changes

Comment thread docs/docs/how-to/unit-test-generative-code.md Outdated

seirasto and others added 3 commits April 28, 2026 14:17

Update docs/docs/how-to/unit-test-generative-code.md

2dea3ef

Co-authored-by: Nigel Jones <nigel.l.jones+git@gmail.com>

Update docs/docs/how-to/unit-test-generative-code.md

0e136eb

Co-authored-by: Nigel Jones <nigel.l.jones+git@gmail.com>

markdownlint fix

5f90692

planetf1 self-requested a review April 29, 2026 07:38

planetf1 approved these changes Apr 29, 2026

View reviewed changes

planetf1 added this pull request to the merge queue Apr 29, 2026

Merged via the queue into generative-computing:main with commit 208ca9b Apr 29, 2026
7 checks passed

Conversation

seirasto commented Apr 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Misc PR

Type of PR

Description

Testing

Attribution

Uh oh!

github-actions Bot commented Apr 23, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

planetf1 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

planetf1 commented Apr 24, 2026

Uh oh!

planetf1 commented Apr 24, 2026

Uh oh!

planetf1 commented Apr 24, 2026

Uh oh!

seirasto commented Apr 24, 2026

Uh oh!

seirasto commented Apr 24, 2026

Uh oh!

seirasto commented Apr 24, 2026

Uh oh!

seirasto commented Apr 24, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

planetf1 commented Apr 28, 2026

Bug 1 — instruct() should be act() (pre-existing)

Bug 2 — .value needed on act() result (introduced by this PR)

Clarification needed — what verdict.value contains

Corrected example

Uh oh!

Uh oh!

Uh oh!

planetf1 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

seirasto commented Apr 23, 2026 •

edited

Loading

Bug 1 — `instruct()` should be `act()` (pre-existing)

Bug 2 — `.value` needed on `act()` result (introduced by this PR)

Clarification needed — what `verdict.value` contains