Skip to content

docs: test based eval documentation#916

Merged
planetf1 merged 14 commits intogenerative-computing:mainfrom
seirasto:test-based-eval-docs
Apr 29, 2026
Merged

docs: test based eval documentation#916
planetf1 merged 14 commits intogenerative-computing:mainfrom
seirasto:test-based-eval-docs

Conversation

@seirasto
Copy link
Copy Markdown
Contributor

@seirasto seirasto commented Apr 23, 2026

Misc PR

Type of PR

  • Bug Fix
  • New Feature
  • Documentation
  • Other

Description

We are pleased to see documentation of our TestBasedEval contribution in the Mellea documentation. We have made some adjustments to add further clarification of the functionality and advantage of using Generative Unit Tests via LLM-as-a-Judge:

  1. Changed the title to say Generative Unit Tests
  2. Expanded the explanation to clarify that there can be multiple inputs/outputs and that the user provides the instructions and examples.
  3. Updated the example to highlight the need for an LLM judge by requiring a semantic match (it was classification before)

Testing

  • Tests added to the respective file if code was changed
  • New code has 100% coverage if code as added
  • Ensure existing tests and github automation passes (a maintainer will kick off the github automation when the rest of the PR is populated)

Attribution

  • AI coding assistants used

@seirasto seirasto requested a review from a team as a code owner April 23, 2026 20:17
@github-actions
Copy link
Copy Markdown
Contributor

The PR description has been updated. Please fill out the template for your PR to be reviewed.

@seirasto seirasto changed the title Test based eval documentation docs: test based eval documentation Apr 23, 2026
@github-actions github-actions Bot added the documentation Improvements or additions to documentation label Apr 23, 2026
Comment thread docs/docs/how-to/unit-test-generative-code.md
Comment thread docs/docs/how-to/unit-test-generative-code.md Outdated
Comment thread docs/docs/how-to/unit-test-generative-code.md Outdated
Comment thread docs/docs/how-to/unit-test-generative-code.md Outdated
Copy link
Copy Markdown
Contributor

@planetf1 planetf1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few minor suggestions - but one syntax correction I think is needed as users will hopefully follow the docs

seirasto and others added 2 commits April 24, 2026 10:35
Co-authored-by: Nigel Jones <nigel.l.jones+git@gmail.com>
Co-authored-by: Nigel Jones <nigel.l.jones+git@gmail.com>
Comment thread docs/docs/how-to/unit-test-generative-code.md Outdated
Comment thread docs/docs/how-to/unit-test-generative-code.md
Comment thread docs/docs/how-to/unit-test-generative-code.md
Comment thread docs/docs/how-to/unit-test-generative-code.md Outdated
@planetf1
Copy link
Copy Markdown
Contributor

On other thing I noticed -- not touched by this PR, but in the same file -- we don't explain 'verdict', at least not in terms of it's content. A reader might think it's a boolean (we refer to this in the docs when talking about llm as a judge... but that's using a different pattern, with conversion)

Here it could be anything I think - whatever the model returns. That may be worth clarifying? Just saying it's the raw llm output?

@planetf1
Copy link
Copy Markdown
Contributor

One structural suggestion: the three-level table at the top sets up a useful mental model, but TestBasedEval doesn't fit any of those rows — it's a standalone evaluation script where you run your model over a set of examples and pass each output to a judge model to score it. It's not a pytest assertion. A reader has no signal it's coming or how it differs from @pytest.mark.qualitative.

A small addition could help — a fourth row in the table:

Level What you assert Deterministic?
Type check isinstance(result, bool) Yes
Structural check result in [...] or field names present Yes
Qualitative check assert result is True No — depends on the model
Semantic evaluation Judge model scores output against reference responses No — run separately, not a pytest assertion

And a bridging sentence after the table: "For levels 1–3, use pytest with the patterns below. For semantic evaluation against reference examples — where you want a judge model to score your model's outputs in bulk — see The unit_test_eval component at the end of this page."

@planetf1
Copy link
Copy Markdown
Contributor

The Next steps section currently points to the Requirements System and Handling Exceptions. Worth adding a link to Evaluate with LLM-as-a-Judge here — it covers the Requirement-based inline judge pattern, which is a related but distinct approach to what TestBasedEval does. Readers finishing this guide are exactly the audience who'd want to know that alternative exists.

## Next steps

- [The Requirements System](../concepts/requirements-system) — understand how
  `Requirement`, `simple_validate`, and `check` interact with the IVR loop
- [Handling Exceptions](../how-to/handling-exceptions) —
  catch and diagnose errors that occur during generation
- [Evaluate with LLM-as-a-Judge](../evaluation-and-observability/evaluate-with-llm-as-a-judge) —
  the `Requirement`-based approach for inline judge evaluation

@seirasto
Copy link
Copy Markdown
Contributor Author

On other thing I noticed -- not touched by this PR, but in the same file -- we don't explain 'verdict', at least not in terms of it's content. A reader might think it's a boolean (we refer to this in the docs when talking about llm as a judge... but that's using a different pattern, with conversion)

Here it could be anything I think - whatever the model returns. That may be worth clarifying? Just saying it's the raw llm output?

The default judge prompt in the jinja template does use a boolean value. This could be adjusted of course.

@seirasto
Copy link
Copy Markdown
Contributor Author

The Next steps section currently points to the Requirements System and Handling Exceptions. Worth adding a link to Evaluate with LLM-as-a-Judge here — it covers the Requirement-based inline judge pattern, which is a related but distinct approach to what TestBasedEval does. Readers finishing this guide are exactly the audience who'd want to know that alternative exists.

## Next steps

- [The Requirements System](../concepts/requirements-system) — understand how
  `Requirement`, `simple_validate`, and `check` interact with the IVR loop
- [Handling Exceptions](../how-to/handling-exceptions) —
  catch and diagnose errors that occur during generation
- [Evaluate with LLM-as-a-Judge](../evaluation-and-observability/evaluate-with-llm-as-a-judge) —
  the `Requirement`-based approach for inline judge evaluation

Added this

@seirasto
Copy link
Copy Markdown
Contributor Author

One structural suggestion: the three-level table at the top sets up a useful mental model, but TestBasedEval doesn't fit any of those rows — it's a standalone evaluation script where you run your model over a set of examples and pass each output to a judge model to score it. It's not a pytest assertion. A reader has no signal it's coming or how it differs from @pytest.mark.qualitative.

A small addition could help — a fourth row in the table:

Level What you assert Deterministic?
Type check isinstance(result, bool) Yes
Structural check result in [...] or field names present Yes
Qualitative check assert result is True No — depends on the model
Semantic evaluation Judge model scores output against reference responses No — run separately, not a pytest assertion
And a bridging sentence after the table: "For levels 1–3, use pytest with the patterns below. For semantic evaluation against reference examples — where you want a judge model to score your model's outputs in bulk — see The unit_test_eval component at the end of this page."

Added this

@seirasto
Copy link
Copy Markdown
Contributor Author

Thanks @planetf1 I think I addressed all your changes. One other thing - you can call test based eval functionality using the cli via cli/eval - should we reference this as well?

Comment thread docs/docs/how-to/unit-test-generative-code.md Outdated
Comment thread docs/docs/how-to/unit-test-generative-code.md
Comment thread docs/docs/how-to/unit-test-generative-code.md Outdated
Comment thread docs/docs/how-to/unit-test-generative-code.md Outdated
Comment thread docs/docs/how-to/unit-test-generative-code.md Outdated
@planetf1
Copy link
Copy Markdown
Contributor

I tested the code example end-to-end in a fresh project (uv add mellea) and found two bugs that prevent it running correctly. Notes and a corrected file below.


Bug 1 — instruct() should be act() (pre-existing)

# as written — broken
verdict = judge_session.instruct(eval_case)

# fix
verdict = judge_session.act(eval_case)

TestBasedEval is a Component — it formats its own judge prompt via the Jinja2 template. instruct() expects a plain str; passing a Component causes Python to evaluate it as str(eval_case) which returns the object repr (<TestBasedEval object at 0x...>). The judge model then receives that repr as its prompt and responds with "It appears you have provided an object reference from Python's Mellea library..." — completely wrong output.

This was present before this PR but the original prediction = "no" placeholder masked it. Now that the example is meant to be runnable this needs fixing.


Bug 2 — .value needed on act() result (introduced by this PR)

# as written — passes ComputedModelOutputThunk to set_judge_context
prediction = generation_session.act(
    SimpleComponent(instruction=input_text)
)

# fix
prediction = generation_session.act(
    SimpleComponent(instruction=input_text)
).value

set_judge_context is typed prediction: str. act() returns a ComputedModelOutputThunk[str] — calling .value extracts the string the model generated. Without this the prediction field in the judge template receives the thunk object rather than the email text.


Clarification needed — what verdict.value contains

To answer the open question about verdict.value: based on the actual output of this example, the judge model returns a structured JSON string, not a boolean. The TestBasedEval.jinja2 template instructs the model to respond in this format:

{"score": 0_or_1, "justification": "..."}

Actual output from a working run:

case_001: {"score": 1, "justification": "The model output closely follows the instructions provided. It is a professional and appropriate follow-up email after an interview..."}

Worth adding a note after the code block explaining this, e.g.:

Note: verdict.value is the raw JSON string returned by the judge — {"score": 0|1, "justification": "..."}. Score 0 means the guidelines were violated; score 1 means the output is well aligned. Parse it to use the score programmatically:

import json
result = json.loads(verdict.value)
print(f"{eval_case.name}: {result['score']}{result['justification']}")

Corrected example

With all fixes applied (including the SimpleComponent import and granite4:micro tag already noted in other comments):

from mellea import start_session
from mellea.stdlib.components import SimpleComponent
from mellea.stdlib.components.unit_test_eval import TestBasedEval

test_evals = TestBasedEval.from_json_file("tests/eval_data/email_writer.json")

judge_session = start_session(backend_name="ollama", model_id="granite4:micro")
generation_session = start_session(backend_name="ollama", model_id="granite4:micro")

for eval_case in test_evals:
    for idx, input_text in enumerate(eval_case.inputs):
        prediction = generation_session.act(
            SimpleComponent(instruction=input_text)
        ).value

        targets = eval_case.targets[idx] if eval_case.targets else []
        eval_case.set_judge_context(input_text, prediction, targets)

        verdict = judge_session.act(eval_case)
        print(f"{eval_case.name}: {verdict.value}")

With this version the example runs and produces correct judge verdicts.

Comment thread docs/docs/how-to/unit-test-generative-code.md Outdated
Comment thread docs/docs/how-to/unit-test-generative-code.md Outdated
seirasto and others added 3 commits April 28, 2026 14:17
Co-authored-by: Nigel Jones <nigel.l.jones+git@gmail.com>
Co-authored-by: Nigel Jones <nigel.l.jones+git@gmail.com>
@planetf1 planetf1 self-requested a review April 29, 2026 07:38
Copy link
Copy Markdown
Contributor

@planetf1 planetf1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@planetf1 planetf1 added this pull request to the merge queue Apr 29, 2026
Merged via the queue into generative-computing:main with commit 208ca9b Apr 29, 2026
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Expand TestBasedEval documentation

2 participants