Problem
Eval is generic over Output, which is shared across three positions:
EvalCase[Input, Output].expected: Output | None — ground truth / assertion data
EvalTask[Input, Output] return type — the actual model output
EvalScorer[Input, Output] second arg — what the scorer receives as output
This works when expected is an instance of the same type the task returns (e.g. comparing two strings). But many eval designs use expected to carry assertion specs or rubrics (what to check against the output), not an example of the output type itself. For example:
- Task returns a
TypedDict with the model's structured response
- Expected is a
frozenset[Assertion] describing what properties to verify
- Scorer receives
(str, ResponseDict, frozenset[Assertion])
There is no valid Output that satisfies all three positions simultaneously. This forces strict-mode users to suppress reportArgumentType (pyright) or # type: ignore[misc] (mypy "cannot infer value of type parameter Output").
Reproduction
from dataclasses import dataclass
from typing import TypedDict
from braintrust import Eval, EvalCase
from braintrust.score import Score
@dataclass(frozen=True)
class Assertion:
check: str
class ResponseDict(TypedDict):
answer: str
explanation: str
data = [EvalCase(input="question", expected=frozenset({Assertion(check="is_correct")}))]
async def task(input: str) -> ResponseDict:
return ResponseDict(answer="42", explanation="because")
async def scorer(
input: str, output: ResponseDict, expected: frozenset[Assertion]
) -> list[Score]:
return [Score(name="check", score=1.0)]
Eval("Project", data=data, task=task, scores=[scorer])
# pyright: reportArgumentType — can't assign task/scores because Output doesn't unify
# mypy: Cannot infer value of type parameter "Output" of "Eval" [misc]
Suggestion
Split the generic into two type parameters — one for the task output, one for the expected data:
def Eval[Input, Output, Expected](
name: str,
data: Iterable[EvalCase[Input, Expected]],
task: EvalTask[Input, Output],
scores: Sequence[EvalScorer[Input, Output, Expected]],
...
) -> ...:
This would let Output and Expected vary independently, which matches how eval frameworks that use rubric/assertion-based scoring actually work. The current single-parameter design only works cleanly for the "compare output to golden answer" pattern.
Problem
Evalis generic overOutput, which is shared across three positions:EvalCase[Input, Output].expected: Output | None— ground truth / assertion dataEvalTask[Input, Output]return type — the actual model outputEvalScorer[Input, Output]second arg — what the scorer receives asoutputThis works when
expectedis an instance of the same type the task returns (e.g. comparing two strings). But many eval designs useexpectedto carry assertion specs or rubrics (what to check against the output), not an example of the output type itself. For example:TypedDictwith the model's structured responsefrozenset[Assertion]describing what properties to verify(str, ResponseDict, frozenset[Assertion])There is no valid
Outputthat satisfies all three positions simultaneously. This forces strict-mode users to suppressreportArgumentType(pyright) or# type: ignore[misc](mypy "cannot infer value of type parameter Output").Reproduction
Suggestion
Split the generic into two type parameters — one for the task output, one for the expected data:
This would let
OutputandExpectedvary independently, which matches how eval frameworks that use rubric/assertion-based scoring actually work. The current single-parameter design only works cleanly for the "compare output to golden answer" pattern.