Skip to content

Eval generic Output parameter can't unify when expected and task return type differ #240

@willfrey

Description

@willfrey

Problem

Eval is generic over Output, which is shared across three positions:

  1. EvalCase[Input, Output].expected: Output | None — ground truth / assertion data
  2. EvalTask[Input, Output] return type — the actual model output
  3. EvalScorer[Input, Output] second arg — what the scorer receives as output

This works when expected is an instance of the same type the task returns (e.g. comparing two strings). But many eval designs use expected to carry assertion specs or rubrics (what to check against the output), not an example of the output type itself. For example:

  • Task returns a TypedDict with the model's structured response
  • Expected is a frozenset[Assertion] describing what properties to verify
  • Scorer receives (str, ResponseDict, frozenset[Assertion])

There is no valid Output that satisfies all three positions simultaneously. This forces strict-mode users to suppress reportArgumentType (pyright) or # type: ignore[misc] (mypy "cannot infer value of type parameter Output").

Reproduction

from dataclasses import dataclass
from typing import TypedDict

from braintrust import Eval, EvalCase
from braintrust.score import Score


@dataclass(frozen=True)
class Assertion:
    check: str


class ResponseDict(TypedDict):
    answer: str
    explanation: str


data = [EvalCase(input="question", expected=frozenset({Assertion(check="is_correct")}))]


async def task(input: str) -> ResponseDict:
    return ResponseDict(answer="42", explanation="because")


async def scorer(
    input: str, output: ResponseDict, expected: frozenset[Assertion]
) -> list[Score]:
    return [Score(name="check", score=1.0)]


Eval("Project", data=data, task=task, scores=[scorer])
# pyright: reportArgumentType — can't assign task/scores because Output doesn't unify
# mypy: Cannot infer value of type parameter "Output" of "Eval" [misc]

Suggestion

Split the generic into two type parameters — one for the task output, one for the expected data:

def Eval[Input, Output, Expected](
    name: str,
    data: Iterable[EvalCase[Input, Expected]],
    task: EvalTask[Input, Output],
    scores: Sequence[EvalScorer[Input, Output, Expected]],
    ...
) -> ...:

This would let Output and Expected vary independently, which matches how eval frameworks that use rubric/assertion-based scoring actually work. The current single-parameter design only works cleanly for the "compare output to golden answer" pattern.

Metadata

Metadata

Labels

Type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions