`Eval` generic `Output` parameter can't unify when `expected` and task return type differ

## Problem

`Eval` is generic over `Output`, which is shared across three positions:

1. `EvalCase[Input, Output].expected: Output | None` — ground truth / assertion data
2. `EvalTask[Input, Output]` return type — the actual model output
3. `EvalScorer[Input, Output]` second arg — what the scorer receives as `output`

This works when `expected` is an instance of the same type the task returns (e.g. comparing two strings). But many eval designs use `expected` to carry **assertion specs** or **rubrics** (what to check against the output), not an example of the output type itself. For example:

- Task returns a `TypedDict` with the model's structured response
- Expected is a `frozenset[Assertion]` describing what properties to verify
- Scorer receives `(str, ResponseDict, frozenset[Assertion])`

There is no valid `Output` that satisfies all three positions simultaneously. This forces strict-mode users to suppress `reportArgumentType` (pyright) or `# type: ignore[misc]` (mypy "cannot infer value of type parameter Output").

## Reproduction

```python
from dataclasses import dataclass
from typing import TypedDict

from braintrust import Eval, EvalCase
from braintrust.score import Score


@dataclass(frozen=True)
class Assertion:
    check: str


class ResponseDict(TypedDict):
    answer: str
    explanation: str


data = [EvalCase(input="question", expected=frozenset({Assertion(check="is_correct")}))]


async def task(input: str) -> ResponseDict:
    return ResponseDict(answer="42", explanation="because")


async def scorer(
    input: str, output: ResponseDict, expected: frozenset[Assertion]
) -> list[Score]:
    return [Score(name="check", score=1.0)]


Eval("Project", data=data, task=task, scores=[scorer])
# pyright: reportArgumentType — can't assign task/scores because Output doesn't unify
# mypy: Cannot infer value of type parameter "Output" of "Eval" [misc]
```

## Suggestion

Split the generic into two type parameters — one for the task output, one for the expected data:

```python
def Eval[Input, Output, Expected](
    name: str,
    data: Iterable[EvalCase[Input, Expected]],
    task: EvalTask[Input, Output],
    scores: Sequence[EvalScorer[Input, Output, Expected]],
    ...
) -> ...:
```

This would let `Output` and `Expected` vary independently, which matches how eval frameworks that use rubric/assertion-based scoring actually work. The current single-parameter design only works cleanly for the "compare output to golden answer" pattern.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`Eval` generic `Output` parameter can't unify when `expected` and task return type differ #240

Problem

Reproduction

Suggestion

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Eval generic Output parameter can't unify when expected and task return type differ #240

Description

Problem

Reproduction

Suggestion

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

`Eval` generic `Output` parameter can't unify when `expected` and task return type differ #240