Skip to content

braintrust eval --dev silently drops evaluators with duplicate eval_name #366

@willfrey

Description

@willfrey

Summary

When braintrust eval --dev loads multiple eval_*.py files whose Eval(...) calls share a name argument, only one evaluator survives the load. The startup log reports the full count loaded, but GET /list returns just one entry. There is no warning or error.

Repro

Two minimal entrypoints, both targeting the same Braintrust project (the natural pattern when you want one project containing multiple evaluators):

eval_a.py:

from braintrust import Eval, EvalCase

Eval(
    "MyProject",
    data=lambda: [EvalCase(input="a", expected="a")],
    task=lambda i, h: i,
    scores=[lambda input, output, expected, **_: 1.0 if output == expected else 0.0],
)

eval_b.py:

from braintrust import Eval, EvalCase

Eval(
    "MyProject",
    data=lambda: [EvalCase(input="b", expected="b")],
    task=lambda i, h: i,
    scores=[lambda input, output, expected, **_: 1.0 if output == expected else 0.0],
)

Run:

braintrust eval eval_a.py eval_b.py --dev --dev-host 127.0.0.1 --dev-port 8300 --dev-org-name <my-org>

Startup log:

Loaded 2 evaluator(s): ['MyProject', 'MyProject']

GET /list:

$ curl -s http://127.0.0.1:8300/list \
    -H "Authorization: Bearer $BRAINTRUST_API_KEY" \
    -H "x-bt-org-name: <my-org>" | jq 'keys, length'
[ "MyProject" ]
1

Two evaluators loaded, only one reachable.

Cause

braintrust/devserver/server.py:308:

_all_evaluators = {evaluator.eval_name: evaluator for evaluator in evaluators}

The dict comprehension silently keeps the last entry per duplicate key. The startup log on the previous line prints the full input list before this collapse, which is why the loaded count looks correct.

Compounding factor: Eval(name=...) doubles as both the evaluator name and the project-name fallback (per the docstring at framework.py:985: "this corresponds to a project name in Braintrust"). The natural way to put multiple evaluators under one Braintrust project is to pass Eval("SHARED_PROJECT_NAME", ...) in each file — which produces this collision under --dev.

Expected behavior

One of:

  1. Raise at server startup naming the conflicting eval_name and the entrypoints that produced it.
  2. Warn (logger or warnings.warn) on the collapse.
  3. Key _all_evaluators by something more discriminating (e.g., file path + eval name).

Option 1 is the most defensible — silent data loss between the loader and the registered-evaluator dict shouldn't be possible. Option 3 is the most permissive but requires deciding how to disambiguate at the /list and /eval API.

Workaround (verified)

Each entrypoint passes a distinct name AND an explicit project_id to keep multiple evaluators under one Braintrust project:

from braintrust import Eval, EvalCase, init_dataset

_DATASET = init_dataset(project="MyProject", name="SomeDataset")

Eval(
    name="MyProject — variant_a",
    project_id=_DATASET.project_id,
    data=lambda: [EvalCase(input="a", expected="a")],
    task=lambda i, h: i,
    scores=[lambda input, output, expected, **_: 1.0 if output == expected else 0.0],
)

Verified empirically:

  • GET /list returns two distinct entries (one per entrypoint with distinct name).
  • Experiments from both evaluators land in the existing MyProject Braintrust project (URL pattern …/p/MyProject/experiments/…), confirming project_id overrides the name-based project resolution at write time (per framework.py:767, 1792-1793).

This is the supported pattern, but it's not discoverable: the API design (name doubles as the project-name fallback, no warning on dev-server collapse) leads users into the broken default.

Fix sketch

In braintrust/devserver/server.py, replace the dict comprehension with a guarded loop:

_all_evaluators: dict[str, Evaluator[Any, Any, Any]] = {}
for evaluator in evaluators:
    if evaluator.eval_name in _all_evaluators:
        raise ValueError(
            f"Multiple evaluators registered with name {evaluator.eval_name!r}. "
            f"Each --dev evaluator must have a distinct name. "
            f"Pass a unique `name=` to Eval(...) and `project_id=` to keep them in the same project."
        )
    _all_evaluators[evaluator.eval_name] = evaluator

Optionally, framework.py's Eval could emit a deprecation note when project_id is omitted and name matches an existing Braintrust project — but that's a separate API discussion.

Environment

  • braintrust 0.17.0
  • Python 3.14
  • macOS

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions