Summary
When braintrust eval --dev loads multiple eval_*.py files whose Eval(...) calls share a name argument, only one evaluator survives the load. The startup log reports the full count loaded, but GET /list returns just one entry. There is no warning or error.
Repro
Two minimal entrypoints, both targeting the same Braintrust project (the natural pattern when you want one project containing multiple evaluators):
eval_a.py:
from braintrust import Eval, EvalCase
Eval(
"MyProject",
data=lambda: [EvalCase(input="a", expected="a")],
task=lambda i, h: i,
scores=[lambda input, output, expected, **_: 1.0 if output == expected else 0.0],
)
eval_b.py:
from braintrust import Eval, EvalCase
Eval(
"MyProject",
data=lambda: [EvalCase(input="b", expected="b")],
task=lambda i, h: i,
scores=[lambda input, output, expected, **_: 1.0 if output == expected else 0.0],
)
Run:
braintrust eval eval_a.py eval_b.py --dev --dev-host 127.0.0.1 --dev-port 8300 --dev-org-name <my-org>
Startup log:
Loaded 2 evaluator(s): ['MyProject', 'MyProject']
GET /list:
$ curl -s http://127.0.0.1:8300/list \
-H "Authorization: Bearer $BRAINTRUST_API_KEY" \
-H "x-bt-org-name: <my-org>" | jq 'keys, length'
[ "MyProject" ]
1
Two evaluators loaded, only one reachable.
Cause
braintrust/devserver/server.py:308:
_all_evaluators = {evaluator.eval_name: evaluator for evaluator in evaluators}
The dict comprehension silently keeps the last entry per duplicate key. The startup log on the previous line prints the full input list before this collapse, which is why the loaded count looks correct.
Compounding factor: Eval(name=...) doubles as both the evaluator name and the project-name fallback (per the docstring at framework.py:985: "this corresponds to a project name in Braintrust"). The natural way to put multiple evaluators under one Braintrust project is to pass Eval("SHARED_PROJECT_NAME", ...) in each file — which produces this collision under --dev.
Expected behavior
One of:
- Raise at server startup naming the conflicting
eval_name and the entrypoints that produced it.
- Warn (logger or
warnings.warn) on the collapse.
- Key
_all_evaluators by something more discriminating (e.g., file path + eval name).
Option 1 is the most defensible — silent data loss between the loader and the registered-evaluator dict shouldn't be possible. Option 3 is the most permissive but requires deciding how to disambiguate at the /list and /eval API.
Workaround (verified)
Each entrypoint passes a distinct name AND an explicit project_id to keep multiple evaluators under one Braintrust project:
from braintrust import Eval, EvalCase, init_dataset
_DATASET = init_dataset(project="MyProject", name="SomeDataset")
Eval(
name="MyProject — variant_a",
project_id=_DATASET.project_id,
data=lambda: [EvalCase(input="a", expected="a")],
task=lambda i, h: i,
scores=[lambda input, output, expected, **_: 1.0 if output == expected else 0.0],
)
Verified empirically:
GET /list returns two distinct entries (one per entrypoint with distinct name).
- Experiments from both evaluators land in the existing
MyProject Braintrust project (URL pattern …/p/MyProject/experiments/…), confirming project_id overrides the name-based project resolution at write time (per framework.py:767, 1792-1793).
This is the supported pattern, but it's not discoverable: the API design (name doubles as the project-name fallback, no warning on dev-server collapse) leads users into the broken default.
Fix sketch
In braintrust/devserver/server.py, replace the dict comprehension with a guarded loop:
_all_evaluators: dict[str, Evaluator[Any, Any, Any]] = {}
for evaluator in evaluators:
if evaluator.eval_name in _all_evaluators:
raise ValueError(
f"Multiple evaluators registered with name {evaluator.eval_name!r}. "
f"Each --dev evaluator must have a distinct name. "
f"Pass a unique `name=` to Eval(...) and `project_id=` to keep them in the same project."
)
_all_evaluators[evaluator.eval_name] = evaluator
Optionally, framework.py's Eval could emit a deprecation note when project_id is omitted and name matches an existing Braintrust project — but that's a separate API discussion.
Environment
braintrust 0.17.0
- Python 3.14
- macOS
Summary
When
braintrust eval --devloads multipleeval_*.pyfiles whoseEval(...)calls share anameargument, only one evaluator survives the load. The startup log reports the full count loaded, butGET /listreturns just one entry. There is no warning or error.Repro
Two minimal entrypoints, both targeting the same Braintrust project (the natural pattern when you want one project containing multiple evaluators):
eval_a.py:eval_b.py:Run:
Startup log:
GET /list:$ curl -s http://127.0.0.1:8300/list \ -H "Authorization: Bearer $BRAINTRUST_API_KEY" \ -H "x-bt-org-name: <my-org>" | jq 'keys, length' [ "MyProject" ] 1Two evaluators loaded, only one reachable.
Cause
braintrust/devserver/server.py:308:The dict comprehension silently keeps the last entry per duplicate key. The startup log on the previous line prints the full input list before this collapse, which is why the loaded count looks correct.
Compounding factor:
Eval(name=...)doubles as both the evaluator name and the project-name fallback (per the docstring atframework.py:985: "this corresponds to a project name in Braintrust"). The natural way to put multiple evaluators under one Braintrust project is to passEval("SHARED_PROJECT_NAME", ...)in each file — which produces this collision under--dev.Expected behavior
One of:
eval_nameand the entrypoints that produced it.warnings.warn) on the collapse._all_evaluatorsby something more discriminating (e.g., file path + eval name).Option 1 is the most defensible — silent data loss between the loader and the registered-evaluator dict shouldn't be possible. Option 3 is the most permissive but requires deciding how to disambiguate at the
/listand/evalAPI.Workaround (verified)
Each entrypoint passes a distinct
nameAND an explicitproject_idto keep multiple evaluators under one Braintrust project:Verified empirically:
GET /listreturns two distinct entries (one per entrypoint with distinctname).MyProjectBraintrust project (URL pattern…/p/MyProject/experiments/…), confirmingproject_idoverrides thename-based project resolution at write time (perframework.py:767, 1792-1793).This is the supported pattern, but it's not discoverable: the API design (
namedoubles as the project-name fallback, no warning on dev-server collapse) leads users into the broken default.Fix sketch
In
braintrust/devserver/server.py, replace the dict comprehension with a guarded loop:Optionally,
framework.py'sEvalcould emit a deprecation note whenproject_idis omitted andnamematches an existing Braintrust project — but that's a separate API discussion.Environment
braintrust0.17.0