feat(ENG-55): first-class LLM-as-judge verifier with dense reward#277
Conversation
- Add rubric_config.py: Pydantic models for rubric.toml (Criterion, JudgeConfig, ScoringConfig, RubricConfig) with binary/likert/numeric criterion types and score normalization - Add llm.py: Multi-provider LLM routing (Anthropic/OpenAI/Gemini) with call_judge(), parse_verdict(), exponential backoff retry - Add file_readers.py: Document extraction (pdf/docx/xlsx/pptx/txt) with find_deliverables() for rollout directory scanning - Replace placeholder LLMJudgeRewardFunc in builtins.py with full rubric-based judge: per-criterion scoring, prompt templates, configurable aggregation (weighted_mean/all_pass/any_pass/threshold), backward-compatible legacy mode - Emit per-criterion dense RewardEvents for fine-grained reward signals - Write evaluation_details.json alongside rollout for transparency - Re-export Criterion, JudgeConfig, RubricConfig, ScoringConfig, load_rubric_toml from rewards and top-level __init__ - Add test_rubric_config.py (15 tests) and test_llm_judge.py (30 tests) Closes ENG-55
🤖 Devin AI EngineerI'll be helping with this pull request! Here's what you should know: ✅ I will automatically:
Note: I can only respond to comments from users who have write access to this repository. ⚙️ Control Options:
|
- Replace sync SDK clients with async variants (AsyncAnthropic, AsyncOpenAI, client.aio for Google) - Replace time.sleep() with await asyncio.sleep() in call_judge retry loop - Pass actual aggregated score to _write_details instead of recomputing n_passed/total
| if self._rubric_path is not None: | ||
| return load_rubric_toml(self._rubric_path) |
There was a problem hiding this comment.
🟡 Constructor judge_model/model silently ignored when rubric_path or auto-discovered rubric is used
When LLMJudgeRewardFunc is constructed with both rubric_path (or when a rubric.toml is auto-discovered) and an explicit judge_model, the constructor's model is silently ignored. In _load_rubric at src/benchflow/rewards/builtins.py:181-182, the rubric_path branch calls load_rubric_toml(self._rubric_path) and returns the TOML's config as-is, without merging self.model. The same applies to auto-discovered rubrics at lines 208-213. However, inline criteria mode at lines 201-202 correctly uses self.model via JudgeConfig(model=self.model, ...). This means LLMJudgeRewardFunc(judge_model="gpt-4o", rubric_path=Path("rubric.toml")) would silently use the TOML's model (defaulting to claude-sonnet-4-6 per src/benchflow/rewards/rubric_config.py:102) instead of gpt-4o, while LLMJudgeRewardFunc(judge_model="gpt-4o", criteria=[...]) correctly uses gpt-4o.
Prompt for agents
The _load_rubric method has an inconsistency: when rubric_path is provided (line 181-182) or a rubric.toml is auto-discovered (lines 208-213), the RubricConfig returned uses the TOML file's judge model, ignoring self.model (which holds the constructor's judge_model parameter). But when inline criteria are used (lines 201-202), self.model IS correctly propagated via JudgeConfig(model=self.model). To fix this, after loading a rubric from a TOML file, merge the constructor's model if it was explicitly provided by the user. One approach: after calling load_rubric_toml(), check if the user explicitly set a model (e.g., track whether judge_model was passed) and if so, override rubric.judge.model with self.model. This needs to be done for both the rubric_path path and the auto-discovery path. You may need to add a boolean flag like self._model_explicit to distinguish between the user passing judge_model and the default being used.
Was this helpful? React with 👍 or 👎 to provide feedback.
There was a problem hiding this comment.
Valid catch. When rubric_path is provided or auto-discovered, self.model from the constructor is silently ignored in favor of the TOML's judge.model. The inline criteria path correctly uses self.model.
Two options to fix:
- Track whether
judge_modelwas explicitly passed (e.g.self._model_explicitflag) and overriderubric.judge.modelwhen it was. - Always merge
self.modelinto loaded rubric configs afterload_rubric_toml().
Happy to fix in a follow-up commit if the maintainer wants this addressed now.
| for provider in providers: | ||
| for attempt in range(retries): | ||
| try: | ||
| if provider == "anthropic": | ||
| return await _call_anthropic(bare_model, prompt, max_tokens) | ||
| if provider == "openai": | ||
| return await _call_openai(bare_model, prompt, max_tokens) | ||
| if provider == "google": | ||
| return await _call_google(bare_model, prompt) | ||
| except ImportError: | ||
| logger.debug("SDK for %s not installed, skipping", provider) | ||
| break # No point retrying if SDK is missing | ||
| except Exception as e: | ||
| last_error = e | ||
| if attempt < retries - 1: | ||
| await asyncio.sleep(2**attempt) | ||
| continue | ||
| logger.warning( | ||
| "%s call failed after %d attempts: %s", | ||
| provider, | ||
| retries, | ||
| e, | ||
| ) | ||
| break # Move to next provider |
There was a problem hiding this comment.
🟡 call_judge retries deterministic failures (wrong model name on fallback providers) with exponential backoff
In call_judge, when the primary provider fails, the code falls back to other providers using the same bare_model name (src/benchflow/rewards/llm.py:97,111-134). For example, if model is claude-sonnet-4-6, it first tries Anthropic (correct), then falls back to OpenAI and Google with bare_model="claude-sonnet-4-6" — a model name neither provider supports. Each fallback provider is retried 3 times with exponential backoff (await asyncio.sleep(2**attempt) at line 126), wasting ~3 seconds per provider (~6 seconds total for 2 fallback providers) on calls that will always fail. The same issue applies to _call_google raising RuntimeError for missing API keys at src/benchflow/rewards/llm.py:181 — this deterministic error is caught by the generic except Exception at line 123 and retried instead of failing fast like ImportError does.
Prompt for agents
The call_judge function at llm.py:111-134 retries all non-ImportError exceptions with exponential backoff, including deterministic failures that will never succeed (e.g., wrong model name on fallback provider, missing API key). Two improvements: (1) In _call_google, the RuntimeError for missing API keys should be a dedicated exception type or be caught similarly to ImportError (break instead of retry). (2) Consider whether cross-provider fallback with the same bare_model name actually makes sense — it will fail for provider-specific model names like 'claude-sonnet-4-6' on OpenAI. Either remove cross-provider fallback or add model-name translation. At minimum, treat 'model not found' / 'invalid model' type errors as non-retryable.
Was this helpful? React with 👍 or 👎 to provide feedback.
There was a problem hiding this comment.
Good point — cross-provider fallback with the same bare_model name doesn't make sense for provider-specific models (e.g. claude-sonnet-4-6 on OpenAI will always 404).
Two improvements worth making:
- Remove cross-provider fallback entirely (only retry the matched provider) — cleaner and avoids wasted retries.
- Treat
RuntimeErrorfor missing API keys (and "model not found" errors) as non-retryable, similar toImportError.
Can address in a follow-up if desired.
Test Results — ENG-55 LLM-as-Judge VerifierRan 45 existing + 16 adversarial tests targeting the 3 bug fixes, plus full regression suite. Bug Fix Verification (critical)
Feature Tests (45 existing)
Additional Adversarial Tests (9)
Full Regression SuiteAll warnings are pre-existing deprecation warnings from harbor. |
- Add docs/llm-judge.md: full user-facing guide with rubric.toml reference, criterion types, aggregation strategies, dense reward events, multi-provider routing, file discovery, inline criteria, and worked examples - Add src/benchflow/rewards/README.md: module-level guide with usage examples - Update docs/concepts.md: reference LLM judge in Verifier primitive and add to 'Where to go next' links
…e + Rollout rename)
…, rewards, adapters (#274) * refactor: add Sandbox protocol + Harbor adapters (ENG-48 Phase A) Create BenchFlow-native Sandbox protocol as parallel types alongside existing Harbor imports. No behavioral changes — existing code continues to use Harbor directly. New files: - src/benchflow/sandbox/protocol.py: ExecResult, Sandbox, ImageRef, ImageConfig, ImageBuilder - src/benchflow/sandbox/docker.py: DockerSandbox adapter wrapping Harbor DockerEnvironment - src/benchflow/sandbox/daytona.py: DaytonaSandbox adapter wrapping Harbor DaytonaEnvironment - tests/test_sandbox_protocol.py: 26 tests covering dataclasses, protocol conformance, and delegation * refactor: unify Scene/Role/Turn types into _types.py (ENG-47) Create src/benchflow/_types.py as the single canonical source for the declarative Role, Scene, and Turn dataclasses. Changes: - New _types.py with Role (adds timeout_sec, idle_timeout_sec, skills_dir), Scene (adds parallel_group), and Turn - trial.py imports from _types.py instead of defining its own copies - _scene.py renames its internal Role to SceneRole (different fields: instruction, tools) with backward-compat alias - __init__.py re-exports canonical types from _types.py; adds TrialRole/TrialScene backward-compat aliases - runtime.py and trial_yaml.py updated to import from _types.py New fields default to None — no runtime behavior changes. * fix: address review — shlex.quote in read_file, cleanup temp in write_file - Use shlex.quote(path) in read_file to prevent shell injection - Clean up temp files with os.unlink in write_file finally block - Move imports to module level * fix: read_file error checking + export ImageConfig from top-level - read_file now raises FileNotFoundError on non-zero return code - Export ImageConfig alongside ImageBuilder in __init__.py - Add tests for read_file error behavior * refactor: kill shim layers, single Rollout execution path (ENG-46) - Rename trial.py → rollout.py, Trial → Rollout, TrialConfig → RolloutConfig - Rename RunResult → RolloutResult in models.py - Create evaluation.py from job.py: Job → Evaluation, JobConfig → EvaluationConfig, JobResult → EvaluationResult - Create _run.py with bf.run(RolloutConfig) → RolloutResult entry point - Replace sdk.py with thin backward-compat shim delegating to rollout.py - Replace trial.py with re-export shim from rollout.py - Replace job.py with re-export shim from evaluation.py - Preserve runtime.py API, update internal imports to use Rollout - Update __init__.py: new public API + backward-compat aliases - Update test patch targets from benchflow.trial → benchflow.rollout - All 841 tests pass, lint clean, pre-existing typecheck errors unchanged * style: apply ruff format to pass CI format check * fix: resolve ty typecheck errors (Any for kwargs and sentinel) * feat: composable Rubric + RewardFunc protocol (ENG-49) - Create src/benchflow/rewards/ package with: - RewardFunc protocol (single scoring dimension) - Rubric dataclass (weighted collection of RewardFuncs) - VerifyResult dataclass (aggregated scoring result) - RewardEvent dataclass (dense/terminal reward signals) - Built-in funcs: TestRewardFunc, LLMJudgeRewardFunc, StringMatchRewardFunc, CodeExecRewardFunc - Add reward_events field to RunResult (additive, no breaking changes) - Re-export all reward types from benchflow top-level __init__.py - Backward compat: Rubric([TestRewardFunc()]) wraps existing test.sh -> reward.txt flow - 31 tests covering all reward types, rubric scoring, weights, error handling, protocol conformance, and re-exports * style: format rewards package and tests with ruff * fix: disambiguate Rubric.items keys when multiple funcs share a class name Append _N suffix on collision so all per-func scores are preserved. Addresses Devin Review feedback on PR #266. * feat: per-agent capabilities + agent-as-tool infrastructure (ENG-50) - Add Role.capabilities field for declarative agent capability lists - Add Sandbox.expose_ports property for inter-agent communication ports - Document capability boundary: BenchFlow provides sandbox + instruction + observation; agents own loops, tool protocols, and agent-as-tool - Update _scene.py and _types.py module docstrings with ENG-50 boundary - Add 16 tests covering capabilities, expose_ports, and Coder→Reviewer→Coder turn sequences * style: apply ruff format to sandbox adapters and tests * feat: external adapters for Inspect AI + ORS (ENG-51) Add benchflow.adapters package with thin format converters: - InspectAdapter / to_inspect_task: Scene + Rubric -> Inspect AI task dict - ORSAdapter / to_ors_reward: VerifyResult + RewardEvent -> ORS format dict No external dependencies required — these are pure format converters. Re-exported from benchflow top-level __init__.py. 14 tests covering both adapters, convenience functions, and round-trips. * style: fix ruff format for adapters and tests * fix: convention fixes + trial shim re-exports + __getattr__ narrowing - Parametrize bare dict returns to dict[str, Any] in adapters - Remove unnecessary @DataClass on adapter classes (no fields) - Add from __future__ import annotations to evaluation.py - Fix __getattr__ to re-raise ModuleNotFoundError for broken deps - Add private helper re-exports to trial.py shim for backward compat - Convert test_save_trajectory to async (fixes event loop deprecation) * rename: backend → sandbox across API, CLI, docs, and tests Aligns naming with the v0.4 Sandbox protocol (ENG-48). The 'backend' parameter/flag/attribute on Environment and the CLI is now 'sandbox' everywhere. Scope: - runtime.py: Environment.__init__(sandbox=), .sandbox, from_task(sandbox=) - cli/main.py: --sandbox flag (was --backend), help text - _snapshot.py: docstring updates - tests/test_runtime.py: updated assertions - docs/: all references in task-authoring, CLI ref, Python API ref, running-benchmarks, integration-tests, examples NOT renamed (different meaning): - _credentials.py 'Vertex AI backend' (provider concept) - _provider_runtime.py / bedrock_proxy.py backend_model (LLM model) - _sandbox.py 'build backends' (Python packaging) * docs: update for v0.4 — Rollout as canonical name, add Sandbox/Rubric/Adapter docs - README: 'Sandbox backends' → 'Sandboxes', table says Rollout not Trial - concepts.md: Environment no longer says 'Backed by Harbor', Trial → Rollout as core primitive, lifecycle diagram uses Rollout.run() - getting-started.md: uses RolloutConfig, links to rollout lifecycle - python-api.md: RolloutConfig (aliased as TrialConfig), Rollout (aliased as Trial), canonical imports shown first, added v0.4 types section with Sandbox protocol, Rubric + RewardFunc, Adapters, Evaluation, backward-compat aliases * fix: correct Rubric/RewardFunc API examples in python-api.md Fixes 3 issues flagged by Devin Review: - StringMatchRewardFunc: remove non-existent 'field' parameter - Rubric: use reward_funcs + weights instead of items=[(func, weight)] - rubric.score: takes rollout_dir: Path, not sandbox= * fix: correct evaluation.py docstring aliases, clean stale comment in __init__.py * docs: remove Harbor migration references - use-cases.md: remove 'How it works vs Harbor' and 'Migration from Harbor' sections - scene-patterns.ipynb: remove 'Mapping to Harbor PR #1462 Concepts' cell - coder-reviewer-demo.py: 'Harbor-format task directory' → 'BenchFlow task directory' - swebench notebook: remove Harbor #1316 reference from intro * CLI: unify sandbox flag (--env → --sandbox, -b → -e) [ENG-56] (#281) * cli: unify sandbox flag --env → --sandbox, -b → -e (ENG-56) Standardize all CLI commands to use --sandbox/-e for the sandbox parameter. Previously bench run used --sandbox/-b while bench eval create and others used --env/-e. - bench run: -b → -e - bench eval create/run/compare: --env → --sandbox - bench eval progress: --env → --sandbox - bench skills eval: --env → --sandbox - bench environment create: -b → -e - docs/examples: updated flag references * cli: drop -e/-b short flags, use --sandbox only (ENG-56) Remove all short flags (-e, -b) for the sandbox parameter across all CLI commands. The long flag --sandbox is the only way to specify the sandbox now. Updated 15 files: CLI definitions, help text examples, README, docs, skill files, integration test runner, and test examples. --------- Co-authored-by: Xiangyi Li <xiangyi@benchflow.ai> * cli: deprecate bench run in favor of bench eval create (ENG-57) (#282) Per ENG-46, bench eval create is the canonical CLI entry point. bench run goes through the old SDK shim and is now deprecated, matching the pattern used for bench job and other legacy commands. Python API bf.run() is unaffected (backward-compat alias stays). Co-authored-by: Xiangyi Li <xiangyi@benchflow.ai> * feat(ENG-55): first-class LLM-as-judge verifier with dense reward (#277) * feat(ENG-55): first-class LLM-as-judge verifier with dense reward - Add rubric_config.py: Pydantic models for rubric.toml (Criterion, JudgeConfig, ScoringConfig, RubricConfig) with binary/likert/numeric criterion types and score normalization - Add llm.py: Multi-provider LLM routing (Anthropic/OpenAI/Gemini) with call_judge(), parse_verdict(), exponential backoff retry - Add file_readers.py: Document extraction (pdf/docx/xlsx/pptx/txt) with find_deliverables() for rollout directory scanning - Replace placeholder LLMJudgeRewardFunc in builtins.py with full rubric-based judge: per-criterion scoring, prompt templates, configurable aggregation (weighted_mean/all_pass/any_pass/threshold), backward-compatible legacy mode - Emit per-criterion dense RewardEvents for fine-grained reward signals - Write evaluation_details.json alongside rollout for transparency - Re-export Criterion, JudgeConfig, RubricConfig, ScoringConfig, load_rubric_toml from rewards and top-level __init__ - Add test_rubric_config.py (15 tests) and test_llm_judge.py (30 tests) Closes ENG-55 * fix(ENG-55): async clients, asyncio.sleep, and score mismatch bugs - Replace sync SDK clients with async variants (AsyncAnthropic, AsyncOpenAI, client.aio for Google) - Replace time.sleep() with await asyncio.sleep() in call_judge retry loop - Pass actual aggregated score to _write_details instead of recomputing n_passed/total * docs(ENG-55): add LLM-as-judge verifier documentation - Add docs/llm-judge.md: full user-facing guide with rubric.toml reference, criterion types, aggregation strategies, dense reward events, multi-provider routing, file discovery, inline criteria, and worked examples - Add src/benchflow/rewards/README.md: module-level guide with usage examples - Update docs/concepts.md: reference LLM judge in Verifier primitive and add to 'Where to go next' links * docs: polish — remove ENG-55 ticket ref, clarify rubric.json exclusion --------- Co-authored-by: Xiangyi Li <xiangyi@benchflow.ai> * CLI: remove all short flags, use full names only [ENG-74] (#284) * CLI: remove all short flags, use full names only [ENG-74] Remove all single-letter short flags (-a, -m, -o, -t, -c, -f, -s, -p, -b, -d) from every CLI command. Full flag names only (--agent, --model, --jobs-dir, --tasks-dir, --concurrency, --config, --skills-dir, --prompt, --benchmark, --dir). Updated across 23 files: - src/benchflow/cli/main.py: all typer.Option definitions + docstring examples - All docs/ (getting-started, running-benchmarks, cli.md, skill-eval, integration-tests, examples/) - README.md - .claude/skills/ (SKILL.md + references) - tests/ (integration/run.sh, test_codex_custom_provider.sh, conformance/README.md) - benchmarks/ (models-as-skills.md, harvey-lab/run_harvey_lab.py) - Test fix: test_skill_eval_dryrun.py invocation updated from -a to --agent * fix: replace remaining -f with --config in task-embedded SKILL.md files --------- Co-authored-by: Xiangyi Li <xiangyi@benchflow.ai> * Backport PR #230 and #242 fixes to refactor/v0.4 (#286) * fix: apply PR #230 (skills double-deploy) and PR #242 (shell injection) fixes to v0.4 PR #230: pass effective task path to deploy_skills, set effective_skills in already-injected branch PR #242: shlex.quote outbox file paths in _scene.py, harden _snapshot.py with path validation * fix: set effective_skills to /skills when Dockerfile already injected When deploy_skills detects the Dockerfile already bakes in `COPY _deps/skills /skills/`, the runtime upload is skipped but effective_skills was left as task.config.environment.skills_dir (often None), causing the subsequent linking step to silently skip distribution to agent-specific paths like /home/agent/.agents/skills. Sandbox users relied on that runtime linking since Dockerfile injection only links under /root/. Tighten the regression test to assert env.exec was called with the expected `ln -sfn` command. --------- Co-authored-by: Xiangyi Li <xiangyi@benchflow.ai> Co-authored-by: Yifeng He <yfhe.prsn@gmail.com> * fix locally * release: 0.3.4 --------- Co-authored-by: Xiangyi Li <xiangyi@benchflow.ai> Co-authored-by: devin-ai-integration[bot] <158243242+devin-ai-integration[bot]@users.noreply.github.com> Co-authored-by: Yifeng He <yfhe.prsn@gmail.com>
…, ACPX integration (#288) * refactor: BenchFlow v0.3.4 — unified types, Rollout, Sandbox protocol, rewards, adapters (#274) * refactor: add Sandbox protocol + Harbor adapters (ENG-48 Phase A) Create BenchFlow-native Sandbox protocol as parallel types alongside existing Harbor imports. No behavioral changes — existing code continues to use Harbor directly. New files: - src/benchflow/sandbox/protocol.py: ExecResult, Sandbox, ImageRef, ImageConfig, ImageBuilder - src/benchflow/sandbox/docker.py: DockerSandbox adapter wrapping Harbor DockerEnvironment - src/benchflow/sandbox/daytona.py: DaytonaSandbox adapter wrapping Harbor DaytonaEnvironment - tests/test_sandbox_protocol.py: 26 tests covering dataclasses, protocol conformance, and delegation * refactor: unify Scene/Role/Turn types into _types.py (ENG-47) Create src/benchflow/_types.py as the single canonical source for the declarative Role, Scene, and Turn dataclasses. Changes: - New _types.py with Role (adds timeout_sec, idle_timeout_sec, skills_dir), Scene (adds parallel_group), and Turn - trial.py imports from _types.py instead of defining its own copies - _scene.py renames its internal Role to SceneRole (different fields: instruction, tools) with backward-compat alias - __init__.py re-exports canonical types from _types.py; adds TrialRole/TrialScene backward-compat aliases - runtime.py and trial_yaml.py updated to import from _types.py New fields default to None — no runtime behavior changes. * fix: address review — shlex.quote in read_file, cleanup temp in write_file - Use shlex.quote(path) in read_file to prevent shell injection - Clean up temp files with os.unlink in write_file finally block - Move imports to module level * fix: read_file error checking + export ImageConfig from top-level - read_file now raises FileNotFoundError on non-zero return code - Export ImageConfig alongside ImageBuilder in __init__.py - Add tests for read_file error behavior * refactor: kill shim layers, single Rollout execution path (ENG-46) - Rename trial.py → rollout.py, Trial → Rollout, TrialConfig → RolloutConfig - Rename RunResult → RolloutResult in models.py - Create evaluation.py from job.py: Job → Evaluation, JobConfig → EvaluationConfig, JobResult → EvaluationResult - Create _run.py with bf.run(RolloutConfig) → RolloutResult entry point - Replace sdk.py with thin backward-compat shim delegating to rollout.py - Replace trial.py with re-export shim from rollout.py - Replace job.py with re-export shim from evaluation.py - Preserve runtime.py API, update internal imports to use Rollout - Update __init__.py: new public API + backward-compat aliases - Update test patch targets from benchflow.trial → benchflow.rollout - All 841 tests pass, lint clean, pre-existing typecheck errors unchanged * style: apply ruff format to pass CI format check * fix: resolve ty typecheck errors (Any for kwargs and sentinel) * feat: composable Rubric + RewardFunc protocol (ENG-49) - Create src/benchflow/rewards/ package with: - RewardFunc protocol (single scoring dimension) - Rubric dataclass (weighted collection of RewardFuncs) - VerifyResult dataclass (aggregated scoring result) - RewardEvent dataclass (dense/terminal reward signals) - Built-in funcs: TestRewardFunc, LLMJudgeRewardFunc, StringMatchRewardFunc, CodeExecRewardFunc - Add reward_events field to RunResult (additive, no breaking changes) - Re-export all reward types from benchflow top-level __init__.py - Backward compat: Rubric([TestRewardFunc()]) wraps existing test.sh -> reward.txt flow - 31 tests covering all reward types, rubric scoring, weights, error handling, protocol conformance, and re-exports * style: format rewards package and tests with ruff * fix: disambiguate Rubric.items keys when multiple funcs share a class name Append _N suffix on collision so all per-func scores are preserved. Addresses Devin Review feedback on PR #266. * feat: per-agent capabilities + agent-as-tool infrastructure (ENG-50) - Add Role.capabilities field for declarative agent capability lists - Add Sandbox.expose_ports property for inter-agent communication ports - Document capability boundary: BenchFlow provides sandbox + instruction + observation; agents own loops, tool protocols, and agent-as-tool - Update _scene.py and _types.py module docstrings with ENG-50 boundary - Add 16 tests covering capabilities, expose_ports, and Coder→Reviewer→Coder turn sequences * style: apply ruff format to sandbox adapters and tests * feat: external adapters for Inspect AI + ORS (ENG-51) Add benchflow.adapters package with thin format converters: - InspectAdapter / to_inspect_task: Scene + Rubric -> Inspect AI task dict - ORSAdapter / to_ors_reward: VerifyResult + RewardEvent -> ORS format dict No external dependencies required — these are pure format converters. Re-exported from benchflow top-level __init__.py. 14 tests covering both adapters, convenience functions, and round-trips. * style: fix ruff format for adapters and tests * fix: convention fixes + trial shim re-exports + __getattr__ narrowing - Parametrize bare dict returns to dict[str, Any] in adapters - Remove unnecessary @DataClass on adapter classes (no fields) - Add from __future__ import annotations to evaluation.py - Fix __getattr__ to re-raise ModuleNotFoundError for broken deps - Add private helper re-exports to trial.py shim for backward compat - Convert test_save_trajectory to async (fixes event loop deprecation) * rename: backend → sandbox across API, CLI, docs, and tests Aligns naming with the v0.4 Sandbox protocol (ENG-48). The 'backend' parameter/flag/attribute on Environment and the CLI is now 'sandbox' everywhere. Scope: - runtime.py: Environment.__init__(sandbox=), .sandbox, from_task(sandbox=) - cli/main.py: --sandbox flag (was --backend), help text - _snapshot.py: docstring updates - tests/test_runtime.py: updated assertions - docs/: all references in task-authoring, CLI ref, Python API ref, running-benchmarks, integration-tests, examples NOT renamed (different meaning): - _credentials.py 'Vertex AI backend' (provider concept) - _provider_runtime.py / bedrock_proxy.py backend_model (LLM model) - _sandbox.py 'build backends' (Python packaging) * docs: update for v0.4 — Rollout as canonical name, add Sandbox/Rubric/Adapter docs - README: 'Sandbox backends' → 'Sandboxes', table says Rollout not Trial - concepts.md: Environment no longer says 'Backed by Harbor', Trial → Rollout as core primitive, lifecycle diagram uses Rollout.run() - getting-started.md: uses RolloutConfig, links to rollout lifecycle - python-api.md: RolloutConfig (aliased as TrialConfig), Rollout (aliased as Trial), canonical imports shown first, added v0.4 types section with Sandbox protocol, Rubric + RewardFunc, Adapters, Evaluation, backward-compat aliases * fix: correct Rubric/RewardFunc API examples in python-api.md Fixes 3 issues flagged by Devin Review: - StringMatchRewardFunc: remove non-existent 'field' parameter - Rubric: use reward_funcs + weights instead of items=[(func, weight)] - rubric.score: takes rollout_dir: Path, not sandbox= * fix: correct evaluation.py docstring aliases, clean stale comment in __init__.py * docs: remove Harbor migration references - use-cases.md: remove 'How it works vs Harbor' and 'Migration from Harbor' sections - scene-patterns.ipynb: remove 'Mapping to Harbor PR #1462 Concepts' cell - coder-reviewer-demo.py: 'Harbor-format task directory' → 'BenchFlow task directory' - swebench notebook: remove Harbor #1316 reference from intro * CLI: unify sandbox flag (--env → --sandbox, -b → -e) [ENG-56] (#281) * cli: unify sandbox flag --env → --sandbox, -b → -e (ENG-56) Standardize all CLI commands to use --sandbox/-e for the sandbox parameter. Previously bench run used --sandbox/-b while bench eval create and others used --env/-e. - bench run: -b → -e - bench eval create/run/compare: --env → --sandbox - bench eval progress: --env → --sandbox - bench skills eval: --env → --sandbox - bench environment create: -b → -e - docs/examples: updated flag references * cli: drop -e/-b short flags, use --sandbox only (ENG-56) Remove all short flags (-e, -b) for the sandbox parameter across all CLI commands. The long flag --sandbox is the only way to specify the sandbox now. Updated 15 files: CLI definitions, help text examples, README, docs, skill files, integration test runner, and test examples. --------- Co-authored-by: Xiangyi Li <xiangyi@benchflow.ai> * cli: deprecate bench run in favor of bench eval create (ENG-57) (#282) Per ENG-46, bench eval create is the canonical CLI entry point. bench run goes through the old SDK shim and is now deprecated, matching the pattern used for bench job and other legacy commands. Python API bf.run() is unaffected (backward-compat alias stays). Co-authored-by: Xiangyi Li <xiangyi@benchflow.ai> * feat(ENG-55): first-class LLM-as-judge verifier with dense reward (#277) * feat(ENG-55): first-class LLM-as-judge verifier with dense reward - Add rubric_config.py: Pydantic models for rubric.toml (Criterion, JudgeConfig, ScoringConfig, RubricConfig) with binary/likert/numeric criterion types and score normalization - Add llm.py: Multi-provider LLM routing (Anthropic/OpenAI/Gemini) with call_judge(), parse_verdict(), exponential backoff retry - Add file_readers.py: Document extraction (pdf/docx/xlsx/pptx/txt) with find_deliverables() for rollout directory scanning - Replace placeholder LLMJudgeRewardFunc in builtins.py with full rubric-based judge: per-criterion scoring, prompt templates, configurable aggregation (weighted_mean/all_pass/any_pass/threshold), backward-compatible legacy mode - Emit per-criterion dense RewardEvents for fine-grained reward signals - Write evaluation_details.json alongside rollout for transparency - Re-export Criterion, JudgeConfig, RubricConfig, ScoringConfig, load_rubric_toml from rewards and top-level __init__ - Add test_rubric_config.py (15 tests) and test_llm_judge.py (30 tests) Closes ENG-55 * fix(ENG-55): async clients, asyncio.sleep, and score mismatch bugs - Replace sync SDK clients with async variants (AsyncAnthropic, AsyncOpenAI, client.aio for Google) - Replace time.sleep() with await asyncio.sleep() in call_judge retry loop - Pass actual aggregated score to _write_details instead of recomputing n_passed/total * docs(ENG-55): add LLM-as-judge verifier documentation - Add docs/llm-judge.md: full user-facing guide with rubric.toml reference, criterion types, aggregation strategies, dense reward events, multi-provider routing, file discovery, inline criteria, and worked examples - Add src/benchflow/rewards/README.md: module-level guide with usage examples - Update docs/concepts.md: reference LLM judge in Verifier primitive and add to 'Where to go next' links * docs: polish — remove ENG-55 ticket ref, clarify rubric.json exclusion --------- Co-authored-by: Xiangyi Li <xiangyi@benchflow.ai> * CLI: remove all short flags, use full names only [ENG-74] (#284) * CLI: remove all short flags, use full names only [ENG-74] Remove all single-letter short flags (-a, -m, -o, -t, -c, -f, -s, -p, -b, -d) from every CLI command. Full flag names only (--agent, --model, --jobs-dir, --tasks-dir, --concurrency, --config, --skills-dir, --prompt, --benchmark, --dir). Updated across 23 files: - src/benchflow/cli/main.py: all typer.Option definitions + docstring examples - All docs/ (getting-started, running-benchmarks, cli.md, skill-eval, integration-tests, examples/) - README.md - .claude/skills/ (SKILL.md + references) - tests/ (integration/run.sh, test_codex_custom_provider.sh, conformance/README.md) - benchmarks/ (models-as-skills.md, harvey-lab/run_harvey_lab.py) - Test fix: test_skill_eval_dryrun.py invocation updated from -a to --agent * fix: replace remaining -f with --config in task-embedded SKILL.md files --------- Co-authored-by: Xiangyi Li <xiangyi@benchflow.ai> * Backport PR #230 and #242 fixes to refactor/v0.4 (#286) * fix: apply PR #230 (skills double-deploy) and PR #242 (shell injection) fixes to v0.4 PR #230: pass effective task path to deploy_skills, set effective_skills in already-injected branch PR #242: shlex.quote outbox file paths in _scene.py, harden _snapshot.py with path validation * fix: set effective_skills to /skills when Dockerfile already injected When deploy_skills detects the Dockerfile already bakes in `COPY _deps/skills /skills/`, the runtime upload is skipped but effective_skills was left as task.config.environment.skills_dir (often None), causing the subsequent linking step to silently skip distribution to agent-specific paths like /home/agent/.agents/skills. Sandbox users relied on that runtime linking since Dockerfile injection only links under /root/. Tighten the regression test to assert env.exec was called with the expected `ln -sfn` command. --------- Co-authored-by: Xiangyi Li <xiangyi@benchflow.ai> Co-authored-by: Yifeng He <yfhe.prsn@gmail.com> * fix locally * release: 0.3.4 --------- Co-authored-by: Xiangyi Li <xiangyi@benchflow.ai> Co-authored-by: devin-ai-integration[bot] <158243242+devin-ai-integration[bot]@users.noreply.github.com> Co-authored-by: Yifeng He <yfhe.prsn@gmail.com> * refactor: remove Harbor references, purge backward-compat aliases in tests, update terminology - Remove 'terminal-bench' from pyproject.toml keywords - Rename _make_harbor_mock → _make_sandbox_mock in test files - Update Harbor patch paths to sandbox equivalents in tests - Rename _from_harbor_yaml → _from_legacy_yaml in evaluation.py - Update docstrings/comments: Harbor → BenchFlow/legacy terminology - Internalize Harbor types into benchflow.task/ subpackage - Create sandbox backends: base.py, docker_impl.py, daytona_impl.py, modal_impl.py - Add compose/ subpackage with docker-compose YAML templates * refactor: modernize file structure — align with test-drive patterns - sandbox/: merge docker_impl→docker, daytona_impl→daytona, base→_base - sandbox/: move _sandbox→lockdown, _snapshot→snapshot, _daytona_patches→_sdk_ops - sandbox/: move process, user, environments→services into sandbox/ - sandbox/: restructure compose/ → _compose.py + _compose_files/ - agents/: move _credentials→credentials, _agent_env→env, _agent_setup→install - _utils/: create subpackage with yaml_loader, benchmark_repos, task_authoring - trajectories/: move viewer, _trajectory→_capture into subpackage - experimental/: move mcp/ into experimental/ subpackage - __init__.py: purge backward-compat aliases (Trial, Job, etc.) - Update all imports across src/, tests/, experiments/ - Run ruff check + format (0 errors) * refactor: purge backward-compat aliases + fix review bugs - Remove Trial=Rollout, TrialConfig=RolloutConfig from rollout.py - Remove Job=Evaluation, JobConfig=EvaluationConfig, JobResult=EvaluationResult from evaluation.py - Delete shim files: trial.py, job.py - Update all imports: benchflow.trial → benchflow.rollout, benchflow.job → benchflow.evaluation - Rename classes: Trial→Rollout, Job→Evaluation across tests/, benchmarks/, experiments/ - Fix runtime.py: environment_type→sandbox_type, trial_name→rollout_name (review bug) - Fix conformance scripts: same keyword arg fixes (review bug) - Clean up __all__ and return type annotations * docs: update terminology to v0.4 (Trial→Rollout, Job→Evaluation) + bump version to 0.4.0 * chore: update uv.lock for v0.4.0 * refactor: modernize RL terminology — trial_dir→rollout_dir, trial_name→rollout_name across codebase * style: format viewer.py * fix: exclude optional-dep sandbox files from ty check * fix: configure ty to ignore unresolved-import + exclude optional-dep files * fix: resolve all test failures — patch paths, protocol conformance, RL terminology * fix: resolve merge conflicts, update broken imports and stale references - Resolve merge conflict markers in rollout.py and evaluation.py - Update task_download imports to benchflow._utils.benchmark_repos (11 files) - Fix benchflow.tasks → benchflow._utils.task_authoring import - Update scene-patterns.ipynb: TrialConfig→RolloutConfig, trial→rollout - Fix Job→Evaluation rename in docs - Ruff auto-fixes (import ordering) * style: format 12 files to pass ruff format check * fix: resolve ty check errors in traces/ — dict[str, object] → dict[str, Any] * fix: add skip guards for optional dependency tests (daytona, modal) * fix: update stale imports and references in docs/benchmarks (Devin Review) * refactor: consolidate underscore modules into proper subpackages - _acp_run.py → acp/runtime.py - _env_setup.py → sandbox/setup.py - _provider_runtime.py → providers/runtime.py - _scoring.py → _utils/scoring.py - _scene.py → scenes.py (removed Role=SceneRole backward-compat alias) - Fix GEMINI_API_KEY auto-detection: inherit from os.environ as fallback after .env * feat: ACPX integration — acpx/<agent> protocol for headless ACP agent invocation - Add 'acpx' as a valid protocol in parse_agent_spec (acpx/claude, acpx/codex, etc.) - _acpx_wrap() decorates any registered agent to launch via acpx CLI - Installs acpx alongside the underlying agent in the sandbox - Preserves all agent env, credentials, and skill_paths - Replace harbor protocol references with acpx in tests * style: format registry.py for ruff format check * fix: suppress ty invalid-assignment for monkey-patch in _patch_docker_dind * docs: update docs and SKILL.md for v0.4 — RL-first terminology, ACPX, purge backward-compat * fix: rename trial_config_from_yaml → rollout_config_from_yaml, fix stale _run import in sdk.py * fix: docs inconsistencies found by dogfooding — GEMINI_API_KEY, daytona install note * fix: docs path evaluations/ → jobs/ to match actual output directory * fix: ruff import ordering in sdk.py * fix: ty type-check suppression for sdk.py run() return type * feat: add sandbox-daytona and sandbox-modal optional dependency groups * chore: update uv.lock for sandbox-daytona and sandbox-modal optional deps --------- Co-authored-by: devin-ai-integration[bot] <158243242+devin-ai-integration[bot]@users.noreply.github.com> Co-authored-by: Xiangyi Li <xiangyi@benchflow.ai> Co-authored-by: Yifeng He <yfhe.prsn@gmail.com>
* refactor: add Sandbox protocol + Harbor adapters (ENG-48 Phase A)
Create BenchFlow-native Sandbox protocol as parallel types alongside
existing Harbor imports. No behavioral changes — existing code continues
to use Harbor directly.
New files:
- src/benchflow/sandbox/protocol.py: ExecResult, Sandbox, ImageRef,
ImageConfig, ImageBuilder
- src/benchflow/sandbox/docker.py: DockerSandbox adapter wrapping
Harbor DockerEnvironment
- src/benchflow/sandbox/daytona.py: DaytonaSandbox adapter wrapping
Harbor DaytonaEnvironment
- tests/test_sandbox_protocol.py: 26 tests covering dataclasses,
protocol conformance, and delegation
* refactor: unify Scene/Role/Turn types into _types.py (ENG-47)
Create src/benchflow/_types.py as the single canonical source for the
declarative Role, Scene, and Turn dataclasses.
Changes:
- New _types.py with Role (adds timeout_sec, idle_timeout_sec,
skills_dir), Scene (adds parallel_group), and Turn
- trial.py imports from _types.py instead of defining its own copies
- _scene.py renames its internal Role to SceneRole (different fields:
instruction, tools) with backward-compat alias
- __init__.py re-exports canonical types from _types.py; adds
TrialRole/TrialScene backward-compat aliases
- runtime.py and trial_yaml.py updated to import from _types.py
New fields default to None — no runtime behavior changes.
* fix: address review — shlex.quote in read_file, cleanup temp in write_file
- Use shlex.quote(path) in read_file to prevent shell injection
- Clean up temp files with os.unlink in write_file finally block
- Move imports to module level
* fix: read_file error checking + export ImageConfig from top-level
- read_file now raises FileNotFoundError on non-zero return code
- Export ImageConfig alongside ImageBuilder in __init__.py
- Add tests for read_file error behavior
* refactor: kill shim layers, single Rollout execution path (ENG-46)
- Rename trial.py → rollout.py, Trial → Rollout, TrialConfig → RolloutConfig
- Rename RunResult → RolloutResult in models.py
- Create evaluation.py from job.py: Job → Evaluation, JobConfig → EvaluationConfig, JobResult → EvaluationResult
- Create _run.py with bf.run(RolloutConfig) → RolloutResult entry point
- Replace sdk.py with thin backward-compat shim delegating to rollout.py
- Replace trial.py with re-export shim from rollout.py
- Replace job.py with re-export shim from evaluation.py
- Preserve runtime.py API, update internal imports to use Rollout
- Update __init__.py: new public API + backward-compat aliases
- Update test patch targets from benchflow.trial → benchflow.rollout
- All 841 tests pass, lint clean, pre-existing typecheck errors unchanged
* style: apply ruff format to pass CI format check
* fix: resolve ty typecheck errors (Any for kwargs and sentinel)
* feat: composable Rubric + RewardFunc protocol (ENG-49)
- Create src/benchflow/rewards/ package with:
- RewardFunc protocol (single scoring dimension)
- Rubric dataclass (weighted collection of RewardFuncs)
- VerifyResult dataclass (aggregated scoring result)
- RewardEvent dataclass (dense/terminal reward signals)
- Built-in funcs: TestRewardFunc, LLMJudgeRewardFunc,
StringMatchRewardFunc, CodeExecRewardFunc
- Add reward_events field to RunResult (additive, no breaking changes)
- Re-export all reward types from benchflow top-level __init__.py
- Backward compat: Rubric([TestRewardFunc()]) wraps existing
test.sh -> reward.txt flow
- 31 tests covering all reward types, rubric scoring, weights,
error handling, protocol conformance, and re-exports
* style: format rewards package and tests with ruff
* fix: disambiguate Rubric.items keys when multiple funcs share a class name
Append _N suffix on collision so all per-func scores are preserved.
Addresses Devin Review feedback on PR #266.
* feat: per-agent capabilities + agent-as-tool infrastructure (ENG-50)
- Add Role.capabilities field for declarative agent capability lists
- Add Sandbox.expose_ports property for inter-agent communication ports
- Document capability boundary: BenchFlow provides sandbox + instruction +
observation; agents own loops, tool protocols, and agent-as-tool
- Update _scene.py and _types.py module docstrings with ENG-50 boundary
- Add 16 tests covering capabilities, expose_ports, and Coder→Reviewer→Coder
turn sequences
* style: apply ruff format to sandbox adapters and tests
* feat: external adapters for Inspect AI + ORS (ENG-51)
Add benchflow.adapters package with thin format converters:
- InspectAdapter / to_inspect_task: Scene + Rubric -> Inspect AI task dict
- ORSAdapter / to_ors_reward: VerifyResult + RewardEvent -> ORS format dict
No external dependencies required — these are pure format converters.
Re-exported from benchflow top-level __init__.py.
14 tests covering both adapters, convenience functions, and round-trips.
* style: fix ruff format for adapters and tests
* fix: convention fixes + trial shim re-exports + __getattr__ narrowing
- Parametrize bare dict returns to dict[str, Any] in adapters
- Remove unnecessary @DataClass on adapter classes (no fields)
- Add from __future__ import annotations to evaluation.py
- Fix __getattr__ to re-raise ModuleNotFoundError for broken deps
- Add private helper re-exports to trial.py shim for backward compat
- Convert test_save_trajectory to async (fixes event loop deprecation)
* rename: backend → sandbox across API, CLI, docs, and tests
Aligns naming with the v0.4 Sandbox protocol (ENG-48). The
'backend' parameter/flag/attribute on Environment and the CLI is
now 'sandbox' everywhere.
Scope:
- runtime.py: Environment.__init__(sandbox=), .sandbox, from_task(sandbox=)
- cli/main.py: --sandbox flag (was --backend), help text
- _snapshot.py: docstring updates
- tests/test_runtime.py: updated assertions
- docs/: all references in task-authoring, CLI ref, Python API ref,
running-benchmarks, integration-tests, examples
NOT renamed (different meaning):
- _credentials.py 'Vertex AI backend' (provider concept)
- _provider_runtime.py / bedrock_proxy.py backend_model (LLM model)
- _sandbox.py 'build backends' (Python packaging)
* docs: update for v0.4 — Rollout as canonical name, add Sandbox/Rubric/Adapter docs
- README: 'Sandbox backends' → 'Sandboxes', table says Rollout not Trial
- concepts.md: Environment no longer says 'Backed by Harbor', Trial → Rollout
as core primitive, lifecycle diagram uses Rollout.run()
- getting-started.md: uses RolloutConfig, links to rollout lifecycle
- python-api.md: RolloutConfig (aliased as TrialConfig), Rollout (aliased as
Trial), canonical imports shown first, added v0.4 types section with Sandbox
protocol, Rubric + RewardFunc, Adapters, Evaluation, backward-compat aliases
* fix: correct Rubric/RewardFunc API examples in python-api.md
Fixes 3 issues flagged by Devin Review:
- StringMatchRewardFunc: remove non-existent 'field' parameter
- Rubric: use reward_funcs + weights instead of items=[(func, weight)]
- rubric.score: takes rollout_dir: Path, not sandbox=
* fix: correct evaluation.py docstring aliases, clean stale comment in __init__.py
* docs: remove Harbor migration references
- use-cases.md: remove 'How it works vs Harbor' and 'Migration from Harbor' sections
- scene-patterns.ipynb: remove 'Mapping to Harbor PR #1462 Concepts' cell
- coder-reviewer-demo.py: 'Harbor-format task directory' → 'BenchFlow task directory'
- swebench notebook: remove Harbor #1316 reference from intro
* CLI: unify sandbox flag (--env → --sandbox, -b → -e) [ENG-56] (#281)
* cli: unify sandbox flag --env → --sandbox, -b → -e (ENG-56)
Standardize all CLI commands to use --sandbox/-e for the sandbox
parameter. Previously bench run used --sandbox/-b while bench eval
create and others used --env/-e.
- bench run: -b → -e
- bench eval create/run/compare: --env → --sandbox
- bench eval progress: --env → --sandbox
- bench skills eval: --env → --sandbox
- bench environment create: -b → -e
- docs/examples: updated flag references
* cli: drop -e/-b short flags, use --sandbox only (ENG-56)
Remove all short flags (-e, -b) for the sandbox parameter across all
CLI commands. The long flag --sandbox is the only way to specify the
sandbox now.
Updated 15 files: CLI definitions, help text examples, README,
docs, skill files, integration test runner, and test examples.
---------
Co-authored-by: Xiangyi Li <xiangyi@benchflow.ai>
* cli: deprecate bench run in favor of bench eval create (ENG-57) (#282)
Per ENG-46, bench eval create is the canonical CLI entry point.
bench run goes through the old SDK shim and is now deprecated,
matching the pattern used for bench job and other legacy commands.
Python API bf.run() is unaffected (backward-compat alias stays).
Co-authored-by: Xiangyi Li <xiangyi@benchflow.ai>
* feat(ENG-55): first-class LLM-as-judge verifier with dense reward (#277)
* feat(ENG-55): first-class LLM-as-judge verifier with dense reward
- Add rubric_config.py: Pydantic models for rubric.toml (Criterion,
JudgeConfig, ScoringConfig, RubricConfig) with binary/likert/numeric
criterion types and score normalization
- Add llm.py: Multi-provider LLM routing (Anthropic/OpenAI/Gemini)
with call_judge(), parse_verdict(), exponential backoff retry
- Add file_readers.py: Document extraction (pdf/docx/xlsx/pptx/txt)
with find_deliverables() for rollout directory scanning
- Replace placeholder LLMJudgeRewardFunc in builtins.py with full
rubric-based judge: per-criterion scoring, prompt templates,
configurable aggregation (weighted_mean/all_pass/any_pass/threshold),
backward-compatible legacy mode
- Emit per-criterion dense RewardEvents for fine-grained reward signals
- Write evaluation_details.json alongside rollout for transparency
- Re-export Criterion, JudgeConfig, RubricConfig, ScoringConfig,
load_rubric_toml from rewards and top-level __init__
- Add test_rubric_config.py (15 tests) and test_llm_judge.py (30 tests)
Closes ENG-55
* fix(ENG-55): async clients, asyncio.sleep, and score mismatch bugs
- Replace sync SDK clients with async variants (AsyncAnthropic, AsyncOpenAI, client.aio for Google)
- Replace time.sleep() with await asyncio.sleep() in call_judge retry loop
- Pass actual aggregated score to _write_details instead of recomputing n_passed/total
* docs(ENG-55): add LLM-as-judge verifier documentation
- Add docs/llm-judge.md: full user-facing guide with rubric.toml reference,
criterion types, aggregation strategies, dense reward events, multi-provider
routing, file discovery, inline criteria, and worked examples
- Add src/benchflow/rewards/README.md: module-level guide with usage examples
- Update docs/concepts.md: reference LLM judge in Verifier primitive and
add to 'Where to go next' links
* docs: polish — remove ENG-55 ticket ref, clarify rubric.json exclusion
---------
Co-authored-by: Xiangyi Li <xiangyi@benchflow.ai>
* CLI: remove all short flags, use full names only [ENG-74] (#284)
* CLI: remove all short flags, use full names only [ENG-74]
Remove all single-letter short flags (-a, -m, -o, -t, -c, -f, -s, -p, -b, -d)
from every CLI command. Full flag names only (--agent, --model, --jobs-dir,
--tasks-dir, --concurrency, --config, --skills-dir, --prompt, --benchmark, --dir).
Updated across 23 files:
- src/benchflow/cli/main.py: all typer.Option definitions + docstring examples
- All docs/ (getting-started, running-benchmarks, cli.md, skill-eval, integration-tests, examples/)
- README.md
- .claude/skills/ (SKILL.md + references)
- tests/ (integration/run.sh, test_codex_custom_provider.sh, conformance/README.md)
- benchmarks/ (models-as-skills.md, harvey-lab/run_harvey_lab.py)
- Test fix: test_skill_eval_dryrun.py invocation updated from -a to --agent
* fix: replace remaining -f with --config in task-embedded SKILL.md files
---------
Co-authored-by: Xiangyi Li <xiangyi@benchflow.ai>
* Backport PR #230 and #242 fixes to refactor/v0.4 (#286)
* fix: apply PR #230 (skills double-deploy) and PR #242 (shell injection) fixes to v0.4
PR #230: pass effective task path to deploy_skills, set effective_skills in already-injected branch
PR #242: shlex.quote outbox file paths in _scene.py, harden _snapshot.py with path validation
* fix: set effective_skills to /skills when Dockerfile already injected
When deploy_skills detects the Dockerfile already bakes in
`COPY _deps/skills /skills/`, the runtime upload is skipped but
effective_skills was left as task.config.environment.skills_dir
(often None), causing the subsequent linking step to silently
skip distribution to agent-specific paths like
/home/agent/.agents/skills. Sandbox users relied on that runtime
linking since Dockerfile injection only links under /root/.
Tighten the regression test to assert env.exec was called with the
expected `ln -sfn` command.
---------
Co-authored-by: Xiangyi Li <xiangyi@benchflow.ai>
Co-authored-by: Yifeng He <yfhe.prsn@gmail.com>
* fix locally
* release: 0.3.4
* feat: trace import — `bench tasks generate` from Claude Code + opentraces (#278)
* feat: trace import prototype — Claude Code + opentraces → BenchFlow tasks
Add benchflow.traces package for personal benchmark curation from agent traces:
- parsers: parse Claude Code JSONL sessions and opentraces v0.1-v0.3 records
- models: format-agnostic ParsedTrace intermediate representation
- task_gen: generate task.toml + instruction.md + test.sh from traces
- huggingface: download and parse trace datasets from HuggingFace Hub
- local: discover and parse local ~/.claude/projects/ sessions
- CLI: bench import {local,file,hf,list-datasets} subcommands
44 new tests, all passing.
* fix: address Devin Review findings
- _cache_dir: fall back to cwd instead of / when outside git repo
- HF cache key: include max_rows in filename to avoid stale data
- parse_claude_code_session: deterministic trace_id from session+filename
- _build_test_sh: use shlex.quote() to prevent shell injection
* refactor: align trace import CLI with noun-verb philosophy
- Replace 'bench import {local,file,hf}' with 'bench tasks generate --from-{local,file,hf}'
- Replace 'bench import list-datasets' with 'bench tasks list-sources'
- Integrate trace commands into existing tasks_app via register_tasks_generate()
- Update CLI docs with new command reference
- Follows BenchFlow resource-verb pattern: bench <resource> <verb>
* fix: address Devin Review — limit flag + format name mismatch
- Apply --limit across all sources (was only used for --from-local)
- Map 'claude-code' → 'claude-messages' in _load_hf so --format claude-code
works correctly with --from-hf
* docs: dogfood bench tasks generate in task-authoring and getting-started guides
* fix: generated tasks now pass bench tasks check — add Dockerfile, tests/ dir, reward path, difficulty scaling, TOML safety
* audit: fix task pipeline issues + normalize docstrings across traces package
Pipeline fixes:
- Fix timeout auto-scaling: batch generator default changed from 300 to 0
(300 > 0 prevented difficulty-based scaling from ever triggering)
- Add [task] name section to generated task.toml (trace-import/<slug>)
- Add build_timeout_sec and storage_mb to [environment] section
- Fix test.sh indentation (remove extra leading spaces from textwrap)
Docstring normalization:
- Add missing docstrings to CLI helper functions (_load_local, _load_file,
_load_hf)
- Improve docstring clarity across traces package (detect_format, print
helpers, HF parsers)
- Use reST code block syntax for CLI examples
48/48 tests pass, lint clean.
* fix: multi-session JSONL collapse, outcome detection, zero-tool-call filter
Dogfooding revealed 3 quality issues:
1. Critical: parse_claude_code_session() merged all sessions in a
multi-session JSONL file into one trace, contaminating difficulty,
instruction, and verifier. Added parse_claude_code_file() that
splits by sessionId before parsing each group.
2. Outcome detection missed common completion verbs (fixed, refactored,
built, created, updated, implemented, added).
3. Traces with zero tool calls (pure explanations) produced useless
tasks with pass-through verifiers. Now filtered out in batch mode.
Before: 5-session file → 1 task (hard, 1200s, 19 files listed)
After: 5-session file → 4 tasks (easy/medium, correct files each)
Tests: 55 passing (was 48), lint clean.
* fix: test.sh file check cap matches instruction.md (10→20)
* fix: real-trace dogfood — path relativization, session artifact cleanup, HF parser robustness
- Add _relativize_path() to convert absolute workspace paths to relative project paths
- Add _clean_user_prompt() to strip session continuation boilerplate
- HF parser: support messages_json key, strip system-reminders, handle tool_result blocks, infer outcome
- CLI: add claude-messages format detection and routing for --from-file
- Add _parse_claude_requests_row() for cc-traces-weka metadata format
- All 55 tests pass, lint clean
* fix: verifier robustness — glob patterns for timestamp-bearing paths
Paths with date/timestamp segments (e.g. migrations/2025-11-28-131040_create_invoices/up.sql)
now use compgen -G glob patterns in test.sh instead of exact [ -f ] checks.
This lets verifiers tolerate agent-generated timestamp variants.
Also updates instruction.md to show globbed paths for dynamic segments.
Adds 10 new tests for _globify_path, _has_dynamic_segments, and
end-to-end verifier pattern selection.
* fix: real-trace parser — user_prompt+gitdiff format, git context, git-diff verifier
- Handle cc-traces-merged rows with user_prompt + gitdiff (no messages_json)
- Extract TASK DESCRIPTION from structured prompts, strip CRITICAL INSTRUCTIONS boilerplate
- Extract git repo/commit from prompt, populate GitContext
- Dockerfile clones repo at base commit when git context available
- Verifier uses git diff to check files were modified (not just exist)
- Fix source_model='None' string bug
---------
Co-authored-by: Xiangyi Li <xiangyi@benchflow.ai>
* refactor: BenchFlow v0.4 — RL-first terminology, module consolidation, ACPX integration (#288)
* refactor: BenchFlow v0.3.4 — unified types, Rollout, Sandbox protocol, rewards, adapters (#274)
* refactor: add Sandbox protocol + Harbor adapters (ENG-48 Phase A)
Create BenchFlow-native Sandbox protocol as parallel types alongside
existing Harbor imports. No behavioral changes — existing code continues
to use Harbor directly.
New files:
- src/benchflow/sandbox/protocol.py: ExecResult, Sandbox, ImageRef,
ImageConfig, ImageBuilder
- src/benchflow/sandbox/docker.py: DockerSandbox adapter wrapping
Harbor DockerEnvironment
- src/benchflow/sandbox/daytona.py: DaytonaSandbox adapter wrapping
Harbor DaytonaEnvironment
- tests/test_sandbox_protocol.py: 26 tests covering dataclasses,
protocol conformance, and delegation
* refactor: unify Scene/Role/Turn types into _types.py (ENG-47)
Create src/benchflow/_types.py as the single canonical source for the
declarative Role, Scene, and Turn dataclasses.
Changes:
- New _types.py with Role (adds timeout_sec, idle_timeout_sec,
skills_dir), Scene (adds parallel_group), and Turn
- trial.py imports from _types.py instead of defining its own copies
- _scene.py renames its internal Role to SceneRole (different fields:
instruction, tools) with backward-compat alias
- __init__.py re-exports canonical types from _types.py; adds
TrialRole/TrialScene backward-compat aliases
- runtime.py and trial_yaml.py updated to import from _types.py
New fields default to None — no runtime behavior changes.
* fix: address review — shlex.quote in read_file, cleanup temp in write_file
- Use shlex.quote(path) in read_file to prevent shell injection
- Clean up temp files with os.unlink in write_file finally block
- Move imports to module level
* fix: read_file error checking + export ImageConfig from top-level
- read_file now raises FileNotFoundError on non-zero return code
- Export ImageConfig alongside ImageBuilder in __init__.py
- Add tests for read_file error behavior
* refactor: kill shim layers, single Rollout execution path (ENG-46)
- Rename trial.py → rollout.py, Trial → Rollout, TrialConfig → RolloutConfig
- Rename RunResult → RolloutResult in models.py
- Create evaluation.py from job.py: Job → Evaluation, JobConfig → EvaluationConfig, JobResult → EvaluationResult
- Create _run.py with bf.run(RolloutConfig) → RolloutResult entry point
- Replace sdk.py with thin backward-compat shim delegating to rollout.py
- Replace trial.py with re-export shim from rollout.py
- Replace job.py with re-export shim from evaluation.py
- Preserve runtime.py API, update internal imports to use Rollout
- Update __init__.py: new public API + backward-compat aliases
- Update test patch targets from benchflow.trial → benchflow.rollout
- All 841 tests pass, lint clean, pre-existing typecheck errors unchanged
* style: apply ruff format to pass CI format check
* fix: resolve ty typecheck errors (Any for kwargs and sentinel)
* feat: composable Rubric + RewardFunc protocol (ENG-49)
- Create src/benchflow/rewards/ package with:
- RewardFunc protocol (single scoring dimension)
- Rubric dataclass (weighted collection of RewardFuncs)
- VerifyResult dataclass (aggregated scoring result)
- RewardEvent dataclass (dense/terminal reward signals)
- Built-in funcs: TestRewardFunc, LLMJudgeRewardFunc,
StringMatchRewardFunc, CodeExecRewardFunc
- Add reward_events field to RunResult (additive, no breaking changes)
- Re-export all reward types from benchflow top-level __init__.py
- Backward compat: Rubric([TestRewardFunc()]) wraps existing
test.sh -> reward.txt flow
- 31 tests covering all reward types, rubric scoring, weights,
error handling, protocol conformance, and re-exports
* style: format rewards package and tests with ruff
* fix: disambiguate Rubric.items keys when multiple funcs share a class name
Append _N suffix on collision so all per-func scores are preserved.
Addresses Devin Review feedback on PR #266.
* feat: per-agent capabilities + agent-as-tool infrastructure (ENG-50)
- Add Role.capabilities field for declarative agent capability lists
- Add Sandbox.expose_ports property for inter-agent communication ports
- Document capability boundary: BenchFlow provides sandbox + instruction +
observation; agents own loops, tool protocols, and agent-as-tool
- Update _scene.py and _types.py module docstrings with ENG-50 boundary
- Add 16 tests covering capabilities, expose_ports, and Coder→Reviewer→Coder
turn sequences
* style: apply ruff format to sandbox adapters and tests
* feat: external adapters for Inspect AI + ORS (ENG-51)
Add benchflow.adapters package with thin format converters:
- InspectAdapter / to_inspect_task: Scene + Rubric -> Inspect AI task dict
- ORSAdapter / to_ors_reward: VerifyResult + RewardEvent -> ORS format dict
No external dependencies required — these are pure format converters.
Re-exported from benchflow top-level __init__.py.
14 tests covering both adapters, convenience functions, and round-trips.
* style: fix ruff format for adapters and tests
* fix: convention fixes + trial shim re-exports + __getattr__ narrowing
- Parametrize bare dict returns to dict[str, Any] in adapters
- Remove unnecessary @DataClass on adapter classes (no fields)
- Add from __future__ import annotations to evaluation.py
- Fix __getattr__ to re-raise ModuleNotFoundError for broken deps
- Add private helper re-exports to trial.py shim for backward compat
- Convert test_save_trajectory to async (fixes event loop deprecation)
* rename: backend → sandbox across API, CLI, docs, and tests
Aligns naming with the v0.4 Sandbox protocol (ENG-48). The
'backend' parameter/flag/attribute on Environment and the CLI is
now 'sandbox' everywhere.
Scope:
- runtime.py: Environment.__init__(sandbox=), .sandbox, from_task(sandbox=)
- cli/main.py: --sandbox flag (was --backend), help text
- _snapshot.py: docstring updates
- tests/test_runtime.py: updated assertions
- docs/: all references in task-authoring, CLI ref, Python API ref,
running-benchmarks, integration-tests, examples
NOT renamed (different meaning):
- _credentials.py 'Vertex AI backend' (provider concept)
- _provider_runtime.py / bedrock_proxy.py backend_model (LLM model)
- _sandbox.py 'build backends' (Python packaging)
* docs: update for v0.4 — Rollout as canonical name, add Sandbox/Rubric/Adapter docs
- README: 'Sandbox backends' → 'Sandboxes', table says Rollout not Trial
- concepts.md: Environment no longer says 'Backed by Harbor', Trial → Rollout
as core primitive, lifecycle diagram uses Rollout.run()
- getting-started.md: uses RolloutConfig, links to rollout lifecycle
- python-api.md: RolloutConfig (aliased as TrialConfig), Rollout (aliased as
Trial), canonical imports shown first, added v0.4 types section with Sandbox
protocol, Rubric + RewardFunc, Adapters, Evaluation, backward-compat aliases
* fix: correct Rubric/RewardFunc API examples in python-api.md
Fixes 3 issues flagged by Devin Review:
- StringMatchRewardFunc: remove non-existent 'field' parameter
- Rubric: use reward_funcs + weights instead of items=[(func, weight)]
- rubric.score: takes rollout_dir: Path, not sandbox=
* fix: correct evaluation.py docstring aliases, clean stale comment in __init__.py
* docs: remove Harbor migration references
- use-cases.md: remove 'How it works vs Harbor' and 'Migration from Harbor' sections
- scene-patterns.ipynb: remove 'Mapping to Harbor PR #1462 Concepts' cell
- coder-reviewer-demo.py: 'Harbor-format task directory' → 'BenchFlow task directory'
- swebench notebook: remove Harbor #1316 reference from intro
* CLI: unify sandbox flag (--env → --sandbox, -b → -e) [ENG-56] (#281)
* cli: unify sandbox flag --env → --sandbox, -b → -e (ENG-56)
Standardize all CLI commands to use --sandbox/-e for the sandbox
parameter. Previously bench run used --sandbox/-b while bench eval
create and others used --env/-e.
- bench run: -b → -e
- bench eval create/run/compare: --env → --sandbox
- bench eval progress: --env → --sandbox
- bench skills eval: --env → --sandbox
- bench environment create: -b → -e
- docs/examples: updated flag references
* cli: drop -e/-b short flags, use --sandbox only (ENG-56)
Remove all short flags (-e, -b) for the sandbox parameter across all
CLI commands. The long flag --sandbox is the only way to specify the
sandbox now.
Updated 15 files: CLI definitions, help text examples, README,
docs, skill files, integration test runner, and test examples.
---------
Co-authored-by: Xiangyi Li <xiangyi@benchflow.ai>
* cli: deprecate bench run in favor of bench eval create (ENG-57) (#282)
Per ENG-46, bench eval create is the canonical CLI entry point.
bench run goes through the old SDK shim and is now deprecated,
matching the pattern used for bench job and other legacy commands.
Python API bf.run() is unaffected (backward-compat alias stays).
Co-authored-by: Xiangyi Li <xiangyi@benchflow.ai>
* feat(ENG-55): first-class LLM-as-judge verifier with dense reward (#277)
* feat(ENG-55): first-class LLM-as-judge verifier with dense reward
- Add rubric_config.py: Pydantic models for rubric.toml (Criterion,
JudgeConfig, ScoringConfig, RubricConfig) with binary/likert/numeric
criterion types and score normalization
- Add llm.py: Multi-provider LLM routing (Anthropic/OpenAI/Gemini)
with call_judge(), parse_verdict(), exponential backoff retry
- Add file_readers.py: Document extraction (pdf/docx/xlsx/pptx/txt)
with find_deliverables() for rollout directory scanning
- Replace placeholder LLMJudgeRewardFunc in builtins.py with full
rubric-based judge: per-criterion scoring, prompt templates,
configurable aggregation (weighted_mean/all_pass/any_pass/threshold),
backward-compatible legacy mode
- Emit per-criterion dense RewardEvents for fine-grained reward signals
- Write evaluation_details.json alongside rollout for transparency
- Re-export Criterion, JudgeConfig, RubricConfig, ScoringConfig,
load_rubric_toml from rewards and top-level __init__
- Add test_rubric_config.py (15 tests) and test_llm_judge.py (30 tests)
Closes ENG-55
* fix(ENG-55): async clients, asyncio.sleep, and score mismatch bugs
- Replace sync SDK clients with async variants (AsyncAnthropic, AsyncOpenAI, client.aio for Google)
- Replace time.sleep() with await asyncio.sleep() in call_judge retry loop
- Pass actual aggregated score to _write_details instead of recomputing n_passed/total
* docs(ENG-55): add LLM-as-judge verifier documentation
- Add docs/llm-judge.md: full user-facing guide with rubric.toml reference,
criterion types, aggregation strategies, dense reward events, multi-provider
routing, file discovery, inline criteria, and worked examples
- Add src/benchflow/rewards/README.md: module-level guide with usage examples
- Update docs/concepts.md: reference LLM judge in Verifier primitive and
add to 'Where to go next' links
* docs: polish — remove ENG-55 ticket ref, clarify rubric.json exclusion
---------
Co-authored-by: Xiangyi Li <xiangyi@benchflow.ai>
* CLI: remove all short flags, use full names only [ENG-74] (#284)
* CLI: remove all short flags, use full names only [ENG-74]
Remove all single-letter short flags (-a, -m, -o, -t, -c, -f, -s, -p, -b, -d)
from every CLI command. Full flag names only (--agent, --model, --jobs-dir,
--tasks-dir, --concurrency, --config, --skills-dir, --prompt, --benchmark, --dir).
Updated across 23 files:
- src/benchflow/cli/main.py: all typer.Option definitions + docstring examples
- All docs/ (getting-started, running-benchmarks, cli.md, skill-eval, integration-tests, examples/)
- README.md
- .claude/skills/ (SKILL.md + references)
- tests/ (integration/run.sh, test_codex_custom_provider.sh, conformance/README.md)
- benchmarks/ (models-as-skills.md, harvey-lab/run_harvey_lab.py)
- Test fix: test_skill_eval_dryrun.py invocation updated from -a to --agent
* fix: replace remaining -f with --config in task-embedded SKILL.md files
---------
Co-authored-by: Xiangyi Li <xiangyi@benchflow.ai>
* Backport PR #230 and #242 fixes to refactor/v0.4 (#286)
* fix: apply PR #230 (skills double-deploy) and PR #242 (shell injection) fixes to v0.4
PR #230: pass effective task path to deploy_skills, set effective_skills in already-injected branch
PR #242: shlex.quote outbox file paths in _scene.py, harden _snapshot.py with path validation
* fix: set effective_skills to /skills when Dockerfile already injected
When deploy_skills detects the Dockerfile already bakes in
`COPY _deps/skills /skills/`, the runtime upload is skipped but
effective_skills was left as task.config.environment.skills_dir
(often None), causing the subsequent linking step to silently
skip distribution to agent-specific paths like
/home/agent/.agents/skills. Sandbox users relied on that runtime
linking since Dockerfile injection only links under /root/.
Tighten the regression test to assert env.exec was called with the
expected `ln -sfn` command.
---------
Co-authored-by: Xiangyi Li <xiangyi@benchflow.ai>
Co-authored-by: Yifeng He <yfhe.prsn@gmail.com>
* fix locally
* release: 0.3.4
---------
Co-authored-by: Xiangyi Li <xiangyi@benchflow.ai>
Co-authored-by: devin-ai-integration[bot] <158243242+devin-ai-integration[bot]@users.noreply.github.com>
Co-authored-by: Yifeng He <yfhe.prsn@gmail.com>
* refactor: remove Harbor references, purge backward-compat aliases in tests, update terminology
- Remove 'terminal-bench' from pyproject.toml keywords
- Rename _make_harbor_mock → _make_sandbox_mock in test files
- Update Harbor patch paths to sandbox equivalents in tests
- Rename _from_harbor_yaml → _from_legacy_yaml in evaluation.py
- Update docstrings/comments: Harbor → BenchFlow/legacy terminology
- Internalize Harbor types into benchflow.task/ subpackage
- Create sandbox backends: base.py, docker_impl.py, daytona_impl.py, modal_impl.py
- Add compose/ subpackage with docker-compose YAML templates
* refactor: modernize file structure — align with test-drive patterns
- sandbox/: merge docker_impl→docker, daytona_impl→daytona, base→_base
- sandbox/: move _sandbox→lockdown, _snapshot→snapshot, _daytona_patches→_sdk_ops
- sandbox/: move process, user, environments→services into sandbox/
- sandbox/: restructure compose/ → _compose.py + _compose_files/
- agents/: move _credentials→credentials, _agent_env→env, _agent_setup→install
- _utils/: create subpackage with yaml_loader, benchmark_repos, task_authoring
- trajectories/: move viewer, _trajectory→_capture into subpackage
- experimental/: move mcp/ into experimental/ subpackage
- __init__.py: purge backward-compat aliases (Trial, Job, etc.)
- Update all imports across src/, tests/, experiments/
- Run ruff check + format (0 errors)
* refactor: purge backward-compat aliases + fix review bugs
- Remove Trial=Rollout, TrialConfig=RolloutConfig from rollout.py
- Remove Job=Evaluation, JobConfig=EvaluationConfig, JobResult=EvaluationResult from evaluation.py
- Delete shim files: trial.py, job.py
- Update all imports: benchflow.trial → benchflow.rollout, benchflow.job → benchflow.evaluation
- Rename classes: Trial→Rollout, Job→Evaluation across tests/, benchmarks/, experiments/
- Fix runtime.py: environment_type→sandbox_type, trial_name→rollout_name (review bug)
- Fix conformance scripts: same keyword arg fixes (review bug)
- Clean up __all__ and return type annotations
* docs: update terminology to v0.4 (Trial→Rollout, Job→Evaluation) + bump version to 0.4.0
* chore: update uv.lock for v0.4.0
* refactor: modernize RL terminology — trial_dir→rollout_dir, trial_name→rollout_name across codebase
* style: format viewer.py
* fix: exclude optional-dep sandbox files from ty check
* fix: configure ty to ignore unresolved-import + exclude optional-dep files
* fix: resolve all test failures — patch paths, protocol conformance, RL terminology
* fix: resolve merge conflicts, update broken imports and stale references
- Resolve merge conflict markers in rollout.py and evaluation.py
- Update task_download imports to benchflow._utils.benchmark_repos (11 files)
- Fix benchflow.tasks → benchflow._utils.task_authoring import
- Update scene-patterns.ipynb: TrialConfig→RolloutConfig, trial→rollout
- Fix Job→Evaluation rename in docs
- Ruff auto-fixes (import ordering)
* style: format 12 files to pass ruff format check
* fix: resolve ty check errors in traces/ — dict[str, object] → dict[str, Any]
* fix: add skip guards for optional dependency tests (daytona, modal)
* fix: update stale imports and references in docs/benchmarks (Devin Review)
* refactor: consolidate underscore modules into proper subpackages
- _acp_run.py → acp/runtime.py
- _env_setup.py → sandbox/setup.py
- _provider_runtime.py → providers/runtime.py
- _scoring.py → _utils/scoring.py
- _scene.py → scenes.py (removed Role=SceneRole backward-compat alias)
- Fix GEMINI_API_KEY auto-detection: inherit from os.environ as fallback after .env
* feat: ACPX integration — acpx/<agent> protocol for headless ACP agent invocation
- Add 'acpx' as a valid protocol in parse_agent_spec (acpx/claude, acpx/codex, etc.)
- _acpx_wrap() decorates any registered agent to launch via acpx CLI
- Installs acpx alongside the underlying agent in the sandbox
- Preserves all agent env, credentials, and skill_paths
- Replace harbor protocol references with acpx in tests
* style: format registry.py for ruff format check
* fix: suppress ty invalid-assignment for monkey-patch in _patch_docker_dind
* docs: update docs and SKILL.md for v0.4 — RL-first terminology, ACPX, purge backward-compat
* fix: rename trial_config_from_yaml → rollout_config_from_yaml, fix stale _run import in sdk.py
* fix: docs inconsistencies found by dogfooding — GEMINI_API_KEY, daytona install note
* fix: docs path evaluations/ → jobs/ to match actual output directory
* fix: ruff import ordering in sdk.py
* fix: ty type-check suppression for sdk.py run() return type
* feat: add sandbox-daytona and sandbox-modal optional dependency groups
* chore: update uv.lock for sandbox-daytona and sandbox-modal optional deps
---------
Co-authored-by: devin-ai-integration[bot] <158243242+devin-ai-integration[bot]@users.noreply.github.com>
Co-authored-by: Xiangyi Li <xiangyi@benchflow.ai>
Co-authored-by: Yifeng He <yfhe.prsn@gmail.com>
* skill eval -> refactor (#293)
* fix: address v0.4 dogfood blockers
* test: cover skill eval nested results
* chore: ignore playwright mcp artifacts
* fix: validate config eval agent protocol
* test: add adapter release evidence checker
* test: wire adapter evidence into release runner
* fix: address ENG-91 dogfood regressions
* fix: record role metadata and timeouts
* fix: align dogfood cli reports
* docs: document environment cleanup command
* docs: remove stale harbor task path examples
* Add citation-management skill eval under skills
* Update src/benchflow/cli/main.py
Co-authored-by: devin-ai-integration[bot] <158243242+devin-ai-integration[bot]@users.noreply.github.com>
---------
Co-authored-by: devin-ai-integration[bot] <158243242+devin-ai-integration[bot]@users.noreply.github.com>
Co-authored-by: Yifeng He <yfhe.prsn@gmail.com>
Summary
Replaces the placeholder
LLMJudgeRewardFunc(which only read a pre-computed score fromllm_judge_score.txt) with a full rubric-based LLM judge implementation. Adds dense reward emission (per-criterionRewardEvent) as an extension.New modules:
rewards/rubric_config.py— Pydantic dataclasses forrubric.tomlparsing:Criterion(binary/likert/numeric types with normalization),JudgeConfig,ScoringConfig,RubricConfigrewards/llm.py— Multi-provider LLM routing (claude-*→ Anthropic,gpt-*/o1*/o3*→ OpenAI,gemini-*→ Google) with exponential backoff retry and JSON verdict parsingrewards/file_readers.py— Document text extraction (pdf,docx,xlsx,pptx, plain text) withfind_deliverables()for rollout directory scanningModified
LLMJudgeRewardFunc:rubric.toml(explicit path, inline criteria list, or auto-discovery), evaluates each criterion individually via LLM, normalizes scores, aggregates via configurable strategy (weighted_mean,all_pass,any_pass,threshold)llm_judge_score.txtwhen no rubric is availableRewardEvent(type="dense")per criterion and writesevaluation_details.jsonto the rollout directoryExports:
Criterion,JudgeConfig,RubricConfig,ScoringConfig,load_rubric_tomladded torewards/__init__and top-levelbenchflow/__init__.Documentation:
docs/llm-judge.md— full user-facing guide:rubric.tomlreference, criterion types (binary/likert/numeric), aggregation strategies, dense reward events, multi-provider routing, file discovery, inline criteria, and worked examples (Harvey LAB legal task)src/benchflow/rewards/README.md— module-level developer guidedocs/concepts.md— updated Verifier primitive to reference LLM judge; added to "Where to go next"Tests: 45 new tests — 15 for rubric config parsing/normalization, 30 for the judge pipeline (verdict parsing, rubric/inline/auto-discovery modes, aggregation strategies, dense events, error handling, evaluation details output). All existing 1008 tests continue to pass.
Updates since last revision
Revision 4 — Docs review & polish
Ran docs-review skill to dogfood all three doc files against the codebase. All code examples, API imports, default values, constructor signatures, cross-references, and links verified correct. Fixed two polish items:
ENG-55ticket reference fromrewards/README.md(ticket-tracking language that rots for future readers)rubric.jsonexclusion indocs/llm-judge.mdas "internal metadata files" to avoid confusion withrubric.tomlRevision 3 — Documentation
Added user-facing docs (
docs/llm-judge.md) and module-level README (src/benchflow/rewards/README.md). Updateddocs/concepts.mdto cross-reference the LLM judge from the Verifier primitive and the navigation links.Revision 2 — Bug fixes (3)
_write_detailsscore mismatch —_write_detailspreviously computed its ownn_passed / totalscore, diverging from the actual_aggregate()result when weights are non-uniform or non-binary criteria are used. Now receives the aggregated score as a parameter.time.sleep()in asynccall_judge— replaced withawait asyncio.sleep()so the retry backoff no longer blocks the event loop._call_anthropicnow usesAsyncAnthropic,_call_openaiusesAsyncOpenAI, and_call_googleusesclient.aio.models.generate_content()instead of their synchronous counterparts.Review & Testing Checklist for Human
call_judge. TheAsyncAnthropic,AsyncOpenAI, andclient.aio.models.generate_content()call signatures should be manually verified with at least one real provider (e.g. Claude) to confirm the async interfaces match expectations end-to-end.JudgeConfig.referenceandprompt_templateare parsed but not wired: Therubric.tomlreference docs list these fields, but_build_promptand_rubric_scoredon't use them. Either wire them in or remove them from the docs to avoid misleading users.modeldefault changed from"gemini-3.1-flash-lite"to"claude-sonnet-4-6"andpromptis now optional — verify no existing callers depend on the old defaults or requiredpromptparameter._load_rubricchecksrollout_dir / ".." / "rubric.toml"— verify this parent-directory traversal is acceptable for your deployment trust model.Suggested test plan: Create a minimal
rubric.tomlwith 2–3 criteria, pointLLMJudgeRewardFunc(rubric_path=...)at it with a real API key, and runscore()against a sample rollout directory. Verifyevaluation_details.jsonis written with correct aggregated score and that denseRewardEvents are emitted per criterion.Notes
pdfplumber,openpyxl,python-docx,markitdown) are not declared inpyproject.toml— they degrade gracefully with(unsupported: ...)messages. Consider adding them as an optional dependency group if you want PDF/DOCX support out of the box.parse_verdictJSON extractor uses regex-based brace matching as a fallback when code-fence extraction fails — works for well-formed responses but could be fragile with deeply nested or malformed LLM output.Link to Devin session: https://app.devin.ai/sessions/dcd907c049b6437499dba8a7280bef93
Requested by: @xdotli