CLI: remove all short flags, use full names only [ENG-74] by devin-ai-integration[bot] · Pull Request #284 · benchflow-ai/benchflow

devin-ai-integration · 2026-05-17T07:08:00Z

Summary

Remove all single-letter short flags from every CLI command and all documentation/examples/skills/tests. Full flag names only for maximum clarity.

Flags removed:

Short	Full replacement
`-a`	`--agent`
`-m`	`--model`
`-o`	`--jobs-dir`
`-t`	`--tasks-dir`
`-c`	`--concurrency`
`-f`	`--config`
`-s`	`--skills-dir`
`-p`	`--prompt`
`-b`	`--sandbox` (already removed in #281)
`-d`	`--dir`

23 files changed across:

src/benchflow/cli/main.py — all typer.Option() definitions + docstring examples
7 docs (getting-started, running-benchmarks, cli.md, skill-eval, integration-tests, examples/README, integration-tests)
README.md
.claude/skills/ (3 SKILL.md files + 3 reference docs)
tests/ (integration/run.sh, test_codex_custom_provider.sh, conformance/README.md, test_skill_eval_dryrun.py, test_oracle_chokepoint.py)
benchmarks/ (models-as-skills.md, harvey-lab/run_harvey_lab.py)

Review & Testing Checklist for Human

Run bench eval create --help and bench run --help — verify no short flags in Options output
Run bench skills eval --help — verify --agent not -a
Spot-check docs/getting-started.md and docs/running-benchmarks.md for any remaining short flags
Run grep -r ' -[aomcstfbpd] ' --include='*.md' --include='*.sh' docs/ tests/ to verify sweep was complete

Notes

2 pre-existing test failures on refactor/v0.4 (not caused by this PR): test_same_provider_native_alias_satisfies_model_check and test_gemini_subscription_auth — both leak env vars from the test environment
Fixed test_cli_dryrun_loads_dataset which invoked CLI with -a (now --agent)
Tracks ENG-74

Link to Devin session: https://app.devin.ai/sessions/206c8b356e29441d88a8882b45c4442e
Requested by: @xdotli

Remove all single-letter short flags (-a, -m, -o, -t, -c, -f, -s, -p, -b, -d) from every CLI command. Full flag names only (--agent, --model, --jobs-dir, --tasks-dir, --concurrency, --config, --skills-dir, --prompt, --benchmark, --dir). Updated across 23 files: - src/benchflow/cli/main.py: all typer.Option definitions + docstring examples - All docs/ (getting-started, running-benchmarks, cli.md, skill-eval, integration-tests, examples/) - README.md - .claude/skills/ (SKILL.md + references) - tests/ (integration/run.sh, test_codex_custom_provider.sh, conformance/README.md) - benchmarks/ (models-as-skills.md, harvey-lab/run_harvey_lab.py) - Test fix: test_skill_eval_dryrun.py invocation updated from -a to --agent

devin-ai-integration · 2026-05-17T07:08:03Z

🤖 Devin AI Engineer

I'll be helping with this pull request! Here's what you should know:

✅ I will automatically:

Address comments on this PR. Add '(aside)' to your comment to have me ignore it.
Look at CI failures and help fix them

Note: I can only respond to comments from users who have write access to this repository.

⚙️ Control Options:

Disable automatic comment and CI monitoring

devin-ai-integration

Devin Review found 2 potential issues.

View 4 additional findings in Devin Review.

devin-ai-integration · 2026-05-17T07:11:58Z

🟡 Incomplete flag migration: -f still used in task-embedded SKILL.md (benchflow-knowledge)

This PR removes the -f short alias from the job command's --config option (src/benchflow/cli/main.py:183), but the task-embedded SKILL.md at .claude/skills/benchflow/tasks/benchflow-knowledge/environment/benchflow/SKILL.md:67 still references benchflow job -f examples/configs/tb2-haiku.yaml. The same file had other short flags correctly updated on lines 36 and 62, and the primary copy at .claude/skills/benchflow/SKILL.md:67 was correctly changed to --config. This SKILL.md is deployed into task sandboxes and read by AI agents — an agent following this documentation will get Error: No such option: -f.

(Refers to line 67)

Was this helpful? React with 👍 or 👎 to provide feedback.

Fixed in 34b09ef — replaced benchflow job -f with benchflow job --config in this file.

devin-ai-integration · 2026-05-17T07:11:59Z

🟡 Incomplete flag migration: -f still used in task-embedded SKILL.md (create-simple-task)

Same incomplete transformation as BUG-0001 but in the other task-embedded copy. .claude/skills/benchflow/tasks/create-simple-task/environment/benchflow/SKILL.md:67 still references benchflow job -f examples/configs/tb2-haiku.yaml after the -f short alias was removed from the job command in src/benchflow/cli/main.py:183. Other short flags in the same file (lines 36, 62) were correctly updated to long form.

(Refers to line 67)

Was this helpful? React with 👍 or 👎 to provide feedback.

Fixed in 34b09ef — same fix applied here.

…, rewards, adapters (#274) * refactor: add Sandbox protocol + Harbor adapters (ENG-48 Phase A) Create BenchFlow-native Sandbox protocol as parallel types alongside existing Harbor imports. No behavioral changes — existing code continues to use Harbor directly. New files: - src/benchflow/sandbox/protocol.py: ExecResult, Sandbox, ImageRef, ImageConfig, ImageBuilder - src/benchflow/sandbox/docker.py: DockerSandbox adapter wrapping Harbor DockerEnvironment - src/benchflow/sandbox/daytona.py: DaytonaSandbox adapter wrapping Harbor DaytonaEnvironment - tests/test_sandbox_protocol.py: 26 tests covering dataclasses, protocol conformance, and delegation * refactor: unify Scene/Role/Turn types into _types.py (ENG-47) Create src/benchflow/_types.py as the single canonical source for the declarative Role, Scene, and Turn dataclasses. Changes: - New _types.py with Role (adds timeout_sec, idle_timeout_sec, skills_dir), Scene (adds parallel_group), and Turn - trial.py imports from _types.py instead of defining its own copies - _scene.py renames its internal Role to SceneRole (different fields: instruction, tools) with backward-compat alias - __init__.py re-exports canonical types from _types.py; adds TrialRole/TrialScene backward-compat aliases - runtime.py and trial_yaml.py updated to import from _types.py New fields default to None — no runtime behavior changes. * fix: address review — shlex.quote in read_file, cleanup temp in write_file - Use shlex.quote(path) in read_file to prevent shell injection - Clean up temp files with os.unlink in write_file finally block - Move imports to module level * fix: read_file error checking + export ImageConfig from top-level - read_file now raises FileNotFoundError on non-zero return code - Export ImageConfig alongside ImageBuilder in __init__.py - Add tests for read_file error behavior * refactor: kill shim layers, single Rollout execution path (ENG-46) - Rename trial.py → rollout.py, Trial → Rollout, TrialConfig → RolloutConfig - Rename RunResult → RolloutResult in models.py - Create evaluation.py from job.py: Job → Evaluation, JobConfig → EvaluationConfig, JobResult → EvaluationResult - Create _run.py with bf.run(RolloutConfig) → RolloutResult entry point - Replace sdk.py with thin backward-compat shim delegating to rollout.py - Replace trial.py with re-export shim from rollout.py - Replace job.py with re-export shim from evaluation.py - Preserve runtime.py API, update internal imports to use Rollout - Update __init__.py: new public API + backward-compat aliases - Update test patch targets from benchflow.trial → benchflow.rollout - All 841 tests pass, lint clean, pre-existing typecheck errors unchanged * style: apply ruff format to pass CI format check * fix: resolve ty typecheck errors (Any for kwargs and sentinel) * feat: composable Rubric + RewardFunc protocol (ENG-49) - Create src/benchflow/rewards/ package with: - RewardFunc protocol (single scoring dimension) - Rubric dataclass (weighted collection of RewardFuncs) - VerifyResult dataclass (aggregated scoring result) - RewardEvent dataclass (dense/terminal reward signals) - Built-in funcs: TestRewardFunc, LLMJudgeRewardFunc, StringMatchRewardFunc, CodeExecRewardFunc - Add reward_events field to RunResult (additive, no breaking changes) - Re-export all reward types from benchflow top-level __init__.py - Backward compat: Rubric([TestRewardFunc()]) wraps existing test.sh -> reward.txt flow - 31 tests covering all reward types, rubric scoring, weights, error handling, protocol conformance, and re-exports * style: format rewards package and tests with ruff * fix: disambiguate Rubric.items keys when multiple funcs share a class name Append _N suffix on collision so all per-func scores are preserved. Addresses Devin Review feedback on PR #266. * feat: per-agent capabilities + agent-as-tool infrastructure (ENG-50) - Add Role.capabilities field for declarative agent capability lists - Add Sandbox.expose_ports property for inter-agent communication ports - Document capability boundary: BenchFlow provides sandbox + instruction + observation; agents own loops, tool protocols, and agent-as-tool - Update _scene.py and _types.py module docstrings with ENG-50 boundary - Add 16 tests covering capabilities, expose_ports, and Coder→Reviewer→Coder turn sequences * style: apply ruff format to sandbox adapters and tests * feat: external adapters for Inspect AI + ORS (ENG-51) Add benchflow.adapters package with thin format converters: - InspectAdapter / to_inspect_task: Scene + Rubric -> Inspect AI task dict - ORSAdapter / to_ors_reward: VerifyResult + RewardEvent -> ORS format dict No external dependencies required — these are pure format converters. Re-exported from benchflow top-level __init__.py. 14 tests covering both adapters, convenience functions, and round-trips. * style: fix ruff format for adapters and tests * fix: convention fixes + trial shim re-exports + __getattr__ narrowing - Parametrize bare dict returns to dict[str, Any] in adapters - Remove unnecessary @DataClass on adapter classes (no fields) - Add from __future__ import annotations to evaluation.py - Fix __getattr__ to re-raise ModuleNotFoundError for broken deps - Add private helper re-exports to trial.py shim for backward compat - Convert test_save_trajectory to async (fixes event loop deprecation) * rename: backend → sandbox across API, CLI, docs, and tests Aligns naming with the v0.4 Sandbox protocol (ENG-48). The 'backend' parameter/flag/attribute on Environment and the CLI is now 'sandbox' everywhere. Scope: - runtime.py: Environment.__init__(sandbox=), .sandbox, from_task(sandbox=) - cli/main.py: --sandbox flag (was --backend), help text - _snapshot.py: docstring updates - tests/test_runtime.py: updated assertions - docs/: all references in task-authoring, CLI ref, Python API ref, running-benchmarks, integration-tests, examples NOT renamed (different meaning): - _credentials.py 'Vertex AI backend' (provider concept) - _provider_runtime.py / bedrock_proxy.py backend_model (LLM model) - _sandbox.py 'build backends' (Python packaging) * docs: update for v0.4 — Rollout as canonical name, add Sandbox/Rubric/Adapter docs - README: 'Sandbox backends' → 'Sandboxes', table says Rollout not Trial - concepts.md: Environment no longer says 'Backed by Harbor', Trial → Rollout as core primitive, lifecycle diagram uses Rollout.run() - getting-started.md: uses RolloutConfig, links to rollout lifecycle - python-api.md: RolloutConfig (aliased as TrialConfig), Rollout (aliased as Trial), canonical imports shown first, added v0.4 types section with Sandbox protocol, Rubric + RewardFunc, Adapters, Evaluation, backward-compat aliases * fix: correct Rubric/RewardFunc API examples in python-api.md Fixes 3 issues flagged by Devin Review: - StringMatchRewardFunc: remove non-existent 'field' parameter - Rubric: use reward_funcs + weights instead of items=[(func, weight)] - rubric.score: takes rollout_dir: Path, not sandbox= * fix: correct evaluation.py docstring aliases, clean stale comment in __init__.py * docs: remove Harbor migration references - use-cases.md: remove 'How it works vs Harbor' and 'Migration from Harbor' sections - scene-patterns.ipynb: remove 'Mapping to Harbor PR #1462 Concepts' cell - coder-reviewer-demo.py: 'Harbor-format task directory' → 'BenchFlow task directory' - swebench notebook: remove Harbor #1316 reference from intro * CLI: unify sandbox flag (--env → --sandbox, -b → -e) [ENG-56] (#281) * cli: unify sandbox flag --env → --sandbox, -b → -e (ENG-56) Standardize all CLI commands to use --sandbox/-e for the sandbox parameter. Previously bench run used --sandbox/-b while bench eval create and others used --env/-e. - bench run: -b → -e - bench eval create/run/compare: --env → --sandbox - bench eval progress: --env → --sandbox - bench skills eval: --env → --sandbox - bench environment create: -b → -e - docs/examples: updated flag references * cli: drop -e/-b short flags, use --sandbox only (ENG-56) Remove all short flags (-e, -b) for the sandbox parameter across all CLI commands. The long flag --sandbox is the only way to specify the sandbox now. Updated 15 files: CLI definitions, help text examples, README, docs, skill files, integration test runner, and test examples. --------- Co-authored-by: Xiangyi Li <xiangyi@benchflow.ai> * cli: deprecate bench run in favor of bench eval create (ENG-57) (#282) Per ENG-46, bench eval create is the canonical CLI entry point. bench run goes through the old SDK shim and is now deprecated, matching the pattern used for bench job and other legacy commands. Python API bf.run() is unaffected (backward-compat alias stays). Co-authored-by: Xiangyi Li <xiangyi@benchflow.ai> * feat(ENG-55): first-class LLM-as-judge verifier with dense reward (#277) * feat(ENG-55): first-class LLM-as-judge verifier with dense reward - Add rubric_config.py: Pydantic models for rubric.toml (Criterion, JudgeConfig, ScoringConfig, RubricConfig) with binary/likert/numeric criterion types and score normalization - Add llm.py: Multi-provider LLM routing (Anthropic/OpenAI/Gemini) with call_judge(), parse_verdict(), exponential backoff retry - Add file_readers.py: Document extraction (pdf/docx/xlsx/pptx/txt) with find_deliverables() for rollout directory scanning - Replace placeholder LLMJudgeRewardFunc in builtins.py with full rubric-based judge: per-criterion scoring, prompt templates, configurable aggregation (weighted_mean/all_pass/any_pass/threshold), backward-compatible legacy mode - Emit per-criterion dense RewardEvents for fine-grained reward signals - Write evaluation_details.json alongside rollout for transparency - Re-export Criterion, JudgeConfig, RubricConfig, ScoringConfig, load_rubric_toml from rewards and top-level __init__ - Add test_rubric_config.py (15 tests) and test_llm_judge.py (30 tests) Closes ENG-55 * fix(ENG-55): async clients, asyncio.sleep, and score mismatch bugs - Replace sync SDK clients with async variants (AsyncAnthropic, AsyncOpenAI, client.aio for Google) - Replace time.sleep() with await asyncio.sleep() in call_judge retry loop - Pass actual aggregated score to _write_details instead of recomputing n_passed/total * docs(ENG-55): add LLM-as-judge verifier documentation - Add docs/llm-judge.md: full user-facing guide with rubric.toml reference, criterion types, aggregation strategies, dense reward events, multi-provider routing, file discovery, inline criteria, and worked examples - Add src/benchflow/rewards/README.md: module-level guide with usage examples - Update docs/concepts.md: reference LLM judge in Verifier primitive and add to 'Where to go next' links * docs: polish — remove ENG-55 ticket ref, clarify rubric.json exclusion --------- Co-authored-by: Xiangyi Li <xiangyi@benchflow.ai> * CLI: remove all short flags, use full names only [ENG-74] (#284) * CLI: remove all short flags, use full names only [ENG-74] Remove all single-letter short flags (-a, -m, -o, -t, -c, -f, -s, -p, -b, -d) from every CLI command. Full flag names only (--agent, --model, --jobs-dir, --tasks-dir, --concurrency, --config, --skills-dir, --prompt, --benchmark, --dir). Updated across 23 files: - src/benchflow/cli/main.py: all typer.Option definitions + docstring examples - All docs/ (getting-started, running-benchmarks, cli.md, skill-eval, integration-tests, examples/) - README.md - .claude/skills/ (SKILL.md + references) - tests/ (integration/run.sh, test_codex_custom_provider.sh, conformance/README.md) - benchmarks/ (models-as-skills.md, harvey-lab/run_harvey_lab.py) - Test fix: test_skill_eval_dryrun.py invocation updated from -a to --agent * fix: replace remaining -f with --config in task-embedded SKILL.md files --------- Co-authored-by: Xiangyi Li <xiangyi@benchflow.ai> * Backport PR #230 and #242 fixes to refactor/v0.4 (#286) * fix: apply PR #230 (skills double-deploy) and PR #242 (shell injection) fixes to v0.4 PR #230: pass effective task path to deploy_skills, set effective_skills in already-injected branch PR #242: shlex.quote outbox file paths in _scene.py, harden _snapshot.py with path validation * fix: set effective_skills to /skills when Dockerfile already injected When deploy_skills detects the Dockerfile already bakes in `COPY _deps/skills /skills/`, the runtime upload is skipped but effective_skills was left as task.config.environment.skills_dir (often None), causing the subsequent linking step to silently skip distribution to agent-specific paths like /home/agent/.agents/skills. Sandbox users relied on that runtime linking since Dockerfile injection only links under /root/. Tighten the regression test to assert env.exec was called with the expected `ln -sfn` command. --------- Co-authored-by: Xiangyi Li <xiangyi@benchflow.ai> Co-authored-by: Yifeng He <yfhe.prsn@gmail.com> * fix locally * release: 0.3.4 --------- Co-authored-by: Xiangyi Li <xiangyi@benchflow.ai> Co-authored-by: devin-ai-integration[bot] <158243242+devin-ai-integration[bot]@users.noreply.github.com> Co-authored-by: Yifeng He <yfhe.prsn@gmail.com>

…, ACPX integration (#288) * refactor: BenchFlow v0.3.4 — unified types, Rollout, Sandbox protocol, rewards, adapters (#274) * refactor: add Sandbox protocol + Harbor adapters (ENG-48 Phase A) Create BenchFlow-native Sandbox protocol as parallel types alongside existing Harbor imports. No behavioral changes — existing code continues to use Harbor directly. New files: - src/benchflow/sandbox/protocol.py: ExecResult, Sandbox, ImageRef, ImageConfig, ImageBuilder - src/benchflow/sandbox/docker.py: DockerSandbox adapter wrapping Harbor DockerEnvironment - src/benchflow/sandbox/daytona.py: DaytonaSandbox adapter wrapping Harbor DaytonaEnvironment - tests/test_sandbox_protocol.py: 26 tests covering dataclasses, protocol conformance, and delegation * refactor: unify Scene/Role/Turn types into _types.py (ENG-47) Create src/benchflow/_types.py as the single canonical source for the declarative Role, Scene, and Turn dataclasses. Changes: - New _types.py with Role (adds timeout_sec, idle_timeout_sec, skills_dir), Scene (adds parallel_group), and Turn - trial.py imports from _types.py instead of defining its own copies - _scene.py renames its internal Role to SceneRole (different fields: instruction, tools) with backward-compat alias - __init__.py re-exports canonical types from _types.py; adds TrialRole/TrialScene backward-compat aliases - runtime.py and trial_yaml.py updated to import from _types.py New fields default to None — no runtime behavior changes. * fix: address review — shlex.quote in read_file, cleanup temp in write_file - Use shlex.quote(path) in read_file to prevent shell injection - Clean up temp files with os.unlink in write_file finally block - Move imports to module level * fix: read_file error checking + export ImageConfig from top-level - read_file now raises FileNotFoundError on non-zero return code - Export ImageConfig alongside ImageBuilder in __init__.py - Add tests for read_file error behavior * refactor: kill shim layers, single Rollout execution path (ENG-46) - Rename trial.py → rollout.py, Trial → Rollout, TrialConfig → RolloutConfig - Rename RunResult → RolloutResult in models.py - Create evaluation.py from job.py: Job → Evaluation, JobConfig → EvaluationConfig, JobResult → EvaluationResult - Create _run.py with bf.run(RolloutConfig) → RolloutResult entry point - Replace sdk.py with thin backward-compat shim delegating to rollout.py - Replace trial.py with re-export shim from rollout.py - Replace job.py with re-export shim from evaluation.py - Preserve runtime.py API, update internal imports to use Rollout - Update __init__.py: new public API + backward-compat aliases - Update test patch targets from benchflow.trial → benchflow.rollout - All 841 tests pass, lint clean, pre-existing typecheck errors unchanged * style: apply ruff format to pass CI format check * fix: resolve ty typecheck errors (Any for kwargs and sentinel) * feat: composable Rubric + RewardFunc protocol (ENG-49) - Create src/benchflow/rewards/ package with: - RewardFunc protocol (single scoring dimension) - Rubric dataclass (weighted collection of RewardFuncs) - VerifyResult dataclass (aggregated scoring result) - RewardEvent dataclass (dense/terminal reward signals) - Built-in funcs: TestRewardFunc, LLMJudgeRewardFunc, StringMatchRewardFunc, CodeExecRewardFunc - Add reward_events field to RunResult (additive, no breaking changes) - Re-export all reward types from benchflow top-level __init__.py - Backward compat: Rubric([TestRewardFunc()]) wraps existing test.sh -> reward.txt flow - 31 tests covering all reward types, rubric scoring, weights, error handling, protocol conformance, and re-exports * style: format rewards package and tests with ruff * fix: disambiguate Rubric.items keys when multiple funcs share a class name Append _N suffix on collision so all per-func scores are preserved. Addresses Devin Review feedback on PR #266. * feat: per-agent capabilities + agent-as-tool infrastructure (ENG-50) - Add Role.capabilities field for declarative agent capability lists - Add Sandbox.expose_ports property for inter-agent communication ports - Document capability boundary: BenchFlow provides sandbox + instruction + observation; agents own loops, tool protocols, and agent-as-tool - Update _scene.py and _types.py module docstrings with ENG-50 boundary - Add 16 tests covering capabilities, expose_ports, and Coder→Reviewer→Coder turn sequences * style: apply ruff format to sandbox adapters and tests * feat: external adapters for Inspect AI + ORS (ENG-51) Add benchflow.adapters package with thin format converters: - InspectAdapter / to_inspect_task: Scene + Rubric -> Inspect AI task dict - ORSAdapter / to_ors_reward: VerifyResult + RewardEvent -> ORS format dict No external dependencies required — these are pure format converters. Re-exported from benchflow top-level __init__.py. 14 tests covering both adapters, convenience functions, and round-trips. * style: fix ruff format for adapters and tests * fix: convention fixes + trial shim re-exports + __getattr__ narrowing - Parametrize bare dict returns to dict[str, Any] in adapters - Remove unnecessary @DataClass on adapter classes (no fields) - Add from __future__ import annotations to evaluation.py - Fix __getattr__ to re-raise ModuleNotFoundError for broken deps - Add private helper re-exports to trial.py shim for backward compat - Convert test_save_trajectory to async (fixes event loop deprecation) * rename: backend → sandbox across API, CLI, docs, and tests Aligns naming with the v0.4 Sandbox protocol (ENG-48). The 'backend' parameter/flag/attribute on Environment and the CLI is now 'sandbox' everywhere. Scope: - runtime.py: Environment.__init__(sandbox=), .sandbox, from_task(sandbox=) - cli/main.py: --sandbox flag (was --backend), help text - _snapshot.py: docstring updates - tests/test_runtime.py: updated assertions - docs/: all references in task-authoring, CLI ref, Python API ref, running-benchmarks, integration-tests, examples NOT renamed (different meaning): - _credentials.py 'Vertex AI backend' (provider concept) - _provider_runtime.py / bedrock_proxy.py backend_model (LLM model) - _sandbox.py 'build backends' (Python packaging) * docs: update for v0.4 — Rollout as canonical name, add Sandbox/Rubric/Adapter docs - README: 'Sandbox backends' → 'Sandboxes', table says Rollout not Trial - concepts.md: Environment no longer says 'Backed by Harbor', Trial → Rollout as core primitive, lifecycle diagram uses Rollout.run() - getting-started.md: uses RolloutConfig, links to rollout lifecycle - python-api.md: RolloutConfig (aliased as TrialConfig), Rollout (aliased as Trial), canonical imports shown first, added v0.4 types section with Sandbox protocol, Rubric + RewardFunc, Adapters, Evaluation, backward-compat aliases * fix: correct Rubric/RewardFunc API examples in python-api.md Fixes 3 issues flagged by Devin Review: - StringMatchRewardFunc: remove non-existent 'field' parameter - Rubric: use reward_funcs + weights instead of items=[(func, weight)] - rubric.score: takes rollout_dir: Path, not sandbox= * fix: correct evaluation.py docstring aliases, clean stale comment in __init__.py * docs: remove Harbor migration references - use-cases.md: remove 'How it works vs Harbor' and 'Migration from Harbor' sections - scene-patterns.ipynb: remove 'Mapping to Harbor PR #1462 Concepts' cell - coder-reviewer-demo.py: 'Harbor-format task directory' → 'BenchFlow task directory' - swebench notebook: remove Harbor #1316 reference from intro * CLI: unify sandbox flag (--env → --sandbox, -b → -e) [ENG-56] (#281) * cli: unify sandbox flag --env → --sandbox, -b → -e (ENG-56) Standardize all CLI commands to use --sandbox/-e for the sandbox parameter. Previously bench run used --sandbox/-b while bench eval create and others used --env/-e. - bench run: -b → -e - bench eval create/run/compare: --env → --sandbox - bench eval progress: --env → --sandbox - bench skills eval: --env → --sandbox - bench environment create: -b → -e - docs/examples: updated flag references * cli: drop -e/-b short flags, use --sandbox only (ENG-56) Remove all short flags (-e, -b) for the sandbox parameter across all CLI commands. The long flag --sandbox is the only way to specify the sandbox now. Updated 15 files: CLI definitions, help text examples, README, docs, skill files, integration test runner, and test examples. --------- Co-authored-by: Xiangyi Li <xiangyi@benchflow.ai> * cli: deprecate bench run in favor of bench eval create (ENG-57) (#282) Per ENG-46, bench eval create is the canonical CLI entry point. bench run goes through the old SDK shim and is now deprecated, matching the pattern used for bench job and other legacy commands. Python API bf.run() is unaffected (backward-compat alias stays). Co-authored-by: Xiangyi Li <xiangyi@benchflow.ai> * feat(ENG-55): first-class LLM-as-judge verifier with dense reward (#277) * feat(ENG-55): first-class LLM-as-judge verifier with dense reward - Add rubric_config.py: Pydantic models for rubric.toml (Criterion, JudgeConfig, ScoringConfig, RubricConfig) with binary/likert/numeric criterion types and score normalization - Add llm.py: Multi-provider LLM routing (Anthropic/OpenAI/Gemini) with call_judge(), parse_verdict(), exponential backoff retry - Add file_readers.py: Document extraction (pdf/docx/xlsx/pptx/txt) with find_deliverables() for rollout directory scanning - Replace placeholder LLMJudgeRewardFunc in builtins.py with full rubric-based judge: per-criterion scoring, prompt templates, configurable aggregation (weighted_mean/all_pass/any_pass/threshold), backward-compatible legacy mode - Emit per-criterion dense RewardEvents for fine-grained reward signals - Write evaluation_details.json alongside rollout for transparency - Re-export Criterion, JudgeConfig, RubricConfig, ScoringConfig, load_rubric_toml from rewards and top-level __init__ - Add test_rubric_config.py (15 tests) and test_llm_judge.py (30 tests) Closes ENG-55 * fix(ENG-55): async clients, asyncio.sleep, and score mismatch bugs - Replace sync SDK clients with async variants (AsyncAnthropic, AsyncOpenAI, client.aio for Google) - Replace time.sleep() with await asyncio.sleep() in call_judge retry loop - Pass actual aggregated score to _write_details instead of recomputing n_passed/total * docs(ENG-55): add LLM-as-judge verifier documentation - Add docs/llm-judge.md: full user-facing guide with rubric.toml reference, criterion types, aggregation strategies, dense reward events, multi-provider routing, file discovery, inline criteria, and worked examples - Add src/benchflow/rewards/README.md: module-level guide with usage examples - Update docs/concepts.md: reference LLM judge in Verifier primitive and add to 'Where to go next' links * docs: polish — remove ENG-55 ticket ref, clarify rubric.json exclusion --------- Co-authored-by: Xiangyi Li <xiangyi@benchflow.ai> * CLI: remove all short flags, use full names only [ENG-74] (#284) * CLI: remove all short flags, use full names only [ENG-74] Remove all single-letter short flags (-a, -m, -o, -t, -c, -f, -s, -p, -b, -d) from every CLI command. Full flag names only (--agent, --model, --jobs-dir, --tasks-dir, --concurrency, --config, --skills-dir, --prompt, --benchmark, --dir). Updated across 23 files: - src/benchflow/cli/main.py: all typer.Option definitions + docstring examples - All docs/ (getting-started, running-benchmarks, cli.md, skill-eval, integration-tests, examples/) - README.md - .claude/skills/ (SKILL.md + references) - tests/ (integration/run.sh, test_codex_custom_provider.sh, conformance/README.md) - benchmarks/ (models-as-skills.md, harvey-lab/run_harvey_lab.py) - Test fix: test_skill_eval_dryrun.py invocation updated from -a to --agent * fix: replace remaining -f with --config in task-embedded SKILL.md files --------- Co-authored-by: Xiangyi Li <xiangyi@benchflow.ai> * Backport PR #230 and #242 fixes to refactor/v0.4 (#286) * fix: apply PR #230 (skills double-deploy) and PR #242 (shell injection) fixes to v0.4 PR #230: pass effective task path to deploy_skills, set effective_skills in already-injected branch PR #242: shlex.quote outbox file paths in _scene.py, harden _snapshot.py with path validation * fix: set effective_skills to /skills when Dockerfile already injected When deploy_skills detects the Dockerfile already bakes in `COPY _deps/skills /skills/`, the runtime upload is skipped but effective_skills was left as task.config.environment.skills_dir (often None), causing the subsequent linking step to silently skip distribution to agent-specific paths like /home/agent/.agents/skills. Sandbox users relied on that runtime linking since Dockerfile injection only links under /root/. Tighten the regression test to assert env.exec was called with the expected `ln -sfn` command. --------- Co-authored-by: Xiangyi Li <xiangyi@benchflow.ai> Co-authored-by: Yifeng He <yfhe.prsn@gmail.com> * fix locally * release: 0.3.4 --------- Co-authored-by: Xiangyi Li <xiangyi@benchflow.ai> Co-authored-by: devin-ai-integration[bot] <158243242+devin-ai-integration[bot]@users.noreply.github.com> Co-authored-by: Yifeng He <yfhe.prsn@gmail.com> * refactor: remove Harbor references, purge backward-compat aliases in tests, update terminology - Remove 'terminal-bench' from pyproject.toml keywords - Rename _make_harbor_mock → _make_sandbox_mock in test files - Update Harbor patch paths to sandbox equivalents in tests - Rename _from_harbor_yaml → _from_legacy_yaml in evaluation.py - Update docstrings/comments: Harbor → BenchFlow/legacy terminology - Internalize Harbor types into benchflow.task/ subpackage - Create sandbox backends: base.py, docker_impl.py, daytona_impl.py, modal_impl.py - Add compose/ subpackage with docker-compose YAML templates * refactor: modernize file structure — align with test-drive patterns - sandbox/: merge docker_impl→docker, daytona_impl→daytona, base→_base - sandbox/: move _sandbox→lockdown, _snapshot→snapshot, _daytona_patches→_sdk_ops - sandbox/: move process, user, environments→services into sandbox/ - sandbox/: restructure compose/ → _compose.py + _compose_files/ - agents/: move _credentials→credentials, _agent_env→env, _agent_setup→install - _utils/: create subpackage with yaml_loader, benchmark_repos, task_authoring - trajectories/: move viewer, _trajectory→_capture into subpackage - experimental/: move mcp/ into experimental/ subpackage - __init__.py: purge backward-compat aliases (Trial, Job, etc.) - Update all imports across src/, tests/, experiments/ - Run ruff check + format (0 errors) * refactor: purge backward-compat aliases + fix review bugs - Remove Trial=Rollout, TrialConfig=RolloutConfig from rollout.py - Remove Job=Evaluation, JobConfig=EvaluationConfig, JobResult=EvaluationResult from evaluation.py - Delete shim files: trial.py, job.py - Update all imports: benchflow.trial → benchflow.rollout, benchflow.job → benchflow.evaluation - Rename classes: Trial→Rollout, Job→Evaluation across tests/, benchmarks/, experiments/ - Fix runtime.py: environment_type→sandbox_type, trial_name→rollout_name (review bug) - Fix conformance scripts: same keyword arg fixes (review bug) - Clean up __all__ and return type annotations * docs: update terminology to v0.4 (Trial→Rollout, Job→Evaluation) + bump version to 0.4.0 * chore: update uv.lock for v0.4.0 * refactor: modernize RL terminology — trial_dir→rollout_dir, trial_name→rollout_name across codebase * style: format viewer.py * fix: exclude optional-dep sandbox files from ty check * fix: configure ty to ignore unresolved-import + exclude optional-dep files * fix: resolve all test failures — patch paths, protocol conformance, RL terminology * fix: resolve merge conflicts, update broken imports and stale references - Resolve merge conflict markers in rollout.py and evaluation.py - Update task_download imports to benchflow._utils.benchmark_repos (11 files) - Fix benchflow.tasks → benchflow._utils.task_authoring import - Update scene-patterns.ipynb: TrialConfig→RolloutConfig, trial→rollout - Fix Job→Evaluation rename in docs - Ruff auto-fixes (import ordering) * style: format 12 files to pass ruff format check * fix: resolve ty check errors in traces/ — dict[str, object] → dict[str, Any] * fix: add skip guards for optional dependency tests (daytona, modal) * fix: update stale imports and references in docs/benchmarks (Devin Review) * refactor: consolidate underscore modules into proper subpackages - _acp_run.py → acp/runtime.py - _env_setup.py → sandbox/setup.py - _provider_runtime.py → providers/runtime.py - _scoring.py → _utils/scoring.py - _scene.py → scenes.py (removed Role=SceneRole backward-compat alias) - Fix GEMINI_API_KEY auto-detection: inherit from os.environ as fallback after .env * feat: ACPX integration — acpx/<agent> protocol for headless ACP agent invocation - Add 'acpx' as a valid protocol in parse_agent_spec (acpx/claude, acpx/codex, etc.) - _acpx_wrap() decorates any registered agent to launch via acpx CLI - Installs acpx alongside the underlying agent in the sandbox - Preserves all agent env, credentials, and skill_paths - Replace harbor protocol references with acpx in tests * style: format registry.py for ruff format check * fix: suppress ty invalid-assignment for monkey-patch in _patch_docker_dind * docs: update docs and SKILL.md for v0.4 — RL-first terminology, ACPX, purge backward-compat * fix: rename trial_config_from_yaml → rollout_config_from_yaml, fix stale _run import in sdk.py * fix: docs inconsistencies found by dogfooding — GEMINI_API_KEY, daytona install note * fix: docs path evaluations/ → jobs/ to match actual output directory * fix: ruff import ordering in sdk.py * fix: ty type-check suppression for sdk.py run() return type * feat: add sandbox-daytona and sandbox-modal optional dependency groups * chore: update uv.lock for sandbox-daytona and sandbox-modal optional deps --------- Co-authored-by: devin-ai-integration[bot] <158243242+devin-ai-integration[bot]@users.noreply.github.com> Co-authored-by: Xiangyi Li <xiangyi@benchflow.ai> Co-authored-by: Yifeng He <yfhe.prsn@gmail.com>

* refactor: add Sandbox protocol + Harbor adapters (ENG-48 Phase A) Create BenchFlow-native Sandbox protocol as parallel types alongside existing Harbor imports. No behavioral changes — existing code continues to use Harbor directly. New files: - src/benchflow/sandbox/protocol.py: ExecResult, Sandbox, ImageRef, ImageConfig, ImageBuilder - src/benchflow/sandbox/docker.py: DockerSandbox adapter wrapping Harbor DockerEnvironment - src/benchflow/sandbox/daytona.py: DaytonaSandbox adapter wrapping Harbor DaytonaEnvironment - tests/test_sandbox_protocol.py: 26 tests covering dataclasses, protocol conformance, and delegation * refactor: unify Scene/Role/Turn types into _types.py (ENG-47) Create src/benchflow/_types.py as the single canonical source for the declarative Role, Scene, and Turn dataclasses. Changes: - New _types.py with Role (adds timeout_sec, idle_timeout_sec, skills_dir), Scene (adds parallel_group), and Turn - trial.py imports from _types.py instead of defining its own copies - _scene.py renames its internal Role to SceneRole (different fields: instruction, tools) with backward-compat alias - __init__.py re-exports canonical types from _types.py; adds TrialRole/TrialScene backward-compat aliases - runtime.py and trial_yaml.py updated to import from _types.py New fields default to None — no runtime behavior changes. * fix: address review — shlex.quote in read_file, cleanup temp in write_file - Use shlex.quote(path) in read_file to prevent shell injection - Clean up temp files with os.unlink in write_file finally block - Move imports to module level * fix: read_file error checking + export ImageConfig from top-level - read_file now raises FileNotFoundError on non-zero return code - Export ImageConfig alongside ImageBuilder in __init__.py - Add tests for read_file error behavior * refactor: kill shim layers, single Rollout execution path (ENG-46) - Rename trial.py → rollout.py, Trial → Rollout, TrialConfig → RolloutConfig - Rename RunResult → RolloutResult in models.py - Create evaluation.py from job.py: Job → Evaluation, JobConfig → EvaluationConfig, JobResult → EvaluationResult - Create _run.py with bf.run(RolloutConfig) → RolloutResult entry point - Replace sdk.py with thin backward-compat shim delegating to rollout.py - Replace trial.py with re-export shim from rollout.py - Replace job.py with re-export shim from evaluation.py - Preserve runtime.py API, update internal imports to use Rollout - Update __init__.py: new public API + backward-compat aliases - Update test patch targets from benchflow.trial → benchflow.rollout - All 841 tests pass, lint clean, pre-existing typecheck errors unchanged * style: apply ruff format to pass CI format check * fix: resolve ty typecheck errors (Any for kwargs and sentinel) * feat: composable Rubric + RewardFunc protocol (ENG-49) - Create src/benchflow/rewards/ package with: - RewardFunc protocol (single scoring dimension) - Rubric dataclass (weighted collection of RewardFuncs) - VerifyResult dataclass (aggregated scoring result) - RewardEvent dataclass (dense/terminal reward signals) - Built-in funcs: TestRewardFunc, LLMJudgeRewardFunc, StringMatchRewardFunc, CodeExecRewardFunc - Add reward_events field to RunResult (additive, no breaking changes) - Re-export all reward types from benchflow top-level __init__.py - Backward compat: Rubric([TestRewardFunc()]) wraps existing test.sh -> reward.txt flow - 31 tests covering all reward types, rubric scoring, weights, error handling, protocol conformance, and re-exports * style: format rewards package and tests with ruff * fix: disambiguate Rubric.items keys when multiple funcs share a class name Append _N suffix on collision so all per-func scores are preserved. Addresses Devin Review feedback on PR #266. * feat: per-agent capabilities + agent-as-tool infrastructure (ENG-50) - Add Role.capabilities field for declarative agent capability lists - Add Sandbox.expose_ports property for inter-agent communication ports - Document capability boundary: BenchFlow provides sandbox + instruction + observation; agents own loops, tool protocols, and agent-as-tool - Update _scene.py and _types.py module docstrings with ENG-50 boundary - Add 16 tests covering capabilities, expose_ports, and Coder→Reviewer→Coder turn sequences * style: apply ruff format to sandbox adapters and tests * feat: external adapters for Inspect AI + ORS (ENG-51) Add benchflow.adapters package with thin format converters: - InspectAdapter / to_inspect_task: Scene + Rubric -> Inspect AI task dict - ORSAdapter / to_ors_reward: VerifyResult + RewardEvent -> ORS format dict No external dependencies required — these are pure format converters. Re-exported from benchflow top-level __init__.py. 14 tests covering both adapters, convenience functions, and round-trips. * style: fix ruff format for adapters and tests * fix: convention fixes + trial shim re-exports + __getattr__ narrowing - Parametrize bare dict returns to dict[str, Any] in adapters - Remove unnecessary @DataClass on adapter classes (no fields) - Add from __future__ import annotations to evaluation.py - Fix __getattr__ to re-raise ModuleNotFoundError for broken deps - Add private helper re-exports to trial.py shim for backward compat - Convert test_save_trajectory to async (fixes event loop deprecation) * rename: backend → sandbox across API, CLI, docs, and tests Aligns naming with the v0.4 Sandbox protocol (ENG-48). The 'backend' parameter/flag/attribute on Environment and the CLI is now 'sandbox' everywhere. Scope: - runtime.py: Environment.__init__(sandbox=), .sandbox, from_task(sandbox=) - cli/main.py: --sandbox flag (was --backend), help text - _snapshot.py: docstring updates - tests/test_runtime.py: updated assertions - docs/: all references in task-authoring, CLI ref, Python API ref, running-benchmarks, integration-tests, examples NOT renamed (different meaning): - _credentials.py 'Vertex AI backend' (provider concept) - _provider_runtime.py / bedrock_proxy.py backend_model (LLM model) - _sandbox.py 'build backends' (Python packaging) * docs: update for v0.4 — Rollout as canonical name, add Sandbox/Rubric/Adapter docs - README: 'Sandbox backends' → 'Sandboxes', table says Rollout not Trial - concepts.md: Environment no longer says 'Backed by Harbor', Trial → Rollout as core primitive, lifecycle diagram uses Rollout.run() - getting-started.md: uses RolloutConfig, links to rollout lifecycle - python-api.md: RolloutConfig (aliased as TrialConfig), Rollout (aliased as Trial), canonical imports shown first, added v0.4 types section with Sandbox protocol, Rubric + RewardFunc, Adapters, Evaluation, backward-compat aliases * fix: correct Rubric/RewardFunc API examples in python-api.md Fixes 3 issues flagged by Devin Review: - StringMatchRewardFunc: remove non-existent 'field' parameter - Rubric: use reward_funcs + weights instead of items=[(func, weight)] - rubric.score: takes rollout_dir: Path, not sandbox= * fix: correct evaluation.py docstring aliases, clean stale comment in __init__.py * docs: remove Harbor migration references - use-cases.md: remove 'How it works vs Harbor' and 'Migration from Harbor' sections - scene-patterns.ipynb: remove 'Mapping to Harbor PR #1462 Concepts' cell - coder-reviewer-demo.py: 'Harbor-format task directory' → 'BenchFlow task directory' - swebench notebook: remove Harbor #1316 reference from intro * CLI: unify sandbox flag (--env → --sandbox, -b → -e) [ENG-56] (#281) * cli: unify sandbox flag --env → --sandbox, -b → -e (ENG-56) Standardize all CLI commands to use --sandbox/-e for the sandbox parameter. Previously bench run used --sandbox/-b while bench eval create and others used --env/-e. - bench run: -b → -e - bench eval create/run/compare: --env → --sandbox - bench eval progress: --env → --sandbox - bench skills eval: --env → --sandbox - bench environment create: -b → -e - docs/examples: updated flag references * cli: drop -e/-b short flags, use --sandbox only (ENG-56) Remove all short flags (-e, -b) for the sandbox parameter across all CLI commands. The long flag --sandbox is the only way to specify the sandbox now. Updated 15 files: CLI definitions, help text examples, README, docs, skill files, integration test runner, and test examples. --------- Co-authored-by: Xiangyi Li <xiangyi@benchflow.ai> * cli: deprecate bench run in favor of bench eval create (ENG-57) (#282) Per ENG-46, bench eval create is the canonical CLI entry point. bench run goes through the old SDK shim and is now deprecated, matching the pattern used for bench job and other legacy commands. Python API bf.run() is unaffected (backward-compat alias stays). Co-authored-by: Xiangyi Li <xiangyi@benchflow.ai> * feat(ENG-55): first-class LLM-as-judge verifier with dense reward (#277) * feat(ENG-55): first-class LLM-as-judge verifier with dense reward - Add rubric_config.py: Pydantic models for rubric.toml (Criterion, JudgeConfig, ScoringConfig, RubricConfig) with binary/likert/numeric criterion types and score normalization - Add llm.py: Multi-provider LLM routing (Anthropic/OpenAI/Gemini) with call_judge(), parse_verdict(), exponential backoff retry - Add file_readers.py: Document extraction (pdf/docx/xlsx/pptx/txt) with find_deliverables() for rollout directory scanning - Replace placeholder LLMJudgeRewardFunc in builtins.py with full rubric-based judge: per-criterion scoring, prompt templates, configurable aggregation (weighted_mean/all_pass/any_pass/threshold), backward-compatible legacy mode - Emit per-criterion dense RewardEvents for fine-grained reward signals - Write evaluation_details.json alongside rollout for transparency - Re-export Criterion, JudgeConfig, RubricConfig, ScoringConfig, load_rubric_toml from rewards and top-level __init__ - Add test_rubric_config.py (15 tests) and test_llm_judge.py (30 tests) Closes ENG-55 * fix(ENG-55): async clients, asyncio.sleep, and score mismatch bugs - Replace sync SDK clients with async variants (AsyncAnthropic, AsyncOpenAI, client.aio for Google) - Replace time.sleep() with await asyncio.sleep() in call_judge retry loop - Pass actual aggregated score to _write_details instead of recomputing n_passed/total * docs(ENG-55): add LLM-as-judge verifier documentation - Add docs/llm-judge.md: full user-facing guide with rubric.toml reference, criterion types, aggregation strategies, dense reward events, multi-provider routing, file discovery, inline criteria, and worked examples - Add src/benchflow/rewards/README.md: module-level guide with usage examples - Update docs/concepts.md: reference LLM judge in Verifier primitive and add to 'Where to go next' links * docs: polish — remove ENG-55 ticket ref, clarify rubric.json exclusion --------- Co-authored-by: Xiangyi Li <xiangyi@benchflow.ai> * CLI: remove all short flags, use full names only [ENG-74] (#284) * CLI: remove all short flags, use full names only [ENG-74] Remove all single-letter short flags (-a, -m, -o, -t, -c, -f, -s, -p, -b, -d) from every CLI command. Full flag names only (--agent, --model, --jobs-dir, --tasks-dir, --concurrency, --config, --skills-dir, --prompt, --benchmark, --dir). Updated across 23 files: - src/benchflow/cli/main.py: all typer.Option definitions + docstring examples - All docs/ (getting-started, running-benchmarks, cli.md, skill-eval, integration-tests, examples/) - README.md - .claude/skills/ (SKILL.md + references) - tests/ (integration/run.sh, test_codex_custom_provider.sh, conformance/README.md) - benchmarks/ (models-as-skills.md, harvey-lab/run_harvey_lab.py) - Test fix: test_skill_eval_dryrun.py invocation updated from -a to --agent * fix: replace remaining -f with --config in task-embedded SKILL.md files --------- Co-authored-by: Xiangyi Li <xiangyi@benchflow.ai> * Backport PR #230 and #242 fixes to refactor/v0.4 (#286) * fix: apply PR #230 (skills double-deploy) and PR #242 (shell injection) fixes to v0.4 PR #230: pass effective task path to deploy_skills, set effective_skills in already-injected branch PR #242: shlex.quote outbox file paths in _scene.py, harden _snapshot.py with path validation * fix: set effective_skills to /skills when Dockerfile already injected When deploy_skills detects the Dockerfile already bakes in `COPY _deps/skills /skills/`, the runtime upload is skipped but effective_skills was left as task.config.environment.skills_dir (often None), causing the subsequent linking step to silently skip distribution to agent-specific paths like /home/agent/.agents/skills. Sandbox users relied on that runtime linking since Dockerfile injection only links under /root/. Tighten the regression test to assert env.exec was called with the expected `ln -sfn` command. --------- Co-authored-by: Xiangyi Li <xiangyi@benchflow.ai> Co-authored-by: Yifeng He <yfhe.prsn@gmail.com> * fix locally * release: 0.3.4 * feat: trace import — `bench tasks generate` from Claude Code + opentraces (#278) * feat: trace import prototype — Claude Code + opentraces → BenchFlow tasks Add benchflow.traces package for personal benchmark curation from agent traces: - parsers: parse Claude Code JSONL sessions and opentraces v0.1-v0.3 records - models: format-agnostic ParsedTrace intermediate representation - task_gen: generate task.toml + instruction.md + test.sh from traces - huggingface: download and parse trace datasets from HuggingFace Hub - local: discover and parse local ~/.claude/projects/ sessions - CLI: bench import {local,file,hf,list-datasets} subcommands 44 new tests, all passing. * fix: address Devin Review findings - _cache_dir: fall back to cwd instead of / when outside git repo - HF cache key: include max_rows in filename to avoid stale data - parse_claude_code_session: deterministic trace_id from session+filename - _build_test_sh: use shlex.quote() to prevent shell injection * refactor: align trace import CLI with noun-verb philosophy - Replace 'bench import {local,file,hf}' with 'bench tasks generate --from-{local,file,hf}' - Replace 'bench import list-datasets' with 'bench tasks list-sources' - Integrate trace commands into existing tasks_app via register_tasks_generate() - Update CLI docs with new command reference - Follows BenchFlow resource-verb pattern: bench <resource> <verb> * fix: address Devin Review — limit flag + format name mismatch - Apply --limit across all sources (was only used for --from-local) - Map 'claude-code' → 'claude-messages' in _load_hf so --format claude-code works correctly with --from-hf * docs: dogfood bench tasks generate in task-authoring and getting-started guides * fix: generated tasks now pass bench tasks check — add Dockerfile, tests/ dir, reward path, difficulty scaling, TOML safety * audit: fix task pipeline issues + normalize docstrings across traces package Pipeline fixes: - Fix timeout auto-scaling: batch generator default changed from 300 to 0 (300 > 0 prevented difficulty-based scaling from ever triggering) - Add [task] name section to generated task.toml (trace-import/<slug>) - Add build_timeout_sec and storage_mb to [environment] section - Fix test.sh indentation (remove extra leading spaces from textwrap) Docstring normalization: - Add missing docstrings to CLI helper functions (_load_local, _load_file, _load_hf) - Improve docstring clarity across traces package (detect_format, print helpers, HF parsers) - Use reST code block syntax for CLI examples 48/48 tests pass, lint clean. * fix: multi-session JSONL collapse, outcome detection, zero-tool-call filter Dogfooding revealed 3 quality issues: 1. Critical: parse_claude_code_session() merged all sessions in a multi-session JSONL file into one trace, contaminating difficulty, instruction, and verifier. Added parse_claude_code_file() that splits by sessionId before parsing each group. 2. Outcome detection missed common completion verbs (fixed, refactored, built, created, updated, implemented, added). 3. Traces with zero tool calls (pure explanations) produced useless tasks with pass-through verifiers. Now filtered out in batch mode. Before: 5-session file → 1 task (hard, 1200s, 19 files listed) After: 5-session file → 4 tasks (easy/medium, correct files each) Tests: 55 passing (was 48), lint clean. * fix: test.sh file check cap matches instruction.md (10→20) * fix: real-trace dogfood — path relativization, session artifact cleanup, HF parser robustness - Add _relativize_path() to convert absolute workspace paths to relative project paths - Add _clean_user_prompt() to strip session continuation boilerplate - HF parser: support messages_json key, strip system-reminders, handle tool_result blocks, infer outcome - CLI: add claude-messages format detection and routing for --from-file - Add _parse_claude_requests_row() for cc-traces-weka metadata format - All 55 tests pass, lint clean * fix: verifier robustness — glob patterns for timestamp-bearing paths Paths with date/timestamp segments (e.g. migrations/2025-11-28-131040_create_invoices/up.sql) now use compgen -G glob patterns in test.sh instead of exact [ -f ] checks. This lets verifiers tolerate agent-generated timestamp variants. Also updates instruction.md to show globbed paths for dynamic segments. Adds 10 new tests for _globify_path, _has_dynamic_segments, and end-to-end verifier pattern selection. * fix: real-trace parser — user_prompt+gitdiff format, git context, git-diff verifier - Handle cc-traces-merged rows with user_prompt + gitdiff (no messages_json) - Extract TASK DESCRIPTION from structured prompts, strip CRITICAL INSTRUCTIONS boilerplate - Extract git repo/commit from prompt, populate GitContext - Dockerfile clones repo at base commit when git context available - Verifier uses git diff to check files were modified (not just exist) - Fix source_model='None' string bug --------- Co-authored-by: Xiangyi Li <xiangyi@benchflow.ai> * refactor: BenchFlow v0.4 — RL-first terminology, module consolidation, ACPX integration (#288) * refactor: BenchFlow v0.3.4 — unified types, Rollout, Sandbox protocol, rewards, adapters (#274) * refactor: add Sandbox protocol + Harbor adapters (ENG-48 Phase A) Create BenchFlow-native Sandbox protocol as parallel types alongside existing Harbor imports. No behavioral changes — existing code continues to use Harbor directly. New files: - src/benchflow/sandbox/protocol.py: ExecResult, Sandbox, ImageRef, ImageConfig, ImageBuilder - src/benchflow/sandbox/docker.py: DockerSandbox adapter wrapping Harbor DockerEnvironment - src/benchflow/sandbox/daytona.py: DaytonaSandbox adapter wrapping Harbor DaytonaEnvironment - tests/test_sandbox_protocol.py: 26 tests covering dataclasses, protocol conformance, and delegation * refactor: unify Scene/Role/Turn types into _types.py (ENG-47) Create src/benchflow/_types.py as the single canonical source for the declarative Role, Scene, and Turn dataclasses. Changes: - New _types.py with Role (adds timeout_sec, idle_timeout_sec, skills_dir), Scene (adds parallel_group), and Turn - trial.py imports from _types.py instead of defining its own copies - _scene.py renames its internal Role to SceneRole (different fields: instruction, tools) with backward-compat alias - __init__.py re-exports canonical types from _types.py; adds TrialRole/TrialScene backward-compat aliases - runtime.py and trial_yaml.py updated to import from _types.py New fields default to None — no runtime behavior changes. * fix: address review — shlex.quote in read_file, cleanup temp in write_file - Use shlex.quote(path) in read_file to prevent shell injection - Clean up temp files with os.unlink in write_file finally block - Move imports to module level * fix: read_file error checking + export ImageConfig from top-level - read_file now raises FileNotFoundError on non-zero return code - Export ImageConfig alongside ImageBuilder in __init__.py - Add tests for read_file error behavior * refactor: kill shim layers, single Rollout execution path (ENG-46) - Rename trial.py → rollout.py, Trial → Rollout, TrialConfig → RolloutConfig - Rename RunResult → RolloutResult in models.py - Create evaluation.py from job.py: Job → Evaluation, JobConfig → EvaluationConfig, JobResult → EvaluationResult - Create _run.py with bf.run(RolloutConfig) → RolloutResult entry point - Replace sdk.py with thin backward-compat shim delegating to rollout.py - Replace trial.py with re-export shim from rollout.py - Replace job.py with re-export shim from evaluation.py - Preserve runtime.py API, update internal imports to use Rollout - Update __init__.py: new public API + backward-compat aliases - Update test patch targets from benchflow.trial → benchflow.rollout - All 841 tests pass, lint clean, pre-existing typecheck errors unchanged * style: apply ruff format to pass CI format check * fix: resolve ty typecheck errors (Any for kwargs and sentinel) * feat: composable Rubric + RewardFunc protocol (ENG-49) - Create src/benchflow/rewards/ package with: - RewardFunc protocol (single scoring dimension) - Rubric dataclass (weighted collection of RewardFuncs) - VerifyResult dataclass (aggregated scoring result) - RewardEvent dataclass (dense/terminal reward signals) - Built-in funcs: TestRewardFunc, LLMJudgeRewardFunc, StringMatchRewardFunc, CodeExecRewardFunc - Add reward_events field to RunResult (additive, no breaking changes) - Re-export all reward types from benchflow top-level __init__.py - Backward compat: Rubric([TestRewardFunc()]) wraps existing test.sh -> reward.txt flow - 31 tests covering all reward types, rubric scoring, weights, error handling, protocol conformance, and re-exports * style: format rewards package and tests with ruff * fix: disambiguate Rubric.items keys when multiple funcs share a class name Append _N suffix on collision so all per-func scores are preserved. Addresses Devin Review feedback on PR #266. * feat: per-agent capabilities + agent-as-tool infrastructure (ENG-50) - Add Role.capabilities field for declarative agent capability lists - Add Sandbox.expose_ports property for inter-agent communication ports - Document capability boundary: BenchFlow provides sandbox + instruction + observation; agents own loops, tool protocols, and agent-as-tool - Update _scene.py and _types.py module docstrings with ENG-50 boundary - Add 16 tests covering capabilities, expose_ports, and Coder→Reviewer→Coder turn sequences * style: apply ruff format to sandbox adapters and tests * feat: external adapters for Inspect AI + ORS (ENG-51) Add benchflow.adapters package with thin format converters: - InspectAdapter / to_inspect_task: Scene + Rubric -> Inspect AI task dict - ORSAdapter / to_ors_reward: VerifyResult + RewardEvent -> ORS format dict No external dependencies required — these are pure format converters. Re-exported from benchflow top-level __init__.py. 14 tests covering both adapters, convenience functions, and round-trips. * style: fix ruff format for adapters and tests * fix: convention fixes + trial shim re-exports + __getattr__ narrowing - Parametrize bare dict returns to dict[str, Any] in adapters - Remove unnecessary @DataClass on adapter classes (no fields) - Add from __future__ import annotations to evaluation.py - Fix __getattr__ to re-raise ModuleNotFoundError for broken deps - Add private helper re-exports to trial.py shim for backward compat - Convert test_save_trajectory to async (fixes event loop deprecation) * rename: backend → sandbox across API, CLI, docs, and tests Aligns naming with the v0.4 Sandbox protocol (ENG-48). The 'backend' parameter/flag/attribute on Environment and the CLI is now 'sandbox' everywhere. Scope: - runtime.py: Environment.__init__(sandbox=), .sandbox, from_task(sandbox=) - cli/main.py: --sandbox flag (was --backend), help text - _snapshot.py: docstring updates - tests/test_runtime.py: updated assertions - docs/: all references in task-authoring, CLI ref, Python API ref, running-benchmarks, integration-tests, examples NOT renamed (different meaning): - _credentials.py 'Vertex AI backend' (provider concept) - _provider_runtime.py / bedrock_proxy.py backend_model (LLM model) - _sandbox.py 'build backends' (Python packaging) * docs: update for v0.4 — Rollout as canonical name, add Sandbox/Rubric/Adapter docs - README: 'Sandbox backends' → 'Sandboxes', table says Rollout not Trial - concepts.md: Environment no longer says 'Backed by Harbor', Trial → Rollout as core primitive, lifecycle diagram uses Rollout.run() - getting-started.md: uses RolloutConfig, links to rollout lifecycle - python-api.md: RolloutConfig (aliased as TrialConfig), Rollout (aliased as Trial), canonical imports shown first, added v0.4 types section with Sandbox protocol, Rubric + RewardFunc, Adapters, Evaluation, backward-compat aliases * fix: correct Rubric/RewardFunc API examples in python-api.md Fixes 3 issues flagged by Devin Review: - StringMatchRewardFunc: remove non-existent 'field' parameter - Rubric: use reward_funcs + weights instead of items=[(func, weight)] - rubric.score: takes rollout_dir: Path, not sandbox= * fix: correct evaluation.py docstring aliases, clean stale comment in __init__.py * docs: remove Harbor migration references - use-cases.md: remove 'How it works vs Harbor' and 'Migration from Harbor' sections - scene-patterns.ipynb: remove 'Mapping to Harbor PR #1462 Concepts' cell - coder-reviewer-demo.py: 'Harbor-format task directory' → 'BenchFlow task directory' - swebench notebook: remove Harbor #1316 reference from intro * CLI: unify sandbox flag (--env → --sandbox, -b → -e) [ENG-56] (#281) * cli: unify sandbox flag --env → --sandbox, -b → -e (ENG-56) Standardize all CLI commands to use --sandbox/-e for the sandbox parameter. Previously bench run used --sandbox/-b while bench eval create and others used --env/-e. - bench run: -b → -e - bench eval create/run/compare: --env → --sandbox - bench eval progress: --env → --sandbox - bench skills eval: --env → --sandbox - bench environment create: -b → -e - docs/examples: updated flag references * cli: drop -e/-b short flags, use --sandbox only (ENG-56) Remove all short flags (-e, -b) for the sandbox parameter across all CLI commands. The long flag --sandbox is the only way to specify the sandbox now. Updated 15 files: CLI definitions, help text examples, README, docs, skill files, integration test runner, and test examples. --------- Co-authored-by: Xiangyi Li <xiangyi@benchflow.ai> * cli: deprecate bench run in favor of bench eval create (ENG-57) (#282) Per ENG-46, bench eval create is the canonical CLI entry point. bench run goes through the old SDK shim and is now deprecated, matching the pattern used for bench job and other legacy commands. Python API bf.run() is unaffected (backward-compat alias stays). Co-authored-by: Xiangyi Li <xiangyi@benchflow.ai> * feat(ENG-55): first-class LLM-as-judge verifier with dense reward (#277) * feat(ENG-55): first-class LLM-as-judge verifier with dense reward - Add rubric_config.py: Pydantic models for rubric.toml (Criterion, JudgeConfig, ScoringConfig, RubricConfig) with binary/likert/numeric criterion types and score normalization - Add llm.py: Multi-provider LLM routing (Anthropic/OpenAI/Gemini) with call_judge(), parse_verdict(), exponential backoff retry - Add file_readers.py: Document extraction (pdf/docx/xlsx/pptx/txt) with find_deliverables() for rollout directory scanning - Replace placeholder LLMJudgeRewardFunc in builtins.py with full rubric-based judge: per-criterion scoring, prompt templates, configurable aggregation (weighted_mean/all_pass/any_pass/threshold), backward-compatible legacy mode - Emit per-criterion dense RewardEvents for fine-grained reward signals - Write evaluation_details.json alongside rollout for transparency - Re-export Criterion, JudgeConfig, RubricConfig, ScoringConfig, load_rubric_toml from rewards and top-level __init__ - Add test_rubric_config.py (15 tests) and test_llm_judge.py (30 tests) Closes ENG-55 * fix(ENG-55): async clients, asyncio.sleep, and score mismatch bugs - Replace sync SDK clients with async variants (AsyncAnthropic, AsyncOpenAI, client.aio for Google) - Replace time.sleep() with await asyncio.sleep() in call_judge retry loop - Pass actual aggregated score to _write_details instead of recomputing n_passed/total * docs(ENG-55): add LLM-as-judge verifier documentation - Add docs/llm-judge.md: full user-facing guide with rubric.toml reference, criterion types, aggregation strategies, dense reward events, multi-provider routing, file discovery, inline criteria, and worked examples - Add src/benchflow/rewards/README.md: module-level guide with usage examples - Update docs/concepts.md: reference LLM judge in Verifier primitive and add to 'Where to go next' links * docs: polish — remove ENG-55 ticket ref, clarify rubric.json exclusion --------- Co-authored-by: Xiangyi Li <xiangyi@benchflow.ai> * CLI: remove all short flags, use full names only [ENG-74] (#284) * CLI: remove all short flags, use full names only [ENG-74] Remove all single-letter short flags (-a, -m, -o, -t, -c, -f, -s, -p, -b, -d) from every CLI command. Full flag names only (--agent, --model, --jobs-dir, --tasks-dir, --concurrency, --config, --skills-dir, --prompt, --benchmark, --dir). Updated across 23 files: - src/benchflow/cli/main.py: all typer.Option definitions + docstring examples - All docs/ (getting-started, running-benchmarks, cli.md, skill-eval, integration-tests, examples/) - README.md - .claude/skills/ (SKILL.md + references) - tests/ (integration/run.sh, test_codex_custom_provider.sh, conformance/README.md) - benchmarks/ (models-as-skills.md, harvey-lab/run_harvey_lab.py) - Test fix: test_skill_eval_dryrun.py invocation updated from -a to --agent * fix: replace remaining -f with --config in task-embedded SKILL.md files --------- Co-authored-by: Xiangyi Li <xiangyi@benchflow.ai> * Backport PR #230 and #242 fixes to refactor/v0.4 (#286) * fix: apply PR #230 (skills double-deploy) and PR #242 (shell injection) fixes to v0.4 PR #230: pass effective task path to deploy_skills, set effective_skills in already-injected branch PR #242: shlex.quote outbox file paths in _scene.py, harden _snapshot.py with path validation * fix: set effective_skills to /skills when Dockerfile already injected When deploy_skills detects the Dockerfile already bakes in `COPY _deps/skills /skills/`, the runtime upload is skipped but effective_skills was left as task.config.environment.skills_dir (often None), causing the subsequent linking step to silently skip distribution to agent-specific paths like /home/agent/.agents/skills. Sandbox users relied on that runtime linking since Dockerfile injection only links under /root/. Tighten the regression test to assert env.exec was called with the expected `ln -sfn` command. --------- Co-authored-by: Xiangyi Li <xiangyi@benchflow.ai> Co-authored-by: Yifeng He <yfhe.prsn@gmail.com> * fix locally * release: 0.3.4 --------- Co-authored-by: Xiangyi Li <xiangyi@benchflow.ai> Co-authored-by: devin-ai-integration[bot] <158243242+devin-ai-integration[bot]@users.noreply.github.com> Co-authored-by: Yifeng He <yfhe.prsn@gmail.com> * refactor: remove Harbor references, purge backward-compat aliases in tests, update terminology - Remove 'terminal-bench' from pyproject.toml keywords - Rename _make_harbor_mock → _make_sandbox_mock in test files - Update Harbor patch paths to sandbox equivalents in tests - Rename _from_harbor_yaml → _from_legacy_yaml in evaluation.py - Update docstrings/comments: Harbor → BenchFlow/legacy terminology - Internalize Harbor types into benchflow.task/ subpackage - Create sandbox backends: base.py, docker_impl.py, daytona_impl.py, modal_impl.py - Add compose/ subpackage with docker-compose YAML templates * refactor: modernize file structure — align with test-drive patterns - sandbox/: merge docker_impl→docker, daytona_impl→daytona, base→_base - sandbox/: move _sandbox→lockdown, _snapshot→snapshot, _daytona_patches→_sdk_ops - sandbox/: move process, user, environments→services into sandbox/ - sandbox/: restructure compose/ → _compose.py + _compose_files/ - agents/: move _credentials→credentials, _agent_env→env, _agent_setup→install - _utils/: create subpackage with yaml_loader, benchmark_repos, task_authoring - trajectories/: move viewer, _trajectory→_capture into subpackage - experimental/: move mcp/ into experimental/ subpackage - __init__.py: purge backward-compat aliases (Trial, Job, etc.) - Update all imports across src/, tests/, experiments/ - Run ruff check + format (0 errors) * refactor: purge backward-compat aliases + fix review bugs - Remove Trial=Rollout, TrialConfig=RolloutConfig from rollout.py - Remove Job=Evaluation, JobConfig=EvaluationConfig, JobResult=EvaluationResult from evaluation.py - Delete shim files: trial.py, job.py - Update all imports: benchflow.trial → benchflow.rollout, benchflow.job → benchflow.evaluation - Rename classes: Trial→Rollout, Job→Evaluation across tests/, benchmarks/, experiments/ - Fix runtime.py: environment_type→sandbox_type, trial_name→rollout_name (review bug) - Fix conformance scripts: same keyword arg fixes (review bug) - Clean up __all__ and return type annotations * docs: update terminology to v0.4 (Trial→Rollout, Job→Evaluation) + bump version to 0.4.0 * chore: update uv.lock for v0.4.0 * refactor: modernize RL terminology — trial_dir→rollout_dir, trial_name→rollout_name across codebase * style: format viewer.py * fix: exclude optional-dep sandbox files from ty check * fix: configure ty to ignore unresolved-import + exclude optional-dep files * fix: resolve all test failures — patch paths, protocol conformance, RL terminology * fix: resolve merge conflicts, update broken imports and stale references - Resolve merge conflict markers in rollout.py and evaluation.py - Update task_download imports to benchflow._utils.benchmark_repos (11 files) - Fix benchflow.tasks → benchflow._utils.task_authoring import - Update scene-patterns.ipynb: TrialConfig→RolloutConfig, trial→rollout - Fix Job→Evaluation rename in docs - Ruff auto-fixes (import ordering) * style: format 12 files to pass ruff format check * fix: resolve ty check errors in traces/ — dict[str, object] → dict[str, Any] * fix: add skip guards for optional dependency tests (daytona, modal) * fix: update stale imports and references in docs/benchmarks (Devin Review) * refactor: consolidate underscore modules into proper subpackages - _acp_run.py → acp/runtime.py - _env_setup.py → sandbox/setup.py - _provider_runtime.py → providers/runtime.py - _scoring.py → _utils/scoring.py - _scene.py → scenes.py (removed Role=SceneRole backward-compat alias) - Fix GEMINI_API_KEY auto-detection: inherit from os.environ as fallback after .env * feat: ACPX integration — acpx/<agent> protocol for headless ACP agent invocation - Add 'acpx' as a valid protocol in parse_agent_spec (acpx/claude, acpx/codex, etc.) - _acpx_wrap() decorates any registered agent to launch via acpx CLI - Installs acpx alongside the underlying agent in the sandbox - Preserves all agent env, credentials, and skill_paths - Replace harbor protocol references with acpx in tests * style: format registry.py for ruff format check * fix: suppress ty invalid-assignment for monkey-patch in _patch_docker_dind * docs: update docs and SKILL.md for v0.4 — RL-first terminology, ACPX, purge backward-compat * fix: rename trial_config_from_yaml → rollout_config_from_yaml, fix stale _run import in sdk.py * fix: docs inconsistencies found by dogfooding — GEMINI_API_KEY, daytona install note * fix: docs path evaluations/ → jobs/ to match actual output directory * fix: ruff import ordering in sdk.py * fix: ty type-check suppression for sdk.py run() return type * feat: add sandbox-daytona and sandbox-modal optional dependency groups * chore: update uv.lock for sandbox-daytona and sandbox-modal optional deps --------- Co-authored-by: devin-ai-integration[bot] <158243242+devin-ai-integration[bot]@users.noreply.github.com> Co-authored-by: Xiangyi Li <xiangyi@benchflow.ai> Co-authored-by: Yifeng He <yfhe.prsn@gmail.com> * skill eval -> refactor (#293) * fix: address v0.4 dogfood blockers * test: cover skill eval nested results * chore: ignore playwright mcp artifacts * fix: validate config eval agent protocol * test: add adapter release evidence checker * test: wire adapter evidence into release runner * fix: address ENG-91 dogfood regressions * fix: record role metadata and timeouts * fix: align dogfood cli reports * docs: document environment cleanup command * docs: remove stale harbor task path examples * Add citation-management skill eval under skills * Update src/benchflow/cli/main.py Co-authored-by: devin-ai-integration[bot] <158243242+devin-ai-integration[bot]@users.noreply.github.com> --------- Co-authored-by: devin-ai-integration[bot] <158243242+devin-ai-integration[bot]@users.noreply.github.com> Co-authored-by: Yifeng He <yfhe.prsn@gmail.com>

devin-ai-integration Bot assigned xdotli May 17, 2026

devin-ai-integration Bot commented May 17, 2026

View reviewed changes

fix: replace remaining -f with --config in task-embedded SKILL.md files

34b09ef

xdotli merged commit 80f2d3a into refactor/v0.4 May 17, 2026
1 check passed

xdotli deleted the devin/1779001175-remove-short-flags branch May 19, 2026 15:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CLI: remove all short flags, use full names only [ENG-74]#284

CLI: remove all short flags, use full names only [ENG-74]#284
xdotli merged 2 commits into
refactor/v0.4from
devin/1779001175-remove-short-flags

devin-ai-integration Bot commented May 17, 2026 •

edited

Loading

Uh oh!

devin-ai-integration Bot commented May 17, 2026

Uh oh!

devin-ai-integration Bot left a comment

Uh oh!

devin-ai-integration Bot May 17, 2026

Uh oh!

devin-ai-integration Bot May 17, 2026

Uh oh!

devin-ai-integration Bot May 17, 2026

Uh oh!

devin-ai-integration Bot May 17, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

devin-ai-integration Bot commented May 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Review & Testing Checklist for Human

Notes

Uh oh!

devin-ai-integration Bot commented May 17, 2026

🤖 Devin AI Engineer

Uh oh!

devin-ai-integration Bot left a comment

Choose a reason for hiding this comment

Uh oh!

devin-ai-integration Bot May 17, 2026

Choose a reason for hiding this comment

Uh oh!

devin-ai-integration Bot May 17, 2026

Choose a reason for hiding this comment

Uh oh!

devin-ai-integration Bot May 17, 2026

Choose a reason for hiding this comment

Uh oh!

devin-ai-integration Bot May 17, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

devin-ai-integration Bot commented May 17, 2026 •

edited

Loading