refactor: BenchFlow v0.3.4 — unified types, Rollout, Sandbox protocol, rewards, adapters by devin-ai-integration[bot] · Pull Request #274 · benchflow-ai/benchflow

devin-ai-integration · 2026-05-16T01:17:30Z

Summary

BenchFlow v0.4 architecture refactor — consolidated from 6 individual PRs (#261, #262, #265, #266, #268, #271) merged in dependency order.

Core changes:

ENG-47: Unified Scene/Role/Turn types into _types.py, added parallel_group + per-role timeouts
ENG-48: Sandbox protocol (Sandbox, ImageBuilder) with Docker/Daytona adapters; Harbor internalized
ENG-46: Single execution path — Rollout replaces Trial as canonical name, backward-compat aliases preserved
ENG-49: Composable rewards — Rubric + RewardFunc protocol, RewardEvent for dense rewards
ENG-50: Agent-as-tool infrastructure (capabilities, sandbox networking)
ENG-51: External framework adapters (InspectAdapter, ORSAdapter)

Additional on this branch:

backend → sandbox rename across CLI/API/docs (11 files)
Python convention fixes (from __future__ import annotations, typed returns, @dataclass cleanup)
__getattr__ fix: correctly re-raises ModuleNotFoundError instead of swallowing as AttributeError
Full docs audit: updated README, concepts.md, getting-started.md, python-api.md for v0.4
Removed all Harbor migration/reference content from docs (use-cases.md, scene-patterns.ipynb, coder-reviewer-demo.py, swebench notebook)

Backward compatibility:

All old names preserved as identity-equal aliases: Trial, TrialConfig, Job, JobConfig, RunResult, RuntimeConfig, Agent, Environment

Review & Testing Checklist for Human

Verify from benchflow import Rollout, RolloutConfig, Rubric, Sandbox works
Verify backward-compat: from benchflow import Trial; assert Trial is Rollout
Run bench eval create -t tasks/jax-computing-basics -a gemini -m gemini-3.1-flash-lite-preview -e daytona end-to-end
Spot-check docs (concepts.md, python-api.md) — no more Harbor references, examples use --sandbox not --backend
Run full test suite: uv run pytest (963 tests expected, 2 pre-existing env-var-leak failures)

Notes

13/13 doc examples dogfooded against this branch — all pass
11/11 integration tests passed (including adapter round-trip tests)
CI green: 961 tests pass, lint/format/typecheck clean
Linear tickets ENG-43 through ENG-51 all marked Done

Link to Devin session: https://app.devin.ai/sessions/206c8b356e29441d88a8882b45c4442e
Requested by: @xdotli

Create BenchFlow-native Sandbox protocol as parallel types alongside existing Harbor imports. No behavioral changes — existing code continues to use Harbor directly. New files: - src/benchflow/sandbox/protocol.py: ExecResult, Sandbox, ImageRef, ImageConfig, ImageBuilder - src/benchflow/sandbox/docker.py: DockerSandbox adapter wrapping Harbor DockerEnvironment - src/benchflow/sandbox/daytona.py: DaytonaSandbox adapter wrapping Harbor DaytonaEnvironment - tests/test_sandbox_protocol.py: 26 tests covering dataclasses, protocol conformance, and delegation

Create src/benchflow/_types.py as the single canonical source for the declarative Role, Scene, and Turn dataclasses. Changes: - New _types.py with Role (adds timeout_sec, idle_timeout_sec, skills_dir), Scene (adds parallel_group), and Turn - trial.py imports from _types.py instead of defining its own copies - _scene.py renames its internal Role to SceneRole (different fields: instruction, tools) with backward-compat alias - __init__.py re-exports canonical types from _types.py; adds TrialRole/TrialScene backward-compat aliases - runtime.py and trial_yaml.py updated to import from _types.py New fields default to None — no runtime behavior changes.

…_file - Use shlex.quote(path) in read_file to prevent shell injection - Clean up temp files with os.unlink in write_file finally block - Move imports to module level

- read_file now raises FileNotFoundError on non-zero return code - Export ImageConfig alongside ImageBuilder in __init__.py - Add tests for read_file error behavior

…ENG-47) refactor: unify Scene/Role/Turn types into _types.py (ENG-47)

…48 Phase A) refactor: add Sandbox protocol + Harbor adapters (ENG-48 Phase A)

- Rename trial.py → rollout.py, Trial → Rollout, TrialConfig → RolloutConfig - Rename RunResult → RolloutResult in models.py - Create evaluation.py from job.py: Job → Evaluation, JobConfig → EvaluationConfig, JobResult → EvaluationResult - Create _run.py with bf.run(RolloutConfig) → RolloutResult entry point - Replace sdk.py with thin backward-compat shim delegating to rollout.py - Replace trial.py with re-export shim from rollout.py - Replace job.py with re-export shim from evaluation.py - Preserve runtime.py API, update internal imports to use Rollout - Update __init__.py: new public API + backward-compat aliases - Update test patch targets from benchflow.trial → benchflow.rollout - All 841 tests pass, lint clean, pre-existing typecheck errors unchanged

- Create src/benchflow/rewards/ package with: - RewardFunc protocol (single scoring dimension) - Rubric dataclass (weighted collection of RewardFuncs) - VerifyResult dataclass (aggregated scoring result) - RewardEvent dataclass (dense/terminal reward signals) - Built-in funcs: TestRewardFunc, LLMJudgeRewardFunc, StringMatchRewardFunc, CodeExecRewardFunc - Add reward_events field to RunResult (additive, no breaking changes) - Re-export all reward types from benchflow top-level __init__.py - Backward compat: Rubric([TestRewardFunc()]) wraps existing test.sh -> reward.txt flow - 31 tests covering all reward types, rubric scoring, weights, error handling, protocol conformance, and re-exports

… name Append _N suffix on collision so all per-func scores are preserved. Addresses Devin Review feedback on PR #266.

- Add Role.capabilities field for declarative agent capability lists - Add Sandbox.expose_ports property for inter-agent communication ports - Document capability boundary: BenchFlow provides sandbox + instruction + observation; agents own loops, tool protocols, and agent-as-tool - Update _scene.py and _types.py module docstrings with ENG-50 boundary - Add 16 tests covering capabilities, expose_ports, and Coder→Reviewer→Coder turn sequences

…ath (ENG-46) refactor: kill shim layers, single Rollout execution path (ENG-46)

…cture (ENG-50) feat: per-agent capabilities + agent-as-tool infrastructure (ENG-50)

feat: composable Rubric + RewardFunc protocol (ENG-49)

Add benchflow.adapters package with thin format converters: - InspectAdapter / to_inspect_task: Scene + Rubric -> Inspect AI task dict - ORSAdapter / to_ors_reward: VerifyResult + RewardEvent -> ORS format dict No external dependencies required — these are pure format converters. Re-exported from benchflow top-level __init__.py. 14 tests covering both adapters, convenience functions, and round-trips.

feat: external adapters for Inspect AI + ORS (ENG-51)

- Parametrize bare dict returns to dict[str, Any] in adapters - Remove unnecessary @DataClass on adapter classes (no fields) - Add from __future__ import annotations to evaluation.py - Fix __getattr__ to re-raise ModuleNotFoundError for broken deps - Add private helper re-exports to trial.py shim for backward compat - Convert test_save_trajectory to async (fixes event loop deprecation)

devin-ai-integration · 2026-05-16T01:17:32Z

🤖 Devin AI Engineer

I'll be helping with this pull request! Here's what you should know:

✅ I will automatically:

Address comments on this PR. Add '(aside)' to your comment to have me ignore it.
Look at CI failures and help fix them

Note: I can only respond to comments from users who have write access to this repository.

⚙️ Control Options:

Disable automatic comment and CI monitoring

devin-ai-integration

✅ Devin Review: No Issues Found

Devin Review analyzed this PR and found no bugs or issues to report.

…o rollout.py Port ensure_bedrock_proxy_runtime/stop_provider_runtime from trial.py (PR #267) into rollout.py execution paths: connect(), cleanup(), and connect_as(). Update trial.py shim with _provider_runtime re-exports. Fix test patches to target benchflow.rollout instead of benchflow.trial.

devin-ai-integration · 2026-05-16T17:14:03Z

Test Results — BenchFlow v0.4 Refactor

Branch: refactor/v0.4 @ 5b80c34 (includes merge of main + Bedrock PR #267)
Session: Devin

Results: 8/8 passed

E2E Integration Test (bench eval create)

Full pipeline completed successfully:

Task: jax-computing-basics (from benchflow-ai/skillsbench)
Agent: gemini (gemini-3.1-flash-lite-preview)
Backend: daytona
Pipeline: task download → Daytona env → Dockerfile build → agent install → sandbox user → ACP connection → 15 tool calls → verifier → reward extraction
Reward: 0.0 (expected for lightweight model — pipeline correctness is what matters)
Exit code: 0

API & Compatibility Tests (7/7)

#	Test	Result
1	New v0.4 imports (24 types: `Rollout`, `Sandbox`, `Rubric`, `RewardFunc`, etc.)	✅
2	Backward-compat aliases identity-equal (`Trial is Rollout`, `Job is Evaluation`, etc.)	✅
3	`__getattr__` re-raises broken dep `ModuleNotFoundError` (not swallowed as `AttributeError`)	✅
4	`Sandbox` protocol is `@runtime_checkable` (`isinstance` works)	✅
5	`Rubric` composition: `score()` → `VerifyResult(reward=1.0)` with `RewardEvent`	✅
6	Full test suite: 963 passed, 0 failed, 28s	✅
7	Lint (`ruff check`) + format (`ruff format --check`): all clean	✅

Merge Conflict Resolution (Bedrock PR #267)

Ported ensure_bedrock_proxy_runtime / stop_provider_runtime into rollout.py (connect(), cleanup(), connect_as())
Updated trial.py shim with _provider_runtime re-exports
Fixed Bedrock test patches to target benchflow.rollout instead of benchflow.trial
All Bedrock tests pass

devin-ai-integration · 2026-05-16T17:32:50Z

Adapter Feature Tests — Follow-up

Added 3 more tests specifically exercising the ENG-51 adapter features (InspectAdapter + ORSAdapter):

Results: 3/3 passed

Test 9: InspectAdapter (to_inspect_task)

Exercised to_inspect_task() with realistic data:

Multi-role scene (coder + reviewer, 3 turns) → correct name, dataset (3 samples with input/role)
Without rubric → no scorer field
With rubric → scorer: {type: "benchflow_rubric", reward_funcs: 2, weights: [0.6, 0.4]}
Edge cases: empty scene → dataset=[]; None prompt → ""

Test 10: ORSAdapter (to_ors_reward)

Exercised to_ors_reward() and ORSAdapter:

Success: VerifyResult(reward=0.75) → {reward: 0.75, is_valid: true, metadata: {items: {...}, events: [...]}}
Error: VerifyResult(error="Verifier timed out") → {is_valid: false, error: "Verifier timed out"}
Dense event step=3 preserved; all timestamp/type/source/reward fields correct
reward_event_to_ors() individual event conversion works

Test 11: Full Adapter Round-Trip

End-to-end: Scene.single(agent='gemini') → to_inspect_task(scene, rubric) → rubric.score() → to_ors_reward(result)

Weighted scoring: 0.7 × 0.8 + 0.3 × 0.2 = 0.62 ✓
All RewardEvent fields preserved through Rubric → ORS pipeline (reward, source, type, timestamp)

Total across both test runs: 11/11 passed (8 original + 3 adapter tests)

Aligns naming with the v0.4 Sandbox protocol (ENG-48). The 'backend' parameter/flag/attribute on Environment and the CLI is now 'sandbox' everywhere. Scope: - runtime.py: Environment.__init__(sandbox=), .sandbox, from_task(sandbox=) - cli/main.py: --sandbox flag (was --backend), help text - _snapshot.py: docstring updates - tests/test_runtime.py: updated assertions - docs/: all references in task-authoring, CLI ref, Python API ref, running-benchmarks, integration-tests, examples NOT renamed (different meaning): - _credentials.py 'Vertex AI backend' (provider concept) - _provider_runtime.py / bedrock_proxy.py backend_model (LLM model) - _sandbox.py 'build backends' (Python packaging)

…/Adapter docs - README: 'Sandbox backends' → 'Sandboxes', table says Rollout not Trial - concepts.md: Environment no longer says 'Backed by Harbor', Trial → Rollout as core primitive, lifecycle diagram uses Rollout.run() - getting-started.md: uses RolloutConfig, links to rollout lifecycle - python-api.md: RolloutConfig (aliased as TrialConfig), Rollout (aliased as Trial), canonical imports shown first, added v0.4 types section with Sandbox protocol, Rubric + RewardFunc, Adapters, Evaluation, backward-compat aliases

Fixes 3 issues flagged by Devin Review: - StringMatchRewardFunc: remove non-existent 'field' parameter - Rubric: use reward_funcs + weights instead of items=[(func, weight)] - rubric.score: takes rollout_dir: Path, not sandbox=

devin-ai-integration

Devin Review found 2 new potential issues.

View 19 additional findings in Devin Review.

devin-ai-integration · 2026-05-16T18:27:32Z

+Backward-compat aliases: ``Job = Evaluation``, ``EvaluationConfig = EvaluationConfig``,
+``EvaluationResult = EvaluationResult``.


🟡 Incorrect backward-compat alias documentation in evaluation.py docstring

The module docstring documents the backward-compat aliases as tautologies (EvaluationConfig = EvaluationConfig, EvaluationResult = EvaluationResult) instead of the actual mappings (JobConfig = EvaluationConfig, JobResult = EvaluationResult). The actual code at src/benchflow/evaluation.py:677-679 defines the correct aliases, but the docstring at the top of the file will mislead developers trying to understand the backward-compat story.

Suggested change

Backward-compat aliases: ``Job = Evaluation``, ``EvaluationConfig = EvaluationConfig``,

``EvaluationResult = EvaluationResult``.

Backward-compat aliases: ``Job = Evaluation``, ``JobConfig = EvaluationConfig``,

``JobResult = EvaluationResult``.

Was this helpful? React with 👍 or 👎 to provide feedback.

Good catch — the docstring has tautological aliases. Will fix to JobConfig = EvaluationConfig, JobResult = EvaluationResult.

…__init__.py

- use-cases.md: remove 'How it works vs Harbor' and 'Migration from Harbor' sections - scene-patterns.ipynb: remove 'Mapping to Harbor PR #1462 Concepts' cell - coder-reviewer-demo.py: 'Harbor-format task directory' → 'BenchFlow task directory' - swebench notebook: remove Harbor #1316 reference from intro

* cli: unify sandbox flag --env → --sandbox, -b → -e (ENG-56) Standardize all CLI commands to use --sandbox/-e for the sandbox parameter. Previously bench run used --sandbox/-b while bench eval create and others used --env/-e. - bench run: -b → -e - bench eval create/run/compare: --env → --sandbox - bench eval progress: --env → --sandbox - bench skills eval: --env → --sandbox - bench environment create: -b → -e - docs/examples: updated flag references * cli: drop -e/-b short flags, use --sandbox only (ENG-56) Remove all short flags (-e, -b) for the sandbox parameter across all CLI commands. The long flag --sandbox is the only way to specify the sandbox now. Updated 15 files: CLI definitions, help text examples, README, docs, skill files, integration test runner, and test examples. --------- Co-authored-by: Xiangyi Li <xiangyi@benchflow.ai>

Per ENG-46, bench eval create is the canonical CLI entry point. bench run goes through the old SDK shim and is now deprecated, matching the pattern used for bench job and other legacy commands. Python API bf.run() is unaffected (backward-compat alias stays). Co-authored-by: Xiangyi Li <xiangyi@benchflow.ai>

* feat(ENG-55): first-class LLM-as-judge verifier with dense reward - Add rubric_config.py: Pydantic models for rubric.toml (Criterion, JudgeConfig, ScoringConfig, RubricConfig) with binary/likert/numeric criterion types and score normalization - Add llm.py: Multi-provider LLM routing (Anthropic/OpenAI/Gemini) with call_judge(), parse_verdict(), exponential backoff retry - Add file_readers.py: Document extraction (pdf/docx/xlsx/pptx/txt) with find_deliverables() for rollout directory scanning - Replace placeholder LLMJudgeRewardFunc in builtins.py with full rubric-based judge: per-criterion scoring, prompt templates, configurable aggregation (weighted_mean/all_pass/any_pass/threshold), backward-compatible legacy mode - Emit per-criterion dense RewardEvents for fine-grained reward signals - Write evaluation_details.json alongside rollout for transparency - Re-export Criterion, JudgeConfig, RubricConfig, ScoringConfig, load_rubric_toml from rewards and top-level __init__ - Add test_rubric_config.py (15 tests) and test_llm_judge.py (30 tests) Closes ENG-55 * fix(ENG-55): async clients, asyncio.sleep, and score mismatch bugs - Replace sync SDK clients with async variants (AsyncAnthropic, AsyncOpenAI, client.aio for Google) - Replace time.sleep() with await asyncio.sleep() in call_judge retry loop - Pass actual aggregated score to _write_details instead of recomputing n_passed/total * docs(ENG-55): add LLM-as-judge verifier documentation - Add docs/llm-judge.md: full user-facing guide with rubric.toml reference, criterion types, aggregation strategies, dense reward events, multi-provider routing, file discovery, inline criteria, and worked examples - Add src/benchflow/rewards/README.md: module-level guide with usage examples - Update docs/concepts.md: reference LLM judge in Verifier primitive and add to 'Where to go next' links * docs: polish — remove ENG-55 ticket ref, clarify rubric.json exclusion --------- Co-authored-by: Xiangyi Li <xiangyi@benchflow.ai>

* CLI: remove all short flags, use full names only [ENG-74] Remove all single-letter short flags (-a, -m, -o, -t, -c, -f, -s, -p, -b, -d) from every CLI command. Full flag names only (--agent, --model, --jobs-dir, --tasks-dir, --concurrency, --config, --skills-dir, --prompt, --benchmark, --dir). Updated across 23 files: - src/benchflow/cli/main.py: all typer.Option definitions + docstring examples - All docs/ (getting-started, running-benchmarks, cli.md, skill-eval, integration-tests, examples/) - README.md - .claude/skills/ (SKILL.md + references) - tests/ (integration/run.sh, test_codex_custom_provider.sh, conformance/README.md) - benchmarks/ (models-as-skills.md, harvey-lab/run_harvey_lab.py) - Test fix: test_skill_eval_dryrun.py invocation updated from -a to --agent * fix: replace remaining -f with --config in task-embedded SKILL.md files --------- Co-authored-by: Xiangyi Li <xiangyi@benchflow.ai>

devin-ai-integration · 2026-05-17T07:39:14Z

Devin is archived and cannot be woken up. Please unarchive Devin if you want to continue using it.

* fix: apply PR #230 (skills double-deploy) and PR #242 (shell injection) fixes to v0.4 PR #230: pass effective task path to deploy_skills, set effective_skills in already-injected branch PR #242: shlex.quote outbox file paths in _scene.py, harden _snapshot.py with path validation * fix: set effective_skills to /skills when Dockerfile already injected When deploy_skills detects the Dockerfile already bakes in `COPY _deps/skills /skills/`, the runtime upload is skipped but effective_skills was left as task.config.environment.skills_dir (often None), causing the subsequent linking step to silently skip distribution to agent-specific paths like /home/agent/.agents/skills. Sandbox users relied on that runtime linking since Dockerfile injection only links under /root/. Tighten the regression test to assert env.exec was called with the expected `ln -sfn` command. --------- Co-authored-by: Xiangyi Li <xiangyi@benchflow.ai> Co-authored-by: Yifeng He <yfhe.prsn@gmail.com>

…, ACPX integration (#288) * refactor: BenchFlow v0.3.4 — unified types, Rollout, Sandbox protocol, rewards, adapters (#274) * refactor: add Sandbox protocol + Harbor adapters (ENG-48 Phase A) Create BenchFlow-native Sandbox protocol as parallel types alongside existing Harbor imports. No behavioral changes — existing code continues to use Harbor directly. New files: - src/benchflow/sandbox/protocol.py: ExecResult, Sandbox, ImageRef, ImageConfig, ImageBuilder - src/benchflow/sandbox/docker.py: DockerSandbox adapter wrapping Harbor DockerEnvironment - src/benchflow/sandbox/daytona.py: DaytonaSandbox adapter wrapping Harbor DaytonaEnvironment - tests/test_sandbox_protocol.py: 26 tests covering dataclasses, protocol conformance, and delegation * refactor: unify Scene/Role/Turn types into _types.py (ENG-47) Create src/benchflow/_types.py as the single canonical source for the declarative Role, Scene, and Turn dataclasses. Changes: - New _types.py with Role (adds timeout_sec, idle_timeout_sec, skills_dir), Scene (adds parallel_group), and Turn - trial.py imports from _types.py instead of defining its own copies - _scene.py renames its internal Role to SceneRole (different fields: instruction, tools) with backward-compat alias - __init__.py re-exports canonical types from _types.py; adds TrialRole/TrialScene backward-compat aliases - runtime.py and trial_yaml.py updated to import from _types.py New fields default to None — no runtime behavior changes. * fix: address review — shlex.quote in read_file, cleanup temp in write_file - Use shlex.quote(path) in read_file to prevent shell injection - Clean up temp files with os.unlink in write_file finally block - Move imports to module level * fix: read_file error checking + export ImageConfig from top-level - read_file now raises FileNotFoundError on non-zero return code - Export ImageConfig alongside ImageBuilder in __init__.py - Add tests for read_file error behavior * refactor: kill shim layers, single Rollout execution path (ENG-46) - Rename trial.py → rollout.py, Trial → Rollout, TrialConfig → RolloutConfig - Rename RunResult → RolloutResult in models.py - Create evaluation.py from job.py: Job → Evaluation, JobConfig → EvaluationConfig, JobResult → EvaluationResult - Create _run.py with bf.run(RolloutConfig) → RolloutResult entry point - Replace sdk.py with thin backward-compat shim delegating to rollout.py - Replace trial.py with re-export shim from rollout.py - Replace job.py with re-export shim from evaluation.py - Preserve runtime.py API, update internal imports to use Rollout - Update __init__.py: new public API + backward-compat aliases - Update test patch targets from benchflow.trial → benchflow.rollout - All 841 tests pass, lint clean, pre-existing typecheck errors unchanged * style: apply ruff format to pass CI format check * fix: resolve ty typecheck errors (Any for kwargs and sentinel) * feat: composable Rubric + RewardFunc protocol (ENG-49) - Create src/benchflow/rewards/ package with: - RewardFunc protocol (single scoring dimension) - Rubric dataclass (weighted collection of RewardFuncs) - VerifyResult dataclass (aggregated scoring result) - RewardEvent dataclass (dense/terminal reward signals) - Built-in funcs: TestRewardFunc, LLMJudgeRewardFunc, StringMatchRewardFunc, CodeExecRewardFunc - Add reward_events field to RunResult (additive, no breaking changes) - Re-export all reward types from benchflow top-level __init__.py - Backward compat: Rubric([TestRewardFunc()]) wraps existing test.sh -> reward.txt flow - 31 tests covering all reward types, rubric scoring, weights, error handling, protocol conformance, and re-exports * style: format rewards package and tests with ruff * fix: disambiguate Rubric.items keys when multiple funcs share a class name Append _N suffix on collision so all per-func scores are preserved. Addresses Devin Review feedback on PR #266. * feat: per-agent capabilities + agent-as-tool infrastructure (ENG-50) - Add Role.capabilities field for declarative agent capability lists - Add Sandbox.expose_ports property for inter-agent communication ports - Document capability boundary: BenchFlow provides sandbox + instruction + observation; agents own loops, tool protocols, and agent-as-tool - Update _scene.py and _types.py module docstrings with ENG-50 boundary - Add 16 tests covering capabilities, expose_ports, and Coder→Reviewer→Coder turn sequences * style: apply ruff format to sandbox adapters and tests * feat: external adapters for Inspect AI + ORS (ENG-51) Add benchflow.adapters package with thin format converters: - InspectAdapter / to_inspect_task: Scene + Rubric -> Inspect AI task dict - ORSAdapter / to_ors_reward: VerifyResult + RewardEvent -> ORS format dict No external dependencies required — these are pure format converters. Re-exported from benchflow top-level __init__.py. 14 tests covering both adapters, convenience functions, and round-trips. * style: fix ruff format for adapters and tests * fix: convention fixes + trial shim re-exports + __getattr__ narrowing - Parametrize bare dict returns to dict[str, Any] in adapters - Remove unnecessary @DataClass on adapter classes (no fields) - Add from __future__ import annotations to evaluation.py - Fix __getattr__ to re-raise ModuleNotFoundError for broken deps - Add private helper re-exports to trial.py shim for backward compat - Convert test_save_trajectory to async (fixes event loop deprecation) * rename: backend → sandbox across API, CLI, docs, and tests Aligns naming with the v0.4 Sandbox protocol (ENG-48). The 'backend' parameter/flag/attribute on Environment and the CLI is now 'sandbox' everywhere. Scope: - runtime.py: Environment.__init__(sandbox=), .sandbox, from_task(sandbox=) - cli/main.py: --sandbox flag (was --backend), help text - _snapshot.py: docstring updates - tests/test_runtime.py: updated assertions - docs/: all references in task-authoring, CLI ref, Python API ref, running-benchmarks, integration-tests, examples NOT renamed (different meaning): - _credentials.py 'Vertex AI backend' (provider concept) - _provider_runtime.py / bedrock_proxy.py backend_model (LLM model) - _sandbox.py 'build backends' (Python packaging) * docs: update for v0.4 — Rollout as canonical name, add Sandbox/Rubric/Adapter docs - README: 'Sandbox backends' → 'Sandboxes', table says Rollout not Trial - concepts.md: Environment no longer says 'Backed by Harbor', Trial → Rollout as core primitive, lifecycle diagram uses Rollout.run() - getting-started.md: uses RolloutConfig, links to rollout lifecycle - python-api.md: RolloutConfig (aliased as TrialConfig), Rollout (aliased as Trial), canonical imports shown first, added v0.4 types section with Sandbox protocol, Rubric + RewardFunc, Adapters, Evaluation, backward-compat aliases * fix: correct Rubric/RewardFunc API examples in python-api.md Fixes 3 issues flagged by Devin Review: - StringMatchRewardFunc: remove non-existent 'field' parameter - Rubric: use reward_funcs + weights instead of items=[(func, weight)] - rubric.score: takes rollout_dir: Path, not sandbox= * fix: correct evaluation.py docstring aliases, clean stale comment in __init__.py * docs: remove Harbor migration references - use-cases.md: remove 'How it works vs Harbor' and 'Migration from Harbor' sections - scene-patterns.ipynb: remove 'Mapping to Harbor PR #1462 Concepts' cell - coder-reviewer-demo.py: 'Harbor-format task directory' → 'BenchFlow task directory' - swebench notebook: remove Harbor #1316 reference from intro * CLI: unify sandbox flag (--env → --sandbox, -b → -e) [ENG-56] (#281) * cli: unify sandbox flag --env → --sandbox, -b → -e (ENG-56) Standardize all CLI commands to use --sandbox/-e for the sandbox parameter. Previously bench run used --sandbox/-b while bench eval create and others used --env/-e. - bench run: -b → -e - bench eval create/run/compare: --env → --sandbox - bench eval progress: --env → --sandbox - bench skills eval: --env → --sandbox - bench environment create: -b → -e - docs/examples: updated flag references * cli: drop -e/-b short flags, use --sandbox only (ENG-56) Remove all short flags (-e, -b) for the sandbox parameter across all CLI commands. The long flag --sandbox is the only way to specify the sandbox now. Updated 15 files: CLI definitions, help text examples, README, docs, skill files, integration test runner, and test examples. --------- Co-authored-by: Xiangyi Li <xiangyi@benchflow.ai> * cli: deprecate bench run in favor of bench eval create (ENG-57) (#282) Per ENG-46, bench eval create is the canonical CLI entry point. bench run goes through the old SDK shim and is now deprecated, matching the pattern used for bench job and other legacy commands. Python API bf.run() is unaffected (backward-compat alias stays). Co-authored-by: Xiangyi Li <xiangyi@benchflow.ai> * feat(ENG-55): first-class LLM-as-judge verifier with dense reward (#277) * feat(ENG-55): first-class LLM-as-judge verifier with dense reward - Add rubric_config.py: Pydantic models for rubric.toml (Criterion, JudgeConfig, ScoringConfig, RubricConfig) with binary/likert/numeric criterion types and score normalization - Add llm.py: Multi-provider LLM routing (Anthropic/OpenAI/Gemini) with call_judge(), parse_verdict(), exponential backoff retry - Add file_readers.py: Document extraction (pdf/docx/xlsx/pptx/txt) with find_deliverables() for rollout directory scanning - Replace placeholder LLMJudgeRewardFunc in builtins.py with full rubric-based judge: per-criterion scoring, prompt templates, configurable aggregation (weighted_mean/all_pass/any_pass/threshold), backward-compatible legacy mode - Emit per-criterion dense RewardEvents for fine-grained reward signals - Write evaluation_details.json alongside rollout for transparency - Re-export Criterion, JudgeConfig, RubricConfig, ScoringConfig, load_rubric_toml from rewards and top-level __init__ - Add test_rubric_config.py (15 tests) and test_llm_judge.py (30 tests) Closes ENG-55 * fix(ENG-55): async clients, asyncio.sleep, and score mismatch bugs - Replace sync SDK clients with async variants (AsyncAnthropic, AsyncOpenAI, client.aio for Google) - Replace time.sleep() with await asyncio.sleep() in call_judge retry loop - Pass actual aggregated score to _write_details instead of recomputing n_passed/total * docs(ENG-55): add LLM-as-judge verifier documentation - Add docs/llm-judge.md: full user-facing guide with rubric.toml reference, criterion types, aggregation strategies, dense reward events, multi-provider routing, file discovery, inline criteria, and worked examples - Add src/benchflow/rewards/README.md: module-level guide with usage examples - Update docs/concepts.md: reference LLM judge in Verifier primitive and add to 'Where to go next' links * docs: polish — remove ENG-55 ticket ref, clarify rubric.json exclusion --------- Co-authored-by: Xiangyi Li <xiangyi@benchflow.ai> * CLI: remove all short flags, use full names only [ENG-74] (#284) * CLI: remove all short flags, use full names only [ENG-74] Remove all single-letter short flags (-a, -m, -o, -t, -c, -f, -s, -p, -b, -d) from every CLI command. Full flag names only (--agent, --model, --jobs-dir, --tasks-dir, --concurrency, --config, --skills-dir, --prompt, --benchmark, --dir). Updated across 23 files: - src/benchflow/cli/main.py: all typer.Option definitions + docstring examples - All docs/ (getting-started, running-benchmarks, cli.md, skill-eval, integration-tests, examples/) - README.md - .claude/skills/ (SKILL.md + references) - tests/ (integration/run.sh, test_codex_custom_provider.sh, conformance/README.md) - benchmarks/ (models-as-skills.md, harvey-lab/run_harvey_lab.py) - Test fix: test_skill_eval_dryrun.py invocation updated from -a to --agent * fix: replace remaining -f with --config in task-embedded SKILL.md files --------- Co-authored-by: Xiangyi Li <xiangyi@benchflow.ai> * Backport PR #230 and #242 fixes to refactor/v0.4 (#286) * fix: apply PR #230 (skills double-deploy) and PR #242 (shell injection) fixes to v0.4 PR #230: pass effective task path to deploy_skills, set effective_skills in already-injected branch PR #242: shlex.quote outbox file paths in _scene.py, harden _snapshot.py with path validation * fix: set effective_skills to /skills when Dockerfile already injected When deploy_skills detects the Dockerfile already bakes in `COPY _deps/skills /skills/`, the runtime upload is skipped but effective_skills was left as task.config.environment.skills_dir (often None), causing the subsequent linking step to silently skip distribution to agent-specific paths like /home/agent/.agents/skills. Sandbox users relied on that runtime linking since Dockerfile injection only links under /root/. Tighten the regression test to assert env.exec was called with the expected `ln -sfn` command. --------- Co-authored-by: Xiangyi Li <xiangyi@benchflow.ai> Co-authored-by: Yifeng He <yfhe.prsn@gmail.com> * fix locally * release: 0.3.4 --------- Co-authored-by: Xiangyi Li <xiangyi@benchflow.ai> Co-authored-by: devin-ai-integration[bot] <158243242+devin-ai-integration[bot]@users.noreply.github.com> Co-authored-by: Yifeng He <yfhe.prsn@gmail.com> * refactor: remove Harbor references, purge backward-compat aliases in tests, update terminology - Remove 'terminal-bench' from pyproject.toml keywords - Rename _make_harbor_mock → _make_sandbox_mock in test files - Update Harbor patch paths to sandbox equivalents in tests - Rename _from_harbor_yaml → _from_legacy_yaml in evaluation.py - Update docstrings/comments: Harbor → BenchFlow/legacy terminology - Internalize Harbor types into benchflow.task/ subpackage - Create sandbox backends: base.py, docker_impl.py, daytona_impl.py, modal_impl.py - Add compose/ subpackage with docker-compose YAML templates * refactor: modernize file structure — align with test-drive patterns - sandbox/: merge docker_impl→docker, daytona_impl→daytona, base→_base - sandbox/: move _sandbox→lockdown, _snapshot→snapshot, _daytona_patches→_sdk_ops - sandbox/: move process, user, environments→services into sandbox/ - sandbox/: restructure compose/ → _compose.py + _compose_files/ - agents/: move _credentials→credentials, _agent_env→env, _agent_setup→install - _utils/: create subpackage with yaml_loader, benchmark_repos, task_authoring - trajectories/: move viewer, _trajectory→_capture into subpackage - experimental/: move mcp/ into experimental/ subpackage - __init__.py: purge backward-compat aliases (Trial, Job, etc.) - Update all imports across src/, tests/, experiments/ - Run ruff check + format (0 errors) * refactor: purge backward-compat aliases + fix review bugs - Remove Trial=Rollout, TrialConfig=RolloutConfig from rollout.py - Remove Job=Evaluation, JobConfig=EvaluationConfig, JobResult=EvaluationResult from evaluation.py - Delete shim files: trial.py, job.py - Update all imports: benchflow.trial → benchflow.rollout, benchflow.job → benchflow.evaluation - Rename classes: Trial→Rollout, Job→Evaluation across tests/, benchmarks/, experiments/ - Fix runtime.py: environment_type→sandbox_type, trial_name→rollout_name (review bug) - Fix conformance scripts: same keyword arg fixes (review bug) - Clean up __all__ and return type annotations * docs: update terminology to v0.4 (Trial→Rollout, Job→Evaluation) + bump version to 0.4.0 * chore: update uv.lock for v0.4.0 * refactor: modernize RL terminology — trial_dir→rollout_dir, trial_name→rollout_name across codebase * style: format viewer.py * fix: exclude optional-dep sandbox files from ty check * fix: configure ty to ignore unresolved-import + exclude optional-dep files * fix: resolve all test failures — patch paths, protocol conformance, RL terminology * fix: resolve merge conflicts, update broken imports and stale references - Resolve merge conflict markers in rollout.py and evaluation.py - Update task_download imports to benchflow._utils.benchmark_repos (11 files) - Fix benchflow.tasks → benchflow._utils.task_authoring import - Update scene-patterns.ipynb: TrialConfig→RolloutConfig, trial→rollout - Fix Job→Evaluation rename in docs - Ruff auto-fixes (import ordering) * style: format 12 files to pass ruff format check * fix: resolve ty check errors in traces/ — dict[str, object] → dict[str, Any] * fix: add skip guards for optional dependency tests (daytona, modal) * fix: update stale imports and references in docs/benchmarks (Devin Review) * refactor: consolidate underscore modules into proper subpackages - _acp_run.py → acp/runtime.py - _env_setup.py → sandbox/setup.py - _provider_runtime.py → providers/runtime.py - _scoring.py → _utils/scoring.py - _scene.py → scenes.py (removed Role=SceneRole backward-compat alias) - Fix GEMINI_API_KEY auto-detection: inherit from os.environ as fallback after .env * feat: ACPX integration — acpx/<agent> protocol for headless ACP agent invocation - Add 'acpx' as a valid protocol in parse_agent_spec (acpx/claude, acpx/codex, etc.) - _acpx_wrap() decorates any registered agent to launch via acpx CLI - Installs acpx alongside the underlying agent in the sandbox - Preserves all agent env, credentials, and skill_paths - Replace harbor protocol references with acpx in tests * style: format registry.py for ruff format check * fix: suppress ty invalid-assignment for monkey-patch in _patch_docker_dind * docs: update docs and SKILL.md for v0.4 — RL-first terminology, ACPX, purge backward-compat * fix: rename trial_config_from_yaml → rollout_config_from_yaml, fix stale _run import in sdk.py * fix: docs inconsistencies found by dogfooding — GEMINI_API_KEY, daytona install note * fix: docs path evaluations/ → jobs/ to match actual output directory * fix: ruff import ordering in sdk.py * fix: ty type-check suppression for sdk.py run() return type * feat: add sandbox-daytona and sandbox-modal optional dependency groups * chore: update uv.lock for sandbox-daytona and sandbox-modal optional deps --------- Co-authored-by: devin-ai-integration[bot] <158243242+devin-ai-integration[bot]@users.noreply.github.com> Co-authored-by: Xiangyi Li <xiangyi@benchflow.ai> Co-authored-by: Yifeng He <yfhe.prsn@gmail.com>

* refactor: add Sandbox protocol + Harbor adapters (ENG-48 Phase A) Create BenchFlow-native Sandbox protocol as parallel types alongside existing Harbor imports. No behavioral changes — existing code continues to use Harbor directly. New files: - src/benchflow/sandbox/protocol.py: ExecResult, Sandbox, ImageRef, ImageConfig, ImageBuilder - src/benchflow/sandbox/docker.py: DockerSandbox adapter wrapping Harbor DockerEnvironment - src/benchflow/sandbox/daytona.py: DaytonaSandbox adapter wrapping Harbor DaytonaEnvironment - tests/test_sandbox_protocol.py: 26 tests covering dataclasses, protocol conformance, and delegation * refactor: unify Scene/Role/Turn types into _types.py (ENG-47) Create src/benchflow/_types.py as the single canonical source for the declarative Role, Scene, and Turn dataclasses. Changes: - New _types.py with Role (adds timeout_sec, idle_timeout_sec, skills_dir), Scene (adds parallel_group), and Turn - trial.py imports from _types.py instead of defining its own copies - _scene.py renames its internal Role to SceneRole (different fields: instruction, tools) with backward-compat alias - __init__.py re-exports canonical types from _types.py; adds TrialRole/TrialScene backward-compat aliases - runtime.py and trial_yaml.py updated to import from _types.py New fields default to None — no runtime behavior changes. * fix: address review — shlex.quote in read_file, cleanup temp in write_file - Use shlex.quote(path) in read_file to prevent shell injection - Clean up temp files with os.unlink in write_file finally block - Move imports to module level * fix: read_file error checking + export ImageConfig from top-level - read_file now raises FileNotFoundError on non-zero return code - Export ImageConfig alongside ImageBuilder in __init__.py - Add tests for read_file error behavior * refactor: kill shim layers, single Rollout execution path (ENG-46) - Rename trial.py → rollout.py, Trial → Rollout, TrialConfig → RolloutConfig - Rename RunResult → RolloutResult in models.py - Create evaluation.py from job.py: Job → Evaluation, JobConfig → EvaluationConfig, JobResult → EvaluationResult - Create _run.py with bf.run(RolloutConfig) → RolloutResult entry point - Replace sdk.py with thin backward-compat shim delegating to rollout.py - Replace trial.py with re-export shim from rollout.py - Replace job.py with re-export shim from evaluation.py - Preserve runtime.py API, update internal imports to use Rollout - Update __init__.py: new public API + backward-compat aliases - Update test patch targets from benchflow.trial → benchflow.rollout - All 841 tests pass, lint clean, pre-existing typecheck errors unchanged * style: apply ruff format to pass CI format check * fix: resolve ty typecheck errors (Any for kwargs and sentinel) * feat: composable Rubric + RewardFunc protocol (ENG-49) - Create src/benchflow/rewards/ package with: - RewardFunc protocol (single scoring dimension) - Rubric dataclass (weighted collection of RewardFuncs) - VerifyResult dataclass (aggregated scoring result) - RewardEvent dataclass (dense/terminal reward signals) - Built-in funcs: TestRewardFunc, LLMJudgeRewardFunc, StringMatchRewardFunc, CodeExecRewardFunc - Add reward_events field to RunResult (additive, no breaking changes) - Re-export all reward types from benchflow top-level __init__.py - Backward compat: Rubric([TestRewardFunc()]) wraps existing test.sh -> reward.txt flow - 31 tests covering all reward types, rubric scoring, weights, error handling, protocol conformance, and re-exports * style: format rewards package and tests with ruff * fix: disambiguate Rubric.items keys when multiple funcs share a class name Append _N suffix on collision so all per-func scores are preserved. Addresses Devin Review feedback on PR #266. * feat: per-agent capabilities + agent-as-tool infrastructure (ENG-50) - Add Role.capabilities field for declarative agent capability lists - Add Sandbox.expose_ports property for inter-agent communication ports - Document capability boundary: BenchFlow provides sandbox + instruction + observation; agents own loops, tool protocols, and agent-as-tool - Update _scene.py and _types.py module docstrings with ENG-50 boundary - Add 16 tests covering capabilities, expose_ports, and Coder→Reviewer→Coder turn sequences * style: apply ruff format to sandbox adapters and tests * feat: external adapters for Inspect AI + ORS (ENG-51) Add benchflow.adapters package with thin format converters: - InspectAdapter / to_inspect_task: Scene + Rubric -> Inspect AI task dict - ORSAdapter / to_ors_reward: VerifyResult + RewardEvent -> ORS format dict No external dependencies required — these are pure format converters. Re-exported from benchflow top-level __init__.py. 14 tests covering both adapters, convenience functions, and round-trips. * style: fix ruff format for adapters and tests * fix: convention fixes + trial shim re-exports + __getattr__ narrowing - Parametrize bare dict returns to dict[str, Any] in adapters - Remove unnecessary @DataClass on adapter classes (no fields) - Add from __future__ import annotations to evaluation.py - Fix __getattr__ to re-raise ModuleNotFoundError for broken deps - Add private helper re-exports to trial.py shim for backward compat - Convert test_save_trajectory to async (fixes event loop deprecation) * rename: backend → sandbox across API, CLI, docs, and tests Aligns naming with the v0.4 Sandbox protocol (ENG-48). The 'backend' parameter/flag/attribute on Environment and the CLI is now 'sandbox' everywhere. Scope: - runtime.py: Environment.__init__(sandbox=), .sandbox, from_task(sandbox=) - cli/main.py: --sandbox flag (was --backend), help text - _snapshot.py: docstring updates - tests/test_runtime.py: updated assertions - docs/: all references in task-authoring, CLI ref, Python API ref, running-benchmarks, integration-tests, examples NOT renamed (different meaning): - _credentials.py 'Vertex AI backend' (provider concept) - _provider_runtime.py / bedrock_proxy.py backend_model (LLM model) - _sandbox.py 'build backends' (Python packaging) * docs: update for v0.4 — Rollout as canonical name, add Sandbox/Rubric/Adapter docs - README: 'Sandbox backends' → 'Sandboxes', table says Rollout not Trial - concepts.md: Environment no longer says 'Backed by Harbor', Trial → Rollout as core primitive, lifecycle diagram uses Rollout.run() - getting-started.md: uses RolloutConfig, links to rollout lifecycle - python-api.md: RolloutConfig (aliased as TrialConfig), Rollout (aliased as Trial), canonical imports shown first, added v0.4 types section with Sandbox protocol, Rubric + RewardFunc, Adapters, Evaluation, backward-compat aliases * fix: correct Rubric/RewardFunc API examples in python-api.md Fixes 3 issues flagged by Devin Review: - StringMatchRewardFunc: remove non-existent 'field' parameter - Rubric: use reward_funcs + weights instead of items=[(func, weight)] - rubric.score: takes rollout_dir: Path, not sandbox= * fix: correct evaluation.py docstring aliases, clean stale comment in __init__.py * docs: remove Harbor migration references - use-cases.md: remove 'How it works vs Harbor' and 'Migration from Harbor' sections - scene-patterns.ipynb: remove 'Mapping to Harbor PR #1462 Concepts' cell - coder-reviewer-demo.py: 'Harbor-format task directory' → 'BenchFlow task directory' - swebench notebook: remove Harbor #1316 reference from intro * CLI: unify sandbox flag (--env → --sandbox, -b → -e) [ENG-56] (#281) * cli: unify sandbox flag --env → --sandbox, -b → -e (ENG-56) Standardize all CLI commands to use --sandbox/-e for the sandbox parameter. Previously bench run used --sandbox/-b while bench eval create and others used --env/-e. - bench run: -b → -e - bench eval create/run/compare: --env → --sandbox - bench eval progress: --env → --sandbox - bench skills eval: --env → --sandbox - bench environment create: -b → -e - docs/examples: updated flag references * cli: drop -e/-b short flags, use --sandbox only (ENG-56) Remove all short flags (-e, -b) for the sandbox parameter across all CLI commands. The long flag --sandbox is the only way to specify the sandbox now. Updated 15 files: CLI definitions, help text examples, README, docs, skill files, integration test runner, and test examples. --------- Co-authored-by: Xiangyi Li <xiangyi@benchflow.ai> * cli: deprecate bench run in favor of bench eval create (ENG-57) (#282) Per ENG-46, bench eval create is the canonical CLI entry point. bench run goes through the old SDK shim and is now deprecated, matching the pattern used for bench job and other legacy commands. Python API bf.run() is unaffected (backward-compat alias stays). Co-authored-by: Xiangyi Li <xiangyi@benchflow.ai> * feat(ENG-55): first-class LLM-as-judge verifier with dense reward (#277) * feat(ENG-55): first-class LLM-as-judge verifier with dense reward - Add rubric_config.py: Pydantic models for rubric.toml (Criterion, JudgeConfig, ScoringConfig, RubricConfig) with binary/likert/numeric criterion types and score normalization - Add llm.py: Multi-provider LLM routing (Anthropic/OpenAI/Gemini) with call_judge(), parse_verdict(), exponential backoff retry - Add file_readers.py: Document extraction (pdf/docx/xlsx/pptx/txt) with find_deliverables() for rollout directory scanning - Replace placeholder LLMJudgeRewardFunc in builtins.py with full rubric-based judge: per-criterion scoring, prompt templates, configurable aggregation (weighted_mean/all_pass/any_pass/threshold), backward-compatible legacy mode - Emit per-criterion dense RewardEvents for fine-grained reward signals - Write evaluation_details.json alongside rollout for transparency - Re-export Criterion, JudgeConfig, RubricConfig, ScoringConfig, load_rubric_toml from rewards and top-level __init__ - Add test_rubric_config.py (15 tests) and test_llm_judge.py (30 tests) Closes ENG-55 * fix(ENG-55): async clients, asyncio.sleep, and score mismatch bugs - Replace sync SDK clients with async variants (AsyncAnthropic, AsyncOpenAI, client.aio for Google) - Replace time.sleep() with await asyncio.sleep() in call_judge retry loop - Pass actual aggregated score to _write_details instead of recomputing n_passed/total * docs(ENG-55): add LLM-as-judge verifier documentation - Add docs/llm-judge.md: full user-facing guide with rubric.toml reference, criterion types, aggregation strategies, dense reward events, multi-provider routing, file discovery, inline criteria, and worked examples - Add src/benchflow/rewards/README.md: module-level guide with usage examples - Update docs/concepts.md: reference LLM judge in Verifier primitive and add to 'Where to go next' links * docs: polish — remove ENG-55 ticket ref, clarify rubric.json exclusion --------- Co-authored-by: Xiangyi Li <xiangyi@benchflow.ai> * CLI: remove all short flags, use full names only [ENG-74] (#284) * CLI: remove all short flags, use full names only [ENG-74] Remove all single-letter short flags (-a, -m, -o, -t, -c, -f, -s, -p, -b, -d) from every CLI command. Full flag names only (--agent, --model, --jobs-dir, --tasks-dir, --concurrency, --config, --skills-dir, --prompt, --benchmark, --dir). Updated across 23 files: - src/benchflow/cli/main.py: all typer.Option definitions + docstring examples - All docs/ (getting-started, running-benchmarks, cli.md, skill-eval, integration-tests, examples/) - README.md - .claude/skills/ (SKILL.md + references) - tests/ (integration/run.sh, test_codex_custom_provider.sh, conformance/README.md) - benchmarks/ (models-as-skills.md, harvey-lab/run_harvey_lab.py) - Test fix: test_skill_eval_dryrun.py invocation updated from -a to --agent * fix: replace remaining -f with --config in task-embedded SKILL.md files --------- Co-authored-by: Xiangyi Li <xiangyi@benchflow.ai> * Backport PR #230 and #242 fixes to refactor/v0.4 (#286) * fix: apply PR #230 (skills double-deploy) and PR #242 (shell injection) fixes to v0.4 PR #230: pass effective task path to deploy_skills, set effective_skills in already-injected branch PR #242: shlex.quote outbox file paths in _scene.py, harden _snapshot.py with path validation * fix: set effective_skills to /skills when Dockerfile already injected When deploy_skills detects the Dockerfile already bakes in `COPY _deps/skills /skills/`, the runtime upload is skipped but effective_skills was left as task.config.environment.skills_dir (often None), causing the subsequent linking step to silently skip distribution to agent-specific paths like /home/agent/.agents/skills. Sandbox users relied on that runtime linking since Dockerfile injection only links under /root/. Tighten the regression test to assert env.exec was called with the expected `ln -sfn` command. --------- Co-authored-by: Xiangyi Li <xiangyi@benchflow.ai> Co-authored-by: Yifeng He <yfhe.prsn@gmail.com> * fix locally * release: 0.3.4 * feat: trace import — `bench tasks generate` from Claude Code + opentraces (#278) * feat: trace import prototype — Claude Code + opentraces → BenchFlow tasks Add benchflow.traces package for personal benchmark curation from agent traces: - parsers: parse Claude Code JSONL sessions and opentraces v0.1-v0.3 records - models: format-agnostic ParsedTrace intermediate representation - task_gen: generate task.toml + instruction.md + test.sh from traces - huggingface: download and parse trace datasets from HuggingFace Hub - local: discover and parse local ~/.claude/projects/ sessions - CLI: bench import {local,file,hf,list-datasets} subcommands 44 new tests, all passing. * fix: address Devin Review findings - _cache_dir: fall back to cwd instead of / when outside git repo - HF cache key: include max_rows in filename to avoid stale data - parse_claude_code_session: deterministic trace_id from session+filename - _build_test_sh: use shlex.quote() to prevent shell injection * refactor: align trace import CLI with noun-verb philosophy - Replace 'bench import {local,file,hf}' with 'bench tasks generate --from-{local,file,hf}' - Replace 'bench import list-datasets' with 'bench tasks list-sources' - Integrate trace commands into existing tasks_app via register_tasks_generate() - Update CLI docs with new command reference - Follows BenchFlow resource-verb pattern: bench <resource> <verb> * fix: address Devin Review — limit flag + format name mismatch - Apply --limit across all sources (was only used for --from-local) - Map 'claude-code' → 'claude-messages' in _load_hf so --format claude-code works correctly with --from-hf * docs: dogfood bench tasks generate in task-authoring and getting-started guides * fix: generated tasks now pass bench tasks check — add Dockerfile, tests/ dir, reward path, difficulty scaling, TOML safety * audit: fix task pipeline issues + normalize docstrings across traces package Pipeline fixes: - Fix timeout auto-scaling: batch generator default changed from 300 to 0 (300 > 0 prevented difficulty-based scaling from ever triggering) - Add [task] name section to generated task.toml (trace-import/<slug>) - Add build_timeout_sec and storage_mb to [environment] section - Fix test.sh indentation (remove extra leading spaces from textwrap) Docstring normalization: - Add missing docstrings to CLI helper functions (_load_local, _load_file, _load_hf) - Improve docstring clarity across traces package (detect_format, print helpers, HF parsers) - Use reST code block syntax for CLI examples 48/48 tests pass, lint clean. * fix: multi-session JSONL collapse, outcome detection, zero-tool-call filter Dogfooding revealed 3 quality issues: 1. Critical: parse_claude_code_session() merged all sessions in a multi-session JSONL file into one trace, contaminating difficulty, instruction, and verifier. Added parse_claude_code_file() that splits by sessionId before parsing each group. 2. Outcome detection missed common completion verbs (fixed, refactored, built, created, updated, implemented, added). 3. Traces with zero tool calls (pure explanations) produced useless tasks with pass-through verifiers. Now filtered out in batch mode. Before: 5-session file → 1 task (hard, 1200s, 19 files listed) After: 5-session file → 4 tasks (easy/medium, correct files each) Tests: 55 passing (was 48), lint clean. * fix: test.sh file check cap matches instruction.md (10→20) * fix: real-trace dogfood — path relativization, session artifact cleanup, HF parser robustness - Add _relativize_path() to convert absolute workspace paths to relative project paths - Add _clean_user_prompt() to strip session continuation boilerplate - HF parser: support messages_json key, strip system-reminders, handle tool_result blocks, infer outcome - CLI: add claude-messages format detection and routing for --from-file - Add _parse_claude_requests_row() for cc-traces-weka metadata format - All 55 tests pass, lint clean * fix: verifier robustness — glob patterns for timestamp-bearing paths Paths with date/timestamp segments (e.g. migrations/2025-11-28-131040_create_invoices/up.sql) now use compgen -G glob patterns in test.sh instead of exact [ -f ] checks. This lets verifiers tolerate agent-generated timestamp variants. Also updates instruction.md to show globbed paths for dynamic segments. Adds 10 new tests for _globify_path, _has_dynamic_segments, and end-to-end verifier pattern selection. * fix: real-trace parser — user_prompt+gitdiff format, git context, git-diff verifier - Handle cc-traces-merged rows with user_prompt + gitdiff (no messages_json) - Extract TASK DESCRIPTION from structured prompts, strip CRITICAL INSTRUCTIONS boilerplate - Extract git repo/commit from prompt, populate GitContext - Dockerfile clones repo at base commit when git context available - Verifier uses git diff to check files were modified (not just exist) - Fix source_model='None' string bug --------- Co-authored-by: Xiangyi Li <xiangyi@benchflow.ai> * refactor: BenchFlow v0.4 — RL-first terminology, module consolidation, ACPX integration (#288) * refactor: BenchFlow v0.3.4 — unified types, Rollout, Sandbox protocol, rewards, adapters (#274) * refactor: add Sandbox protocol + Harbor adapters (ENG-48 Phase A) Create BenchFlow-native Sandbox protocol as parallel types alongside existing Harbor imports. No behavioral changes — existing code continues to use Harbor directly. New files: - src/benchflow/sandbox/protocol.py: ExecResult, Sandbox, ImageRef, ImageConfig, ImageBuilder - src/benchflow/sandbox/docker.py: DockerSandbox adapter wrapping Harbor DockerEnvironment - src/benchflow/sandbox/daytona.py: DaytonaSandbox adapter wrapping Harbor DaytonaEnvironment - tests/test_sandbox_protocol.py: 26 tests covering dataclasses, protocol conformance, and delegation * refactor: unify Scene/Role/Turn types into _types.py (ENG-47) Create src/benchflow/_types.py as the single canonical source for the declarative Role, Scene, and Turn dataclasses. Changes: - New _types.py with Role (adds timeout_sec, idle_timeout_sec, skills_dir), Scene (adds parallel_group), and Turn - trial.py imports from _types.py instead of defining its own copies - _scene.py renames its internal Role to SceneRole (different fields: instruction, tools) with backward-compat alias - __init__.py re-exports canonical types from _types.py; adds TrialRole/TrialScene backward-compat aliases - runtime.py and trial_yaml.py updated to import from _types.py New fields default to None — no runtime behavior changes. * fix: address review — shlex.quote in read_file, cleanup temp in write_file - Use shlex.quote(path) in read_file to prevent shell injection - Clean up temp files with os.unlink in write_file finally block - Move imports to module level * fix: read_file error checking + export ImageConfig from top-level - read_file now raises FileNotFoundError on non-zero return code - Export ImageConfig alongside ImageBuilder in __init__.py - Add tests for read_file error behavior * refactor: kill shim layers, single Rollout execution path (ENG-46) - Rename trial.py → rollout.py, Trial → Rollout, TrialConfig → RolloutConfig - Rename RunResult → RolloutResult in models.py - Create evaluation.py from job.py: Job → Evaluation, JobConfig → EvaluationConfig, JobResult → EvaluationResult - Create _run.py with bf.run(RolloutConfig) → RolloutResult entry point - Replace sdk.py with thin backward-compat shim delegating to rollout.py - Replace trial.py with re-export shim from rollout.py - Replace job.py with re-export shim from evaluation.py - Preserve runtime.py API, update internal imports to use Rollout - Update __init__.py: new public API + backward-compat aliases - Update test patch targets from benchflow.trial → benchflow.rollout - All 841 tests pass, lint clean, pre-existing typecheck errors unchanged * style: apply ruff format to pass CI format check * fix: resolve ty typecheck errors (Any for kwargs and sentinel) * feat: composable Rubric + RewardFunc protocol (ENG-49) - Create src/benchflow/rewards/ package with: - RewardFunc protocol (single scoring dimension) - Rubric dataclass (weighted collection of RewardFuncs) - VerifyResult dataclass (aggregated scoring result) - RewardEvent dataclass (dense/terminal reward signals) - Built-in funcs: TestRewardFunc, LLMJudgeRewardFunc, StringMatchRewardFunc, CodeExecRewardFunc - Add reward_events field to RunResult (additive, no breaking changes) - Re-export all reward types from benchflow top-level __init__.py - Backward compat: Rubric([TestRewardFunc()]) wraps existing test.sh -> reward.txt flow - 31 tests covering all reward types, rubric scoring, weights, error handling, protocol conformance, and re-exports * style: format rewards package and tests with ruff * fix: disambiguate Rubric.items keys when multiple funcs share a class name Append _N suffix on collision so all per-func scores are preserved. Addresses Devin Review feedback on PR #266. * feat: per-agent capabilities + agent-as-tool infrastructure (ENG-50) - Add Role.capabilities field for declarative agent capability lists - Add Sandbox.expose_ports property for inter-agent communication ports - Document capability boundary: BenchFlow provides sandbox + instruction + observation; agents own loops, tool protocols, and agent-as-tool - Update _scene.py and _types.py module docstrings with ENG-50 boundary - Add 16 tests covering capabilities, expose_ports, and Coder→Reviewer→Coder turn sequences * style: apply ruff format to sandbox adapters and tests * feat: external adapters for Inspect AI + ORS (ENG-51) Add benchflow.adapters package with thin format converters: - InspectAdapter / to_inspect_task: Scene + Rubric -> Inspect AI task dict - ORSAdapter / to_ors_reward: VerifyResult + RewardEvent -> ORS format dict No external dependencies required — these are pure format converters. Re-exported from benchflow top-level __init__.py. 14 tests covering both adapters, convenience functions, and round-trips. * style: fix ruff format for adapters and tests * fix: convention fixes + trial shim re-exports + __getattr__ narrowing - Parametrize bare dict returns to dict[str, Any] in adapters - Remove unnecessary @DataClass on adapter classes (no fields) - Add from __future__ import annotations to evaluation.py - Fix __getattr__ to re-raise ModuleNotFoundError for broken deps - Add private helper re-exports to trial.py shim for backward compat - Convert test_save_trajectory to async (fixes event loop deprecation) * rename: backend → sandbox across API, CLI, docs, and tests Aligns naming with the v0.4 Sandbox protocol (ENG-48). The 'backend' parameter/flag/attribute on Environment and the CLI is now 'sandbox' everywhere. Scope: - runtime.py: Environment.__init__(sandbox=), .sandbox, from_task(sandbox=) - cli/main.py: --sandbox flag (was --backend), help text - _snapshot.py: docstring updates - tests/test_runtime.py: updated assertions - docs/: all references in task-authoring, CLI ref, Python API ref, running-benchmarks, integration-tests, examples NOT renamed (different meaning): - _credentials.py 'Vertex AI backend' (provider concept) - _provider_runtime.py / bedrock_proxy.py backend_model (LLM model) - _sandbox.py 'build backends' (Python packaging) * docs: update for v0.4 — Rollout as canonical name, add Sandbox/Rubric/Adapter docs - README: 'Sandbox backends' → 'Sandboxes', table says Rollout not Trial - concepts.md: Environment no longer says 'Backed by Harbor', Trial → Rollout as core primitive, lifecycle diagram uses Rollout.run() - getting-started.md: uses RolloutConfig, links to rollout lifecycle - python-api.md: RolloutConfig (aliased as TrialConfig), Rollout (aliased as Trial), canonical imports shown first, added v0.4 types section with Sandbox protocol, Rubric + RewardFunc, Adapters, Evaluation, backward-compat aliases * fix: correct Rubric/RewardFunc API examples in python-api.md Fixes 3 issues flagged by Devin Review: - StringMatchRewardFunc: remove non-existent 'field' parameter - Rubric: use reward_funcs + weights instead of items=[(func, weight)] - rubric.score: takes rollout_dir: Path, not sandbox= * fix: correct evaluation.py docstring aliases, clean stale comment in __init__.py * docs: remove Harbor migration references - use-cases.md: remove 'How it works vs Harbor' and 'Migration from Harbor' sections - scene-patterns.ipynb: remove 'Mapping to Harbor PR #1462 Concepts' cell - coder-reviewer-demo.py: 'Harbor-format task directory' → 'BenchFlow task directory' - swebench notebook: remove Harbor #1316 reference from intro * CLI: unify sandbox flag (--env → --sandbox, -b → -e) [ENG-56] (#281) * cli: unify sandbox flag --env → --sandbox, -b → -e (ENG-56) Standardize all CLI commands to use --sandbox/-e for the sandbox parameter. Previously bench run used --sandbox/-b while bench eval create and others used --env/-e. - bench run: -b → -e - bench eval create/run/compare: --env → --sandbox - bench eval progress: --env → --sandbox - bench skills eval: --env → --sandbox - bench environment create: -b → -e - docs/examples: updated flag references * cli: drop -e/-b short flags, use --sandbox only (ENG-56) Remove all short flags (-e, -b) for the sandbox parameter across all CLI commands. The long flag --sandbox is the only way to specify the sandbox now. Updated 15 files: CLI definitions, help text examples, README, docs, skill files, integration test runner, and test examples. --------- Co-authored-by: Xiangyi Li <xiangyi@benchflow.ai> * cli: deprecate bench run in favor of bench eval create (ENG-57) (#282) Per ENG-46, bench eval create is the canonical CLI entry point. bench run goes through the old SDK shim and is now deprecated, matching the pattern used for bench job and other legacy commands. Python API bf.run() is unaffected (backward-compat alias stays). Co-authored-by: Xiangyi Li <xiangyi@benchflow.ai> * feat(ENG-55): first-class LLM-as-judge verifier with dense reward (#277) * feat(ENG-55): first-class LLM-as-judge verifier with dense reward - Add rubric_config.py: Pydantic models for rubric.toml (Criterion, JudgeConfig, ScoringConfig, RubricConfig) with binary/likert/numeric criterion types and score normalization - Add llm.py: Multi-provider LLM routing (Anthropic/OpenAI/Gemini) with call_judge(), parse_verdict(), exponential backoff retry - Add file_readers.py: Document extraction (pdf/docx/xlsx/pptx/txt) with find_deliverables() for rollout directory scanning - Replace placeholder LLMJudgeRewardFunc in builtins.py with full rubric-based judge: per-criterion scoring, prompt templates, configurable aggregation (weighted_mean/all_pass/any_pass/threshold), backward-compatible legacy mode - Emit per-criterion dense RewardEvents for fine-grained reward signals - Write evaluation_details.json alongside rollout for transparency - Re-export Criterion, JudgeConfig, RubricConfig, ScoringConfig, load_rubric_toml from rewards and top-level __init__ - Add test_rubric_config.py (15 tests) and test_llm_judge.py (30 tests) Closes ENG-55 * fix(ENG-55): async clients, asyncio.sleep, and score mismatch bugs - Replace sync SDK clients with async variants (AsyncAnthropic, AsyncOpenAI, client.aio for Google) - Replace time.sleep() with await asyncio.sleep() in call_judge retry loop - Pass actual aggregated score to _write_details instead of recomputing n_passed/total * docs(ENG-55): add LLM-as-judge verifier documentation - Add docs/llm-judge.md: full user-facing guide with rubric.toml reference, criterion types, aggregation strategies, dense reward events, multi-provider routing, file discovery, inline criteria, and worked examples - Add src/benchflow/rewards/README.md: module-level guide with usage examples - Update docs/concepts.md: reference LLM judge in Verifier primitive and add to 'Where to go next' links * docs: polish — remove ENG-55 ticket ref, clarify rubric.json exclusion --------- Co-authored-by: Xiangyi Li <xiangyi@benchflow.ai> * CLI: remove all short flags, use full names only [ENG-74] (#284) * CLI: remove all short flags, use full names only [ENG-74] Remove all single-letter short flags (-a, -m, -o, -t, -c, -f, -s, -p, -b, -d) from every CLI command. Full flag names only (--agent, --model, --jobs-dir, --tasks-dir, --concurrency, --config, --skills-dir, --prompt, --benchmark, --dir). Updated across 23 files: - src/benchflow/cli/main.py: all typer.Option definitions + docstring examples - All docs/ (getting-started, running-benchmarks, cli.md, skill-eval, integration-tests, examples/) - README.md - .claude/skills/ (SKILL.md + references) - tests/ (integration/run.sh, test_codex_custom_provider.sh, conformance/README.md) - benchmarks/ (models-as-skills.md, harvey-lab/run_harvey_lab.py) - Test fix: test_skill_eval_dryrun.py invocation updated from -a to --agent * fix: replace remaining -f with --config in task-embedded SKILL.md files --------- Co-authored-by: Xiangyi Li <xiangyi@benchflow.ai> * Backport PR #230 and #242 fixes to refactor/v0.4 (#286) * fix: apply PR #230 (skills double-deploy) and PR #242 (shell injection) fixes to v0.4 PR #230: pass effective task path to deploy_skills, set effective_skills in already-injected branch PR #242: shlex.quote outbox file paths in _scene.py, harden _snapshot.py with path validation * fix: set effective_skills to /skills when Dockerfile already injected When deploy_skills detects the Dockerfile already bakes in `COPY _deps/skills /skills/`, the runtime upload is skipped but effective_skills was left as task.config.environment.skills_dir (often None), causing the subsequent linking step to silently skip distribution to agent-specific paths like /home/agent/.agents/skills. Sandbox users relied on that runtime linking since Dockerfile injection only links under /root/. Tighten the regression test to assert env.exec was called with the expected `ln -sfn` command. --------- Co-authored-by: Xiangyi Li <xiangyi@benchflow.ai> Co-authored-by: Yifeng He <yfhe.prsn@gmail.com> * fix locally * release: 0.3.4 --------- Co-authored-by: Xiangyi Li <xiangyi@benchflow.ai> Co-authored-by: devin-ai-integration[bot] <158243242+devin-ai-integration[bot]@users.noreply.github.com> Co-authored-by: Yifeng He <yfhe.prsn@gmail.com> * refactor: remove Harbor references, purge backward-compat aliases in tests, update terminology - Remove 'terminal-bench' from pyproject.toml keywords - Rename _make_harbor_mock → _make_sandbox_mock in test files - Update Harbor patch paths to sandbox equivalents in tests - Rename _from_harbor_yaml → _from_legacy_yaml in evaluation.py - Update docstrings/comments: Harbor → BenchFlow/legacy terminology - Internalize Harbor types into benchflow.task/ subpackage - Create sandbox backends: base.py, docker_impl.py, daytona_impl.py, modal_impl.py - Add compose/ subpackage with docker-compose YAML templates * refactor: modernize file structure — align with test-drive patterns - sandbox/: merge docker_impl→docker, daytona_impl→daytona, base→_base - sandbox/: move _sandbox→lockdown, _snapshot→snapshot, _daytona_patches→_sdk_ops - sandbox/: move process, user, environments→services into sandbox/ - sandbox/: restructure compose/ → _compose.py + _compose_files/ - agents/: move _credentials→credentials, _agent_env→env, _agent_setup→install - _utils/: create subpackage with yaml_loader, benchmark_repos, task_authoring - trajectories/: move viewer, _trajectory→_capture into subpackage - experimental/: move mcp/ into experimental/ subpackage - __init__.py: purge backward-compat aliases (Trial, Job, etc.) - Update all imports across src/, tests/, experiments/ - Run ruff check + format (0 errors) * refactor: purge backward-compat aliases + fix review bugs - Remove Trial=Rollout, TrialConfig=RolloutConfig from rollout.py - Remove Job=Evaluation, JobConfig=EvaluationConfig, JobResult=EvaluationResult from evaluation.py - Delete shim files: trial.py, job.py - Update all imports: benchflow.trial → benchflow.rollout, benchflow.job → benchflow.evaluation - Rename classes: Trial→Rollout, Job→Evaluation across tests/, benchmarks/, experiments/ - Fix runtime.py: environment_type→sandbox_type, trial_name→rollout_name (review bug) - Fix conformance scripts: same keyword arg fixes (review bug) - Clean up __all__ and return type annotations * docs: update terminology to v0.4 (Trial→Rollout, Job→Evaluation) + bump version to 0.4.0 * chore: update uv.lock for v0.4.0 * refactor: modernize RL terminology — trial_dir→rollout_dir, trial_name→rollout_name across codebase * style: format viewer.py * fix: exclude optional-dep sandbox files from ty check * fix: configure ty to ignore unresolved-import + exclude optional-dep files * fix: resolve all test failures — patch paths, protocol conformance, RL terminology * fix: resolve merge conflicts, update broken imports and stale references - Resolve merge conflict markers in rollout.py and evaluation.py - Update task_download imports to benchflow._utils.benchmark_repos (11 files) - Fix benchflow.tasks → benchflow._utils.task_authoring import - Update scene-patterns.ipynb: TrialConfig→RolloutConfig, trial→rollout - Fix Job→Evaluation rename in docs - Ruff auto-fixes (import ordering) * style: format 12 files to pass ruff format check * fix: resolve ty check errors in traces/ — dict[str, object] → dict[str, Any] * fix: add skip guards for optional dependency tests (daytona, modal) * fix: update stale imports and references in docs/benchmarks (Devin Review) * refactor: consolidate underscore modules into proper subpackages - _acp_run.py → acp/runtime.py - _env_setup.py → sandbox/setup.py - _provider_runtime.py → providers/runtime.py - _scoring.py → _utils/scoring.py - _scene.py → scenes.py (removed Role=SceneRole backward-compat alias) - Fix GEMINI_API_KEY auto-detection: inherit from os.environ as fallback after .env * feat: ACPX integration — acpx/<agent> protocol for headless ACP agent invocation - Add 'acpx' as a valid protocol in parse_agent_spec (acpx/claude, acpx/codex, etc.) - _acpx_wrap() decorates any registered agent to launch via acpx CLI - Installs acpx alongside the underlying agent in the sandbox - Preserves all agent env, credentials, and skill_paths - Replace harbor protocol references with acpx in tests * style: format registry.py for ruff format check * fix: suppress ty invalid-assignment for monkey-patch in _patch_docker_dind * docs: update docs and SKILL.md for v0.4 — RL-first terminology, ACPX, purge backward-compat * fix: rename trial_config_from_yaml → rollout_config_from_yaml, fix stale _run import in sdk.py * fix: docs inconsistencies found by dogfooding — GEMINI_API_KEY, daytona install note * fix: docs path evaluations/ → jobs/ to match actual output directory * fix: ruff import ordering in sdk.py * fix: ty type-check suppression for sdk.py run() return type * feat: add sandbox-daytona and sandbox-modal optional dependency groups * chore: update uv.lock for sandbox-daytona and sandbox-modal optional deps --------- Co-authored-by: devin-ai-integration[bot] <158243242+devin-ai-integration[bot]@users.noreply.github.com> Co-authored-by: Xiangyi Li <xiangyi@benchflow.ai> Co-authored-by: Yifeng He <yfhe.prsn@gmail.com> * skill eval -> refactor (#293) * fix: address v0.4 dogfood blockers * test: cover skill eval nested results * chore: ignore playwright mcp artifacts * fix: validate config eval agent protocol * test: add adapter release evidence checker * test: wire adapter evidence into release runner * fix: address ENG-91 dogfood regressions * fix: record role metadata and timeouts * fix: align dogfood cli reports * docs: document environment cleanup command * docs: remove stale harbor task path examples * Add citation-management skill eval under skills * Update src/benchflow/cli/main.py Co-authored-by: devin-ai-integration[bot] <158243242+devin-ai-integration[bot]@users.noreply.github.com> --------- Co-authored-by: devin-ai-integration[bot] <158243242+devin-ai-integration[bot]@users.noreply.github.com> Co-authored-by: Yifeng He <yfhe.prsn@gmail.com>

xdotli and others added 22 commits May 15, 2026 22:00

fix: address review — shlex.quote in read_file, cleanup temp in write…

19bbce5

…_file - Use shlex.quote(path) in read_file to prevent shell injection - Clean up temp files with os.unlink in write_file finally block - Move imports to module level

fix: read_file error checking + export ImageConfig from top-level

6179b82

- read_file now raises FileNotFoundError on non-zero return code - Export ImageConfig alongside ImageBuilder in __init__.py - Add tests for read_file error behavior

Merge PR #262: refactor: unify Scene/Role/Turn types into _types.py (…

2a45331

…ENG-47) refactor: unify Scene/Role/Turn types into _types.py (ENG-47)

Merge PR #261: refactor: add Sandbox protocol + Harbor adapters (ENG-…

66be050

…48 Phase A) refactor: add Sandbox protocol + Harbor adapters (ENG-48 Phase A)

style: apply ruff format to pass CI format check

27c50e0

fix: resolve ty typecheck errors (Any for kwargs and sentinel)

8675839

style: format rewards package and tests with ruff

8373739

fix: disambiguate Rubric.items keys when multiple funcs share a class…

a612a1d

… name Append _N suffix on collision so all per-func scores are preserved. Addresses Devin Review feedback on PR #266.

style: apply ruff format to sandbox adapters and tests

ccdf516

Merge PR #268: refactor: kill shim layers, single Rollout execution p…

75bb8f3

…ath (ENG-46) refactor: kill shim layers, single Rollout execution path (ENG-46)

Merge PR #265: feat: per-agent capabilities + agent-as-tool infrastru…

ce77bbf

…cture (ENG-50) feat: per-agent capabilities + agent-as-tool infrastructure (ENG-50)

merge: resolve ENG-49 rewards imports with ENG-46 renames

828344b

Merge PR #266: feat: composable Rubric + RewardFunc protocol (ENG-49)

6f71a41

feat: composable Rubric + RewardFunc protocol (ENG-49)

style: fix ruff format for adapters and tests

c468cbe

Merge PR #271: feat: external adapters for Inspect AI + ORS (ENG-51)

1d9e1a5

feat: external adapters for Inspect AI + ORS (ENG-51)

devin-ai-integration Bot assigned xdotli May 16, 2026

devin-ai-integration Bot commented May 16, 2026

View reviewed changes

xdotli added 2 commits May 16, 2026 17:39

This comment was marked as resolved.

Sign in to view

devin-ai-integration Bot commented May 16, 2026

View reviewed changes

xdotli and others added 6 commits May 16, 2026 18:28

fix: correct evaluation.py docstring aliases, clean stale comment in …

daff0a0

…__init__.py

devin-ai-integration Bot and others added 4 commits May 17, 2026 04:02

fix locally

4139681

merge: resolve refactor/v0.4 with main

485de66

release: 0.3.4

dfdd02e

xdotli changed the title ~~refactor: BenchFlow v0.4 — unified types, Rollout, Sandbox protocol, rewards, adapters~~ refactor: BenchFlow v0.3.4 — unified types, Rollout, Sandbox protocol, rewards, adapters May 17, 2026

xdotli merged commit 3e8b06c into main May 17, 2026
1 of 2 checks passed

This was referenced May 21, 2026

test: guard issue #229 — deploy_skills receives effective task path #308

Merged

deploy_skills double-deploys when skills_dir is set, causing 'cannot overwrite directory "/skills/..."' on container cp #229

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor: BenchFlow v0.3.4 — unified types, Rollout, Sandbox protocol, rewards, adapters#274

refactor: BenchFlow v0.3.4 — unified types, Rollout, Sandbox protocol, rewards, adapters#274
xdotli merged 36 commits into
mainfrom
refactor/v0.4

devin-ai-integration Bot commented May 16, 2026 •

edited

Loading

Uh oh!

devin-ai-integration Bot commented May 16, 2026

Uh oh!

devin-ai-integration Bot left a comment

Uh oh!

devin-ai-integration Bot commented May 16, 2026

Uh oh!

devin-ai-integration Bot commented May 16, 2026

Uh oh!

This comment was marked as resolved.

Uh oh!

devin-ai-integration Bot left a comment

Uh oh!

devin-ai-integration Bot May 16, 2026

Uh oh!

devin-ai-integration Bot May 16, 2026

Uh oh!

Uh oh!

devin-ai-integration Bot commented May 17, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

		Backward-compat aliases: ``Job = Evaluation``, ``EvaluationConfig = EvaluationConfig``,
		``EvaluationResult = EvaluationResult``.

Conversation

devin-ai-integration Bot commented May 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Core changes:

Additional on this branch:

Backward compatibility:

Review & Testing Checklist for Human

Notes

Uh oh!

devin-ai-integration Bot commented May 16, 2026

🤖 Devin AI Engineer

Uh oh!

devin-ai-integration Bot left a comment

Choose a reason for hiding this comment

✅ Devin Review: No Issues Found

Uh oh!

devin-ai-integration Bot commented May 16, 2026

Test Results — BenchFlow v0.4 Refactor

Results: 8/8 passed

Uh oh!

devin-ai-integration Bot commented May 16, 2026

Adapter Feature Tests — Follow-up

Uh oh!

This comment was marked as resolved.

Uh oh!

devin-ai-integration Bot left a comment

Choose a reason for hiding this comment

Uh oh!

devin-ai-integration Bot May 16, 2026

Choose a reason for hiding this comment

Uh oh!

devin-ai-integration Bot May 16, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

devin-ai-integration Bot commented May 17, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

devin-ai-integration Bot commented May 16, 2026 •

edited

Loading