Skip to content

refactor: BenchFlow v0.3.4 — unified types, Rollout, Sandbox protocol, rewards, adapters#274

Merged
xdotli merged 36 commits into
mainfrom
refactor/v0.4
May 17, 2026
Merged

refactor: BenchFlow v0.3.4 — unified types, Rollout, Sandbox protocol, rewards, adapters#274
xdotli merged 36 commits into
mainfrom
refactor/v0.4

Conversation

@devin-ai-integration
Copy link
Copy Markdown
Contributor

@devin-ai-integration devin-ai-integration Bot commented May 16, 2026

Summary

BenchFlow v0.4 architecture refactor — consolidated from 6 individual PRs (#261, #262, #265, #266, #268, #271) merged in dependency order.

Core changes:

  • ENG-47: Unified Scene/Role/Turn types into _types.py, added parallel_group + per-role timeouts
  • ENG-48: Sandbox protocol (Sandbox, ImageBuilder) with Docker/Daytona adapters; Harbor internalized
  • ENG-46: Single execution path — Rollout replaces Trial as canonical name, backward-compat aliases preserved
  • ENG-49: Composable rewards — Rubric + RewardFunc protocol, RewardEvent for dense rewards
  • ENG-50: Agent-as-tool infrastructure (capabilities, sandbox networking)
  • ENG-51: External framework adapters (InspectAdapter, ORSAdapter)

Additional on this branch:

  • backendsandbox rename across CLI/API/docs (11 files)
  • Python convention fixes (from __future__ import annotations, typed returns, @dataclass cleanup)
  • __getattr__ fix: correctly re-raises ModuleNotFoundError instead of swallowing as AttributeError
  • Full docs audit: updated README, concepts.md, getting-started.md, python-api.md for v0.4
  • Removed all Harbor migration/reference content from docs (use-cases.md, scene-patterns.ipynb, coder-reviewer-demo.py, swebench notebook)

Backward compatibility:

All old names preserved as identity-equal aliases: Trial, TrialConfig, Job, JobConfig, RunResult, RuntimeConfig, Agent, Environment

Review & Testing Checklist for Human

  • Verify from benchflow import Rollout, RolloutConfig, Rubric, Sandbox works
  • Verify backward-compat: from benchflow import Trial; assert Trial is Rollout
  • Run bench eval create -t tasks/jax-computing-basics -a gemini -m gemini-3.1-flash-lite-preview -e daytona end-to-end
  • Spot-check docs (concepts.md, python-api.md) — no more Harbor references, examples use --sandbox not --backend
  • Run full test suite: uv run pytest (963 tests expected, 2 pre-existing env-var-leak failures)

Notes

  • 13/13 doc examples dogfooded against this branch — all pass
  • 11/11 integration tests passed (including adapter round-trip tests)
  • CI green: 961 tests pass, lint/format/typecheck clean
  • Linear tickets ENG-43 through ENG-51 all marked Done

Link to Devin session: https://app.devin.ai/sessions/206c8b356e29441d88a8882b45c4442e
Requested by: @xdotli


Open in Devin Review

xdotli and others added 22 commits May 15, 2026 22:00
Create BenchFlow-native Sandbox protocol as parallel types alongside
existing Harbor imports. No behavioral changes — existing code continues
to use Harbor directly.

New files:
- src/benchflow/sandbox/protocol.py: ExecResult, Sandbox, ImageRef,
  ImageConfig, ImageBuilder
- src/benchflow/sandbox/docker.py: DockerSandbox adapter wrapping
  Harbor DockerEnvironment
- src/benchflow/sandbox/daytona.py: DaytonaSandbox adapter wrapping
  Harbor DaytonaEnvironment
- tests/test_sandbox_protocol.py: 26 tests covering dataclasses,
  protocol conformance, and delegation
Create src/benchflow/_types.py as the single canonical source for the
declarative Role, Scene, and Turn dataclasses.

Changes:
- New _types.py with Role (adds timeout_sec, idle_timeout_sec,
  skills_dir), Scene (adds parallel_group), and Turn
- trial.py imports from _types.py instead of defining its own copies
- _scene.py renames its internal Role to SceneRole (different fields:
  instruction, tools) with backward-compat alias
- __init__.py re-exports canonical types from _types.py; adds
  TrialRole/TrialScene backward-compat aliases
- runtime.py and trial_yaml.py updated to import from _types.py

New fields default to None — no runtime behavior changes.
…_file

- Use shlex.quote(path) in read_file to prevent shell injection
- Clean up temp files with os.unlink in write_file finally block
- Move imports to module level
- read_file now raises FileNotFoundError on non-zero return code
- Export ImageConfig alongside ImageBuilder in __init__.py
- Add tests for read_file error behavior
…ENG-47)

refactor: unify Scene/Role/Turn types into _types.py (ENG-47)
…48 Phase A)

refactor: add Sandbox protocol + Harbor adapters (ENG-48 Phase A)
- Rename trial.py → rollout.py, Trial → Rollout, TrialConfig → RolloutConfig
- Rename RunResult → RolloutResult in models.py
- Create evaluation.py from job.py: Job → Evaluation, JobConfig → EvaluationConfig, JobResult → EvaluationResult
- Create _run.py with bf.run(RolloutConfig) → RolloutResult entry point
- Replace sdk.py with thin backward-compat shim delegating to rollout.py
- Replace trial.py with re-export shim from rollout.py
- Replace job.py with re-export shim from evaluation.py
- Preserve runtime.py API, update internal imports to use Rollout
- Update __init__.py: new public API + backward-compat aliases
- Update test patch targets from benchflow.trial → benchflow.rollout
- All 841 tests pass, lint clean, pre-existing typecheck errors unchanged
- Create src/benchflow/rewards/ package with:
  - RewardFunc protocol (single scoring dimension)
  - Rubric dataclass (weighted collection of RewardFuncs)
  - VerifyResult dataclass (aggregated scoring result)
  - RewardEvent dataclass (dense/terminal reward signals)
  - Built-in funcs: TestRewardFunc, LLMJudgeRewardFunc,
    StringMatchRewardFunc, CodeExecRewardFunc
- Add reward_events field to RunResult (additive, no breaking changes)
- Re-export all reward types from benchflow top-level __init__.py
- Backward compat: Rubric([TestRewardFunc()]) wraps existing
  test.sh -> reward.txt flow
- 31 tests covering all reward types, rubric scoring, weights,
  error handling, protocol conformance, and re-exports
… name

Append _N suffix on collision so all per-func scores are preserved.
Addresses Devin Review feedback on PR #266.
- Add Role.capabilities field for declarative agent capability lists
- Add Sandbox.expose_ports property for inter-agent communication ports
- Document capability boundary: BenchFlow provides sandbox + instruction +
  observation; agents own loops, tool protocols, and agent-as-tool
- Update _scene.py and _types.py module docstrings with ENG-50 boundary
- Add 16 tests covering capabilities, expose_ports, and Coder→Reviewer→Coder
  turn sequences
…ath (ENG-46)

refactor: kill shim layers, single Rollout execution path (ENG-46)
…cture (ENG-50)

feat: per-agent capabilities + agent-as-tool infrastructure (ENG-50)
feat: composable Rubric + RewardFunc protocol (ENG-49)
Add benchflow.adapters package with thin format converters:
- InspectAdapter / to_inspect_task: Scene + Rubric -> Inspect AI task dict
- ORSAdapter / to_ors_reward: VerifyResult + RewardEvent -> ORS format dict

No external dependencies required — these are pure format converters.
Re-exported from benchflow top-level __init__.py.
14 tests covering both adapters, convenience functions, and round-trips.
feat: external adapters for Inspect AI + ORS (ENG-51)
- Parametrize bare dict returns to dict[str, Any] in adapters
- Remove unnecessary @DataClass on adapter classes (no fields)
- Add from __future__ import annotations to evaluation.py
- Fix __getattr__ to re-raise ModuleNotFoundError for broken deps
- Add private helper re-exports to trial.py shim for backward compat
- Convert test_save_trajectory to async (fixes event loop deprecation)
@devin-ai-integration
Copy link
Copy Markdown
Contributor Author

🤖 Devin AI Engineer

I'll be helping with this pull request! Here's what you should know:

✅ I will automatically:

  • Address comments on this PR. Add '(aside)' to your comment to have me ignore it.
  • Look at CI failures and help fix them

Note: I can only respond to comments from users who have write access to this repository.

⚙️ Control Options:

  • Disable automatic comment and CI monitoring

Copy link
Copy Markdown
Contributor Author

@devin-ai-integration devin-ai-integration Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Devin Review: No Issues Found

Devin Review analyzed this PR and found no bugs or issues to report.

Open in Devin Review

…o rollout.py

Port ensure_bedrock_proxy_runtime/stop_provider_runtime from trial.py (PR #267)
into rollout.py execution paths: connect(), cleanup(), and connect_as().
Update trial.py shim with _provider_runtime re-exports.
Fix test patches to target benchflow.rollout instead of benchflow.trial.
@devin-ai-integration
Copy link
Copy Markdown
Contributor Author

Test Results — BenchFlow v0.4 Refactor

Branch: refactor/v0.4 @ 5b80c34 (includes merge of main + Bedrock PR #267)
Session: Devin

Results: 8/8 passed

E2E Integration Test (bench eval create)

Full pipeline completed successfully:

  • Task: jax-computing-basics (from benchflow-ai/skillsbench)
  • Agent: gemini (gemini-3.1-flash-lite-preview)
  • Backend: daytona
  • Pipeline: task download → Daytona env → Dockerfile build → agent install → sandbox user → ACP connection → 15 tool calls → verifier → reward extraction
  • Reward: 0.0 (expected for lightweight model — pipeline correctness is what matters)
  • Exit code: 0
API & Compatibility Tests (7/7)
# Test Result
1 New v0.4 imports (24 types: Rollout, Sandbox, Rubric, RewardFunc, etc.)
2 Backward-compat aliases identity-equal (Trial is Rollout, Job is Evaluation, etc.)
3 __getattr__ re-raises broken dep ModuleNotFoundError (not swallowed as AttributeError)
4 Sandbox protocol is @runtime_checkable (isinstance works)
5 Rubric composition: score()VerifyResult(reward=1.0) with RewardEvent
6 Full test suite: 963 passed, 0 failed, 28s
7 Lint (ruff check) + format (ruff format --check): all clean
Merge Conflict Resolution (Bedrock PR #267)
  • Ported ensure_bedrock_proxy_runtime / stop_provider_runtime into rollout.py (connect(), cleanup(), connect_as())
  • Updated trial.py shim with _provider_runtime re-exports
  • Fixed Bedrock test patches to target benchflow.rollout instead of benchflow.trial
  • All Bedrock tests pass

@devin-ai-integration
Copy link
Copy Markdown
Contributor Author

Adapter Feature Tests — Follow-up

Added 3 more tests specifically exercising the ENG-51 adapter features (InspectAdapter + ORSAdapter):

Results: 3/3 passed

Test 9: InspectAdapter (to_inspect_task)

Exercised to_inspect_task() with realistic data:

  • Multi-role scene (coder + reviewer, 3 turns) → correct name, dataset (3 samples with input/role)
  • Without rubric → no scorer field
  • With rubric → scorer: {type: "benchflow_rubric", reward_funcs: 2, weights: [0.6, 0.4]}
  • Edge cases: empty scene → dataset=[]; None prompt → ""
Test 10: ORSAdapter (to_ors_reward)

Exercised to_ors_reward() and ORSAdapter:

  • Success: VerifyResult(reward=0.75){reward: 0.75, is_valid: true, metadata: {items: {...}, events: [...]}}
  • Error: VerifyResult(error="Verifier timed out"){is_valid: false, error: "Verifier timed out"}
  • Dense event step=3 preserved; all timestamp/type/source/reward fields correct
  • reward_event_to_ors() individual event conversion works
Test 11: Full Adapter Round-Trip

End-to-end: Scene.single(agent='gemini')to_inspect_task(scene, rubric)rubric.score()to_ors_reward(result)

  • Weighted scoring: 0.7 × 0.8 + 0.3 × 0.2 = 0.62
  • All RewardEvent fields preserved through Rubric → ORS pipeline (reward, source, type, timestamp)

Total across both test runs: 11/11 passed (8 original + 3 adapter tests)

xdotli added 2 commits May 16, 2026 17:39
Aligns naming with the v0.4 Sandbox protocol (ENG-48). The
'backend' parameter/flag/attribute on Environment and the CLI is
now 'sandbox' everywhere.

Scope:
- runtime.py: Environment.__init__(sandbox=), .sandbox, from_task(sandbox=)
- cli/main.py: --sandbox flag (was --backend), help text
- _snapshot.py: docstring updates
- tests/test_runtime.py: updated assertions
- docs/: all references in task-authoring, CLI ref, Python API ref,
  running-benchmarks, integration-tests, examples

NOT renamed (different meaning):
- _credentials.py 'Vertex AI backend' (provider concept)
- _provider_runtime.py / bedrock_proxy.py backend_model (LLM model)
- _sandbox.py 'build backends' (Python packaging)
…/Adapter docs

- README: 'Sandbox backends' → 'Sandboxes', table says Rollout not Trial
- concepts.md: Environment no longer says 'Backed by Harbor', Trial → Rollout
  as core primitive, lifecycle diagram uses Rollout.run()
- getting-started.md: uses RolloutConfig, links to rollout lifecycle
- python-api.md: RolloutConfig (aliased as TrialConfig), Rollout (aliased as
  Trial), canonical imports shown first, added v0.4 types section with Sandbox
  protocol, Rubric + RewardFunc, Adapters, Evaluation, backward-compat aliases
devin-ai-integration[bot]

This comment was marked as resolved.

Fixes 3 issues flagged by Devin Review:
- StringMatchRewardFunc: remove non-existent 'field' parameter
- Rubric: use reward_funcs + weights instead of items=[(func, weight)]
- rubric.score: takes rollout_dir: Path, not sandbox=
Copy link
Copy Markdown
Contributor Author

@devin-ai-integration devin-ai-integration Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Devin Review found 2 new potential issues.

View 19 additional findings in Devin Review.

Open in Devin Review

Comment thread src/benchflow/evaluation.py Outdated
Comment on lines +7 to +8
Backward-compat aliases: ``Job = Evaluation``, ``EvaluationConfig = EvaluationConfig``,
``EvaluationResult = EvaluationResult``.
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Incorrect backward-compat alias documentation in evaluation.py docstring

The module docstring documents the backward-compat aliases as tautologies (EvaluationConfig = EvaluationConfig, EvaluationResult = EvaluationResult) instead of the actual mappings (JobConfig = EvaluationConfig, JobResult = EvaluationResult). The actual code at src/benchflow/evaluation.py:677-679 defines the correct aliases, but the docstring at the top of the file will mislead developers trying to understand the backward-compat story.

Suggested change
Backward-compat aliases: ``Job = Evaluation``, ``EvaluationConfig = EvaluationConfig``,
``EvaluationResult = EvaluationResult``.
Backward-compat aliases: ``Job = Evaluation``, ``JobConfig = EvaluationConfig``,
``JobResult = EvaluationResult``.
Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch — the docstring has tautological aliases. Will fix to JobConfig = EvaluationConfig, JobResult = EvaluationResult.

Comment thread src/benchflow/__init__.py Outdated
xdotli and others added 6 commits May 16, 2026 18:28
- use-cases.md: remove 'How it works vs Harbor' and 'Migration from Harbor' sections
- scene-patterns.ipynb: remove 'Mapping to Harbor PR #1462 Concepts' cell
- coder-reviewer-demo.py: 'Harbor-format task directory' → 'BenchFlow task directory'
- swebench notebook: remove Harbor #1316 reference from intro
* cli: unify sandbox flag --env → --sandbox, -b → -e (ENG-56)

Standardize all CLI commands to use --sandbox/-e for the sandbox
parameter. Previously bench run used --sandbox/-b while bench eval
create and others used --env/-e.

- bench run: -b → -e
- bench eval create/run/compare: --env → --sandbox
- bench eval progress: --env → --sandbox
- bench skills eval: --env → --sandbox
- bench environment create: -b → -e
- docs/examples: updated flag references

* cli: drop -e/-b short flags, use --sandbox only (ENG-56)

Remove all short flags (-e, -b) for the sandbox parameter across all
CLI commands. The long flag --sandbox is the only way to specify the
sandbox now.

Updated 15 files: CLI definitions, help text examples, README,
docs, skill files, integration test runner, and test examples.

---------

Co-authored-by: Xiangyi Li <xiangyi@benchflow.ai>
Per ENG-46, bench eval create is the canonical CLI entry point.
bench run goes through the old SDK shim and is now deprecated,
matching the pattern used for bench job and other legacy commands.

Python API bf.run() is unaffected (backward-compat alias stays).

Co-authored-by: Xiangyi Li <xiangyi@benchflow.ai>
* feat(ENG-55): first-class LLM-as-judge verifier with dense reward

- Add rubric_config.py: Pydantic models for rubric.toml (Criterion,
  JudgeConfig, ScoringConfig, RubricConfig) with binary/likert/numeric
  criterion types and score normalization
- Add llm.py: Multi-provider LLM routing (Anthropic/OpenAI/Gemini)
  with call_judge(), parse_verdict(), exponential backoff retry
- Add file_readers.py: Document extraction (pdf/docx/xlsx/pptx/txt)
  with find_deliverables() for rollout directory scanning
- Replace placeholder LLMJudgeRewardFunc in builtins.py with full
  rubric-based judge: per-criterion scoring, prompt templates,
  configurable aggregation (weighted_mean/all_pass/any_pass/threshold),
  backward-compatible legacy mode
- Emit per-criterion dense RewardEvents for fine-grained reward signals
- Write evaluation_details.json alongside rollout for transparency
- Re-export Criterion, JudgeConfig, RubricConfig, ScoringConfig,
  load_rubric_toml from rewards and top-level __init__
- Add test_rubric_config.py (15 tests) and test_llm_judge.py (30 tests)

Closes ENG-55

* fix(ENG-55): async clients, asyncio.sleep, and score mismatch bugs

- Replace sync SDK clients with async variants (AsyncAnthropic, AsyncOpenAI, client.aio for Google)
- Replace time.sleep() with await asyncio.sleep() in call_judge retry loop
- Pass actual aggregated score to _write_details instead of recomputing n_passed/total

* docs(ENG-55): add LLM-as-judge verifier documentation

- Add docs/llm-judge.md: full user-facing guide with rubric.toml reference,
  criterion types, aggregation strategies, dense reward events, multi-provider
  routing, file discovery, inline criteria, and worked examples
- Add src/benchflow/rewards/README.md: module-level guide with usage examples
- Update docs/concepts.md: reference LLM judge in Verifier primitive and
  add to 'Where to go next' links

* docs: polish — remove ENG-55 ticket ref, clarify rubric.json exclusion

---------

Co-authored-by: Xiangyi Li <xiangyi@benchflow.ai>
* CLI: remove all short flags, use full names only [ENG-74]

Remove all single-letter short flags (-a, -m, -o, -t, -c, -f, -s, -p, -b, -d)
from every CLI command. Full flag names only (--agent, --model, --jobs-dir,
--tasks-dir, --concurrency, --config, --skills-dir, --prompt, --benchmark, --dir).

Updated across 23 files:
- src/benchflow/cli/main.py: all typer.Option definitions + docstring examples
- All docs/ (getting-started, running-benchmarks, cli.md, skill-eval, integration-tests, examples/)
- README.md
- .claude/skills/ (SKILL.md + references)
- tests/ (integration/run.sh, test_codex_custom_provider.sh, conformance/README.md)
- benchmarks/ (models-as-skills.md, harvey-lab/run_harvey_lab.py)
- Test fix: test_skill_eval_dryrun.py invocation updated from -a to --agent

* fix: replace remaining -f with --config in task-embedded SKILL.md files

---------

Co-authored-by: Xiangyi Li <xiangyi@benchflow.ai>
@devin-ai-integration
Copy link
Copy Markdown
Contributor Author

Devin is archived and cannot be woken up. Please unarchive Devin if you want to continue using it.

devin-ai-integration Bot and others added 4 commits May 17, 2026 04:02
* fix: apply PR #230 (skills double-deploy) and PR #242 (shell injection) fixes to v0.4

PR #230: pass effective task path to deploy_skills, set effective_skills in already-injected branch
PR #242: shlex.quote outbox file paths in _scene.py, harden _snapshot.py with path validation

* fix: set effective_skills to /skills when Dockerfile already injected

When deploy_skills detects the Dockerfile already bakes in
`COPY _deps/skills /skills/`, the runtime upload is skipped but
effective_skills was left as task.config.environment.skills_dir
(often None), causing the subsequent linking step to silently
skip distribution to agent-specific paths like
/home/agent/.agents/skills. Sandbox users relied on that runtime
linking since Dockerfile injection only links under /root/.

Tighten the regression test to assert env.exec was called with the
expected `ln -sfn` command.

---------

Co-authored-by: Xiangyi Li <xiangyi@benchflow.ai>
Co-authored-by: Yifeng He <yfhe.prsn@gmail.com>
@xdotli xdotli changed the title refactor: BenchFlow v0.4 — unified types, Rollout, Sandbox protocol, rewards, adapters refactor: BenchFlow v0.3.4 — unified types, Rollout, Sandbox protocol, rewards, adapters May 17, 2026
@xdotli xdotli merged commit 3e8b06c into main May 17, 2026
1 of 2 checks passed
xdotli added a commit that referenced this pull request May 18, 2026
…, ACPX integration (#288)

* refactor: BenchFlow v0.3.4 — unified types, Rollout, Sandbox protocol, rewards, adapters (#274)

* refactor: add Sandbox protocol + Harbor adapters (ENG-48 Phase A)

Create BenchFlow-native Sandbox protocol as parallel types alongside
existing Harbor imports. No behavioral changes — existing code continues
to use Harbor directly.

New files:
- src/benchflow/sandbox/protocol.py: ExecResult, Sandbox, ImageRef,
  ImageConfig, ImageBuilder
- src/benchflow/sandbox/docker.py: DockerSandbox adapter wrapping
  Harbor DockerEnvironment
- src/benchflow/sandbox/daytona.py: DaytonaSandbox adapter wrapping
  Harbor DaytonaEnvironment
- tests/test_sandbox_protocol.py: 26 tests covering dataclasses,
  protocol conformance, and delegation

* refactor: unify Scene/Role/Turn types into _types.py (ENG-47)

Create src/benchflow/_types.py as the single canonical source for the
declarative Role, Scene, and Turn dataclasses.

Changes:
- New _types.py with Role (adds timeout_sec, idle_timeout_sec,
  skills_dir), Scene (adds parallel_group), and Turn
- trial.py imports from _types.py instead of defining its own copies
- _scene.py renames its internal Role to SceneRole (different fields:
  instruction, tools) with backward-compat alias
- __init__.py re-exports canonical types from _types.py; adds
  TrialRole/TrialScene backward-compat aliases
- runtime.py and trial_yaml.py updated to import from _types.py

New fields default to None — no runtime behavior changes.

* fix: address review — shlex.quote in read_file, cleanup temp in write_file

- Use shlex.quote(path) in read_file to prevent shell injection
- Clean up temp files with os.unlink in write_file finally block
- Move imports to module level

* fix: read_file error checking + export ImageConfig from top-level

- read_file now raises FileNotFoundError on non-zero return code
- Export ImageConfig alongside ImageBuilder in __init__.py
- Add tests for read_file error behavior

* refactor: kill shim layers, single Rollout execution path (ENG-46)

- Rename trial.py → rollout.py, Trial → Rollout, TrialConfig → RolloutConfig
- Rename RunResult → RolloutResult in models.py
- Create evaluation.py from job.py: Job → Evaluation, JobConfig → EvaluationConfig, JobResult → EvaluationResult
- Create _run.py with bf.run(RolloutConfig) → RolloutResult entry point
- Replace sdk.py with thin backward-compat shim delegating to rollout.py
- Replace trial.py with re-export shim from rollout.py
- Replace job.py with re-export shim from evaluation.py
- Preserve runtime.py API, update internal imports to use Rollout
- Update __init__.py: new public API + backward-compat aliases
- Update test patch targets from benchflow.trial → benchflow.rollout
- All 841 tests pass, lint clean, pre-existing typecheck errors unchanged

* style: apply ruff format to pass CI format check

* fix: resolve ty typecheck errors (Any for kwargs and sentinel)

* feat: composable Rubric + RewardFunc protocol (ENG-49)

- Create src/benchflow/rewards/ package with:
  - RewardFunc protocol (single scoring dimension)
  - Rubric dataclass (weighted collection of RewardFuncs)
  - VerifyResult dataclass (aggregated scoring result)
  - RewardEvent dataclass (dense/terminal reward signals)
  - Built-in funcs: TestRewardFunc, LLMJudgeRewardFunc,
    StringMatchRewardFunc, CodeExecRewardFunc
- Add reward_events field to RunResult (additive, no breaking changes)
- Re-export all reward types from benchflow top-level __init__.py
- Backward compat: Rubric([TestRewardFunc()]) wraps existing
  test.sh -> reward.txt flow
- 31 tests covering all reward types, rubric scoring, weights,
  error handling, protocol conformance, and re-exports

* style: format rewards package and tests with ruff

* fix: disambiguate Rubric.items keys when multiple funcs share a class name

Append _N suffix on collision so all per-func scores are preserved.
Addresses Devin Review feedback on PR #266.

* feat: per-agent capabilities + agent-as-tool infrastructure (ENG-50)

- Add Role.capabilities field for declarative agent capability lists
- Add Sandbox.expose_ports property for inter-agent communication ports
- Document capability boundary: BenchFlow provides sandbox + instruction +
  observation; agents own loops, tool protocols, and agent-as-tool
- Update _scene.py and _types.py module docstrings with ENG-50 boundary
- Add 16 tests covering capabilities, expose_ports, and Coder→Reviewer→Coder
  turn sequences

* style: apply ruff format to sandbox adapters and tests

* feat: external adapters for Inspect AI + ORS (ENG-51)

Add benchflow.adapters package with thin format converters:
- InspectAdapter / to_inspect_task: Scene + Rubric -> Inspect AI task dict
- ORSAdapter / to_ors_reward: VerifyResult + RewardEvent -> ORS format dict

No external dependencies required — these are pure format converters.
Re-exported from benchflow top-level __init__.py.
14 tests covering both adapters, convenience functions, and round-trips.

* style: fix ruff format for adapters and tests

* fix: convention fixes + trial shim re-exports + __getattr__ narrowing

- Parametrize bare dict returns to dict[str, Any] in adapters
- Remove unnecessary @DataClass on adapter classes (no fields)
- Add from __future__ import annotations to evaluation.py
- Fix __getattr__ to re-raise ModuleNotFoundError for broken deps
- Add private helper re-exports to trial.py shim for backward compat
- Convert test_save_trajectory to async (fixes event loop deprecation)

* rename: backend → sandbox across API, CLI, docs, and tests

Aligns naming with the v0.4 Sandbox protocol (ENG-48). The
'backend' parameter/flag/attribute on Environment and the CLI is
now 'sandbox' everywhere.

Scope:
- runtime.py: Environment.__init__(sandbox=), .sandbox, from_task(sandbox=)
- cli/main.py: --sandbox flag (was --backend), help text
- _snapshot.py: docstring updates
- tests/test_runtime.py: updated assertions
- docs/: all references in task-authoring, CLI ref, Python API ref,
  running-benchmarks, integration-tests, examples

NOT renamed (different meaning):
- _credentials.py 'Vertex AI backend' (provider concept)
- _provider_runtime.py / bedrock_proxy.py backend_model (LLM model)
- _sandbox.py 'build backends' (Python packaging)

* docs: update for v0.4 — Rollout as canonical name, add Sandbox/Rubric/Adapter docs

- README: 'Sandbox backends' → 'Sandboxes', table says Rollout not Trial
- concepts.md: Environment no longer says 'Backed by Harbor', Trial → Rollout
  as core primitive, lifecycle diagram uses Rollout.run()
- getting-started.md: uses RolloutConfig, links to rollout lifecycle
- python-api.md: RolloutConfig (aliased as TrialConfig), Rollout (aliased as
  Trial), canonical imports shown first, added v0.4 types section with Sandbox
  protocol, Rubric + RewardFunc, Adapters, Evaluation, backward-compat aliases

* fix: correct Rubric/RewardFunc API examples in python-api.md

Fixes 3 issues flagged by Devin Review:
- StringMatchRewardFunc: remove non-existent 'field' parameter
- Rubric: use reward_funcs + weights instead of items=[(func, weight)]
- rubric.score: takes rollout_dir: Path, not sandbox=

* fix: correct evaluation.py docstring aliases, clean stale comment in __init__.py

* docs: remove Harbor migration references

- use-cases.md: remove 'How it works vs Harbor' and 'Migration from Harbor' sections
- scene-patterns.ipynb: remove 'Mapping to Harbor PR #1462 Concepts' cell
- coder-reviewer-demo.py: 'Harbor-format task directory' → 'BenchFlow task directory'
- swebench notebook: remove Harbor #1316 reference from intro

* CLI: unify sandbox flag (--env → --sandbox, -b → -e) [ENG-56] (#281)

* cli: unify sandbox flag --env → --sandbox, -b → -e (ENG-56)

Standardize all CLI commands to use --sandbox/-e for the sandbox
parameter. Previously bench run used --sandbox/-b while bench eval
create and others used --env/-e.

- bench run: -b → -e
- bench eval create/run/compare: --env → --sandbox
- bench eval progress: --env → --sandbox
- bench skills eval: --env → --sandbox
- bench environment create: -b → -e
- docs/examples: updated flag references

* cli: drop -e/-b short flags, use --sandbox only (ENG-56)

Remove all short flags (-e, -b) for the sandbox parameter across all
CLI commands. The long flag --sandbox is the only way to specify the
sandbox now.

Updated 15 files: CLI definitions, help text examples, README,
docs, skill files, integration test runner, and test examples.

---------

Co-authored-by: Xiangyi Li <xiangyi@benchflow.ai>

* cli: deprecate bench run in favor of bench eval create (ENG-57) (#282)

Per ENG-46, bench eval create is the canonical CLI entry point.
bench run goes through the old SDK shim and is now deprecated,
matching the pattern used for bench job and other legacy commands.

Python API bf.run() is unaffected (backward-compat alias stays).

Co-authored-by: Xiangyi Li <xiangyi@benchflow.ai>

* feat(ENG-55): first-class LLM-as-judge verifier with dense reward (#277)

* feat(ENG-55): first-class LLM-as-judge verifier with dense reward

- Add rubric_config.py: Pydantic models for rubric.toml (Criterion,
  JudgeConfig, ScoringConfig, RubricConfig) with binary/likert/numeric
  criterion types and score normalization
- Add llm.py: Multi-provider LLM routing (Anthropic/OpenAI/Gemini)
  with call_judge(), parse_verdict(), exponential backoff retry
- Add file_readers.py: Document extraction (pdf/docx/xlsx/pptx/txt)
  with find_deliverables() for rollout directory scanning
- Replace placeholder LLMJudgeRewardFunc in builtins.py with full
  rubric-based judge: per-criterion scoring, prompt templates,
  configurable aggregation (weighted_mean/all_pass/any_pass/threshold),
  backward-compatible legacy mode
- Emit per-criterion dense RewardEvents for fine-grained reward signals
- Write evaluation_details.json alongside rollout for transparency
- Re-export Criterion, JudgeConfig, RubricConfig, ScoringConfig,
  load_rubric_toml from rewards and top-level __init__
- Add test_rubric_config.py (15 tests) and test_llm_judge.py (30 tests)

Closes ENG-55

* fix(ENG-55): async clients, asyncio.sleep, and score mismatch bugs

- Replace sync SDK clients with async variants (AsyncAnthropic, AsyncOpenAI, client.aio for Google)
- Replace time.sleep() with await asyncio.sleep() in call_judge retry loop
- Pass actual aggregated score to _write_details instead of recomputing n_passed/total

* docs(ENG-55): add LLM-as-judge verifier documentation

- Add docs/llm-judge.md: full user-facing guide with rubric.toml reference,
  criterion types, aggregation strategies, dense reward events, multi-provider
  routing, file discovery, inline criteria, and worked examples
- Add src/benchflow/rewards/README.md: module-level guide with usage examples
- Update docs/concepts.md: reference LLM judge in Verifier primitive and
  add to 'Where to go next' links

* docs: polish — remove ENG-55 ticket ref, clarify rubric.json exclusion

---------

Co-authored-by: Xiangyi Li <xiangyi@benchflow.ai>

* CLI: remove all short flags, use full names only [ENG-74] (#284)

* CLI: remove all short flags, use full names only [ENG-74]

Remove all single-letter short flags (-a, -m, -o, -t, -c, -f, -s, -p, -b, -d)
from every CLI command. Full flag names only (--agent, --model, --jobs-dir,
--tasks-dir, --concurrency, --config, --skills-dir, --prompt, --benchmark, --dir).

Updated across 23 files:
- src/benchflow/cli/main.py: all typer.Option definitions + docstring examples
- All docs/ (getting-started, running-benchmarks, cli.md, skill-eval, integration-tests, examples/)
- README.md
- .claude/skills/ (SKILL.md + references)
- tests/ (integration/run.sh, test_codex_custom_provider.sh, conformance/README.md)
- benchmarks/ (models-as-skills.md, harvey-lab/run_harvey_lab.py)
- Test fix: test_skill_eval_dryrun.py invocation updated from -a to --agent

* fix: replace remaining -f with --config in task-embedded SKILL.md files

---------

Co-authored-by: Xiangyi Li <xiangyi@benchflow.ai>

* Backport PR #230 and #242 fixes to refactor/v0.4 (#286)

* fix: apply PR #230 (skills double-deploy) and PR #242 (shell injection) fixes to v0.4

PR #230: pass effective task path to deploy_skills, set effective_skills in already-injected branch
PR #242: shlex.quote outbox file paths in _scene.py, harden _snapshot.py with path validation

* fix: set effective_skills to /skills when Dockerfile already injected

When deploy_skills detects the Dockerfile already bakes in
`COPY _deps/skills /skills/`, the runtime upload is skipped but
effective_skills was left as task.config.environment.skills_dir
(often None), causing the subsequent linking step to silently
skip distribution to agent-specific paths like
/home/agent/.agents/skills. Sandbox users relied on that runtime
linking since Dockerfile injection only links under /root/.

Tighten the regression test to assert env.exec was called with the
expected `ln -sfn` command.

---------

Co-authored-by: Xiangyi Li <xiangyi@benchflow.ai>
Co-authored-by: Yifeng He <yfhe.prsn@gmail.com>

* fix locally

* release: 0.3.4

---------

Co-authored-by: Xiangyi Li <xiangyi@benchflow.ai>
Co-authored-by: devin-ai-integration[bot] <158243242+devin-ai-integration[bot]@users.noreply.github.com>
Co-authored-by: Yifeng He <yfhe.prsn@gmail.com>

* refactor: remove Harbor references, purge backward-compat aliases in tests, update terminology

- Remove 'terminal-bench' from pyproject.toml keywords
- Rename _make_harbor_mock → _make_sandbox_mock in test files
- Update Harbor patch paths to sandbox equivalents in tests
- Rename _from_harbor_yaml → _from_legacy_yaml in evaluation.py
- Update docstrings/comments: Harbor → BenchFlow/legacy terminology
- Internalize Harbor types into benchflow.task/ subpackage
- Create sandbox backends: base.py, docker_impl.py, daytona_impl.py, modal_impl.py
- Add compose/ subpackage with docker-compose YAML templates

* refactor: modernize file structure — align with test-drive patterns

- sandbox/: merge docker_impl→docker, daytona_impl→daytona, base→_base
- sandbox/: move _sandbox→lockdown, _snapshot→snapshot, _daytona_patches→_sdk_ops
- sandbox/: move process, user, environments→services into sandbox/
- sandbox/: restructure compose/ → _compose.py + _compose_files/
- agents/: move _credentials→credentials, _agent_env→env, _agent_setup→install
- _utils/: create subpackage with yaml_loader, benchmark_repos, task_authoring
- trajectories/: move viewer, _trajectory→_capture into subpackage
- experimental/: move mcp/ into experimental/ subpackage
- __init__.py: purge backward-compat aliases (Trial, Job, etc.)
- Update all imports across src/, tests/, experiments/
- Run ruff check + format (0 errors)

* refactor: purge backward-compat aliases + fix review bugs

- Remove Trial=Rollout, TrialConfig=RolloutConfig from rollout.py
- Remove Job=Evaluation, JobConfig=EvaluationConfig, JobResult=EvaluationResult from evaluation.py
- Delete shim files: trial.py, job.py
- Update all imports: benchflow.trial → benchflow.rollout, benchflow.job → benchflow.evaluation
- Rename classes: Trial→Rollout, Job→Evaluation across tests/, benchmarks/, experiments/
- Fix runtime.py: environment_type→sandbox_type, trial_name→rollout_name (review bug)
- Fix conformance scripts: same keyword arg fixes (review bug)
- Clean up __all__ and return type annotations

* docs: update terminology to v0.4 (Trial→Rollout, Job→Evaluation) + bump version to 0.4.0

* chore: update uv.lock for v0.4.0

* refactor: modernize RL terminology — trial_dir→rollout_dir, trial_name→rollout_name across codebase

* style: format viewer.py

* fix: exclude optional-dep sandbox files from ty check

* fix: configure ty to ignore unresolved-import + exclude optional-dep files

* fix: resolve all test failures — patch paths, protocol conformance, RL terminology

* fix: resolve merge conflicts, update broken imports and stale references

- Resolve merge conflict markers in rollout.py and evaluation.py
- Update task_download imports to benchflow._utils.benchmark_repos (11 files)
- Fix benchflow.tasks → benchflow._utils.task_authoring import
- Update scene-patterns.ipynb: TrialConfig→RolloutConfig, trial→rollout
- Fix Job→Evaluation rename in docs
- Ruff auto-fixes (import ordering)

* style: format 12 files to pass ruff format check

* fix: resolve ty check errors in traces/ — dict[str, object] → dict[str, Any]

* fix: add skip guards for optional dependency tests (daytona, modal)

* fix: update stale imports and references in docs/benchmarks (Devin Review)

* refactor: consolidate underscore modules into proper subpackages

- _acp_run.py → acp/runtime.py
- _env_setup.py → sandbox/setup.py
- _provider_runtime.py → providers/runtime.py
- _scoring.py → _utils/scoring.py
- _scene.py → scenes.py (removed Role=SceneRole backward-compat alias)
- Fix GEMINI_API_KEY auto-detection: inherit from os.environ as fallback after .env

* feat: ACPX integration — acpx/<agent> protocol for headless ACP agent invocation

- Add 'acpx' as a valid protocol in parse_agent_spec (acpx/claude, acpx/codex, etc.)
- _acpx_wrap() decorates any registered agent to launch via acpx CLI
- Installs acpx alongside the underlying agent in the sandbox
- Preserves all agent env, credentials, and skill_paths
- Replace harbor protocol references with acpx in tests

* style: format registry.py for ruff format check

* fix: suppress ty invalid-assignment for monkey-patch in _patch_docker_dind

* docs: update docs and SKILL.md for v0.4 — RL-first terminology, ACPX, purge backward-compat

* fix: rename trial_config_from_yaml → rollout_config_from_yaml, fix stale _run import in sdk.py

* fix: docs inconsistencies found by dogfooding — GEMINI_API_KEY, daytona install note

* fix: docs path evaluations/ → jobs/ to match actual output directory

* fix: ruff import ordering in sdk.py

* fix: ty type-check suppression for sdk.py run() return type

* feat: add sandbox-daytona and sandbox-modal optional dependency groups

* chore: update uv.lock for sandbox-daytona and sandbox-modal optional deps

---------

Co-authored-by: devin-ai-integration[bot] <158243242+devin-ai-integration[bot]@users.noreply.github.com>
Co-authored-by: Xiangyi Li <xiangyi@benchflow.ai>
Co-authored-by: Yifeng He <yfhe.prsn@gmail.com>
xdotli added a commit that referenced this pull request May 19, 2026
* refactor: add Sandbox protocol + Harbor adapters (ENG-48 Phase A)

Create BenchFlow-native Sandbox protocol as parallel types alongside
existing Harbor imports. No behavioral changes — existing code continues
to use Harbor directly.

New files:
- src/benchflow/sandbox/protocol.py: ExecResult, Sandbox, ImageRef,
  ImageConfig, ImageBuilder
- src/benchflow/sandbox/docker.py: DockerSandbox adapter wrapping
  Harbor DockerEnvironment
- src/benchflow/sandbox/daytona.py: DaytonaSandbox adapter wrapping
  Harbor DaytonaEnvironment
- tests/test_sandbox_protocol.py: 26 tests covering dataclasses,
  protocol conformance, and delegation

* refactor: unify Scene/Role/Turn types into _types.py (ENG-47)

Create src/benchflow/_types.py as the single canonical source for the
declarative Role, Scene, and Turn dataclasses.

Changes:
- New _types.py with Role (adds timeout_sec, idle_timeout_sec,
  skills_dir), Scene (adds parallel_group), and Turn
- trial.py imports from _types.py instead of defining its own copies
- _scene.py renames its internal Role to SceneRole (different fields:
  instruction, tools) with backward-compat alias
- __init__.py re-exports canonical types from _types.py; adds
  TrialRole/TrialScene backward-compat aliases
- runtime.py and trial_yaml.py updated to import from _types.py

New fields default to None — no runtime behavior changes.

* fix: address review — shlex.quote in read_file, cleanup temp in write_file

- Use shlex.quote(path) in read_file to prevent shell injection
- Clean up temp files with os.unlink in write_file finally block
- Move imports to module level

* fix: read_file error checking + export ImageConfig from top-level

- read_file now raises FileNotFoundError on non-zero return code
- Export ImageConfig alongside ImageBuilder in __init__.py
- Add tests for read_file error behavior

* refactor: kill shim layers, single Rollout execution path (ENG-46)

- Rename trial.py → rollout.py, Trial → Rollout, TrialConfig → RolloutConfig
- Rename RunResult → RolloutResult in models.py
- Create evaluation.py from job.py: Job → Evaluation, JobConfig → EvaluationConfig, JobResult → EvaluationResult
- Create _run.py with bf.run(RolloutConfig) → RolloutResult entry point
- Replace sdk.py with thin backward-compat shim delegating to rollout.py
- Replace trial.py with re-export shim from rollout.py
- Replace job.py with re-export shim from evaluation.py
- Preserve runtime.py API, update internal imports to use Rollout
- Update __init__.py: new public API + backward-compat aliases
- Update test patch targets from benchflow.trial → benchflow.rollout
- All 841 tests pass, lint clean, pre-existing typecheck errors unchanged

* style: apply ruff format to pass CI format check

* fix: resolve ty typecheck errors (Any for kwargs and sentinel)

* feat: composable Rubric + RewardFunc protocol (ENG-49)

- Create src/benchflow/rewards/ package with:
  - RewardFunc protocol (single scoring dimension)
  - Rubric dataclass (weighted collection of RewardFuncs)
  - VerifyResult dataclass (aggregated scoring result)
  - RewardEvent dataclass (dense/terminal reward signals)
  - Built-in funcs: TestRewardFunc, LLMJudgeRewardFunc,
    StringMatchRewardFunc, CodeExecRewardFunc
- Add reward_events field to RunResult (additive, no breaking changes)
- Re-export all reward types from benchflow top-level __init__.py
- Backward compat: Rubric([TestRewardFunc()]) wraps existing
  test.sh -> reward.txt flow
- 31 tests covering all reward types, rubric scoring, weights,
  error handling, protocol conformance, and re-exports

* style: format rewards package and tests with ruff

* fix: disambiguate Rubric.items keys when multiple funcs share a class name

Append _N suffix on collision so all per-func scores are preserved.
Addresses Devin Review feedback on PR #266.

* feat: per-agent capabilities + agent-as-tool infrastructure (ENG-50)

- Add Role.capabilities field for declarative agent capability lists
- Add Sandbox.expose_ports property for inter-agent communication ports
- Document capability boundary: BenchFlow provides sandbox + instruction +
  observation; agents own loops, tool protocols, and agent-as-tool
- Update _scene.py and _types.py module docstrings with ENG-50 boundary
- Add 16 tests covering capabilities, expose_ports, and Coder→Reviewer→Coder
  turn sequences

* style: apply ruff format to sandbox adapters and tests

* feat: external adapters for Inspect AI + ORS (ENG-51)

Add benchflow.adapters package with thin format converters:
- InspectAdapter / to_inspect_task: Scene + Rubric -> Inspect AI task dict
- ORSAdapter / to_ors_reward: VerifyResult + RewardEvent -> ORS format dict

No external dependencies required — these are pure format converters.
Re-exported from benchflow top-level __init__.py.
14 tests covering both adapters, convenience functions, and round-trips.

* style: fix ruff format for adapters and tests

* fix: convention fixes + trial shim re-exports + __getattr__ narrowing

- Parametrize bare dict returns to dict[str, Any] in adapters
- Remove unnecessary @DataClass on adapter classes (no fields)
- Add from __future__ import annotations to evaluation.py
- Fix __getattr__ to re-raise ModuleNotFoundError for broken deps
- Add private helper re-exports to trial.py shim for backward compat
- Convert test_save_trajectory to async (fixes event loop deprecation)

* rename: backend → sandbox across API, CLI, docs, and tests

Aligns naming with the v0.4 Sandbox protocol (ENG-48). The
'backend' parameter/flag/attribute on Environment and the CLI is
now 'sandbox' everywhere.

Scope:
- runtime.py: Environment.__init__(sandbox=), .sandbox, from_task(sandbox=)
- cli/main.py: --sandbox flag (was --backend), help text
- _snapshot.py: docstring updates
- tests/test_runtime.py: updated assertions
- docs/: all references in task-authoring, CLI ref, Python API ref,
  running-benchmarks, integration-tests, examples

NOT renamed (different meaning):
- _credentials.py 'Vertex AI backend' (provider concept)
- _provider_runtime.py / bedrock_proxy.py backend_model (LLM model)
- _sandbox.py 'build backends' (Python packaging)

* docs: update for v0.4 — Rollout as canonical name, add Sandbox/Rubric/Adapter docs

- README: 'Sandbox backends' → 'Sandboxes', table says Rollout not Trial
- concepts.md: Environment no longer says 'Backed by Harbor', Trial → Rollout
  as core primitive, lifecycle diagram uses Rollout.run()
- getting-started.md: uses RolloutConfig, links to rollout lifecycle
- python-api.md: RolloutConfig (aliased as TrialConfig), Rollout (aliased as
  Trial), canonical imports shown first, added v0.4 types section with Sandbox
  protocol, Rubric + RewardFunc, Adapters, Evaluation, backward-compat aliases

* fix: correct Rubric/RewardFunc API examples in python-api.md

Fixes 3 issues flagged by Devin Review:
- StringMatchRewardFunc: remove non-existent 'field' parameter
- Rubric: use reward_funcs + weights instead of items=[(func, weight)]
- rubric.score: takes rollout_dir: Path, not sandbox=

* fix: correct evaluation.py docstring aliases, clean stale comment in __init__.py

* docs: remove Harbor migration references

- use-cases.md: remove 'How it works vs Harbor' and 'Migration from Harbor' sections
- scene-patterns.ipynb: remove 'Mapping to Harbor PR #1462 Concepts' cell
- coder-reviewer-demo.py: 'Harbor-format task directory' → 'BenchFlow task directory'
- swebench notebook: remove Harbor #1316 reference from intro

* CLI: unify sandbox flag (--env → --sandbox, -b → -e) [ENG-56] (#281)

* cli: unify sandbox flag --env → --sandbox, -b → -e (ENG-56)

Standardize all CLI commands to use --sandbox/-e for the sandbox
parameter. Previously bench run used --sandbox/-b while bench eval
create and others used --env/-e.

- bench run: -b → -e
- bench eval create/run/compare: --env → --sandbox
- bench eval progress: --env → --sandbox
- bench skills eval: --env → --sandbox
- bench environment create: -b → -e
- docs/examples: updated flag references

* cli: drop -e/-b short flags, use --sandbox only (ENG-56)

Remove all short flags (-e, -b) for the sandbox parameter across all
CLI commands. The long flag --sandbox is the only way to specify the
sandbox now.

Updated 15 files: CLI definitions, help text examples, README,
docs, skill files, integration test runner, and test examples.

---------

Co-authored-by: Xiangyi Li <xiangyi@benchflow.ai>

* cli: deprecate bench run in favor of bench eval create (ENG-57) (#282)

Per ENG-46, bench eval create is the canonical CLI entry point.
bench run goes through the old SDK shim and is now deprecated,
matching the pattern used for bench job and other legacy commands.

Python API bf.run() is unaffected (backward-compat alias stays).

Co-authored-by: Xiangyi Li <xiangyi@benchflow.ai>

* feat(ENG-55): first-class LLM-as-judge verifier with dense reward (#277)

* feat(ENG-55): first-class LLM-as-judge verifier with dense reward

- Add rubric_config.py: Pydantic models for rubric.toml (Criterion,
  JudgeConfig, ScoringConfig, RubricConfig) with binary/likert/numeric
  criterion types and score normalization
- Add llm.py: Multi-provider LLM routing (Anthropic/OpenAI/Gemini)
  with call_judge(), parse_verdict(), exponential backoff retry
- Add file_readers.py: Document extraction (pdf/docx/xlsx/pptx/txt)
  with find_deliverables() for rollout directory scanning
- Replace placeholder LLMJudgeRewardFunc in builtins.py with full
  rubric-based judge: per-criterion scoring, prompt templates,
  configurable aggregation (weighted_mean/all_pass/any_pass/threshold),
  backward-compatible legacy mode
- Emit per-criterion dense RewardEvents for fine-grained reward signals
- Write evaluation_details.json alongside rollout for transparency
- Re-export Criterion, JudgeConfig, RubricConfig, ScoringConfig,
  load_rubric_toml from rewards and top-level __init__
- Add test_rubric_config.py (15 tests) and test_llm_judge.py (30 tests)

Closes ENG-55

* fix(ENG-55): async clients, asyncio.sleep, and score mismatch bugs

- Replace sync SDK clients with async variants (AsyncAnthropic, AsyncOpenAI, client.aio for Google)
- Replace time.sleep() with await asyncio.sleep() in call_judge retry loop
- Pass actual aggregated score to _write_details instead of recomputing n_passed/total

* docs(ENG-55): add LLM-as-judge verifier documentation

- Add docs/llm-judge.md: full user-facing guide with rubric.toml reference,
  criterion types, aggregation strategies, dense reward events, multi-provider
  routing, file discovery, inline criteria, and worked examples
- Add src/benchflow/rewards/README.md: module-level guide with usage examples
- Update docs/concepts.md: reference LLM judge in Verifier primitive and
  add to 'Where to go next' links

* docs: polish — remove ENG-55 ticket ref, clarify rubric.json exclusion

---------

Co-authored-by: Xiangyi Li <xiangyi@benchflow.ai>

* CLI: remove all short flags, use full names only [ENG-74] (#284)

* CLI: remove all short flags, use full names only [ENG-74]

Remove all single-letter short flags (-a, -m, -o, -t, -c, -f, -s, -p, -b, -d)
from every CLI command. Full flag names only (--agent, --model, --jobs-dir,
--tasks-dir, --concurrency, --config, --skills-dir, --prompt, --benchmark, --dir).

Updated across 23 files:
- src/benchflow/cli/main.py: all typer.Option definitions + docstring examples
- All docs/ (getting-started, running-benchmarks, cli.md, skill-eval, integration-tests, examples/)
- README.md
- .claude/skills/ (SKILL.md + references)
- tests/ (integration/run.sh, test_codex_custom_provider.sh, conformance/README.md)
- benchmarks/ (models-as-skills.md, harvey-lab/run_harvey_lab.py)
- Test fix: test_skill_eval_dryrun.py invocation updated from -a to --agent

* fix: replace remaining -f with --config in task-embedded SKILL.md files

---------

Co-authored-by: Xiangyi Li <xiangyi@benchflow.ai>

* Backport PR #230 and #242 fixes to refactor/v0.4 (#286)

* fix: apply PR #230 (skills double-deploy) and PR #242 (shell injection) fixes to v0.4

PR #230: pass effective task path to deploy_skills, set effective_skills in already-injected branch
PR #242: shlex.quote outbox file paths in _scene.py, harden _snapshot.py with path validation

* fix: set effective_skills to /skills when Dockerfile already injected

When deploy_skills detects the Dockerfile already bakes in
`COPY _deps/skills /skills/`, the runtime upload is skipped but
effective_skills was left as task.config.environment.skills_dir
(often None), causing the subsequent linking step to silently
skip distribution to agent-specific paths like
/home/agent/.agents/skills. Sandbox users relied on that runtime
linking since Dockerfile injection only links under /root/.

Tighten the regression test to assert env.exec was called with the
expected `ln -sfn` command.

---------

Co-authored-by: Xiangyi Li <xiangyi@benchflow.ai>
Co-authored-by: Yifeng He <yfhe.prsn@gmail.com>

* fix locally

* release: 0.3.4

* feat: trace import — `bench tasks generate` from Claude Code + opentraces (#278)

* feat: trace import prototype — Claude Code + opentraces → BenchFlow tasks

Add benchflow.traces package for personal benchmark curation from agent traces:

- parsers: parse Claude Code JSONL sessions and opentraces v0.1-v0.3 records
- models: format-agnostic ParsedTrace intermediate representation
- task_gen: generate task.toml + instruction.md + test.sh from traces
- huggingface: download and parse trace datasets from HuggingFace Hub
- local: discover and parse local ~/.claude/projects/ sessions
- CLI: bench import {local,file,hf,list-datasets} subcommands

44 new tests, all passing.

* fix: address Devin Review findings

- _cache_dir: fall back to cwd instead of / when outside git repo
- HF cache key: include max_rows in filename to avoid stale data
- parse_claude_code_session: deterministic trace_id from session+filename
- _build_test_sh: use shlex.quote() to prevent shell injection

* refactor: align trace import CLI with noun-verb philosophy

- Replace 'bench import {local,file,hf}' with 'bench tasks generate --from-{local,file,hf}'
- Replace 'bench import list-datasets' with 'bench tasks list-sources'
- Integrate trace commands into existing tasks_app via register_tasks_generate()
- Update CLI docs with new command reference
- Follows BenchFlow resource-verb pattern: bench <resource> <verb>

* fix: address Devin Review — limit flag + format name mismatch

- Apply --limit across all sources (was only used for --from-local)
- Map 'claude-code' → 'claude-messages' in _load_hf so --format claude-code
  works correctly with --from-hf

* docs: dogfood bench tasks generate in task-authoring and getting-started guides

* fix: generated tasks now pass bench tasks check — add Dockerfile, tests/ dir, reward path, difficulty scaling, TOML safety

* audit: fix task pipeline issues + normalize docstrings across traces package

Pipeline fixes:
- Fix timeout auto-scaling: batch generator default changed from 300 to 0
  (300 > 0 prevented difficulty-based scaling from ever triggering)
- Add [task] name section to generated task.toml (trace-import/<slug>)
- Add build_timeout_sec and storage_mb to [environment] section
- Fix test.sh indentation (remove extra leading spaces from textwrap)

Docstring normalization:
- Add missing docstrings to CLI helper functions (_load_local, _load_file,
  _load_hf)
- Improve docstring clarity across traces package (detect_format, print
  helpers, HF parsers)
- Use reST code block syntax for CLI examples

48/48 tests pass, lint clean.

* fix: multi-session JSONL collapse, outcome detection, zero-tool-call filter

Dogfooding revealed 3 quality issues:

1. Critical: parse_claude_code_session() merged all sessions in a
   multi-session JSONL file into one trace, contaminating difficulty,
   instruction, and verifier. Added parse_claude_code_file() that
   splits by sessionId before parsing each group.

2. Outcome detection missed common completion verbs (fixed, refactored,
   built, created, updated, implemented, added).

3. Traces with zero tool calls (pure explanations) produced useless
   tasks with pass-through verifiers. Now filtered out in batch mode.

Before: 5-session file → 1 task (hard, 1200s, 19 files listed)
After:  5-session file → 4 tasks (easy/medium, correct files each)

Tests: 55 passing (was 48), lint clean.

* fix: test.sh file check cap matches instruction.md (10→20)

* fix: real-trace dogfood — path relativization, session artifact cleanup, HF parser robustness

- Add _relativize_path() to convert absolute workspace paths to relative project paths
- Add _clean_user_prompt() to strip session continuation boilerplate
- HF parser: support messages_json key, strip system-reminders, handle tool_result blocks, infer outcome
- CLI: add claude-messages format detection and routing for --from-file
- Add _parse_claude_requests_row() for cc-traces-weka metadata format
- All 55 tests pass, lint clean

* fix: verifier robustness — glob patterns for timestamp-bearing paths

Paths with date/timestamp segments (e.g. migrations/2025-11-28-131040_create_invoices/up.sql)
now use compgen -G glob patterns in test.sh instead of exact [ -f ] checks.
This lets verifiers tolerate agent-generated timestamp variants.

Also updates instruction.md to show globbed paths for dynamic segments.

Adds 10 new tests for _globify_path, _has_dynamic_segments, and
end-to-end verifier pattern selection.

* fix: real-trace parser — user_prompt+gitdiff format, git context, git-diff verifier

- Handle cc-traces-merged rows with user_prompt + gitdiff (no messages_json)
- Extract TASK DESCRIPTION from structured prompts, strip CRITICAL INSTRUCTIONS boilerplate
- Extract git repo/commit from prompt, populate GitContext
- Dockerfile clones repo at base commit when git context available
- Verifier uses git diff to check files were modified (not just exist)
- Fix source_model='None' string bug

---------

Co-authored-by: Xiangyi Li <xiangyi@benchflow.ai>

* refactor: BenchFlow v0.4 — RL-first terminology, module consolidation, ACPX integration (#288)

* refactor: BenchFlow v0.3.4 — unified types, Rollout, Sandbox protocol, rewards, adapters (#274)

* refactor: add Sandbox protocol + Harbor adapters (ENG-48 Phase A)

Create BenchFlow-native Sandbox protocol as parallel types alongside
existing Harbor imports. No behavioral changes — existing code continues
to use Harbor directly.

New files:
- src/benchflow/sandbox/protocol.py: ExecResult, Sandbox, ImageRef,
  ImageConfig, ImageBuilder
- src/benchflow/sandbox/docker.py: DockerSandbox adapter wrapping
  Harbor DockerEnvironment
- src/benchflow/sandbox/daytona.py: DaytonaSandbox adapter wrapping
  Harbor DaytonaEnvironment
- tests/test_sandbox_protocol.py: 26 tests covering dataclasses,
  protocol conformance, and delegation

* refactor: unify Scene/Role/Turn types into _types.py (ENG-47)

Create src/benchflow/_types.py as the single canonical source for the
declarative Role, Scene, and Turn dataclasses.

Changes:
- New _types.py with Role (adds timeout_sec, idle_timeout_sec,
  skills_dir), Scene (adds parallel_group), and Turn
- trial.py imports from _types.py instead of defining its own copies
- _scene.py renames its internal Role to SceneRole (different fields:
  instruction, tools) with backward-compat alias
- __init__.py re-exports canonical types from _types.py; adds
  TrialRole/TrialScene backward-compat aliases
- runtime.py and trial_yaml.py updated to import from _types.py

New fields default to None — no runtime behavior changes.

* fix: address review — shlex.quote in read_file, cleanup temp in write_file

- Use shlex.quote(path) in read_file to prevent shell injection
- Clean up temp files with os.unlink in write_file finally block
- Move imports to module level

* fix: read_file error checking + export ImageConfig from top-level

- read_file now raises FileNotFoundError on non-zero return code
- Export ImageConfig alongside ImageBuilder in __init__.py
- Add tests for read_file error behavior

* refactor: kill shim layers, single Rollout execution path (ENG-46)

- Rename trial.py → rollout.py, Trial → Rollout, TrialConfig → RolloutConfig
- Rename RunResult → RolloutResult in models.py
- Create evaluation.py from job.py: Job → Evaluation, JobConfig → EvaluationConfig, JobResult → EvaluationResult
- Create _run.py with bf.run(RolloutConfig) → RolloutResult entry point
- Replace sdk.py with thin backward-compat shim delegating to rollout.py
- Replace trial.py with re-export shim from rollout.py
- Replace job.py with re-export shim from evaluation.py
- Preserve runtime.py API, update internal imports to use Rollout
- Update __init__.py: new public API + backward-compat aliases
- Update test patch targets from benchflow.trial → benchflow.rollout
- All 841 tests pass, lint clean, pre-existing typecheck errors unchanged

* style: apply ruff format to pass CI format check

* fix: resolve ty typecheck errors (Any for kwargs and sentinel)

* feat: composable Rubric + RewardFunc protocol (ENG-49)

- Create src/benchflow/rewards/ package with:
  - RewardFunc protocol (single scoring dimension)
  - Rubric dataclass (weighted collection of RewardFuncs)
  - VerifyResult dataclass (aggregated scoring result)
  - RewardEvent dataclass (dense/terminal reward signals)
  - Built-in funcs: TestRewardFunc, LLMJudgeRewardFunc,
    StringMatchRewardFunc, CodeExecRewardFunc
- Add reward_events field to RunResult (additive, no breaking changes)
- Re-export all reward types from benchflow top-level __init__.py
- Backward compat: Rubric([TestRewardFunc()]) wraps existing
  test.sh -> reward.txt flow
- 31 tests covering all reward types, rubric scoring, weights,
  error handling, protocol conformance, and re-exports

* style: format rewards package and tests with ruff

* fix: disambiguate Rubric.items keys when multiple funcs share a class name

Append _N suffix on collision so all per-func scores are preserved.
Addresses Devin Review feedback on PR #266.

* feat: per-agent capabilities + agent-as-tool infrastructure (ENG-50)

- Add Role.capabilities field for declarative agent capability lists
- Add Sandbox.expose_ports property for inter-agent communication ports
- Document capability boundary: BenchFlow provides sandbox + instruction +
  observation; agents own loops, tool protocols, and agent-as-tool
- Update _scene.py and _types.py module docstrings with ENG-50 boundary
- Add 16 tests covering capabilities, expose_ports, and Coder→Reviewer→Coder
  turn sequences

* style: apply ruff format to sandbox adapters and tests

* feat: external adapters for Inspect AI + ORS (ENG-51)

Add benchflow.adapters package with thin format converters:
- InspectAdapter / to_inspect_task: Scene + Rubric -> Inspect AI task dict
- ORSAdapter / to_ors_reward: VerifyResult + RewardEvent -> ORS format dict

No external dependencies required — these are pure format converters.
Re-exported from benchflow top-level __init__.py.
14 tests covering both adapters, convenience functions, and round-trips.

* style: fix ruff format for adapters and tests

* fix: convention fixes + trial shim re-exports + __getattr__ narrowing

- Parametrize bare dict returns to dict[str, Any] in adapters
- Remove unnecessary @DataClass on adapter classes (no fields)
- Add from __future__ import annotations to evaluation.py
- Fix __getattr__ to re-raise ModuleNotFoundError for broken deps
- Add private helper re-exports to trial.py shim for backward compat
- Convert test_save_trajectory to async (fixes event loop deprecation)

* rename: backend → sandbox across API, CLI, docs, and tests

Aligns naming with the v0.4 Sandbox protocol (ENG-48). The
'backend' parameter/flag/attribute on Environment and the CLI is
now 'sandbox' everywhere.

Scope:
- runtime.py: Environment.__init__(sandbox=), .sandbox, from_task(sandbox=)
- cli/main.py: --sandbox flag (was --backend), help text
- _snapshot.py: docstring updates
- tests/test_runtime.py: updated assertions
- docs/: all references in task-authoring, CLI ref, Python API ref,
  running-benchmarks, integration-tests, examples

NOT renamed (different meaning):
- _credentials.py 'Vertex AI backend' (provider concept)
- _provider_runtime.py / bedrock_proxy.py backend_model (LLM model)
- _sandbox.py 'build backends' (Python packaging)

* docs: update for v0.4 — Rollout as canonical name, add Sandbox/Rubric/Adapter docs

- README: 'Sandbox backends' → 'Sandboxes', table says Rollout not Trial
- concepts.md: Environment no longer says 'Backed by Harbor', Trial → Rollout
  as core primitive, lifecycle diagram uses Rollout.run()
- getting-started.md: uses RolloutConfig, links to rollout lifecycle
- python-api.md: RolloutConfig (aliased as TrialConfig), Rollout (aliased as
  Trial), canonical imports shown first, added v0.4 types section with Sandbox
  protocol, Rubric + RewardFunc, Adapters, Evaluation, backward-compat aliases

* fix: correct Rubric/RewardFunc API examples in python-api.md

Fixes 3 issues flagged by Devin Review:
- StringMatchRewardFunc: remove non-existent 'field' parameter
- Rubric: use reward_funcs + weights instead of items=[(func, weight)]
- rubric.score: takes rollout_dir: Path, not sandbox=

* fix: correct evaluation.py docstring aliases, clean stale comment in __init__.py

* docs: remove Harbor migration references

- use-cases.md: remove 'How it works vs Harbor' and 'Migration from Harbor' sections
- scene-patterns.ipynb: remove 'Mapping to Harbor PR #1462 Concepts' cell
- coder-reviewer-demo.py: 'Harbor-format task directory' → 'BenchFlow task directory'
- swebench notebook: remove Harbor #1316 reference from intro

* CLI: unify sandbox flag (--env → --sandbox, -b → -e) [ENG-56] (#281)

* cli: unify sandbox flag --env → --sandbox, -b → -e (ENG-56)

Standardize all CLI commands to use --sandbox/-e for the sandbox
parameter. Previously bench run used --sandbox/-b while bench eval
create and others used --env/-e.

- bench run: -b → -e
- bench eval create/run/compare: --env → --sandbox
- bench eval progress: --env → --sandbox
- bench skills eval: --env → --sandbox
- bench environment create: -b → -e
- docs/examples: updated flag references

* cli: drop -e/-b short flags, use --sandbox only (ENG-56)

Remove all short flags (-e, -b) for the sandbox parameter across all
CLI commands. The long flag --sandbox is the only way to specify the
sandbox now.

Updated 15 files: CLI definitions, help text examples, README,
docs, skill files, integration test runner, and test examples.

---------

Co-authored-by: Xiangyi Li <xiangyi@benchflow.ai>

* cli: deprecate bench run in favor of bench eval create (ENG-57) (#282)

Per ENG-46, bench eval create is the canonical CLI entry point.
bench run goes through the old SDK shim and is now deprecated,
matching the pattern used for bench job and other legacy commands.

Python API bf.run() is unaffected (backward-compat alias stays).

Co-authored-by: Xiangyi Li <xiangyi@benchflow.ai>

* feat(ENG-55): first-class LLM-as-judge verifier with dense reward (#277)

* feat(ENG-55): first-class LLM-as-judge verifier with dense reward

- Add rubric_config.py: Pydantic models for rubric.toml (Criterion,
  JudgeConfig, ScoringConfig, RubricConfig) with binary/likert/numeric
  criterion types and score normalization
- Add llm.py: Multi-provider LLM routing (Anthropic/OpenAI/Gemini)
  with call_judge(), parse_verdict(), exponential backoff retry
- Add file_readers.py: Document extraction (pdf/docx/xlsx/pptx/txt)
  with find_deliverables() for rollout directory scanning
- Replace placeholder LLMJudgeRewardFunc in builtins.py with full
  rubric-based judge: per-criterion scoring, prompt templates,
  configurable aggregation (weighted_mean/all_pass/any_pass/threshold),
  backward-compatible legacy mode
- Emit per-criterion dense RewardEvents for fine-grained reward signals
- Write evaluation_details.json alongside rollout for transparency
- Re-export Criterion, JudgeConfig, RubricConfig, ScoringConfig,
  load_rubric_toml from rewards and top-level __init__
- Add test_rubric_config.py (15 tests) and test_llm_judge.py (30 tests)

Closes ENG-55

* fix(ENG-55): async clients, asyncio.sleep, and score mismatch bugs

- Replace sync SDK clients with async variants (AsyncAnthropic, AsyncOpenAI, client.aio for Google)
- Replace time.sleep() with await asyncio.sleep() in call_judge retry loop
- Pass actual aggregated score to _write_details instead of recomputing n_passed/total

* docs(ENG-55): add LLM-as-judge verifier documentation

- Add docs/llm-judge.md: full user-facing guide with rubric.toml reference,
  criterion types, aggregation strategies, dense reward events, multi-provider
  routing, file discovery, inline criteria, and worked examples
- Add src/benchflow/rewards/README.md: module-level guide with usage examples
- Update docs/concepts.md: reference LLM judge in Verifier primitive and
  add to 'Where to go next' links

* docs: polish — remove ENG-55 ticket ref, clarify rubric.json exclusion

---------

Co-authored-by: Xiangyi Li <xiangyi@benchflow.ai>

* CLI: remove all short flags, use full names only [ENG-74] (#284)

* CLI: remove all short flags, use full names only [ENG-74]

Remove all single-letter short flags (-a, -m, -o, -t, -c, -f, -s, -p, -b, -d)
from every CLI command. Full flag names only (--agent, --model, --jobs-dir,
--tasks-dir, --concurrency, --config, --skills-dir, --prompt, --benchmark, --dir).

Updated across 23 files:
- src/benchflow/cli/main.py: all typer.Option definitions + docstring examples
- All docs/ (getting-started, running-benchmarks, cli.md, skill-eval, integration-tests, examples/)
- README.md
- .claude/skills/ (SKILL.md + references)
- tests/ (integration/run.sh, test_codex_custom_provider.sh, conformance/README.md)
- benchmarks/ (models-as-skills.md, harvey-lab/run_harvey_lab.py)
- Test fix: test_skill_eval_dryrun.py invocation updated from -a to --agent

* fix: replace remaining -f with --config in task-embedded SKILL.md files

---------

Co-authored-by: Xiangyi Li <xiangyi@benchflow.ai>

* Backport PR #230 and #242 fixes to refactor/v0.4 (#286)

* fix: apply PR #230 (skills double-deploy) and PR #242 (shell injection) fixes to v0.4

PR #230: pass effective task path to deploy_skills, set effective_skills in already-injected branch
PR #242: shlex.quote outbox file paths in _scene.py, harden _snapshot.py with path validation

* fix: set effective_skills to /skills when Dockerfile already injected

When deploy_skills detects the Dockerfile already bakes in
`COPY _deps/skills /skills/`, the runtime upload is skipped but
effective_skills was left as task.config.environment.skills_dir
(often None), causing the subsequent linking step to silently
skip distribution to agent-specific paths like
/home/agent/.agents/skills. Sandbox users relied on that runtime
linking since Dockerfile injection only links under /root/.

Tighten the regression test to assert env.exec was called with the
expected `ln -sfn` command.

---------

Co-authored-by: Xiangyi Li <xiangyi@benchflow.ai>
Co-authored-by: Yifeng He <yfhe.prsn@gmail.com>

* fix locally

* release: 0.3.4

---------

Co-authored-by: Xiangyi Li <xiangyi@benchflow.ai>
Co-authored-by: devin-ai-integration[bot] <158243242+devin-ai-integration[bot]@users.noreply.github.com>
Co-authored-by: Yifeng He <yfhe.prsn@gmail.com>

* refactor: remove Harbor references, purge backward-compat aliases in tests, update terminology

- Remove 'terminal-bench' from pyproject.toml keywords
- Rename _make_harbor_mock → _make_sandbox_mock in test files
- Update Harbor patch paths to sandbox equivalents in tests
- Rename _from_harbor_yaml → _from_legacy_yaml in evaluation.py
- Update docstrings/comments: Harbor → BenchFlow/legacy terminology
- Internalize Harbor types into benchflow.task/ subpackage
- Create sandbox backends: base.py, docker_impl.py, daytona_impl.py, modal_impl.py
- Add compose/ subpackage with docker-compose YAML templates

* refactor: modernize file structure — align with test-drive patterns

- sandbox/: merge docker_impl→docker, daytona_impl→daytona, base→_base
- sandbox/: move _sandbox→lockdown, _snapshot→snapshot, _daytona_patches→_sdk_ops
- sandbox/: move process, user, environments→services into sandbox/
- sandbox/: restructure compose/ → _compose.py + _compose_files/
- agents/: move _credentials→credentials, _agent_env→env, _agent_setup→install
- _utils/: create subpackage with yaml_loader, benchmark_repos, task_authoring
- trajectories/: move viewer, _trajectory→_capture into subpackage
- experimental/: move mcp/ into experimental/ subpackage
- __init__.py: purge backward-compat aliases (Trial, Job, etc.)
- Update all imports across src/, tests/, experiments/
- Run ruff check + format (0 errors)

* refactor: purge backward-compat aliases + fix review bugs

- Remove Trial=Rollout, TrialConfig=RolloutConfig from rollout.py
- Remove Job=Evaluation, JobConfig=EvaluationConfig, JobResult=EvaluationResult from evaluation.py
- Delete shim files: trial.py, job.py
- Update all imports: benchflow.trial → benchflow.rollout, benchflow.job → benchflow.evaluation
- Rename classes: Trial→Rollout, Job→Evaluation across tests/, benchmarks/, experiments/
- Fix runtime.py: environment_type→sandbox_type, trial_name→rollout_name (review bug)
- Fix conformance scripts: same keyword arg fixes (review bug)
- Clean up __all__ and return type annotations

* docs: update terminology to v0.4 (Trial→Rollout, Job→Evaluation) + bump version to 0.4.0

* chore: update uv.lock for v0.4.0

* refactor: modernize RL terminology — trial_dir→rollout_dir, trial_name→rollout_name across codebase

* style: format viewer.py

* fix: exclude optional-dep sandbox files from ty check

* fix: configure ty to ignore unresolved-import + exclude optional-dep files

* fix: resolve all test failures — patch paths, protocol conformance, RL terminology

* fix: resolve merge conflicts, update broken imports and stale references

- Resolve merge conflict markers in rollout.py and evaluation.py
- Update task_download imports to benchflow._utils.benchmark_repos (11 files)
- Fix benchflow.tasks → benchflow._utils.task_authoring import
- Update scene-patterns.ipynb: TrialConfig→RolloutConfig, trial→rollout
- Fix Job→Evaluation rename in docs
- Ruff auto-fixes (import ordering)

* style: format 12 files to pass ruff format check

* fix: resolve ty check errors in traces/ — dict[str, object] → dict[str, Any]

* fix: add skip guards for optional dependency tests (daytona, modal)

* fix: update stale imports and references in docs/benchmarks (Devin Review)

* refactor: consolidate underscore modules into proper subpackages

- _acp_run.py → acp/runtime.py
- _env_setup.py → sandbox/setup.py
- _provider_runtime.py → providers/runtime.py
- _scoring.py → _utils/scoring.py
- _scene.py → scenes.py (removed Role=SceneRole backward-compat alias)
- Fix GEMINI_API_KEY auto-detection: inherit from os.environ as fallback after .env

* feat: ACPX integration — acpx/<agent> protocol for headless ACP agent invocation

- Add 'acpx' as a valid protocol in parse_agent_spec (acpx/claude, acpx/codex, etc.)
- _acpx_wrap() decorates any registered agent to launch via acpx CLI
- Installs acpx alongside the underlying agent in the sandbox
- Preserves all agent env, credentials, and skill_paths
- Replace harbor protocol references with acpx in tests

* style: format registry.py for ruff format check

* fix: suppress ty invalid-assignment for monkey-patch in _patch_docker_dind

* docs: update docs and SKILL.md for v0.4 — RL-first terminology, ACPX, purge backward-compat

* fix: rename trial_config_from_yaml → rollout_config_from_yaml, fix stale _run import in sdk.py

* fix: docs inconsistencies found by dogfooding — GEMINI_API_KEY, daytona install note

* fix: docs path evaluations/ → jobs/ to match actual output directory

* fix: ruff import ordering in sdk.py

* fix: ty type-check suppression for sdk.py run() return type

* feat: add sandbox-daytona and sandbox-modal optional dependency groups

* chore: update uv.lock for sandbox-daytona and sandbox-modal optional deps

---------

Co-authored-by: devin-ai-integration[bot] <158243242+devin-ai-integration[bot]@users.noreply.github.com>
Co-authored-by: Xiangyi Li <xiangyi@benchflow.ai>
Co-authored-by: Yifeng He <yfhe.prsn@gmail.com>

* skill eval -> refactor (#293)

* fix: address v0.4 dogfood blockers

* test: cover skill eval nested results

* chore: ignore playwright mcp artifacts

* fix: validate config eval agent protocol

* test: add adapter release evidence checker

* test: wire adapter evidence into release runner

* fix: address ENG-91 dogfood regressions

* fix: record role metadata and timeouts

* fix: align dogfood cli reports

* docs: document environment cleanup command

* docs: remove stale harbor task path examples

* Add citation-management skill eval under skills

* Update src/benchflow/cli/main.py

Co-authored-by: devin-ai-integration[bot] <158243242+devin-ai-integration[bot]@users.noreply.github.com>

---------

Co-authored-by: devin-ai-integration[bot] <158243242+devin-ai-integration[bot]@users.noreply.github.com>
Co-authored-by: Yifeng He <yfhe.prsn@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant