feat(ENG-55): first-class LLM-as-judge verifier with dense reward by devin-ai-integration[bot] · Pull Request #277 · benchflow-ai/benchflow

devin-ai-integration · 2026-05-16T17:54:06Z

Summary

Replaces the placeholder LLMJudgeRewardFunc (which only read a pre-computed score from llm_judge_score.txt) with a full rubric-based LLM judge implementation. Adds dense reward emission (per-criterion RewardEvent) as an extension.

New modules:

rewards/rubric_config.py — Pydantic dataclasses for rubric.toml parsing: Criterion (binary/likert/numeric types with normalization), JudgeConfig, ScoringConfig, RubricConfig
rewards/llm.py — Multi-provider LLM routing (claude-* → Anthropic, gpt-*/o1*/o3* → OpenAI, gemini-* → Google) with exponential backoff retry and JSON verdict parsing
rewards/file_readers.py — Document text extraction (pdf, docx, xlsx, pptx, plain text) with find_deliverables() for rollout directory scanning

Modified LLMJudgeRewardFunc:

Rubric mode (new): loads rubric.toml (explicit path, inline criteria list, or auto-discovery), evaluates each criterion individually via LLM, normalizes scores, aggregates via configurable strategy (weighted_mean, all_pass, any_pass, threshold)
Legacy mode (preserved): falls back to reading llm_judge_score.txt when no rubric is available
Emits a RewardEvent(type="dense") per criterion and writes evaluation_details.json to the rollout directory

Exports: Criterion, JudgeConfig, RubricConfig, ScoringConfig, load_rubric_toml added to rewards/__init__ and top-level benchflow/__init__.

Documentation:

docs/llm-judge.md — full user-facing guide: rubric.toml reference, criterion types (binary/likert/numeric), aggregation strategies, dense reward events, multi-provider routing, file discovery, inline criteria, and worked examples (Harvey LAB legal task)
src/benchflow/rewards/README.md — module-level developer guide
docs/concepts.md — updated Verifier primitive to reference LLM judge; added to "Where to go next"

Tests: 45 new tests — 15 for rubric config parsing/normalization, 30 for the judge pipeline (verdict parsing, rubric/inline/auto-discovery modes, aggregation strategies, dense events, error handling, evaluation details output). All existing 1008 tests continue to pass.

Updates since last revision

Revision 4 — Docs review & polish

Ran docs-review skill to dogfood all three doc files against the codebase. All code examples, API imports, default values, constructor signatures, cross-references, and links verified correct. Fixed two polish items:

Removed ENG-55 ticket reference from rewards/README.md (ticket-tracking language that rots for future readers)
Clarified rubric.json exclusion in docs/llm-judge.md as "internal metadata files" to avoid confusion with rubric.toml

Revision 3 — Documentation

Added user-facing docs (docs/llm-judge.md) and module-level README (src/benchflow/rewards/README.md). Updated docs/concepts.md to cross-reference the LLM judge from the Verifier primitive and the navigation links.

Revision 2 — Bug fixes (3)

_write_details score mismatch — _write_details previously computed its own n_passed / total score, diverging from the actual _aggregate() result when weights are non-uniform or non-binary criteria are used. Now receives the aggregated score as a parameter.
Blocking time.sleep() in async call_judge — replaced with await asyncio.sleep() so the retry backoff no longer blocks the event loop.
Sync SDK clients in async provider functions — _call_anthropic now uses AsyncAnthropic, _call_openai uses AsyncOpenAI, and _call_google uses client.aio.models.generate_content() instead of their synchronous counterparts.

Review & Testing Checklist for Human

Async provider API surfaces untested against real SDKs: All 45 tests mock call_judge. The AsyncAnthropic, AsyncOpenAI, and client.aio.models.generate_content() call signatures should be manually verified with at least one real provider (e.g. Claude) to confirm the async interfaces match expectations end-to-end.
JudgeConfig.reference and prompt_template are parsed but not wired: The rubric.toml reference docs list these fields, but _build_prompt and _rubric_score don't use them. Either wire them in or remove them from the docs to avoid misleading users.
Constructor default change: model default changed from "gemini-3.1-flash-lite" to "claude-sonnet-4-6" and prompt is now optional — verify no existing callers depend on the old defaults or required prompt parameter.
Auto-discovery path traversal: _load_rubric checks rollout_dir / ".." / "rubric.toml" — verify this parent-directory traversal is acceptable for your deployment trust model.

Suggested test plan: Create a minimal rubric.toml with 2–3 criteria, point LLMJudgeRewardFunc(rubric_path=...) at it with a real API key, and run score() against a sample rollout directory. Verify evaluation_details.json is written with correct aggregated score and that dense RewardEvents are emitted per criterion.

Notes

File reader optional dependencies (pdfplumber, openpyxl, python-docx, markitdown) are not declared in pyproject.toml — they degrade gracefully with (unsupported: ...) messages. Consider adding them as an optional dependency group if you want PDF/DOCX support out of the box.
The parse_verdict JSON extractor uses regex-based brace matching as a fallback when code-fence extraction fails — works for well-formed responses but could be fragile with deeply nested or malformed LLM output.
No concurrency for multi-criterion rubrics: criteria are scored sequentially. For rubrics with many criteria, this will be slow. Acceptable for now?

Link to Devin session: https://app.devin.ai/sessions/dcd907c049b6437499dba8a7280bef93
Requested by: @xdotli

- Add rubric_config.py: Pydantic models for rubric.toml (Criterion, JudgeConfig, ScoringConfig, RubricConfig) with binary/likert/numeric criterion types and score normalization - Add llm.py: Multi-provider LLM routing (Anthropic/OpenAI/Gemini) with call_judge(), parse_verdict(), exponential backoff retry - Add file_readers.py: Document extraction (pdf/docx/xlsx/pptx/txt) with find_deliverables() for rollout directory scanning - Replace placeholder LLMJudgeRewardFunc in builtins.py with full rubric-based judge: per-criterion scoring, prompt templates, configurable aggregation (weighted_mean/all_pass/any_pass/threshold), backward-compatible legacy mode - Emit per-criterion dense RewardEvents for fine-grained reward signals - Write evaluation_details.json alongside rollout for transparency - Re-export Criterion, JudgeConfig, RubricConfig, ScoringConfig, load_rubric_toml from rewards and top-level __init__ - Add test_rubric_config.py (15 tests) and test_llm_judge.py (30 tests) Closes ENG-55

devin-ai-integration · 2026-05-16T17:54:09Z

🤖 Devin AI Engineer

I'll be helping with this pull request! Here's what you should know:

✅ I will automatically:

Address comments on this PR. Add '(aside)' to your comment to have me ignore it.
Look at CI failures and help fix them

Note: I can only respond to comments from users who have write access to this repository.

⚙️ Control Options:

Disable automatic comment and CI monitoring

- Replace sync SDK clients with async variants (AsyncAnthropic, AsyncOpenAI, client.aio for Google) - Replace time.sleep() with await asyncio.sleep() in call_judge retry loop - Pass actual aggregated score to _write_details instead of recomputing n_passed/total

devin-ai-integration

Devin Review found 2 new potential issues.

View 10 additional findings in Devin Review.

devin-ai-integration · 2026-05-16T18:10:32Z

+        if self._rubric_path is not None:
+            return load_rubric_toml(self._rubric_path)


🟡 Constructor judge_model/model silently ignored when rubric_path or auto-discovered rubric is used

When LLMJudgeRewardFunc is constructed with both rubric_path (or when a rubric.toml is auto-discovered) and an explicit judge_model, the constructor's model is silently ignored. In _load_rubric at src/benchflow/rewards/builtins.py:181-182, the rubric_path branch calls load_rubric_toml(self._rubric_path) and returns the TOML's config as-is, without merging self.model. The same applies to auto-discovered rubrics at lines 208-213. However, inline criteria mode at lines 201-202 correctly uses self.model via JudgeConfig(model=self.model, ...). This means LLMJudgeRewardFunc(judge_model="gpt-4o", rubric_path=Path("rubric.toml")) would silently use the TOML's model (defaulting to claude-sonnet-4-6 per src/benchflow/rewards/rubric_config.py:102) instead of gpt-4o, while LLMJudgeRewardFunc(judge_model="gpt-4o", criteria=[...]) correctly uses gpt-4o.

Prompt for agents

The _load_rubric method has an inconsistency: when rubric_path is provided (line 181-182) or a rubric.toml is auto-discovered (lines 208-213), the RubricConfig returned uses the TOML file's judge model, ignoring self.model (which holds the constructor's judge_model parameter). But when inline criteria are used (lines 201-202), self.model IS correctly propagated via JudgeConfig(model=self.model). To fix this, after loading a rubric from a TOML file, merge the constructor's model if it was explicitly provided by the user. One approach: after calling load_rubric_toml(), check if the user explicitly set a model (e.g., track whether judge_model was passed) and if so, override rubric.judge.model with self.model. This needs to be done for both the rubric_path path and the auto-discovery path. You may need to add a boolean flag like self._model_explicit to distinguish between the user passing judge_model and the default being used.

Was this helpful? React with 👍 or 👎 to provide feedback.

Valid catch. When rubric_path is provided or auto-discovered, self.model from the constructor is silently ignored in favor of the TOML's judge.model. The inline criteria path correctly uses self.model.

Two options to fix:

Track whether judge_model was explicitly passed (e.g. self._model_explicit flag) and override rubric.judge.model when it was.

Always merge self.model into loaded rubric configs after load_rubric_toml().

Happy to fix in a follow-up commit if the maintainer wants this addressed now.

devin-ai-integration · 2026-05-16T18:10:33Z

+    for provider in providers:
+        for attempt in range(retries):
+            try:
+                if provider == "anthropic":
+                    return await _call_anthropic(bare_model, prompt, max_tokens)
+                if provider == "openai":
+                    return await _call_openai(bare_model, prompt, max_tokens)
+                if provider == "google":
+                    return await _call_google(bare_model, prompt)
+            except ImportError:
+                logger.debug("SDK for %s not installed, skipping", provider)
+                break  # No point retrying if SDK is missing
+            except Exception as e:
+                last_error = e
+                if attempt < retries - 1:
+                    await asyncio.sleep(2**attempt)
+                    continue
+                logger.warning(
+                    "%s call failed after %d attempts: %s",
+                    provider,
+                    retries,
+                    e,
+                )
+                break  # Move to next provider


🟡 call_judge retries deterministic failures (wrong model name on fallback providers) with exponential backoff

In call_judge, when the primary provider fails, the code falls back to other providers using the same bare_model name (src/benchflow/rewards/llm.py:97,111-134). For example, if model is claude-sonnet-4-6, it first tries Anthropic (correct), then falls back to OpenAI and Google with bare_model="claude-sonnet-4-6" — a model name neither provider supports. Each fallback provider is retried 3 times with exponential backoff (await asyncio.sleep(2**attempt) at line 126), wasting ~3 seconds per provider (~6 seconds total for 2 fallback providers) on calls that will always fail. The same issue applies to _call_google raising RuntimeError for missing API keys at src/benchflow/rewards/llm.py:181 — this deterministic error is caught by the generic except Exception at line 123 and retried instead of failing fast like ImportError does.

Prompt for agents

The call_judge function at llm.py:111-134 retries all non-ImportError exceptions with exponential backoff, including deterministic failures that will never succeed (e.g., wrong model name on fallback provider, missing API key). Two improvements: (1) In _call_google, the RuntimeError for missing API keys should be a dedicated exception type or be caught similarly to ImportError (break instead of retry). (2) Consider whether cross-provider fallback with the same bare_model name actually makes sense — it will fail for provider-specific model names like 'claude-sonnet-4-6' on OpenAI. Either remove cross-provider fallback or add model-name translation. At minimum, treat 'model not found' / 'invalid model' type errors as non-retryable.

Was this helpful? React with 👍 or 👎 to provide feedback.

Good point — cross-provider fallback with the same bare_model name doesn't make sense for provider-specific models (e.g. claude-sonnet-4-6 on OpenAI will always 404).

Two improvements worth making:

Remove cross-provider fallback entirely (only retry the matched provider) — cleaner and avoids wasted retries.

Treat RuntimeError for missing API keys (and "model not found" errors) as non-retryable, similar to ImportError.

Can address in a follow-up if desired.

devin-ai-integration · 2026-05-16T18:20:18Z

Test Results — ENG-55 LLM-as-Judge Verifier

Ran 45 existing + 16 adversarial tests targeting the 3 bug fixes, plus full regression suite.

Bug Fix Verification (critical)

Test	Result	Detail
`_write_details` score mismatch (weighted)	✅ passed	weight=3 pass + weight=1 fail → details score = 0.75, not 0.5
`_write_details` score mismatch (likert)	✅ passed	likert 3/5 → details score = 0.5, not 1.0
`score()` matches `evaluation_details.json`	✅ passed	Return value identical to file
No `time.sleep` in `llm.py`	✅ passed	Static analysis — `asyncio.sleep` used
`AsyncAnthropic` in `_call_anthropic`	✅ passed
`AsyncOpenAI` in `_call_openai`	✅ passed
`client.aio` in `_call_google`	✅ passed

Feature Tests (45 existing)

Rubric config parsing: 15/15 passed (normalization, TOML parsing, defaults)
LLM judge pipeline: 30/30 passed (binary/likert/numeric, all aggregation modes, inline criteria, auto-discovery, dense events, error handling)

Additional Adversarial Tests (9)

File readers: discovers supported files, skips hidden + rubric.json ✓
Empty/nonexistent directory handling ✓
Public API exports from benchflow and benchflow.rewards ✓
Auto-discovery from parent directory ✓
Empty criteria rubric returns 0.0 ✓
No deliverables still attempts scoring ✓

Full Regression Suite

1008 passed, 1 deselected, 34 warnings in 22.97s

All warnings are pre-existing deprecation warnings from harbor.

Devin session

- Add docs/llm-judge.md: full user-facing guide with rubric.toml reference, criterion types, aggregation strategies, dense reward events, multi-provider routing, file discovery, inline criteria, and worked examples - Add src/benchflow/rewards/README.md: module-level guide with usage examples - Update docs/concepts.md: reference LLM judge in Verifier primitive and add to 'Where to go next' links

…e + Rollout rename)

…, rewards, adapters (#274) * refactor: add Sandbox protocol + Harbor adapters (ENG-48 Phase A) Create BenchFlow-native Sandbox protocol as parallel types alongside existing Harbor imports. No behavioral changes — existing code continues to use Harbor directly. New files: - src/benchflow/sandbox/protocol.py: ExecResult, Sandbox, ImageRef, ImageConfig, ImageBuilder - src/benchflow/sandbox/docker.py: DockerSandbox adapter wrapping Harbor DockerEnvironment - src/benchflow/sandbox/daytona.py: DaytonaSandbox adapter wrapping Harbor DaytonaEnvironment - tests/test_sandbox_protocol.py: 26 tests covering dataclasses, protocol conformance, and delegation * refactor: unify Scene/Role/Turn types into _types.py (ENG-47) Create src/benchflow/_types.py as the single canonical source for the declarative Role, Scene, and Turn dataclasses. Changes: - New _types.py with Role (adds timeout_sec, idle_timeout_sec, skills_dir), Scene (adds parallel_group), and Turn - trial.py imports from _types.py instead of defining its own copies - _scene.py renames its internal Role to SceneRole (different fields: instruction, tools) with backward-compat alias - __init__.py re-exports canonical types from _types.py; adds TrialRole/TrialScene backward-compat aliases - runtime.py and trial_yaml.py updated to import from _types.py New fields default to None — no runtime behavior changes. * fix: address review — shlex.quote in read_file, cleanup temp in write_file - Use shlex.quote(path) in read_file to prevent shell injection - Clean up temp files with os.unlink in write_file finally block - Move imports to module level * fix: read_file error checking + export ImageConfig from top-level - read_file now raises FileNotFoundError on non-zero return code - Export ImageConfig alongside ImageBuilder in __init__.py - Add tests for read_file error behavior * refactor: kill shim layers, single Rollout execution path (ENG-46) - Rename trial.py → rollout.py, Trial → Rollout, TrialConfig → RolloutConfig - Rename RunResult → RolloutResult in models.py - Create evaluation.py from job.py: Job → Evaluation, JobConfig → EvaluationConfig, JobResult → EvaluationResult - Create _run.py with bf.run(RolloutConfig) → RolloutResult entry point - Replace sdk.py with thin backward-compat shim delegating to rollout.py - Replace trial.py with re-export shim from rollout.py - Replace job.py with re-export shim from evaluation.py - Preserve runtime.py API, update internal imports to use Rollout - Update __init__.py: new public API + backward-compat aliases - Update test patch targets from benchflow.trial → benchflow.rollout - All 841 tests pass, lint clean, pre-existing typecheck errors unchanged * style: apply ruff format to pass CI format check * fix: resolve ty typecheck errors (Any for kwargs and sentinel) * feat: composable Rubric + RewardFunc protocol (ENG-49) - Create src/benchflow/rewards/ package with: - RewardFunc protocol (single scoring dimension) - Rubric dataclass (weighted collection of RewardFuncs) - VerifyResult dataclass (aggregated scoring result) - RewardEvent dataclass (dense/terminal reward signals) - Built-in funcs: TestRewardFunc, LLMJudgeRewardFunc, StringMatchRewardFunc, CodeExecRewardFunc - Add reward_events field to RunResult (additive, no breaking changes) - Re-export all reward types from benchflow top-level __init__.py - Backward compat: Rubric([TestRewardFunc()]) wraps existing test.sh -> reward.txt flow - 31 tests covering all reward types, rubric scoring, weights, error handling, protocol conformance, and re-exports * style: format rewards package and tests with ruff * fix: disambiguate Rubric.items keys when multiple funcs share a class name Append _N suffix on collision so all per-func scores are preserved. Addresses Devin Review feedback on PR #266. * feat: per-agent capabilities + agent-as-tool infrastructure (ENG-50) - Add Role.capabilities field for declarative agent capability lists - Add Sandbox.expose_ports property for inter-agent communication ports - Document capability boundary: BenchFlow provides sandbox + instruction + observation; agents own loops, tool protocols, and agent-as-tool - Update _scene.py and _types.py module docstrings with ENG-50 boundary - Add 16 tests covering capabilities, expose_ports, and Coder→Reviewer→Coder turn sequences * style: apply ruff format to sandbox adapters and tests * feat: external adapters for Inspect AI + ORS (ENG-51) Add benchflow.adapters package with thin format converters: - InspectAdapter / to_inspect_task: Scene + Rubric -> Inspect AI task dict - ORSAdapter / to_ors_reward: VerifyResult + RewardEvent -> ORS format dict No external dependencies required — these are pure format converters. Re-exported from benchflow top-level __init__.py. 14 tests covering both adapters, convenience functions, and round-trips. * style: fix ruff format for adapters and tests * fix: convention fixes + trial shim re-exports + __getattr__ narrowing - Parametrize bare dict returns to dict[str, Any] in adapters - Remove unnecessary @DataClass on adapter classes (no fields) - Add from __future__ import annotations to evaluation.py - Fix __getattr__ to re-raise ModuleNotFoundError for broken deps - Add private helper re-exports to trial.py shim for backward compat - Convert test_save_trajectory to async (fixes event loop deprecation) * rename: backend → sandbox across API, CLI, docs, and tests Aligns naming with the v0.4 Sandbox protocol (ENG-48). The 'backend' parameter/flag/attribute on Environment and the CLI is now 'sandbox' everywhere. Scope: - runtime.py: Environment.__init__(sandbox=), .sandbox, from_task(sandbox=) - cli/main.py: --sandbox flag (was --backend), help text - _snapshot.py: docstring updates - tests/test_runtime.py: updated assertions - docs/: all references in task-authoring, CLI ref, Python API ref, running-benchmarks, integration-tests, examples NOT renamed (different meaning): - _credentials.py 'Vertex AI backend' (provider concept) - _provider_runtime.py / bedrock_proxy.py backend_model (LLM model) - _sandbox.py 'build backends' (Python packaging) * docs: update for v0.4 — Rollout as canonical name, add Sandbox/Rubric/Adapter docs - README: 'Sandbox backends' → 'Sandboxes', table says Rollout not Trial - concepts.md: Environment no longer says 'Backed by Harbor', Trial → Rollout as core primitive, lifecycle diagram uses Rollout.run() - getting-started.md: uses RolloutConfig, links to rollout lifecycle - python-api.md: RolloutConfig (aliased as TrialConfig), Rollout (aliased as Trial), canonical imports shown first, added v0.4 types section with Sandbox protocol, Rubric + RewardFunc, Adapters, Evaluation, backward-compat aliases * fix: correct Rubric/RewardFunc API examples in python-api.md Fixes 3 issues flagged by Devin Review: - StringMatchRewardFunc: remove non-existent 'field' parameter - Rubric: use reward_funcs + weights instead of items=[(func, weight)] - rubric.score: takes rollout_dir: Path, not sandbox= * fix: correct evaluation.py docstring aliases, clean stale comment in __init__.py * docs: remove Harbor migration references - use-cases.md: remove 'How it works vs Harbor' and 'Migration from Harbor' sections - scene-patterns.ipynb: remove 'Mapping to Harbor PR #1462 Concepts' cell - coder-reviewer-demo.py: 'Harbor-format task directory' → 'BenchFlow task directory' - swebench notebook: remove Harbor #1316 reference from intro * CLI: unify sandbox flag (--env → --sandbox, -b → -e) [ENG-56] (#281) * cli: unify sandbox flag --env → --sandbox, -b → -e (ENG-56) Standardize all CLI commands to use --sandbox/-e for the sandbox parameter. Previously bench run used --sandbox/-b while bench eval create and others used --env/-e. - bench run: -b → -e - bench eval create/run/compare: --env → --sandbox - bench eval progress: --env → --sandbox - bench skills eval: --env → --sandbox - bench environment create: -b → -e - docs/examples: updated flag references * cli: drop -e/-b short flags, use --sandbox only (ENG-56) Remove all short flags (-e, -b) for the sandbox parameter across all CLI commands. The long flag --sandbox is the only way to specify the sandbox now. Updated 15 files: CLI definitions, help text examples, README, docs, skill files, integration test runner, and test examples. --------- Co-authored-by: Xiangyi Li <xiangyi@benchflow.ai> * cli: deprecate bench run in favor of bench eval create (ENG-57) (#282) Per ENG-46, bench eval create is the canonical CLI entry point. bench run goes through the old SDK shim and is now deprecated, matching the pattern used for bench job and other legacy commands. Python API bf.run() is unaffected (backward-compat alias stays). Co-authored-by: Xiangyi Li <xiangyi@benchflow.ai> * feat(ENG-55): first-class LLM-as-judge verifier with dense reward (#277) * feat(ENG-55): first-class LLM-as-judge verifier with dense reward - Add rubric_config.py: Pydantic models for rubric.toml (Criterion, JudgeConfig, ScoringConfig, RubricConfig) with binary/likert/numeric criterion types and score normalization - Add llm.py: Multi-provider LLM routing (Anthropic/OpenAI/Gemini) with call_judge(), parse_verdict(), exponential backoff retry - Add file_readers.py: Document extraction (pdf/docx/xlsx/pptx/txt) with find_deliverables() for rollout directory scanning - Replace placeholder LLMJudgeRewardFunc in builtins.py with full rubric-based judge: per-criterion scoring, prompt templates, configurable aggregation (weighted_mean/all_pass/any_pass/threshold), backward-compatible legacy mode - Emit per-criterion dense RewardEvents for fine-grained reward signals - Write evaluation_details.json alongside rollout for transparency - Re-export Criterion, JudgeConfig, RubricConfig, ScoringConfig, load_rubric_toml from rewards and top-level __init__ - Add test_rubric_config.py (15 tests) and test_llm_judge.py (30 tests) Closes ENG-55 * fix(ENG-55): async clients, asyncio.sleep, and score mismatch bugs - Replace sync SDK clients with async variants (AsyncAnthropic, AsyncOpenAI, client.aio for Google) - Replace time.sleep() with await asyncio.sleep() in call_judge retry loop - Pass actual aggregated score to _write_details instead of recomputing n_passed/total * docs(ENG-55): add LLM-as-judge verifier documentation - Add docs/llm-judge.md: full user-facing guide with rubric.toml reference, criterion types, aggregation strategies, dense reward events, multi-provider routing, file discovery, inline criteria, and worked examples - Add src/benchflow/rewards/README.md: module-level guide with usage examples - Update docs/concepts.md: reference LLM judge in Verifier primitive and add to 'Where to go next' links * docs: polish — remove ENG-55 ticket ref, clarify rubric.json exclusion --------- Co-authored-by: Xiangyi Li <xiangyi@benchflow.ai> * CLI: remove all short flags, use full names only [ENG-74] (#284) * CLI: remove all short flags, use full names only [ENG-74] Remove all single-letter short flags (-a, -m, -o, -t, -c, -f, -s, -p, -b, -d) from every CLI command. Full flag names only (--agent, --model, --jobs-dir, --tasks-dir, --concurrency, --config, --skills-dir, --prompt, --benchmark, --dir). Updated across 23 files: - src/benchflow/cli/main.py: all typer.Option definitions + docstring examples - All docs/ (getting-started, running-benchmarks, cli.md, skill-eval, integration-tests, examples/) - README.md - .claude/skills/ (SKILL.md + references) - tests/ (integration/run.sh, test_codex_custom_provider.sh, conformance/README.md) - benchmarks/ (models-as-skills.md, harvey-lab/run_harvey_lab.py) - Test fix: test_skill_eval_dryrun.py invocation updated from -a to --agent * fix: replace remaining -f with --config in task-embedded SKILL.md files --------- Co-authored-by: Xiangyi Li <xiangyi@benchflow.ai> * Backport PR #230 and #242 fixes to refactor/v0.4 (#286) * fix: apply PR #230 (skills double-deploy) and PR #242 (shell injection) fixes to v0.4 PR #230: pass effective task path to deploy_skills, set effective_skills in already-injected branch PR #242: shlex.quote outbox file paths in _scene.py, harden _snapshot.py with path validation * fix: set effective_skills to /skills when Dockerfile already injected When deploy_skills detects the Dockerfile already bakes in `COPY _deps/skills /skills/`, the runtime upload is skipped but effective_skills was left as task.config.environment.skills_dir (often None), causing the subsequent linking step to silently skip distribution to agent-specific paths like /home/agent/.agents/skills. Sandbox users relied on that runtime linking since Dockerfile injection only links under /root/. Tighten the regression test to assert env.exec was called with the expected `ln -sfn` command. --------- Co-authored-by: Xiangyi Li <xiangyi@benchflow.ai> Co-authored-by: Yifeng He <yfhe.prsn@gmail.com> * fix locally * release: 0.3.4 --------- Co-authored-by: Xiangyi Li <xiangyi@benchflow.ai> Co-authored-by: devin-ai-integration[bot] <158243242+devin-ai-integration[bot]@users.noreply.github.com> Co-authored-by: Yifeng He <yfhe.prsn@gmail.com>

…, ACPX integration (#288) * refactor: BenchFlow v0.3.4 — unified types, Rollout, Sandbox protocol, rewards, adapters (#274) * refactor: add Sandbox protocol + Harbor adapters (ENG-48 Phase A) Create BenchFlow-native Sandbox protocol as parallel types alongside existing Harbor imports. No behavioral changes — existing code continues to use Harbor directly. New files: - src/benchflow/sandbox/protocol.py: ExecResult, Sandbox, ImageRef, ImageConfig, ImageBuilder - src/benchflow/sandbox/docker.py: DockerSandbox adapter wrapping Harbor DockerEnvironment - src/benchflow/sandbox/daytona.py: DaytonaSandbox adapter wrapping Harbor DaytonaEnvironment - tests/test_sandbox_protocol.py: 26 tests covering dataclasses, protocol conformance, and delegation * refactor: unify Scene/Role/Turn types into _types.py (ENG-47) Create src/benchflow/_types.py as the single canonical source for the declarative Role, Scene, and Turn dataclasses. Changes: - New _types.py with Role (adds timeout_sec, idle_timeout_sec, skills_dir), Scene (adds parallel_group), and Turn - trial.py imports from _types.py instead of defining its own copies - _scene.py renames its internal Role to SceneRole (different fields: instruction, tools) with backward-compat alias - __init__.py re-exports canonical types from _types.py; adds TrialRole/TrialScene backward-compat aliases - runtime.py and trial_yaml.py updated to import from _types.py New fields default to None — no runtime behavior changes. * fix: address review — shlex.quote in read_file, cleanup temp in write_file - Use shlex.quote(path) in read_file to prevent shell injection - Clean up temp files with os.unlink in write_file finally block - Move imports to module level * fix: read_file error checking + export ImageConfig from top-level - read_file now raises FileNotFoundError on non-zero return code - Export ImageConfig alongside ImageBuilder in __init__.py - Add tests for read_file error behavior * refactor: kill shim layers, single Rollout execution path (ENG-46) - Rename trial.py → rollout.py, Trial → Rollout, TrialConfig → RolloutConfig - Rename RunResult → RolloutResult in models.py - Create evaluation.py from job.py: Job → Evaluation, JobConfig → EvaluationConfig, JobResult → EvaluationResult - Create _run.py with bf.run(RolloutConfig) → RolloutResult entry point - Replace sdk.py with thin backward-compat shim delegating to rollout.py - Replace trial.py with re-export shim from rollout.py - Replace job.py with re-export shim from evaluation.py - Preserve runtime.py API, update internal imports to use Rollout - Update __init__.py: new public API + backward-compat aliases - Update test patch targets from benchflow.trial → benchflow.rollout - All 841 tests pass, lint clean, pre-existing typecheck errors unchanged * style: apply ruff format to pass CI format check * fix: resolve ty typecheck errors (Any for kwargs and sentinel) * feat: composable Rubric + RewardFunc protocol (ENG-49) - Create src/benchflow/rewards/ package with: - RewardFunc protocol (single scoring dimension) - Rubric dataclass (weighted collection of RewardFuncs) - VerifyResult dataclass (aggregated scoring result) - RewardEvent dataclass (dense/terminal reward signals) - Built-in funcs: TestRewardFunc, LLMJudgeRewardFunc, StringMatchRewardFunc, CodeExecRewardFunc - Add reward_events field to RunResult (additive, no breaking changes) - Re-export all reward types from benchflow top-level __init__.py - Backward compat: Rubric([TestRewardFunc()]) wraps existing test.sh -> reward.txt flow - 31 tests covering all reward types, rubric scoring, weights, error handling, protocol conformance, and re-exports * style: format rewards package and tests with ruff * fix: disambiguate Rubric.items keys when multiple funcs share a class name Append _N suffix on collision so all per-func scores are preserved. Addresses Devin Review feedback on PR #266. * feat: per-agent capabilities + agent-as-tool infrastructure (ENG-50) - Add Role.capabilities field for declarative agent capability lists - Add Sandbox.expose_ports property for inter-agent communication ports - Document capability boundary: BenchFlow provides sandbox + instruction + observation; agents own loops, tool protocols, and agent-as-tool - Update _scene.py and _types.py module docstrings with ENG-50 boundary - Add 16 tests covering capabilities, expose_ports, and Coder→Reviewer→Coder turn sequences * style: apply ruff format to sandbox adapters and tests * feat: external adapters for Inspect AI + ORS (ENG-51) Add benchflow.adapters package with thin format converters: - InspectAdapter / to_inspect_task: Scene + Rubric -> Inspect AI task dict - ORSAdapter / to_ors_reward: VerifyResult + RewardEvent -> ORS format dict No external dependencies required — these are pure format converters. Re-exported from benchflow top-level __init__.py. 14 tests covering both adapters, convenience functions, and round-trips. * style: fix ruff format for adapters and tests * fix: convention fixes + trial shim re-exports + __getattr__ narrowing - Parametrize bare dict returns to dict[str, Any] in adapters - Remove unnecessary @DataClass on adapter classes (no fields) - Add from __future__ import annotations to evaluation.py - Fix __getattr__ to re-raise ModuleNotFoundError for broken deps - Add private helper re-exports to trial.py shim for backward compat - Convert test_save_trajectory to async (fixes event loop deprecation) * rename: backend → sandbox across API, CLI, docs, and tests Aligns naming with the v0.4 Sandbox protocol (ENG-48). The 'backend' parameter/flag/attribute on Environment and the CLI is now 'sandbox' everywhere. Scope: - runtime.py: Environment.__init__(sandbox=), .sandbox, from_task(sandbox=) - cli/main.py: --sandbox flag (was --backend), help text - _snapshot.py: docstring updates - tests/test_runtime.py: updated assertions - docs/: all references in task-authoring, CLI ref, Python API ref, running-benchmarks, integration-tests, examples NOT renamed (different meaning): - _credentials.py 'Vertex AI backend' (provider concept) - _provider_runtime.py / bedrock_proxy.py backend_model (LLM model) - _sandbox.py 'build backends' (Python packaging) * docs: update for v0.4 — Rollout as canonical name, add Sandbox/Rubric/Adapter docs - README: 'Sandbox backends' → 'Sandboxes', table says Rollout not Trial - concepts.md: Environment no longer says 'Backed by Harbor', Trial → Rollout as core primitive, lifecycle diagram uses Rollout.run() - getting-started.md: uses RolloutConfig, links to rollout lifecycle - python-api.md: RolloutConfig (aliased as TrialConfig), Rollout (aliased as Trial), canonical imports shown first, added v0.4 types section with Sandbox protocol, Rubric + RewardFunc, Adapters, Evaluation, backward-compat aliases * fix: correct Rubric/RewardFunc API examples in python-api.md Fixes 3 issues flagged by Devin Review: - StringMatchRewardFunc: remove non-existent 'field' parameter - Rubric: use reward_funcs + weights instead of items=[(func, weight)] - rubric.score: takes rollout_dir: Path, not sandbox= * fix: correct evaluation.py docstring aliases, clean stale comment in __init__.py * docs: remove Harbor migration references - use-cases.md: remove 'How it works vs Harbor' and 'Migration from Harbor' sections - scene-patterns.ipynb: remove 'Mapping to Harbor PR #1462 Concepts' cell - coder-reviewer-demo.py: 'Harbor-format task directory' → 'BenchFlow task directory' - swebench notebook: remove Harbor #1316 reference from intro * CLI: unify sandbox flag (--env → --sandbox, -b → -e) [ENG-56] (#281) * cli: unify sandbox flag --env → --sandbox, -b → -e (ENG-56) Standardize all CLI commands to use --sandbox/-e for the sandbox parameter. Previously bench run used --sandbox/-b while bench eval create and others used --env/-e. - bench run: -b → -e - bench eval create/run/compare: --env → --sandbox - bench eval progress: --env → --sandbox - bench skills eval: --env → --sandbox - bench environment create: -b → -e - docs/examples: updated flag references * cli: drop -e/-b short flags, use --sandbox only (ENG-56) Remove all short flags (-e, -b) for the sandbox parameter across all CLI commands. The long flag --sandbox is the only way to specify the sandbox now. Updated 15 files: CLI definitions, help text examples, README, docs, skill files, integration test runner, and test examples. --------- Co-authored-by: Xiangyi Li <xiangyi@benchflow.ai> * cli: deprecate bench run in favor of bench eval create (ENG-57) (#282) Per ENG-46, bench eval create is the canonical CLI entry point. bench run goes through the old SDK shim and is now deprecated, matching the pattern used for bench job and other legacy commands. Python API bf.run() is unaffected (backward-compat alias stays). Co-authored-by: Xiangyi Li <xiangyi@benchflow.ai> * feat(ENG-55): first-class LLM-as-judge verifier with dense reward (#277) * feat(ENG-55): first-class LLM-as-judge verifier with dense reward - Add rubric_config.py: Pydantic models for rubric.toml (Criterion, JudgeConfig, ScoringConfig, RubricConfig) with binary/likert/numeric criterion types and score normalization - Add llm.py: Multi-provider LLM routing (Anthropic/OpenAI/Gemini) with call_judge(), parse_verdict(), exponential backoff retry - Add file_readers.py: Document extraction (pdf/docx/xlsx/pptx/txt) with find_deliverables() for rollout directory scanning - Replace placeholder LLMJudgeRewardFunc in builtins.py with full rubric-based judge: per-criterion scoring, prompt templates, configurable aggregation (weighted_mean/all_pass/any_pass/threshold), backward-compatible legacy mode - Emit per-criterion dense RewardEvents for fine-grained reward signals - Write evaluation_details.json alongside rollout for transparency - Re-export Criterion, JudgeConfig, RubricConfig, ScoringConfig, load_rubric_toml from rewards and top-level __init__ - Add test_rubric_config.py (15 tests) and test_llm_judge.py (30 tests) Closes ENG-55 * fix(ENG-55): async clients, asyncio.sleep, and score mismatch bugs - Replace sync SDK clients with async variants (AsyncAnthropic, AsyncOpenAI, client.aio for Google) - Replace time.sleep() with await asyncio.sleep() in call_judge retry loop - Pass actual aggregated score to _write_details instead of recomputing n_passed/total * docs(ENG-55): add LLM-as-judge verifier documentation - Add docs/llm-judge.md: full user-facing guide with rubric.toml reference, criterion types, aggregation strategies, dense reward events, multi-provider routing, file discovery, inline criteria, and worked examples - Add src/benchflow/rewards/README.md: module-level guide with usage examples - Update docs/concepts.md: reference LLM judge in Verifier primitive and add to 'Where to go next' links * docs: polish — remove ENG-55 ticket ref, clarify rubric.json exclusion --------- Co-authored-by: Xiangyi Li <xiangyi@benchflow.ai> * CLI: remove all short flags, use full names only [ENG-74] (#284) * CLI: remove all short flags, use full names only [ENG-74] Remove all single-letter short flags (-a, -m, -o, -t, -c, -f, -s, -p, -b, -d) from every CLI command. Full flag names only (--agent, --model, --jobs-dir, --tasks-dir, --concurrency, --config, --skills-dir, --prompt, --benchmark, --dir). Updated across 23 files: - src/benchflow/cli/main.py: all typer.Option definitions + docstring examples - All docs/ (getting-started, running-benchmarks, cli.md, skill-eval, integration-tests, examples/) - README.md - .claude/skills/ (SKILL.md + references) - tests/ (integration/run.sh, test_codex_custom_provider.sh, conformance/README.md) - benchmarks/ (models-as-skills.md, harvey-lab/run_harvey_lab.py) - Test fix: test_skill_eval_dryrun.py invocation updated from -a to --agent * fix: replace remaining -f with --config in task-embedded SKILL.md files --------- Co-authored-by: Xiangyi Li <xiangyi@benchflow.ai> * Backport PR #230 and #242 fixes to refactor/v0.4 (#286) * fix: apply PR #230 (skills double-deploy) and PR #242 (shell injection) fixes to v0.4 PR #230: pass effective task path to deploy_skills, set effective_skills in already-injected branch PR #242: shlex.quote outbox file paths in _scene.py, harden _snapshot.py with path validation * fix: set effective_skills to /skills when Dockerfile already injected When deploy_skills detects the Dockerfile already bakes in `COPY _deps/skills /skills/`, the runtime upload is skipped but effective_skills was left as task.config.environment.skills_dir (often None), causing the subsequent linking step to silently skip distribution to agent-specific paths like /home/agent/.agents/skills. Sandbox users relied on that runtime linking since Dockerfile injection only links under /root/. Tighten the regression test to assert env.exec was called with the expected `ln -sfn` command. --------- Co-authored-by: Xiangyi Li <xiangyi@benchflow.ai> Co-authored-by: Yifeng He <yfhe.prsn@gmail.com> * fix locally * release: 0.3.4 --------- Co-authored-by: Xiangyi Li <xiangyi@benchflow.ai> Co-authored-by: devin-ai-integration[bot] <158243242+devin-ai-integration[bot]@users.noreply.github.com> Co-authored-by: Yifeng He <yfhe.prsn@gmail.com> * refactor: remove Harbor references, purge backward-compat aliases in tests, update terminology - Remove 'terminal-bench' from pyproject.toml keywords - Rename _make_harbor_mock → _make_sandbox_mock in test files - Update Harbor patch paths to sandbox equivalents in tests - Rename _from_harbor_yaml → _from_legacy_yaml in evaluation.py - Update docstrings/comments: Harbor → BenchFlow/legacy terminology - Internalize Harbor types into benchflow.task/ subpackage - Create sandbox backends: base.py, docker_impl.py, daytona_impl.py, modal_impl.py - Add compose/ subpackage with docker-compose YAML templates * refactor: modernize file structure — align with test-drive patterns - sandbox/: merge docker_impl→docker, daytona_impl→daytona, base→_base - sandbox/: move _sandbox→lockdown, _snapshot→snapshot, _daytona_patches→_sdk_ops - sandbox/: move process, user, environments→services into sandbox/ - sandbox/: restructure compose/ → _compose.py + _compose_files/ - agents/: move _credentials→credentials, _agent_env→env, _agent_setup→install - _utils/: create subpackage with yaml_loader, benchmark_repos, task_authoring - trajectories/: move viewer, _trajectory→_capture into subpackage - experimental/: move mcp/ into experimental/ subpackage - __init__.py: purge backward-compat aliases (Trial, Job, etc.) - Update all imports across src/, tests/, experiments/ - Run ruff check + format (0 errors) * refactor: purge backward-compat aliases + fix review bugs - Remove Trial=Rollout, TrialConfig=RolloutConfig from rollout.py - Remove Job=Evaluation, JobConfig=EvaluationConfig, JobResult=EvaluationResult from evaluation.py - Delete shim files: trial.py, job.py - Update all imports: benchflow.trial → benchflow.rollout, benchflow.job → benchflow.evaluation - Rename classes: Trial→Rollout, Job→Evaluation across tests/, benchmarks/, experiments/ - Fix runtime.py: environment_type→sandbox_type, trial_name→rollout_name (review bug) - Fix conformance scripts: same keyword arg fixes (review bug) - Clean up __all__ and return type annotations * docs: update terminology to v0.4 (Trial→Rollout, Job→Evaluation) + bump version to 0.4.0 * chore: update uv.lock for v0.4.0 * refactor: modernize RL terminology — trial_dir→rollout_dir, trial_name→rollout_name across codebase * style: format viewer.py * fix: exclude optional-dep sandbox files from ty check * fix: configure ty to ignore unresolved-import + exclude optional-dep files * fix: resolve all test failures — patch paths, protocol conformance, RL terminology * fix: resolve merge conflicts, update broken imports and stale references - Resolve merge conflict markers in rollout.py and evaluation.py - Update task_download imports to benchflow._utils.benchmark_repos (11 files) - Fix benchflow.tasks → benchflow._utils.task_authoring import - Update scene-patterns.ipynb: TrialConfig→RolloutConfig, trial→rollout - Fix Job→Evaluation rename in docs - Ruff auto-fixes (import ordering) * style: format 12 files to pass ruff format check * fix: resolve ty check errors in traces/ — dict[str, object] → dict[str, Any] * fix: add skip guards for optional dependency tests (daytona, modal) * fix: update stale imports and references in docs/benchmarks (Devin Review) * refactor: consolidate underscore modules into proper subpackages - _acp_run.py → acp/runtime.py - _env_setup.py → sandbox/setup.py - _provider_runtime.py → providers/runtime.py - _scoring.py → _utils/scoring.py - _scene.py → scenes.py (removed Role=SceneRole backward-compat alias) - Fix GEMINI_API_KEY auto-detection: inherit from os.environ as fallback after .env * feat: ACPX integration — acpx/<agent> protocol for headless ACP agent invocation - Add 'acpx' as a valid protocol in parse_agent_spec (acpx/claude, acpx/codex, etc.) - _acpx_wrap() decorates any registered agent to launch via acpx CLI - Installs acpx alongside the underlying agent in the sandbox - Preserves all agent env, credentials, and skill_paths - Replace harbor protocol references with acpx in tests * style: format registry.py for ruff format check * fix: suppress ty invalid-assignment for monkey-patch in _patch_docker_dind * docs: update docs and SKILL.md for v0.4 — RL-first terminology, ACPX, purge backward-compat * fix: rename trial_config_from_yaml → rollout_config_from_yaml, fix stale _run import in sdk.py * fix: docs inconsistencies found by dogfooding — GEMINI_API_KEY, daytona install note * fix: docs path evaluations/ → jobs/ to match actual output directory * fix: ruff import ordering in sdk.py * fix: ty type-check suppression for sdk.py run() return type * feat: add sandbox-daytona and sandbox-modal optional dependency groups * chore: update uv.lock for sandbox-daytona and sandbox-modal optional deps --------- Co-authored-by: devin-ai-integration[bot] <158243242+devin-ai-integration[bot]@users.noreply.github.com> Co-authored-by: Xiangyi Li <xiangyi@benchflow.ai> Co-authored-by: Yifeng He <yfhe.prsn@gmail.com>

* refactor: add Sandbox protocol + Harbor adapters (ENG-48 Phase A) Create BenchFlow-native Sandbox protocol as parallel types alongside existing Harbor imports. No behavioral changes — existing code continues to use Harbor directly. New files: - src/benchflow/sandbox/protocol.py: ExecResult, Sandbox, ImageRef, ImageConfig, ImageBuilder - src/benchflow/sandbox/docker.py: DockerSandbox adapter wrapping Harbor DockerEnvironment - src/benchflow/sandbox/daytona.py: DaytonaSandbox adapter wrapping Harbor DaytonaEnvironment - tests/test_sandbox_protocol.py: 26 tests covering dataclasses, protocol conformance, and delegation * refactor: unify Scene/Role/Turn types into _types.py (ENG-47) Create src/benchflow/_types.py as the single canonical source for the declarative Role, Scene, and Turn dataclasses. Changes: - New _types.py with Role (adds timeout_sec, idle_timeout_sec, skills_dir), Scene (adds parallel_group), and Turn - trial.py imports from _types.py instead of defining its own copies - _scene.py renames its internal Role to SceneRole (different fields: instruction, tools) with backward-compat alias - __init__.py re-exports canonical types from _types.py; adds TrialRole/TrialScene backward-compat aliases - runtime.py and trial_yaml.py updated to import from _types.py New fields default to None — no runtime behavior changes. * fix: address review — shlex.quote in read_file, cleanup temp in write_file - Use shlex.quote(path) in read_file to prevent shell injection - Clean up temp files with os.unlink in write_file finally block - Move imports to module level * fix: read_file error checking + export ImageConfig from top-level - read_file now raises FileNotFoundError on non-zero return code - Export ImageConfig alongside ImageBuilder in __init__.py - Add tests for read_file error behavior * refactor: kill shim layers, single Rollout execution path (ENG-46) - Rename trial.py → rollout.py, Trial → Rollout, TrialConfig → RolloutConfig - Rename RunResult → RolloutResult in models.py - Create evaluation.py from job.py: Job → Evaluation, JobConfig → EvaluationConfig, JobResult → EvaluationResult - Create _run.py with bf.run(RolloutConfig) → RolloutResult entry point - Replace sdk.py with thin backward-compat shim delegating to rollout.py - Replace trial.py with re-export shim from rollout.py - Replace job.py with re-export shim from evaluation.py - Preserve runtime.py API, update internal imports to use Rollout - Update __init__.py: new public API + backward-compat aliases - Update test patch targets from benchflow.trial → benchflow.rollout - All 841 tests pass, lint clean, pre-existing typecheck errors unchanged * style: apply ruff format to pass CI format check * fix: resolve ty typecheck errors (Any for kwargs and sentinel) * feat: composable Rubric + RewardFunc protocol (ENG-49) - Create src/benchflow/rewards/ package with: - RewardFunc protocol (single scoring dimension) - Rubric dataclass (weighted collection of RewardFuncs) - VerifyResult dataclass (aggregated scoring result) - RewardEvent dataclass (dense/terminal reward signals) - Built-in funcs: TestRewardFunc, LLMJudgeRewardFunc, StringMatchRewardFunc, CodeExecRewardFunc - Add reward_events field to RunResult (additive, no breaking changes) - Re-export all reward types from benchflow top-level __init__.py - Backward compat: Rubric([TestRewardFunc()]) wraps existing test.sh -> reward.txt flow - 31 tests covering all reward types, rubric scoring, weights, error handling, protocol conformance, and re-exports * style: format rewards package and tests with ruff * fix: disambiguate Rubric.items keys when multiple funcs share a class name Append _N suffix on collision so all per-func scores are preserved. Addresses Devin Review feedback on PR #266. * feat: per-agent capabilities + agent-as-tool infrastructure (ENG-50) - Add Role.capabilities field for declarative agent capability lists - Add Sandbox.expose_ports property for inter-agent communication ports - Document capability boundary: BenchFlow provides sandbox + instruction + observation; agents own loops, tool protocols, and agent-as-tool - Update _scene.py and _types.py module docstrings with ENG-50 boundary - Add 16 tests covering capabilities, expose_ports, and Coder→Reviewer→Coder turn sequences * style: apply ruff format to sandbox adapters and tests * feat: external adapters for Inspect AI + ORS (ENG-51) Add benchflow.adapters package with thin format converters: - InspectAdapter / to_inspect_task: Scene + Rubric -> Inspect AI task dict - ORSAdapter / to_ors_reward: VerifyResult + RewardEvent -> ORS format dict No external dependencies required — these are pure format converters. Re-exported from benchflow top-level __init__.py. 14 tests covering both adapters, convenience functions, and round-trips. * style: fix ruff format for adapters and tests * fix: convention fixes + trial shim re-exports + __getattr__ narrowing - Parametrize bare dict returns to dict[str, Any] in adapters - Remove unnecessary @DataClass on adapter classes (no fields) - Add from __future__ import annotations to evaluation.py - Fix __getattr__ to re-raise ModuleNotFoundError for broken deps - Add private helper re-exports to trial.py shim for backward compat - Convert test_save_trajectory to async (fixes event loop deprecation) * rename: backend → sandbox across API, CLI, docs, and tests Aligns naming with the v0.4 Sandbox protocol (ENG-48). The 'backend' parameter/flag/attribute on Environment and the CLI is now 'sandbox' everywhere. Scope: - runtime.py: Environment.__init__(sandbox=), .sandbox, from_task(sandbox=) - cli/main.py: --sandbox flag (was --backend), help text - _snapshot.py: docstring updates - tests/test_runtime.py: updated assertions - docs/: all references in task-authoring, CLI ref, Python API ref, running-benchmarks, integration-tests, examples NOT renamed (different meaning): - _credentials.py 'Vertex AI backend' (provider concept) - _provider_runtime.py / bedrock_proxy.py backend_model (LLM model) - _sandbox.py 'build backends' (Python packaging) * docs: update for v0.4 — Rollout as canonical name, add Sandbox/Rubric/Adapter docs - README: 'Sandbox backends' → 'Sandboxes', table says Rollout not Trial - concepts.md: Environment no longer says 'Backed by Harbor', Trial → Rollout as core primitive, lifecycle diagram uses Rollout.run() - getting-started.md: uses RolloutConfig, links to rollout lifecycle - python-api.md: RolloutConfig (aliased as TrialConfig), Rollout (aliased as Trial), canonical imports shown first, added v0.4 types section with Sandbox protocol, Rubric + RewardFunc, Adapters, Evaluation, backward-compat aliases * fix: correct Rubric/RewardFunc API examples in python-api.md Fixes 3 issues flagged by Devin Review: - StringMatchRewardFunc: remove non-existent 'field' parameter - Rubric: use reward_funcs + weights instead of items=[(func, weight)] - rubric.score: takes rollout_dir: Path, not sandbox= * fix: correct evaluation.py docstring aliases, clean stale comment in __init__.py * docs: remove Harbor migration references - use-cases.md: remove 'How it works vs Harbor' and 'Migration from Harbor' sections - scene-patterns.ipynb: remove 'Mapping to Harbor PR #1462 Concepts' cell - coder-reviewer-demo.py: 'Harbor-format task directory' → 'BenchFlow task directory' - swebench notebook: remove Harbor #1316 reference from intro * CLI: unify sandbox flag (--env → --sandbox, -b → -e) [ENG-56] (#281) * cli: unify sandbox flag --env → --sandbox, -b → -e (ENG-56) Standardize all CLI commands to use --sandbox/-e for the sandbox parameter. Previously bench run used --sandbox/-b while bench eval create and others used --env/-e. - bench run: -b → -e - bench eval create/run/compare: --env → --sandbox - bench eval progress: --env → --sandbox - bench skills eval: --env → --sandbox - bench environment create: -b → -e - docs/examples: updated flag references * cli: drop -e/-b short flags, use --sandbox only (ENG-56) Remove all short flags (-e, -b) for the sandbox parameter across all CLI commands. The long flag --sandbox is the only way to specify the sandbox now. Updated 15 files: CLI definitions, help text examples, README, docs, skill files, integration test runner, and test examples. --------- Co-authored-by: Xiangyi Li <xiangyi@benchflow.ai> * cli: deprecate bench run in favor of bench eval create (ENG-57) (#282) Per ENG-46, bench eval create is the canonical CLI entry point. bench run goes through the old SDK shim and is now deprecated, matching the pattern used for bench job and other legacy commands. Python API bf.run() is unaffected (backward-compat alias stays). Co-authored-by: Xiangyi Li <xiangyi@benchflow.ai> * feat(ENG-55): first-class LLM-as-judge verifier with dense reward (#277) * feat(ENG-55): first-class LLM-as-judge verifier with dense reward - Add rubric_config.py: Pydantic models for rubric.toml (Criterion, JudgeConfig, ScoringConfig, RubricConfig) with binary/likert/numeric criterion types and score normalization - Add llm.py: Multi-provider LLM routing (Anthropic/OpenAI/Gemini) with call_judge(), parse_verdict(), exponential backoff retry - Add file_readers.py: Document extraction (pdf/docx/xlsx/pptx/txt) with find_deliverables() for rollout directory scanning - Replace placeholder LLMJudgeRewardFunc in builtins.py with full rubric-based judge: per-criterion scoring, prompt templates, configurable aggregation (weighted_mean/all_pass/any_pass/threshold), backward-compatible legacy mode - Emit per-criterion dense RewardEvents for fine-grained reward signals - Write evaluation_details.json alongside rollout for transparency - Re-export Criterion, JudgeConfig, RubricConfig, ScoringConfig, load_rubric_toml from rewards and top-level __init__ - Add test_rubric_config.py (15 tests) and test_llm_judge.py (30 tests) Closes ENG-55 * fix(ENG-55): async clients, asyncio.sleep, and score mismatch bugs - Replace sync SDK clients with async variants (AsyncAnthropic, AsyncOpenAI, client.aio for Google) - Replace time.sleep() with await asyncio.sleep() in call_judge retry loop - Pass actual aggregated score to _write_details instead of recomputing n_passed/total * docs(ENG-55): add LLM-as-judge verifier documentation - Add docs/llm-judge.md: full user-facing guide with rubric.toml reference, criterion types, aggregation strategies, dense reward events, multi-provider routing, file discovery, inline criteria, and worked examples - Add src/benchflow/rewards/README.md: module-level guide with usage examples - Update docs/concepts.md: reference LLM judge in Verifier primitive and add to 'Where to go next' links * docs: polish — remove ENG-55 ticket ref, clarify rubric.json exclusion --------- Co-authored-by: Xiangyi Li <xiangyi@benchflow.ai> * CLI: remove all short flags, use full names only [ENG-74] (#284) * CLI: remove all short flags, use full names only [ENG-74] Remove all single-letter short flags (-a, -m, -o, -t, -c, -f, -s, -p, -b, -d) from every CLI command. Full flag names only (--agent, --model, --jobs-dir, --tasks-dir, --concurrency, --config, --skills-dir, --prompt, --benchmark, --dir). Updated across 23 files: - src/benchflow/cli/main.py: all typer.Option definitions + docstring examples - All docs/ (getting-started, running-benchmarks, cli.md, skill-eval, integration-tests, examples/) - README.md - .claude/skills/ (SKILL.md + references) - tests/ (integration/run.sh, test_codex_custom_provider.sh, conformance/README.md) - benchmarks/ (models-as-skills.md, harvey-lab/run_harvey_lab.py) - Test fix: test_skill_eval_dryrun.py invocation updated from -a to --agent * fix: replace remaining -f with --config in task-embedded SKILL.md files --------- Co-authored-by: Xiangyi Li <xiangyi@benchflow.ai> * Backport PR #230 and #242 fixes to refactor/v0.4 (#286) * fix: apply PR #230 (skills double-deploy) and PR #242 (shell injection) fixes to v0.4 PR #230: pass effective task path to deploy_skills, set effective_skills in already-injected branch PR #242: shlex.quote outbox file paths in _scene.py, harden _snapshot.py with path validation * fix: set effective_skills to /skills when Dockerfile already injected When deploy_skills detects the Dockerfile already bakes in `COPY _deps/skills /skills/`, the runtime upload is skipped but effective_skills was left as task.config.environment.skills_dir (often None), causing the subsequent linking step to silently skip distribution to agent-specific paths like /home/agent/.agents/skills. Sandbox users relied on that runtime linking since Dockerfile injection only links under /root/. Tighten the regression test to assert env.exec was called with the expected `ln -sfn` command. --------- Co-authored-by: Xiangyi Li <xiangyi@benchflow.ai> Co-authored-by: Yifeng He <yfhe.prsn@gmail.com> * fix locally * release: 0.3.4 * feat: trace import — `bench tasks generate` from Claude Code + opentraces (#278) * feat: trace import prototype — Claude Code + opentraces → BenchFlow tasks Add benchflow.traces package for personal benchmark curation from agent traces: - parsers: parse Claude Code JSONL sessions and opentraces v0.1-v0.3 records - models: format-agnostic ParsedTrace intermediate representation - task_gen: generate task.toml + instruction.md + test.sh from traces - huggingface: download and parse trace datasets from HuggingFace Hub - local: discover and parse local ~/.claude/projects/ sessions - CLI: bench import {local,file,hf,list-datasets} subcommands 44 new tests, all passing. * fix: address Devin Review findings - _cache_dir: fall back to cwd instead of / when outside git repo - HF cache key: include max_rows in filename to avoid stale data - parse_claude_code_session: deterministic trace_id from session+filename - _build_test_sh: use shlex.quote() to prevent shell injection * refactor: align trace import CLI with noun-verb philosophy - Replace 'bench import {local,file,hf}' with 'bench tasks generate --from-{local,file,hf}' - Replace 'bench import list-datasets' with 'bench tasks list-sources' - Integrate trace commands into existing tasks_app via register_tasks_generate() - Update CLI docs with new command reference - Follows BenchFlow resource-verb pattern: bench <resource> <verb> * fix: address Devin Review — limit flag + format name mismatch - Apply --limit across all sources (was only used for --from-local) - Map 'claude-code' → 'claude-messages' in _load_hf so --format claude-code works correctly with --from-hf * docs: dogfood bench tasks generate in task-authoring and getting-started guides * fix: generated tasks now pass bench tasks check — add Dockerfile, tests/ dir, reward path, difficulty scaling, TOML safety * audit: fix task pipeline issues + normalize docstrings across traces package Pipeline fixes: - Fix timeout auto-scaling: batch generator default changed from 300 to 0 (300 > 0 prevented difficulty-based scaling from ever triggering) - Add [task] name section to generated task.toml (trace-import/<slug>) - Add build_timeout_sec and storage_mb to [environment] section - Fix test.sh indentation (remove extra leading spaces from textwrap) Docstring normalization: - Add missing docstrings to CLI helper functions (_load_local, _load_file, _load_hf) - Improve docstring clarity across traces package (detect_format, print helpers, HF parsers) - Use reST code block syntax for CLI examples 48/48 tests pass, lint clean. * fix: multi-session JSONL collapse, outcome detection, zero-tool-call filter Dogfooding revealed 3 quality issues: 1. Critical: parse_claude_code_session() merged all sessions in a multi-session JSONL file into one trace, contaminating difficulty, instruction, and verifier. Added parse_claude_code_file() that splits by sessionId before parsing each group. 2. Outcome detection missed common completion verbs (fixed, refactored, built, created, updated, implemented, added). 3. Traces with zero tool calls (pure explanations) produced useless tasks with pass-through verifiers. Now filtered out in batch mode. Before: 5-session file → 1 task (hard, 1200s, 19 files listed) After: 5-session file → 4 tasks (easy/medium, correct files each) Tests: 55 passing (was 48), lint clean. * fix: test.sh file check cap matches instruction.md (10→20) * fix: real-trace dogfood — path relativization, session artifact cleanup, HF parser robustness - Add _relativize_path() to convert absolute workspace paths to relative project paths - Add _clean_user_prompt() to strip session continuation boilerplate - HF parser: support messages_json key, strip system-reminders, handle tool_result blocks, infer outcome - CLI: add claude-messages format detection and routing for --from-file - Add _parse_claude_requests_row() for cc-traces-weka metadata format - All 55 tests pass, lint clean * fix: verifier robustness — glob patterns for timestamp-bearing paths Paths with date/timestamp segments (e.g. migrations/2025-11-28-131040_create_invoices/up.sql) now use compgen -G glob patterns in test.sh instead of exact [ -f ] checks. This lets verifiers tolerate agent-generated timestamp variants. Also updates instruction.md to show globbed paths for dynamic segments. Adds 10 new tests for _globify_path, _has_dynamic_segments, and end-to-end verifier pattern selection. * fix: real-trace parser — user_prompt+gitdiff format, git context, git-diff verifier - Handle cc-traces-merged rows with user_prompt + gitdiff (no messages_json) - Extract TASK DESCRIPTION from structured prompts, strip CRITICAL INSTRUCTIONS boilerplate - Extract git repo/commit from prompt, populate GitContext - Dockerfile clones repo at base commit when git context available - Verifier uses git diff to check files were modified (not just exist) - Fix source_model='None' string bug --------- Co-authored-by: Xiangyi Li <xiangyi@benchflow.ai> * refactor: BenchFlow v0.4 — RL-first terminology, module consolidation, ACPX integration (#288) * refactor: BenchFlow v0.3.4 — unified types, Rollout, Sandbox protocol, rewards, adapters (#274) * refactor: add Sandbox protocol + Harbor adapters (ENG-48 Phase A) Create BenchFlow-native Sandbox protocol as parallel types alongside existing Harbor imports. No behavioral changes — existing code continues to use Harbor directly. New files: - src/benchflow/sandbox/protocol.py: ExecResult, Sandbox, ImageRef, ImageConfig, ImageBuilder - src/benchflow/sandbox/docker.py: DockerSandbox adapter wrapping Harbor DockerEnvironment - src/benchflow/sandbox/daytona.py: DaytonaSandbox adapter wrapping Harbor DaytonaEnvironment - tests/test_sandbox_protocol.py: 26 tests covering dataclasses, protocol conformance, and delegation * refactor: unify Scene/Role/Turn types into _types.py (ENG-47) Create src/benchflow/_types.py as the single canonical source for the declarative Role, Scene, and Turn dataclasses. Changes: - New _types.py with Role (adds timeout_sec, idle_timeout_sec, skills_dir), Scene (adds parallel_group), and Turn - trial.py imports from _types.py instead of defining its own copies - _scene.py renames its internal Role to SceneRole (different fields: instruction, tools) with backward-compat alias - __init__.py re-exports canonical types from _types.py; adds TrialRole/TrialScene backward-compat aliases - runtime.py and trial_yaml.py updated to import from _types.py New fields default to None — no runtime behavior changes. * fix: address review — shlex.quote in read_file, cleanup temp in write_file - Use shlex.quote(path) in read_file to prevent shell injection - Clean up temp files with os.unlink in write_file finally block - Move imports to module level * fix: read_file error checking + export ImageConfig from top-level - read_file now raises FileNotFoundError on non-zero return code - Export ImageConfig alongside ImageBuilder in __init__.py - Add tests for read_file error behavior * refactor: kill shim layers, single Rollout execution path (ENG-46) - Rename trial.py → rollout.py, Trial → Rollout, TrialConfig → RolloutConfig - Rename RunResult → RolloutResult in models.py - Create evaluation.py from job.py: Job → Evaluation, JobConfig → EvaluationConfig, JobResult → EvaluationResult - Create _run.py with bf.run(RolloutConfig) → RolloutResult entry point - Replace sdk.py with thin backward-compat shim delegating to rollout.py - Replace trial.py with re-export shim from rollout.py - Replace job.py with re-export shim from evaluation.py - Preserve runtime.py API, update internal imports to use Rollout - Update __init__.py: new public API + backward-compat aliases - Update test patch targets from benchflow.trial → benchflow.rollout - All 841 tests pass, lint clean, pre-existing typecheck errors unchanged * style: apply ruff format to pass CI format check * fix: resolve ty typecheck errors (Any for kwargs and sentinel) * feat: composable Rubric + RewardFunc protocol (ENG-49) - Create src/benchflow/rewards/ package with: - RewardFunc protocol (single scoring dimension) - Rubric dataclass (weighted collection of RewardFuncs) - VerifyResult dataclass (aggregated scoring result) - RewardEvent dataclass (dense/terminal reward signals) - Built-in funcs: TestRewardFunc, LLMJudgeRewardFunc, StringMatchRewardFunc, CodeExecRewardFunc - Add reward_events field to RunResult (additive, no breaking changes) - Re-export all reward types from benchflow top-level __init__.py - Backward compat: Rubric([TestRewardFunc()]) wraps existing test.sh -> reward.txt flow - 31 tests covering all reward types, rubric scoring, weights, error handling, protocol conformance, and re-exports * style: format rewards package and tests with ruff * fix: disambiguate Rubric.items keys when multiple funcs share a class name Append _N suffix on collision so all per-func scores are preserved. Addresses Devin Review feedback on PR #266. * feat: per-agent capabilities + agent-as-tool infrastructure (ENG-50) - Add Role.capabilities field for declarative agent capability lists - Add Sandbox.expose_ports property for inter-agent communication ports - Document capability boundary: BenchFlow provides sandbox + instruction + observation; agents own loops, tool protocols, and agent-as-tool - Update _scene.py and _types.py module docstrings with ENG-50 boundary - Add 16 tests covering capabilities, expose_ports, and Coder→Reviewer→Coder turn sequences * style: apply ruff format to sandbox adapters and tests * feat: external adapters for Inspect AI + ORS (ENG-51) Add benchflow.adapters package with thin format converters: - InspectAdapter / to_inspect_task: Scene + Rubric -> Inspect AI task dict - ORSAdapter / to_ors_reward: VerifyResult + RewardEvent -> ORS format dict No external dependencies required — these are pure format converters. Re-exported from benchflow top-level __init__.py. 14 tests covering both adapters, convenience functions, and round-trips. * style: fix ruff format for adapters and tests * fix: convention fixes + trial shim re-exports + __getattr__ narrowing - Parametrize bare dict returns to dict[str, Any] in adapters - Remove unnecessary @DataClass on adapter classes (no fields) - Add from __future__ import annotations to evaluation.py - Fix __getattr__ to re-raise ModuleNotFoundError for broken deps - Add private helper re-exports to trial.py shim for backward compat - Convert test_save_trajectory to async (fixes event loop deprecation) * rename: backend → sandbox across API, CLI, docs, and tests Aligns naming with the v0.4 Sandbox protocol (ENG-48). The 'backend' parameter/flag/attribute on Environment and the CLI is now 'sandbox' everywhere. Scope: - runtime.py: Environment.__init__(sandbox=), .sandbox, from_task(sandbox=) - cli/main.py: --sandbox flag (was --backend), help text - _snapshot.py: docstring updates - tests/test_runtime.py: updated assertions - docs/: all references in task-authoring, CLI ref, Python API ref, running-benchmarks, integration-tests, examples NOT renamed (different meaning): - _credentials.py 'Vertex AI backend' (provider concept) - _provider_runtime.py / bedrock_proxy.py backend_model (LLM model) - _sandbox.py 'build backends' (Python packaging) * docs: update for v0.4 — Rollout as canonical name, add Sandbox/Rubric/Adapter docs - README: 'Sandbox backends' → 'Sandboxes', table says Rollout not Trial - concepts.md: Environment no longer says 'Backed by Harbor', Trial → Rollout as core primitive, lifecycle diagram uses Rollout.run() - getting-started.md: uses RolloutConfig, links to rollout lifecycle - python-api.md: RolloutConfig (aliased as TrialConfig), Rollout (aliased as Trial), canonical imports shown first, added v0.4 types section with Sandbox protocol, Rubric + RewardFunc, Adapters, Evaluation, backward-compat aliases * fix: correct Rubric/RewardFunc API examples in python-api.md Fixes 3 issues flagged by Devin Review: - StringMatchRewardFunc: remove non-existent 'field' parameter - Rubric: use reward_funcs + weights instead of items=[(func, weight)] - rubric.score: takes rollout_dir: Path, not sandbox= * fix: correct evaluation.py docstring aliases, clean stale comment in __init__.py * docs: remove Harbor migration references - use-cases.md: remove 'How it works vs Harbor' and 'Migration from Harbor' sections - scene-patterns.ipynb: remove 'Mapping to Harbor PR #1462 Concepts' cell - coder-reviewer-demo.py: 'Harbor-format task directory' → 'BenchFlow task directory' - swebench notebook: remove Harbor #1316 reference from intro * CLI: unify sandbox flag (--env → --sandbox, -b → -e) [ENG-56] (#281) * cli: unify sandbox flag --env → --sandbox, -b → -e (ENG-56) Standardize all CLI commands to use --sandbox/-e for the sandbox parameter. Previously bench run used --sandbox/-b while bench eval create and others used --env/-e. - bench run: -b → -e - bench eval create/run/compare: --env → --sandbox - bench eval progress: --env → --sandbox - bench skills eval: --env → --sandbox - bench environment create: -b → -e - docs/examples: updated flag references * cli: drop -e/-b short flags, use --sandbox only (ENG-56) Remove all short flags (-e, -b) for the sandbox parameter across all CLI commands. The long flag --sandbox is the only way to specify the sandbox now. Updated 15 files: CLI definitions, help text examples, README, docs, skill files, integration test runner, and test examples. --------- Co-authored-by: Xiangyi Li <xiangyi@benchflow.ai> * cli: deprecate bench run in favor of bench eval create (ENG-57) (#282) Per ENG-46, bench eval create is the canonical CLI entry point. bench run goes through the old SDK shim and is now deprecated, matching the pattern used for bench job and other legacy commands. Python API bf.run() is unaffected (backward-compat alias stays). Co-authored-by: Xiangyi Li <xiangyi@benchflow.ai> * feat(ENG-55): first-class LLM-as-judge verifier with dense reward (#277) * feat(ENG-55): first-class LLM-as-judge verifier with dense reward - Add rubric_config.py: Pydantic models for rubric.toml (Criterion, JudgeConfig, ScoringConfig, RubricConfig) with binary/likert/numeric criterion types and score normalization - Add llm.py: Multi-provider LLM routing (Anthropic/OpenAI/Gemini) with call_judge(), parse_verdict(), exponential backoff retry - Add file_readers.py: Document extraction (pdf/docx/xlsx/pptx/txt) with find_deliverables() for rollout directory scanning - Replace placeholder LLMJudgeRewardFunc in builtins.py with full rubric-based judge: per-criterion scoring, prompt templates, configurable aggregation (weighted_mean/all_pass/any_pass/threshold), backward-compatible legacy mode - Emit per-criterion dense RewardEvents for fine-grained reward signals - Write evaluation_details.json alongside rollout for transparency - Re-export Criterion, JudgeConfig, RubricConfig, ScoringConfig, load_rubric_toml from rewards and top-level __init__ - Add test_rubric_config.py (15 tests) and test_llm_judge.py (30 tests) Closes ENG-55 * fix(ENG-55): async clients, asyncio.sleep, and score mismatch bugs - Replace sync SDK clients with async variants (AsyncAnthropic, AsyncOpenAI, client.aio for Google) - Replace time.sleep() with await asyncio.sleep() in call_judge retry loop - Pass actual aggregated score to _write_details instead of recomputing n_passed/total * docs(ENG-55): add LLM-as-judge verifier documentation - Add docs/llm-judge.md: full user-facing guide with rubric.toml reference, criterion types, aggregation strategies, dense reward events, multi-provider routing, file discovery, inline criteria, and worked examples - Add src/benchflow/rewards/README.md: module-level guide with usage examples - Update docs/concepts.md: reference LLM judge in Verifier primitive and add to 'Where to go next' links * docs: polish — remove ENG-55 ticket ref, clarify rubric.json exclusion --------- Co-authored-by: Xiangyi Li <xiangyi@benchflow.ai> * CLI: remove all short flags, use full names only [ENG-74] (#284) * CLI: remove all short flags, use full names only [ENG-74] Remove all single-letter short flags (-a, -m, -o, -t, -c, -f, -s, -p, -b, -d) from every CLI command. Full flag names only (--agent, --model, --jobs-dir, --tasks-dir, --concurrency, --config, --skills-dir, --prompt, --benchmark, --dir). Updated across 23 files: - src/benchflow/cli/main.py: all typer.Option definitions + docstring examples - All docs/ (getting-started, running-benchmarks, cli.md, skill-eval, integration-tests, examples/) - README.md - .claude/skills/ (SKILL.md + references) - tests/ (integration/run.sh, test_codex_custom_provider.sh, conformance/README.md) - benchmarks/ (models-as-skills.md, harvey-lab/run_harvey_lab.py) - Test fix: test_skill_eval_dryrun.py invocation updated from -a to --agent * fix: replace remaining -f with --config in task-embedded SKILL.md files --------- Co-authored-by: Xiangyi Li <xiangyi@benchflow.ai> * Backport PR #230 and #242 fixes to refactor/v0.4 (#286) * fix: apply PR #230 (skills double-deploy) and PR #242 (shell injection) fixes to v0.4 PR #230: pass effective task path to deploy_skills, set effective_skills in already-injected branch PR #242: shlex.quote outbox file paths in _scene.py, harden _snapshot.py with path validation * fix: set effective_skills to /skills when Dockerfile already injected When deploy_skills detects the Dockerfile already bakes in `COPY _deps/skills /skills/`, the runtime upload is skipped but effective_skills was left as task.config.environment.skills_dir (often None), causing the subsequent linking step to silently skip distribution to agent-specific paths like /home/agent/.agents/skills. Sandbox users relied on that runtime linking since Dockerfile injection only links under /root/. Tighten the regression test to assert env.exec was called with the expected `ln -sfn` command. --------- Co-authored-by: Xiangyi Li <xiangyi@benchflow.ai> Co-authored-by: Yifeng He <yfhe.prsn@gmail.com> * fix locally * release: 0.3.4 --------- Co-authored-by: Xiangyi Li <xiangyi@benchflow.ai> Co-authored-by: devin-ai-integration[bot] <158243242+devin-ai-integration[bot]@users.noreply.github.com> Co-authored-by: Yifeng He <yfhe.prsn@gmail.com> * refactor: remove Harbor references, purge backward-compat aliases in tests, update terminology - Remove 'terminal-bench' from pyproject.toml keywords - Rename _make_harbor_mock → _make_sandbox_mock in test files - Update Harbor patch paths to sandbox equivalents in tests - Rename _from_harbor_yaml → _from_legacy_yaml in evaluation.py - Update docstrings/comments: Harbor → BenchFlow/legacy terminology - Internalize Harbor types into benchflow.task/ subpackage - Create sandbox backends: base.py, docker_impl.py, daytona_impl.py, modal_impl.py - Add compose/ subpackage with docker-compose YAML templates * refactor: modernize file structure — align with test-drive patterns - sandbox/: merge docker_impl→docker, daytona_impl→daytona, base→_base - sandbox/: move _sandbox→lockdown, _snapshot→snapshot, _daytona_patches→_sdk_ops - sandbox/: move process, user, environments→services into sandbox/ - sandbox/: restructure compose/ → _compose.py + _compose_files/ - agents/: move _credentials→credentials, _agent_env→env, _agent_setup→install - _utils/: create subpackage with yaml_loader, benchmark_repos, task_authoring - trajectories/: move viewer, _trajectory→_capture into subpackage - experimental/: move mcp/ into experimental/ subpackage - __init__.py: purge backward-compat aliases (Trial, Job, etc.) - Update all imports across src/, tests/, experiments/ - Run ruff check + format (0 errors) * refactor: purge backward-compat aliases + fix review bugs - Remove Trial=Rollout, TrialConfig=RolloutConfig from rollout.py - Remove Job=Evaluation, JobConfig=EvaluationConfig, JobResult=EvaluationResult from evaluation.py - Delete shim files: trial.py, job.py - Update all imports: benchflow.trial → benchflow.rollout, benchflow.job → benchflow.evaluation - Rename classes: Trial→Rollout, Job→Evaluation across tests/, benchmarks/, experiments/ - Fix runtime.py: environment_type→sandbox_type, trial_name→rollout_name (review bug) - Fix conformance scripts: same keyword arg fixes (review bug) - Clean up __all__ and return type annotations * docs: update terminology to v0.4 (Trial→Rollout, Job→Evaluation) + bump version to 0.4.0 * chore: update uv.lock for v0.4.0 * refactor: modernize RL terminology — trial_dir→rollout_dir, trial_name→rollout_name across codebase * style: format viewer.py * fix: exclude optional-dep sandbox files from ty check * fix: configure ty to ignore unresolved-import + exclude optional-dep files * fix: resolve all test failures — patch paths, protocol conformance, RL terminology * fix: resolve merge conflicts, update broken imports and stale references - Resolve merge conflict markers in rollout.py and evaluation.py - Update task_download imports to benchflow._utils.benchmark_repos (11 files) - Fix benchflow.tasks → benchflow._utils.task_authoring import - Update scene-patterns.ipynb: TrialConfig→RolloutConfig, trial→rollout - Fix Job→Evaluation rename in docs - Ruff auto-fixes (import ordering) * style: format 12 files to pass ruff format check * fix: resolve ty check errors in traces/ — dict[str, object] → dict[str, Any] * fix: add skip guards for optional dependency tests (daytona, modal) * fix: update stale imports and references in docs/benchmarks (Devin Review) * refactor: consolidate underscore modules into proper subpackages - _acp_run.py → acp/runtime.py - _env_setup.py → sandbox/setup.py - _provider_runtime.py → providers/runtime.py - _scoring.py → _utils/scoring.py - _scene.py → scenes.py (removed Role=SceneRole backward-compat alias) - Fix GEMINI_API_KEY auto-detection: inherit from os.environ as fallback after .env * feat: ACPX integration — acpx/<agent> protocol for headless ACP agent invocation - Add 'acpx' as a valid protocol in parse_agent_spec (acpx/claude, acpx/codex, etc.) - _acpx_wrap() decorates any registered agent to launch via acpx CLI - Installs acpx alongside the underlying agent in the sandbox - Preserves all agent env, credentials, and skill_paths - Replace harbor protocol references with acpx in tests * style: format registry.py for ruff format check * fix: suppress ty invalid-assignment for monkey-patch in _patch_docker_dind * docs: update docs and SKILL.md for v0.4 — RL-first terminology, ACPX, purge backward-compat * fix: rename trial_config_from_yaml → rollout_config_from_yaml, fix stale _run import in sdk.py * fix: docs inconsistencies found by dogfooding — GEMINI_API_KEY, daytona install note * fix: docs path evaluations/ → jobs/ to match actual output directory * fix: ruff import ordering in sdk.py * fix: ty type-check suppression for sdk.py run() return type * feat: add sandbox-daytona and sandbox-modal optional dependency groups * chore: update uv.lock for sandbox-daytona and sandbox-modal optional deps --------- Co-authored-by: devin-ai-integration[bot] <158243242+devin-ai-integration[bot]@users.noreply.github.com> Co-authored-by: Xiangyi Li <xiangyi@benchflow.ai> Co-authored-by: Yifeng He <yfhe.prsn@gmail.com> * skill eval -> refactor (#293) * fix: address v0.4 dogfood blockers * test: cover skill eval nested results * chore: ignore playwright mcp artifacts * fix: validate config eval agent protocol * test: add adapter release evidence checker * test: wire adapter evidence into release runner * fix: address ENG-91 dogfood regressions * fix: record role metadata and timeouts * fix: align dogfood cli reports * docs: document environment cleanup command * docs: remove stale harbor task path examples * Add citation-management skill eval under skills * Update src/benchflow/cli/main.py Co-authored-by: devin-ai-integration[bot] <158243242+devin-ai-integration[bot]@users.noreply.github.com> --------- Co-authored-by: devin-ai-integration[bot] <158243242+devin-ai-integration[bot]@users.noreply.github.com> Co-authored-by: Yifeng He <yfhe.prsn@gmail.com>

devin-ai-integration Bot assigned xdotli May 16, 2026

This comment was marked as resolved.

Sign in to view

devin-ai-integration Bot commented May 16, 2026

View reviewed changes

xdotli added 3 commits May 17, 2026 02:06

merge: resolve concepts.md conflict with refactor/v0.4 (keep LLM judg…

137cd5c

…e + Rollout rename)

docs: polish — remove ENG-55 ticket ref, clarify rubric.json exclusion

54dfaf9

xdotli merged commit 526b740 into refactor/v0.4 May 17, 2026
1 check passed

xdotli deleted the devin/ENG-55-1778953466 branch May 19, 2026 15:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(ENG-55): first-class LLM-as-judge verifier with dense reward#277

feat(ENG-55): first-class LLM-as-judge verifier with dense reward#277
xdotli merged 5 commits into
refactor/v0.4from
devin/ENG-55-1778953466

devin-ai-integration Bot commented May 16, 2026 •

edited

Loading

Uh oh!

devin-ai-integration Bot commented May 16, 2026

Uh oh!

This comment was marked as resolved.

Uh oh!

devin-ai-integration Bot left a comment

Uh oh!

devin-ai-integration Bot May 16, 2026

Uh oh!

devin-ai-integration Bot May 16, 2026

Uh oh!

devin-ai-integration Bot May 16, 2026

Uh oh!

devin-ai-integration Bot May 16, 2026

Uh oh!

devin-ai-integration Bot commented May 16, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

		if self._rubric_path is not None:
		return load_rubric_toml(self._rubric_path)

Conversation

devin-ai-integration Bot commented May 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Updates since last revision

Review & Testing Checklist for Human

Notes

Uh oh!

devin-ai-integration Bot commented May 16, 2026

🤖 Devin AI Engineer

Uh oh!

This comment was marked as resolved.

Uh oh!

devin-ai-integration Bot left a comment

Choose a reason for hiding this comment

Uh oh!

devin-ai-integration Bot May 16, 2026

Choose a reason for hiding this comment

Uh oh!

devin-ai-integration Bot May 16, 2026

Choose a reason for hiding this comment

Uh oh!

devin-ai-integration Bot May 16, 2026

Choose a reason for hiding this comment

Uh oh!

devin-ai-integration Bot May 16, 2026

Choose a reason for hiding this comment

Uh oh!

devin-ai-integration Bot commented May 16, 2026

Test Results — ENG-55 LLM-as-Judge Verifier

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

devin-ai-integration Bot commented May 16, 2026 •

edited

Loading