feat: add ProgramBench integration, remove TB2, migrate .ref/ → benchmarks/ by devin-ai-integration[bot] · Pull Request #237 · benchflow-ai/benchflow

devin-ai-integration · 2026-05-06T08:26:04Z

Summary

Adds ProgramBench (201 program-reconstruction tasks across 7 languages) as a BenchFlow benchmark, removes TB2, migrates .ref/ → benchmarks/, and adds 12 tests verifying allow_internet actually works.

Changes:

ProgramBench adapter (benchmarks/programbench/) — converter, CLI, runner, parity results, benchmark.yaml (standardized to match Harvey LAB format)
TB2 removal — deleted run_tb2.py, all tb2-*.yaml configs, terminal-bench-2 from TASK_REPOS
.ref/ → benchmarks/ migration — all code, docs, YAML configs, notebooks, skills
allow_internet test suite — 12 new tests verifying shell commands produce valid agent configs on disk, idempotency, merge-with-existing, and Harbor Task model parsing

Agent parity results (same submission → both ProgramBench eval and BenchFlow verifier):

Task	Lang	PB Reward	BF Reward	Delta
cmatrix	C	0.3758	0.3758	0.0000
zoxide	Rust	1.0000	1.0000	0.0000
shellharden	Rust	0.9992	0.9992	0.0000
ditaa	Java	0.0088	0.0088	0.0000
chroma	Go	0.0000	0.0132	0.0132

Review & Testing Checklist for Human

Spot-check a generated ProgramBench task dir structure against benchmarks/programbench/README.md format comparison tables
Verify benchmark.yaml fields are consistent with parity_experiment.json numbers
Run python -m pytest tests/test_internet_policy.py -v and confirm all 24 tests pass (12 new)
Confirm no remaining .ref/ references in code (only CHANGELOG.md historical entries should remain): rg '\.ref/' --glob '!CHANGELOG.md' src/ benchmarks/ docs/

Suggested test plan: Generate 2-3 ProgramBench tasks locally (python -m benchmarks.programbench.main --output-dir /tmp/pb-test --limit 3), run bench tasks check on them, and verify they have allow_internet = false in their task.toml.

Notes

ProgramBench Docker images are linux/amd64 only — requires x86_64 machine
The only Harbor adapter step that doesn't fully apply is oracle verification (running oracle through full eval) — oracle solve.sh exists and was spot-checked but not run through the full 201-task eval pipeline
3 bugs were found and fixed during parity testing: wrong Docker base image, wrong anti-cheat ordering, shallow clone in oracle solve.sh

Link to Devin session: https://app.devin.ai/sessions/f3761955c99449d7a3e3c2380ed664da
Requested by: @xdotli

Generate BenchFlow task directories from ProgramBench's 200 program- reconstruction instances. Each task gives an agent a compiled binary and its documentation; the agent must re-implement the program from scratch. Files: - benchmarks/programbench/generate.py — reads ProgramBench task.yaml + tests.json, emits task.toml / instruction.md / Dockerfile / test.sh / verify.py per instance - benchmarks/programbench/main.py — CLI entry point for generation - benchmarks/run_programbench.py — Job runner (mirrors run_skillsbench.py) - benchmarks/programbench-gemini-flash-lite.yaml — default config - src/benchflow/task_download.py — extended to support generated benchmarks; clones ProgramBench upstream, runs the generator, caches under .ref/

devin-ai-integration · 2026-05-06T08:26:07Z

🤖 Devin AI Engineer

I'll be helping with this pull request! Here's what you should know:

✅ I will automatically:

Address comments on this PR. Add '(aside)' to your comment to have me ignore it.
Look at CI failures and help fix them

Note: I can only respond to comments from users who have write access to this repository.

⚙️ Control Options:

Disable automatic comment and CI monitoring

- _ensure_generated now generates into a staging directory and renames atomically on success, preventing partial cache on failure - verify.py wraps tar extraction in try/except so a corrupt archive for one branch doesn't crash the entire verifier - Fix ruff format on task_download.py

Validates the full pipeline end-to-end: Docker image build, Gemini API query, compilation, and verifier execution. Uses a single-shot prompt (not multi-turn agent), so 0% scores are expected on these hard tasks.

ProgramBench cleanroom images don't need 4 CPUs — reducing to 2 makes the benchmark runnable on smaller machines.

…ch/tasks/ Generated tasks now live under benchmarks/ instead of .ref/ per project convention. Added benchmarks/programbench/tasks/ to .gitignore since these are generated at runtime.

Use fallback pattern (try without --break-system-packages first, then with) so the install works on both old and new pip.

Wraps the test run subprocess in try/except so a hanging test branch doesn't crash the verifier and lose results from completed branches.

…nt.json - Add [task] name field to generated task.toml (programbench/<instance_id>) - Add adapter_metadata.json with structured benchmark metadata - Add parity_experiment.json with results from 10 diverse tasks - 8/10 exact test count match, 2 minor variance (<0.5%) - Covers C, Rust, Go, C++, Java across easy/medium difficulties

The oracle checks out the original source code at the specified commit from the upstream repo — this is the gold answer for ProgramBench tasks. Each task now generates a solution/ directory with solve.sh.

Detailed tables covering directory structure, evaluation pipeline, field mappings, and what changes vs stays the same.

devin-ai-integration

Devin Review found 1 new potential issue.

View 13 additional findings in Devin Review.

Add BenchFlow adapter for Harvey LAB — 1,251 legal tasks across 24 practice areas (M&A, insurance, IP, tax, real estate, etc.). - benchflow.py: Translates Harvey LAB task.json → BenchFlow task format (task.toml, instruction.md, Dockerfile, LLM-as-judge verifier) - evaluate.py: Gemini 3.1 Flash Lite judge grades deliverables against rubric criteria (PASS/FAIL per criterion, partial credit reward) - parity_test.py: Structural + eval parity tests - Structural: 1251/1251 tasks pass (all files, metadata, criteria match) - Eval: 5/5 tasks pass (Gemini judge pipeline works end-to-end) - run_harvey_lab.py + YAML config for running benchmarks - Register harvey-lab in task_download.py for auto-download

The runner now: 1. Downloads raw Harvey LAB data via ensure_tasks() 2. Runs benchflow.py adapter to convert task.json → task.toml format 3. Writes converted tasks to .ref/harvey-lab-benchflow/ 4. YAML config updated to use tasks_dir pointing to converted output

- Fix raw_dir.parent.parent → raw_dir.parent in run_harvey_lab.py (ensure_tasks returns .ref/harvey-lab/tasks, so one parent up is .ref/harvey-lab which is the correct harvey-root) - Replace str.format() with str.replace() in evaluate.py's judge prompt to prevent crashes when agent output or criteria contain curly braces (common in legal documents)

- Replace sequential .replace() chain with string.Template.safe_substitute() in both benchflow.py (generated evaluate.py) and parity_test.py - Prevents agent output containing literal placeholder strings from corrupting later substitutions - Add side-by-side parity test mode (Harbor Step 5): runs original Harvey LAB prompt template vs adapted BenchFlow prompt through the same Gemini judge on identical agent output - Results: 25/25 criteria agree (100% agreement rate) across 5 tasks - Add parity_experiment.json with detailed per-criterion results - Add adapter_metadata.json with benchmark metadata

- Remove all Harbor mentions from parity_test.py - Rewrite README with BenchFlow-native adapter convention table - Add step-by-step parity results table (all 9 steps documented) - Add side-by-side parity breakdown by practice area - Document BenchFlow adapter file structure

…luate.py The parity test's _ADAPTED_PROMPT had 8-space indentation that didn't match the actual generated evaluate.py (which goes through textwrap.dedent). Fixed to use no extra indentation. Re-ran side-by-side parity: still 25/25 (100% agreement).

- Rename adapter_metadata.json → benchmark_metadata.json - Replace 'adapter' with 'converter' in code/docs/comments - Update README title from 'Harvey LAB Adapter' to 'Harvey LAB' - Rename _run_adapter → _run_converter, _ADAPTER → _CONVERTER - Section renamed: 'Adapter Structure' → 'Directory Structure' - Convention renamed: 'BenchFlow Adapter Convention' → 'BenchFlow Benchmark Convention'

- Dockerfile now uses :task (not :task_cleanroom) matching ProgramBench eval's environment, with workspace reset to cleanroom state. - Anti-cheat hash check now runs BEFORE compile (matching ProgramBench eval order), preventing false positives on legitimately rebuilt executables. - Updated README comparison tables to reflect image change.

…ata.json Introduce benchmark.yaml as the standard benchmark descriptor for BenchFlow benchmarks. This replaces benchmark_metadata.json with a structured YAML format covering: - name, description, url, author - tasks (count, categories, tags) - conversion (script, source format, oracle solutions) - verification (method, judge model, reward type) - parity (structural, eval pipeline, side-by-side results) Job configs (how to run) remain in separate YAML files.

Shallow clone with --depth 1 always fetches HEAD, so the fallback block that checks out the specific commit never ran. Now always does full clone followed by git checkout at the task's commit.

Fix skillsbench-claude-glm51.yaml pointing to stale .ref/ path. Update all docs, examples, notebooks, skills, and configs to use benchmarks/ paths. Only CHANGELOG.md retains .ref/ as historical.

Ran Harvey LAB's own harness (agent loop + 6 tools + system prompt) via DirectSandbox on 5 tasks in both original and BenchFlow-converted formats. Results (aggregate across 4 evaluated tasks): - Original: 64/261 (24.5%) - BenchFlow: 74/261 (28.4%) - Delta: +3.8% (within expected non-determinism range) Bug fix: harness read tool was failing because parse-doc command (used to parse .docx/.xlsx/.pdf inside sandbox) wasn't available outside the Podman container. DirectSandbox now requires parse-doc in PATH. Also updates parity config model to gemini-3.1-flash-lite-preview.

…o benchmarks/

- Accept main's two-field source pattern (source.repo + source.path) - Accept main's .cache/datasets/ caching for cloned benchmarks - Keep ProgramBench generated-benchmark support (_GENERATED_BENCHMARKS) - Add harvey-lab to TASK_ALIASES for backward compatibility - Delete TB2 YAML configs (user requested TB2 removal) - Update skillsbench YAML to use source: pattern - Keep ProgramBench YAML with tasks_dir: (generated benchmarks) - All 811 tests pass, ruff + ty clean

… tasks_dir override - _to_host_path: add separator check so /workspace/documents_backup doesn't match /workspace/documents - Harvey LAB verifier: normalize verdict to .lower() (parity_test already did this) - DirectSandbox.exec: use 'is not None' check so timeout=0 isn't treated as default - run_harvey_lab.py: override Job tasks_dir with absolute converted path

Harvey LAB belongs in its own PR (#239). Removed: - benchmarks/harvey-lab/ (converter, parity tests, configs, runner) - src/benchflow/agents/harvey_lab_acp_shim.py - harvey-lab-harness agent in registry.py - harvey-lab alias in task_download.py TASK_ALIASES - harvey-lab-benchflow/ from .gitignore - Harvey LAB reference in CONVERT.md

…rambench-adapter

- Replace adapter_metadata.json with benchmark.yaml (matches Harvey LAB) - Move YAML config + runner into benchmarks/programbench/ (matches Harvey LAB) - Add 12 new allow_internet tests: shell-command verification for all agents (claude, gemini, opencode, openhands), idempotency, merge-with-existing, task.toml parsing through Harbor's Task model, explicit true/false/default - Verify setup_cmd actually produces valid agent config files on disk

Previous commit accidentally replaced the comprehensive docs (Harvey LAB, SkillsBench, agents, backends, parity validation, YAML config reference) with ProgramBench-only content. This restores all existing content, removes TB2 references, and adds ProgramBench as a new section.

All examples now use bench eval create / bench run instead of raw Python scripts. Parity validation (developer workflow) is the only exception and uses uv run python explicitly.

devin-ai-integration

Devin Review found 1 new potential issue.

View 32 additional findings in Devin Review.

devin-ai-integration · 2026-05-15T22:13:57Z

+[environment]
+cpus = 2
+memory_mb = 4096
+storage_mb = 20480
+allow_internet = false


🔴 Oracle runs silently fail for ProgramBench tasks because allow_internet = false blocks network but solve.sh and verifier both require internet

ProgramBench tasks set allow_internet = false in the generated task.toml (benchmarks/programbench/benchflow.py:119). When running with the oracle agent (-a oracle), _disallow_web_tools is False (because of the cfg.primary_agent != "oracle" guard at src/benchflow/trial.py:518), so preserve_agent_network is also False (src/benchflow/trial.py:559). This means _create_environment at src/benchflow/_env_setup.py:463 does NOT override allow_internet back to True, and the Docker container truly has no network access.

However, the oracle's solve.sh requires internet to git clone the original repository (benchmarks/programbench/benchflow.py:464), and the verifier's verify.py requires internet to download test blobs from HuggingFace (benchmarks/programbench/benchflow.py:305-313). Both operations fail silently — solve.sh errors out, and _download_test_blob returns (0, 0) per branch — producing a reward of 0 instead of the expected score.

For LLM agent runs, preserve_agent_network=True coincidentally keeps the container network on, so both agent and verifier work. But for oracle runs, this workaround is bypassed. The docs explicitly recommend oracle verification at docs/running-benchmarks.md:148:

bench run benchmarks/programbench/tasks/abishekvashok__cmatrix.5c082c6 -a oracle -b docker

This command will silently produce reward 0.

Prompt for agents

The ProgramBench tasks set allow_internet = false in the generated task.toml, but both the oracle solve.sh (which does git clone) and the verifier verify.py (which downloads test blobs from HuggingFace) require internet access. For LLM agent runs, preserve_agent_network=True (set via _disallow_web_tools in src/benchflow/trial.py:559) coincidentally keeps the container network on, but for oracle runs _disallow_web_tools is False (because primary_agent == oracle at trial.py:518), so the container truly has no internet. Possible fixes: 1. Remove allow_internet = false from the generated task.toml (simplest, but changes the task contract) 2. Pre-download HuggingFace test blobs into the Docker image at build time so the verifier works offline, and bundle the oracle source similarly 3. Ensure preserve_agent_network is also True for oracle runs when the verifier/oracle need internet (e.g. add a separate flag or always preserve network for oracle) 4. Add a task.toml field like verifier_needs_internet that the framework can use to keep network on during the verify phase regardless of allow_internet The key files involved are: benchmarks/programbench/benchflow.py (_render_task_toml generates allow_internet=false), src/benchflow/trial.py (lines 516-559 compute _disallow_web_tools and preserve_agent_network), src/benchflow/_env_setup.py (lines 463-469 conditionally override allow_internet).

Was this helpful? React with 👍 or 👎 to provide feedback.

Good catch. For LLM agent runs, preserve_agent_network=True keeps the container network alive (needed for model API calls), which also lets the verifier download HuggingFace test blobs. For oracle runs, _disallow_web_tools is False so preserve_agent_network is also False, meaning the container has no network — causing both solve.sh (git clone) and verify.py (HF download) to fail silently.

In practice this hasn't been hit because our parity runs used the :task Docker image with full network (the allow_internet enforcement is at the agent-tool level, not Docker network level, when preserve_agent_network=True). But the Devin Review analysis is correct that oracle + allow_internet=false is a broken combination.

The cleanest fix would be option 3 — ensure preserve_agent_network is always True for oracle runs since the oracle doesn't have "web tools" to disable. This is a pre-existing framework behavior rather than a ProgramBench-specific bug, so I'd suggest addressing it in a separate PR. Noting it here for tracking.

…benchmarks) - ProgramBench + Harvey LAB YAML configs now use source.repo instead of tasks_dir - Remove _GENERATED_BENCHMARKS and _ensure_generated() from task_download.py - Add programbench + harvey-lab to TASK_ALIASES pointing at benchflow-ai/benchmarks - Remove benchmarks/programbench/tasks/ from .gitignore - Update docs YAML config reference to show consistent source pattern - Replace TB2 test with programbench alias test

devin-ai-integration Bot assigned xdotli May 6, 2026

This comment was marked as resolved.

Sign in to view

xdotli added 8 commits May 6, 2026 08:34

feat: add standalone parity test script for ProgramBench pipeline

3061190

Validates the full pipeline end-to-end: Docker image build, Gemini API query, compilation, and verifier execution. Uses a single-shot prompt (not multi-turn agent), so 0% scores are expected on these hard tasks.

fix: lint parity_test.py (unused import, simplify toggle)

0b3c60f

fix: correct Gemini model name to gemini-3.1-flash-lite-preview

d8acdba

rename: generate.py → benchflow.py and update all references

a36ba3f

fix: reduce environment resources to cpus=2, memory=4096MB

880ab89

ProgramBench cleanroom images don't need 4 CPUs — reducing to 2 makes the benchmark runnable on smaller machines.

refactor: move programbench tasks from .ref/ to benchmarks/programben…

c2c171f

…ch/tasks/ Generated tasks now live under benchmarks/ instead of .ref/ per project convention. Added benchmarks/programbench/tasks/ to .gitignore since these are generated at runtime.

fix: Dockerfile pip install compatibility with older pip versions

431c8a8

Use fallback pattern (try without --break-system-packages first, then with) so the install works on both old and new pip.

This comment was marked as resolved.

Sign in to view

xdotli added 4 commits May 6, 2026 09:26

fix: handle TimeoutExpired in test branch execution

ea8c977

Wraps the test run subprocess in try/except so a hanging test branch doesn't crash the verifier and lose results from completed branches.

feat: add solution/solve.sh oracle for each task

9c1147f

The oracle checks out the original source code at the specified commit from the upstream repo — this is the gold answer for ProgramBench tasks. Each task now generates a solution/ directory with solve.sh.

docs: add format comparison tables (ProgramBench vs BenchFlow)

e15258e

Detailed tables covering directory structure, evaluation pipeline, field mappings, and what changes vs stays the same.

devin-ai-integration Bot commented May 6, 2026

View reviewed changes

xdotli added 9 commits May 6, 2026 18:49

This comment was marked as resolved.

Sign in to view

fix: oracle solve.sh always does full clone + checkout

0b483be

Shallow clone with --depth 1 always fetches HEAD, so the fallback block that checks out the specific commit never ran. Now always does full clone followed by git checkout at the task's commit.

This comment was marked as resolved.

Sign in to view

xdotli and others added 4 commits May 6, 2026 23:23

chore: complete .ref/ → benchmarks/ migration across docs and configs

b1cdffe

Fix skillsbench-claude-glm51.yaml pointing to stale .ref/ path. Update all docs, examples, notebooks, skills, and configs to use benchmarks/ paths. Only CHANGELOG.md retains .ref/ as historical.

merge: incorporate Harvey LAB benchmark from session 046003a8

24163c9

fix: harmonize harvey-lab merge — move to TASK_REPOS, migrate .ref/ t…

4c1740e

…o benchmarks/

devin-ai-integration Bot changed the title ~~feat: add ProgramBench integration~~ feat: add ProgramBench + Harvey LAB integrations, remove TB2, migrate .ref/ → benchmarks/ May 7, 2026

This comment was marked as resolved.

Sign in to view

xdotli added 2 commits May 14, 2026 20:46

style: format task_download.py for ruff

40fc14a

This comment was marked as resolved.

Sign in to view

xdotli added 2 commits May 14, 2026 21:17

fix: clean up stale temp dirs before clone in _ensure_generated

493e117

devin-ai-integration Bot changed the title ~~feat: add ProgramBench + Harvey LAB integrations, remove TB2, migrate .ref/ → benchmarks/~~ feat: add ProgramBench integration, remove TB2, migrate .ref/ → benchmarks/ May 14, 2026

xdotli and others added 11 commits May 15, 2026 02:30

merge: resolve conflicts with main (CONVERT.md, running-benchmarks.md)

8b250ca

Merge remote-tracking branch 'origin/main' into devin/1778055558-prog…

1801ec7

…rambench-adapter

style: ruff format test_internet_policy.py

7b914d4

ci: trigger re-run

ad7e462

ci: retry (cache issue)

29c3ec1

merge: pull latest main (metrics.py fix + AGENTS.md)

59773f4

style: format metrics.py after merge

6dcb1c3

docs: rewrite running-benchmarks to use bench CLI everywhere

3c90875

All examples now use bench eval create / bench run instead of raw Python scripts. Parity validation (developer workflow) is the only exception and uses uv run python explicitly.

docs: clean up code blocks — one command per line, no inline comments

af1d3fb

devin-ai-integration Bot commented May 15, 2026

View reviewed changes

xdotli merged commit 3f3960b into main May 15, 2026
3 checks passed

devin-ai-integration Bot mentioned this pull request May 15, 2026

refactor: kill shim layers, single Rollout execution path (ENG-46) #268

Merged

4 tasks

xdotli deleted the devin/1778055558-programbench-adapter branch May 17, 2026 05:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add ProgramBench integration, remove TB2, migrate .ref/ → benchmarks/#237

feat: add ProgramBench integration, remove TB2, migrate .ref/ → benchmarks/#237
xdotli merged 54 commits into
mainfrom
devin/1778055558-programbench-adapter

devin-ai-integration Bot commented May 6, 2026 •

edited

Loading

Uh oh!

devin-ai-integration Bot commented May 6, 2026

Uh oh!

This comment was marked as resolved.

Uh oh!

This comment was marked as resolved.

Uh oh!

devin-ai-integration Bot left a comment

Uh oh!

This comment was marked as resolved.

Uh oh!

This comment was marked as resolved.

Uh oh!

This comment was marked as resolved.

Uh oh!

This comment was marked as resolved.

Uh oh!

This comment was marked as resolved.

Uh oh!

devin-ai-integration Bot left a comment

Uh oh!

devin-ai-integration Bot May 15, 2026

Uh oh!

devin-ai-integration Bot May 15, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

devin-ai-integration Bot commented May 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Review & Testing Checklist for Human

Notes

Uh oh!

devin-ai-integration Bot commented May 6, 2026

🤖 Devin AI Engineer

Uh oh!

This comment was marked as resolved.

Uh oh!

This comment was marked as resolved.

Uh oh!

devin-ai-integration Bot left a comment

Choose a reason for hiding this comment

Uh oh!

This comment was marked as resolved.

Uh oh!

This comment was marked as resolved.

Uh oh!

This comment was marked as resolved.

Uh oh!

This comment was marked as resolved.

Uh oh!

This comment was marked as resolved.

Uh oh!

devin-ai-integration Bot left a comment

Choose a reason for hiding this comment

Uh oh!

devin-ai-integration Bot May 15, 2026

Choose a reason for hiding this comment

Uh oh!

devin-ai-integration Bot May 15, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

devin-ai-integration Bot commented May 6, 2026 •

edited

Loading