feat: add ProgramBench integration, remove TB2, migrate .ref/ → benchmarks/#237
Conversation
Generate BenchFlow task directories from ProgramBench's 200 program- reconstruction instances. Each task gives an agent a compiled binary and its documentation; the agent must re-implement the program from scratch. Files: - benchmarks/programbench/generate.py — reads ProgramBench task.yaml + tests.json, emits task.toml / instruction.md / Dockerfile / test.sh / verify.py per instance - benchmarks/programbench/main.py — CLI entry point for generation - benchmarks/run_programbench.py — Job runner (mirrors run_skillsbench.py) - benchmarks/programbench-gemini-flash-lite.yaml — default config - src/benchflow/task_download.py — extended to support generated benchmarks; clones ProgramBench upstream, runs the generator, caches under .ref/
🤖 Devin AI EngineerI'll be helping with this pull request! Here's what you should know: ✅ I will automatically:
Note: I can only respond to comments from users who have write access to this repository. ⚙️ Control Options:
|
- _ensure_generated now generates into a staging directory and renames atomically on success, preventing partial cache on failure - verify.py wraps tar extraction in try/except so a corrupt archive for one branch doesn't crash the entire verifier - Fix ruff format on task_download.py
Validates the full pipeline end-to-end: Docker image build, Gemini API query, compilation, and verifier execution. Uses a single-shot prompt (not multi-turn agent), so 0% scores are expected on these hard tasks.
ProgramBench cleanroom images don't need 4 CPUs — reducing to 2 makes the benchmark runnable on smaller machines.
…ch/tasks/ Generated tasks now live under benchmarks/ instead of .ref/ per project convention. Added benchmarks/programbench/tasks/ to .gitignore since these are generated at runtime.
Use fallback pattern (try without --break-system-packages first, then with) so the install works on both old and new pip.
Wraps the test run subprocess in try/except so a hanging test branch doesn't crash the verifier and lose results from completed branches.
…nt.json - Add [task] name field to generated task.toml (programbench/<instance_id>) - Add adapter_metadata.json with structured benchmark metadata - Add parity_experiment.json with results from 10 diverse tasks - 8/10 exact test count match, 2 minor variance (<0.5%) - Covers C, Rust, Go, C++, Java across easy/medium difficulties
The oracle checks out the original source code at the specified commit from the upstream repo — this is the gold answer for ProgramBench tasks. Each task now generates a solution/ directory with solve.sh.
Detailed tables covering directory structure, evaluation pipeline, field mappings, and what changes vs stays the same.
Add BenchFlow adapter for Harvey LAB — 1,251 legal tasks across 24 practice areas (M&A, insurance, IP, tax, real estate, etc.). - benchflow.py: Translates Harvey LAB task.json → BenchFlow task format (task.toml, instruction.md, Dockerfile, LLM-as-judge verifier) - evaluate.py: Gemini 3.1 Flash Lite judge grades deliverables against rubric criteria (PASS/FAIL per criterion, partial credit reward) - parity_test.py: Structural + eval parity tests - Structural: 1251/1251 tasks pass (all files, metadata, criteria match) - Eval: 5/5 tasks pass (Gemini judge pipeline works end-to-end) - run_harvey_lab.py + YAML config for running benchmarks - Register harvey-lab in task_download.py for auto-download
The runner now: 1. Downloads raw Harvey LAB data via ensure_tasks() 2. Runs benchflow.py adapter to convert task.json → task.toml format 3. Writes converted tasks to .ref/harvey-lab-benchflow/ 4. YAML config updated to use tasks_dir pointing to converted output
- Fix raw_dir.parent.parent → raw_dir.parent in run_harvey_lab.py (ensure_tasks returns .ref/harvey-lab/tasks, so one parent up is .ref/harvey-lab which is the correct harvey-root) - Replace str.format() with str.replace() in evaluate.py's judge prompt to prevent crashes when agent output or criteria contain curly braces (common in legal documents)
- Replace sequential .replace() chain with string.Template.safe_substitute() in both benchflow.py (generated evaluate.py) and parity_test.py - Prevents agent output containing literal placeholder strings from corrupting later substitutions - Add side-by-side parity test mode (Harbor Step 5): runs original Harvey LAB prompt template vs adapted BenchFlow prompt through the same Gemini judge on identical agent output - Results: 25/25 criteria agree (100% agreement rate) across 5 tasks - Add parity_experiment.json with detailed per-criterion results - Add adapter_metadata.json with benchmark metadata
- Remove all Harbor mentions from parity_test.py - Rewrite README with BenchFlow-native adapter convention table - Add step-by-step parity results table (all 9 steps documented) - Add side-by-side parity breakdown by practice area - Document BenchFlow adapter file structure
…luate.py The parity test's _ADAPTED_PROMPT had 8-space indentation that didn't match the actual generated evaluate.py (which goes through textwrap.dedent). Fixed to use no extra indentation. Re-ran side-by-side parity: still 25/25 (100% agreement).
- Rename adapter_metadata.json → benchmark_metadata.json - Replace 'adapter' with 'converter' in code/docs/comments - Update README title from 'Harvey LAB Adapter' to 'Harvey LAB' - Rename _run_adapter → _run_converter, _ADAPTER → _CONVERTER - Section renamed: 'Adapter Structure' → 'Directory Structure' - Convention renamed: 'BenchFlow Adapter Convention' → 'BenchFlow Benchmark Convention'
- Dockerfile now uses :task (not :task_cleanroom) matching ProgramBench eval's environment, with workspace reset to cleanroom state. - Anti-cheat hash check now runs BEFORE compile (matching ProgramBench eval order), preventing false positives on legitimately rebuilt executables. - Updated README comparison tables to reflect image change.
…ata.json Introduce benchmark.yaml as the standard benchmark descriptor for BenchFlow benchmarks. This replaces benchmark_metadata.json with a structured YAML format covering: - name, description, url, author - tasks (count, categories, tags) - conversion (script, source format, oracle solutions) - verification (method, judge model, reward type) - parity (structural, eval pipeline, side-by-side results) Job configs (how to run) remain in separate YAML files.
Shallow clone with --depth 1 always fetches HEAD, so the fallback block that checks out the specific commit never ran. Now always does full clone followed by git checkout at the task's commit.
Fix skillsbench-claude-glm51.yaml pointing to stale .ref/ path. Update all docs, examples, notebooks, skills, and configs to use benchmarks/ paths. Only CHANGELOG.md retains .ref/ as historical.
Ran Harvey LAB's own harness (agent loop + 6 tools + system prompt) via DirectSandbox on 5 tasks in both original and BenchFlow-converted formats. Results (aggregate across 4 evaluated tasks): - Original: 64/261 (24.5%) - BenchFlow: 74/261 (28.4%) - Delta: +3.8% (within expected non-determinism range) Bug fix: harness read tool was failing because parse-doc command (used to parse .docx/.xlsx/.pdf inside sandbox) wasn't available outside the Podman container. DirectSandbox now requires parse-doc in PATH. Also updates parity config model to gemini-3.1-flash-lite-preview.
- Accept main's two-field source pattern (source.repo + source.path) - Accept main's .cache/datasets/ caching for cloned benchmarks - Keep ProgramBench generated-benchmark support (_GENERATED_BENCHMARKS) - Add harvey-lab to TASK_ALIASES for backward compatibility - Delete TB2 YAML configs (user requested TB2 removal) - Update skillsbench YAML to use source: pattern - Keep ProgramBench YAML with tasks_dir: (generated benchmarks) - All 811 tests pass, ruff + ty clean
… tasks_dir override - _to_host_path: add separator check so /workspace/documents_backup doesn't match /workspace/documents - Harvey LAB verifier: normalize verdict to .lower() (parity_test already did this) - DirectSandbox.exec: use 'is not None' check so timeout=0 isn't treated as default - run_harvey_lab.py: override Job tasks_dir with absolute converted path
Harvey LAB belongs in its own PR (#239). Removed: - benchmarks/harvey-lab/ (converter, parity tests, configs, runner) - src/benchflow/agents/harvey_lab_acp_shim.py - harvey-lab-harness agent in registry.py - harvey-lab alias in task_download.py TASK_ALIASES - harvey-lab-benchflow/ from .gitignore - Harvey LAB reference in CONVERT.md
- Replace adapter_metadata.json with benchmark.yaml (matches Harvey LAB) - Move YAML config + runner into benchmarks/programbench/ (matches Harvey LAB) - Add 12 new allow_internet tests: shell-command verification for all agents (claude, gemini, opencode, openhands), idempotency, merge-with-existing, task.toml parsing through Harbor's Task model, explicit true/false/default - Verify setup_cmd actually produces valid agent config files on disk
Previous commit accidentally replaced the comprehensive docs (Harvey LAB, SkillsBench, agents, backends, parity validation, YAML config reference) with ProgramBench-only content. This restores all existing content, removes TB2 references, and adds ProgramBench as a new section.
All examples now use bench eval create / bench run instead of raw Python scripts. Parity validation (developer workflow) is the only exception and uses uv run python explicitly.
| [environment] | ||
| cpus = 2 | ||
| memory_mb = 4096 | ||
| storage_mb = 20480 | ||
| allow_internet = false |
There was a problem hiding this comment.
🔴 Oracle runs silently fail for ProgramBench tasks because allow_internet = false blocks network but solve.sh and verifier both require internet
ProgramBench tasks set allow_internet = false in the generated task.toml (benchmarks/programbench/benchflow.py:119). When running with the oracle agent (-a oracle), _disallow_web_tools is False (because of the cfg.primary_agent != "oracle" guard at src/benchflow/trial.py:518), so preserve_agent_network is also False (src/benchflow/trial.py:559). This means _create_environment at src/benchflow/_env_setup.py:463 does NOT override allow_internet back to True, and the Docker container truly has no network access.
However, the oracle's solve.sh requires internet to git clone the original repository (benchmarks/programbench/benchflow.py:464), and the verifier's verify.py requires internet to download test blobs from HuggingFace (benchmarks/programbench/benchflow.py:305-313). Both operations fail silently — solve.sh errors out, and _download_test_blob returns (0, 0) per branch — producing a reward of 0 instead of the expected score.
For LLM agent runs, preserve_agent_network=True coincidentally keeps the container network on, so both agent and verifier work. But for oracle runs, this workaround is bypassed. The docs explicitly recommend oracle verification at docs/running-benchmarks.md:148:
bench run benchmarks/programbench/tasks/abishekvashok__cmatrix.5c082c6 -a oracle -b docker
This command will silently produce reward 0.
Prompt for agents
The ProgramBench tasks set allow_internet = false in the generated task.toml, but both the oracle solve.sh (which does git clone) and the verifier verify.py (which downloads test blobs from HuggingFace) require internet access. For LLM agent runs, preserve_agent_network=True (set via _disallow_web_tools in src/benchflow/trial.py:559) coincidentally keeps the container network on, but for oracle runs _disallow_web_tools is False (because primary_agent == oracle at trial.py:518), so the container truly has no internet.
Possible fixes:
1. Remove allow_internet = false from the generated task.toml (simplest, but changes the task contract)
2. Pre-download HuggingFace test blobs into the Docker image at build time so the verifier works offline, and bundle the oracle source similarly
3. Ensure preserve_agent_network is also True for oracle runs when the verifier/oracle need internet (e.g. add a separate flag or always preserve network for oracle)
4. Add a task.toml field like verifier_needs_internet that the framework can use to keep network on during the verify phase regardless of allow_internet
The key files involved are: benchmarks/programbench/benchflow.py (_render_task_toml generates allow_internet=false), src/benchflow/trial.py (lines 516-559 compute _disallow_web_tools and preserve_agent_network), src/benchflow/_env_setup.py (lines 463-469 conditionally override allow_internet).
Was this helpful? React with 👍 or 👎 to provide feedback.
There was a problem hiding this comment.
Good catch. For LLM agent runs, preserve_agent_network=True keeps the container network alive (needed for model API calls), which also lets the verifier download HuggingFace test blobs. For oracle runs, _disallow_web_tools is False so preserve_agent_network is also False, meaning the container has no network — causing both solve.sh (git clone) and verify.py (HF download) to fail silently.
In practice this hasn't been hit because our parity runs used the :task Docker image with full network (the allow_internet enforcement is at the agent-tool level, not Docker network level, when preserve_agent_network=True). But the Devin Review analysis is correct that oracle + allow_internet=false is a broken combination.
The cleanest fix would be option 3 — ensure preserve_agent_network is always True for oracle runs since the oracle doesn't have "web tools" to disable. This is a pre-existing framework behavior rather than a ProgramBench-specific bug, so I'd suggest addressing it in a separate PR. Noting it here for tracking.
…benchmarks) - ProgramBench + Harvey LAB YAML configs now use source.repo instead of tasks_dir - Remove _GENERATED_BENCHMARKS and _ensure_generated() from task_download.py - Add programbench + harvey-lab to TASK_ALIASES pointing at benchflow-ai/benchmarks - Remove benchmarks/programbench/tasks/ from .gitignore - Update docs YAML config reference to show consistent source pattern - Replace TB2 test with programbench alias test
Summary
Adds ProgramBench (201 program-reconstruction tasks across 7 languages) as a BenchFlow benchmark, removes TB2, migrates
.ref/→benchmarks/, and adds 12 tests verifyingallow_internetactually works.Changes:
benchmarks/programbench/) — converter, CLI, runner, parity results, benchmark.yaml (standardized to match Harvey LAB format)run_tb2.py, alltb2-*.yamlconfigs,terminal-bench-2fromTASK_REPOS.ref/→benchmarks/migration — all code, docs, YAML configs, notebooks, skillsallow_internettest suite — 12 new tests verifying shell commands produce valid agent configs on disk, idempotency, merge-with-existing, and Harbor Task model parsingAgent parity results (same submission → both ProgramBench eval and BenchFlow verifier):
Review & Testing Checklist for Human
benchmarks/programbench/README.mdformat comparison tablesbenchmark.yamlfields are consistent withparity_experiment.jsonnumberspython -m pytest tests/test_internet_policy.py -vand confirm all 24 tests pass (12 new).ref/references in code (onlyCHANGELOG.mdhistorical entries should remain):rg '\.ref/' --glob '!CHANGELOG.md' src/ benchmarks/ docs/Suggested test plan: Generate 2-3 ProgramBench tasks locally (
python -m benchmarks.programbench.main --output-dir /tmp/pb-test --limit 3), runbench tasks checkon them, and verify they haveallow_internet = falsein theirtask.toml.Notes
solve.shexists and was spot-checked but not run through the full 201-task eval pipelineLink to Devin session: https://app.devin.ai/sessions/f3761955c99449d7a3e3c2380ed664da
Requested by: @xdotli