Skip to content

feat: add ProgramBench integration, remove TB2, migrate .ref/ → benchmarks/#237

Merged
xdotli merged 54 commits into
mainfrom
devin/1778055558-programbench-adapter
May 15, 2026
Merged

feat: add ProgramBench integration, remove TB2, migrate .ref/ → benchmarks/#237
xdotli merged 54 commits into
mainfrom
devin/1778055558-programbench-adapter

Conversation

@devin-ai-integration
Copy link
Copy Markdown
Contributor

@devin-ai-integration devin-ai-integration Bot commented May 6, 2026

Summary

Adds ProgramBench (201 program-reconstruction tasks across 7 languages) as a BenchFlow benchmark, removes TB2, migrates .ref/benchmarks/, and adds 12 tests verifying allow_internet actually works.

Changes:

  1. ProgramBench adapter (benchmarks/programbench/) — converter, CLI, runner, parity results, benchmark.yaml (standardized to match Harvey LAB format)
  2. TB2 removal — deleted run_tb2.py, all tb2-*.yaml configs, terminal-bench-2 from TASK_REPOS
  3. .ref/benchmarks/ migration — all code, docs, YAML configs, notebooks, skills
  4. allow_internet test suite — 12 new tests verifying shell commands produce valid agent configs on disk, idempotency, merge-with-existing, and Harbor Task model parsing

Agent parity results (same submission → both ProgramBench eval and BenchFlow verifier):

Task Lang PB Reward BF Reward Delta
cmatrix C 0.3758 0.3758 0.0000
zoxide Rust 1.0000 1.0000 0.0000
shellharden Rust 0.9992 0.9992 0.0000
ditaa Java 0.0088 0.0088 0.0000
chroma Go 0.0000 0.0132 0.0132

Review & Testing Checklist for Human

  • Spot-check a generated ProgramBench task dir structure against benchmarks/programbench/README.md format comparison tables
  • Verify benchmark.yaml fields are consistent with parity_experiment.json numbers
  • Run python -m pytest tests/test_internet_policy.py -v and confirm all 24 tests pass (12 new)
  • Confirm no remaining .ref/ references in code (only CHANGELOG.md historical entries should remain): rg '\.ref/' --glob '!CHANGELOG.md' src/ benchmarks/ docs/

Suggested test plan: Generate 2-3 ProgramBench tasks locally (python -m benchmarks.programbench.main --output-dir /tmp/pb-test --limit 3), run bench tasks check on them, and verify they have allow_internet = false in their task.toml.

Notes

  • ProgramBench Docker images are linux/amd64 only — requires x86_64 machine
  • The only Harbor adapter step that doesn't fully apply is oracle verification (running oracle through full eval) — oracle solve.sh exists and was spot-checked but not run through the full 201-task eval pipeline
  • 3 bugs were found and fixed during parity testing: wrong Docker base image, wrong anti-cheat ordering, shallow clone in oracle solve.sh

Link to Devin session: https://app.devin.ai/sessions/f3761955c99449d7a3e3c2380ed664da
Requested by: @xdotli


Open in Devin Review

Generate BenchFlow task directories from ProgramBench's 200 program-
reconstruction instances.  Each task gives an agent a compiled binary and
its documentation; the agent must re-implement the program from scratch.

Files:
- benchmarks/programbench/generate.py — reads ProgramBench task.yaml +
  tests.json, emits task.toml / instruction.md / Dockerfile / test.sh /
  verify.py per instance
- benchmarks/programbench/main.py — CLI entry point for generation
- benchmarks/run_programbench.py — Job runner (mirrors run_skillsbench.py)
- benchmarks/programbench-gemini-flash-lite.yaml — default config
- src/benchflow/task_download.py — extended to support generated benchmarks;
  clones ProgramBench upstream, runs the generator, caches under .ref/
@devin-ai-integration
Copy link
Copy Markdown
Contributor Author

🤖 Devin AI Engineer

I'll be helping with this pull request! Here's what you should know:

✅ I will automatically:

  • Address comments on this PR. Add '(aside)' to your comment to have me ignore it.
  • Look at CI failures and help fix them

Note: I can only respond to comments from users who have write access to this repository.

⚙️ Control Options:

  • Disable automatic comment and CI monitoring

devin-ai-integration[bot]

This comment was marked as resolved.

xdotli added 8 commits May 6, 2026 08:34
- _ensure_generated now generates into a staging directory and renames
  atomically on success, preventing partial cache on failure
- verify.py wraps tar extraction in try/except so a corrupt archive
  for one branch doesn't crash the entire verifier
- Fix ruff format on task_download.py
Validates the full pipeline end-to-end: Docker image build, Gemini API
query, compilation, and verifier execution.  Uses a single-shot prompt
(not multi-turn agent), so 0% scores are expected on these hard tasks.
ProgramBench cleanroom images don't need 4 CPUs — reducing to 2 makes
the benchmark runnable on smaller machines.
…ch/tasks/

Generated tasks now live under benchmarks/ instead of .ref/ per project
convention. Added benchmarks/programbench/tasks/ to .gitignore since
these are generated at runtime.
Use fallback pattern (try without --break-system-packages first, then
with) so the install works on both old and new pip.
devin-ai-integration[bot]

This comment was marked as resolved.

xdotli added 4 commits May 6, 2026 09:26
Wraps the test run subprocess in try/except so a hanging test branch
doesn't crash the verifier and lose results from completed branches.
…nt.json

- Add [task] name field to generated task.toml (programbench/<instance_id>)
- Add adapter_metadata.json with structured benchmark metadata
- Add parity_experiment.json with results from 10 diverse tasks
  - 8/10 exact test count match, 2 minor variance (<0.5%)
  - Covers C, Rust, Go, C++, Java across easy/medium difficulties
The oracle checks out the original source code at the specified commit
from the upstream repo — this is the gold answer for ProgramBench tasks.
Each task now generates a solution/ directory with solve.sh.
Detailed tables covering directory structure, evaluation pipeline,
field mappings, and what changes vs stays the same.
Copy link
Copy Markdown
Contributor Author

@devin-ai-integration devin-ai-integration Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Devin Review found 1 new potential issue.

View 13 additional findings in Devin Review.

Open in Devin Review

xdotli added 9 commits May 6, 2026 18:49
Add BenchFlow adapter for Harvey LAB — 1,251 legal tasks across 24
practice areas (M&A, insurance, IP, tax, real estate, etc.).

- benchflow.py: Translates Harvey LAB task.json → BenchFlow task format
  (task.toml, instruction.md, Dockerfile, LLM-as-judge verifier)
- evaluate.py: Gemini 3.1 Flash Lite judge grades deliverables against
  rubric criteria (PASS/FAIL per criterion, partial credit reward)
- parity_test.py: Structural + eval parity tests
  - Structural: 1251/1251 tasks pass (all files, metadata, criteria match)
  - Eval: 5/5 tasks pass (Gemini judge pipeline works end-to-end)
- run_harvey_lab.py + YAML config for running benchmarks
- Register harvey-lab in task_download.py for auto-download
The runner now:
1. Downloads raw Harvey LAB data via ensure_tasks()
2. Runs benchflow.py adapter to convert task.json → task.toml format
3. Writes converted tasks to .ref/harvey-lab-benchflow/
4. YAML config updated to use tasks_dir pointing to converted output
- Fix raw_dir.parent.parent → raw_dir.parent in run_harvey_lab.py
  (ensure_tasks returns .ref/harvey-lab/tasks, so one parent up is
  .ref/harvey-lab which is the correct harvey-root)
- Replace str.format() with str.replace() in evaluate.py's judge
  prompt to prevent crashes when agent output or criteria contain
  curly braces (common in legal documents)
- Replace sequential .replace() chain with string.Template.safe_substitute()
  in both benchflow.py (generated evaluate.py) and parity_test.py
- Prevents agent output containing literal placeholder strings from
  corrupting later substitutions
- Add side-by-side parity test mode (Harbor Step 5): runs original
  Harvey LAB prompt template vs adapted BenchFlow prompt through the
  same Gemini judge on identical agent output
- Results: 25/25 criteria agree (100% agreement rate) across 5 tasks
- Add parity_experiment.json with detailed per-criterion results
- Add adapter_metadata.json with benchmark metadata
- Remove all Harbor mentions from parity_test.py
- Rewrite README with BenchFlow-native adapter convention table
- Add step-by-step parity results table (all 9 steps documented)
- Add side-by-side parity breakdown by practice area
- Document BenchFlow adapter file structure
…luate.py

The parity test's _ADAPTED_PROMPT had 8-space indentation that didn't
match the actual generated evaluate.py (which goes through
textwrap.dedent). Fixed to use no extra indentation.

Re-ran side-by-side parity: still 25/25 (100% agreement).
- Rename adapter_metadata.json → benchmark_metadata.json
- Replace 'adapter' with 'converter' in code/docs/comments
- Update README title from 'Harvey LAB Adapter' to 'Harvey LAB'
- Rename _run_adapter → _run_converter, _ADAPTER → _CONVERTER
- Section renamed: 'Adapter Structure' → 'Directory Structure'
- Convention renamed: 'BenchFlow Adapter Convention' → 'BenchFlow Benchmark Convention'
- Dockerfile now uses :task (not :task_cleanroom) matching ProgramBench
  eval's environment, with workspace reset to cleanroom state.
- Anti-cheat hash check now runs BEFORE compile (matching ProgramBench
  eval order), preventing false positives on legitimately rebuilt
  executables.
- Updated README comparison tables to reflect image change.
…ata.json

Introduce benchmark.yaml as the standard benchmark descriptor for
BenchFlow benchmarks. This replaces benchmark_metadata.json with a
structured YAML format covering:
- name, description, url, author
- tasks (count, categories, tags)
- conversion (script, source format, oracle solutions)
- verification (method, judge model, reward type)
- parity (structural, eval pipeline, side-by-side results)

Job configs (how to run) remain in separate YAML files.
devin-ai-integration[bot]

This comment was marked as resolved.

Shallow clone with --depth 1 always fetches HEAD, so the fallback
block that checks out the specific commit never ran. Now always
does full clone followed by git checkout at the task's commit.
devin-ai-integration[bot]

This comment was marked as resolved.

xdotli and others added 4 commits May 6, 2026 23:23
Fix skillsbench-claude-glm51.yaml pointing to stale .ref/ path.
Update all docs, examples, notebooks, skills, and configs to use
benchmarks/ paths. Only CHANGELOG.md retains .ref/ as historical.
Ran Harvey LAB's own harness (agent loop + 6 tools + system prompt) via
DirectSandbox on 5 tasks in both original and BenchFlow-converted formats.

Results (aggregate across 4 evaluated tasks):
- Original: 64/261 (24.5%)
- BenchFlow: 74/261 (28.4%)
- Delta: +3.8% (within expected non-determinism range)

Bug fix: harness read tool was failing because parse-doc command
(used to parse .docx/.xlsx/.pdf inside sandbox) wasn't available
outside the Podman container. DirectSandbox now requires parse-doc
in PATH.

Also updates parity config model to gemini-3.1-flash-lite-preview.
@devin-ai-integration devin-ai-integration Bot changed the title feat: add ProgramBench integration feat: add ProgramBench + Harvey LAB integrations, remove TB2, migrate .ref/ → benchmarks/ May 7, 2026
devin-ai-integration[bot]

This comment was marked as resolved.

xdotli added 2 commits May 14, 2026 20:46
- Accept main's two-field source pattern (source.repo + source.path)
- Accept main's .cache/datasets/ caching for cloned benchmarks
- Keep ProgramBench generated-benchmark support (_GENERATED_BENCHMARKS)
- Add harvey-lab to TASK_ALIASES for backward compatibility
- Delete TB2 YAML configs (user requested TB2 removal)
- Update skillsbench YAML to use source: pattern
- Keep ProgramBench YAML with tasks_dir: (generated benchmarks)
- All 811 tests pass, ruff + ty clean
devin-ai-integration[bot]

This comment was marked as resolved.

… tasks_dir override

- _to_host_path: add separator check so /workspace/documents_backup doesn't match /workspace/documents
- Harvey LAB verifier: normalize verdict to .lower() (parity_test already did this)
- DirectSandbox.exec: use 'is not None' check so timeout=0 isn't treated as default
- run_harvey_lab.py: override Job tasks_dir with absolute converted path
devin-ai-integration[bot]

This comment was marked as resolved.

xdotli added 2 commits May 14, 2026 21:17
Harvey LAB belongs in its own PR (#239). Removed:
- benchmarks/harvey-lab/ (converter, parity tests, configs, runner)
- src/benchflow/agents/harvey_lab_acp_shim.py
- harvey-lab-harness agent in registry.py
- harvey-lab alias in task_download.py TASK_ALIASES
- harvey-lab-benchflow/ from .gitignore
- Harvey LAB reference in CONVERT.md
@devin-ai-integration devin-ai-integration Bot changed the title feat: add ProgramBench + Harvey LAB integrations, remove TB2, migrate .ref/ → benchmarks/ feat: add ProgramBench integration, remove TB2, migrate .ref/ → benchmarks/ May 14, 2026
xdotli and others added 11 commits May 15, 2026 02:30
- Replace adapter_metadata.json with benchmark.yaml (matches Harvey LAB)
- Move YAML config + runner into benchmarks/programbench/ (matches Harvey LAB)
- Add 12 new allow_internet tests: shell-command verification for all agents
  (claude, gemini, opencode, openhands), idempotency, merge-with-existing,
  task.toml parsing through Harbor's Task model, explicit true/false/default
- Verify setup_cmd actually produces valid agent config files on disk
Previous commit accidentally replaced the comprehensive docs (Harvey LAB,
SkillsBench, agents, backends, parity validation, YAML config reference)
with ProgramBench-only content. This restores all existing content, removes
TB2 references, and adds ProgramBench as a new section.
All examples now use bench eval create / bench run instead of raw
Python scripts. Parity validation (developer workflow) is the only
exception and uses uv run python explicitly.
Copy link
Copy Markdown
Contributor Author

@devin-ai-integration devin-ai-integration Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Devin Review found 1 new potential issue.

View 32 additional findings in Devin Review.

Open in Devin Review

Comment on lines +115 to +119
[environment]
cpus = 2
memory_mb = 4096
storage_mb = 20480
allow_internet = false
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Oracle runs silently fail for ProgramBench tasks because allow_internet = false blocks network but solve.sh and verifier both require internet

ProgramBench tasks set allow_internet = false in the generated task.toml (benchmarks/programbench/benchflow.py:119). When running with the oracle agent (-a oracle), _disallow_web_tools is False (because of the cfg.primary_agent != "oracle" guard at src/benchflow/trial.py:518), so preserve_agent_network is also False (src/benchflow/trial.py:559). This means _create_environment at src/benchflow/_env_setup.py:463 does NOT override allow_internet back to True, and the Docker container truly has no network access.

However, the oracle's solve.sh requires internet to git clone the original repository (benchmarks/programbench/benchflow.py:464), and the verifier's verify.py requires internet to download test blobs from HuggingFace (benchmarks/programbench/benchflow.py:305-313). Both operations fail silently — solve.sh errors out, and _download_test_blob returns (0, 0) per branch — producing a reward of 0 instead of the expected score.

For LLM agent runs, preserve_agent_network=True coincidentally keeps the container network on, so both agent and verifier work. But for oracle runs, this workaround is bypassed. The docs explicitly recommend oracle verification at docs/running-benchmarks.md:148:

bench run benchmarks/programbench/tasks/abishekvashok__cmatrix.5c082c6 -a oracle -b docker

This command will silently produce reward 0.

Prompt for agents
The ProgramBench tasks set allow_internet = false in the generated task.toml, but both the oracle solve.sh (which does git clone) and the verifier verify.py (which downloads test blobs from HuggingFace) require internet access. For LLM agent runs, preserve_agent_network=True (set via _disallow_web_tools in src/benchflow/trial.py:559) coincidentally keeps the container network on, but for oracle runs _disallow_web_tools is False (because primary_agent == oracle at trial.py:518), so the container truly has no internet.

Possible fixes:
1. Remove allow_internet = false from the generated task.toml (simplest, but changes the task contract)
2. Pre-download HuggingFace test blobs into the Docker image at build time so the verifier works offline, and bundle the oracle source similarly
3. Ensure preserve_agent_network is also True for oracle runs when the verifier/oracle need internet (e.g. add a separate flag or always preserve network for oracle)
4. Add a task.toml field like verifier_needs_internet that the framework can use to keep network on during the verify phase regardless of allow_internet

The key files involved are: benchmarks/programbench/benchflow.py (_render_task_toml generates allow_internet=false), src/benchflow/trial.py (lines 516-559 compute _disallow_web_tools and preserve_agent_network), src/benchflow/_env_setup.py (lines 463-469 conditionally override allow_internet).
Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. For LLM agent runs, preserve_agent_network=True keeps the container network alive (needed for model API calls), which also lets the verifier download HuggingFace test blobs. For oracle runs, _disallow_web_tools is False so preserve_agent_network is also False, meaning the container has no network — causing both solve.sh (git clone) and verify.py (HF download) to fail silently.

In practice this hasn't been hit because our parity runs used the :task Docker image with full network (the allow_internet enforcement is at the agent-tool level, not Docker network level, when preserve_agent_network=True). But the Devin Review analysis is correct that oracle + allow_internet=false is a broken combination.

The cleanest fix would be option 3 — ensure preserve_agent_network is always True for oracle runs since the oracle doesn't have "web tools" to disable. This is a pre-existing framework behavior rather than a ProgramBench-specific bug, so I'd suggest addressing it in a separate PR. Noting it here for tracking.

…benchmarks)

- ProgramBench + Harvey LAB YAML configs now use source.repo instead of tasks_dir
- Remove _GENERATED_BENCHMARKS and _ensure_generated() from task_download.py
- Add programbench + harvey-lab to TASK_ALIASES pointing at benchflow-ai/benchmarks
- Remove benchmarks/programbench/tasks/ from .gitignore
- Update docs YAML config reference to show consistent source pattern
- Replace TB2 test with programbench alias test
@xdotli xdotli merged commit 3f3960b into main May 15, 2026
3 checks passed
@xdotli xdotli deleted the devin/1778055558-programbench-adapter branch May 17, 2026 05:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant