refactor: adopt two-field source pattern for dataset sourcing by devin-ai-integration[bot] · Pull Request #252 · benchflow-ai/benchflow

devin-ai-integration · 2026-05-14T02:00:49Z

Summary

Adopts a two-field source.repo + source.path pattern for dataset sourcing (inspired by Vercel's gitRepository.repo + rootDirectory), replacing both the old .ref/ directory and the ambiguous single-string format.

YAML config format

source:
  repo: benchflow-ai/skillsbench   # GitHub org/repo
  path: tasks                       # optional subpath within repo
  ref: main                         # optional branch/tag

CLI flags (mirrors YAML)

bench run --source-repo benchflow-ai/skillsbench --source-path tasks/regex-log
bench eval create --source-repo benchflow-ai/skillsbench --source-path tasks -c 64

-t/--tasks-dir remains for local paths only — no magic resolution.

Why two fields?

No ambiguity about where org/repo ends and path begins
Monorepo-friendly (one repo holds many benchmarks, each config picks a different path)
CLI ↔ YAML consistency (--source-repo = source.repo, --source-path = source.path)
Extensible (easy to add ref, type, sha fields later)

Changes

Area	What
`src/benchflow/task_download.py`	New `Source` dataclass + `resolve_source(repo, path, ref)`
`src/benchflow/job.py`	Parses `source:` block (backward compat with `tasks_dir:`)
`src/benchflow/cli/main.py`	New `--source-repo`, `--source-path`, `--source-ref` flags on `bench run` and `bench eval create`
`benchmarks/*.yaml`	All 4 configs migrated to `source:` format
Docs	README, getting-started, use-cases, cli ref, examples — all use `benchflow-ai/skillsbench`
Tests	7 tests covering new two-field API

Backward compatibility preserved: tasks_dir: local/path and tasks_dir: alias-name still work in YAML; -t still works for local paths in CLI.

Review & Testing Checklist for Human

Run bench eval create -f benchmarks/skillsbench-claude-glm51.yaml — verify it resolves benchflow-ai/skillsbench with path: tasks + ref: main
Run bench eval create --source-repo benchflow-ai/skillsbench --source-path tasks --help — verify new flags appear
Run bench eval create -t ./some-local-dir — verify local paths still work without --source-repo

Notes

808/808 tests pass, ruff/ty clean
All doc examples now use benchflow-ai/skillsbench (not harbor-framework)
.cache/datasets/ is gitignored — repos are cloned on first use and cached locally
Old resolve_ref and is_remote_ref functions removed — resolve_source is the single entry point

Link to Devin session: https://app.devin.ai/sessions/046003a8ff3843e3ad272bd3af9c7726
Requested by: @xdotli

Benchmark datasets now live in benchflow-ai/benchmarks. The local .ref/ directory was a confusing hidden name for downloaded task repos. Rename to datasets/ for clarity and alignment with the external dataset repo. Changes: - task_download.py: clone into datasets/ instead of .ref/ - .gitignore: ignore datasets/ instead of .ref/ - All YAML configs, docs, experiments, skills updated - Tests updated to match new path

devin-ai-integration · 2026-05-14T02:00:51Z

🤖 Devin AI Engineer

I'll be helping with this pull request! Here's what you should know:

✅ I will automatically:

Address comments on this PR. Add '(aside)' to your comment to have me ignore it.
Look at CI failures and help fix them

Note: I can only respond to comments from users who have write access to this repository.

⚙️ Control Options:

Disable automatic comment and CI monitoring

devin-ai-integration

Devin Review found 2 potential issues.

⚠️ 2 issues in files not directly in the diff

⚠️ Incomplete .ref/ → datasets/ rename in scene-patterns.ipynb — executable code uses old path (`docs/examples/scene-patterns.ipynb:49`)

The PR's intent is a mechanical rename of .ref/ to datasets/ across the entire codebase. docs/examples/scene-patterns.ipynb was missed. It contains executable code at line 49 (TASK = Path(".ref/terminal-bench-2/regex-log")) that will resolve to a non-existent directory after this rename lands, causing the notebook to fail when users run it. The markdown description at line 24 and cached output at line 36 also reference the old .ref/ path.

⚠️ Incomplete .ref/ → datasets/ rename in swebench_pro_progressive_disclosure.ipynb (`docs/examples/swebench_pro_progressive_disclosure.ipynb:318`)

The PR's mechanical .ref/ → datasets/ rename missed docs/examples/swebench_pro_progressive_disclosure.ipynb. Line 318 contains # .ref/swebenchpro/instance_qutebrowser__.../task.toml in a markdown TOML comment block. While this is only a comment (not executable code), it is inconsistent with the rest of the rename and will confuse readers who follow the notebook.

View 2 additional findings in Devin Review.

devin-ai-integration · 2026-05-14T02:03:50Z

Fixed both in 9680949 — updated .ref/ → datasets/ in scene-patterns.ipynb and swebench_pro_progressive_disclosure.ipynb.

- Remove benchmarks/run_skillsbench.py and benchmarks/run_tb2.py (redundant with 'bench eval create -f <yaml>') - Add auto-download to Job.from_yaml: if tasks_dir under datasets/ doesn't exist and matches a known benchmark, fetch it automatically - Update docs to show CLI commands as primary interface

The benchflow:// protocol is not wired into the active CLI (bench run and bench eval create in main.py). Use datasets/ paths which work today.

…urcing YAML configs now use: source: repo: org/repo path: optional/subpath # optional ref: branch-or-tag # optional Instead of the single-string tasks_dir field. The two-field pattern (inspired by Vercel's gitRepository.repo + rootDirectory) provides: - Explicit separation of concerns (which repo vs which subpath) - Monorepo-friendly design - Extensibility (easy to add ref, type, sha fields later) - No ambiguity in parsing Backward compat: tasks_dir: still works for local paths and aliases. Also introduces Source dataclass in task_download.py for type-safe usage.

Devin Review correctly identified that the CLI treated -t as a plain Path — org/repo/path arguments would fail with file-not-found. Added _resolve_task_path() helper that: 1. Checks if path exists locally (fast path) 2. If not, parses as org/repo/path and calls resolve_source() Also fixed scene-patterns.ipynb to use resolve_source instead of bare Path() with org/repo string.

bench run and bench eval create now support explicit source flags that mirror the YAML config source: block: bench run --source-repo benchflow-ai/skillsbench --source-path tasks/regex-log bench eval create --source-repo benchflow-ai/skillsbench --source-path tasks -t/--tasks-dir remains for local paths only (no magic resolution). All docs updated to use benchflow-ai/skillsbench as the primary example.

Previously, if a repo was cached at a different branch, a call with a new ref would silently return the stale clone. Now on cache hit we compare the current HEAD to the requested ref and fetch+checkout if they differ.

Experiment scripts previously called resolve_source() at module level, triggering git clone on import. Now uses lazy get_*() helpers so network calls only happen when main() actually runs. Also cleans stale clone_tmp directory before git clone to handle interrupted clones in a single retry.

regex-log doesn't exist in benchflow-ai/skillsbench — it was a terminal-bench-2 task. All doc examples now reference edit-pdf which actually exists and resolves correctly.

…hardcoded)

devin-ai-integration

Devin Review found 1 new potential issue.

View 13 additional findings in Devin Review.

devin-ai-integration · 2026-05-14T20:45:52Z

    model_name: anthropic/claude-haiku-4-5-20251001
 datasets:
-  - path: .ref/terminal-bench-2
+  - path: harbor-framework/terminal-bench-2


🟡 Harbor-compatible YAML docs updated with remote repo paths, but _from_harbor_yaml code not updated to resolve them

The PR updated the Harbor-compatible YAML examples in multiple SKILL.md files from path: .ref/terminal-bench-2 to path: harbor-framework/terminal-bench-2. However, _from_harbor_yaml at src/benchflow/job.py:368 still treats this value as a local filesystem path via Path(datasets[0].get("path", "tasks")). Unlike _from_native_yaml which was updated to handle the new source: block and resolve_source(), the Harbor parser will construct Path("harbor-framework/terminal-bench-2") which doesn't exist on disk. Anyone following the documented Harbor-compatible YAML format will hit a downstream error when the non-existent path is used as a tasks directory. The SKILL.md files serve as instructions for AI agents, making this likely to generate broken configs.

Prompt for agents

The Harbor-compatible YAML example in three SKILL.md files (.claude/skills/benchflow/SKILL.md, .claude/skills/benchflow/tasks/benchflow-knowledge/environment/benchflow/SKILL.md, and .claude/skills/benchflow/tasks/create-simple-task/environment/benchflow/SKILL.md) was updated to show `path: harbor-framework/terminal-bench-2` in the datasets section. However, the _from_harbor_yaml method in src/benchflow/job.py (line 368) still does `tasks_dir = Path(datasets[0].get('path', 'tasks'))` which treats the path as a local filesystem path. There are two approaches to fix this: (1) Revert the SKILL.md Harbor YAML examples to use a local path format or note that Harbor-compatible YAML still requires local paths, or (2) Update _from_harbor_yaml to detect org/repo-style paths in datasets[].path and call resolve_source() to clone them, similar to how _from_native_yaml handles the new source: block.

Was this helpful? React with 👍 or 👎 to provide feedback.

- Accept main's two-field source pattern (source.repo + source.path) - Accept main's .cache/datasets/ caching for cloned benchmarks - Keep ProgramBench generated-benchmark support (_GENERATED_BENCHMARKS) - Add harvey-lab to TASK_ALIASES for backward compatibility - Delete TB2 YAML configs (user requested TB2 removal) - Update skillsbench YAML to use source: pattern - Keep ProgramBench YAML with tasks_dir: (generated benchmarks) - All 811 tests pass, ruff + ty clean

…tent - concepts.md: restore edit-pdf task name (not regex-log) - use-cases.md: restore mcp_reviewer_hook + pre_agent_hooks pattern - use-cases.md: restore service hooks explicit warning + Migration from Harbor - python-api.md: restore User-Driven Loops section (was replaced with stale 0.3 Limitations) - task-authoring.md: restore bench run for single-task examples + skills example - task-authoring.md: restore full security note

* docs: sync all docs from www.benchflow.ai (source of latest updates) Syncs benchflow/docs/ with the website docs which were updated in www.benchflow.ai PRs #7 and #9 to match benchflow 0.3.x patterns. Key changes across all files: - .ref/ local paths → --source-repo remote pattern - Harbor-specific references → generic sandbox language - Added Modal as environment backend - Relative links normalized to ./X.md format - use-cases.md: source: {repo:...} YAML format - reference/cli.md: --source-repo, --exclude flags - progressive-disclosure.md: resolve_source() instead of .ref/ paths * docs: fix review findings — backend→environment, add missing Path imports * docs: fix broken ../examples/ links in progressive-disclosure.md examples/ is at docs/examples/, not repo root. Changed ../examples/ → ./examples/. * docs: restore opencode/openhands agents, fix auto-start services claim - Restore opencode and openhands to Registered Agents table (both still in registry) - Revert incorrect auto-start services claim to accurate wording: services require explicit pre_agent_hooks wiring * docs: fix user_dogfood.py description — edit-pdf, not regex-log * fix: revert regressions where website overwrote correct post-#252 content - concepts.md: restore edit-pdf task name (not regex-log) - use-cases.md: restore mcp_reviewer_hook + pre_agent_hooks pattern - use-cases.md: restore service hooks explicit warning + Migration from Harbor - python-api.md: restore User-Driven Loops section (was replaced with stale 0.3 Limitations) - task-authoring.md: restore bench run for single-task examples + skills example - task-authoring.md: restore full security note * fix: address Devin Review findings — lockdown_paths claim + non-existent CLI flags - use-cases.md: revert incorrect lockdown_paths per-role isolation claim; restore correct outbox communication description - cli.md: remove --agent-env/--ae and --exclude from bench eval create table (--ae only exists on bench run; --exclude only in YAML config) * fix: update broken anchor links after heading renames - README.md: update progressive-disclosure.md anchor (Harbor #1316 parity removed) - notebook: update use-cases.md anchor (#1-interactive-user-simulation) --------- Co-authored-by: Xiangyi Li <xiangyi@benchflow.ai>

devin-ai-integration Bot assigned xdotli May 14, 2026

devin-ai-integration Bot commented May 14, 2026

View reviewed changes

fix: update .ref/ references in notebooks missed in initial rename

9680949