refactor: adopt two-field source pattern for dataset sourcing#252
Conversation
Benchmark datasets now live in benchflow-ai/benchmarks. The local .ref/ directory was a confusing hidden name for downloaded task repos. Rename to datasets/ for clarity and alignment with the external dataset repo. Changes: - task_download.py: clone into datasets/ instead of .ref/ - .gitignore: ignore datasets/ instead of .ref/ - All YAML configs, docs, experiments, skills updated - Tests updated to match new path
🤖 Devin AI EngineerI'll be helping with this pull request! Here's what you should know: ✅ I will automatically:
Note: I can only respond to comments from users who have write access to this repository. ⚙️ Control Options:
|
There was a problem hiding this comment.
Devin Review found 2 potential issues.
⚠️ 2 issues in files not directly in the diff
⚠️ Incomplete .ref/ → datasets/ rename in scene-patterns.ipynb — executable code uses old path (docs/examples/scene-patterns.ipynb:49)
The PR's intent is a mechanical rename of .ref/ to datasets/ across the entire codebase. docs/examples/scene-patterns.ipynb was missed. It contains executable code at line 49 (TASK = Path(".ref/terminal-bench-2/regex-log")) that will resolve to a non-existent directory after this rename lands, causing the notebook to fail when users run it. The markdown description at line 24 and cached output at line 36 also reference the old .ref/ path.
⚠️ Incomplete .ref/ → datasets/ rename in swebench_pro_progressive_disclosure.ipynb (docs/examples/swebench_pro_progressive_disclosure.ipynb:318)
The PR's mechanical .ref/ → datasets/ rename missed docs/examples/swebench_pro_progressive_disclosure.ipynb. Line 318 contains # .ref/swebenchpro/instance_qutebrowser__.../task.toml in a markdown TOML comment block. While this is only a comment (not executable code), it is inconsistent with the rest of the rename and will confuse readers who follow the notebook.
View 2 additional findings in Devin Review.
|
Fixed both in 9680949 — updated |
- Remove benchmarks/run_skillsbench.py and benchmarks/run_tb2.py (redundant with 'bench eval create -f <yaml>') - Add auto-download to Job.from_yaml: if tasks_dir under datasets/ doesn't exist and matches a known benchmark, fetch it automatically - Update docs to show CLI commands as primary interface
The benchflow:// protocol is not wired into the active CLI (bench run and bench eval create in main.py). Use datasets/ paths which work today.
…urcing
YAML configs now use:
source:
repo: org/repo
path: optional/subpath # optional
ref: branch-or-tag # optional
Instead of the single-string tasks_dir field. The two-field pattern
(inspired by Vercel's gitRepository.repo + rootDirectory) provides:
- Explicit separation of concerns (which repo vs which subpath)
- Monorepo-friendly design
- Extensibility (easy to add ref, type, sha fields later)
- No ambiguity in parsing
Backward compat: tasks_dir: still works for local paths and aliases.
Also introduces Source dataclass in task_download.py for type-safe usage.
Devin Review correctly identified that the CLI treated -t as a plain Path — org/repo/path arguments would fail with file-not-found. Added _resolve_task_path() helper that: 1. Checks if path exists locally (fast path) 2. If not, parses as org/repo/path and calls resolve_source() Also fixed scene-patterns.ipynb to use resolve_source instead of bare Path() with org/repo string.
bench run and bench eval create now support explicit source flags that mirror the YAML config source: block: bench run --source-repo benchflow-ai/skillsbench --source-path tasks/regex-log bench eval create --source-repo benchflow-ai/skillsbench --source-path tasks -t/--tasks-dir remains for local paths only (no magic resolution). All docs updated to use benchflow-ai/skillsbench as the primary example.
Previously, if a repo was cached at a different branch, a call with a new ref would silently return the stale clone. Now on cache hit we compare the current HEAD to the requested ref and fetch+checkout if they differ.
Experiment scripts previously called resolve_source() at module level, triggering git clone on import. Now uses lazy get_*() helpers so network calls only happen when main() actually runs. Also cleans stale clone_tmp directory before git clone to handle interrupted clones in a single retry.
regex-log doesn't exist in benchflow-ai/skillsbench — it was a terminal-bench-2 task. All doc examples now reference edit-pdf which actually exists and resolves correctly.
| model_name: anthropic/claude-haiku-4-5-20251001 | ||
| datasets: | ||
| - path: .ref/terminal-bench-2 | ||
| - path: harbor-framework/terminal-bench-2 |
There was a problem hiding this comment.
🟡 Harbor-compatible YAML docs updated with remote repo paths, but _from_harbor_yaml code not updated to resolve them
The PR updated the Harbor-compatible YAML examples in multiple SKILL.md files from path: .ref/terminal-bench-2 to path: harbor-framework/terminal-bench-2. However, _from_harbor_yaml at src/benchflow/job.py:368 still treats this value as a local filesystem path via Path(datasets[0].get("path", "tasks")). Unlike _from_native_yaml which was updated to handle the new source: block and resolve_source(), the Harbor parser will construct Path("harbor-framework/terminal-bench-2") which doesn't exist on disk. Anyone following the documented Harbor-compatible YAML format will hit a downstream error when the non-existent path is used as a tasks directory. The SKILL.md files serve as instructions for AI agents, making this likely to generate broken configs.
Prompt for agents
The Harbor-compatible YAML example in three SKILL.md files (.claude/skills/benchflow/SKILL.md, .claude/skills/benchflow/tasks/benchflow-knowledge/environment/benchflow/SKILL.md, and .claude/skills/benchflow/tasks/create-simple-task/environment/benchflow/SKILL.md) was updated to show `path: harbor-framework/terminal-bench-2` in the datasets section. However, the _from_harbor_yaml method in src/benchflow/job.py (line 368) still does `tasks_dir = Path(datasets[0].get('path', 'tasks'))` which treats the path as a local filesystem path. There are two approaches to fix this: (1) Revert the SKILL.md Harbor YAML examples to use a local path format or note that Harbor-compatible YAML still requires local paths, or (2) Update _from_harbor_yaml to detect org/repo-style paths in datasets[].path and call resolve_source() to clone them, similar to how _from_native_yaml handles the new source: block.
Was this helpful? React with 👍 or 👎 to provide feedback.
- Accept main's two-field source pattern (source.repo + source.path) - Accept main's .cache/datasets/ caching for cloned benchmarks - Keep ProgramBench generated-benchmark support (_GENERATED_BENCHMARKS) - Add harvey-lab to TASK_ALIASES for backward compatibility - Delete TB2 YAML configs (user requested TB2 removal) - Update skillsbench YAML to use source: pattern - Keep ProgramBench YAML with tasks_dir: (generated benchmarks) - All 811 tests pass, ruff + ty clean
…tent - concepts.md: restore edit-pdf task name (not regex-log) - use-cases.md: restore mcp_reviewer_hook + pre_agent_hooks pattern - use-cases.md: restore service hooks explicit warning + Migration from Harbor - python-api.md: restore User-Driven Loops section (was replaced with stale 0.3 Limitations) - task-authoring.md: restore bench run for single-task examples + skills example - task-authoring.md: restore full security note
* docs: sync all docs from www.benchflow.ai (source of latest updates) Syncs benchflow/docs/ with the website docs which were updated in www.benchflow.ai PRs #7 and #9 to match benchflow 0.3.x patterns. Key changes across all files: - .ref/ local paths → --source-repo remote pattern - Harbor-specific references → generic sandbox language - Added Modal as environment backend - Relative links normalized to ./X.md format - use-cases.md: source: {repo:...} YAML format - reference/cli.md: --source-repo, --exclude flags - progressive-disclosure.md: resolve_source() instead of .ref/ paths * docs: fix review findings — backend→environment, add missing Path imports * docs: fix broken ../examples/ links in progressive-disclosure.md examples/ is at docs/examples/, not repo root. Changed ../examples/ → ./examples/. * docs: restore opencode/openhands agents, fix auto-start services claim - Restore opencode and openhands to Registered Agents table (both still in registry) - Revert incorrect auto-start services claim to accurate wording: services require explicit pre_agent_hooks wiring * docs: fix user_dogfood.py description — edit-pdf, not regex-log * fix: revert regressions where website overwrote correct post-#252 content - concepts.md: restore edit-pdf task name (not regex-log) - use-cases.md: restore mcp_reviewer_hook + pre_agent_hooks pattern - use-cases.md: restore service hooks explicit warning + Migration from Harbor - python-api.md: restore User-Driven Loops section (was replaced with stale 0.3 Limitations) - task-authoring.md: restore bench run for single-task examples + skills example - task-authoring.md: restore full security note * fix: address Devin Review findings — lockdown_paths claim + non-existent CLI flags - use-cases.md: revert incorrect lockdown_paths per-role isolation claim; restore correct outbox communication description - cli.md: remove --agent-env/--ae and --exclude from bench eval create table (--ae only exists on bench run; --exclude only in YAML config) * fix: update broken anchor links after heading renames - README.md: update progressive-disclosure.md anchor (Harbor #1316 parity removed) - notebook: update use-cases.md anchor (#1-interactive-user-simulation) --------- Co-authored-by: Xiangyi Li <xiangyi@benchflow.ai>
Summary
Adopts a two-field
source.repo+source.pathpattern for dataset sourcing (inspired by Vercel'sgitRepository.repo+rootDirectory), replacing both the old.ref/directory and the ambiguous single-string format.YAML config format
CLI flags (mirrors YAML)
bench run --source-repo benchflow-ai/skillsbench --source-path tasks/regex-log bench eval create --source-repo benchflow-ai/skillsbench --source-path tasks -c 64-t/--tasks-dirremains for local paths only — no magic resolution.Why two fields?
path)--source-repo=source.repo,--source-path=source.path)ref,type,shafields later)Changes
src/benchflow/task_download.pySourcedataclass +resolve_source(repo, path, ref)src/benchflow/job.pysource:block (backward compat withtasks_dir:)src/benchflow/cli/main.py--source-repo,--source-path,--source-refflags onbench runandbench eval createbenchmarks/*.yamlsource:formatbenchflow-ai/skillsbenchBackward compatibility preserved:
tasks_dir: local/pathandtasks_dir: alias-namestill work in YAML;-tstill works for local paths in CLI.Review & Testing Checklist for Human
bench eval create -f benchmarks/skillsbench-claude-glm51.yaml— verify it resolvesbenchflow-ai/skillsbenchwithpath: tasks+ref: mainbench eval create --source-repo benchflow-ai/skillsbench --source-path tasks --help— verify new flags appearbench eval create -t ./some-local-dir— verify local paths still work without--source-repoNotes
benchflow-ai/skillsbench(not harbor-framework).cache/datasets/is gitignored — repos are cloned on first use and cached locallyresolve_refandis_remote_reffunctions removed —resolve_sourceis the single entry pointLink to Devin session: https://app.devin.ai/sessions/046003a8ff3843e3ad272bd3af9c7726
Requested by: @xdotli