Skip to content

refactor: adopt two-field source pattern for dataset sourcing#252

Merged
xdotli merged 11 commits into
mainfrom
devin/1778723768-drop-dot-ref
May 14, 2026
Merged

refactor: adopt two-field source pattern for dataset sourcing#252
xdotli merged 11 commits into
mainfrom
devin/1778723768-drop-dot-ref

Conversation

@devin-ai-integration
Copy link
Copy Markdown
Contributor

@devin-ai-integration devin-ai-integration Bot commented May 14, 2026

Summary

Adopts a two-field source.repo + source.path pattern for dataset sourcing (inspired by Vercel's gitRepository.repo + rootDirectory), replacing both the old .ref/ directory and the ambiguous single-string format.

YAML config format

source:
  repo: benchflow-ai/skillsbench   # GitHub org/repo
  path: tasks                       # optional subpath within repo
  ref: main                         # optional branch/tag

CLI flags (mirrors YAML)

bench run --source-repo benchflow-ai/skillsbench --source-path tasks/regex-log
bench eval create --source-repo benchflow-ai/skillsbench --source-path tasks -c 64

-t/--tasks-dir remains for local paths only — no magic resolution.

Why two fields?

  • No ambiguity about where org/repo ends and path begins
  • Monorepo-friendly (one repo holds many benchmarks, each config picks a different path)
  • CLI ↔ YAML consistency (--source-repo = source.repo, --source-path = source.path)
  • Extensible (easy to add ref, type, sha fields later)

Changes

Area What
src/benchflow/task_download.py New Source dataclass + resolve_source(repo, path, ref)
src/benchflow/job.py Parses source: block (backward compat with tasks_dir:)
src/benchflow/cli/main.py New --source-repo, --source-path, --source-ref flags on bench run and bench eval create
benchmarks/*.yaml All 4 configs migrated to source: format
Docs README, getting-started, use-cases, cli ref, examples — all use benchflow-ai/skillsbench
Tests 7 tests covering new two-field API

Backward compatibility preserved: tasks_dir: local/path and tasks_dir: alias-name still work in YAML; -t still works for local paths in CLI.

Review & Testing Checklist for Human

  • Run bench eval create -f benchmarks/skillsbench-claude-glm51.yaml — verify it resolves benchflow-ai/skillsbench with path: tasks + ref: main
  • Run bench eval create --source-repo benchflow-ai/skillsbench --source-path tasks --help — verify new flags appear
  • Run bench eval create -t ./some-local-dir — verify local paths still work without --source-repo

Notes

  • 808/808 tests pass, ruff/ty clean
  • All doc examples now use benchflow-ai/skillsbench (not harbor-framework)
  • .cache/datasets/ is gitignored — repos are cloned on first use and cached locally
  • Old resolve_ref and is_remote_ref functions removed — resolve_source is the single entry point

Link to Devin session: https://app.devin.ai/sessions/046003a8ff3843e3ad272bd3af9c7726
Requested by: @xdotli


Open in Devin Review

Benchmark datasets now live in benchflow-ai/benchmarks. The local .ref/
directory was a confusing hidden name for downloaded task repos. Rename
to datasets/ for clarity and alignment with the external dataset repo.

Changes:
- task_download.py: clone into datasets/ instead of .ref/
- .gitignore: ignore datasets/ instead of .ref/
- All YAML configs, docs, experiments, skills updated
- Tests updated to match new path
@devin-ai-integration
Copy link
Copy Markdown
Contributor Author

🤖 Devin AI Engineer

I'll be helping with this pull request! Here's what you should know:

✅ I will automatically:

  • Address comments on this PR. Add '(aside)' to your comment to have me ignore it.
  • Look at CI failures and help fix them

Note: I can only respond to comments from users who have write access to this repository.

⚙️ Control Options:

  • Disable automatic comment and CI monitoring

Copy link
Copy Markdown
Contributor Author

@devin-ai-integration devin-ai-integration Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Devin Review found 2 potential issues.

⚠️ 2 issues in files not directly in the diff

⚠️ Incomplete .ref/ → datasets/ rename in scene-patterns.ipynb — executable code uses old path (docs/examples/scene-patterns.ipynb:49)

The PR's intent is a mechanical rename of .ref/ to datasets/ across the entire codebase. docs/examples/scene-patterns.ipynb was missed. It contains executable code at line 49 (TASK = Path(".ref/terminal-bench-2/regex-log")) that will resolve to a non-existent directory after this rename lands, causing the notebook to fail when users run it. The markdown description at line 24 and cached output at line 36 also reference the old .ref/ path.


⚠️ Incomplete .ref/ → datasets/ rename in swebench_pro_progressive_disclosure.ipynb (docs/examples/swebench_pro_progressive_disclosure.ipynb:318)

The PR's mechanical .ref/datasets/ rename missed docs/examples/swebench_pro_progressive_disclosure.ipynb. Line 318 contains # .ref/swebenchpro/instance_qutebrowser__.../task.toml in a markdown TOML comment block. While this is only a comment (not executable code), it is inconsistent with the rest of the rename and will confuse readers who follow the notebook.

View 2 additional findings in Devin Review.

Open in Devin Review

@devin-ai-integration
Copy link
Copy Markdown
Contributor Author

Fixed both in 9680949 — updated .ref/datasets/ in scene-patterns.ipynb and swebench_pro_progressive_disclosure.ipynb.

- Remove benchmarks/run_skillsbench.py and benchmarks/run_tb2.py
  (redundant with 'bench eval create -f <yaml>')
- Add auto-download to Job.from_yaml: if tasks_dir under datasets/
  doesn't exist and matches a known benchmark, fetch it automatically
- Update docs to show CLI commands as primary interface
devin-ai-integration[bot]

This comment was marked as resolved.

xdotli added 2 commits May 14, 2026 02:15
The benchflow:// protocol is not wired into the active CLI (bench run
and bench eval create in main.py). Use datasets/ paths which work today.
…urcing

YAML configs now use:
  source:
    repo: org/repo
    path: optional/subpath   # optional
    ref: branch-or-tag       # optional

Instead of the single-string tasks_dir field. The two-field pattern
(inspired by Vercel's gitRepository.repo + rootDirectory) provides:
- Explicit separation of concerns (which repo vs which subpath)
- Monorepo-friendly design
- Extensibility (easy to add ref, type, sha fields later)
- No ambiguity in parsing

Backward compat: tasks_dir: still works for local paths and aliases.

Also introduces Source dataclass in task_download.py for type-safe usage.
@devin-ai-integration devin-ai-integration Bot changed the title refactor: replace .ref/ with datasets/ for benchmark task storage refactor: adopt two-field source pattern for dataset sourcing May 14, 2026
devin-ai-integration[bot]

This comment was marked as resolved.

xdotli added 2 commits May 14, 2026 19:52
Devin Review correctly identified that the CLI treated -t as a plain
Path — org/repo/path arguments would fail with file-not-found.

Added _resolve_task_path() helper that:
1. Checks if path exists locally (fast path)
2. If not, parses as org/repo/path and calls resolve_source()

Also fixed scene-patterns.ipynb to use resolve_source instead of
bare Path() with org/repo string.
bench run and bench eval create now support explicit source flags that
mirror the YAML config source: block:

  bench run --source-repo benchflow-ai/skillsbench --source-path tasks/regex-log
  bench eval create --source-repo benchflow-ai/skillsbench --source-path tasks

-t/--tasks-dir remains for local paths only (no magic resolution).
All docs updated to use benchflow-ai/skillsbench as the primary example.
devin-ai-integration[bot]

This comment was marked as resolved.

Previously, if a repo was cached at a different branch, a call with a
new ref would silently return the stale clone. Now on cache hit we
compare the current HEAD to the requested ref and fetch+checkout if
they differ.
devin-ai-integration[bot]

This comment was marked as resolved.

xdotli added 2 commits May 14, 2026 20:18
Experiment scripts previously called resolve_source() at module level,
triggering git clone on import. Now uses lazy get_*() helpers so network
calls only happen when main() actually runs.

Also cleans stale clone_tmp directory before git clone to handle
interrupted clones in a single retry.
regex-log doesn't exist in benchflow-ai/skillsbench — it was a
terminal-bench-2 task. All doc examples now reference edit-pdf which
actually exists and resolves correctly.
devin-ai-integration[bot]

This comment was marked as resolved.

@xdotli xdotli merged commit 5f0740c into main May 14, 2026
2 of 3 checks passed
Copy link
Copy Markdown
Contributor Author

@devin-ai-integration devin-ai-integration Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Devin Review found 1 new potential issue.

View 13 additional findings in Devin Review.

Open in Devin Review

model_name: anthropic/claude-haiku-4-5-20251001
datasets:
- path: .ref/terminal-bench-2
- path: harbor-framework/terminal-bench-2
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Harbor-compatible YAML docs updated with remote repo paths, but _from_harbor_yaml code not updated to resolve them

The PR updated the Harbor-compatible YAML examples in multiple SKILL.md files from path: .ref/terminal-bench-2 to path: harbor-framework/terminal-bench-2. However, _from_harbor_yaml at src/benchflow/job.py:368 still treats this value as a local filesystem path via Path(datasets[0].get("path", "tasks")). Unlike _from_native_yaml which was updated to handle the new source: block and resolve_source(), the Harbor parser will construct Path("harbor-framework/terminal-bench-2") which doesn't exist on disk. Anyone following the documented Harbor-compatible YAML format will hit a downstream error when the non-existent path is used as a tasks directory. The SKILL.md files serve as instructions for AI agents, making this likely to generate broken configs.

Prompt for agents
The Harbor-compatible YAML example in three SKILL.md files (.claude/skills/benchflow/SKILL.md, .claude/skills/benchflow/tasks/benchflow-knowledge/environment/benchflow/SKILL.md, and .claude/skills/benchflow/tasks/create-simple-task/environment/benchflow/SKILL.md) was updated to show `path: harbor-framework/terminal-bench-2` in the datasets section. However, the _from_harbor_yaml method in src/benchflow/job.py (line 368) still does `tasks_dir = Path(datasets[0].get('path', 'tasks'))` which treats the path as a local filesystem path. There are two approaches to fix this: (1) Revert the SKILL.md Harbor YAML examples to use a local path format or note that Harbor-compatible YAML still requires local paths, or (2) Update _from_harbor_yaml to detect org/repo-style paths in datasets[].path and call resolve_source() to clone them, similar to how _from_native_yaml handles the new source: block.
Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

devin-ai-integration Bot pushed a commit that referenced this pull request May 14, 2026
- Accept main's two-field source pattern (source.repo + source.path)
- Accept main's .cache/datasets/ caching for cloned benchmarks
- Keep ProgramBench generated-benchmark support (_GENERATED_BENCHMARKS)
- Add harvey-lab to TASK_ALIASES for backward compatibility
- Delete TB2 YAML configs (user requested TB2 removal)
- Update skillsbench YAML to use source: pattern
- Keep ProgramBench YAML with tasks_dir: (generated benchmarks)
- All 811 tests pass, ruff + ty clean
devin-ai-integration Bot pushed a commit that referenced this pull request May 15, 2026
…tent

- concepts.md: restore edit-pdf task name (not regex-log)
- use-cases.md: restore mcp_reviewer_hook + pre_agent_hooks pattern
- use-cases.md: restore service hooks explicit warning + Migration from Harbor
- python-api.md: restore User-Driven Loops section (was replaced with stale 0.3 Limitations)
- task-authoring.md: restore bench run for single-task examples + skills example
- task-authoring.md: restore full security note
xdotli added a commit that referenced this pull request May 15, 2026
* docs: sync all docs from www.benchflow.ai (source of latest updates)

Syncs benchflow/docs/ with the website docs which were updated in
www.benchflow.ai PRs #7 and #9 to match benchflow 0.3.x patterns.

Key changes across all files:
- .ref/ local paths → --source-repo remote pattern
- Harbor-specific references → generic sandbox language
- Added Modal as environment backend
- Relative links normalized to ./X.md format
- use-cases.md: source: {repo:...} YAML format
- reference/cli.md: --source-repo, --exclude flags
- progressive-disclosure.md: resolve_source() instead of .ref/ paths

* docs: fix review findings — backend→environment, add missing Path imports

* docs: fix broken ../examples/ links in progressive-disclosure.md

examples/ is at docs/examples/, not repo root. Changed ../examples/ → ./examples/.

* docs: restore opencode/openhands agents, fix auto-start services claim

- Restore opencode and openhands to Registered Agents table (both still in registry)
- Revert incorrect auto-start services claim to accurate wording:
  services require explicit pre_agent_hooks wiring

* docs: fix user_dogfood.py description — edit-pdf, not regex-log

* fix: revert regressions where website overwrote correct post-#252 content

- concepts.md: restore edit-pdf task name (not regex-log)
- use-cases.md: restore mcp_reviewer_hook + pre_agent_hooks pattern
- use-cases.md: restore service hooks explicit warning + Migration from Harbor
- python-api.md: restore User-Driven Loops section (was replaced with stale 0.3 Limitations)
- task-authoring.md: restore bench run for single-task examples + skills example
- task-authoring.md: restore full security note

* fix: address Devin Review findings — lockdown_paths claim + non-existent CLI flags

- use-cases.md: revert incorrect lockdown_paths per-role isolation claim;
  restore correct outbox communication description
- cli.md: remove --agent-env/--ae and --exclude from bench eval create table
  (--ae only exists on bench run; --exclude only in YAML config)

* fix: update broken anchor links after heading renames

- README.md: update progressive-disclosure.md anchor (Harbor #1316 parity removed)
- notebook: update use-cases.md anchor (#1-interactive-user-simulation)

---------

Co-authored-by: Xiangyi Li <xiangyi@benchflow.ai>
@xdotli xdotli deleted the devin/1778723768-drop-dot-ref branch May 15, 2026 21:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant