feat: per-agent capabilities + agent-as-tool infrastructure (ENG-50) by devin-ai-integration[bot] · Pull Request #265 · benchflow-ai/benchflow

devin-ai-integration · 2026-05-15T22:29:30Z

Summary

Implements ENG-50 for BenchFlow v0.4 — establishes the capability boundary between BenchFlow (infrastructure) and agents (loops, tools, agent-as-tool).

Depends on: #262 (ENG-47 — unified types) and #261 (ENG-48 — sandbox protocol). This branch merges both.

Changes (ENG-50 commit only):

Role.capabilities field (_types.py) — optional list[str] declaring what the agent natively supports (e.g. ["tool-use", "agent-as-tool", "loop"]). BenchFlow records these in metadata but does not act on them.
Sandbox.expose_ports property (sandbox/protocol.py, docker.py, daytona.py) — allows sandbox implementations to declare ports for inter-agent communication. All roles share one sandbox, so localhost is reachable by default; this extends the protocol for explicit port exposure.
Capability boundary documentation — module-level docstrings in _types.py and _scene.py clearly establish:
- BenchFlow provides sandbox + instruction + observation
- BenchFlow does NOT orchestrate agent-internal loops
- Agent-as-tool is a per-agent capability, not a BenchFlow feature
- BenchFlow ensures sandbox networking allows inter-agent communication
16 new tests (test_eng50_capabilities.py) covering:
- Role.capabilities field (defaults, list, backward compat)
- expose_ports on Docker and Daytona adapters (default, list, copy safety, protocol conformance)
- Coder → Reviewer → Coder turn sequence with message routing
- Role env vars preserved through Scene construction
- max_rounds enforcement

What this PR does NOT add (by design):

No BenchFlow-level loop management
No BenchFlow-level Tool protocol
No automatic agent-as-tool invocation

Review & Testing Checklist for Human

Verify Role.capabilities field is optional and backward-compatible — existing YAML/code that doesn't set capabilities should work unchanged
Confirm Sandbox.expose_ports doesn't break existing Docker/Daytona sandbox usage (property defaults to [])
Check that the capability boundary docs in _types.py and _scene.py accurately reflect the intended v0.4 architecture
Run integration test: uv run bench run --source-repo benchflow-ai/skillsbench --source-path tasks/jax-computing-basics --agent gemini --model gemini-3.1-flash-lite-preview -e daytona

Notes

This is intentionally a small, focused PR — mostly documentation + one new field + one new protocol property
The ENG-47 and ENG-48 merges are included in the branch; the ENG-50-specific commit is the last one
All 856 existing tests pass, plus 16 new tests added

Link to Devin session: https://app.devin.ai/sessions/0efb94a21c464b0ab4a7cf764f98d40c
Requested by: @xdotli

devin-ai-integration · 2026-05-15T22:29:33Z

🤖 Devin AI Engineer

I'll be helping with this pull request! Here's what you should know:

✅ I will automatically:

Address comments on this PR. Add '(aside)' to your comment to have me ignore it.
Look at CI failures and help fix them

Note: I can only respond to comments from users who have write access to this repository.

⚙️ Control Options:

Disable automatic comment and CI monitoring

devin-ai-integration · 2026-05-15T22:47:55Z

Integration Test Results (ENG-50)

Session: https://app.devin.ai/sessions/0efb94a21c464b0ab4a7cf764f98d40c

Integration Test — bench run with Daytona

uv run bench run --source-repo benchflow-ai/skillsbench \
  --source-path tasks/jax-computing-basics \
  --agent gemini --model gemini-3.1-flash-lite-preview -b daytona

Task: jax-computing-basics
Agent: gemini-cli
Rewards: {'reward': 0.0}
Tool calls: 31

Exit code: 0 — full Trial lifecycle completed (task download → Daytona sandbox → ACP agent → 31 tool calls → verifier → clean exit).

Unit Tests + Lint + Typecheck

Check	Result
`pytest tests/`	856 passed, 1 skipped, 1 deselected (27.72s)
`ruff check .`	All checks passed
`ty check src/`	All 59 files passed
`Role.capabilities` round-trip	Passed

Note on CLI flag

The original command used -e daytona but the correct flag is -b daytona (or --backend daytona). The -e/--environment flag was renamed to -b/--backend.

devin-ai-integration · 2026-05-15T22:54:25Z

Integration Test Results (ENG-50) — Corrected

Session: https://app.devin.ai/sessions/0efb94a21c464b0ab4a7cf764f98d40c

Integration Test — bench eval create with Daytona

uv run bench eval create \
  --source-repo benchflow-ai/skillsbench --source-path tasks/jax-computing-basics \
  -a gemini -m gemini-3.1-flash-lite-preview \
  -e daytona -c 1 -o jobs/integration-smoke/gemini

Task: jax-computing-basics
Agent: gemini (gemini-3.1-flash-lite-preview)
Reward: 0.0
Tool calls: 10

Exit code: 0 — full Job/Evaluation pipeline completed (task download → Daytona sandbox → ACP agent → 10 tool calls → verifier → clean exit).

Unit Tests + Lint + Typecheck

Check	Result
`pytest tests/`	856 passed, 1 skipped, 1 deselected (27.72s)
`ruff check .`	All checks passed
`ty check src/`	All 59 files passed
`Role.capabilities` round-trip	Passed

Notes

Used bench eval create (canonical integration path), not bench run
--include flag does not exist on this branch; used --source-path tasks/jax-computing-basics instead
Role.capabilities defaults to None, Sandbox.expose_ports defaults to [] — backward compatible, no impact on pipeline

devin-ai-integration · 2026-05-16T00:21:25Z

Closing — this is now included in the combined refactor branch: #272 (refactor/v0.4 → main). All changes from this PR are preserved there.

- Add Role.capabilities field for declarative agent capability lists - Add Sandbox.expose_ports property for inter-agent communication ports - Document capability boundary: BenchFlow provides sandbox + instruction + observation; agents own loops, tool protocols, and agent-as-tool - Update _scene.py and _types.py module docstrings with ENG-50 boundary - Add 16 tests covering capabilities, expose_ports, and Coder→Reviewer→Coder turn sequences

devin-ai-integration Bot assigned xdotli May 15, 2026

devin-ai-integration Bot mentioned this pull request May 16, 2026

refactor: BenchFlow v0.4 — unified types, Rollout, Sandbox protocol, rewards, adapters #272

Closed

5 tasks

xdotli closed this May 16, 2026

devin-ai-integration Bot reopened this May 16, 2026

devin-ai-integration Bot changed the base branch from main to refactor/v0.4 May 16, 2026 01:07

xdotli added 2 commits May 16, 2026 01:09

style: apply ruff format to sandbox adapters and tests

ccdf516

devin-ai-integration Bot force-pushed the devin/1778883855-eng-50-agent-capabilities branch from 006092d to ccdf516 Compare May 16, 2026 01:09

devin-ai-integration Bot merged commit ce77bbf into refactor/v0.4 May 16, 2026
1 of 2 checks passed

devin-ai-integration Bot mentioned this pull request May 16, 2026

refactor: BenchFlow v0.3.4 — unified types, Rollout, Sandbox protocol, rewards, adapters #274

Merged

5 tasks

xdotli deleted the devin/1778883855-eng-50-agent-capabilities branch May 17, 2026 05:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: per-agent capabilities + agent-as-tool infrastructure (ENG-50)#265

feat: per-agent capabilities + agent-as-tool infrastructure (ENG-50)#265
devin-ai-integration[bot] merged 2 commits into
refactor/v0.4from
devin/1778883855-eng-50-agent-capabilities

devin-ai-integration Bot commented May 15, 2026 •

edited

Loading

Uh oh!

devin-ai-integration Bot commented May 15, 2026

Uh oh!

devin-ai-integration Bot commented May 15, 2026

Uh oh!

devin-ai-integration Bot commented May 15, 2026

Uh oh!

devin-ai-integration Bot commented May 16, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

devin-ai-integration Bot commented May 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes (ENG-50 commit only):

What this PR does NOT add (by design):

Review & Testing Checklist for Human

Notes

Uh oh!

devin-ai-integration Bot commented May 15, 2026

🤖 Devin AI Engineer

Uh oh!

devin-ai-integration Bot commented May 15, 2026

Integration Test Results (ENG-50)

Uh oh!

devin-ai-integration Bot commented May 15, 2026

Integration Test Results (ENG-50) — Corrected

Uh oh!

devin-ai-integration Bot commented May 16, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

devin-ai-integration Bot commented May 15, 2026 •

edited

Loading