feat: per-agent capabilities + agent-as-tool infrastructure (ENG-50)#265
Conversation
🤖 Devin AI EngineerI'll be helping with this pull request! Here's what you should know: ✅ I will automatically:
Note: I can only respond to comments from users who have write access to this repository. ⚙️ Control Options:
|
Integration Test Results (ENG-50)Session: https://app.devin.ai/sessions/0efb94a21c464b0ab4a7cf764f98d40c Integration Test — bench run with Daytonauv run bench run --source-repo benchflow-ai/skillsbench \
--source-path tasks/jax-computing-basics \
--agent gemini --model gemini-3.1-flash-lite-preview -b daytonaExit code: 0 — full Trial lifecycle completed (task download → Daytona sandbox → ACP agent → 31 tool calls → verifier → clean exit). Unit Tests + Lint + Typecheck
Note on CLI flagThe original command used |
Integration Test Results (ENG-50) — CorrectedSession: https://app.devin.ai/sessions/0efb94a21c464b0ab4a7cf764f98d40c Integration Test — bench eval create with Daytonauv run bench eval create \
--source-repo benchflow-ai/skillsbench --source-path tasks/jax-computing-basics \
-a gemini -m gemini-3.1-flash-lite-preview \
-e daytona -c 1 -o jobs/integration-smoke/geminiExit code: 0 — full Job/Evaluation pipeline completed (task download → Daytona sandbox → ACP agent → 10 tool calls → verifier → clean exit). Unit Tests + Lint + Typecheck
Notes
|
|
Closing — this is now included in the combined refactor branch: #272 ( |
- Add Role.capabilities field for declarative agent capability lists - Add Sandbox.expose_ports property for inter-agent communication ports - Document capability boundary: BenchFlow provides sandbox + instruction + observation; agents own loops, tool protocols, and agent-as-tool - Update _scene.py and _types.py module docstrings with ENG-50 boundary - Add 16 tests covering capabilities, expose_ports, and Coder→Reviewer→Coder turn sequences
006092d to
ccdf516
Compare
Summary
Implements ENG-50 for BenchFlow v0.4 — establishes the capability boundary between BenchFlow (infrastructure) and agents (loops, tools, agent-as-tool).
Depends on: #262 (ENG-47 — unified types) and #261 (ENG-48 — sandbox protocol). This branch merges both.
Changes (ENG-50 commit only):
Role.capabilitiesfield (_types.py) — optionallist[str]declaring what the agent natively supports (e.g.["tool-use", "agent-as-tool", "loop"]). BenchFlow records these in metadata but does not act on them.Sandbox.expose_portsproperty (sandbox/protocol.py,docker.py,daytona.py) — allows sandbox implementations to declare ports for inter-agent communication. All roles share one sandbox, so localhost is reachable by default; this extends the protocol for explicit port exposure.Capability boundary documentation — module-level docstrings in
_types.pyand_scene.pyclearly establish:16 new tests (
test_eng50_capabilities.py) covering:Role.capabilitiesfield (defaults, list, backward compat)expose_portson Docker and Daytona adapters (default, list, copy safety, protocol conformance)What this PR does NOT add (by design):
Review & Testing Checklist for Human
Role.capabilitiesfield is optional and backward-compatible — existing YAML/code that doesn't set capabilities should work unchangedSandbox.expose_portsdoesn't break existing Docker/Daytona sandbox usage (property defaults to[])_types.pyand_scene.pyaccurately reflect the intended v0.4 architectureuv run bench run --source-repo benchflow-ai/skillsbench --source-path tasks/jax-computing-basics --agent gemini --model gemini-3.1-flash-lite-preview -e daytonaNotes
Link to Devin session: https://app.devin.ai/sessions/0efb94a21c464b0ab4a7cf764f98d40c
Requested by: @xdotli