Skip to content

feat: per-agent capabilities + agent-as-tool infrastructure (ENG-50)#265

Merged
devin-ai-integration[bot] merged 2 commits into
refactor/v0.4from
devin/1778883855-eng-50-agent-capabilities
May 16, 2026
Merged

feat: per-agent capabilities + agent-as-tool infrastructure (ENG-50)#265
devin-ai-integration[bot] merged 2 commits into
refactor/v0.4from
devin/1778883855-eng-50-agent-capabilities

Conversation

@devin-ai-integration
Copy link
Copy Markdown
Contributor

@devin-ai-integration devin-ai-integration Bot commented May 15, 2026

Summary

Implements ENG-50 for BenchFlow v0.4 — establishes the capability boundary between BenchFlow (infrastructure) and agents (loops, tools, agent-as-tool).

Depends on: #262 (ENG-47 — unified types) and #261 (ENG-48 — sandbox protocol). This branch merges both.

Changes (ENG-50 commit only):

  1. Role.capabilities field (_types.py) — optional list[str] declaring what the agent natively supports (e.g. ["tool-use", "agent-as-tool", "loop"]). BenchFlow records these in metadata but does not act on them.

  2. Sandbox.expose_ports property (sandbox/protocol.py, docker.py, daytona.py) — allows sandbox implementations to declare ports for inter-agent communication. All roles share one sandbox, so localhost is reachable by default; this extends the protocol for explicit port exposure.

  3. Capability boundary documentation — module-level docstrings in _types.py and _scene.py clearly establish:

    • BenchFlow provides sandbox + instruction + observation
    • BenchFlow does NOT orchestrate agent-internal loops
    • Agent-as-tool is a per-agent capability, not a BenchFlow feature
    • BenchFlow ensures sandbox networking allows inter-agent communication
  4. 16 new tests (test_eng50_capabilities.py) covering:

    • Role.capabilities field (defaults, list, backward compat)
    • expose_ports on Docker and Daytona adapters (default, list, copy safety, protocol conformance)
    • Coder → Reviewer → Coder turn sequence with message routing
    • Role env vars preserved through Scene construction
    • max_rounds enforcement

What this PR does NOT add (by design):

  • No BenchFlow-level loop management
  • No BenchFlow-level Tool protocol
  • No automatic agent-as-tool invocation

Review & Testing Checklist for Human

  • Verify Role.capabilities field is optional and backward-compatible — existing YAML/code that doesn't set capabilities should work unchanged
  • Confirm Sandbox.expose_ports doesn't break existing Docker/Daytona sandbox usage (property defaults to [])
  • Check that the capability boundary docs in _types.py and _scene.py accurately reflect the intended v0.4 architecture
  • Run integration test: uv run bench run --source-repo benchflow-ai/skillsbench --source-path tasks/jax-computing-basics --agent gemini --model gemini-3.1-flash-lite-preview -e daytona

Notes

  • This is intentionally a small, focused PR — mostly documentation + one new field + one new protocol property
  • The ENG-47 and ENG-48 merges are included in the branch; the ENG-50-specific commit is the last one
  • All 856 existing tests pass, plus 16 new tests added

Link to Devin session: https://app.devin.ai/sessions/0efb94a21c464b0ab4a7cf764f98d40c
Requested by: @xdotli


Open in Devin Review

@devin-ai-integration
Copy link
Copy Markdown
Contributor Author

🤖 Devin AI Engineer

I'll be helping with this pull request! Here's what you should know:

✅ I will automatically:

  • Address comments on this PR. Add '(aside)' to your comment to have me ignore it.
  • Look at CI failures and help fix them

Note: I can only respond to comments from users who have write access to this repository.

⚙️ Control Options:

  • Disable automatic comment and CI monitoring

@devin-ai-integration
Copy link
Copy Markdown
Contributor Author

Integration Test Results (ENG-50)

Session: https://app.devin.ai/sessions/0efb94a21c464b0ab4a7cf764f98d40c

Integration Test — bench run with Daytona
uv run bench run --source-repo benchflow-ai/skillsbench \
  --source-path tasks/jax-computing-basics \
  --agent gemini --model gemini-3.1-flash-lite-preview -b daytona
Task: jax-computing-basics
Agent: gemini-cli
Rewards: {'reward': 0.0}
Tool calls: 31

Exit code: 0 — full Trial lifecycle completed (task download → Daytona sandbox → ACP agent → 31 tool calls → verifier → clean exit).

Unit Tests + Lint + Typecheck
Check Result
pytest tests/ 856 passed, 1 skipped, 1 deselected (27.72s)
ruff check . All checks passed
ty check src/ All 59 files passed
Role.capabilities round-trip Passed
Note on CLI flag

The original command used -e daytona but the correct flag is -b daytona (or --backend daytona). The -e/--environment flag was renamed to -b/--backend.

@devin-ai-integration
Copy link
Copy Markdown
Contributor Author

Integration Test Results (ENG-50) — Corrected

Session: https://app.devin.ai/sessions/0efb94a21c464b0ab4a7cf764f98d40c

Integration Test — bench eval create with Daytona
uv run bench eval create \
  --source-repo benchflow-ai/skillsbench --source-path tasks/jax-computing-basics \
  -a gemini -m gemini-3.1-flash-lite-preview \
  -e daytona -c 1 -o jobs/integration-smoke/gemini
Task: jax-computing-basics
Agent: gemini (gemini-3.1-flash-lite-preview)
Reward: 0.0
Tool calls: 10

Exit code: 0 — full Job/Evaluation pipeline completed (task download → Daytona sandbox → ACP agent → 10 tool calls → verifier → clean exit).

Unit Tests + Lint + Typecheck
Check Result
pytest tests/ 856 passed, 1 skipped, 1 deselected (27.72s)
ruff check . All checks passed
ty check src/ All 59 files passed
Role.capabilities round-trip Passed
Notes
  • Used bench eval create (canonical integration path), not bench run
  • --include flag does not exist on this branch; used --source-path tasks/jax-computing-basics instead
  • Role.capabilities defaults to None, Sandbox.expose_ports defaults to [] — backward compatible, no impact on pipeline

@devin-ai-integration
Copy link
Copy Markdown
Contributor Author

Closing — this is now included in the combined refactor branch: #272 (refactor/v0.4main). All changes from this PR are preserved there.

@xdotli xdotli closed this May 16, 2026
@devin-ai-integration devin-ai-integration Bot changed the base branch from main to refactor/v0.4 May 16, 2026 01:07
xdotli added 2 commits May 16, 2026 01:09
- Add Role.capabilities field for declarative agent capability lists
- Add Sandbox.expose_ports property for inter-agent communication ports
- Document capability boundary: BenchFlow provides sandbox + instruction +
  observation; agents own loops, tool protocols, and agent-as-tool
- Update _scene.py and _types.py module docstrings with ENG-50 boundary
- Add 16 tests covering capabilities, expose_ports, and Coder→Reviewer→Coder
  turn sequences
@devin-ai-integration devin-ai-integration Bot force-pushed the devin/1778883855-eng-50-agent-capabilities branch from 006092d to ccdf516 Compare May 16, 2026 01:09
@devin-ai-integration devin-ai-integration Bot merged commit ce77bbf into refactor/v0.4 May 16, 2026
1 of 2 checks passed
@xdotli xdotli deleted the devin/1778883855-eng-50-agent-capabilities branch May 17, 2026 05:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant