Skip to content

Benchmark adapters (umbrella): SWE-bench, WebArena, GAIA, AgentBench, BrowseComp #53

@bordeauxred

Description

@bordeauxred

Why

ClawLoop ships adapters for math, Harbor/BFCL, CRMArena (Entropic), Taubench, CAR-bench, and EnterpriseOps-Gym. Each one proved that the loop generalizes across task shapes — but each one was also bespoke. The ProblemEnv abstraction tracked in #41 is the blocker for painless new adapters.

This umbrella collects the next wave of benchmark families that would materially broaden what ClawLoop can learn on. It depends on #41's ProblemEnv work — the intent is that these adapters become the concrete drivers that shape ProblemEnv's surface, not the other way around.

Adapters

  • SWE-bench / SWE-bench Verified — long-horizon code-repair episodes. Stresses multi-step reasoning, tool use, and partial-credit reward signals.
  • WebArena — realistic web-navigation tasks with a container-backed env. Strong test for the proxy harness + playbook evolution across sessions.
  • GAIA — general assistant benchmark with tool use and multi-step reasoning. Good fit for the harness layer and reward composition system.
  • AgentBench — multi-domain agent evaluation. Forces the adapter abstraction to handle heterogeneous task shapes under one runner.
  • BrowseComp — web-search agent benchmark; complements WebArena with shorter horizons.

Why an umbrella?

Each adapter is a multi-day effort and mostly independent. Tracking them together surfaces the cross-cutting design questions (reward normalization across task types, partial-credit signals, long-horizon episode handling) that should land in ProblemEnv.

Blocks on

Contributor notes

If you want to contribute an adapter, comment on the relevant sub-item before starting. Existing adapters (clawloop/environments/math.py, clawloop/environments/harbor.py) are the reference shape until ProblemEnv lands.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or requestroadmapFuture direction; not a launch blocker

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions