Cocoa-Agent

Cocoa-Agent is an agent framework for building and evaluating general digital agents. It provides seamless integration with AIO Sandbox, an all-in-one Docker environment. It equips agents with a full suite of tools—browser automation, terminal access, file operations, and code interpreters—enabling them to operate like human developers in realistic settings. Our framework is model-agnostic, and we provide example scripts for running agents with both open-source LLMs such as Qwen3-VL and commercial models such as GPT-5.1, on the example task of CocoaBench. To support robust evaluation at scale, cocoa-agent implements both dynamic runtime tests for verifying computational correctness and lightweight static-matching checks for deterministic answers.

Overview

This framework provides:

Model-agnostic execution - Works with any OpenAI-compatible LLM or human controllers
Comprehensive tool suite - Browser automation, terminal, file operations, code interpretation
Scalable evaluation - Dynamic runtime tests and lightweight static-matching checks
Execution tracking - Full conversation history and action traces for analysis
Docker isolation - Sandboxed task environments with custom configurations

Quick Start

Running Tasks

Execute all tasks in a directory using the main entry point:

python inference_main.py \
  --config configs/default_gpt.json \
  --tasks-dir tasks/ \
  --output-dir results/

Command-line Options:

--config CONFIG_FILE: Path to configuration file (default: config.json)
--tasks-dir TASKS_DIR: Directory containing task subdirectories (default: tasks/)
--output-dir OUTPUT_DIR: Output directory for results JSON files (default: results/)
--model MODEL_NAME: Override model name from config

Configuration File Format

Example configs/default_gpt.json:

{
  "log_level": "DEBUG",
  "use_encrypted_tasks": false,
  "controller": {
    "type": "llm",
    "args": {
      "model": "gpt-5.1",
      "base_url": "",
      "api_key": "sk-proj-..."
    }
  },
  "sandbox": {
    "client_type": "unified",
    "docker_port": 8084,
    "max_iterations": 30
  }
}

Configuration Keys:

Key	Type	Description
`log_level`	string	Logging verbosity: DEBUG, INFO, WARNING, ERROR
`use_encrypted_tasks`	bool	Enable encrypted task files (default: false)
`controller.type`	string	Agent type: "llm" (AI model) or "human" (interactive)
`controller.args.model`	string	Model identifier (e.g., "gpt-5.1", "Qwen3-VL")
`controller.args.base_url`	string	API endpoint (empty for OpenAI, required for local servers)
`controller.args.api_key`	string	Authentication token
`sandbox.client_type`	string	Sandbox mode: "unified" (all tools) or "browser" (UI only)
`sandbox.docker_port`	int	Host port for sandbox access (default: 8080)
`sandbox.max_iterations`	int	Maximum iterations per task before timeout

Evaluation

The system uses a host-side evaluation approach where the test() function in test.py validates task results.

Evaluation Method: `test.py`

The evaluation script runs on the host machine after task completion. The framework:

Loads the test() function from test.py in the task directory
Calls test(result) with the complete execution result dictionary
Expects a dictionary return value with evaluation results

Function Signature:

def test(result: dict) -> dict:
    """Evaluate task results after execution.

    Args:
        result: Complete execution result containing:
            - conversation: Full message history with controller
            - execution_trace: All actions and their outputs
            - status: Task status ("success" or "failed")
            - instruction: Original task instruction
            - iterations: Number of iterations completed
            - sandbox: Sandbox configuration (docker_port, etc.)

    Returns:
        Dictionary with:
            - passed (bool): Whether task passed evaluation
            - feedback (str): Human-readable evaluation message
            - details (dict, optional): Additional metrics
    """

Return Format:

{
    "passed": True,                    # Required: Whether task passed
    "feedback": "Task completed successfully",  # Required: Human-readable message
    "details": {                       # Optional: Additional metrics
        "key1": "value1",
        "key2": "value2"
    }
}

Result Dictionary Contents:

result = {
    "task_name": "task-name",
    "instruction": "Task instruction...",
    "status": "success",  # or "failed"
    "iterations": 5,
    "conversation": [
        {"role": "user", "content": "..."},
        {"role": "assistant", "content": "..."}
    ],
    "execution_trace": [
        {
            "action": {"action_type": "shell", "command": "ls"},
            "feedback": {"done": False, "message": "..."}
        }
    ],
    "sandbox": {
        "docker_port": 8084,
        "client_type": "unified"
    }
}

Sandbox API for Host-Side Scripts

Call Sandbox APIs from test.py to inspect container state:

from pathlib import Path
import sys

# Initialize sandbox client
sandbox_sdk_path = Path(__file__).parent.parent / "sandbox" / "sdk" / "python"
if sandbox_sdk_path.exists():
    sys.path.insert(0, str(sandbox_sdk_path))
    from agent_sandbox import Sandbox

docker_port = result.get("sandbox", {}).get("docker_port", 8080)
sandbox = Sandbox(base_url=f"http://localhost:{docker_port}")

# Read file
content = sandbox.file.read_file(file="/home/gem/output.txt").data.content

# List directory
entries = sandbox.file.list_path(path="/home/gem").data.entries

# Download file
binary_data = sandbox.file.download_file(path="/home/gem/report.pdf")

# Take screenshot
image = sandbox.browser.screenshot().data.image

Common File APIs:

Capability	API	Returns
Read file	`sandbox.file.read_file(file=path)`	`.data.content` with full text
List directory	`sandbox.file.list_path(path=dir)`	`.data.entries` list
Download file	`sandbox.file.download_file(path=path)`	Binary data for streaming
Screenshot	`sandbox.browser.screenshot()`	`.data.image` as base64

Results

Each task produces a JSON file in the output directory (e.g., results/task-name.json) with complete execution details.

Result Dictionary Fields:

Field	Type	Description
`task_name`	string	Name of the task
`instruction`	string	Original task instruction
`status`	string	Task status: "success" or "failed"
`iterations`	integer	Number of controller iterations
`conversation`	array	Full conversation with controller (role/content pairs)
`execution_trace`	array	List of actions and their feedback
`eval`	object	Evaluation results from `test.py`
`execution_time`	float	Total execution time in seconds
`docker_port`	integer	Docker port used for this task
`client_type`	string	Sandbox client type (unified, browser, etc.)

Setup

Prerequisites

Python 3.10+
Docker and Docker Compose (for running sandboxed tasks)
uv (Python package manager, recommended)

Installation

Clone the repository
Install dependencies:
```
uv sync
```
Set up configuration file:
- Copy or create a config file in configs/ directory
- Update the API key for your LLM provider
- Configure sandbox settings (docker_port, max_iterations, etc.)

Example configuration setup:

cp configs/default_gpt.json configs/my-config.json
# Edit my-config.json and set your API key
python inference_main.py --config configs/my-config.json --tasks-dir tasks/ --output-dir results/

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
cocoabench-example-tasks		cocoabench-example-tasks
configs		configs
executor		executor
.gitignore		.gitignore
.python-version		.python-version
LICENSE		LICENSE
README.md		README.md
decrypt_utils.py		decrypt_utils.py
encrypt_tasks.py		encrypt_tasks.py
inference_main.py		inference_main.py
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Cocoa-Agent

Overview

Quick Start

Running Tasks

Configuration File Format

Evaluation

Evaluation Method: `test.py`

Sandbox API for Host-Side Scripts

Results

Setup

Prerequisites

Installation

About

Uh oh!

Releases

Packages

Languages

License

cocoabench/cocoa-agent

Folders and files

Latest commit

History

Repository files navigation

Cocoa-Agent

Overview

Quick Start

Running Tasks

Configuration File Format

Evaluation

Evaluation Method: test.py

Sandbox API for Host-Side Scripts

Results

Setup

Prerequisites

Installation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Evaluation Method: `test.py`

Packages