π Website Β Β·Β π Paper Β Β·Β Synthesis Quick Start Β Β·Β Evaluation Quick Start Β Β·Β License
We present OpenComputer, a verifier-grounded framework for constructing verifiable software worlds for computer-use agents. OpenComputer integrates four components: (1) app-specific state verifiers that expose structured inspection endpoints over real applications, (2) a self-evolving verification layer that improves verifier reliability using execution-grounded feedback, (3) a task-generation pipeline that synthesizes realistic and machine-checkable desktop tasks, and (4) an evaluation harness that records full trajectories and computes auditable partial-credit rewards. In its current form, OpenComputer covers 33 desktop applications and 1,000 finalized tasks spanning browsers, office tools, creative software, development environments, file managers, and communication applications. Experiments show that OpenComputer's hard-coded verifiers align more closely with human adjudication than LLM-as-judge evaluation, especially when success depends on fine-grained application state. Frontier agents struggle with end-to-end completion despite partial progress, and open-source models exhibit sharp drops from their OSWorld-Verified scores, exposing a persistent gap in robust computer automation.
Hand-authored GUI benchmarks don't scale β every task needs a unique workbook, project file, or screen state, and every task needs a verifier that knows where to look. OpenComputer automates both sides:
- Tasks are generated, not hand-written. The task generator (
task_generator/) proposes goals, scores them on complexity and how easily their input artifacts can be synthesized, matches each accepted goal to a verifier endpoint, and emits a finaltask.jsonplus a fully synthesizedenv/directory of input files (CSVs, ODT/ODS/ODP/XLSX docs, PNG/SVG images, project files, configs, β¦). - Environments are reproducible. Every task ships with the exact files needed to seed the sandbox; the runtime uploads them, launches the right app, and waits for it to be ready before the agent is given control.
- Verification is programmatic first, LLM-judged second. Each app has a verifier module exposing
check-*CLI endpoints that read live IPC state (CDP, D-Bus, UNO, AT-SPI), parse files on disk, or query SQLite profile DBs. An LLM judge backs them up where automatic inspection isn't enough.
cp .env.example .env
pip install -r requirements.txtThen make sure the following are in place:
| Requirement | Why it's needed | How to set it |
|---|---|---|
| E2B sandbox | every stage spins up the desktop-all-apps template |
set E2B_API_KEY in .env, then run python computer_env/provision/e2b/build_all_apps_template.py once |
| Reasoning-backend CLI | drives verifier authoring, smoke judge/comparator/repair, and task generation | install and authenticate claude (default) or codex. A single SMOKE_BACKEND in .env selects the CLI for smoke/smoke_loop.py(default claude); |
| Smoke GUI-agent key | the in-sandbox agent that actually executes smoke tasks | set the API key matching SMOKE_MODEL (default kimi-k2.6 β KIMI_API_KEY). Override SMOKE_MODEL to use Claude / GPT / Gemini / etc. and set that family's key instead |
| LLM-as-judge key (optional) | only used when a verifier check has judge: llm |
JUDGE_MODEL defaults to gpt-5.4 β OPENAI_API_KEY. Override the model to reuse an existing key |
The pipeline runs in a strict order β verifier quality gates task quality, and task quality gates evaluation:
- Verifier (
verifiers/) β generate the per-appcheck-*endpoints used to score tasks. - Smoke (
smoke/) β exercise each verifier endpoint with a real agent in a live sandbox and repair any bugs. - Task generation (
task_generator/) β propose, evaluate, match to the verifier, and synthesize env files for finalized tasks.
The root CLAUDE.md is the canonical guide. You can either:
- Automated: hand
CLAUDE.mdto a coding agent and let it drive the full loop end-to-end (synthesize verifiers β smoke test β generate tasks β repair task verifiers(optional)). - Manual: step through each stage yourself by following the per-stage READMEs above in order, inspecting output quality between stages.
git clone <this repo>
cd OpenComputer
cp .env.example .env # fill in backend-related variables + the model API keys you plan to use
pip install -r requirements.txtrequirements.txt installs the E2B core SDK and Desktop SDK from PyPI, plus anthropic, openai, dashscope, Pillow, httpx, python-dotenv, etc.
Pick one backend.
# E2B (cloud sandbox β recommended; this is the canonical target)
python computer_env/provision/e2b/build_all_apps_template.py
# Docker (local; useful for offline iteration)
bash computer_env/provision/docker/build_image.sh
# remote_docker (AWS or Tencent Cloud worker fleet)
# Follow one of the dedicated provisioning guides:
# computer_env/provision/aws/README.md
# computer_env/provision/tencentcloud/README.mdAll three backends use the same Ubuntu/XFCE desktop stack with the app suite preinstalled. The E2B build pushes a template named desktop-all-apps; the Docker backends use an OCI image.
For the full remote_docker provisioning flow, directly follow one of:
Those provider docs are the source of truth for:
- provider auth verification
- provider prerequisites (
setup_prereqs.py) - desktop image build + registry push
- worker launch / stream inspection / termination
run_eval.py --env-backend remote_dockerusage and runtime notes
# Single task
python evaluation/run_eval.py --app chrome --task chrome_form_fill_httpbin --model claude-sonnet-4-6
# All tasks for one app, parallel
python evaluation/run_eval.py --app libreoffice_calc --model gpt-5.4 --parallel 4
# Smoke test one task per app on Docker
python evaluation/run_eval.py --env-backend docker --tasks-per-app 1
# Run against a remote Docker fleet
python evaluation/run_eval.py \
--env-backend remote_docker \
--docker-image <remote-registry>/<repo>:latest \
--tasks-per-app 1 \
--parallel 2
# Sample N tasks per app
python evaluation/run_eval.py --tasks-per-app 3 --model kimi-k2.6
# Resume a previous run (skips already-completed tasks)
python evaluation/run_eval.py --resume <run_id>
# Point the agent at a local OpenAI-compatible endpoint
python evaluation/run_eval.py --model gui-owl-1.5 --endpoint-port 8001Trajectories, screenshots, and report.json land in evaluation/runs/<run_id>/ (gitignored).
The agent registry (agents/registry.py) maps friendly aliases to agent classes and sensible defaults. Pass any alias to --model.
| Family | Aliases |
|---|---|
| Claude (Anthropic) | claude-sonnet-4-5, claude-sonnet-4, claude-sonnet-4-6, claude-opus-4, claude-opus-4-1, claude-opus-4-5, claude-opus-4-6, claude-3-7-sonnet |
| ChatGPT (OpenAI / Azure) | chatgpt, gpt-5, gpt-5.4, computer-use-preview, azure-chatgpt, azure-gpt-5.4, azure-computer-use-preview, azure-gpt-5.3-chat |
| Gemini (Google AI Studio) | gemini-3-flash, gemini-3-flash-preview, gemini-2.5-computer-use |
| Kimi (Moonshot) | kimi-k2.5, kimi-k2.6 |
| Qwen (DashScope / OpenAI-compatible) | qwen3-vl, qwen2.5-vl-72b, qwen3.5-35b-a3b, qwen3.5-27b, qwen3.5-9b, qwen3.5-4b |
| GUI-Owl | owl1.5, gui-owl-1.5 |
| EvoCUA | evocua-s1, evocua-s2 |
| Specialised CUA models | mano, opencua, dart |
python evaluation/run_eval.py --list-models enumerates everything currently registered. Unknown model IDs are routed by family heuristics (any name containing claude, kimi, qwen, gemini, gpt-, etc.) so you can also pass full vendor model IDs directly.
This section intentionally lists only the top-level variables most users need to run evaluations. Backend-specific and operator-facing settings are documented in:
computer_env/README.mdfor sandbox/backend runtime settingscomputer_env/provision/aws/README.mdforremote_dockeron AWScomputer_env/provision/tencentcloud/README.mdforremote_dockeron Tencent Cloud Chinaevaluation/repair/README.mdfor repair-only settings.env.examplefor a starter template
| Variable | Default | Purpose |
|---|---|---|
EVAL_MODEL |
kimi-k2.6 |
Default agent for evaluation/run_eval.py. |
EVAL_MAX_ITERATIONS |
100 |
Max screenshot-action steps per task. |
EVAL_SANDBOX_TIMEOUT |
3600 |
Per-task sandbox lifetime in seconds. |
JUDGE_MODEL |
gpt-5.4 |
LLM-as-judge model used by verification. |
SMOKE_BACKEND |
claude |
Shared reasoning-backend CLI for smoke/smoke_loop.py and evaluation/repair/repair_loop.py (claude or codex). Override per-run with --backend. Other REPAIR_* knobs live in the repair README. |
- New app or more tasks for an existing app β see Synthesizing your own environment above.
- New agent β subclass
agents/base.py:BaseAgent, parse the model's output into pyautogui code strings (or the specialDONE/FAIL/WAITtokens), and register the alias inagents/registry.py. Unknown model IDs are routed by family heuristics, so any name containingclaude,kimi,qwen,gemini,gpt-, etc. will pick up the right agent class even without an explicit registry entry. - New backend β drop a runtime adapter under
computer_env/backends/<name>/implementingBaseComputerEnvironment, register it incomputer_env/factory.py, and add a--env-backend <name>branch inevaluation/run_eval.py.
If you use OpenComputer in your research or find it helpful, please cite:
@misc{wei2026opencomputerverifiablesoftwareworlds,
title={OpenComputer: Verifiable Software Worlds for Computer-Use Agents},
author={Jinbiao Wei and Qianran Ma and Yilun Zhao and Xiao Zhou and Kangqi Ni and Guo Gan and Arman Cohan},
year={2026},
eprint={2605.19769},
archivePrefix={arXiv},
primaryClass={cs.AI},
url={https://arxiv.org/abs/2605.19769},
}OpenComputer is released under the Apache License 2.0.
Copyright 2026 Yale NLP Lab
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0