Max Fu*,1,2, Justin Yu*,2, Karim El-Refai*,2, Ethan Kou*,2, Haoru Xue*,1,2, Huang Huang3, Wenli Xiao4, Guanzhi Wang1, Fei-Fei Li3, Guanya Shi4, Jiajun Wu3, Shankar Sastry2, Yuke Zhu1, Ken Goldberg†,2, Jim Fan†,1
1NVIDIA 2UC Berkeley 3Stanford University 4Carnegie Mellon University
*Equal contribution †Equal advising
CaP-X is an open-access framework for systematically studying Code-as-Policy agents in robot manipulation. It consists of four components:
| Component | What it does |
|---|---|
| CaP-Gym | Interactive Gymnasium environments where agents control robots by generating Python code that composes perception and control primitives. 39 tasks across Robosuite, LIBERO-PRO, and BEHAVIOR. |
| CaP-Bench | Systematic benchmark evaluating coding agents across abstraction levels, interaction modes, and visual grounding modalities. 8 tiers (S1-S4 single-turn, M1-M4 multi-turn). |
| CaP-Agent0 | Training-free agentic framework with multi-turn visual differencing, auto-synthesized skill libraries, and parallel ensembled reasoning. |
| CaP-RL | Reinforcement learning on the coding agent via GRPO, using environment rewards to post-train language models. Transfers from sim to real with minimal gap. |
CaP-X uses uv for dependency management. Requires Python 3.10 and a CUDA-capable GPU.
git clone --recurse-submodules https://github.com/capgym/cap-x && cd CaP-X
# Or if already cloned without --recurse-submodules:
git submodule update --init --recursive
# Install uv (if not present)
curl -LsSf https://astral.sh/uv/install.sh | sh
uv python install 3.10 && uv venv -p 3.10
# Base install
uv syncPick one simulator family to install — Robosuite and LIBERO conflict in the same environment.
uv sync --extra robosuiteLIBERO requires a separate virtual environment (its Robosuite fork conflicts with the standard one).
uv venv .venv-libero --python 3.12
source .venv-libero/bin/activate
uv sync --active --extra libero --extra contactgraspnetSee docs/libero-tasks.md for running any of 130+ LIBERO tasks.
BEHAVIOR tasks run on NVIDIA Isaac Sim via OmniGibson. Requires Python 3.10 and CUDA 12.x.
cd capx/third_party/b1k
./uv_install.sh --dataset # installs OmniGibson, Isaac Sim, BDDL, cuRobo, and downloads assets
cd ../../.. # back to repo root
# Post-install fixes — copy cuRobo JIT headers to site-packages
cp capx/third_party/curobo/src/curobo/curobolib/cpp/*.h \
$(python -c "import sysconfig; print(sysconfig.get_path('purelib'))")/curobo/curobolib/cpp/The
--datasetflag downloads robot assets, BEHAVIOR-1K scene/object assets, and 2025 challenge task instances. You will be prompted to accept the NVIDIA Isaac Sim EULA and BEHAVIOR dataset license. To auto-accept, add--accept-dataset-tos.
For headless servers:
sudo apt-get update && sudo apt-get install -y libegl1 libgl1
# Remove duplicate Vulkan ICD if present (causes segfault on multi-GPU systems)
sudo rm -f /usr/share/vulkan/icd.d/nvidia_icd.jsonSee docs/behavior-tasks.md for task details and expected baselines.
uv sync --extra verl # RL training with VeRL/GRPO
uv sync --extra contactgraspnet # Contact-GraspNet grasp planning
uv sync --extra curobo # cuRobo GPU-accelerated IK & motion planning (requires CUDA)Perception servers (SAM3, ContactGraspNet, PyRoKi) are auto-launched by the YAML config when you run an evaluation. No manual setup required for most configs.
To pre-launch servers (e.g. for sharing across multiple eval runs):
# Start SAM3 + GraspNet + PyRoKi with automatic GPU allocation
uv run --no-sync --active capx/serving/launch_servers.py --profile defaultUse --dry-run to preview the allocation. Other profiles:
--profile full # All perception servers (SAM3, GraspNet, PyRoKi, OWL-ViT, SAM2)
--profile minimal # PyRoKi only (for oracle/privileged evals)The evaluation harness queries an LLM through a local proxy that exposes an OpenAI-compatible API.
# OpenRouter (get a key at openrouter.ai/keys)
echo "sk-or-v1-your-key-here" > .openrouterkey
uv run --no-sync --active capx/serving/openrouter_server.py --key-file .openrouterkey --port 8110Note:
.openrouterkeyare git-ignored. The default server URL in configs ishttp://127.0.0.1:8110/chat/completions.
See docs/configuration.md for all provider options (OpenRouter, NVIDIA, vLLM, custom).
# Robosuite: single-turn benchmark (100 trials, 12 parallel workers)
uv run --no-sync --active capx/envs/launch.py \
--config-path env_configs/cube_stack/franka_robosuite_cube_stack.yaml \
--model "google/gemini-3.1-pro-preview"
# Robosuite: multi-turn with visual differencing
uv run --no-sync --active capx/envs/launch.py \
--config-path env_configs/cube_stack/franka_robosuite_cube_stack_multiturn_vdm.yaml \
--model "google/gemini-3.1-pro-preview"
# LIBERO-PRO: spatial task (requires .venv-libero)
source .venv-libero/bin/activate
uv run --no-sync --active capx/envs/launch.py \
--config-path env_configs/libero/franka_libero_spatial_0.yaml \
--model "google/gemini-3.1-pro-preview"
# BEHAVIOR: R1Pro radio pickup (20 trials)
OMNI_KIT_ACCEPT_EULA=YES OMNIGIBSON_HEADLESS=1 \
uv run --no-sync --active capx/envs/launch.py \
--config-path env_configs/r1pro/r1pro_pick_up_radio.yaml \
--model "google/gemini-3.1-pro-preview"
# Interactive Web UI
uv run --no-sync --active capx/envs/launch.py \
--config-path env_configs/cube_stack/franka_robosuite_cube_stack.yaml \
--web-ui True
# Open http://localhost:8200
# Regression tests
./scripts/regression_test.sh quick # 10-trial smoke test (~30s)
./scripts/regression_test.sh test1 # Full single-turn (~3 min)Tip (BEHAVIOR): Isaac Sim uses
OMNIGIBSON_GPU_ID(notCUDA_VISIBLE_DEVICES) to select the GPU. For best performance on multi-GPU systems, run perception servers on a separate GPU (e.g.OMNIGIBSON_GPU_ID=0for the eval, and pre-launch SAM3/GraspNet withCUDA_VISIBLE_DEVICES=1). SetOMNI_KIT_ACCEPT_EULA=YESandOMNIGIBSON_HEADLESS=1for headless evaluation.
| Guide | Contents |
|---|---|
| Adding Environments | Creating simulators, task environments, YAML configs |
| Adding APIs | Implementing and registering new robot control APIs |
| Configuration | YAML format, CLI flags, LLM provider setup |
| LIBERO-PRO Tasks | Setup, running any of 130+ LIBERO tasks, suite reference |
| BEHAVIOR Tasks | Setup, R1Pro tasks, expected baselines, environment variables |
| Development | Testing, linting, LIBERO/GraspNet setup, checkpoints, known issues |
| Real-World Franka Panda Bringup | Bringup with robots_realtime, real-robot QuickStart |
| RL Training | CaP-RL with GRPO/VeRL, sim-to-real transfer |
@inproceedings{fu2025capx,
title = {{CaP-X}: A Framework for Benchmarking and Improving Coding Agents for Robot Manipulation},
author = {Fu, Max and Yu, Justin and El-Refai, Karim and Kou, Ethan and Xue, Haoru and Huang, Huang and Xiao, Wenli and Wang, Guanzhi and Li, Fei-Fei and Shi, Guanya and Wu, Jiajun and Sastry, Shankar and Zhu, Yuke and Goldberg, Ken and Fan, Jim},
year = {2025}
}This project is released under the MIT License.