Skip to content

capgym/cap-x

Repository files navigation

CaP-X

A Framework for Benchmarking and Improving Coding Agents for Robot Manipulation

Project Page  |  Paper

Max Fu*,1,2, Justin Yu*,2, Karim El-Refai*,2, Ethan Kou*,2, Haoru Xue*,1,2, Huang Huang3, Wenli Xiao4, Guanzhi Wang1, Fei-Fei Li3, Guanya Shi4, Jiajun Wu3, Shankar Sastry2, Yuke Zhu1, Ken Goldberg†,2, Jim Fan†,1

1NVIDIA   2UC Berkeley   3Stanford University   4Carnegie Mellon University

*Equal contribution   Equal advising


CaP-X is an open-access framework for systematically studying Code-as-Policy agents in robot manipulation. It consists of four components:

Component What it does
CaP-Gym Interactive Gymnasium environments where agents control robots by generating Python code that composes perception and control primitives. 39 tasks across Robosuite, LIBERO-PRO, and BEHAVIOR.
CaP-Bench Systematic benchmark evaluating coding agents across abstraction levels, interaction modes, and visual grounding modalities. 8 tiers (S1-S4 single-turn, M1-M4 multi-turn).
CaP-Agent0 Training-free agentic framework with multi-turn visual differencing, auto-synthesized skill libraries, and parallel ensembled reasoning.
CaP-RL Reinforcement learning on the coding agent via GRPO, using environment rewards to post-train language models. Transfers from sim to real with minimal gap.

Installation

CaP-X uses uv for dependency management. Requires Python 3.10 and a CUDA-capable GPU.

git clone --recurse-submodules https://github.com/capgym/cap-x && cd CaP-X

# Or if already cloned without --recurse-submodules:
git submodule update --init --recursive

# Install uv (if not present)
curl -LsSf https://astral.sh/uv/install.sh | sh

uv python install 3.10 && uv venv -p 3.10

# Base install
uv sync

Simulator-specific setup

Pick one simulator family to install — Robosuite and LIBERO conflict in the same environment.

Robosuite

uv sync --extra robosuite

LIBERO-PRO

LIBERO requires a separate virtual environment (its Robosuite fork conflicts with the standard one).

uv venv .venv-libero --python 3.12
source .venv-libero/bin/activate
uv sync --active --extra libero --extra contactgraspnet

See docs/libero-tasks.md for running any of 130+ LIBERO tasks.

BEHAVIOR (Isaac Sim)

BEHAVIOR tasks run on NVIDIA Isaac Sim via OmniGibson. Requires Python 3.10 and CUDA 12.x.

cd capx/third_party/b1k
./uv_install.sh --dataset          # installs OmniGibson, Isaac Sim, BDDL, cuRobo, and downloads assets
cd ../../..                        # back to repo root

# Post-install fixes — copy cuRobo JIT headers to site-packages
cp capx/third_party/curobo/src/curobo/curobolib/cpp/*.h \
   $(python -c "import sysconfig; print(sysconfig.get_path('purelib'))")/curobo/curobolib/cpp/

The --dataset flag downloads robot assets, BEHAVIOR-1K scene/object assets, and 2025 challenge task instances. You will be prompted to accept the NVIDIA Isaac Sim EULA and BEHAVIOR dataset license. To auto-accept, add --accept-dataset-tos.

For headless servers:

sudo apt-get update && sudo apt-get install -y libegl1 libgl1
# Remove duplicate Vulkan ICD if present (causes segfault on multi-GPU systems)
sudo rm -f /usr/share/vulkan/icd.d/nvidia_icd.json

See docs/behavior-tasks.md for task details and expected baselines.

Optional extras

uv sync --extra verl             # RL training with VeRL/GRPO
uv sync --extra contactgraspnet  # Contact-GraspNet grasp planning
uv sync --extra curobo           # cuRobo GPU-accelerated IK & motion planning (requires CUDA)

Quick Start

1. Perception servers (auto-launched)

Perception servers (SAM3, ContactGraspNet, PyRoKi) are auto-launched by the YAML config when you run an evaluation. No manual setup required for most configs.

To pre-launch servers (e.g. for sharing across multiple eval runs):

# Start SAM3 + GraspNet + PyRoKi with automatic GPU allocation
uv run --no-sync --active capx/serving/launch_servers.py --profile default

Use --dry-run to preview the allocation. Other profiles:

--profile full      # All perception servers (SAM3, GraspNet, PyRoKi, OWL-ViT, SAM2)
--profile minimal   # PyRoKi only (for oracle/privileged evals)

2. Set up an LLM proxy

The evaluation harness queries an LLM through a local proxy that exposes an OpenAI-compatible API.

# OpenRouter (get a key at openrouter.ai/keys)
echo "sk-or-v1-your-key-here" > .openrouterkey
uv run --no-sync --active capx/serving/openrouter_server.py --key-file .openrouterkey --port 8110

Note: .openrouterkey are git-ignored. The default server URL in configs is http://127.0.0.1:8110/chat/completions.

See docs/configuration.md for all provider options (OpenRouter, NVIDIA, vLLM, custom).

3. Run evaluation

# Robosuite: single-turn benchmark (100 trials, 12 parallel workers)
uv run --no-sync --active capx/envs/launch.py \
    --config-path env_configs/cube_stack/franka_robosuite_cube_stack.yaml \
    --model "google/gemini-3.1-pro-preview"

# Robosuite: multi-turn with visual differencing
uv run --no-sync --active capx/envs/launch.py \
    --config-path env_configs/cube_stack/franka_robosuite_cube_stack_multiturn_vdm.yaml \
    --model "google/gemini-3.1-pro-preview"

# LIBERO-PRO: spatial task (requires .venv-libero)
source .venv-libero/bin/activate
uv run --no-sync --active capx/envs/launch.py \
    --config-path env_configs/libero/franka_libero_spatial_0.yaml \
    --model "google/gemini-3.1-pro-preview"

# BEHAVIOR: R1Pro radio pickup (20 trials)
OMNI_KIT_ACCEPT_EULA=YES OMNIGIBSON_HEADLESS=1 \
uv run --no-sync --active capx/envs/launch.py \
    --config-path env_configs/r1pro/r1pro_pick_up_radio.yaml \
    --model "google/gemini-3.1-pro-preview"

# Interactive Web UI
uv run --no-sync --active capx/envs/launch.py \
    --config-path env_configs/cube_stack/franka_robosuite_cube_stack.yaml \
    --web-ui True
# Open http://localhost:8200

# Regression tests
./scripts/regression_test.sh quick    # 10-trial smoke test (~30s)
./scripts/regression_test.sh test1    # Full single-turn (~3 min)

Tip (BEHAVIOR): Isaac Sim uses OMNIGIBSON_GPU_ID (not CUDA_VISIBLE_DEVICES) to select the GPU. For best performance on multi-GPU systems, run perception servers on a separate GPU (e.g. OMNIGIBSON_GPU_ID=0 for the eval, and pre-launch SAM3/GraspNet with CUDA_VISIBLE_DEVICES=1). Set OMNI_KIT_ACCEPT_EULA=YES and OMNIGIBSON_HEADLESS=1 for headless evaluation.


Documentation

Guide Contents
Adding Environments Creating simulators, task environments, YAML configs
Adding APIs Implementing and registering new robot control APIs
Configuration YAML format, CLI flags, LLM provider setup
LIBERO-PRO Tasks Setup, running any of 130+ LIBERO tasks, suite reference
BEHAVIOR Tasks Setup, R1Pro tasks, expected baselines, environment variables
Development Testing, linting, LIBERO/GraspNet setup, checkpoints, known issues
Real-World Franka Panda Bringup Bringup with robots_realtime, real-robot QuickStart
RL Training CaP-RL with GRPO/VeRL, sim-to-real transfer

Citation

@inproceedings{fu2025capx,
  title     = {{CaP-X}: A Framework for Benchmarking and Improving Coding Agents for Robot Manipulation},
  author    = {Fu, Max and Yu, Justin and El-Refai, Karim and Kou, Ethan and Xue, Haoru and Huang, Huang and Xiao, Wenli and Wang, Guanzhi and Li, Fei-Fei and Shi, Guanya and Wu, Jiajun and Sastry, Shankar and Zhu, Yuke and Goldberg, Ken and Fan, Jim},
  year      = {2025}
}

License

This project is released under the MIT License.

About

A Framework for Benchmarking and Improving Coding Agents for Robot Manipulation

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages