Skip to content

alexantaluo0/ACoT-VLA-WM

Repository files navigation

ACOT-VLA-WM: Precise Robotic Subgoal Generation and Execution

Project Page arXiv License: CC BY 4.0 License: MIT

ACOT-VLA-WM is an improved variant of ACoT-VLA that integrates a Predictive World Model into the Action Chain-of-Thought framework, enabling precise robotic subgoal generation and execution on long-horizon, high-precision manipulation tasks.

📖 Full technical details, demos, and evaluation videos: acotvla-wm.xyz


Overview

Foundational robot models need not only semantic comprehension of task objectives, but also concrete instantiations of intermediate subgoals throughout training and inference. ACOT-VLA-WM addresses this by deeply fusing the main ACoT-VLA model with a predictive world model: the world model forecasts multi-view future frames and embeds them as visual subgoals into the main model's training, significantly improving physical fault tolerance and action precision.

Key Improvements over ACoT-VLA

  • World Model Integration: A finetuned predictive world model (BAGEL-based) generates multi-view future subgoal images that are embedded into ACoT-VLA training prompts.
  • Mixed Subgoal Sampling (75% / 12.5% / 12.5%):
    • 75% — uniformly sample a future frame 0–4 seconds ahead, improving robustness to execution delay and speed perturbations.
    • 12.5% — use the sub-step terminal frame as subgoal.
    • 12.5% — use world-model-generated future frames as subgoal.
  • Robust Long-Horizon Execution: On 5 industrial manipulation scenarios (10 rollouts each), baseline ACoT-VLA achieves 80% overall success rate, while ACOT-VLA-WM reaches 100%, including challenging tasks such as picking up a barcode scanner and scanning 5 QR codes on a marble table.

Core training logic lives in src/openpi/training/subgoal_dataset.py and src/openpi/training/sampler.py, which dynamically schedule world-model predictions and real future frames during training.


Get Started

1. Installation

We utilize uv to manage the Python environment.

git clone https://github.com/AgibotTech/ACOT-VLA-WM.git
cd ACOT-VLA-WM
git submodule update --init --recursive
GIT_LFS_SKIP_SMUDGE=1 uv sync
GIT_LFS_SKIP_SMUDGE=1 uv pip install -e .

2. G2 Robot: Video Data → Images(>2× Faster Model Training)

Convert LeRobot video datasets to parquet-embedded image datasets for faster training I/O.

export HF_LEROBOT_HOME=/data/dataset/Robotdataset/Robotdataset/G2_Robot/phone_packaging

uv run python scripts/predecode_lerobot_videos_to_images.py \
  --input-repo-id video_data \
  --output-repo-id images_data \
  --image-format jpeg \
  --jpeg-quality 98 \
  --num-workers 16 \
  --overwrite

3. Compute Normalization Statistics

uv run scripts/compute_norm_stats.py \
  --config-name acot-vla-wm-task \
  --robot-action-dim=24

4. Model Training

# Load image-based dataset for training
export XLA_PYTHON_CLIENT_MEM_FRACTION=0.6

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 uv run scripts/train.py acot-vla-wm-task \
  --exp-name=id02 \
  --data.use-parquet-images \
  --batch-size 64 \
  --num-workers 16 \
  --overwrite

5. Model Deployment

export CUDA_VISIBLE_DEVICES=4,5,6,7
export XLA_PYTHON_CLIENT_MEM_FRACTION=0.5

GIT_LFS_SKIP_SMUDGE=1 uv run python scripts/serve_policy.py \
  --env G2SIM \
  --port 8067 \
  policy:checkpoint \
  --policy.config acot-vla-wm-task \
  --policy.dir "/data/code/ACoT-VLA/checkpoints/acot-vla-wm-task/id01/20000"

How to Train World Model?

ACOT-VLA-WM relies on a finetuned predictive world model to generate multi-view future subgoal images. For world model training and finetuning, see the companion repository:

BAGEL-WM

Acknowledgements

This repo is built upon the OpenPI framework and the ACoT-VLA codebase. We sincerely thank the authors for their contributions to the community.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages