ACOT-VLA-WM is an improved variant of ACoT-VLA that integrates a Predictive World Model into the Action Chain-of-Thought framework, enabling precise robotic subgoal generation and execution on long-horizon, high-precision manipulation tasks.
📖 Full technical details, demos, and evaluation videos: acotvla-wm.xyz
Foundational robot models need not only semantic comprehension of task objectives, but also concrete instantiations of intermediate subgoals throughout training and inference. ACOT-VLA-WM addresses this by deeply fusing the main ACoT-VLA model with a predictive world model: the world model forecasts multi-view future frames and embeds them as visual subgoals into the main model's training, significantly improving physical fault tolerance and action precision.
- World Model Integration: A finetuned predictive world model (BAGEL-based) generates multi-view future subgoal images that are embedded into ACoT-VLA training prompts.
- Mixed Subgoal Sampling (75% / 12.5% / 12.5%):
- 75% — uniformly sample a future frame 0–4 seconds ahead, improving robustness to execution delay and speed perturbations.
- 12.5% — use the sub-step terminal frame as subgoal.
- 12.5% — use world-model-generated future frames as subgoal.
- Robust Long-Horizon Execution: On 5 industrial manipulation scenarios (10 rollouts each), baseline ACoT-VLA achieves 80% overall success rate, while ACOT-VLA-WM reaches 100%, including challenging tasks such as picking up a barcode scanner and scanning 5 QR codes on a marble table.
Core training logic lives in src/openpi/training/subgoal_dataset.py and src/openpi/training/sampler.py, which dynamically schedule world-model predictions and real future frames during training.
We utilize uv to manage the Python environment.
git clone https://github.com/AgibotTech/ACOT-VLA-WM.git
cd ACOT-VLA-WM
git submodule update --init --recursive
GIT_LFS_SKIP_SMUDGE=1 uv sync
GIT_LFS_SKIP_SMUDGE=1 uv pip install -e .Convert LeRobot video datasets to parquet-embedded image datasets for faster training I/O.
export HF_LEROBOT_HOME=/data/dataset/Robotdataset/Robotdataset/G2_Robot/phone_packaging
uv run python scripts/predecode_lerobot_videos_to_images.py \
--input-repo-id video_data \
--output-repo-id images_data \
--image-format jpeg \
--jpeg-quality 98 \
--num-workers 16 \
--overwriteuv run scripts/compute_norm_stats.py \
--config-name acot-vla-wm-task \
--robot-action-dim=24# Load image-based dataset for training
export XLA_PYTHON_CLIENT_MEM_FRACTION=0.6
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 uv run scripts/train.py acot-vla-wm-task \
--exp-name=id02 \
--data.use-parquet-images \
--batch-size 64 \
--num-workers 16 \
--overwriteexport CUDA_VISIBLE_DEVICES=4,5,6,7
export XLA_PYTHON_CLIENT_MEM_FRACTION=0.5
GIT_LFS_SKIP_SMUDGE=1 uv run python scripts/serve_policy.py \
--env G2SIM \
--port 8067 \
policy:checkpoint \
--policy.config acot-vla-wm-task \
--policy.dir "/data/code/ACoT-VLA/checkpoints/acot-vla-wm-task/id01/20000"ACOT-VLA-WM relies on a finetuned predictive world model to generate multi-view future subgoal images. For world model training and finetuning, see the companion repository:
This repo is built upon the OpenPI framework and the ACoT-VLA codebase. We sincerely thank the authors for their contributions to the community.