ICML 2024
Haoyu Zhen, Xiaowen Qiu, Peihao Chen, Jincheng Yang, Xin Yan, Yilun Du, Yining Hong, Chuang Gan
Tabel of Contents
3D-VLA is a framework that connects vision-language-action (VLA) models to the 3D physical world. Unlike traditional 2D models, 3D-VLA integrates 3D perception, reasoning, and action through a generative world model, similar to human cognitive processes. It is built on the 3D-LLM and uses interaction tokens to engage with the environment. Embodied diffusion models are trained and aligned with the LLM to predict goal images and point clouds.
conda create -n 3dvla python=3.9
conda activate 3dvla
pip install -r requirements.txt
-
Train the goal image latent diffusion model with the following command:
bash launcher/train_ldm.sh
If you want to include depth information, you could add
--include_depth
to the command in thetrain_ldm.sh
file. -
Then you could generate the goal image with the following command:
python inference_ldm_goal_image.py --ckpt_folder lavis/output/LDM/runs/pix2pix (--include_depth)
The results will be saved in the
lavis/output/LDM/results
folder.
@article{zhen20243dvla,
author = {Zhen, Haoyu and Qiu, Xiaowen and Chen, Peihao and Yang, Jincheng and Yan, Xin and Du, Yilun and Hong, Yining and Gan, Chuang},
title = {3D-VLA: 3D Vision-Language-Action Generative World Model},
journal = {arXiv preprint arXiv:2403.09631},
year = {2024},
}
Here we would like to thank the following resources for their great work:
- SAM, ConceptFusion and 3D-CLR for Data Processing.
- Diffusers, InstructPix2Pix, StableDiffusion and Point-E for the Diffusion Model.
- LAVIS and 3D-LLM for the Codebase and Architecture.
- OpenX for Dataset.
- RLBench and Hiveformer for Evaluation.