3D-VLA: A 3D Vision-Language-Action Generative World Model

ICML 2024

Haoyu Zhen, Xiaowen Qiu, Peihao Chen, Jincheng Yang, Xin Yan, Yilun Du, Yining Hong, Chuang Gan

Tabel of Contents

Method
Installation
Embodied Diffusion Models
- Goal Image Generation
Citation
Acknowledgement

Method

3D-VLA is a framework that connects vision-language-action (VLA) models to the 3D physical world. Unlike traditional 2D models, 3D-VLA integrates 3D perception, reasoning, and action through a generative world model, similar to human cognitive processes. It is built on the 3D-LLM and uses interaction tokens to engage with the environment. Embodied diffusion models are trained and aligned with the LLM to predict goal images and point clouds.

Installation

conda create -n 3dvla python=3.9
conda activate 3dvla
pip install -r requirements.txt

Embodied Diffusion Models

Goal Image Generation

Train the goal image latent diffusion model with the following command:
```
bash launcher/train_ldm.sh
```
If you want to include depth information, you could add --include_depth to the command in the train_ldm.sh file.
Then you could generate the goal image with the following command:
```
python inference_ldm_goal_image.py --ckpt_folder lavis/output/LDM/runs/pix2pix (--include_depth)
```
The results will be saved in the lavis/output/LDM/results folder.

Citation

@article{zhen20243dvla,
  author = {Zhen, Haoyu and Qiu, Xiaowen and Chen, Peihao and Yang, Jincheng and Yan, Xin and Du, Yilun and Hong, Yining and Gan, Chuang},
  title = {3D-VLA: 3D Vision-Language-Action Generative World Model},
  journal = {arXiv preprint arXiv:2403.09631},
  year = {2024},
}

Acknowledgement

Here we would like to thank the following resources for their great work:

SAM, ConceptFusion and 3D-CLR for Data Processing.
Diffusers, InstructPix2Pix, StableDiffusion and Point-E for the Diffusion Model.
LAVIS and 3D-LLM for the Codebase and Architecture.
OpenX for Dataset.
RLBench and Hiveformer for Evaluation.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

3D-VLA: A 3D Vision-Language-Action Generative World Model

Method

Installation

Embodied Diffusion Models

Goal Image Generation

Citation

Acknowledgement

Files

README.md

Latest commit

History

README.md

File metadata and controls

3D-VLA: A 3D Vision-Language-Action Generative World Model

Method

Installation

Embodied Diffusion Models

Goal Image Generation

Citation

Acknowledgement