Skip to content

Latest commit

 

History

History
99 lines (88 loc) · 4.19 KB

README.md

File metadata and controls

99 lines (88 loc) · 4.19 KB

3D-VLA: A 3D Vision-Language-Action Generative World Model

ICML 2024

Haoyu Zhen, Xiaowen Qiu, Peihao Chen, Jincheng Yang, Xin Yan, Yilun Du, Yining Hong, Chuang Gan

Paper PDF Project Page

Tabel of Contents
  1. Method
  2. Installation
  3. Embodied Diffusion Models
  4. Citation
  5. Acknowledgement

Method

3D-VLA is a framework that connects vision-language-action (VLA) models to the 3D physical world. Unlike traditional 2D models, 3D-VLA integrates 3D perception, reasoning, and action through a generative world model, similar to human cognitive processes. It is built on the 3D-LLM and uses interaction tokens to engage with the environment. Embodied diffusion models are trained and aligned with the LLM to predict goal images and point clouds.

Logo

Installation

conda create -n 3dvla python=3.9
conda activate 3dvla
pip install -r requirements.txt

Embodied Diffusion Models

Goal Image Generation

  • Train the goal image latent diffusion model with the following command:

    bash launcher/train_ldm.sh

    If you want to include depth information, you could add --include_depth to the command in the train_ldm.sh file.

  • Then you could generate the goal image with the following command:

    python inference_ldm_goal_image.py --ckpt_folder lavis/output/LDM/runs/pix2pix (--include_depth)

    The results will be saved in the lavis/output/LDM/results folder.

Citation

@article{zhen20243dvla,
  author = {Zhen, Haoyu and Qiu, Xiaowen and Chen, Peihao and Yang, Jincheng and Yan, Xin and Du, Yilun and Hong, Yining and Gan, Chuang},
  title = {3D-VLA: 3D Vision-Language-Action Generative World Model},
  journal = {arXiv preprint arXiv:2403.09631},
  year = {2024},
}

Acknowledgement

Here we would like to thank the following resources for their great work: