GaussianVLM: Scene-centric 3D Vision-Language Models using Language-aligned Gaussian Splats for Embodied Reasoning and Beyond
Anna-Maria Halacheva | Jan-Nico Zaech | Xi Wang | Danda Pani Paudel | Luc Van Gool
We are releasing this as an early-access version of the codebase due to multiple requests from the community.
Caution
This is an early release. A more detailed set of instructions and a thoroughly cleaned repository for easier setup will be released in the upcoming weeks. For immediate setup help or specific queries, please contact the first author.
GaussianVLM is a novel scene-centric 3D Vision-Language Model (VLM) designed for comprehensive 3D scene understanding. By leveraging Language-aligned Gaussian Splatting, our model achieves state-of-the-art results across a wide range of embodied reasoning tasks without the need for traditional object detectors.
- Scene-centric Reasoning: Operates on dense, language-augmented representations.
- Dual Sparsification: Efficiently distills 3D Gaussian features into task-relevant tokens for LLMs.
- Versatile Benchmarking: High performance on both scene-level (planning/embodied reasoning) and object-level (captioning/QA) tasks.
This repository is built upon the foundation provided by LEO. We sincerely thank the authors of LEO for their incredible effort and for open-sourcing their framework.
To ensure environment reproducibility, we provide a pre-packaged environment.
# Create a directory for the environment
mkdir -p gaussian_vlm_env
# Unpack the provided environment archive
tar -xzf gaussian_vlm_env.tar.gz -C gaussian_vlm_env
source gaussian_vlm_env/bin/activate
conda-unpack