GaussianVLM: Scene-centric 3D Vision-Language Models using Language-aligned Gaussian Splats for Embodied Reasoning and Beyond

Anna-Maria Halacheva | Jan-Nico Zaech | Xi Wang | Danda Pani Paudel | Luc Van Gool

🌐 Project Page | 📃 Paper (arXiv) | 📊 Evaluation Results

📢 Important Note on This Release

We are releasing this as an early-access version of the codebase due to multiple requests from the community.

Caution

This is an early release. A more detailed set of instructions and a thoroughly cleaned repository for easier setup will be released in the upcoming weeks. For immediate setup help or specific queries, please contact the first author.

🌟 Overview

GaussianVLM is a novel scene-centric 3D Vision-Language Model (VLM) designed for comprehensive 3D scene understanding. By leveraging Language-aligned Gaussian Splatting, our model achieves state-of-the-art results across a wide range of embodied reasoning tasks without the need for traditional object detectors.

Core Capabilities:

Scene-centric Reasoning: Operates on dense, language-augmented representations.
Dual Sparsification: Efficiently distills 3D Gaussian features into task-relevant tokens for LLMs.
Versatile Benchmarking: High performance on both scene-level (planning/embodied reasoning) and object-level (captioning/QA) tasks.

🛠️ Setup & Environment

This repository is built upon the foundation provided by LEO. We sincerely thank the authors of LEO for their incredible effort and for open-sourcing their framework.

Installation (via conda-pack)

To ensure environment reproducibility, we provide a pre-packaged environment.

# Create a directory for the environment
mkdir -p gaussian_vlm_env
# Unpack the provided environment archive
tar -xzf gaussian_vlm_env.tar.gz -C gaussian_vlm_env
source gaussian_vlm_env/bin/activate
conda-unpack

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
all-MiniLM-L6-v2		all-MiniLM-L6-v2
assets		assets
common		common
configs		configs
data		data
evaluator		evaluator
exp		exp
model		model
pointnet2_ops		pointnet2_ops
scripts		scripts
trainer		trainer
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
LL3DA_annotations.zip		LL3DA_annotations.zip
README.md		README.md
annotations.zip		annotations.zip
config_ptv3.py		config_ptv3.py
configs.zip		configs.zip
inference.py		inference.py
launch.py		launch.py
requirements.txt		requirements.txt
run.py		run.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

GaussianVLM: Scene-centric 3D Vision-Language Models using Language-aligned Gaussian Splats for Embodied Reasoning and Beyond

🌐 Project Page | 📃 Paper (arXiv) | 📊 Evaluation Results

📢 Important Note on This Release

🌟 Overview

Core Capabilities:

🛠️ Setup & Environment

Installation (via conda-pack)

About

Uh oh!

Releases

Packages

Languages

License

amhalacheva/GaussianVLM

Folders and files

Latest commit

History

Repository files navigation

GaussianVLM: Scene-centric 3D Vision-Language Models using Language-aligned Gaussian Splats for Embodied Reasoning and Beyond

🌐 Project Page | 📃 Paper (arXiv) | 📊 Evaluation Results

📢 Important Note on This Release

🌟 Overview

Core Capabilities:

🛠️ Setup & Environment

Installation (via conda-pack)

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages