Skip to content

be2rlab/RADIO-ViPE

Repository files navigation

RADIO-ViPE: Online Tightly Coupled Multi-Modal Fusion for Open-Vocabulary Semantic SLAM in Dynamic Environments

BE2R Lab — Biomechatronics and Energy-Efficient Robotics Laboratory, ITMO University

🌐 Project Page: be2rlab.github.io/radio_vipe  |  📄 Paper: Coming Soon


Pipeline

Abstract

We present RADIO-ViPE (Reduce All Domains Into One — Video Pose Engine), an online semantic SLAM system that enables geometry-aware open-vocabulary grounding, associating arbitrary natural language queries with localized 3D regions and objects in dynamic environments.

Unlike existing approaches that require calibrated, posed RGB-D input, RADIO-ViPE operates directly on raw monocular RGB video streams, requiring no prior camera intrinsics, depth sensors, or pose initialization. The system tightly couples multi-modal embeddings — spanning vision and language — derived from agglomerative foundation models (e.g., RADIO) with geometric scene information. This vision-language-geometric fusion is optimized within adaptive robust kernels, designed to handle both actively moving objects and agent-displaced scene elements (e.g., furniture rearranged during ego-centric sessions).

Experiments demonstrate that RADIO-ViPE achieves state-of-the-art results on the dynamic TUM-RGBD benchmark while maintaining competitive performance against offline open-vocabulary methods that rely on calibrated data and static scene assumptions. RADIO-ViPE bridges a critical gap in real-world deployment, enabling robust open-vocabulary semantic grounding for autonomous robotics, AR/VR applications, and unconstrained in-the-wild video streams.

Demo


Installation

Docker

# Build the Docker image
make build

# Run the Docker image
make DATA_DIR={YOUR_DATA_DIR} run

# Inside the container, install the package
pip install --no-build-isolation -e .

Usage

# Run the full pipeline
python run.py pipeline=default streams=raw_mp4_stream streams.base_path=YOUR_VIDEO_OR_DIR_PATH

# Run the pose-only pipeline (without depth estimation)
python run.py pipeline=default streams=raw_mp4_stream streams.base_path=YOUR_VIDEO_OR_DIR_PATH pipeline.post.depth_align_model=null

Evaluation

Documentation and evaluation scripts are coming soon.


Acknowledgments

RADIO-ViPE builds upon many outstanding open-source research projects and codebases, including (non-exhaustive):

Project Reference
RAD-SEG arXiv:2511.19704
KM-ViPE arXiv:2512.01889
RayFronts arXiv:2504.06994
ViPE GitHub
RADIO arXiv:2601.17237
DINOv3 GitHub
Talk2DINO GitHub
RVWO GitHub
UniDepth GitHub

License

This project will download and install additional third-party models and software. Note that these are not distributed by NVIDIA — please review their respective license terms before use.

This source code is released under the Apache 2.0 License.

About

Online Tightly Coupled Vision-Language-Geometry Fusion for Open-Vocabulary Semantic SLAM

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors