RADIO-ViPE: Online Tightly Coupled Multi-Modal Fusion for Open-Vocabulary Semantic SLAM in Dynamic Environments
BE2R Lab — Biomechatronics and Energy-Efficient Robotics Laboratory, ITMO University
🌐 Project Page: be2rlab.github.io/radio_vipe | 📄 Paper: Coming Soon
We present RADIO-ViPE (Reduce All Domains Into One — Video Pose Engine), an online semantic SLAM system that enables geometry-aware open-vocabulary grounding, associating arbitrary natural language queries with localized 3D regions and objects in dynamic environments.
Unlike existing approaches that require calibrated, posed RGB-D input, RADIO-ViPE operates directly on raw monocular RGB video streams, requiring no prior camera intrinsics, depth sensors, or pose initialization. The system tightly couples multi-modal embeddings — spanning vision and language — derived from agglomerative foundation models (e.g., RADIO) with geometric scene information. This vision-language-geometric fusion is optimized within adaptive robust kernels, designed to handle both actively moving objects and agent-displaced scene elements (e.g., furniture rearranged during ego-centric sessions).
Experiments demonstrate that RADIO-ViPE achieves state-of-the-art results on the dynamic TUM-RGBD benchmark while maintaining competitive performance against offline open-vocabulary methods that rely on calibrated data and static scene assumptions. RADIO-ViPE bridges a critical gap in real-world deployment, enabling robust open-vocabulary semantic grounding for autonomous robotics, AR/VR applications, and unconstrained in-the-wild video streams.
# Build the Docker image
make build
# Run the Docker image
make DATA_DIR={YOUR_DATA_DIR} run
# Inside the container, install the package
pip install --no-build-isolation -e .# Run the full pipeline
python run.py pipeline=default streams=raw_mp4_stream streams.base_path=YOUR_VIDEO_OR_DIR_PATH
# Run the pose-only pipeline (without depth estimation)
python run.py pipeline=default streams=raw_mp4_stream streams.base_path=YOUR_VIDEO_OR_DIR_PATH pipeline.post.depth_align_model=nullDocumentation and evaluation scripts are coming soon.
RADIO-ViPE builds upon many outstanding open-source research projects and codebases, including (non-exhaustive):
| Project | Reference |
|---|---|
| RAD-SEG | arXiv:2511.19704 |
| KM-ViPE | arXiv:2512.01889 |
| RayFronts | arXiv:2504.06994 |
| ViPE | GitHub |
| RADIO | arXiv:2601.17237 |
| DINOv3 | GitHub |
| Talk2DINO | GitHub |
| RVWO | GitHub |
| UniDepth | GitHub |
This project will download and install additional third-party models and software. Note that these are not distributed by NVIDIA — please review their respective license terms before use.
This source code is released under the Apache 2.0 License.

