Ziyang Xie, Zhizheng Liu, Zhenghao Peng, Wayne Wu, Bolei Zhou
Vid2Sim is a novel framework that converts monocular videos into photorealistic and physically interactive simulation environments for training embodied agents with minimal sim-to-real gap.
# Clone the repository
git clone https://github.com/Vid2Sim/Vid2Sim.git --recursive
cd Vid2Sim
# Create a new environment
conda create -n vid2sim python=3.10
conda activate vid2sim
# Install dependencies
pip install -e .
# Install reconstruction dependencies
pip install -e submodules/vid2sim-rasterizer
pip install -e submodules/vid2sim-deva-segmentation
pip install -e submodules/simple-knn
# Install RL dependencies
pip install -r src/vid2sim_rl/requirements.txt
pip install -e submodules/ml-agents/ml-agents
[Optional] pip install -e submodules/r3m
Vid2Sim transforms monocular videos into simulation environments by reconstructing the scene geometry and appearance. The generated environments preserve real-world diversity and visual fidelity, providing minimal sim-to-real gap for agent training.
π To get started, follow the reconstruction guide in vid2sim_recon to reconstruct the simulation environment from video.
After the environment is reconstructed, Vid2Sim translates the real-to-sim environments into a interactive environment with both realistic visual appearance and physical collision to train the agent in diverse situations.
π To set up the environment and launch RL training, refer to vid2sim_rl.
Vid2Sim/
βββ data/ # Source data
βββ src/
β βββ vid2sim_recon/ # Reconstruct the simulation environment from video
β βββ vid2sim_rl/ # Train the agent in real-to-sim environments
βββ tools/ # Tools scripts
βββ README.md # This file
The Vid2Sim dataset includes 30 high-quality real-to-sim simulation environments reconstructed from video clips sourced from 9 web videos. Each clip includes 15 seconds of forward-facing video recorded at 30 fps, providing 450 frames per scene for environment reconstruction and simulation.
We provide the source video data, and interactive Unity environments for agent training.
If you find this work useful in your research, please consider citing:
@article{xie2024vid2sim,
title={Vid2Sim: Realistic and Interactive Simulation from Video for Urban Navigation},
author={Xie, Ziyang and Liu, Zhizheng and Peng, Zhenghao and Wu, Wayne and Zhou, Bolei},
journal={CVPR},
year={2025}
}