Official implementation of "MObI: Multimodal Object Inpainting Using Diffusion Models" - CVPR Workshop on Data-Driven Autonomous Driving Simulation (DDADS)
MObI addresses limitations in existing approaches:
-
Object inpainting methods based on edit masks alone (e.g., Paint-by-Example) achieve high realism but can lead to surprising results because there are often multiple semantically consistent ways to inpaint an object within a scene.
-
Methods based on 3D reconstruction (e.g., NeuRAD) have strong controllability but sometimes lead to low realism, especially for object viewpoints that have not been observed.
- Joint inpainting across multiple modalities (RGB camera, lidar depth and intensity)
- Object insertion using just a single reference image
- 3D bounding box conditioning for accurate spatial positioning
- Improved controllability compared to traditional inpainting methods
MObI extends Paint-by-Example, a reference-based image inpainting method, to include bounding box conditioning and jointly generate camera and lidar perception inputs. Therefore, this repository is based on the Paint-by-Example repo.
Clone repository and set the project root directory:
git clone https://github.com/alexbuburuzan/MObI.git
cd MObI
echo "export WORK_DIR_MOBI=$(pwd)" >> ~/.bashrc
source ~/.bashrcInstall conda environment based on CUDA 11.3 (you may be unable to properly install mmdet3d if using a different CUDA version):
conda env create -f environment.yml
conda activate mobiThis codebase is partly based on the BEVFusion repo, particularly the data preprocessing code. You may refer to their documentation if having issues building mmdet3d. Install the following:
# uses pre-build wheel; you could install from scratch
pip install mmcv-full==1.4.0 -f https://download.openmmlab.com/mmcv/dist/cu113/torch1.10.0/index.html --no-cache-dir
pip install mmdet==2.20.0
pip install nuscenes-devkit
cd bevfusion
# builds mmdet3d; use older gcc version
python setup.py developInstall additional dependencies and the project itself:
pip install git+https://github.com/openai/CLIP.git
pip install -e git+https://github.com/CompVis/taming-transformers.git@master#egg=taming-transformers
cd WORK_DIR_MOBI
pip install -e .
First, download the nuScenes dataset.
Run data processing script for the camera-lidar inpainting model.
bash scripts/process_data.shExpected directory structure:
MObI/
βββ checkpoints/ # Pretrained models
β βββ model.ckpt # Paint-by-Example pretrained model
β βββ mobi_nusc_512/
β βββ mobi_nuscenes_epoch28.ckpt # MObI trained model
β βββ autoencoders/
β βββ range_autoencoder.ckpt # Range view autoencoder
βββ processed-data/ # Preprocessed datasets
β βββ nuscenes/ # Full nuScenes dataset
β β βββ nuscenes_infos_train.pkl
β β βββ nuscenes_infos_val.pkl
β β βββ nuscenes_dbinfos_pbe_train.csv
β β βββ nuscenes_dbinfos_pbe_val.csv
β β βββ nuscenes_scene_infos_pbe_train.pkl
β β βββ nuscenes_scene_infos_pbe_val.pkl
β β βββ nuscenes_pbe_gt_database_train/
β β βββ nuscenes_pbe_gt_database_val/
β βββ nuscenes-mini/ # Mini nuScenes dataset
β βββ ...
βββ data/
β βββ nuscenes/ # Raw nuScenes data
β βββ samples/ # Sensor data samples
β βββ sweeps/ # Sensor data sweeps
β βββ maps/ # Map data
β βββ can_bus/ # CAN bus data
β βββ panoptic/ # Panoptic segmentation
β βββ v1.0-trainval/ # Train/val annotations
β βββ v1.0-test/ # Test annotations
β βββ v1.0-mini/ # Mini dataset annotations
β βββ test_v1.0-mini/
β βββ nuscenes_gt_database/
β βββ nuscenes_infos_train_mono3d.coco.json
β βββ nuscenes_infos_val_mono3d.coco.json
β βββ nuscenes_map_anns_val.json
β βββ nuScenes_license.pdf
β βββ VERSION.txt
β βββ DISCLAIMER.txt
βββ configs/ # Configuration files
β βββ mobi_nusc_256.yaml
β βββ mobi_nusc_512.yaml
β βββ mobi_nusc_all-classes_256.yaml
β βββ mobi_nusc_all-classes_512.yaml
β βββ mobi_nusc-mini_256.yaml
β βββ mobi_nusc-mini_512.yaml
β βββ pbe.yaml
β βββ range_autoencoder.yaml
βββ scripts/ # Training and evaluation scripts
βββ ldm/ # Latent diffusion model modules
βββ eval_tool/ # Evaluation metrics (camera & lidar)
βββ bevfusion/ # BEVFusion repo
βββ assets/ # Assets and media
βββ environment.yaml # Conda environment specification
βββ main.py # Main training script
Download MObI weights, including for its range view autoencoder, and Paint-by-Example:
bash scripts/download_models.shRun the following script to perform model inference and realism evaluation given the setting described in the paper:
bash scripts/realism_test_bench.shYou should obtain:
| Model | Reference Type | FID | LPIPS | CLIP | D-LPIPS | I-LPIPS |
|---|---|---|---|---|---|---|
| mobi_nuscenes_epoch28 | id-ref | 6.503 | 0.114 | 84.9 | 0.130 | 0.147 |
| mobi_nuscenes_epoch28 | track-ref | 6.703 | 0.115 | 83.5 | 0.129 | 0.149 |
| mobi_nuscenes_epoch28 | in-domain-ref | 8.947 | 0.127 | 77.5 | 0.132 | 0.154 |
| mobi_nuscenes_epoch28 | cross-domain-ref | 9.046 | 0.130 | 76.0 | 0.132 | 0.153 |
See bevfusion/edited-objects-eval.md for detailed instructions on how to run BEVFusion model on reinserted objects and measure its performance.
Train MObI using Paint-by-Example pretraining and provided range view autoencoder (this codebase provides scripts to train your own range view VAE, too):
bash scripts/train.sh
The training script will save the top-5 checkpoints. To select the best checkpoint, run a short evaluation on each of them using the following script:
bash scripts/model_selection.sh
First, extract the image VAE of Paint-by-Example and then run finetuning script:
cd WORK_DIR_MOBI
python scripts/extract_autoencoder.py
bash scripts/finetune_autonecoder.sh
If you find our work useful in your research, please consider citing:
@inproceedings{buburuzan2025mobi,
title={Mobi: Multimodal object inpainting using diffusion models},
author={Buburuzan, Alexandru and Sharma, Anuj and Redford, John and Dokania, Puneet K and Mueller, Romain},
booktitle={Proceedings of the Computer Vision and Pattern Recognition Conference},
pages={1974--1984},
year={2025}
}LICENSE_MObI covers the MObI-specific code and assets. Please note that this codebase builds upon other works such as Paint-by-Example and BEVFusion, which have their own respective licenses.

