Official PyTorch Implementation
This repository contains the official implementation of the paper "Single-Modal-Operable Multimodal Collaborative Perception" (ICLR 2026).
π Update: Pretrained checkpoints are now available! Download them from our Hugging Face repository.
Multimodal collaborative perception promises robust 3D object detection by fusing complementary sensor data from multiple connected vehicles. However, existing methods suffer from catastrophic performance degradation when one modality becomes unavailable during deployment, a common scenario in real-world autonomous driving. SiMO addresses this critical limitation through two key innovations:
-
LAMMA (Length-Adaptive Multi-Modal Fusion): A novel fusion module that adaptively handles variable numbers of input modalities, operating like a parallel circuit rather than series fusion.
-
PAFR Training Strategy: A four-stage training paradigm (Pretrain-Align-Fuse-Random Drop) that prevents modality competition and enables seamless single-modal operation.
| Modality | AP@30 | AP@50 | AP@70 |
|---|---|---|---|
| LiDAR + Camera | 98.30 | 97.94 | 94.64 |
| LiDAR-only | 97.32 | 97.07 | 94.06 |
| Camera-only | 80.81 | 69.63 | 44.82 |
Key Result: SiMO achieves state-of-the-art performance on OPV2V-H with graceful degradation when modalities fail.
- Single-Modal Operability: First multimodal collaborative perception framework that maintains functional performance with any subset of modalities
- Adaptive Fusion: LAMMA module dynamically adjusts to available modalities without architecture changes
- No Modality Competition: PAFR training prevents feature suppression between modalities
- Drop-in Replacement: Compatible with existing fusion frameworks like HEAL's Pyramid Fusion
- Multi-Dataset Support: Evaluated on OPV2V-H, V2XSet, and DAIR-V2X-C
LAMMA is the core fusion module that enables SiMO's single-modal operability:
Input: Camera Features (B, N, C, H, W) + LiDAR Features (B, N, C, H, W)
β
[Positional Encoding]
β
[Feature Projection] β Downsampling (2x)
β
[Modality-Aware Masking] β Single-mode or Random Drop
β
[Cross-Attention] Γ 2 (Camera branch + LiDAR branch)
β
[Parallel Fusion] β Sum of attended features
β
[Feature Recovery] β Upsampling (2x)
β
Output: Fused Features + Single-Modal Features
Key Design Principles:
- Parallel Processing: Unlike sequential fusion, LAMMA processes modalities in parallel and sums their contributions
- Adaptive Masking: During training, random modality dropout forces the network to learn robust single-modal representations
- Cross-Attention: Each modality attends to the concatenated features of all available modalities
SiMO works seamlessly with HEAL's Pyramid Fusion framework:
Stage 1: Single-Modal Encoders (PointPillar for LiDAR, Lift-Splat-Shoot for Camera)
β
Stage 2: Single-Modal Backbones (ResNet-based BEV feature extraction)
β
Stage 3: Modality Alignment (ConvNeXt-based feature alignment)
β
Stage 4: LAMMA Fusion (Adaptive multimodal fusion)
β
Stage 5: Pyramid Fusion Backbone (Multi-scale collaborative aggregation)
β
Stage 6: Detection Head (Anchor-based 3D object detection)
The PAFR (Pretrain-Align-Fuse-Random Drop) strategy consists of four stages:
Goal: Train single-modal feature extractors independently
# Pretrain LiDAR branch
python opencood/tools/train.py --hypes_yaml opencood/hypes_yaml/opv2v/LiDAROnly/lidar_pyramid.yaml
# Pretrain Camera branch
python opencood/tools/train.py --hypes_yaml opencood/hypes_yaml/opv2v/CameraOnly/camera_pyramid.yamlConfiguration: Set freeze: true for all pretrained components in subsequent stages.
Goal: Align multi-modal features to a common representation space using ConvNeXt
Key Configuration:
aligner_args:
core_method: convnext
freeze: true
spatial_align: false
args:
num_of_blocks: 3
dim: 64Training: Train with both modalities, allowing the aligner to learn cross-modal feature correspondence.
Goal: Train LAMMA fusion module with full multimodal inputs
Key Configuration:
mm_fusion_method: 'lamma3'
lamma:
freeze: false
feature_stride: 2
feat_dim: 64
dim: 128
heads: 2
single_mode: false
random_drop: falseImportant: Keep random_drop: false and single_mode: false during this stage.
Goal: Fine-tune with random modality dropout to enable single-modal operation
Key Configuration:
lamma:
random_drop: true
lidar_drop_ratio: 0.5 # 50% chance to drop LiDAR when droppingTraining: With 50% probability, randomly drop one modality during training. This forces the network to maintain functional performance with either modality alone.
This project is implemented based on HEAL and adopts the same environment setup. Please refer to the HEAL repository for detailed installation instructions and troubleshooting.
- Python >= 3.8
- PyTorch >= 1.12.0
- CUDA >= 11.3
- spconv >= 2.0
git clone https://github.com/dempsey-wen/SiMO.git
cd SiMO# Install PyTorch (adjust CUDA version as needed)
pip install torch==1.12.1+cu113 torchvision==0.13.1+cu113 --extra-index-url https://download.pytorch.org/whl/cu113
# Install spconv (for LiDAR feature extraction)
pip install spconv-cu113
# Install other requirements
pip install -r requirements.txtKey Dependencies:
easydict~=1.9opencv-python-headless~=4.5.1.48timmeinopsshapely==2.0.0efficientnet_pytorch==0.7.0
pip install -e .cd opencood/pcdet_utils/pointnet2
python setup.py install
cd ../iou3d_nms
python setup.py install
cd ../../..SiMO supports the following collaborative perception datasets:
| Dataset | Scenarios | Modalities | Download |
|---|---|---|---|
| OPV2V-H | Highway, Urban | LiDAR, Camera | Link |
| V2XSet | Highway, Urban | LiDAR, Camera | Link |
| DAIR-V2X-C | Real-world | LiDAR, Camera | Link |
data/
βββ OPV2V/
β βββ train/
β βββ validate/
β βββ test/
βββ V2XSet/
β βββ train/
β βββ validate/
β βββ test/
βββ DAIR-V2X/
βββ ...
# LiDAR-only pretraining (20 epochs)
python opencood/tools/train.py \
--hypes_yaml opencood/hypes_yaml/opv2v/LiDAROnly/lidar_pyramid.yaml
# Camera-only pretraining (50 epochs)
python opencood/tools/train.py \
--hypes_yaml opencood/hypes_yaml/opv2v/CameraOnly/camera_pyramid.yamlOutput: Model checkpoints saved to saved_models/opv2v_lidar_pyramid/ and saved_models/opv2v_camera_pyramid/
This stage trains the modality aligners to align LiDAR and camera features to a common representation space.
Key Configuration: The aligner trains independently for each modality with frozen encoders and backbones.
Modify config to set single_modality: lidar and freeze camera aligner:
model:
args:
single_modality: lidar
lidar_aligner:
freeze: false
camera_aligner:
freeze: true # Freeze camera aligner
lidar_encoder:
freeze: true # Freeze pretrained LiDAR encoder
lidar_backbone:
freeze: true # Freeze pretrained LiDAR backbone
camera_encoder:
freeze: true # Freeze pretrained camera encoder
camera_backbone:
freeze: true # Freeze pretrained camera backboneThen run training:
python opencood/tools/train.py \
--hypes_yaml opencood/hypes_yaml/opv2v/MoreModality/lidar_camera_lamma3_pyramid_fusion.yamlModify config to set single_modality: camera and freeze LiDAR aligner:
model:
args:
single_modality: camera
lidar_aligner:
freeze: true # Freeze LiDAR aligner
camera_aligner:
freeze: false
# Keep encoders and backbones frozenThen run training:
python opencood/tools/train.py \
--hypes_yaml opencood/hypes_yaml/opv2v/MoreModality/lidar_camera_lamma3_pyramid_fusion.yamlOutput: Checkpoints saved with trained aligners. Load these checkpoints for the next stage.
This stage trains the LAMMA fusion module with both aligners frozen.
Key Configuration: Set single_modality: false to enable full multimodal fusion.
model:
args:
single_modality: false # Enable full multimodal fusion
lidar_aligner:
freeze: true # Freeze both aligners
camera_aligner:
freeze: true
lamma:
random_drop: false # Disable random drop in this stage
single_mode: falseSet model_dir to load the pretrained aligner checkpoints from the Align stage.
python opencood/tools/train.py \
--hypes_yaml opencood/hypes_yaml/opv2v/MoreModality/lidar_camera_lamma3_pyramid_fusion.yaml \
--model_dir saved_models/opv2v_lidarcamera_aligned/This final stage enables random modality dropout during training to ensure robust single-modal operation.
Key Configuration: Set lamma.random_drop: true to enable random dropout.
model:
args:
single_modality: false # Still enable full multimodal fusion
lamma:
random_drop: true # Enable random modality dropout
lidar_drop_ratio: 0.5 # 50% probability to drop LiDAR when dropping
single_mode: falseThen resume training with the Fusion stage checkpoint:
python opencood/tools/train.py \
--hypes_yaml opencood/hypes_yaml/opv2v/MoreModality/lidar_camera_lamma3_pyramid_fusion.yaml \
--model_dir saved_models/opv2v_lidarcamera_lamma3_fused/Training Notes:
- During training, with 50% probability, one modality is randomly dropped
- This forces the network to maintain functional performance with either modality alone
- The final checkpoint will have robust single-modal operability
python opencood/tools/inference.py \
--model_dir saved_models/opv2v_lidarcamera_lamma3_pyramid_fusion/ \
--fusion_method intermediateModify the config to set single_modality: lidar:
model:
args:
single_modality: lidarThen run inference:
python opencood/tools/inference.py \
--model_dir saved_models/opv2v_lidarcamera_lamma3_pyramid_fusion/ \
--fusion_method intermediatemodel:
args:
single_modality: camerapython opencood/tools/inference.py \
--model_dir saved_models/opv2v_lidarcamera_lamma3_pyramid_fusion/ \
--fusion_method intermediatepython opencood/tools/inference.py \
--model_dir saved_models/opv2v_lidarcamera_lamma3_pyramid_fusion/ \
--fusion_method intermediate \
--range 51.2,51.2python opencood/tools/inference.py \
--model_dir saved_models/opv2v_lidarcamera_lamma3_pyramid_fusion/ \
--fusion_method intermediate \
--save_vis_interval 10| Method | Modality | AP@30 | AP@50 | AP@70 | Modality Drop? |
|---|---|---|---|---|---|
| SiMO-PF | LiDAR + Camera | 98.30 | 97.94 | 94.64 | No |
| SiMO-PF | LiDAR only | 97.32 | 97.07 | 94.06 | Yes |
| SiMO-PF | Camera only | 80.81 | 69.63 | 44.82 | Yes |
Key Observations:
- SiMO maintains >97% AP@50 even when operating with LiDAR alone
- Camera-only performance is competitive for low-precision detection (AP@30 = 80.81)
- Graceful degradation pattern enables safe fallback strategies
| Method | LiDAR+Camera AP@50 | LiDAR-Only AP@50 | Camera-Only AP@50 |
|---|---|---|---|
| BM2CP (Zhao et al., 2023) | 91.45 | 91.31 | 0.00 |
| BEVFusion (Liu et al., 2023) | 94.21 | 91.99 | 0.00 |
| UniBEV (Wang et al., 2024a) | 91.71 | 91.73 | 0.00 |
| AttFusion (Xu et al., 2022c) | - | 95.09 | 52.91 |
| HEAL (Lu et al., 2024) | - | 98.00 | 60.48 |
| SiMO (AttFusion w/ RD) | 94.98 | 94.02 | 49.69 |
| SiMO (Pyramid Fusion w/ RD) (Ours) | 97.94 | 97.07 | 69.63 |
| Method | LiDAR+Camera AP@50 | LiDAR-Only AP@50 | Camera-Only AP@50 |
|---|---|---|---|
| SiMO-PF | 92.66 | 90.44 | 56.42 |
| Method | LiDAR+Camera AP@50 | LiDAR-Only AP@50 | Camera-Only AP@50 |
|---|---|---|---|
| SiMO-PF | 51.82 | 52.33 | 2.24 |
Pretrained models are available on Hugging Face.
| Model | Dataset | Config | Checkpoint |
|---|---|---|---|
| SiMO-PF | OPV2V-H | Config | π€ HF |
| SiMO-AttFuse | OPV2V-H | Config | π€ HF |
# Install huggingface-hub
pip install huggingface-hub
# Download all checkpoints
python -c "from huggingface_hub import snapshot_download; snapshot_download(repo_id='DempseyWen/SiMO', repo_type='model')"
# Or download specific model
python -c "from huggingface_hub import hf_hub_download; hf_hub_download(repo_id='DempseyWen/SiMO', filename='path/to/checkpoint.pth')"The downloaded checkpoints will be saved to ~/.cache/huggingface/hub/. You can also manually download from Hugging Face.
SiMO/
βββ opencood/
β βββ models/
β β βββ fuse_modules/
β β β βββ lamma.py # LAMMA implementation
β β β βββ pyramid_fuse.py # Pyramid Fusion
β β βββ heter_pyramid_collab.py # Main model
β βββ tools/
β β βββ train.py # Training script
β β βββ train_ddp.py # Distributed training
β β βββ inference.py # Testing script
β βββ hypes_yaml/
β β βββ opv2v/
β β βββ LiDAROnly/ # Single-modal configs
β β βββ CameraOnly/
β β βββ MoreModality/ # Multimodal configs
β βββ data_utils/
β βββ datasets/ # Dataset loaders
βββ requirements.txt
βββ setup.py
βββ README.md
If you find this work useful for your research, please cite:
@inproceedings{wen2026simo,
title={Single-Modal-Operable Multimodal Collaborative Perception},
author={Wen, Dempsey and Lu, Yifan and others},
booktitle={International Conference on Learning Representations (ICLR)},
year={2026}
}If you use the OpenCOOD framework, please also cite:
@inproceedings{xu2022opencood,
title={OpenCOOD: An Open Cooperative Perception Framework for Autonomous Driving},
author={Xu, Runsheng and Lu, Yifan and others},
booktitle={IEEE International Conference on Robotics and Automation (ICRA)},
year={2023}
}This project is licensed under the MIT License. See LICENSE for details.
The code is based on OpenCOOD and HEAL.
We thank the authors of OpenCOOD and HEAL for their excellent open-source frameworks. This work builds upon their contributions to collaborative perception research.
For questions or issues, please open an issue on GitHub or contact the authors.