Skip to content

dempsey-wen/SiMO

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

10 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

SiMO: Single-Modal-Operable Multimodal Collaborative Perception

ICLR 2026 arXiv GitHub Hugging Face License: MIT

Official PyTorch Implementation

This repository contains the official implementation of the paper "Single-Modal-Operable Multimodal Collaborative Perception" (ICLR 2026).

πŸŽ‰ Update: Pretrained checkpoints are now available! Download them from our Hugging Face repository.


Abstract

Multimodal collaborative perception promises robust 3D object detection by fusing complementary sensor data from multiple connected vehicles. However, existing methods suffer from catastrophic performance degradation when one modality becomes unavailable during deployment, a common scenario in real-world autonomous driving. SiMO addresses this critical limitation through two key innovations:

  1. LAMMA (Length-Adaptive Multi-Modal Fusion): A novel fusion module that adaptively handles variable numbers of input modalities, operating like a parallel circuit rather than series fusion.

  2. PAFR Training Strategy: A four-stage training paradigm (Pretrain-Align-Fuse-Random Drop) that prevents modality competition and enables seamless single-modal operation.

Modality AP@30 AP@50 AP@70
LiDAR + Camera 98.30 97.94 94.64
LiDAR-only 97.32 97.07 94.06
Camera-only 80.81 69.63 44.82

Key Result: SiMO achieves state-of-the-art performance on OPV2V-H with graceful degradation when modalities fail.


Key Features

  • Single-Modal Operability: First multimodal collaborative perception framework that maintains functional performance with any subset of modalities
  • Adaptive Fusion: LAMMA module dynamically adjusts to available modalities without architecture changes
  • No Modality Competition: PAFR training prevents feature suppression between modalities
  • Drop-in Replacement: Compatible with existing fusion frameworks like HEAL's Pyramid Fusion
  • Multi-Dataset Support: Evaluated on OPV2V-H, V2XSet, and DAIR-V2X-C

Architecture Overview

LAMMA (Length-Adaptive Multi-Modal Fusion)

LAMMA is the core fusion module that enables SiMO's single-modal operability:

Input: Camera Features (B, N, C, H, W) + LiDAR Features (B, N, C, H, W)
         ↓
    [Positional Encoding]
         ↓
    [Feature Projection] β†’ Downsampling (2x)
         ↓
    [Modality-Aware Masking] ← Single-mode or Random Drop
         ↓
    [Cross-Attention] Γ— 2 (Camera branch + LiDAR branch)
         ↓
    [Parallel Fusion] β†’ Sum of attended features
         ↓
    [Feature Recovery] β†’ Upsampling (2x)
         ↓
Output: Fused Features + Single-Modal Features

Key Design Principles:

  • Parallel Processing: Unlike sequential fusion, LAMMA processes modalities in parallel and sums their contributions
  • Adaptive Masking: During training, random modality dropout forces the network to learn robust single-modal representations
  • Cross-Attention: Each modality attends to the concatenated features of all available modalities

Integration with Pyramid Fusion

SiMO works seamlessly with HEAL's Pyramid Fusion framework:

Stage 1: Single-Modal Encoders (PointPillar for LiDAR, Lift-Splat-Shoot for Camera)
         ↓
Stage 2: Single-Modal Backbones (ResNet-based BEV feature extraction)
         ↓
Stage 3: Modality Alignment (ConvNeXt-based feature alignment)
         ↓
Stage 4: LAMMA Fusion (Adaptive multimodal fusion)
         ↓
Stage 5: Pyramid Fusion Backbone (Multi-scale collaborative aggregation)
         ↓
Stage 6: Detection Head (Anchor-based 3D object detection)

PAFR Training Strategy

The PAFR (Pretrain-Align-Fuse-Random Drop) strategy consists of four stages:

Stage 1: Pretrain (P)

Goal: Train single-modal feature extractors independently

# Pretrain LiDAR branch
python opencood/tools/train.py --hypes_yaml opencood/hypes_yaml/opv2v/LiDAROnly/lidar_pyramid.yaml

# Pretrain Camera branch
python opencood/tools/train.py --hypes_yaml opencood/hypes_yaml/opv2v/CameraOnly/camera_pyramid.yaml

Configuration: Set freeze: true for all pretrained components in subsequent stages.

Stage 2: Align (A)

Goal: Align multi-modal features to a common representation space using ConvNeXt

Key Configuration:

aligner_args:
  core_method: convnext
  freeze: true
  spatial_align: false
  args:
    num_of_blocks: 3
    dim: 64

Training: Train with both modalities, allowing the aligner to learn cross-modal feature correspondence.

Stage 3: Fuse (F)

Goal: Train LAMMA fusion module with full multimodal inputs

Key Configuration:

mm_fusion_method: 'lamma3'
lamma:
  freeze: false
  feature_stride: 2
  feat_dim: 64
  dim: 128
  heads: 2
  single_mode: false
  random_drop: false

Important: Keep random_drop: false and single_mode: false during this stage.

Stage 4: Random Drop (RD)

Goal: Fine-tune with random modality dropout to enable single-modal operation

Key Configuration:

lamma:
  random_drop: true
  lidar_drop_ratio: 0.5  # 50% chance to drop LiDAR when dropping

Training: With 50% probability, randomly drop one modality during training. This forces the network to maintain functional performance with either modality alone.


Installation

This project is implemented based on HEAL and adopts the same environment setup. Please refer to the HEAL repository for detailed installation instructions and troubleshooting.

Prerequisites

  • Python >= 3.8
  • PyTorch >= 1.12.0
  • CUDA >= 11.3
  • spconv >= 2.0

Step 1: Clone Repository

git clone https://github.com/dempsey-wen/SiMO.git
cd SiMO

Step 2: Install Dependencies

# Install PyTorch (adjust CUDA version as needed)
pip install torch==1.12.1+cu113 torchvision==0.13.1+cu113 --extra-index-url https://download.pytorch.org/whl/cu113

# Install spconv (for LiDAR feature extraction)
pip install spconv-cu113

# Install other requirements
pip install -r requirements.txt

Key Dependencies:

  • easydict~=1.9
  • opencv-python-headless~=4.5.1.48
  • timm
  • einops
  • shapely==2.0.0
  • efficientnet_pytorch==0.7.0

Step 3: Install OpenCOOD

pip install -e .

Step 4: Compile CUDA Extensions

cd opencood/pcdet_utils/pointnet2
python setup.py install
cd ../iou3d_nms
python setup.py install
cd ../../..

Data Preparation

Supported Datasets

SiMO supports the following collaborative perception datasets:

Dataset Scenarios Modalities Download
OPV2V-H Highway, Urban LiDAR, Camera Link
V2XSet Highway, Urban LiDAR, Camera Link
DAIR-V2X-C Real-world LiDAR, Camera Link

Directory Structure

data/
β”œβ”€β”€ OPV2V/
β”‚   β”œβ”€β”€ train/
β”‚   β”œβ”€β”€ validate/
β”‚   └── test/
β”œβ”€β”€ V2XSet/
β”‚   β”œβ”€β”€ train/
β”‚   β”œβ”€β”€ validate/
β”‚   └── test/
└── DAIR-V2X/
    └── ...

Training Commands

Complete PAFR Pipeline

Step 1: Pretrain Single-Modal Branches

# LiDAR-only pretraining (20 epochs)
python opencood/tools/train.py \
    --hypes_yaml opencood/hypes_yaml/opv2v/LiDAROnly/lidar_pyramid.yaml

# Camera-only pretraining (50 epochs)
python opencood/tools/train.py \
    --hypes_yaml opencood/hypes_yaml/opv2v/CameraOnly/camera_pyramid.yaml

Output: Model checkpoints saved to saved_models/opv2v_lidar_pyramid/ and saved_models/opv2v_camera_pyramid/

Step 2: Align Multi-Modal Features

This stage trains the modality aligners to align LiDAR and camera features to a common representation space.

Key Configuration: The aligner trains independently for each modality with frozen encoders and backbones.

2.1 Train LiDAR Aligner

Modify config to set single_modality: lidar and freeze camera aligner:

model:
  args:
    single_modality: lidar
    lidar_aligner:
      freeze: false
    camera_aligner:
      freeze: true  # Freeze camera aligner
    lidar_encoder:
      freeze: true   # Freeze pretrained LiDAR encoder
    lidar_backbone:
      freeze: true   # Freeze pretrained LiDAR backbone
    camera_encoder:
      freeze: true   # Freeze pretrained camera encoder
    camera_backbone:
      freeze: true   # Freeze pretrained camera backbone

Then run training:

python opencood/tools/train.py \
    --hypes_yaml opencood/hypes_yaml/opv2v/MoreModality/lidar_camera_lamma3_pyramid_fusion.yaml
2.2 Train Camera Aligner

Modify config to set single_modality: camera and freeze LiDAR aligner:

model:
  args:
    single_modality: camera
    lidar_aligner:
      freeze: true   # Freeze LiDAR aligner
    camera_aligner:
      freeze: false
    # Keep encoders and backbones frozen

Then run training:

python opencood/tools/train.py \
    --hypes_yaml opencood/hypes_yaml/opv2v/MoreModality/lidar_camera_lamma3_pyramid_fusion.yaml

Output: Checkpoints saved with trained aligners. Load these checkpoints for the next stage.

Step 3: Train LAMMA Fusion

This stage trains the LAMMA fusion module with both aligners frozen.

Key Configuration: Set single_modality: false to enable full multimodal fusion.

model:
  args:
    single_modality: false   # Enable full multimodal fusion
    lidar_aligner:
      freeze: true    # Freeze both aligners
    camera_aligner:
      freeze: true
    lamma:
      random_drop: false  # Disable random drop in this stage
      single_mode: false

Set model_dir to load the pretrained aligner checkpoints from the Align stage.

python opencood/tools/train.py \
    --hypes_yaml opencood/hypes_yaml/opv2v/MoreModality/lidar_camera_lamma3_pyramid_fusion.yaml \
    --model_dir saved_models/opv2v_lidarcamera_aligned/

Step 4: Random Drop Fine-tuning

This final stage enables random modality dropout during training to ensure robust single-modal operation.

Key Configuration: Set lamma.random_drop: true to enable random dropout.

model:
  args:
    single_modality: false   # Still enable full multimodal fusion
    lamma:
      random_drop: true       # Enable random modality dropout
      lidar_drop_ratio: 0.5  # 50% probability to drop LiDAR when dropping
      single_mode: false

Then resume training with the Fusion stage checkpoint:

python opencood/tools/train.py \
    --hypes_yaml opencood/hypes_yaml/opv2v/MoreModality/lidar_camera_lamma3_pyramid_fusion.yaml \
    --model_dir saved_models/opv2v_lidarcamera_lamma3_fused/

Training Notes:

  • During training, with 50% probability, one modality is randomly dropped
  • This forces the network to maintain functional performance with either modality alone
  • The final checkpoint will have robust single-modal operability

Testing Commands

Multimodal Testing (LiDAR + Camera)

python opencood/tools/inference.py \
    --model_dir saved_models/opv2v_lidarcamera_lamma3_pyramid_fusion/ \
    --fusion_method intermediate

Single-Modal Testing

LiDAR-Only Inference

Modify the config to set single_modality: lidar:

model:
  args:
    single_modality: lidar

Then run inference:

python opencood/tools/inference.py \
    --model_dir saved_models/opv2v_lidarcamera_lamma3_pyramid_fusion/ \
    --fusion_method intermediate

Camera-Only Inference

model:
  args:
    single_modality: camera
python opencood/tools/inference.py \
    --model_dir saved_models/opv2v_lidarcamera_lamma3_pyramid_fusion/ \
    --fusion_method intermediate

Evaluation with Different Ranges

python opencood/tools/inference.py \
    --model_dir saved_models/opv2v_lidarcamera_lamma3_pyramid_fusion/ \
    --fusion_method intermediate \
    --range 51.2,51.2

Save Visualization

python opencood/tools/inference.py \
    --model_dir saved_models/opv2v_lidarcamera_lamma3_pyramid_fusion/ \
    --fusion_method intermediate \
    --save_vis_interval 10

Benchmark Results

OPV2V-H Test Set

SiMO-PF (Pyramid Fusion + LAMMA)

Method Modality AP@30 AP@50 AP@70 Modality Drop?
SiMO-PF LiDAR + Camera 98.30 97.94 94.64 No
SiMO-PF LiDAR only 97.32 97.07 94.06 Yes
SiMO-PF Camera only 80.81 69.63 44.82 Yes

Key Observations:

  • SiMO maintains >97% AP@50 even when operating with LiDAR alone
  • Camera-only performance is competitive for low-precision detection (AP@30 = 80.81)
  • Graceful degradation pattern enables safe fallback strategies

Comparison with Baselines

Method LiDAR+Camera AP@50 LiDAR-Only AP@50 Camera-Only AP@50
BM2CP (Zhao et al., 2023) 91.45 91.31 0.00
BEVFusion (Liu et al., 2023) 94.21 91.99 0.00
UniBEV (Wang et al., 2024a) 91.71 91.73 0.00
AttFusion (Xu et al., 2022c) - 95.09 52.91
HEAL (Lu et al., 2024) - 98.00 60.48
SiMO (AttFusion w/ RD) 94.98 94.02 49.69
SiMO (Pyramid Fusion w/ RD) (Ours) 97.94 97.07 69.63

V2XSet Test Set

Method LiDAR+Camera AP@50 LiDAR-Only AP@50 Camera-Only AP@50
SiMO-PF 92.66 90.44 56.42

DAIR-V2X-C Test Set

Method LiDAR+Camera AP@50 LiDAR-Only AP@50 Camera-Only AP@50
SiMO-PF 51.82 52.33 2.24

Model Zoo

Pretrained models are available on Hugging Face.

Model Dataset Config Checkpoint
SiMO-PF OPV2V-H Config πŸ€— HF
SiMO-AttFuse OPV2V-H Config πŸ€— HF

Download Models from Hugging Face

# Install huggingface-hub
pip install huggingface-hub

# Download all checkpoints
python -c "from huggingface_hub import snapshot_download; snapshot_download(repo_id='DempseyWen/SiMO', repo_type='model')"

# Or download specific model
python -c "from huggingface_hub import hf_hub_download; hf_hub_download(repo_id='DempseyWen/SiMO', filename='path/to/checkpoint.pth')"

The downloaded checkpoints will be saved to ~/.cache/huggingface/hub/. You can also manually download from Hugging Face.

Project Structure

SiMO/
β”œβ”€β”€ opencood/
β”‚   β”œβ”€β”€ models/
β”‚   β”‚   β”œβ”€β”€ fuse_modules/
β”‚   β”‚   β”‚   β”œβ”€β”€ lamma.py              # LAMMA implementation
β”‚   β”‚   β”‚   └── pyramid_fuse.py       # Pyramid Fusion
β”‚   β”‚   └── heter_pyramid_collab.py   # Main model
β”‚   β”œβ”€β”€ tools/
β”‚   β”‚   β”œβ”€β”€ train.py                  # Training script
β”‚   β”‚   β”œβ”€β”€ train_ddp.py              # Distributed training
β”‚   β”‚   └── inference.py              # Testing script
β”‚   β”œβ”€β”€ hypes_yaml/
β”‚   β”‚   └── opv2v/
β”‚   β”‚       β”œβ”€β”€ LiDAROnly/            # Single-modal configs
β”‚   β”‚       β”œβ”€β”€ CameraOnly/
β”‚   β”‚       └── MoreModality/         # Multimodal configs
β”‚   └── data_utils/
β”‚       └── datasets/                 # Dataset loaders
β”œβ”€β”€ requirements.txt
β”œβ”€β”€ setup.py
└── README.md

Citation

If you find this work useful for your research, please cite:

@inproceedings{wen2026simo,
  title={Single-Modal-Operable Multimodal Collaborative Perception},
  author={Wen, Dempsey and Lu, Yifan and others},
  booktitle={International Conference on Learning Representations (ICLR)},
  year={2026}
}

If you use the OpenCOOD framework, please also cite:

@inproceedings{xu2022opencood,
  title={OpenCOOD: An Open Cooperative Perception Framework for Autonomous Driving},
  author={Xu, Runsheng and Lu, Yifan and others},
  booktitle={IEEE International Conference on Robotics and Automation (ICRA)},
  year={2023}
}

License

This project is licensed under the MIT License. See LICENSE for details.

The code is based on OpenCOOD and HEAL.


Acknowledgements

We thank the authors of OpenCOOD and HEAL for their excellent open-source frameworks. This work builds upon their contributions to collaborative perception research.


Contact

For questions or issues, please open an issue on GitHub or contact the authors.

About

This is the implementation of the paper "Single-Modal Operable Multimodal Collaborative Perception".

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors