SLGNet 🎮: Synergizing Structural Priors and Language-Guided Modulation for Multimodal Object Detection
Xiantai Xiang1,2,3 Guangyao Zhou1,2 Zixiao Wen1,2 Wenshuai Li1,2 Ben Niu1,2 Feng Wang1,2 Lijia Huang1,2,3 Qiantong Wang1,2 Yuhan Liu1,2,3 Zongxu Pan4 Yuxin Hu1,2,3
2 Key Laboratory of Target Cognition and Application Technology, CAS;
3 University of Chinese Academy of Sciences;
4 Xi'an Jiaotong University
This repository is the official implementation of the paper: SLGNet: Synergizing Structural Priors and Language-Guided Modulation for Multimodal Object Detection.
SLGNet is a parameter-efficient framework designed to transfer the capabilities of frozen RGB foundation models (e.g., DINOv2) to multimodal RGB-IR object detection. By synergizing a Structure-Aware Adapter and Language-Guided Modulation (LGM), SLGNet addresses structural degradation in Vision Transformers while endowing the model with robust environmental awareness.
Figure 1: Overview of the SLGNet framework. The architecture synergizes a frozen ViT backbone with a Structure-Aware Adapter for geometric localization and LGM for environmental adaptation
- Parameter-Efficient Tuning: Effectively adapts frozen foundation models while reducing trainable parameters by approximately 87% compared to full fine-tuning.
- Structure-Aware Adapter (SA-Adapter): Extracts hierarchical structural priors (edges/contours) from both modalities to compensate for spatial detail loss in ViTs.
- Language-Guided Modulation (LGM): Exploits VLM-driven structured captions (Environment, Scene, Objects, Thermal) to dynamically recalibrate visual features via affine transformations.
- State-of-the-art Performance: Establishes new SOTA results on LLVIP, FLIR, KAIST, and Drone Vehicle benchmarks.
| Status | Task Description |
|---|---|
| 🔜 | Release model checkpoints (Coming soon after publication) |
| 🔜 | Training and evaluation scripts (Coming soon after publication) |
| 🔜 | Open-source model code (Coming soon after publication) |
ℹ️ Note: All the resources mentioned above will be made publicly available upon the official publication of our research paper.
| Methods | Modality | LLVIP ( |
FLIR ( |
Drone Vehicle ( |
Trainable Params |
|---|---|---|---|---|---|
| Full Fine-tuning | RGB+IR | 54.5 | 82.2 | 75.3 | 96.0M |
| UniRGB-IR | RGB+IR | 96.1 | 81.4 | - | 8.9M |
| COFNet | RGB+IR | 97.7 | 83.6 | - | 90.2M |
| SLGNet (Ours) | RGB+IR | 98.3 | 85.8 | 80.7 | 12.1M |
Detailed results and comparisons are available in the technical paper.
Feel free to contact me through my email (xiangxiantai@gmail.com) or github issue.
If you find our paper or benchmark helpful for your research, please consider citing our paper. Thank you very much!
@misc{xiang2026slgnetsynergizingstructuralpriors,
title={SLGNet: Synergizing Structural Priors and Language-Guided Modulation for Multimodal Object Detection},
author={Xiantai Xiang and Guangyao Zhou and Zixiao Wen and Wenshuai Li and Ben Niu and Feng Wang and Lijia Huang and Qiantong Wang and Yuhan Liu and Zongxu Pan and Yuxin Hu},
year={2026},
eprint={2601.02249},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2601.02249},
}
