Skip to content

Xiantai01/SLGNet

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

SLGNet 🎮: Synergizing Structural Priors and Language-Guided Modulation for Multimodal Object Detection

Xiantai Xiang1,2,3Guangyao Zhou1,2Zixiao Wen1,2Wenshuai Li1,2Ben Niu1,2Feng Wang1,2Lijia Huang1,2,3Qiantong Wang1,2Yuhan Liu1,2,3Zongxu Pan4Yuxin Hu1,2,3

1 Aerospace Information Research Institute, CAS;
2 Key Laboratory of Target Cognition and Application Technology, CAS;
3 University of Chinese Academy of Sciences;
4 Xi'an Jiaotong University

This repository is the official implementation of the paper: SLGNet: Synergizing Structural Priors and Language-Guided Modulation for Multimodal Object Detection.

SLGNet is a parameter-efficient framework designed to transfer the capabilities of frozen RGB foundation models (e.g., DINOv2) to multimodal RGB-IR object detection. By synergizing a Structure-Aware Adapter and Language-Guided Modulation (LGM), SLGNet addresses structural degradation in Vision Transformers while endowing the model with robust environmental awareness.

Figure 1: Overview of the SLGNet framework. The architecture synergizes a frozen ViT backbone with a Structure-Aware Adapter for geometric localization and LGM for environmental adaptation

image-20260106171240432

Core Features 🌟

  • Parameter-Efficient Tuning: Effectively adapts frozen foundation models while reducing trainable parameters by approximately 87% compared to full fine-tuning.
  • Structure-Aware Adapter (SA-Adapter): Extracts hierarchical structural priors (edges/contours) from both modalities to compensate for spatial detail loss in ViTs.
  • Language-Guided Modulation (LGM): Exploits VLM-driven structured captions (Environment, Scene, Objects, Thermal) to dynamically recalibrate visual features via affine transformations.
  • State-of-the-art Performance: Establishes new SOTA results on LLVIP, FLIR, KAIST, and Drone Vehicle benchmarks.

📋 TODO List

🔧 Development Roadmap

Status Task Description
🔜 Release model checkpoints (Coming soon after publication)
🔜 Training and evaluation scripts (Coming soon after publication)
🔜 Open-source model code (Coming soon after publication)

ℹ️ Note: All the resources mentioned above will be made publicly available upon the official publication of our research paper.

Performance 📊

Quantitative Comparison (mAP)

Methods Modality LLVIP ($mAP_{50}$) FLIR ($mAP_{50}$) Drone Vehicle ($mAP$) Trainable Params
Full Fine-tuning RGB+IR 54.5 82.2 75.3 96.0M
UniRGB-IR RGB+IR 96.1 81.4 - 8.9M
COFNet RGB+IR 97.7 83.6 - 90.2M
SLGNet (Ours) RGB+IR 98.3 85.8 80.7 12.1M

Detailed results and comparisons are available in the technical paper.

Contact and Acknowledge

Feel free to contact me through my email (xiangxiantai@gmail.com) or github issue.

Citation

If you find our paper or benchmark helpful for your research, please consider citing our paper. Thank you very much!

@misc{xiang2026slgnetsynergizingstructuralpriors,
      title={SLGNet: Synergizing Structural Priors and Language-Guided Modulation for Multimodal Object Detection}, 
      author={Xiantai Xiang and Guangyao Zhou and Zixiao Wen and Wenshuai Li and Ben Niu and Feng Wang and Lijia Huang and Qiantong Wang and Yuhan Liu and Zongxu Pan and Yuxin Hu},
      year={2026},
      eprint={2601.02249},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2601.02249}, 
}

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors