Skip to content

XIEFOX/PixDLM

Repository files navigation

📄 [CVPR 2026 Highlight] PixDLM: A Dual-Path Multimodal Language Model for UAV Reasoning Segmentation

arXiv Dataset Model


✨ Authors

Shuyan Ke1, Yifan Mei1, Changli Wu1, 2, Yonghan Zheng1, Jiayi Ji1, Liujuan Cao1, Rongrong Ji1

1 Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Ministry of Education of China, Xiamen University, 361005, P.R. China
2 Shanghai Innovation Institute


📢 News

  • [2026.04] 🔥 PixDLM is selected as CVPR 2026 Highlight!
  • [2026.04] 📄 Paper is available on arXiv.
  • [2026.04] Pretrained models and inference code are available!
  • [2026.04] Training code released.
  • [2026.03] 🧩 DRSeg dataset released on HuggingFace.
  • [2026.01] 🎉 PixDLM is accepted by CVPR 2026!

🚀 Overview

Understanding complex aerial scenes requires not only pixel-level perception but also structured reasoning under UAV-specific visual characteristics such as small objects, large viewpoints, and high scene complexity.

In this project, we introduce PixDLM, a Dual-Path Multimodal Language Model designed for UAV reasoning segmentation, a new task that integrates instruction following, reasoning, and fine-grained segmentation.

PixDLM explicitly models the synergy between:

  • Semantic reasoning (Language-aligned reasoning path)
  • Fine-grained perception (Pixel-level visual path)

This dual-path design enables robust performance under challenging UAV scenarios.


🌟 Highlights

  • 📌 New Task: UAV Reasoning Segmentation We formalize it as an instruction-driven pixel-level prediction task, highlighting the limitations of existing reasoning models under aerial viewpoints.
  • 📊 New Dataset: DRSeg The first large-scale UAV reasoning segmentation benchmark featuring high-resolution aerial imagery and chain-of-thought (CoT) aligned reasoning annotations.
  • 🧠 New Model: PixDLM A Dual-Path pixel-level MLLM that decouples reasoning and perception, enabling structured reasoning-guided segmentation with strong performance on both UAV and referring segmentation benchmarks.

🗂️ DRSeg Dataset

DRSeg is a large-scale benchmark designed specifically for UAV reasoning segmentation.

  • Statistics: 10,000 high-resolution UAV images | 10,000 instance masks | 10,000 reasoning QA pairs.
  • Reasoning Types: Spatial reasoning (33.33%), Attribute reasoning (33.34%), Scene-level reasoning (33.33%).
  • UAV-specific Properties: Multi-altitude distribution (30m, 60m, 100m) and small-object dominance (58.08% of instances occupy < 2% of the image area).

⚙️ Installation

git clone https://github.com/your-repo/PixDLM.git
cd PixDLM

pip install -r requirements.txt
pip install flash-attn --no-build-isolation

📁 Data & Weights Preparation

Please download:

Organize your project directory as follows:

PixDLM/
├── PixDLM/
│   └── pytorch_model.bin        <-- Place the pretrained weights here
├── data/
│   └── DRSeg/                   <-- Place the dataset here
├── model/
├── train.sh
├── eval.sh
└── ...

⚠️ Important: Make sure the pretrained weight file pytorch_model.bin is placed under:

PixDLM/PixDLM/pytorch_model.bin

Otherwise, the model will not be loaded correctly.

🏋️ Training

🔹 Pretrained Initialization

Our model is initialized from:

  • liuhaotian/llava-v1.6-vicuna-7b
  • openai/clip-vit-large-patch14
  • sam2.1_hiera_l

🔹 Training

sh train.sh

🔧 Post-processing

Convert DeepSpeed weights

cd path_to_ckpt_model
python zero_to_fp32.py . ../pytorch_model.bin

Merge LoRA weights

sh merge.sh

Extract model components

python extract.py

Update path in:

./model/llava/model/multimodal_encoder/multipath_encoder_wapper.py
weights_dict = torch.load("path_to_extracted_model", map_location="cpu")

📊 Evaluation

sh eval.sh

🔮 Future Work

  • Scaling UAV reasoning datasets
  • Improving long-chain reasoning consistency
  • Real-world UAV deployment

📌 Citation

If you find our work useful, please consider citing:

@inproceedings{ke2026pixdlm,
  title={PixDLM: A Dual-Path Multimodal Language Model for UAV Reasoning Segmentation},
  author={Ke, Shuyan and Mei, Yifan and Wu, Changli and Zheng, Yonghan and Ji, Jiayi and Cao, Liujuan and Ji, Rongrong},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  year={2026}
}

⭐ Acknowledgements

This project is built upon the following open-source works:

  • LLaVA
  • CLIP
  • Segment Anything (SAM)

We sincerely thank the authors for their contributions.


⭐ If you find this project useful, please consider giving it a star!

About

[CVPR2026]PixDLM: A Dual-Path Multimodal Language Model for UAV Reasoning Segmentation

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages