📄 [CVPR 2026 Highlight] PixDLM: A Dual-Path Multimodal Language Model for UAV Reasoning Segmentation

✨ Authors

Shuyan Ke¹, Yifan Mei¹, Changli Wu^{1, 2}, Yonghan Zheng¹, Jiayi Ji¹, Liujuan Cao¹, Rongrong Ji¹ ✉

¹ Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Ministry of Education of China, Xiamen University, 361005, P.R. China
² Shanghai Innovation Institute

📢 News

[2026.04] 🔥 PixDLM is selected as CVPR 2026 Highlight!
[2026.04] 📄 Paper is available on arXiv.
[2026.04] Pretrained models and inference code are available!
[2026.04] Training code released.
[2026.03] 🧩 DRSeg dataset released on HuggingFace.
[2026.01] 🎉 PixDLM is accepted by CVPR 2026!

🚀 Overview

Understanding complex aerial scenes requires not only pixel-level perception but also structured reasoning under UAV-specific visual characteristics such as small objects, large viewpoints, and high scene complexity.

In this project, we introduce PixDLM, a Dual-Path Multimodal Language Model designed for UAV reasoning segmentation, a new task that integrates instruction following, reasoning, and fine-grained segmentation.

PixDLM explicitly models the synergy between:

Semantic reasoning (Language-aligned reasoning path)
Fine-grained perception (Pixel-level visual path)

This dual-path design enables robust performance under challenging UAV scenarios.

🌟 Highlights

📌 New Task: UAV Reasoning Segmentation We formalize it as an instruction-driven pixel-level prediction task, highlighting the limitations of existing reasoning models under aerial viewpoints.
📊 New Dataset: DRSeg The first large-scale UAV reasoning segmentation benchmark featuring high-resolution aerial imagery and chain-of-thought (CoT) aligned reasoning annotations.
🧠 New Model: PixDLM A Dual-Path pixel-level MLLM that decouples reasoning and perception, enabling structured reasoning-guided segmentation with strong performance on both UAV and referring segmentation benchmarks.

🗂️ DRSeg Dataset

DRSeg is a large-scale benchmark designed specifically for UAV reasoning segmentation.

Statistics: 10,000 high-resolution UAV images | 10,000 instance masks | 10,000 reasoning QA pairs.
Reasoning Types: Spatial reasoning (33.33%), Attribute reasoning (33.34%), Scene-level reasoning (33.33%).
UAV-specific Properties: Multi-altitude distribution (30m, 60m, 100m) and small-object dominance (58.08% of instances occupy < 2% of the image area).

⚙️ Installation

git clone https://github.com/your-repo/PixDLM.git
cd PixDLM

pip install -r requirements.txt
pip install flash-attn --no-build-isolation

📁 Data & Weights Preparation

Please download:

🤗 Pretrained weights from our HuggingFace model page
📊 DRSeg dataset from our HuggingFace dataset page

Organize your project directory as follows:

PixDLM/
├── PixDLM/
│   └── pytorch_model.bin        <-- Place the pretrained weights here
├── data/
│   └── DRSeg/                   <-- Place the dataset here
├── model/
├── train.sh
├── eval.sh
└── ...

⚠️ Important: Make sure the pretrained weight file pytorch_model.bin is placed under:
PixDLM/PixDLM/pytorch_model.bin
Otherwise, the model will not be loaded correctly.

🏋️ Training

🔹 Pretrained Initialization

Our model is initialized from:

liuhaotian/llava-v1.6-vicuna-7b
openai/clip-vit-large-patch14
sam2.1_hiera_l

🔹 Training

sh train.sh

🔧 Post-processing

Convert DeepSpeed weights

cd path_to_ckpt_model
python zero_to_fp32.py . ../pytorch_model.bin

Merge LoRA weights

sh merge.sh

Extract model components

python extract.py

Update path in:

./model/llava/model/multimodal_encoder/multipath_encoder_wapper.py

weights_dict = torch.load("path_to_extracted_model", map_location="cpu")

📊 Evaluation

sh eval.sh

🔮 Future Work

Scaling UAV reasoning datasets
Improving long-chain reasoning consistency
Real-world UAV deployment

📌 Citation

If you find our work useful, please consider citing:

@inproceedings{ke2026pixdlm,
  title={PixDLM: A Dual-Path Multimodal Language Model for UAV Reasoning Segmentation},
  author={Ke, Shuyan and Mei, Yifan and Wu, Changli and Zheng, Yonghan and Ji, Jiayi and Cao, Liujuan and Ji, Rongrong},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  year={2026}
}

⭐ Acknowledgements

This project is built upon the following open-source works:

LLaVA
CLIP
Segment Anything (SAM)

We sincerely thank the authors for their contributions.

⭐ If you find this project useful, please consider giving it a star!

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
PixDLM		PixDLM
configs		configs
model		model
utils		utils
README.md		README.md
eval.py		eval.py
eval.sh		eval.sh
extract.py		extract.py
merge.sh		merge.sh
merge_lora_weights_and_save_hf_model.py		merge_lora_weights_and_save_hf_model.py
requirements.txt		requirements.txt
train.sh		train.sh
train_ds.py		train_ds.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

📄 [CVPR 2026 Highlight] PixDLM: A Dual-Path Multimodal Language Model for UAV Reasoning Segmentation

✨ Authors

📢 News

🚀 Overview

🌟 Highlights

🗂️ DRSeg Dataset

⚙️ Installation

📁 Data & Weights Preparation

🏋️ Training

🔹 Pretrained Initialization

🔹 Training

🔧 Post-processing

Convert DeepSpeed weights

Merge LoRA weights

Extract model components

📊 Evaluation

🔮 Future Work

📌 Citation

⭐ Acknowledgements

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 1

Languages

Folders and files

Latest commit

History

Repository files navigation

📄 [CVPR 2026 Highlight] PixDLM: A Dual-Path Multimodal Language Model for UAV Reasoning Segmentation

✨ Authors

📢 News

🚀 Overview

🌟 Highlights

🗂️ DRSeg Dataset

⚙️ Installation

📁 Data & Weights Preparation

🏋️ Training

🔹 Pretrained Initialization

🔹 Training

🔧 Post-processing

Convert DeepSpeed weights

Merge LoRA weights

Extract model components

📊 Evaluation

🔮 Future Work

📌 Citation

⭐ Acknowledgements

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 1

Languages

Packages