SAMChat: Introducing Chain-of-Thought Reasoning and GRPO to a Multimodal Small Language Model for Small-Scale Remote Sensing
In this work, a lightweight multimodal language model termed SAMChat is introduced, specifically adapted to analyze remote sensing imagery in secluded areas, including challenging missile launch sites. A new dataset, SAMData, was compiled by verifying hundreds of aerial images through expert review, and subtle military installations were highlighted via detailed captions. Supervised fine-tuning on a 2B-parameter open-source MLLM with chain-of-thought reasoning annotations was performed, enabling more accurate and interpretable explanations. In addition, group relative policy optimization was leveraged to enhance the model’s ability to detect critical domain-specific cues—such as defensive layouts and key military structures—while minimizing false positives on civilian scenes. Through empirical evaluations, it has been shown that SAMChat significantly outperforms both larger, general-purpose multimodal models and existing remote sensing-adapted approaches on open-ended captioning and classification metrics. Over 80% recall and 98% precision were achieved on the newly proposed SAMData benchmark, underscoring the potency of targeted fine-tuning and reinforcement learning in specialized real-world applications.
| dataset | purpose | link |
|---|---|---|
| SAMData-300-Train | Training | aybora/SAMData-300-Train |
| SAMData-300-Test | Testing | aybora/SAMData-300-Test |
| model | type | link |
|---|---|---|
| SAMChat-Base | instant | aybora/Qwen2-VL-SAMChat-Base |
| SAMChat-Distill | reasoning (sft only) | aybora/Qwen2-VL-SAMChat-Distill |
| SAMChat-Zero | reasoning (rl only) | aybora/Qwen2-VL-SAMChat-Zero |
| SAMChat-R1 | reasoning (sft+rl) | aybora/Qwen2-VL-SAMChat-R1 |
For best reproducibility, we suggest you to generate three different environments, one each for finetuning, GRPO training and evaluation.
For SFT:
git clone https://github.com/aybora/SAMChat
conda env create -f environment.yaml
conda activate sft
pip install qwen-vl-utils
pip install flash-attn --no-build-isolationFor GRPO:
git clone https://github.com/aybora/SAMChat
conda create -n grpo python=3.10 -y
conda activate grpo
cd ~/SAMChat/grpo
pip3 install -e ".[dev]"
pip3 install wandb==0.18.3
pip3 install flash-attn==2.7.4.post1 --no-build-isolationFor Evaluation:
conda create -n eval python==3.10 -y
conda activate eval
cd ~/SAMChat/eval
pip install -r requirements.txt
pip3 install flash-attn==2.7.4.post1 --no-build-isolationFirst download training set folder from our Huggingface Repo.
For reproducing Base and Distill models, you may follow the sample script below, which works on one node with 4 x H100s or A100s. Use sam_300_inst.json for the Base model and sam_300_reasoning_inst.json for the Distill model.
conda activate sft
cd ./SAMChat/sft/
MODEL_NAME="Qwen/Qwen2-VL-2B-Instruct"
GLOBAL_BATCH_SIZE=128
BATCH_PER_DEVICE=16
NUM_DEVICES=4
GRAD_ACCUM_STEPS=$((GLOBAL_BATCH_SIZE / (BATCH_PER_DEVICE * NUM_DEVICES)))
export PYTHONPATH=src:$PYTHONPATH
deepspeed --master_port 29400 src/training/train.py \
--deepspeed scripts/zero3_offload.json \
--model_id $MODEL_NAME \
--data_path "Your path to sam_300_inst.json or sam_300_reasoning_inst.json file" \
--image_folder "Your path to SAMData-300-Train folder" \
--remove_unused_columns False \
--freeze_vision_tower False \
--freeze_llm False \
--tune_merger True \
--bf16 True \
--fp16 False \
--disable_flash_attn2 False \
--output_dir output/qwen_sam_300 \
--num_train_epochs 1 \
--per_device_train_batch_size $BATCH_PER_DEVICE \
--gradient_accumulation_steps $GRAD_ACCUM_STEPS \
--image_min_pixels $((512 * 28 * 28)) \
--image_max_pixels $((1280 * 28 * 28)) \
--learning_rate 1e-5 \
--merger_lr 1e-5 \
--vision_lr 2e-6 \
--weight_decay 0.1 \
--warmup_ratio 0.03 \
--lr_scheduler_type "cosine" \
--logging_steps 1 \
--tf32 True \
--gradient_checkpointing True \
--report_to tensorboard \
--lazy_preprocess True \
--save_strategy "steps" \
--save_steps 200 \
--save_total_limit 5 \
--dataloader_num_workers 4Below script works on at least one node with 4 x H100s or A100s (65-80 GB). Use "Qwen/Qwen2-VL-2B-Instruct" for reproducing the Zero model and "aybora/Qwen2-VL-SAMChat-Distill" for the R1 model.
export WANDB_RUN_NAME=Qwen-VL-2B-GRPO-$(date +%Y-%m-%d-%H-%M-%S)
torchrun \
--nproc_per_node="$GPUS_PER_NODE" \
--nnodes="$SLURM_NNODES" \
--node_rank="$SLURM_NODEID" \
--rdzv_backend=c10d \
--rdzv_endpoint ${MASTER_ADDR}:${MASTER_PORT} \
--rdzv_id $SLURM_JOB_ID \
src/open_r1/grpo.py \
--deepspeed local_scripts/zero3.json \
--output_dir checkpoints/${WANDB_RUN_NAME} \
--model_name_or_path aybora/Qwen2-VL-SAMChat-Distill \
--dataset_name aybora/VHM_dataset_grpo \
--max_prompt_length 8192 \
--max_completion_length 8192 \
--per_device_train_batch_size 1 \
--gradient_accumulation_steps 1 \
--logging_steps 1 \
--bf16 true \
--beta 0.001 \
--report_to wandb \
--gradient_checkpointing true \
--attn_implementation flash_attention_2 \
--max_pixels 2359296 \
--save_total_limit 6 \
--num_train_epochs 64 \
--num_generations 4 \
--save_steps 100 \
--run_name $WANDB_RUN_NAMEYou may need to adjust some of the parameters (MASTER_ADDR, GPUS_PER_NODE etc.) depending on your multi-gpu, multi-node setting.
First download datasets eval folder from our Huggingface Repo.
To evaluate our, or your reproduced model, you may use the script below:
DATA_ROOT="Your path to SAMData-300-Test folder"
OUTPUT_DIR="Your path to eval log files"
MODEL_PATH=aybora/Qwen2-VL-SAMChat-R1 #or your own local model
python samchat_infer_eval.py \
--model $MODEL_PATH \
--test_folder $DATA_ROOT \
--output_dir $OUTPUT_DIR \
--num_gpus 1 #supports multi-gpu
Our work is derived from Qwen2-VL for the base model, Qwen-VL-Series-Finetune for the forked main sft code, open-r1-multimodal for the forked main grpo code. We appreciate all of these great works.
If you find this code useful for your research, please consider citing our works:
@ARTICLE{koksal2025samchat,
author={Köksal, Aybora and Alatan, A. Aydın},
journal={IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing},
title={SAMChat: Introducing Chain-of-Thought Reasoning and GRPO to a Multimodal Small Language Model for Small-Scale Remote Sensing},
year={2026},
volume={19},
number={},
pages={795-804},
keywords={Cognition;Adaptation models;Computational modeling;Remote sensing;Visualization;Missiles;Satellite images;Mathematical models;Large language models;Analytical models;Aerial image analysis;chain-of-thought (CoT) reasoning;domain adaptation;group relative policy optimization (GRPO);multimodal large language models (MLLMs);remote sensing (RS)},
doi={10.1109/JSTARS.2025.3637115}}
}@article{koksal2025tinyrs,
title={Tinyrs-r1: Compact vision language model for remote sensing},
author={K{\"o}ksal, Aybora and Alatan, A Ayd{\i}n},
journal={IEEE Geoscience and Remote Sensing Letters},
year={2025},
publisher={IEEE}
}If you are interested in this work, you may find the following work also useful:
@inproceedings{koksal2025few,
title={Few-Shot Vision-Language Reasoning for Satellite Imagery via Verifiable Rewards},
author={K{\"o}ksal, Aybora and Alatan, A Ayd{\i}n},
booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision},
pages={6901--6910},
year={2025}
}