[Installation][Datasets][Training][Evaluation]
Visual grounding is the task of localising image regions from natural language queries and is critical for reasoning capable Graphical User Interface (GUI) agents 1. Many existing methods rely on massive, noisy synthetic datasets. This repository provides code for building an efficient training pipeline that combines model-based data filtering with parameter-efficient fine-tuning. Vision-Language Models can be trained under three regimes: supervised fine-tuning, chain-of-thought- augmented fine-tuning, and reinforcement learning via Group Relative Policy Optimisation.
Install Poetry
curl -sSL https://install.python-poetry.org | python3 -
Install Dependencies
# Use Python 3.11.13 in the project
pyenv local 3.11.13
# Tell Poetry to use pyenv
poetry env use $(pyenv which python)
# Install dependencies
poetry install
Or using Conda environments
Use Python 3.11.13 in the project
conda create -n venv python=3.11
# Install dependencies
poetry install
Clone the OSWorld-G repo to a directory that you would like
We provide two options:
- Option 1: Download the existing ranking/filtered datasets (recommended)
- Option 2: Perform the filtering yourself. This is not recommended as you will need to download lots of instances from Aria-UI/Show-UI and you may run into storage issues. You will also need an OpenAI key to perform the final filtering steps (see below for guidelines if you decide to follow this option)
from datasets import load_dataset
gui_grounding_dataset = load_dataset("alan-turing-institute/gui_grounding")
ranking_dataset = load_dataset("alan-turing-institute/gui_ranking")
Below are the step-by-step instructions for replicating the filtering process:
mkdir -p storage/downsampled/
python scripts/downsample_examples.py \
--input-json /path/to/aria/local/snapshot/directory/mobile/aria_ui_mobile_AMEX_1077k.json \
--image-root /path/to/amex/local/snapshot/directory/AMEX/screenshot/ \
--output-json storage/downsampled/aria_ui_mobile_AMEX_1077k_downsampled.json
mkdir -p storage/downsampled/
python scripts/downsample_examples.py \
--input-json /path/to/aria/local/snapshot/directory/web/aria_ui_web_2889k.json \
--image-root /path/to/web/local/snapshot/directory/web/images/ \
--output-json storage/downsampled/aria_ui_mobile_AMEX_1077k_downsampled.json
mkdir -p storage/zero_shot_predictions
accelerate launch --num_processes=2 src/gui_agent/evaluation/generate_predictions_qwen2_5_aria.py \
--end-index 100 --dataset-subset desktop \
--annotation-json-path /path/to/web/local/snapshot/directory/desktop/aria_ui_desktop_fix_batch_2_with_instructions.json \
--images-folder-path /path/to/web/local/snapshot/directory/desktop/screenshots_fix_batch2 \
--output-json storage/zero_shot_predictions/qwen2.5vl_3b_aria_desktop_results.json
mkdir -p storage/zero_shot_predictions
accelerate launch --num_processes=2 src/gui_agent/evaluation/generate_predictions_qwen2_5_aria.py \
--end-index 100 --dataset-subset desktop \
--annotation-json-path storage/downsampled/aria_ui_mobile_AMEX_1077k_downsampled.json \
--images-folder-path /path/to/amex/local/snapshot/directory/AMEX/screenshot/ \
--output-json storage/zero_shot_predictions/qwen2.5vl_3b_aria_ui_mobile_AMEX_1077k_downsampled.json
mkdir -p storage/zero_shot_predictions
accelerate launch --num_processes=2 src/gui_agent/evaluation/generate_predictions_qwen2_5_aria.py \
--end-index 100 --dataset-subset desktop \
--annotation-json-path storage/downsampled/aria_ui_web_2889k_downsampled.json \
--images-folder-path /path/to/web/local/snapshot/directory/web/images/ \
--output-json storage/zero_shot_predictions/qwen2.5vl_3b_aria_ui_web_2889k_downsampled.json
mkdir -p storage/zero_shot_predictions
python src/gui_agent/evaluation/generate_predictions_qwen2_5showui.py \
--output-json storage/zero_shot_predictions/qwen2.5vl_3b_showui_dekstop_results.json
In this step we create a ranking dataset based on the examples where the model correctly predicted the target UI element. These examples are grouped per image to create positive/negative examples for training the ranker model.
mkdir storage/ranking -p
python scripts/prepare_ranking_dataset.py
--aria-ui-desktop-annotations /path/to/web/local/snapshot/directory/desktop/aria_ui_desktop_fix_batch_2_with_instructions.json \
--aria-ui-desktop-prediction storage/zero_shot_predictions/qwen2.5vl_3b_aria_desktop_results.json \
--aria-ui-desktop-image-path /path/to/web/local/snapshot/directory/desktop/screenshots_fix_batch2 \
--aria-ui-mobile-annotations storage/downsampled/aria_ui_mobile_AMEX_1077k_downsampled.json \
--aria-ui-mobile-prediction storage/zero_shot_predictions/qwen2.5vl_3b_aria_ui_mobile_AMEX_1077k_downsampled.json \
--aria-ui-mobile-image-path /path/to/amex/local/snapshot/directory/AMEX/screenshot/ \
--aria-ui-web-annotations storage/downsampled/aria_ui_web_2889k_downsampled.json \
--aria-ui-web-prediction storage/zero_shot_predictions/qwen2.5vl_3b_aria_ui_web_2889k_downsampled.json \
--aria-ui-web-image-path /path/to/web/local/snapshot/directory/web/images/ \
--show-ui-prediction storage/zero_shot_predictions/qwen2.5vl_3b_showui_dekstop_results.json \
--output-json storage/ranking/ranking_dataset.json
This step requires a ranking model, for details on how to train it (see here)
mkdir -p storage/ranking/
accelerate launch --num_proceses=2 src/gui_agent/evaluation/generate_ranking_predictions_qwen2_5_aria_accelerate.py \
--model-path /path/to/ranker/model \
--annotation-json /path/to/web/local/snapshot/directory/desktop/aria_ui_desktop_fix_batch_2_with_instructions.json \
--image-folder-path /path/to/amex/local/snapshot/directory/AMEX/screenshot/ \
--dataset-subset mobile \
--prediction-json storage/zero_shot_predictions/qwen2.5vl_3b_aria_desktop_results.json \
--output-json storage/ranking/qwen2.5vl_3b_aria_desktop.json
mkdir -p storage/ranking/
accelerate launch --num_proceses=2 src/gui_agent/evaluation/generate_ranking_predictions_qwen2_5_aria_accelerate.py \
--model-path /path/to/ranker/model \
--annotation-json /path/to/aria/local/snapshot/directory/mobile/aria_ui_mobile_AMEX_1077k.json \
--image-folder-path /path/to/amex/local/snapshot/directory/AMEX/screenshot/ \
--dataset-subset mobile \
--prediction-json storage/zero_shot_predictions/qwen2.5vl_3b_aria_ui_mobile_AMEX_1077k_downsampled.json \
--output-json storage/ranking/qwen2.5vl_3b_aria_ui_mobile_AMEX_1077k_downsampled.json
mkdir -p storage/ranking/
accelerate launch --num_proceses=2 src/gui_agent/evaluation/generate_ranking_predictions_qwen2_5_aria_accelerate.py \
--model-path /path/to/ranker/model \
--annotation-json storage/downsampled/aria_ui_web_2889k_downsampled.json \
--image-folder-path /path/to/web/local/snapshot/directory/web/images/ \
--dataset-subset web \
--prediction-json storage/zero_shot_predictions/qwen2.5vl_3b_aria_ui_web_2889k_downsampled.json \
--output-json storage/ranking/qwen2.5vl_3b_aria_ui_web_2889k_downsampled.json
mkdir -p storage/ranking/
python src/gui_agent/evalaution/generate_ranking_predictions_qwen2_5_showui.py \
--prediction-json storage/zero_shot_predictions/qwen2.5vl_3b_showui_dekstop_results.json \
--output-json storage/ranking/qwen2.5vl_3b_showui_dekstop_results.json
Finally, group the ranking predictions in a single json file and use a HF repo to push the results if you want:
python scripts/prepare_full_dataset.py
--ranking-predictions-path storage/ranking/ \
--output-json storage/gui_grounding_ranked.json \
--push-to-hub \
--hf-repo-id /my/repo/id/ranked
First, extract the last hidden state from the ranked examples, use the repo id that you used on the previous step:
python scripts/get_hidden_states_from_ranked_examples.py \
--model-path /path/to/ranker/model \
--input-hf-repo-id /my/repo/id/ranked \
--output-directory storage/hidden_states \
--push-to-hub \
--hf-repo-id /my/hf/repo/id/hidden/states/
Run clustering on the hidden states for every benchmark (aria ui/web/mobile and showui desktop)
Important note here: For efficient clustering, we use the faiss library. Unfortunately, faiss is built within conda and forces a numpy version < 2.0 that breaks the remaining codebase. You should either 1) install faiss on a separate venv, or 2) install it within the current venv, this will break the package dependencies but you can re-install the repo with poetry installs
mkdir storage/clustering/ -p
python scripts/run_clustering.py \
--dataset-name /my/repo/id/ranked \
--source aria-ui-mobile \
--embeddings-path storage/hidden_states \
--output-path storage/clustering/aria_ui_mobile_indices.json
Filter the clusters using gpt-4o (you will need your open openai_api key) for every benchmark (aria ui/web/mobile and showui desktop)
mkdir storage/gpt4o-filtering/ -p
python scripts/gpt4_filtering.py \
--dataset-name /my/repo/id/ranked \
--openai-api-key <YOUR_OPEN_AI_API_KEY>
--indices-path storage/clustering/aria_ui_mobile_indices.json
--output-json storage/gpt4o-filtering/aria_ui_mobile.json
Collate all filtered results:
python scripts/collate_filtering_results.py \
--input-clustering-json storage/clustering/aria_ui_mobile_indices.json storage/clustering/aria_ui_web_indices.json storage/clustering/aria_ui_desktop_indices.json storage/clustering/show_ui_desktop_indices.json \
--input-filtering-json storage/filtering/aria_ui_mobile_indices.json storage/filtering/aria_ui_web_indices.json storage/filtering/aria_ui_desktop_indices.json storage/filtering/show_ui_desktop_indices.json \
--output-json storage/collated/gui_grounding_final.json \
--push-to-hub
This step assumes that you have already generated the zero-shot predictions (see here)
Example Usage:
accelerate launch --num_procceses=2 train_ranker_from_correct_predictions.py \
--model_name Qwen/Qwen2.5-VL-3B-Instruct \
--output_dir storage/models/qwen_2_5_3b_vl_ranker \
--per_device_train_batch_size 2 \
--per_device_eval_batch_size 4 \
--gradient_accumulation_steps 8 \
--min_pixels 3136 \
--max_pixels 4028160 \
--num_train_epochs 2 \
--logging_steps 1 \
--save_strategy steps \
--save_steps 0.1 \
--learning_rate 1e-5 \
--lr_scheduler_type cosine \
--gradient_checkpointing True \
--deepspeed configs/trainer/zero3.json \
--save_total_limit 2 \
--log_level info \
--save_safetensors True \
--seed 12345 \
--data_seed 12345 \
--dataloader_num_workers 4 \
--logging_nan_inf_filter False \
--run_name qwen_2_5_3b_vl_ranker \
--project_name project_mllm_ranker
Example usage:
python src/gui_agent/training/train_sft.py \
--dataloader_num_workers 4 \
--output_dir storage/models/qwen_2_5_3b_vl_full_data_all_linears_r256_lora_alpha512.json_lr1e-6_bsz32 \
--num_train_epochs 2 \
--learning_rate 1e-6 \
--run_name qwen_2_5_3b_vl_full_data_all_linears_r256_lora_alpha512.json_lr1e-6_bsz32 \
--lora_config configs/lora/all_linears_r256_lora_alpha512.json \
--per_device_train_batch_size 2 \
--gradient_accumulation_steps 16 \
--save_total_limit 1
See more lora configs under configs/lora, if you want to do full finetuning just remove the --lora_config argument.
For deepspeed options there are configs available under configs/deepspeed
Evaluate the ranker on the OSWorld-G
python src/gui_agent/evaluation/evaluate_qwen2_5_ranker_osworldg.py \
--osworldg-path /path/to/osworld/repo \
--model-path /path/to/ranker/model \
--output-json storage/results/qwen_2_5_3b_vl_ranker_osworld.json
Evaluate the ranker on the ScreenSpot
python src/gui_agent/evaluation/evaluate_qwen2_5_ranker_screenspot.py \
--dataset-name rootsautomation/ScreenSpot \
--model-path /path/to/ranker/model \
--output-json storage/results/qwen_2_5_3b_vl_ranker_screenspot.json
Evaluate the ranker on the ScreenSpot-v2
python src/gui_agent/evaluation/evaluate_qwen2_5_ranker_screenspot.py \
--dataset-name HongxinLi/ScreenSpot_v2 \
--model-path /path/to/ranker/model \
--output-json storage/results/qwen_2_5_3b_vl_ranker_screenspotv2.json
Example usage:
python src/gui_agent/evaluation/evaluate_qwen2_5_screenspot_trained.py \
--output-json storage/results/grounding/qwen_2_5_3b_vl_full_data_lr1e-6_bsz32_epochs2_cot.json \
--model-name path/to/trained/model/ \
If you used lora finetuning the --model-name must be the base name from HF, and the --checkpoint-peft-path must point to the adapters produced during training.
We evaluate our agent using gpt4o planning instructions in:
For a fair comparison, we abstract the impact of the planning instructions by using the gpt4-o-based planners from UGround.
For evaluation on each agentic benchmark see the README folder under src/src/gui_agent/evaluation/agent_evaluation
If you use this repository, please cite:
@misc{pantazopoulos2025efficienttrainingpipelinereasoning,
title={An Efficient Training Pipeline for Reasoning Graphical User Interface Agents},
author={Georgios Pantazopoulos and Eda B. Özyiğit},
year={2025},
eprint={2511.08172},
archivePrefix={arXiv},
primaryClass={cs.AI},
url={https://arxiv.org/abs/2511.08172},
}