An Efficient Training Pipeline for Reasoning Graphical User Interface Agents

[Installation][Datasets][Training][Evaluation]

Visual grounding is the task of localising image regions from natural language queries and is critical for reasoning capable Graphical User Interface (GUI) agents 1. Many existing methods rely on massive, noisy synthetic datasets. This repository provides code for building an efficient training pipeline that combines model-based data filtering with parameter-efficient fine-tuning. Vision-Language Models can be trained under three regimes: supervised fine-tuning, chain-of-thought- augmented fine-tuning, and reinforcement learning via Group Relative Policy Optimisation.

Installation

Install Poetry

curl -sSL https://install.python-poetry.org | python3 -

Install Dependencies

# Use Python 3.11.13 in the project
pyenv local 3.11.13

# Tell Poetry to use pyenv
poetry env use $(pyenv which python)

# Install dependencies
poetry install

Or using Conda environments

Use Python 3.11.13 in the project
conda create -n venv python=3.11

# Install dependencies
poetry install

Datasets

Datasets used for Training

Aria-UI 🤗
Show-UI 🤗

Datasets used for Evaluation

Clone the OSWorld-G repo to a directory that you would like

Dataset Filtering

We provide two options:

Option 1: Download the existing ranking/filtered datasets (recommended)
Option 2: Perform the filtering yourself. This is not recommended as you will need to download lots of instances from Aria-UI/Show-UI and you may run into storage issues. You will also need an OpenAI key to perform the final filtering steps (see below for guidelines if you decide to follow this option)

Option 1: Download the existing ranking/filtering datasets from:

gui_filtering 🤗
gui_grounding 🤗

from datasets import load_dataset
gui_grounding_dataset = load_dataset("alan-turing-institute/gui_grounding")
ranking_dataset = load_dataset("alan-turing-institute/gui_ranking")

Below are the step-by-step instructions for replicating the filtering process:

Step 1: Downsample Data

Downsample Aria-UI Mobile

mkdir -p storage/downsampled/

python scripts/downsample_examples.py \
	--input-json /path/to/aria/local/snapshot/directory/mobile/aria_ui_mobile_AMEX_1077k.json \
	--image-root /path/to/amex/local/snapshot/directory/AMEX/screenshot/ \
	--output-json storage/downsampled/aria_ui_mobile_AMEX_1077k_downsampled.json

Downsample Aria-UI Web

mkdir -p storage/downsampled/

python scripts/downsample_examples.py \
	--input-json /path/to/aria/local/snapshot/directory/web/aria_ui_web_2889k.json \
	--image-root /path/to/web/local/snapshot/directory/web/images/ \
	--output-json storage/downsampled/aria_ui_mobile_AMEX_1077k_downsampled.json

Step 2: Generate Predictions

Aria-UI Desktop

mkdir -p storage/zero_shot_predictions
accelerate launch --num_processes=2 src/gui_agent/evaluation/generate_predictions_qwen2_5_aria.py \
	--end-index 100 --dataset-subset desktop \
	--annotation-json-path /path/to/web/local/snapshot/directory/desktop/aria_ui_desktop_fix_batch_2_with_instructions.json \
	--images-folder-path /path/to/web/local/snapshot/directory/desktop/screenshots_fix_batch2 \
	--output-json storage/zero_shot_predictions/qwen2.5vl_3b_aria_desktop_results.json

Aria-UI Mobile

mkdir -p storage/zero_shot_predictions
accelerate launch --num_processes=2 src/gui_agent/evaluation/generate_predictions_qwen2_5_aria.py \
	--end-index 100 --dataset-subset desktop \
	--annotation-json-path storage/downsampled/aria_ui_mobile_AMEX_1077k_downsampled.json \
	--images-folder-path /path/to/amex/local/snapshot/directory/AMEX/screenshot/ \
	--output-json storage/zero_shot_predictions/qwen2.5vl_3b_aria_ui_mobile_AMEX_1077k_downsampled.json

Aria-UI Web

mkdir -p storage/zero_shot_predictions
accelerate launch --num_processes=2 src/gui_agent/evaluation/generate_predictions_qwen2_5_aria.py \
	--end-index 100 --dataset-subset desktop \
	--annotation-json-path storage/downsampled/aria_ui_web_2889k_downsampled.json \
	--images-folder-path /path/to/web/local/snapshot/directory/web/images/ \
	--output-json storage/zero_shot_predictions/qwen2.5vl_3b_aria_ui_web_2889k_downsampled.json

Show-UI Desktop

mkdir -p storage/zero_shot_predictions
python src/gui_agent/evaluation/generate_predictions_qwen2_5showui.py \
	--output-json storage/zero_shot_predictions/qwen2.5vl_3b_showui_dekstop_results.json

Step 3: Preparing Ranking Dataset

In this step we create a ranking dataset based on the examples where the model correctly predicted the target UI element. These examples are grouped per image to create positive/negative examples for training the ranker model.

mkdir storage/ranking -p

python scripts/prepare_ranking_dataset.py
	--aria-ui-desktop-annotations /path/to/web/local/snapshot/directory/desktop/aria_ui_desktop_fix_batch_2_with_instructions.json \
	--aria-ui-desktop-prediction storage/zero_shot_predictions/qwen2.5vl_3b_aria_desktop_results.json \
	--aria-ui-desktop-image-path /path/to/web/local/snapshot/directory/desktop/screenshots_fix_batch2 \
	--aria-ui-mobile-annotations storage/downsampled/aria_ui_mobile_AMEX_1077k_downsampled.json \
	--aria-ui-mobile-prediction storage/zero_shot_predictions/qwen2.5vl_3b_aria_ui_mobile_AMEX_1077k_downsampled.json \
	--aria-ui-mobile-image-path /path/to/amex/local/snapshot/directory/AMEX/screenshot/ \
	--aria-ui-web-annotations storage/downsampled/aria_ui_web_2889k_downsampled.json \
	--aria-ui-web-prediction storage/zero_shot_predictions/qwen2.5vl_3b_aria_ui_web_2889k_downsampled.json \
	--aria-ui-web-image-path /path/to/web/local/snapshot/directory/web/images/ \
	--show-ui-prediction storage/zero_shot_predictions/qwen2.5vl_3b_showui_dekstop_results.json \
	--output-json storage/ranking/ranking_dataset.json

Step 4: Rank Predictions

This step requires a ranking model, for details on how to train it (see here)

Aria-UI Desktop

mkdir -p storage/ranking/
accelerate launch --num_proceses=2 src/gui_agent/evaluation/generate_ranking_predictions_qwen2_5_aria_accelerate.py \
	--model-path /path/to/ranker/model \
	--annotation-json /path/to/web/local/snapshot/directory/desktop/aria_ui_desktop_fix_batch_2_with_instructions.json \
	--image-folder-path /path/to/amex/local/snapshot/directory/AMEX/screenshot/ \
	--dataset-subset mobile \
	--prediction-json storage/zero_shot_predictions/qwen2.5vl_3b_aria_desktop_results.json \
	--output-json storage/ranking/qwen2.5vl_3b_aria_desktop.json

Aria-UI Mobile

mkdir -p storage/ranking/
accelerate launch --num_proceses=2 src/gui_agent/evaluation/generate_ranking_predictions_qwen2_5_aria_accelerate.py \
	--model-path /path/to/ranker/model \
	--annotation-json /path/to/aria/local/snapshot/directory/mobile/aria_ui_mobile_AMEX_1077k.json \
	--image-folder-path /path/to/amex/local/snapshot/directory/AMEX/screenshot/ \
	--dataset-subset mobile \
	--prediction-json storage/zero_shot_predictions/qwen2.5vl_3b_aria_ui_mobile_AMEX_1077k_downsampled.json \
	--output-json storage/ranking/qwen2.5vl_3b_aria_ui_mobile_AMEX_1077k_downsampled.json

Aria-UI Web

mkdir -p storage/ranking/
accelerate launch --num_proceses=2 src/gui_agent/evaluation/generate_ranking_predictions_qwen2_5_aria_accelerate.py \
	--model-path /path/to/ranker/model \
	--annotation-json storage/downsampled/aria_ui_web_2889k_downsampled.json \
	--image-folder-path /path/to/web/local/snapshot/directory/web/images/ \
	--dataset-subset web \
	--prediction-json storage/zero_shot_predictions/qwen2.5vl_3b_aria_ui_web_2889k_downsampled.json \
	--output-json storage/ranking/qwen2.5vl_3b_aria_ui_web_2889k_downsampled.json

Show-UI Desktop

mkdir -p storage/ranking/

python src/gui_agent/evalaution/generate_ranking_predictions_qwen2_5_showui.py \
	--prediction-json storage/zero_shot_predictions/qwen2.5vl_3b_showui_dekstop_results.json \
	--output-json storage/ranking/qwen2.5vl_3b_showui_dekstop_results.json

Finally, group the ranking predictions in a single json file and use a HF repo to push the results if you want:

python scripts/prepare_full_dataset.py
	--ranking-predictions-path storage/ranking/ \
	--output-json storage/gui_grounding_ranked.json \
	--push-to-hub \
	--hf-repo-id /my/repo/id/ranked

Step 5: Cluster Ranked Predictions

First, extract the last hidden state from the ranked examples, use the repo id that you used on the previous step:

python scripts/get_hidden_states_from_ranked_examples.py \
	--model-path /path/to/ranker/model \
	--input-hf-repo-id /my/repo/id/ranked \
	--output-directory storage/hidden_states \
	--push-to-hub \
	--hf-repo-id /my/hf/repo/id/hidden/states/

Run clustering on the hidden states for every benchmark (aria ui/web/mobile and showui desktop)

Important note here: For efficient clustering, we use the faiss library. Unfortunately, faiss is built within conda and forces a numpy version < 2.0 that breaks the remaining codebase. You should either 1) install faiss on a separate venv, or 2) install it within the current venv, this will break the package dependencies but you can re-install the repo with poetry installs

mkdir storage/clustering/ -p
python scripts/run_clustering.py \
	--dataset-name /my/repo/id/ranked \
	--source aria-ui-mobile \
	--embeddings-path storage/hidden_states \
	--output-path storage/clustering/aria_ui_mobile_indices.json

Step 6: GPT-4o based filtering

Filter the clusters using gpt-4o (you will need your open openai_api key) for every benchmark (aria ui/web/mobile and showui desktop)

mkdir storage/gpt4o-filtering/ -p
python scripts/gpt4_filtering.py \
	--dataset-name /my/repo/id/ranked \
	--openai-api-key <YOUR_OPEN_AI_API_KEY>
	--indices-path storage/clustering/aria_ui_mobile_indices.json
	--output-json storage/gpt4o-filtering/aria_ui_mobile.json

Collate all filtered results:

python scripts/collate_filtering_results.py \
	--input-clustering-json storage/clustering/aria_ui_mobile_indices.json  storage/clustering/aria_ui_web_indices.json storage/clustering/aria_ui_desktop_indices.json  storage/clustering/show_ui_desktop_indices.json \
	--input-filtering-json storage/filtering/aria_ui_mobile_indices.json  storage/filtering/aria_ui_web_indices.json storage/filtering/aria_ui_desktop_indices.json  storage/filtering/show_ui_desktop_indices.json \
	--output-json storage/collated/gui_grounding_final.json \
	--push-to-hub

Training

Train a ranking model from correct zero-shot predictions

This step assumes that you have already generated the zero-shot predictions (see here)

Example Usage:

accelerate launch --num_procceses=2 train_ranker_from_correct_predictions.py \
	--model_name Qwen/Qwen2.5-VL-3B-Instruct \
	--output_dir storage/models/qwen_2_5_3b_vl_ranker \
	--per_device_train_batch_size 2 \
	--per_device_eval_batch_size 4 \
	--gradient_accumulation_steps 8 \
	--min_pixels 3136 \
	--max_pixels 4028160 \
	--num_train_epochs 2 \
	--logging_steps 1 \
	--save_strategy steps \
	--save_steps 0.1 \
	--learning_rate 1e-5 \ 
	--lr_scheduler_type cosine \
	--gradient_checkpointing True \ 
	--deepspeed configs/trainer/zero3.json \
	--save_total_limit 2 \
	--log_level info \
	--save_safetensors True \
	--seed 12345 \ 
	--data_seed 12345 \
	--dataloader_num_workers 4 \
	--logging_nan_inf_filter False \
	--run_name qwen_2_5_3b_vl_ranker \
	--project_name project_mllm_ranker

Train GUI model

Example usage:

python src/gui_agent/training/train_sft.py \
	--dataloader_num_workers 4 \
	--output_dir storage/models/qwen_2_5_3b_vl_full_data_all_linears_r256_lora_alpha512.json_lr1e-6_bsz32 \
	--num_train_epochs 2 \
	--learning_rate 1e-6 \
	--run_name qwen_2_5_3b_vl_full_data_all_linears_r256_lora_alpha512.json_lr1e-6_bsz32 \
	--lora_config configs/lora/all_linears_r256_lora_alpha512.json \
	--per_device_train_batch_size 2 \
	--gradient_accumulation_steps 16 \
	--save_total_limit 1

See more lora configs under configs/lora, if you want to do full finetuning just remove the --lora_config argument.

For deepspeed options there are configs available under configs/deepspeed

Evaluation

Evaluate ranker

Evaluate the ranker on the OSWorld-G

python src/gui_agent/evaluation/evaluate_qwen2_5_ranker_osworldg.py \
	--osworldg-path /path/to/osworld/repo \
	--model-path /path/to/ranker/model \
	--output-json storage/results/qwen_2_5_3b_vl_ranker_osworld.json

Evaluate the ranker on the ScreenSpot

python src/gui_agent/evaluation/evaluate_qwen2_5_ranker_screenspot.py \
	--dataset-name rootsautomation/ScreenSpot \
	--model-path /path/to/ranker/model \
	--output-json storage/results/qwen_2_5_3b_vl_ranker_screenspot.json

Evaluate the ranker on the ScreenSpot-v2

python src/gui_agent/evaluation/evaluate_qwen2_5_ranker_screenspot.py \
	--dataset-name HongxinLi/ScreenSpot_v2 \
	--model-path /path/to/ranker/model \
	--output-json storage/results/qwen_2_5_3b_vl_ranker_screenspotv2.json

Evaluate GUI agent

Grounding Evaluation

Example usage:

python src/gui_agent/evaluation/evaluate_qwen2_5_screenspot_trained.py \
	--output-json storage/results/grounding/qwen_2_5_3b_vl_full_data_lr1e-6_bsz32_epochs2_cot.json \
	--model-name path/to/trained/model/ \

If you used lora finetuning the --model-name must be the base name from HF, and the --checkpoint-peft-path must point to the adapters produced during training.

Offline Agentic Evaluation

We evaluate our agent using gpt4o planning instructions in:

For a fair comparison, we abstract the impact of the planning instructions by using the gpt4-o-based planners from UGround. For evaluation on each agentic benchmark see the README folder under src/src/gui_agent/evaluation/agent_evaluation

Citation

If you use this repository, please cite:

@misc{pantazopoulos2025efficienttrainingpipelinereasoning,
      title={An Efficient Training Pipeline for Reasoning Graphical User Interface Agents}, 
      author={Georgios Pantazopoulos and Eda B. Özyiğit},
      year={2025},
      eprint={2511.08172},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2511.08172}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 57 Commits
configs		configs
scripts		scripts
src/gui_agent		src/gui_agent
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml

License

alan-turing-institute/gui-agent

Folders and files

Latest commit

History

Repository files navigation

An Efficient Training Pipeline for Reasoning Graphical User Interface Agents

Installation

Datasets

Datasets used for Training

Datasets used for Evaluation

Dataset Filtering

Option 1: Download the existing ranking/filtering datasets from:

Step 1: Downsample Data

Downsample Aria-UI Mobile

Downsample Aria-UI Web

Step 2: Generate Predictions

Aria-UI Desktop

Aria-UI Mobile

Aria-UI Web

Show-UI Desktop

Step 3: Preparing Ranking Dataset

Step 4: Rank Predictions

Aria-UI Desktop

Aria-UI Mobile

Aria-UI Web

Show-UI Desktop

Step 5: Cluster Ranked Predictions

Step 6: GPT-4o based filtering

Training

Train a ranking model from correct zero-shot predictions

Train GUI model

Evaluation

Evaluate ranker

Evaluate GUI agent

Grounding Evaluation

Offline Agentic Evaluation

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages