Skip to content

alan-turing-institute/gui-agent

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

57 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

An Efficient Training Pipeline for Reasoning Graphical User Interface Agents

[Installation][Datasets][Training][Evaluation]

Visual grounding is the task of localising image regions from natural language queries and is critical for reasoning capable Graphical User Interface (GUI) agents 1. Many existing methods rely on massive, noisy synthetic datasets. This repository provides code for building an efficient training pipeline that combines model-based data filtering with parameter-efficient fine-tuning. Vision-Language Models can be trained under three regimes: supervised fine-tuning, chain-of-thought- augmented fine-tuning, and reinforcement learning via Group Relative Policy Optimisation.

Installation

Install Poetry

curl -sSL https://install.python-poetry.org | python3 -

Install Dependencies

# Use Python 3.11.13 in the project
pyenv local 3.11.13

# Tell Poetry to use pyenv
poetry env use $(pyenv which python)

# Install dependencies
poetry install

Or using Conda environments

Use Python 3.11.13 in the project
conda create -n venv python=3.11

# Install dependencies
poetry install

Datasets

Datasets used for Training


Datasets used for Evaluation

Clone the OSWorld-G repo to a directory that you would like

Dataset Filtering

We provide two options:

  • Option 1: Download the existing ranking/filtered datasets (recommended)
  • Option 2: Perform the filtering yourself. This is not recommended as you will need to download lots of instances from Aria-UI/Show-UI and you may run into storage issues. You will also need an OpenAI key to perform the final filtering steps (see below for guidelines if you decide to follow this option)
Option 1: Download the existing ranking/filtering datasets from:
from datasets import load_dataset
gui_grounding_dataset = load_dataset("alan-turing-institute/gui_grounding")
ranking_dataset = load_dataset("alan-turing-institute/gui_ranking")

Below are the step-by-step instructions for replicating the filtering process:

Step 1: Downsample Data
Downsample Aria-UI Mobile
mkdir -p storage/downsampled/

python scripts/downsample_examples.py \
	--input-json /path/to/aria/local/snapshot/directory/mobile/aria_ui_mobile_AMEX_1077k.json \
	--image-root /path/to/amex/local/snapshot/directory/AMEX/screenshot/ \
	--output-json storage/downsampled/aria_ui_mobile_AMEX_1077k_downsampled.json
Downsample Aria-UI Web
mkdir -p storage/downsampled/

python scripts/downsample_examples.py \
	--input-json /path/to/aria/local/snapshot/directory/web/aria_ui_web_2889k.json \
	--image-root /path/to/web/local/snapshot/directory/web/images/ \
	--output-json storage/downsampled/aria_ui_mobile_AMEX_1077k_downsampled.json
Step 2: Generate Predictions
Aria-UI Desktop
mkdir -p storage/zero_shot_predictions
accelerate launch --num_processes=2 src/gui_agent/evaluation/generate_predictions_qwen2_5_aria.py \
	--end-index 100 --dataset-subset desktop \
	--annotation-json-path /path/to/web/local/snapshot/directory/desktop/aria_ui_desktop_fix_batch_2_with_instructions.json \
	--images-folder-path /path/to/web/local/snapshot/directory/desktop/screenshots_fix_batch2 \
	--output-json storage/zero_shot_predictions/qwen2.5vl_3b_aria_desktop_results.json
Aria-UI Mobile
mkdir -p storage/zero_shot_predictions
accelerate launch --num_processes=2 src/gui_agent/evaluation/generate_predictions_qwen2_5_aria.py \
	--end-index 100 --dataset-subset desktop \
	--annotation-json-path storage/downsampled/aria_ui_mobile_AMEX_1077k_downsampled.json \
	--images-folder-path /path/to/amex/local/snapshot/directory/AMEX/screenshot/ \
	--output-json storage/zero_shot_predictions/qwen2.5vl_3b_aria_ui_mobile_AMEX_1077k_downsampled.json
Aria-UI Web
mkdir -p storage/zero_shot_predictions
accelerate launch --num_processes=2 src/gui_agent/evaluation/generate_predictions_qwen2_5_aria.py \
	--end-index 100 --dataset-subset desktop \
	--annotation-json-path storage/downsampled/aria_ui_web_2889k_downsampled.json \
	--images-folder-path /path/to/web/local/snapshot/directory/web/images/ \
	--output-json storage/zero_shot_predictions/qwen2.5vl_3b_aria_ui_web_2889k_downsampled.json
Show-UI Desktop
mkdir -p storage/zero_shot_predictions
python src/gui_agent/evaluation/generate_predictions_qwen2_5showui.py \
	--output-json storage/zero_shot_predictions/qwen2.5vl_3b_showui_dekstop_results.json

Step 3: Preparing Ranking Dataset

In this step we create a ranking dataset based on the examples where the model correctly predicted the target UI element. These examples are grouped per image to create positive/negative examples for training the ranker model.

mkdir storage/ranking -p

python scripts/prepare_ranking_dataset.py
	--aria-ui-desktop-annotations /path/to/web/local/snapshot/directory/desktop/aria_ui_desktop_fix_batch_2_with_instructions.json \
	--aria-ui-desktop-prediction storage/zero_shot_predictions/qwen2.5vl_3b_aria_desktop_results.json \
	--aria-ui-desktop-image-path /path/to/web/local/snapshot/directory/desktop/screenshots_fix_batch2 \
	--aria-ui-mobile-annotations storage/downsampled/aria_ui_mobile_AMEX_1077k_downsampled.json \
	--aria-ui-mobile-prediction storage/zero_shot_predictions/qwen2.5vl_3b_aria_ui_mobile_AMEX_1077k_downsampled.json \
	--aria-ui-mobile-image-path /path/to/amex/local/snapshot/directory/AMEX/screenshot/ \
	--aria-ui-web-annotations storage/downsampled/aria_ui_web_2889k_downsampled.json \
	--aria-ui-web-prediction storage/zero_shot_predictions/qwen2.5vl_3b_aria_ui_web_2889k_downsampled.json \
	--aria-ui-web-image-path /path/to/web/local/snapshot/directory/web/images/ \
	--show-ui-prediction storage/zero_shot_predictions/qwen2.5vl_3b_showui_dekstop_results.json \
	--output-json storage/ranking/ranking_dataset.json

Step 4: Rank Predictions

This step requires a ranking model, for details on how to train it (see here)

Aria-UI Desktop
mkdir -p storage/ranking/
accelerate launch --num_proceses=2 src/gui_agent/evaluation/generate_ranking_predictions_qwen2_5_aria_accelerate.py \
	--model-path /path/to/ranker/model \
	--annotation-json /path/to/web/local/snapshot/directory/desktop/aria_ui_desktop_fix_batch_2_with_instructions.json \
	--image-folder-path /path/to/amex/local/snapshot/directory/AMEX/screenshot/ \
	--dataset-subset mobile \
	--prediction-json storage/zero_shot_predictions/qwen2.5vl_3b_aria_desktop_results.json \
	--output-json storage/ranking/qwen2.5vl_3b_aria_desktop.json
Aria-UI Mobile
mkdir -p storage/ranking/
accelerate launch --num_proceses=2 src/gui_agent/evaluation/generate_ranking_predictions_qwen2_5_aria_accelerate.py \
	--model-path /path/to/ranker/model \
	--annotation-json /path/to/aria/local/snapshot/directory/mobile/aria_ui_mobile_AMEX_1077k.json \
	--image-folder-path /path/to/amex/local/snapshot/directory/AMEX/screenshot/ \
	--dataset-subset mobile \
	--prediction-json storage/zero_shot_predictions/qwen2.5vl_3b_aria_ui_mobile_AMEX_1077k_downsampled.json \
	--output-json storage/ranking/qwen2.5vl_3b_aria_ui_mobile_AMEX_1077k_downsampled.json
Aria-UI Web
mkdir -p storage/ranking/
accelerate launch --num_proceses=2 src/gui_agent/evaluation/generate_ranking_predictions_qwen2_5_aria_accelerate.py \
	--model-path /path/to/ranker/model \
	--annotation-json storage/downsampled/aria_ui_web_2889k_downsampled.json \
	--image-folder-path /path/to/web/local/snapshot/directory/web/images/ \
	--dataset-subset web \
	--prediction-json storage/zero_shot_predictions/qwen2.5vl_3b_aria_ui_web_2889k_downsampled.json \
	--output-json storage/ranking/qwen2.5vl_3b_aria_ui_web_2889k_downsampled.json
Show-UI Desktop
mkdir -p storage/ranking/

python src/gui_agent/evalaution/generate_ranking_predictions_qwen2_5_showui.py \
	--prediction-json storage/zero_shot_predictions/qwen2.5vl_3b_showui_dekstop_results.json \
	--output-json storage/ranking/qwen2.5vl_3b_showui_dekstop_results.json

Finally, group the ranking predictions in a single json file and use a HF repo to push the results if you want:

python scripts/prepare_full_dataset.py
	--ranking-predictions-path storage/ranking/ \
	--output-json storage/gui_grounding_ranked.json \
	--push-to-hub \
	--hf-repo-id /my/repo/id/ranked

Step 5: Cluster Ranked Predictions

First, extract the last hidden state from the ranked examples, use the repo id that you used on the previous step:

python scripts/get_hidden_states_from_ranked_examples.py \
	--model-path /path/to/ranker/model \
	--input-hf-repo-id /my/repo/id/ranked \
	--output-directory storage/hidden_states \
	--push-to-hub \
	--hf-repo-id /my/hf/repo/id/hidden/states/

Run clustering on the hidden states for every benchmark (aria ui/web/mobile and showui desktop)

Important note here: For efficient clustering, we use the faiss library. Unfortunately, faiss is built within conda and forces a numpy version < 2.0 that breaks the remaining codebase. You should either 1) install faiss on a separate venv, or 2) install it within the current venv, this will break the package dependencies but you can re-install the repo with poetry installs

mkdir storage/clustering/ -p
python scripts/run_clustering.py \
	--dataset-name /my/repo/id/ranked \
	--source aria-ui-mobile \
	--embeddings-path storage/hidden_states \
	--output-path storage/clustering/aria_ui_mobile_indices.json

Step 6: GPT-4o based filtering

Filter the clusters using gpt-4o (you will need your open openai_api key) for every benchmark (aria ui/web/mobile and showui desktop)

mkdir storage/gpt4o-filtering/ -p
python scripts/gpt4_filtering.py \
	--dataset-name /my/repo/id/ranked \
	--openai-api-key <YOUR_OPEN_AI_API_KEY>
	--indices-path storage/clustering/aria_ui_mobile_indices.json
	--output-json storage/gpt4o-filtering/aria_ui_mobile.json

Collate all filtered results:

python scripts/collate_filtering_results.py \
	--input-clustering-json storage/clustering/aria_ui_mobile_indices.json  storage/clustering/aria_ui_web_indices.json storage/clustering/aria_ui_desktop_indices.json  storage/clustering/show_ui_desktop_indices.json \
	--input-filtering-json storage/filtering/aria_ui_mobile_indices.json  storage/filtering/aria_ui_web_indices.json storage/filtering/aria_ui_desktop_indices.json  storage/filtering/show_ui_desktop_indices.json \
	--output-json storage/collated/gui_grounding_final.json \
	--push-to-hub 

Training

Train a ranking model from correct zero-shot predictions

This step assumes that you have already generated the zero-shot predictions (see here)

Example Usage:

accelerate launch --num_procceses=2 train_ranker_from_correct_predictions.py \
	--model_name Qwen/Qwen2.5-VL-3B-Instruct \
	--output_dir storage/models/qwen_2_5_3b_vl_ranker \
	--per_device_train_batch_size 2 \
	--per_device_eval_batch_size 4 \
	--gradient_accumulation_steps 8 \
	--min_pixels 3136 \
	--max_pixels 4028160 \
	--num_train_epochs 2 \
	--logging_steps 1 \
	--save_strategy steps \
	--save_steps 0.1 \
	--learning_rate 1e-5 \ 
	--lr_scheduler_type cosine \
	--gradient_checkpointing True \ 
	--deepspeed configs/trainer/zero3.json \
	--save_total_limit 2 \
	--log_level info \
	--save_safetensors True \
	--seed 12345 \ 
	--data_seed 12345 \
	--dataloader_num_workers 4 \
	--logging_nan_inf_filter False \
	--run_name qwen_2_5_3b_vl_ranker \
	--project_name project_mllm_ranker

Train GUI model

Example usage:

python src/gui_agent/training/train_sft.py \
	--dataloader_num_workers 4 \
	--output_dir storage/models/qwen_2_5_3b_vl_full_data_all_linears_r256_lora_alpha512.json_lr1e-6_bsz32 \
	--num_train_epochs 2 \
	--learning_rate 1e-6 \
	--run_name qwen_2_5_3b_vl_full_data_all_linears_r256_lora_alpha512.json_lr1e-6_bsz32 \
	--lora_config configs/lora/all_linears_r256_lora_alpha512.json \
	--per_device_train_batch_size 2 \
	--gradient_accumulation_steps 16 \
	--save_total_limit 1

See more lora configs under configs/lora, if you want to do full finetuning just remove the --lora_config argument.

For deepspeed options there are configs available under configs/deepspeed


Evaluation

Evaluate ranker

Evaluate the ranker on the OSWorld-G

python src/gui_agent/evaluation/evaluate_qwen2_5_ranker_osworldg.py \
	--osworldg-path /path/to/osworld/repo \
	--model-path /path/to/ranker/model \
	--output-json storage/results/qwen_2_5_3b_vl_ranker_osworld.json

Evaluate the ranker on the ScreenSpot

python src/gui_agent/evaluation/evaluate_qwen2_5_ranker_screenspot.py \
	--dataset-name rootsautomation/ScreenSpot \
	--model-path /path/to/ranker/model \
	--output-json storage/results/qwen_2_5_3b_vl_ranker_screenspot.json

Evaluate the ranker on the ScreenSpot-v2

python src/gui_agent/evaluation/evaluate_qwen2_5_ranker_screenspot.py \
	--dataset-name HongxinLi/ScreenSpot_v2 \
	--model-path /path/to/ranker/model \
	--output-json storage/results/qwen_2_5_3b_vl_ranker_screenspotv2.json

Evaluate GUI agent

Grounding Evaluation

Example usage:

python src/gui_agent/evaluation/evaluate_qwen2_5_screenspot_trained.py \
	--output-json storage/results/grounding/qwen_2_5_3b_vl_full_data_lr1e-6_bsz32_epochs2_cot.json \
	--model-name path/to/trained/model/ \

If you used lora finetuning the --model-name must be the base name from HF, and the --checkpoint-peft-path must point to the adapters produced during training.

Offline Agentic Evaluation

We evaluate our agent using gpt4o planning instructions in:

For a fair comparison, we abstract the impact of the planning instructions by using the gpt4-o-based planners from UGround. For evaluation on each agentic benchmark see the README folder under src/src/gui_agent/evaluation/agent_evaluation

Citation

If you use this repository, please cite:

@misc{pantazopoulos2025efficienttrainingpipelinereasoning,
      title={An Efficient Training Pipeline for Reasoning Graphical User Interface Agents}, 
      author={Georgios Pantazopoulos and Eda B. Özyiğit},
      year={2025},
      eprint={2511.08172},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2511.08172}, 
}

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •  

Languages