GUICAN: Deep GUI Cross-Modal Alignment Network

A Novel Approach for Multimodal GUI Retrieval Based on CLIP > 📝 Paper | 💾 Dataset | 📦 Pretrained Models

📖 Introduction

GUICAN addresses the challenges in GUI retrieval (high sensitivity to image quality, fine-grained alignment issues) by adapting large-scale vision-language models to downstream tasks.

Core Contributions:

SemAlign-GUI Framework: A novel two-stage pipeline incorporating Task-Oriented Fine-Tuning and Attention-Guided Gated Fusion.
SOTA Performance: Achieves 69% higher Recall@10 and 37% improvement in MRR compared to baselines.
Large-Scale Dataset: We introduce GL3D, containing 62,530 triplets for composed GUI retrieval.

🛠️ Method Overview

Our approach consists of two main stages to bridge the gap between general pre-training and GUI tasks.

Stage 1: Feature Alignment

We freeze CLIP encoders and train the AGGF module.

Click to view Stage 1 Architecture (Figure 2)

Figure 2: Task-oriented fine-tuning with progressive unfreezing strategy.

Stage 2: Feature Fusion

We train the MEDR-Combiner network from scratch to fuse multimodal features.

Click to view Stage 2 & Model Architecture (Figure 3 & 4)

Stage 2 Pipeline	MEDR-Combiner Structure

💾 Resources (Datasets & Checkpoints)

We release all datasets and pre-trained weights for reproducibility.

Resource Name	Description	Download Link
GL3D Dataset	62,530 triplets (Ref, Text, Target) for composed retrieval.	[`Google Drive`]
Rico-Topic	5,562 curated screenshots across 10 themes.	[`Google Drive`]
Model Weights	Pre-trained GUIAlignFusion weights.	[`Google Drive`]

🔍 Dataset Construction Details

We constructed GL3D using a hybrid pipeline involving OpenCV and GPT-2.

Click to expand detailed construction pipeline (HSV, GPT-2, etc.)

1. GUI Element Extraction

Used HSV conversion and contour detection via OpenCV.
Mapped BGR colors to component types (e.g., Green -> TextView).
Detection includes thresholding, morphological operations, and contour analysis.

2. Natural Language Description (GPT-2)

Employed a hybrid rule-based + GPT-2 approach.
GPT-2 refined mechanical stats into natural English (Temp=0.7, Repetition Penalty=1.1).

3. Difference Matching

Calculated Euclidean distance for component differences.
Filtered top-3 significant changes based on weight and area.

💻 Implementation & Usage

Prerequisites

Single Tesla V100-SXM2-32GB GPU (Recommended)
Python 3.x / PyTorch (Add your versions here)

Key Parameters

Optimizer: AdamW (lr=3e-5 for fusion, 3e-6 for CLIP)
Input Size: 288×288 pixels
Batch Size: (Add your batch size)

# Example: How to run training (if you have code)
python train.py --dataset gl3d --epochs 50

Name		Name	Last commit message	Last commit date
Latest commit History 95 Commits
Display		Display
GUIAlignFusion		GUIAlignFusion
.gitignore		.gitignore
README.md		README.md
model.jpg		model.jpg
overall.jpg		overall.jpg
stage.jpg		stage.jpg
stage2.jpg		stage2.jpg
table1.png		table1.png
table2.png		table2.png
table3.png		table3.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

GUICAN: Deep GUI Cross-Modal Alignment Network

📖 Introduction

🛠️ Method Overview

Stage 1: Feature Alignment

Stage 2: Feature Fusion

💾 Resources (Datasets & Checkpoints)

🔍 Dataset Construction Details

1. GUI Element Extraction

2. Natural Language Description (GPT-2)

3. Difference Matching

💻 Implementation & Usage

Prerequisites

Key Parameters

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

GUICAN: Deep GUI Cross-Modal Alignment Network

📖 Introduction

🛠️ Method Overview

Stage 1: Feature Alignment

Stage 2: Feature Fusion

💾 Resources (Datasets & Checkpoints)

🔍 Dataset Construction Details

1. GUI Element Extraction

2. Natural Language Description (GPT-2)

3. Difference Matching

💻 Implementation & Usage

Prerequisites

Key Parameters

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages