Skip to content

anonymous1-cloud/GUICAN

Repository files navigation

GUICAN: Deep GUI Cross-Modal Alignment Network

Overview

A Novel Approach for Multimodal GUI Retrieval Based on CLIP > 📝 Paper | 💾 Dataset | 📦 Pretrained Models

📖 Introduction

GUICAN addresses the challenges in GUI retrieval (high sensitivity to image quality, fine-grained alignment issues) by adapting large-scale vision-language models to downstream tasks.

Core Contributions:

  • SemAlign-GUI Framework: A novel two-stage pipeline incorporating Task-Oriented Fine-Tuning and Attention-Guided Gated Fusion.
  • SOTA Performance: Achieves 69% higher Recall@10 and 37% improvement in MRR compared to baselines.
  • Large-Scale Dataset: We introduce GL3D, containing 62,530 triplets for composed GUI retrieval.

🛠️ Method Overview

Our approach consists of two main stages to bridge the gap between general pre-training and GUI tasks.

Stage 1: Feature Alignment

We freeze CLIP encoders and train the AGGF module.

Click to view Stage 1 Architecture (Figure 2)

Stage 1 Figure 2: Task-oriented fine-tuning with progressive unfreezing strategy.

Stage 2: Feature Fusion

We train the MEDR-Combiner network from scratch to fuse multimodal features.

Click to view Stage 2 & Model Architecture (Figure 3 & 4)
Stage 2 Pipeline MEDR-Combiner Structure
Stage 2 Model

💾 Resources (Datasets & Checkpoints)

We release all datasets and pre-trained weights for reproducibility.

Resource Name Description Download Link
GL3D Dataset 62,530 triplets (Ref, Text, Target) for composed retrieval. [Google Drive]
Rico-Topic 5,562 curated screenshots across 10 themes. [Google Drive]
Model Weights Pre-trained GUIAlignFusion weights. [Google Drive]

🔍 Dataset Construction Details

We constructed GL3D using a hybrid pipeline involving OpenCV and GPT-2.

Click to expand detailed construction pipeline (HSV, GPT-2, etc.)

1. GUI Element Extraction

  • Used HSV conversion and contour detection via OpenCV.
  • Mapped BGR colors to component types (e.g., Green -> TextView).
  • Detection includes thresholding, morphological operations, and contour analysis.

2. Natural Language Description (GPT-2)

  • Employed a hybrid rule-based + GPT-2 approach.
  • GPT-2 refined mechanical stats into natural English (Temp=0.7, Repetition Penalty=1.1).

3. Difference Matching

  • Calculated Euclidean distance for component differences.
  • Filtered top-3 significant changes based on weight and area.

💻 Implementation & Usage

Prerequisites

  • Single Tesla V100-SXM2-32GB GPU (Recommended)
  • Python 3.x / PyTorch (Add your versions here)

Key Parameters

  • Optimizer: AdamW (lr=3e-5 for fusion, 3e-6 for CLIP)
  • Input Size: 288×288 pixels
  • Batch Size: (Add your batch size)
# Example: How to run training (if you have code)
python train.py --dataset gl3d --epochs 50

About

GUI retrieval paper code

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages