A Novel Approach for Multimodal GUI Retrieval Based on CLIP > 📝 Paper | 💾 Dataset | 📦 Pretrained Models
GUICAN addresses the challenges in GUI retrieval (high sensitivity to image quality, fine-grained alignment issues) by adapting large-scale vision-language models to downstream tasks.
Core Contributions:
- SemAlign-GUI Framework: A novel two-stage pipeline incorporating Task-Oriented Fine-Tuning and Attention-Guided Gated Fusion.
- SOTA Performance: Achieves 69% higher Recall@10 and 37% improvement in MRR compared to baselines.
- Large-Scale Dataset: We introduce GL3D, containing 62,530 triplets for composed GUI retrieval.
Our approach consists of two main stages to bridge the gap between general pre-training and GUI tasks.
We freeze CLIP encoders and train the AGGF module.
Click to view Stage 1 Architecture (Figure 2)
Figure 2: Task-oriented fine-tuning with progressive unfreezing strategy.
We train the MEDR-Combiner network from scratch to fuse multimodal features.
We release all datasets and pre-trained weights for reproducibility.
| Resource Name | Description | Download Link |
|---|---|---|
| GL3D Dataset | 62,530 triplets (Ref, Text, Target) for composed retrieval. | [Google Drive] |
| Rico-Topic | 5,562 curated screenshots across 10 themes. | [Google Drive] |
| Model Weights | Pre-trained GUIAlignFusion weights. | [Google Drive] |
We constructed GL3D using a hybrid pipeline involving OpenCV and GPT-2.
Click to expand detailed construction pipeline (HSV, GPT-2, etc.)
- Used HSV conversion and contour detection via OpenCV.
- Mapped BGR colors to component types (e.g., Green -> TextView).
- Detection includes thresholding, morphological operations, and contour analysis.
- Employed a hybrid rule-based + GPT-2 approach.
- GPT-2 refined mechanical stats into natural English (Temp=0.7, Repetition Penalty=1.1).
- Calculated Euclidean distance for component differences.
- Filtered top-3 significant changes based on weight and area.
- Single Tesla V100-SXM2-32GB GPU (Recommended)
- Python 3.x / PyTorch (Add your versions here)
- Optimizer: AdamW (lr=3e-5 for fusion, 3e-6 for CLIP)
- Input Size: 288×288 pixels
- Batch Size: (Add your batch size)
# Example: How to run training (if you have code)
python train.py --dataset gl3d --epochs 50


