##DiMo-GUI: A Python Implementation of a Visual Grounding Framework

This notebook is an implementation of the research paper "DiMo-GUI: Advancing Test-time Scaling in GUI Grounding via Modality-Aware Visual Reasoning" ([arXiv:2507.00008v2](https://arxiv.org/pdf/2507.00008v2)). The goal is to replicate the paper's core logic (dynamic zooming and modality decoupling), and test its effectiveness on a general-purpose Vision-Language Model.

In [None]:
!pip install transformers torch pillow accelerate bitsandbytes

**Step 1: Setup and Helper Functions**

Here, we install the necessary Python libraries and load a pre-trained Vision-Language Model (Llava-1.5-7b) from Hugging Face. We use a 4-bit quantized version to ensure it runs within the free Google Colab environment.

A key utility function, extract_bbox, is also defined. Since the VLM outputs coordinates in natural language (e.g., "The bounding box is [128, 352, 307, 417]"), this function's job is to parse that text. It's designed to handle both absolute pixel values and normalized decimal coordinates, converting them into a standardized list of integers that the Pillow library can use for image manipulation.

In [None]:
import torch
from transformers import AutoProcessor, LlavaForConditionalGeneration
from PIL import Image, ImageDraw
import requests
from google.colab import userdata
userdata.get('HF_TOKEN')

# Loaded the model and processor from Hugging Face
# Used a 4-bit quantized version to fit in Colab's free GPU memory
model_id = "llava-hf/llava-1.5-7b-hf"

from transformers import BitsAndBytesConfig

quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_use_double_quant=True,
)

model = LlavaForConditionalGeneration.from_pretrained(
    model_id,
    quantization_config=quantization_config,
    low_cpu_mem_usage=True,
)

processor = AutoProcessor.from_pretrained(model_id)

print("Setup Complete! Model is ready.")

**Step 2: Implementing the DiMo-GUI Framework**

The core logic of the paper is implemented in three main functions, each corresponding to a stage in the DiMo-GUI process:

divide_modalities: This function begins the process by prompting the VLM twice—once for text-only elements and once for icon-only elements—to get two independent initial guesses.

dynamic_grounding: This is the iterative refinement stage. It takes an initial guess, crops the image to that region, and asks the model to look again, progressively zooming in to find a more precise location.

select_answer: In the final stage, the two refined candidates (one text, one icon) are presented back to the model, which makes a final decision on which one best matches the user's command.

In [None]:
import re

def extract_bbox(model_output_text, image_width, image_height):
    # Updated regex to find numbers that can be integers or decimals
    match = re.search(r'\[(\s*[\d\.]+\s*,\s*[\d\.]+\s*,\s*[\d\.]+\s*,\s*[\d\.]+\s*)\]', model_output_text)

    if match:
        box_str = match.group(1)
        try:
            # Convert the matched strings to floats first
            coords = [float(n.strip()) for n in box_str.split(',')]

            # Check if coordinates are normalized (between 0 and 1)
            if all(0 <= c <= 1 for c in coords):
                # If so, convert them to absolute pixel values
                x1 = int(coords[0] * image_width)
                y1 = int(coords[1] * image_height)
                x2 = int(coords[2] * image_width)
                y2 = int(coords[3] * image_height)
                return [x1, y1, x2, y2]
            else:
                # If they are already pixel values, just convert them to integers
                return [int(c) for c in coords]

        except (ValueError, IndexError):
            return None # Handle cases where conversion or parsing fails

    return None # Return None if no bounding box is found

In [None]:
def divide_modalities(image, command, model, processor):
    width, height = image.size
    # Simpler text prompt
    text_prompt = f"USER: <image>\nFind the TEXT element for the command: '{command}'. Respond with only the bounding box in [x1, y1, x2, y2] format.\nASSISTANT:"
    inputs = processor(text=text_prompt, images=image, return_tensors="pt").to("cuda", torch.float16)
    generate_ids = model.generate(**inputs, max_new_tokens=50)
    text_output = processor.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
    initial_text_bbox = extract_bbox(text_output, width, height)
    print(f"Initial Text Guess Output: {text_output}")

    # Simpler icon prompt
    icon_prompt = f"USER: <image>\nFind the ICON element for the command: '{command}'. Respond with only the bounding box in [x1, y1, x2, y2] format.\nASSISTANT:"
    inputs = processor(text=icon_prompt, images=image, return_tensors="pt").to("cuda", torch.float16)
    generate_ids = model.generate(**inputs, max_new_tokens=50)
    icon_output = processor.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
    initial_icon_bbox = extract_bbox(icon_output, width, height)
    print(f"Initial Icon Guess Output: {icon_output}")

    return initial_text_bbox, initial_icon_bbox

In [None]:
def dynamic_grounding(image, command, initial_bbox, modality, model, processor, max_iterations=3):
    if not initial_bbox:
        print(f"Skipping dynamic grounding for {modality} due to no initial bbox.")
        return None
    current_bbox = initial_bbox

    # Select the correct simplified prompt
    if modality == "text":
        prompt_template = "USER: <image>\nFind the TEXT element for the command: '{command}'. Respond with only the bounding box in [x1, y1, x2, y2] format.\nASSISTANT:"
    else:
        prompt_template = "USER: <image>\nFind the ICON element for the command: '{command}'. Respond with only the bounding box in [x1, y1, x2, y2] format.\nASSISTANT:"

    print(f"\n--- Refining {modality.upper()} Bounding Box ---")
    for i in range(max_iterations):
        print(f"Iteration {i+1} with bbox: {current_bbox}")
        box_width = current_bbox[2] - current_bbox[0]
        box_height = current_bbox[3] - current_bbox[1]
        if box_width < 10 or box_height < 10:
            print(f"  > Bbox size ({box_width}x{box_height}) is too small. Stopping refinement.")
            break

        cropped_image = image.crop(current_bbox)
        prompt = prompt_template.format(command=command)
        inputs = processor(text=prompt, images=cropped_image, return_tensors="pt").to("cuda", torch.float16)
        generate_ids = model.generate(**inputs, max_new_tokens=50)
        output_text = processor.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]

        width, height = cropped_image.size
        relative_bbox = extract_bbox(output_text, width, height)
        print(f"  > Model output: {output_text.strip()}")

        if relative_bbox:
            origin_x, origin_y = current_bbox[0], current_bbox[1]
            current_bbox = [origin_x + relative_bbox[0], origin_y + relative_bbox[1], origin_x + relative_bbox[2], origin_y + relative_bbox[3]]
        else:
            print("  > Could not find a new bbox. Stopping refinement.")
            break

    print(f"Final Refined {modality.upper()} Bbox: {current_bbox}")
    return current_bbox

In [None]:
def select_answer(image, command, final_text_bbox, final_icon_bbox, model, processor):
    # Handle cases where one of the candidates might be missing
    if not final_text_bbox or not final_icon_bbox:
        # If one bbox is missing, default to the one that exists
        return final_text_bbox if final_text_bbox else final_icon_bbox

    print("\n--- Stage 3: Selecting Final Answer ---")

    # Create the prompt for the final decision, modeled on Figure 3
    selection_prompt = (
        f"USER: <image>\nYou are given a UI screenshot and a user command: '{command}'. "
        f"There are two candidate UI elements:\n"
        f"Candidate 1 (text-based): {final_text_bbox}\n"
        f"Candidate 2 (icon-based): {final_icon_bbox}\n"
        f"Based on the command and the description of candidates, "
        f"choose which candidate (1 for text, 2 for icon) better matches the command. "
        f"Just answer with '1' or '2'.\nASSISTANT:"
    )

    inputs = processor(text=selection_prompt, images=image, return_tensors="pt").to("cuda", torch.float16)
    generate_ids = model.generate(**inputs, max_new_tokens=10) # Only need a short answer
    output_text = processor.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]

    print(f"Selection Model Output: {output_text.strip()}")

    # Find the final choice in the model's response
    if '1' in output_text:
        print("Model chose Candidate 1 (Text-based).")
        return final_text_bbox
    elif '2' in output_text:
        print("Model chose Candidate 2 (Icon-based).")
        return final_icon_bbox
    else:
        # Default to one of the boxes if the model gives an unclear answer
        print("Could not determine choice, defaulting to text-based candidate.")
        return final_text_bbox

**Step 3: The Comparative Experiment**

This final section brings everything together to demonstrate the framework's impact. A run_truly_vanilla_test function is defined to establish a baseline by sending a single, generic prompt to the model. This is compared against a run_with_dimo_gui function, which executes the full three-stage process.

By running both tests on a small, curated set of challenging UI screenshots, we can visually compare the model's raw performance (the "vanilla" result) against its performance when guided by the DiMo-GUI framework. This side-by-side comparison clearly showcases the value and logic of the paper's approach.

In [None]:
def run_vanilla_test(image, command, model, processor):
    print("\n--- Running VANILLA Test (Single Generic Prompt) ---")

    image_true_vanilla = image.copy()
    draw = ImageDraw.Draw(image_true_vanilla)
    width, height = image.size

    # New, simpler prompt
    generic_prompt = f"USER: <image>\nYour task is to find a UI element. What is the bounding box of the element for the command: '{command}'? Respond with only the bounding box in [x1, y1, x2, y2] format.\nASSISTANT:"

    inputs = processor(text=generic_prompt, images=image, return_tensors="pt").to("cuda", torch.float16)
    generate_ids = model.generate(**inputs, max_new_tokens=50)
    output_text = processor.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]

    bbox = extract_bbox(output_text, width, height)

    if bbox:
        print(f"Vanilla Guess: {bbox}")
        draw.rectangle(bbox, outline="purple", width=5)
    else:
        print(f"Vanilla model could not find a bounding box. Output: {output_text}")

    return image_true_vanilla

In [None]:
def run_with_dimo_gui(image, command, model, processor):
    print("\n--- Running WITH DiMo-GUI (Full Workflow) ---")

    # Create a copy of the image to draw on
    image_dimo = image.copy()
    draw = ImageDraw.Draw(image_dimo)

    # Stage 1: Divide Modalities
    initial_text_bbox, initial_icon_bbox = divide_modalities(image, command, model, processor)

    # Stage 2: Dynamic Grounding
    final_text_bbox = dynamic_grounding(image, command, initial_text_bbox, "text", model, processor)
    final_icon_bbox = dynamic_grounding(image, command, initial_icon_bbox, "icon", model, processor)

    # Stage 3: Select Answer
    winning_bbox = select_answer(image, command, final_text_bbox, final_icon_bbox, model, processor)

    # Draw the final winning box if it was found
    if winning_bbox:
        print(f"\nDiMo-GUI Winning Bbox: {winning_bbox}")
        draw.rectangle(winning_bbox, outline="red", width=5)
    else:
        print("\nDiMo-GUI could not determine the final bounding box.")

    return image_dimo

In [None]:
# Create our mini-dataset of challenging test cases
test_cases = [
    {
        "name": "PowerPoint - Simple Case",
        "command": "Slide show from beginning",
        "image_path": "None" # The file you uploaded
    },
    {
        "name": "Google Maps - Ambiguous Icon/Text",
        "command": "Start Navigation",
        "image_path": "None" # This should exist from the last run
    }
]

try:
    maps_url = "https://kajabi-storefronts-production.kajabi-cdn.com/kajabi-storefronts-production/file-uploads/blogs/2147484362/images/78372fc-17c2-bb2-33f0-32d5f745d72_Step_2.PNG" # Stable URL for a Google Maps screenshot
    maps_image = Image.open(requests.get(maps_url, stream=True).raw).convert("RGB")
    maps_image.save("pres.png")
    test_cases[0]["image_path"] = "pres.png"
except Exception as e:
    print(f"Could not download the first test image: {e}")
    test_cases.pop(1) # Remove the test case if download fails

try:
    maps_url = "https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjneJLERTIFbAKsOCK8TSPFGFC3a4eIa3A1fZhkWX1gVXrOU8UcSaq4n95F8iZE8AZELfQK5Y-x-OYkshEKTDJBNc_uEDkxO47js0bykp25iRGPHqs2hcjZ6FGMxpS0W9Qafyd85w5hfGY/s1742/Google-Maps-new-route-UI.jpg" # Stable URL for a Google Maps screenshot
    maps_image = Image.open(requests.get(maps_url, stream=True).raw).convert("RGB")
    maps_image.save("maps.png")
    test_cases[1]["image_path"] = "maps.png"
except Exception as e:
    print(f"Could not download the second test image: {e}")
    test_cases.pop(1) # Remove the test case if download fails


# Loop through each test case and run the new, better comparison
for case in test_cases:
    print(f"\n{'='*20}\nRunning Test Case: {case['name']}\nCommand: '{case['command']}'\n{'='*20}")

    try:
        image = Image.open(case["image_path"]).convert("RGB")

        # Run the 'Vanilla' test
        vanilla_result_image = run_vanilla_test(image, case['command'], model, processor)
        print("\nResult VANILLA (Purple=Single Generic Guess):")
        display(vanilla_result_image)

        # Run the 'With DiMo-GUI' test
        dimo_gui_result_image = run_with_dimo_gui(image, case['command'], model, processor)
        print("\nResult WITH DiMo-GUI (Red=Final Answer):")
        display(dimo_gui_result_image)

    except FileNotFoundError:
        print(f"ERROR: Could not find the image file '{case['image_path']}'. Please make sure it is uploaded.")
    except Exception as e:
        print(f"An unexpected error occurred during the test case: {e}")

**Step 4: Analysis of Results**

The experiment shows that the "Truly Vanilla" model often fails to locate the correct UI element, and its guesses can be random.

While the full DiMo-GUI framework correctly follows its logic (zooming and selecting), its final answer is also incorrect. This is not a bug in the implementation but a demonstration of the model capability gap discussed in the paper. The general-purpose Llava model lacks the specialized training to accurately understand GUIs, causing it to make poor initial guesses that the framework cannot recover from. This highlights the importance of pairing the DiMo-GUI framework with a specialized, GUI-aware model as the authors did.