Skip to content

Add Visual Grid Screenshot and Coordinate-Based Clicking Features #5

@brendanjerwin

Description

@brendanjerwin

Feature Request: Visual Grid Overlay and Coordinate-Based Interaction

Problem Statement

Agents struggle to map visual elements to their HTML/CSS/XPath selectors, especially when dealing with complex or dynamically generated interfaces. This creates friction when trying to automate interactions with web elements that are visually obvious but structurally complex to target.

Proposed Solution

Implement two complementary features that enable visual-to-coordinate mapping for more intuitive browser automation:

1. Grid Screenshot Tool: mcp__selenium__take_grid_screenshot

Functionality

  • Inject JavaScript to overlay a coordinate grid on the current viewport
  • Display grid lines at regular intervals (e.g., every 50px)
  • Show coordinate labels at grid intersections
  • Optionally highlight interactive elements with colored borders
  • Number clickable elements for easy reference
  • Return screenshot with grid overlay visible

JavaScript Injection Example

// Create grid overlay
const gridOverlay = document.createElement('div');
gridOverlay.style.position = 'fixed';
gridOverlay.style.top = '0';
gridOverlay.style.left = '0';
gridOverlay.style.width = '100%';
gridOverlay.style.height = '100%';
gridOverlay.style.pointerEvents = 'none';
gridOverlay.style.zIndex = '999999';

// Add vertical and horizontal grid lines
for (let x = 0; x <= window.innerWidth; x += 50) {
    const vLine = document.createElement('div');
    vLine.style.position = 'absolute';
    vLine.style.left = x + 'px';
    vLine.style.top = '0';
    vLine.style.width = '1px';
    vLine.style.height = '100%';
    vLine.style.background = 'rgba(0, 0, 255, 0.2)';
    gridOverlay.appendChild(vLine);
    
    // Add coordinate label
    if (x % 100 === 0) {
        const label = document.createElement('div');
        label.style.position = 'absolute';
        label.style.left = x + 'px';
        label.style.top = '5px';
        label.style.color = 'blue';
        label.style.fontSize = '10px';
        label.textContent = x;
        gridOverlay.appendChild(label);
    }
}

// Similar for horizontal lines
document.body.appendChild(gridOverlay);

// Highlight clickable elements
const clickables = document.querySelectorAll('a, button, input, [onclick], [role="button"]');
clickables.forEach((el, index) => {
    el.style.outline = '2px solid red';
    // Optionally add number labels
});

Tool Parameters

{
    "grid_spacing": 50,  # Pixels between grid lines
    "show_coordinates": true,  # Show coordinate labels
    "highlight_clickables": true,  # Highlight interactive elements
    "number_elements": false,  # Add numbers to clickable elements
    "outputPath": "/tmp/grid_screenshot.png"  # Optional output path
}

2. Coordinate Click Tool: mcp__selenium__click_at_coordinates

Functionality

  • Accept x,y viewport coordinates
  • Use Selenium's ActionChains to click at exact coordinates
  • Support both absolute viewport coordinates and element-relative offsets
  • Validate coordinates are within viewport bounds
  • Optionally scroll if coordinates are outside current view

Implementation Using ActionChains

from selenium.webdriver.common.action_chains import ActionChains

def click_at_coordinates(driver, x, y, relative_to="viewport"):
    if relative_to == "viewport":
        # Click at absolute viewport coordinates
        actions = ActionChains(driver)
        # Reset to top-left corner
        body = driver.find_element_by_tag_name('body')
        actions.move_to_element_with_offset(body, -body.size['width']/2, -body.size['height']/2)
        # Move to target coordinates and click
        actions.move_by_offset(x, y).click().perform()
    elif relative_to == "center":
        # Click relative to viewport center
        actions = ActionChains(driver)
        actions.move_by_offset(x, y).click().perform()

Tool Parameters

{
    "x": 250,  # X coordinate
    "y": 150,  # Y coordinate
    "relative_to": "viewport",  # "viewport" or "center"
    "scroll_if_needed": true  # Auto-scroll if outside viewport
}

3. Enhanced Agent Workflow

Visual Element Identification Process

  1. Agent calls take_grid_screenshot to get visual with coordinates
  2. Analyzes image to identify target element visually
  3. Notes grid coordinates (e.g., "blue button at approximately 250,150")
  4. Calls click_at_coordinates(x=250, y=150) to interact

Benefits

  • Intuitive: Agents can describe elements by visual position
  • Fallback: When selectors fail, coordinates provide alternative
  • Debugging: Grid screenshots help diagnose clicking issues
  • Universal: Works with canvas, SVG, and complex dynamic content

4. Additional Enhancements

Element Mapper Tool (Optional)

Create a tool that combines both approaches:

def map_visual_elements(driver):
    # Inject JS to find all clickable elements
    # Get their boundaries and positions
    # Create a mapping table
    # Return: {
    #     "elements": [
    #         {"type": "button", "text": "Submit", "coords": [250, 150], "selector": "#submit-btn"},
    #         {"type": "link", "text": "Home", "coords": [50, 30], "selector": "a.home-link"}
    #     ]
    # }

Implementation Priority

  1. Phase 1: Implement click_at_coordinates using existing ActionChains
  2. Phase 2: Add take_grid_screenshot with basic grid overlay
  3. Phase 3: Enhance grid with element highlighting and numbering
  4. Phase 4: Create element mapping utilities

CRITICAL Considerations

  • Coordinates are relative to viewport, not page
  • Account for scroll position when calculating clicks
  • Handle high-DPI displays and zoom levels
  • Validate coordinates are within viewport bounds
  • Consider MoveTargetOutOfBoundsException handling
  • Maintain a consistent coordinate space (units and origin) across methods regardless of the grid spacing. i.e. even if the grid leaves 20 pts between lines, the addressable coordinate system is still in pts.

This feature set would significantly improve the agent's ability to interact with complex web interfaces by bridging the gap between visual identification and programmatic interaction.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions