forked from angiejones/mcp-selenium
-
Notifications
You must be signed in to change notification settings - Fork 0
Open
Description
Feature Request: Visual Grid Overlay and Coordinate-Based Interaction
Problem Statement
Agents struggle to map visual elements to their HTML/CSS/XPath selectors, especially when dealing with complex or dynamically generated interfaces. This creates friction when trying to automate interactions with web elements that are visually obvious but structurally complex to target.
Proposed Solution
Implement two complementary features that enable visual-to-coordinate mapping for more intuitive browser automation:
1. Grid Screenshot Tool: mcp__selenium__take_grid_screenshot
Functionality
- Inject JavaScript to overlay a coordinate grid on the current viewport
- Display grid lines at regular intervals (e.g., every 50px)
- Show coordinate labels at grid intersections
- Optionally highlight interactive elements with colored borders
- Number clickable elements for easy reference
- Return screenshot with grid overlay visible
JavaScript Injection Example
// Create grid overlay
const gridOverlay = document.createElement('div');
gridOverlay.style.position = 'fixed';
gridOverlay.style.top = '0';
gridOverlay.style.left = '0';
gridOverlay.style.width = '100%';
gridOverlay.style.height = '100%';
gridOverlay.style.pointerEvents = 'none';
gridOverlay.style.zIndex = '999999';
// Add vertical and horizontal grid lines
for (let x = 0; x <= window.innerWidth; x += 50) {
const vLine = document.createElement('div');
vLine.style.position = 'absolute';
vLine.style.left = x + 'px';
vLine.style.top = '0';
vLine.style.width = '1px';
vLine.style.height = '100%';
vLine.style.background = 'rgba(0, 0, 255, 0.2)';
gridOverlay.appendChild(vLine);
// Add coordinate label
if (x % 100 === 0) {
const label = document.createElement('div');
label.style.position = 'absolute';
label.style.left = x + 'px';
label.style.top = '5px';
label.style.color = 'blue';
label.style.fontSize = '10px';
label.textContent = x;
gridOverlay.appendChild(label);
}
}
// Similar for horizontal lines
document.body.appendChild(gridOverlay);
// Highlight clickable elements
const clickables = document.querySelectorAll('a, button, input, [onclick], [role="button"]');
clickables.forEach((el, index) => {
el.style.outline = '2px solid red';
// Optionally add number labels
});
Tool Parameters
{
"grid_spacing": 50, # Pixels between grid lines
"show_coordinates": true, # Show coordinate labels
"highlight_clickables": true, # Highlight interactive elements
"number_elements": false, # Add numbers to clickable elements
"outputPath": "/tmp/grid_screenshot.png" # Optional output path
}
2. Coordinate Click Tool: mcp__selenium__click_at_coordinates
Functionality
- Accept x,y viewport coordinates
- Use Selenium's ActionChains to click at exact coordinates
- Support both absolute viewport coordinates and element-relative offsets
- Validate coordinates are within viewport bounds
- Optionally scroll if coordinates are outside current view
Implementation Using ActionChains
from selenium.webdriver.common.action_chains import ActionChains
def click_at_coordinates(driver, x, y, relative_to="viewport"):
if relative_to == "viewport":
# Click at absolute viewport coordinates
actions = ActionChains(driver)
# Reset to top-left corner
body = driver.find_element_by_tag_name('body')
actions.move_to_element_with_offset(body, -body.size['width']/2, -body.size['height']/2)
# Move to target coordinates and click
actions.move_by_offset(x, y).click().perform()
elif relative_to == "center":
# Click relative to viewport center
actions = ActionChains(driver)
actions.move_by_offset(x, y).click().perform()
Tool Parameters
{
"x": 250, # X coordinate
"y": 150, # Y coordinate
"relative_to": "viewport", # "viewport" or "center"
"scroll_if_needed": true # Auto-scroll if outside viewport
}
3. Enhanced Agent Workflow
Visual Element Identification Process
- Agent calls
take_grid_screenshot
to get visual with coordinates - Analyzes image to identify target element visually
- Notes grid coordinates (e.g., "blue button at approximately 250,150")
- Calls
click_at_coordinates(x=250, y=150)
to interact
Benefits
- Intuitive: Agents can describe elements by visual position
- Fallback: When selectors fail, coordinates provide alternative
- Debugging: Grid screenshots help diagnose clicking issues
- Universal: Works with canvas, SVG, and complex dynamic content
4. Additional Enhancements
Element Mapper Tool (Optional)
Create a tool that combines both approaches:
def map_visual_elements(driver):
# Inject JS to find all clickable elements
# Get their boundaries and positions
# Create a mapping table
# Return: {
# "elements": [
# {"type": "button", "text": "Submit", "coords": [250, 150], "selector": "#submit-btn"},
# {"type": "link", "text": "Home", "coords": [50, 30], "selector": "a.home-link"}
# ]
# }
Implementation Priority
- Phase 1: Implement
click_at_coordinates
using existing ActionChains - Phase 2: Add
take_grid_screenshot
with basic grid overlay - Phase 3: Enhance grid with element highlighting and numbering
- Phase 4: Create element mapping utilities
CRITICAL Considerations
- Coordinates are relative to viewport, not page
- Account for scroll position when calculating clicks
- Handle high-DPI displays and zoom levels
- Validate coordinates are within viewport bounds
- Consider MoveTargetOutOfBoundsException handling
- Maintain a consistent coordinate space (units and origin) across methods regardless of the grid spacing. i.e. even if the grid leaves 20 pts between lines, the addressable coordinate system is still in pts.
This feature set would significantly improve the agent's ability to interact with complex web interfaces by bridging the gap between visual identification and programmatic interaction.
Copilot
Metadata
Metadata
Assignees
Labels
No labels