# Evaluating a multimodal web navigation agent

Web navigation poses unique challenges for AI systems. Effective agents must interpret page structures, understand visual elements, maintain context across steps, and accurately select actions. This cookbook focuses on action selection, where models determine whether to click elements, type inputs, or select from dropdown menus at each step.

We'll build a robust evaluation pipeline using the [Multimodal-Mind2Web dataset](https://osu-nlp-group.github.io/Mind2Web/), which includes HTML structures and webpage screenshots to inform multimodal decision-making. Although this approach leverages visual and structural cues, we'll explore how to apply what we learned from this cookbook to other methods. By the end, you'll have a comprehensive framework to evaluate web agent performance and pinpoint improvement areas.

## Getting started

To follow along, start by installing the required packages:
```bash
pip install lxml openai datasets pillow braintrust autoevals
```

Next, make sure you have a [Braintrust](https://www.braintrust.dev/signup) account, along with an [OpenAI API key](https://platform.openai.com/). To authenticate with Braintrust, export your `BRAINTRUST_API_KEY` as an environment variable:

```bash
export BRAINTRUST_API_KEY="YOUR_API_KEY_HERE"
```

<Callout type="info">
Exporting your API key is a best practice, but to make it easier to follow along with this cookbook, you can also hardcode it into the code below.
</Callout>

We'll import our modules and initialize the OpenAI client using the Braintrust proxy:



In [15]:
import os
import json
import base64
import re
import time
from typing import Dict, Any, List, Optional, Tuple

from lxml import etree
import openai
from datasets import load_dataset
from PIL import Image
from io import BytesIO

from braintrust import (
    Eval,
    Attachment,
    start_span,
    current_span,
    wrap_openai,
)
from autoevals import LLMClassifier

# Constants
MAX_SAMPLES = 50
HTML_MAX_ELEMENTS = 50
MAX_PREVIOUS_ACTIONS = 3

# Uncomment the following line to hardcode your API key
# os.environ["BRAINTRUST_API_KEY"] = "YOUR_API_KEY_HERE"


client = wrap_openai(
    openai.OpenAI(
        api_key=os.environ.get("BRAINTRUST_API_KEY"),
        base_url="https://api.braintrust.dev/v1/proxy",
    )
)

## Approaches to web navigation

There are multiple approaches to web navigation. HTML-only approaches parse document structures, excelling in semantic understanding but missing visual cues such as element prominence. Screenshot-only methods capture visual context effectively but may overlook interactive element properties.

Multimodal methods integrate HTML and visual data, providing comprehensive insights but necessitating more complex processing. Additionally, hybrid methods combine multimodal data with specialized modules to handle interactions like date pickers.

In this cookbook, we'll implement a multimodal method using HTML DOM and visual screenshots to inform decision making.

## Processing screenshots

First, let's create a function to convert the screenshots of a given webpage into a format that we can use to pass to our model and [attach to our eval](/docs/guides/evals/write#attachments).

In [16]:
def process_screenshot(screenshot_input: Any) -> Optional[Attachment]:
    with start_span(name="process_screenshot") as span:
        try:
            # Handle PIL Image
            if isinstance(screenshot_input, Image.Image):
                img_byte_arr = BytesIO()
                screenshot_input.save(img_byte_arr, format="PNG")
                image_data = img_byte_arr.getvalue()

            # Handle file path
            elif isinstance(screenshot_input, str) and os.path.exists(screenshot_input):
                with open(screenshot_input, "rb") as f:
                    image_data = f.read()

            # Handle bytes
            elif isinstance(screenshot_input, bytes):
                image_data = screenshot_input

            # Handle dictionary with base64 data
            elif isinstance(screenshot_input, dict) and "data" in screenshot_input:
                data = screenshot_input["data"]
                if not isinstance(data, str):
                    return None

                # Process base64 data
                if data.startswith("data:image"):
                    base64_data = data.split(",", 1)[1]
                elif data.startswith("/9j/") or data.startswith("iVBOR"):
                    base64_data = data
                else:
                    return None

                image_data = base64.b64decode(base64_data)
            else:
                return None

            # Create attachment
            result = Attachment(
                data=image_data,
                filename="screenshot.png",
                content_type="image/png",
            )

            return result

        except Exception:
            return None

Next, we create a function to extract the most important elements from HTML documents:

In [17]:
def get_enhanced_tree_summary(
    html_content: str, max_items: int = HTML_MAX_ELEMENTS
) -> str:
    with start_span(name="html_parsing") as span:
        if not html_content:
            return "No HTML content provided"

        try:
            # Parse HTML
            parser = etree.HTMLParser()
            dom_tree = etree.fromstring(html_content, parser)

            # XPath for interactive elements, sorted by relevance
            xpath_queries = [
                "//button | //input[@type='submit'] | //input[@type='button']",
                "//a[@href] | //*[@role='button'] | //*[@onclick]",
                "//input[not(@type='hidden')] | //select | //textarea",
                "//label | //form",
                "//h1 | //h2 | //h3 | //nav | //*[@role='navigation']",
            ]

            # Collect elements by priority until max_items is reached
            important_elements = []
            for query in xpath_queries:
                if len(important_elements) >= max_items:
                    break
                elements = dom_tree.xpath(query)
                remaining_slots = max_items - len(important_elements)
                important_elements.extend(elements[:remaining_slots])

            # Create a concise representation
            summary = []
            for elem in important_elements:
                tag = elem.tag

                # Get text content, limited to 30 chars
                text = elem.text.strip() if elem.text else ""
                if not text:
                    for child in elem.xpath(".//text()"):
                        if child.strip():
                            text += " " + child.strip()
                text = text.strip()[:30]

                # Get key attributes
                key_attrs = [
                    "id",
                    "type",
                    "placeholder",
                    "href",
                    "role",
                    "aria-label",
                    "value",
                    "name",
                ]
                attrs = []
                for k in key_attrs:
                    if k in elem.attrib:
                        attrs.append(f'{k}="{elem.attrib[k]}"')

                # Format element representation
                elem_repr = f"<{tag} {' '.join(attrs)}>{text}</{tag}>"
                summary.append(elem_repr)

            return "\n".join(summary)

        except Exception as e:
            return f"Error parsing HTML: {str(e)}"

## Action history and parsing operations

Web navigation relies on sequential context. Each previous action provides critical information about the agent's current state and influences future decisions. Without historical context, the agent may misinterpret its position in the navigation flow, potentially repeating actions or selecting inappropriate elements.
Maintaining a structured record of previous interactions significantly enhances prediction accuracy. Our implementation passes a configurable number of recent actions using `MAX_PREVIOUS_ACTIONS` which are then formatted to provide context to the model:

In [18]:
def format_previous_actions(
    actions: List[str], max_actions: int = MAX_PREVIOUS_ACTIONS
) -> str:
    if not actions:
        return "None"

    # Only take the most recent actions
    recent_actions = actions[-max_actions:]

    # Format with numbering
    formatted = "\n".join(
        [f"{i+1}. {action}" for i, action in enumerate(recent_actions)]
    )

    # Indicate if there were more actions before these
    if len(actions) > max_actions:
        formatted = (
            f"Showing {max_actions} most recent of {len(actions)} total actions\n"
            + formatted
        )

    return formatted

We also need a consistent way to parse ground-truth actions from the dataset:

In [19]:
def parse_operation_string(operation_str: str) -> Dict[str, str]:
    with start_span(name="parse_operation") as span:
        # Default values
        operation = {"op": "CLICK", "value": ""}

        if not operation_str:
            return operation

        try:
            # Try parsing as JSON first
            if operation_str.strip().startswith("{"):
                parsed = json.loads(operation_str)
                if isinstance(parsed, dict):
                    operation["op"] = parsed.get("op", "CLICK")
                    operation["value"] = parsed.get("value", "")
            else:
                # Fallback to regex parsing
                import re

                match_op = re.search(r"(CLICK|TYPE|SELECT)", operation_str)
                if match_op:
                    operation["op"] = match_op.group(1)
                    match_value = re.search(
                        r'value\s*[:=]\s*["\']?([^"\']+)["\']?', operation_str
                    )
                    if match_value:
                        operation["value"] = match_value.group(1)
        except Exception:
            pass

        return operation

## Creating our dataset

Now we load and process samples from the [Multimodal-Mind2Web dataset](https://huggingface.co/datasets/osunlp/Multimodal-Mind2Web):

In [20]:
def load_mind2web_samples(
    max_samples: int = MAX_SAMPLES, use_smaller_subset: bool = True
) -> List[Dict[str, Any]]:

    # Load the dataset with streaming to conserve memory
    split = "test_domain" if use_smaller_subset else "train"
    dataset = load_dataset(
        "osunlp/Multimodal-Mind2Web", split=split, streaming=True
    )

    processed_samples = []
    successful_samples = 0

    # Process samples
    for item in dataset:
        if successful_samples >= max_samples:
            break

        try:
            with start_span(name="process_sample") as sample_span:
                # Extract basic fields
                annotation_id = item.get(
                    "annotation_id", f"sample_{successful_samples}"
                )
                website = item.get("website", "unknown")
                confirmed_task = item.get("confirmed_task", "Navigate the website")
                cleaned_html = item.get("cleaned_html", "<html></html>")
                operation_str = item.get(
                    "operation", '{"op": "CLICK", "value": ""}'
                )

                # Process operation
                operation = parse_operation_string(operation_str)

                # Process screenshot
                screenshot_attachment = None
                screenshot_dict = item.get("screenshot")
                if screenshot_dict:
                    screenshot_attachment = process_screenshot(screenshot_dict)

                # Process HTML summary
                html_summary = get_enhanced_tree_summary(
                    cleaned_html, max_items=HTML_MAX_ELEMENTS
                )

                # Process previous actions
                action_reprs = item.get("action_reprs", [])
                previous_actions_str = format_previous_actions(
                    action_reprs, max_actions=MAX_PREVIOUS_ACTIONS
                )

                # Map operation type to the correct option letter
                expected_option = "A"  # Default to CLICK
                if operation["op"] == "TYPE":
                    expected_option = "B"
                elif operation["op"] == "SELECT":
                    expected_option = "C"

                # Create a focused prompt
                formatted_prompt = f"""
Task: {confirmed_task}

Key webpage elements:
{html_summary}

Previous actions:
{previous_actions_str}

What should be the next action? Select from:
A. Click the appropriate element based on the task
B. Type text into an input field
C. Select an option from a dropdown
"""
                # Create target string
                target = f"Answer: {expected_option}\nAction: {operation['op']}"
                if operation["op"] != "CLICK" and operation["value"]:
                    target += f"\nValue: {operation['value']}"

                # Build complete sample
                sample = {
                    "annotation_id": annotation_id,
                    "website": website,
                    "confirmed_task": confirmed_task,
                    "html_summary": html_summary,
                    "operation": operation,
                    "previous_actions_str": previous_actions_str,
                    "formatted_prompt": formatted_prompt,
                    "target": target,
                    "expected_option": expected_option,
                    "expected_action": operation["op"],
                    "expected_value": operation["value"],
                    "screenshot_attachment": screenshot_attachment,
                }

                processed_samples.append(sample)
                successful_samples += 1

        except Exception:
            continue

    return processed_samples

We transform these samples to a format that facilitates easy evaluation:

In [21]:
def create_braintrust_dataset(samples: List[Dict[str, Any]]) -> List[Dict[str, Any]]:

    dataset_samples = []

    for sample in samples:
        if not isinstance(sample, dict):
            continue

        # Extract operation details
        operation = sample.get("operation", {})
        operation_type = (
            operation.get("op", "CLICK") if isinstance(operation, dict) else "CLICK"
        )
        operation_value = (
            operation.get("value", "") if isinstance(operation, dict) else ""
        )

        # Create dataset entry
        dataset_entry = {
            "input": {
                "prompt": sample.get("formatted_prompt", ""),
                "task": sample.get("confirmed_task", ""),
                "website": sample.get("website", ""),
                "previous_actions": sample.get("previous_actions_str", "None"),
            },
            "expected": {
                "option": sample.get("expected_option", ""),
                "action": operation_type,
                "value": operation_value,
                "target": sample.get("target", ""),
            },
            "metadata": {
                "annotation_id": sample.get("annotation_id", ""),
                "website": sample.get("website", ""),
                "operation_type": operation_type,
            },
        }

        # Add screenshot attachment if available
        if sample.get("screenshot_attachment"):
            dataset_entry["input"]["screenshot"] = sample["screenshot_attachment"]

        dataset_samples.append(dataset_entry)

    return dataset_samples

## Building the prediction function

We pass the task, page info (image + html), and previous actions so `gpt-4o` can determine the next action:

In [22]:
def extract_model_response(response_text: str) -> Tuple[str, str, str]:

    with start_span(name="extract_response") as span:
        # Initialize default values
        option = ""
        action = ""
        value = ""

        try:
            # Extract option (A, B, or C)
            option_match = re.search(r"Answer:\s*([ABC])", response_text, re.IGNORECASE)
            if option_match:
                option = option_match.group(1).upper()

            # Extract action (CLICK, TYPE, or SELECT)
            action_match = re.search(
                r"Action:\s*(CLICK|TYPE|SELECT)", response_text, re.IGNORECASE
            )
            if action_match:
                action = action_match.group(1).upper()

            # Extract value (for TYPE or SELECT)
            value_match = re.search(r"Value:\s*(.+?)(?:\n|$)", response_text)
            if value_match:
                value = value_match.group(1).strip()

            return option, action, value

        except Exception:
            return option, action, value


def predict_with_gpt4o(input_data: Dict[str, Any]) -> str:
    with start_span(name="model_prediction") as predict_span:
        try:
            # Extract input components
            prompt = input_data.get("prompt", "")
            task = input_data.get("task", "")
            website = input_data.get("website", "")
            screenshot_attachment = input_data.get("screenshot")

            # Create system message
            system_message = """You are a web navigation assistant that helps users complete tasks online.
Analyze the webpage and determine the best action to take next based on the task.

Format your answer EXACTLY as follows:
Answer: [option letter - must be A, B, or C]
Action: [CLICK, TYPE, or SELECT]
Value: [Only provide value for TYPE/SELECT actions]

Example for clicking:
Answer: A
Action: CLICK

Example for typing:
Answer: B
Action: TYPE
Value: search query text

Example for selecting:
Answer: C
Action: SELECT
Value: dropdown option
"""

            # Create messages array
            messages = [{"role": "system", "content": system_message}]

            # Add screenshot if available
            if screenshot_attachment and hasattr(screenshot_attachment, "data"):
                try:
                    image_data = screenshot_attachment.data
                    base64_image = base64.b64encode(image_data).decode("utf-8")

                    messages.append(
                        {
                            "role": "user",
                            "content": [
                                {
                                    "type": "image_url",
                                    "image_url": {
                                        "url": f"data:image/png;base64,{base64_image}"
                                    },
                                },
                                {"type": "text", "text": prompt},
                            ],
                        }
                    )
                except Exception:
                    messages.append({"role": "user", "content": prompt})
            else:
                messages.append({"role": "user", "content": prompt})

            # Use the wrapped client for proper tracing
            response = client.chat.completions.create(
                model="gpt-4o",
                messages=messages,
                max_tokens=150,
                temperature=0.2,
            )

            result = response.choices[0].message.content

            return result

        except Exception as e:
            return f"Error: {str(e)}"

## Defining our scorers

For web navigation tasks, simply measuring overall success isn't enough. We need metrics that can pinpoint specific strengths and weaknesses in our agent. In this example, we create two LLM powered scorers using [LLMClassifer from autoevals](https://github.com/braintrustdata/autoevals?tab=readme-ov-file#custom-evaluation-prompts). The first scorer measures whether the model correctly identified the action type. The second evaluates whether the details of the action were correct. This separation helps identify whether errors come from misunderstanding the task context or from incorrectly formulating the action details.

In [23]:

option_selection_scorer = LLMClassifier(
    name="option_selection",
    prompt_template="""
You are evaluating if a web navigation assistant selected the correct option.

Task context: {{input.task}}
Expected option: {{expected.option}} (A=Click, B=Type, C=Select)
Model's raw response: 
'''
{{output}}
'''

First, identify which option (A, B, or C) the model actually selected.
Then determine if this matches the expected option {{expected.option}}.

Did the model correctly select option {{expected.option}}? Answer only Y or N.
""",
    choice_scores={"Y": 1, "N": 0},
)

action_correctness_scorer = LLMClassifier(
    name="action_correctness",
    prompt_template="""
You are evaluating if a web navigation assistant identified the correct action type and value.

Expected action: {{expected.action}} (CLICK, TYPE, or SELECT)
Expected value: {{expected.value}} (should be provided for TYPE and SELECT actions only)
Model's raw response:
'''
{{output}}
'''

First, identify what action (CLICK, TYPE, or SELECT) the model actually specified.
For TYPE and SELECT actions, also identify what value the model specified.
Then determine if these match the expected action and value.

Did the model correctly identify both the action type and value (if applicable)? 
Answer only Y or N.
""",
    choice_scores={"Y": 1, "N": 0},
)

## Running the evaluation

With everything in place, we can run the evaluation:

In [None]:
def run_mind2web_evaluation(sample_size: int = MAX_SAMPLES) -> None:
    try:
        # Load samples
        samples = load_mind2web_samples(max_samples=sample_size)

        if not samples:
            return

        # Create Braintrust dataset
        dataset = create_braintrust_dataset(samples)

        # Run the evaluation
        experiment_name = f"mind2web-{int(time.time())}"
        Eval(
            "multimodal-mind2web-eval",  # Project name
            data=dataset,
            task=predict_with_gpt4o,
            scores=[option_selection_scorer, action_correctness_scorer],
            experiment_name=experiment_name,
            metadata={
                "model": "gpt-4o",
            },
        )

    except Exception as e:
        print(f"Evaluation failed: {e}")


if __name__ == "__main__":
    # Run evaluation with a smaller sample size for testing. Adjust this number to run on more or less samples.
    run_mind2web_evaluation(sample_size=10)

## Analyzing the results

Web agents have many configuration parameters that affect decision-making quality. Braintrust provides comprehensive tracing capabilities that show exactly what happened at each step of agent execution. This visibility includes all attachments and intermediate processing, which simplifies debugging and enables rapid iterations.

![attachment](./assets/attachment.gif)

Performance often varies across different contexts. Your agent might excel with certain websites while struggling with others, or handle some interaction types better than others. To uncover these differences, you can filter or group by metadata:

![grouping](./assets/grouping.png)





### Learning from the data

Taking the time to analyze results will reveal specific improvement opportunities in your agent implementation. You might discover that certain HTML cleaning strategies produce better results for form-heavy websites, or even that providing more detailed previous action context improves performance on multi-step tasks. The ability to trace steps, filter results, and compare different approaches enables you to systematically enhance your agent's capabilities rather than making blind adjustments.

## Next steps

Now that you've explored how to evaluate the decision making ability of a web agent, you should check out:

- Learn more about [how to evaluate agents](/blog/evaluating-agents)
- Check out the [guide to what you should do after running an eval](/blog/after-evals)
- Try out another [agent cookbook](/docs/cookbook/recipes/PromptChaining)