# Exploring WebArena Results with Zeno 


[Zeno](https://zenoml.com/) provides interative interface to explore the results of your agents in WebArena. You can easily
* Visualize the trajectories
* Compare the performance of different agents
* Interactively select and analyze trajectories with various filters such as trajectory length 

In [100]:
!pip install zeno_client


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.2.1[0m[39;49m -> [0m[32;49m23.3.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpython3.11 -m pip install --upgrade pip[0m


In [101]:
import pandas as pd
import json
import os
from dotenv import load_dotenv

import zeno_client

We first need to convert and combine the output `HTML` trajectories into a single `JSON` file using the `html2json` script:
Remember to change `result_folder` to the path you saved your `render_*.html`. The results will be saved to `{{result_folder}}/json_dump.json`. For example:

In [102]:
# /Users/guozhitong/webarena/cache/91206_gpt35_16k_cot_na_deberta

!python html2json.py --result_folder /Users/guozhitong/webarena/cache/91206_gpt35_16k_cot_na_deberta --config_json ../config_files/test.raw.json

Total number of files: 809


In [103]:
# !python html2json.py --result_folder ../cache/918_text_bison_001_cot --config_json ../config_files/test.raw.json
# !python html2json.py --result_folder ../cache/919_gpt35_16k_cot --config_json ../config_files/test.raw.json
# !python html2json.py --result_folder ../cache/919_gpt35_16k_cot_na --config_json ../config_files/test.raw.json
# !python html2json.py --result_folder ../cache/919_gpt35_16k_direct --config_json ../config_files/test.raw.json
# !python html2json.py --result_folder ../cache/919_gpt35_16k_direct_na --config_json ../config_files/test.raw.json
# !python html2json.py --result_folder ../cache/919_gpt4_8k_cot --config_json ../config_files/test.raw.json

Next you will record the json file names in `RESULT_JSONS` and provide the model tag in `RESULT_NAMES`

In [104]:
RESULT_JSONS = [
    # "../cache/918_text_bison_001_cot/json_dump.json", 
    # "../cache/919_gpt35_16k_cot/json_dump.json",
    # "../cache/919_gpt35_16k_cot_na/json_dump.json",
    # "../cache/919_gpt35_16k_direct/json_dump.json",
    # "../cache/919_gpt35_16k_direct_na/json_dump.json",
    "/Users/guozhitong/webarena/cache/91206_gpt35_16k_cot_na_deberta/json_dump.json",
    ]
RESULT_NAMES = ["gpt3.5-deberta-cot"]

## Obtaining Data

We can use the first results file to create the base `dataset` we'll upload to Zeno with just the initial prompt intent.

In [105]:
with open(RESULT_JSONS[0], "r") as f:
    raw_json: dict = json.load(f)

In [106]:
df = pd.DataFrame(
    {
        "example_id": list(raw_json.keys()),
        "site": [", ".join(x["sites"]) for x in raw_json.values()],
        "eval_type": [", ".join(x["eval_types"]) for x in raw_json.values()],
        "achievable": [x["achievable"] for x in raw_json.values()],
        "context": [
            json.dumps(
                [
                    {
                        "role": "system",
                        "content": row["intent"],
                    }
                ]
            )
            for row in raw_json.values()
        ],
    }
)

## Authenticate and Create a Project

We can now create a new [Zeno](https://zenoml.com) project and upload this data.

Create an account and API key by signing up at [Zeno Hub](https://hub.zenoml.com) and going to your [Account page](http://hub.zenoml.com/account). Save the API key in a `.env` file.

In [107]:
# read ZENO_API_KEY from .env file
load_dotenv(override=True)

client = zeno_client.ZenoClient(os.environ.get("ZENO_API_KEY"))

In [108]:
project = client.create_project(
    name="Webarena with DeBERTa (GPT3.5)",
    view={
        "data": {
            "type": "list",
            "elements": {"type": "message", "content": {"type": "markdown"}},
            "collapsible": "top",
        },
        "label": {"type": "markdown"},
        "output": {
            "type": "list",
            "elements": {
                "type": "message",
                "highlight": True,
                "content": {"type": "markdown"},
            },
            "collapsible": "top",
        },
    },
    metrics=[
        zeno_client.ZenoMetric(name="success", type="mean", columns=["success"]),
        zeno_client.ZenoMetric(
            name="# of go backs", type="mean", columns=["# of go_backs"]
        ),
        zeno_client.ZenoMetric(name="# of steps", type="mean", columns=["# of steps"]),
    ],
)

Successfully updated project.
Access your project at  https://hub.zenoml.com/project/zhitongguo/Webarena%20with%20DeBERTa%20%28GPT3.5%29


In [109]:
result_file_3 = "/Users/guozhitong/webarena/cache/91206_gpt35_16k_cot_na_deberta/json_dump.json"

In [110]:
with open(result_file_3, "r") as f:
    all_data_3: dict = json.load(f)

In [111]:
df = pd.DataFrame(
    {
        "example_id": list(all_data_3.keys()),
        "site": [", ".join(x["sites"]) for x in all_data_3.values()],
        "eval_type": [", ".join(x["eval_types"]) for x in all_data_3.values()],
        "achievable": [x["achievable"] for x in all_data_3.values()],
        "context": [
            json.dumps(
                [
                    {
                        "role": "system",
                        "content": row["intent"],
                    }
                ]
            )
            for row in all_data_3.values()
        ],
    }
)

In [112]:
project.upload_dataset(df, id_column="example_id", data_column="context")

  0%|          | 0/1 [00:00<?, ?it/s]

Successfully uploaded data


# Uploading Model Outputs

We can now upload the full trajectory outputs for our models.

If you want to display the images, you will need to upload the images to a publically accessible location and provide the URL in the `image_url` field.

In [113]:
image_base_url = None

In [114]:
def format_message(row):
    return_list = []
    for message in row["messages"]:
        role = "user" if "user" in message else "assistant"

        if role == "user":
            if image_base_url:
                content = (
                    "[![image](%s/%s)](%s/%s)\n%s"
                    % (
                        image_base_url,
                        "/".join(message["image"].split("/")[-2:]),
                        image_base_url,
                        "/".join(message["image"].split("/")[-2:]),
                        message[role],
                    )
                )
            else:
                content = message[role]
        else:
            content = message[role]
        return_list.append({"role": role, "content": content})
    return return_list

In [115]:
def get_system_df(result_path: str):
    with open(result_path, "r") as f:
        json_input: dict = json.load(f)
    return pd.DataFrame(
        {
            "example_id": list(json_input.keys()),
            "# of clicks": [
                sum(
                    [
                        1
                        for x in r["messages"]
                        if "assistant" in x and "`click" in x["assistant"]
                    ]
                )
                for r in json_input.values()
            ],
            "# of types": [
                sum(
                    [
                        1
                        for x in r["messages"]
                        if "assistant" in x and "`type" in x["assistant"]
                    ]
                )
                for r in json_input.values()
            ],
            "# of go_backs": [
                sum(
                    [
                        1
                        for x in r["messages"]
                        if "assistant" in x and "`go_back" in x["assistant"]
                    ]
                )
                for r in json_input.values()
            ],
            "# of steps": [len(r["messages"]) for r in json_input.values()],
            "context": [json.dumps(format_message(row)) for row in json_input.values()],
            "success": [r["success"] for r in json_input.values()],
        }
    )

In [116]:
for i, system in enumerate(RESULT_JSONS):
    output_df = get_system_df(system)
    project.upload_system(
        output_df, name=RESULT_NAMES[i], id_column="example_id", output_column="context"
    ) 

  0%|          | 0/6 [00:00<?, ?it/s]

Successfully uploaded system
