##### Web Voyager
WebVoyager is a vision-enabled web-browsing agent capable of controlling the mouse and keyboard. 
It works by viewing annotated browsing screenshots for each turn,then choosing the next step to take. The agent architecture is a basic reasoning and action (ReAct ) loop. The unique aspects of the agent are: 
- It's usage of Set-of-Marks like image annotations to serve as UI afforandances for the agent. 
- It's application in the browser by suing tools to control both the mouse and keyboard

In [1]:
import os
from dotenv import load_dotenv

In [2]:
load_dotenv()

True

In [3]:
import nest_asyncio
nest_asyncio.apply()

##### Define Graph State
The state provides the inputs to each node in the graph.

In our case, the agent will track the webpage object (within the browser), annotated images + bounding boxes, the user's initial request, and the messages containing the agent scratchpad, system prompt, and other information

In [4]:
from typing import List, Optional, TypedDict
from langchain_core.messages import BaseMessage, SystemMessage
from playwright.async_api import Page

class BBox(TypedDict):
    x: float
    y: float
    text: str
    type: str
    ariaLabel: str
    
    
class Prediction(TypedDict):
    action: str
    args: Optional[List[str]]
    
    
class AgentState(TypedDict):
    page: Page
    input: str
    img: str
    bboxes: List[BBox]
    prediction: Prediction
    
    scratchpad: List[BaseMessage]
    observation: str

##### Define tools
The agent has 6 simple tools
1. Click
2. Type
3. Scroll
4. Wait
5. Go back
6. Go to search engine(Google)

We define the following functions below

In [5]:
import asyncio
import platform

async def click(state: AgentState):
    page = state["page"]
    click_args = state["prediction"]["args"]
    
    if click_args is None or len(click_args) != 1:
        return f"Failed to click bounding box labeled as number {click_args}"
    
    bbox_id = click_args[0]
    bbox_id = int(bbox_id)
    
    try:
        bbox = state["bboxes"][bbox_id]
    except:
        return f"Error: no bbox for : {bbox_id}"
    x, y = bbox["x"], bbox["y"]
    
    res = await page.mouse.click(x, y)
    
    return f"clicked {bbox_id}"


async def type_text(state: AgentState):
    page = state["page"]
    type_args = state["prediction"]["args"]
    
    if type_args is None or len(type_args)!= 2:
        return (
            f"Failed to type in element from bounding box labeld as number {type_args}"
        )

    bbox_id = type_args[0]
    bbox_id = int(bbox_id)
    
    bbox = state["bboxes"][bbox_id]
    x, y = bbox["x"], bbox["y"]
    
    text_content = type_args[1]
    
    await page.mouse.click(x, y)
    
    select_all = "Meta+A" if platform.system() == "Darwin" else "Control+A"
    await page.keyboard.press(select_all)
    await page.keyboard.press("Backspace")
    await page.keyboard.type(text_content)
    await page.keyboard.press("Enter")
    
    return f"Typed {text_content} and submitted"

async def scroll(state: AgentState):
    page = state["page"]
    scroll_args = state["prediction"]["args"]
    
    if scroll_args is None or len(scroll_args) != 2:
        return "Failed to scroll due to incorrect arguments."
    
    target, direction = scroll_args
    
    if target.upper() == "WINDOW":
        scroll_amount = 500
        scroll_direction = (
            -scroll_amount if direction.lower() == "up" else scroll_amount
        )
        await page.evaluate(f"window.scrollBy(0, {scroll_direction})")
        
    else:
        scroll_amount = 200
        target_id = int(target)
        bbox = state["bboxes"][target_id]
        
        x, y = bbox["x"], bbox["y"]
        
        scroll_direction = (
            - scroll_amount if direction.lower() == "up" else scroll_amount
        )
        await page.mouse.move(x, y)
        await page.mouse.wheel(0, scroll_direction)
    
    return f"Scrolled {direction} in {'window' if target.upper() == 'WINDOW' else 'element'}"


async def wait(state: AgentState):
    sleep_time = 5
    await asyncio.sleep(sleep_time)
    return f"Waited for {sleep_time}"

async def go_back(state: AgentState):
    page = state["page"]
    await page.go_back()
    return f"Navigated back a page to {page.url}"

async def to_google(state: AgentState):
    page = state["page"]
    await page.goto("https://www.google.com/")
    return "Navigated to google.com"

#### Define Agent
The agent is driven by a multi-modal model and decides the action to take for each step. It is composed of a few runnable objects:
1. A `mark_page` function to annotate the current page with bounding boxes
2. A prompt to hold the user question, annotated image, and agent scratchpad
3. GPT-4V to decide the next steps
4. Parsing logic to extract the action

Let's first define the annotations step:

##### Browser Annotations
Theis function annotates all buttons, inputs, text areas, etec with numbered bounding boxes. GPT-4V then just has to refer to a bounding box then taking actions, reducing the complextity of the overall task. 


In [6]:
import asyncio
import base64

from langchain_core.runnables import chain as chain_decorator


with open("mark_page.js") as f:
    mark_page_script = f.read()
    
@chain_decorator
async def mark_page(page):
    await page.evaluate(mark_page_script)
    for _ in range(10):
        try:
            bboxes = await page.evaluate("markPage()")
            break
        except Exception as e:
            asyncio.sleep(3)
    screenshot =  await page.screenshot()
    
    await page.evaluate("unmarkPage()")
    return {
        "img": base64.b64encode(screenshot).decode(),
        "bboxes": bboxes,
    }

##### Agent Definition
Now we'll compose this functin with the prompt, llm and output parser to complete our agent

In [7]:
from langchain import hub
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder
from langchain_core.runnables import RunnablePassthrough
from langchain_openai import ChatOpenAI

async def annotate(state: AgentState):
    marked_page = await mark_page.with_retry().ainvoke(state["page"])
    return {**state, **marked_page}

async def format_descriptions(state):
    labels = []
    for i, bbox in enumerate(state["bboxes"]):
        text = bbox["text"]
        if not text.strip():
            text = bbox["text"]
        el_type = bbox.get("type")
        labels.append(f"{i}(<{el_type}/>): '{text}'")
        
    bbox_descriptions = "\nValid Bounding Boxes :\n " + "\n".join(labels)
    return {**state, "bbox_descriptions": bbox_descriptions}

def parse(text: str):
    action_prefix = "Action:"

    if not text.strip().split("\n")[-1].startswith(action_prefix):
        return {"action": "retry", "args": f"Could not parse LLM output: {text}"}
    action_block = text.strip().split("\n")[-1]
    
    action_str = action_block[len(action_prefix):]
    split_output = action_str.split(" ", 1)
    
    if len(split_output) == 1:
        action, action_input = split_output[0], None
    else:
        action, action_input = split_output
        
    aciton = action.strip()
    if action_input is not None:
        action_input = [
            inp.strip().strip("[]") for inp in action_input.strip().split(";")
        ]
        return {"action": action, "args": action_input}
    

prompt = hub.pull("wfh/web-voyager")

In [8]:
llm = ChatOpenAI(model="gpt-4-vision-preview", max_tokens=4096)
agent = annotate  | RunnablePassthrough.assign(
    prediction=format_descriptions | prompt | llm | StrOutputParser() | parse
)

##### Define graph

We've created most of the important logic. We have one more function to define that will help us update the grpah state after a tool is called.

In [9]:
import re

def update_scratchpad(state: AgentState):
    """ 
    After a tool is invoked, we want to update this 
    scratchpad so the agent is aware of its previous steps
    """
    old = state.get("scratchpad")
    
    if old:
        txt = old[0].content
        last_line = txt.rsplit("\n", 1)[-1]
        step = int(re.match(r"\d+", last_line).group()) + 1
    else:
        txt = "Previous action observation: \n"
        step = 1
    txt += f"\n{step}. {state['observation']}"
    
    return {**state, "scratchpad": [SystemMessage(content=txt)]}

In [10]:
from langchain_core.runnables import RunnableLambda
from langgraph.graph import END, StateGraph

graph_builder = StateGraph(AgentState)

graph_builder.add_node("agent", agent)
graph_builder.set_entry_point("agent")


graph_builder.add_node("update_scratchpad", update_scratchpad)
graph_builder.add_edge("update_scratchpad", "agent")


tools = {
    "Click": click,
    "Type": type_text,
    "Scroll": scroll,
    "Wait": wait,
    "GoBack": go_back,
    "Google": to_google
}

for node_name, tool in tools.items():
    graph_builder.add_node(
        node_name,
        RunnableLambda(tool) | (lambda observation: {"observation": observation})
    )
    
    graph_builder.add_edge(node_name, "update_scratchpad")
    

def select_tool(state: AgentState):
    action = state["prediction"]["action"]
    if action == "ANSWER":
        return END
    if action == "retry":
        return "agent"
    return "action"


graph_builder.add_conditional_edges("agent", select_tool)
graph = graph_builder.compile()

##### Run Agent
Now that we've created the whole agent executor, we can run it on a few questions 

In [11]:
import playwright
from IPython import display
from playwright.async_api import async_playwright


browser = await async_playwright().start()

browser = await browser.chromium.launch(headless=False, args=None)
page = await browser.new_page()

_ = await page.goto("https://www.google.com")

async def call_agent(question:str, page, max_steps: int  = 150):
    event_stream = graph.astream(
        {
            "page": page,
            "input": question,
            "scratchpad": []
        },{
            "recursion_limit": max_steps
        }
    )
    
    final_answer = None
    steps = []
    
    async for event in event_stream:
        if "agent" not in event:
            continue
        
        pred = event["agent"].get("prediction") or {}
        action = pred.get("action")
        
        action_input = pred.get("args")
        
        display.clear_output(wait=False)
        steps.append(f"{len(steps) + 1}. {action}:{action_input}")
        print("\n".join(steps))
        display.display(display.Image(base64.b16decode(event["agent"]["img"])))
        if "ANSWER" in action:
            final_anser = action_input[0]
            break
    return final_anser

Error: Executable doesn't exist at /Users/kosisochukwuasuzu/Library/Caches/ms-playwright/chromium-1097/chrome-mac/Chromium.app/Contents/MacOS/Chromium
╔════════════════════════════════════════════════════════════╗
║ Looks like Playwright was just installed or updated.       ║
║ Please run the following command to download new browsers: ║
║                                                            ║
║     playwright install                                     ║
║                                                            ║
║ <3 Playwright Team                                         ║
╚════════════════════════════════════════════════════════════╝