# InSTA: Towards Internet-Scale Training For Agents

![Pipeline Overview](https://data-for-agents.github.io/static/images/pipeline_overview.png)

**Brandon Trabucco (1) Gunnar Sigurdsson (2) Robinson Piramuthu (2) Ruslan Salakhutdinov (1)**

**(1) Carnegie Mellon University, Machine Learning Department (2) Amazon**

The predominant approach for training web navigation agents gathers human demonstrations for a set of popular websites and hand-written tasks, but it is becoming clear that human data are an inefficient resource. We develop a pipeline to facilitate Internet-scale training for agents without laborious human annotations. In the first stage, an LLM generates tasks for 150k diverse websites. In the next stage, LLM agents complete tasks and produce trajectories. In the final stage, an LLM reviews the trajectories and judges their success. Language models are competitive with human annotators, detecting and filtering out harmful content with an accuracy of 97%, generating feasible tasks with an 89% rate, and judging successful trajectories with an 82.6% accuracy. Scaling the pipeline, agents based on Llama 3.1 70B solve 16.7% of tasks for 150k sites. Training on the data generated by our pipeline is competitive with training on human demonstrations. In data-limited settings derived from Mind2Web and WebLINX, we improve Step Accuracy by up to +89.5% and +122.1% respectively for agents trained on mixtures of data from our pipeline, and human data. When training agents with all available human data from these benchmarks, agents fail to generalize to diverse real sites, and adding our data improves their generalization by +149.0% for WebLINX and +156.3% for Mind2Web. Code available at: [data-for-agents.github.io](https://data-for-agents.github.io).

[website](https://data-for-agents.github.io)    |    [paper](https://arxiv.org/abs/2502.06776)    |    [data](https://huggingface.co/datasets/data-for-agents/insta-150k)

---

This notebook provides a demo of the InSTA pipeline for agents.

In [1]:
# Run the following commands to install InSTA and prepare the environment

# !docker pull brandontrabucco/insta-browser-environment
# !docker run -p 7860:7860 -p 3000-3007:3000-3007 -t brandontrabucco/insta-browser-environment &
# !pip install git+https://github.com/data-for-agents/insta

In [2]:
from insta import (
    InstaPipeline,
    create_demo_videos
)

In [3]:
pipeline = InstaPipeline()

In [4]:
prepared_demo_options = [
    {   # 0
        "domain": "sustainablewebdesign.org",
        "task": "Retrieve a guide on reducing website carbon emissions."
    },
    {   # 1
        "domain": "statejobs.ny.gov",
        "task": "Search for currently available jobs in the field of environmental conservation."
    },
    {   # 2
        "domain": "quanthub.com",
        "task": "Find a research paper on quantum computing algorithms."
    },
    {   # 3
        "domain": "nameberry.com",
        "task": "Find the most popular baby names of the past decade."
    },
    {   # 4
        "domain": "apple.es",
        "task": "Find the technical specifications of the latest iPhone model."
    },
    {   # 5
        "domain": "agro.bayer.nl",
        "task": "Locate information on crop protection products for wheat."
    },
    {   # 6
        "domain": "misti.mit.edu",
        "task": "Find a course lecture on introductory computer science."
    },
    {   # 7
        "domain": "sharjahairport.ae",
        "task": "Check the flight schedule for arrivals at Sharjah International Airport."
    },
    {   # 8
        "domain": "w.org",
        "task": "Find documentation on how to install WordPress on a website."
    },
    {   # 9
        "domain": "visuwords.com",
        "task": "Visualize the word relationships for \"artificial intelligence\"."
    }
]

In [5]:
IDX = 4

pipeline.run_pipeline(
    dataset = [prepared_demo_options[IDX]]
)

Processing: apple.es: 100%|█████████████████████████████████████████████████████████████████████████████████████| 1/1 [05:53<00:00, 353.94s/it]


In [6]:
create_demo_videos(
    task_is_feasible_threshold = 0.0,
    success_threshold = 0.0,
    on_right_track_threshold = 0.0,
)

Processing: apple.es: 100%|██████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:20<00:00, 20.64s/it]


In [7]:
from ipywidgets import Output, GridspecLayout
from IPython import display

import glob
import json
import os

MAX_VIDEOS = 10

selected_video_files = glob.glob(
    "data/videos/*.mp4"
)[:MAX_VIDEOS]

height = len(selected_video_files)

video_grid = GridspecLayout(
    1, height
)

for panel_idx, video_path in enumerate(
    selected_video_files
):
    
    video_output = Output()
    
    with video_output:
        
        display.display(display.Video(
            video_path,
            embed = True,
            width = 1000
        ))
        
    video_grid[0, panel_idx] = video_output

video_grid

GridspecLayout(children=(Output(layout=Layout(grid_area='widget001')),), layout=Layout(grid_template_areas='"w…

In [8]:
first_video_path = selected_video_files[0]

# show the final observation, action, and judgment responses:

final_observation = "data/observations/{}".format(
    os.path.basename(first_video_path)
    .replace(".mp4", ".json")
)

with open(final_observation, "r") as file:

    final_observation = json.load(file)[-1]

final_action = "data/actions/{}".format(
    os.path.basename(first_video_path)
    .replace(".mp4", ".json")
)

with open(final_action, "r") as file:

    final_action = json.load(file)[-1]

judgment = "data/judgments/{}".format(
    os.path.basename(first_video_path)
    .replace(".mp4", ".json")
)

with open(judgment, "r") as file:

    judgment = json.load(file)

print(
    "Final Observation for {}:\n\n{}\n\n".format(
        first_video_path,
        final_observation["processed_text"]
    )
)

print(
    "Final Action for {}:\n\n{}\n\n".format(
        first_video_path,
        final_action["response"]
    )
)

print(
    "Judgment for {}:\n\n{}\n\n".format(
        first_video_path,
        judgment["response"]
    )
)

Final Observation for data/videos/apple.es.mp4:

iPhone 16 y iPhone 16 Plus - Especificaciones técnicas - Apple (ES) 
* [id: 13] Apple link
* * [id: 27] Tienda link
    * [id: 37] Menú de la tienda button
 
    * [id: 87] Mac link
    * [id: 97] Menú del Mac button
 
    * [id: 165] iPad link
    * [id: 175] Menú del iPad button
 
    * [id: 239] iPhone link
    * [id: 249] Menú del iPhone button
 
    * [id: 313] Watch link
    * [id: 323] Menú del Apple Watch button
 
    * [id: 385] AirPods link
    * [id: 395] Menú de los AirPods button
 
    * [id: 437] TV y Casa link
    * [id: 447] Menú de TV y Casa button
 
    * [id: 505] Entretenimiento link
    * [id: 515] Menú de entretenimiento button
 
    * [id: 556] Accesorios link
    * [id: 566] Menú de accesorios button
 
    * [id: 605] Soporte link
    * [id: 615] Menú de soporte button
* [id: 665] Buscar en apple.com link
* [id: 697] Bolsa link
 [id: 746] iPhone 16 link 
* [id: 755] Descripción link
* [id: 757] Pasarse de Android 

In [9]:
from ipywidgets import Output, GridspecLayout
from IPython import display

import glob
import json
import os

MAX_VIDEOS = 10

selected_video_files = [
    os.path.join("data-backup/videos/{}.mp4".format(x["domain"]))
    for x in prepared_demo_options
]

height = len(selected_video_files)

video_grid = GridspecLayout(
    1, height
)

for panel_idx, video_path in enumerate(
    selected_video_files
):
    
    video_output = Output()
    
    with video_output:
        
        display.display(display.Video(
            video_path,
            embed = True,
            width = 1000
        ))
        
    video_grid[0, panel_idx] = video_output

video_grid

GridspecLayout(children=(Output(layout=Layout(grid_area='widget001')), Output(layout=Layout(grid_area='widget0…