To train this agent, click _Runtime_ and press _Run all_. Make sure you've enabled a free Tesla T4 GPU!

<div class="align-center">
<a href="https://github.com/openpipe/art"><img src="https://github.com/openpipe/art/raw/main/assets/ART_pill.png" height="50"></a>
<a href="https://discord.gg/zbBHRUpwf4"><img src="https://github.com/openpipe/art/raw/main/assets/Discord_pill.png" height="50"></a>
<a href="https://openpipe.ai/blog/art-e-mail-agent"><img src="https://github.com/openpipe/art/raw/main/assets/ART_E_pill.png" height="50"></a>

Questions? Join the Discord and ask away! For feature requests or to leave a star, visit our [Github](https://github.com/openpipe/art).

</div>

<a href="https://art.openpipe.ai/"><img src="https://github.com/openpipe/art/raw/main/assets/Header_separator.png" height="5"></a>

This notebook shows how to train a Qwen 2.5 7B model to navigate Wikipedia. It will demonstrate how to set up a multi-turn agent that learns to hop between Wikipedia pages by selecting the best links to reach target pages.

Completions will be logged to OpenPipe, and metrics will be logged to Weights & Biases.

You will learn how to construct an [agentic environment](#Environment), how to define a [rollout](#Rollout), and how to run a [training loop](#Loop).


In [1]:
!uv pip install "numpy<2.0.0"

[2mUsing Python 3.10.13 environment at: /root/sky_workdir/.venv[0m
[2mAudited [1m1 package[0m [2min 11ms[0m[0m


### WARNING:

If you are running in Google Colab and installing numpy does not say "Requirement already satisfied: numpy<2.0.0" then click "Runtime" and "Restart Session."


In [2]:
# make sure we're using numpy 1.*.*
import numpy as np

if (np.__version__).startswith("1."):
    print("Numpy version is 1.*.*, you're good to go!")
else:
    raise ValueError("Please restart your runtime using the above instructions!")

Numpy version is 1.*.*, you're good to go!


### Environment Variables

Later on in the notebook, we'll be creating a model that can automatically logs metrics to Weights & Biases. In order to do so, you'll need to provide your Weights & Biases API key as an environment variable.

You can also optionally initiate an OpenPipe client to report completions to a [dashboard](https://app.openpipe.ai) to get a feel for what the completions your model is generating look like, and how they change over time. Logging to OpenPipe is free, but is not required for training!


In [3]:
import os


# Optional
WANDB_API_KEY = ""
if WANDB_API_KEY:
    os.environ["WANDB_API_KEY"] = WANDB_API_KEY

# Optional
OPENPIPE_API_KEY = ""
if OPENPIPE_API_KEY:
    os.environ["OPENPIPE_API_KEY"] = OPENPIPE_API_KEY

### Installation

In [4]:
%%capture
!uv pip install openpipe-art==0.3.11 openpipe accelerate==1.7.0 requests beautifulsoup4 --prerelease allow --no-cache-dir

### Agentic Environment

<a name="Environment"></a>

ART allows your agent to learn by interacting with its environment. In this example, we'll create an environment in which the agent navigates Wikipedia by selecting links to reach target pages.

The agent starts at the Philosophy Wikipedia page and must navigate to various target pages by selecting the best links from the first paragraph of each page. The environment functions handle Wikipedia scraping, link selection, and target matching.

Feel free to read as much or as little of this section's code as you'd like. The important thing to understand is that we're defining the rules of this agent's environment.


In [5]:
import random
import re
import xml.etree.ElementTree as ET
from typing import List
import requests
from bs4 import BeautifulSoup


def read_first_paragraphs_with_links(url: str, num_paragraphs: int = 3) -> str:
    """Given a link to a wikipedia page, read the first paragraphs with links."""
    try:
        response = requests.get(url, timeout=10)
        response.raise_for_status()
        
        soup = BeautifulSoup(response.content, 'html.parser')
        
        # Find the main content div
        content = soup.find('div', {'id': 'mw-content-text'})
        if not content:
            return "Could not find main content"

        num_good_paragraphs = 0
        parsed_paragraphs = []
        # also count ordered and unordered lists
        paragraphs = content.find_all(['p', 'ol', 'ul'])
        paragraph_index = 0

        while num_good_paragraphs < num_paragraphs and paragraph_index < len(paragraphs):
            p = paragraphs[paragraph_index]
            text = p.get_text().strip()
            # Get the paragraph with links preserved
            paragraph_html = str(p)
            parsed_paragraphs.append(paragraph_html)
            if len(text) > 50 and p.find('a'):  # Must have substantial content and a link
                num_good_paragraphs += 1
            paragraph_index += 1
        
        if len(parsed_paragraphs) == 0:
            raise Exception("No substantial first paragraphs found")
        
        return "\n\n".join(parsed_paragraphs)
        
    except Exception as e:
        return f"Error reading page: {str(e)}"


def extract_urls(first_paragraph: str) -> List[str]:
    """Extract Wikipedia URLs from the first paragraph HTML."""
    soup = BeautifulSoup(first_paragraph, 'html.parser')
    links = []
    for a_tag in soup.find_all('a', href=True):
        href = a_tag['href']
        text = a_tag.get_text().strip()
        if href.startswith('/wiki/') and ':' not in href and text:
            full_url = f"https://en.wikipedia.org{href}"
            links.append(full_url)
    return links



def check_target_match(current_url: str, target_page: str) -> bool:
    """Given the current url, determine whether it is a match for the target page."""
    # Normalize URLs for comparison
    current_url = current_url.strip().rstrip('/')
    target_page = target_page.strip().rstrip('/')
    
    # Direct match
    if current_url == target_page:
        return True
    
    # Extract the page title from both URLs
    def extract_page_title(url):
        if '/wiki/' in url:
            return url.split('/wiki/')[-1]
        return url
    
    current_title = extract_page_title(current_url)
    target_title = extract_page_title(target_page)
    
    return current_title == target_title


# Target URLs for training scenarios
TARGET_URLS = [
    "https://en.wikipedia.org/wiki/Maple_syrup",
    "https://en.wikipedia.org/wiki/2011_Christchurch_earthquake",
    "https://en.wikipedia.org/wiki/Alena_Vesel%C3%A1",
    "https://en.wikipedia.org/wiki/Horsehair",
    "https://en.wikipedia.org/wiki/French_Army",
    "https://en.wikipedia.org/wiki/Grant_Park_Music_Festival",
    "https://en.wikipedia.org/wiki/World_War_II",
    "https://en.wikipedia.org/wiki/1782",
    "https://en.wikipedia.org/wiki/Privateer",
    "https://en.wikipedia.org/wiki/Greater_Western_Sydney_vs_Brisbane_Lions_(2024_AFL_season)",
    "https://en.wikipedia.org/wiki/Murder_of_Nancy_Titterton",
    "https://en.wikipedia.org/wiki/The_Beatles",
    "https://en.wikipedia.org/wiki/The_Acres",
    "https://en.wikipedia.org/wiki/Hungarian_Revolution_of_1956",
    "https://en.wikipedia.org/wiki/Orphic_Hymns",
    "https://en.wikipedia.org/wiki/Baking"
]

STARTING_URL = "https://en.wikipedia.org/wiki/Philosophy"


In [6]:
print(extract_urls(read_first_paragraphs_with_links("https://en.wikipedia.org/wiki/Syrup")))

['https://en.wikipedia.org/wiki/Cooking', 'https://en.wikipedia.org/wiki/Arabic_language', 'https://en.wikipedia.org/wiki/Latin_language', 'https://en.wikipedia.org/wiki/Viscous', 'https://en.wikipedia.org/wiki/Solution_(chemistry)', 'https://en.wikipedia.org/wiki/Sugar', 'https://en.wikipedia.org/wiki/Crystal', 'https://en.wikipedia.org/wiki/Molasses', 'https://en.wikipedia.org/wiki/Hydrogen_bond', 'https://en.wikipedia.org/wiki/Hydroxyl', 'https://en.wikipedia.org/wiki/Agave_nectar', 'https://en.wikipedia.org/wiki/Agave', 'https://en.wikipedia.org/wiki/Cane_syrup', 'https://en.wikipedia.org/wiki/Chocolate_syrup', 'https://en.wikipedia.org/wiki/Corn_syrup', 'https://en.wikipedia.org/wiki/Glucose_syrup', 'https://en.wikipedia.org/wiki/Golden_syrup', 'https://en.wikipedia.org/wiki/Sugar', 'https://en.wikipedia.org/wiki/High_fructose_corn_syrup', 'https://en.wikipedia.org/wiki/Maple_syrup', 'https://en.wikipedia.org/wiki/Table_syrup', 'https://en.wikipedia.org/wiki/Mixed_drink']


### Creating a Model

Now that we've defined the rules of our environment, we can create a model that will learn to navigate Wikipedia. We'll use a Qwen 2.5 3B model for this example. The `name` parameter will be associated with a wandb run, and the `base_model` parameter is the model that we'll be training a LoRA on top of.

In [7]:
import art
from dotenv import load_dotenv

from openpipe.client import OpenPipe
from art.local import LocalBackend

load_dotenv()

op_client = OpenPipe()
print("OpenPipe client initialized")

random.seed(42)

backend = LocalBackend(path="./.art")

INFO 07-01 11:15:49 [importing.py:53] Triton module has been replaced with a placeholder.
INFO 07-01 11:15:49 [__init__.py:239] Automatically detected platform cuda.
OpenPipe client initialized


### Model Registration

Now we'll register our model with the ART backend. This creates the infrastructure needed for training and tracks our model's progress through the training steps.


In [8]:
import os

TEST_MODE = False


if TEST_MODE:
    model_or = art.Model(
        name="closed_or",
        project="wikihop-navigation",
        inference_base_url="https://openrouter.ai/api/v1",
        inference_api_key=os.getenv("OPENROUTER_API_KEY"),
        inference_model_name="openai/gpt-4.1",
    )
    model = art.Model(
        name="closed",
        project="wikihop-navigation",
        inference_base_url="https://api.openai.com/v1",
        inference_api_key=os.getenv("OPENAI_API_KEY"),
        inference_model_name="gpt-4.1",
    )
else:

    model = art.TrainableModel(
        name="003-wikihop", project="wikihop-navigation", base_model="Qwen/Qwen2.5-14B-Instruct"
    )
    await model.register(backend)


Please restructure your imports with 'import unsloth' at the top of your file.
  import unsloth  # type: ignore


🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth Zoo will now patch everything to make training faster!
INFO 07-01 11:15:59 [importing.py:53] Triton module has been replaced with a placeholder.
INFO 07-01 11:15:59 [__init__.py:239] Automatically detected platform cuda.
==((====))==  Unsloth 2025.5.1: Fast Qwen2 patching. Transformers: 4.51.3. vLLM: 0.8.5.post1.
   \\   /|    NVIDIA H100 PCIe. Num GPUs = 1. Max memory: 79.189 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.6.0+cu124. CUDA: 9.0. CUDA Toolkit: 12.4. Triton: 3.2.0
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.29.post3. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!
Unsloth: vLLM loading unsloth/qwen2.5-14b-instruct-unsloth-bnb-4bit with actual GPU utilization = 78.47%
Unsloth: Your GPU has CUDA compute capability 9.0 with VRAM = 79.19 GB.
Unsloth: Using conservativeness = 1.0

Loading safetensors checkpoint shards:   0% Completed | 0/3 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  33% Completed | 1/3 [00:01<00:02,  1.04s/it]
Loading safetensors checkpoint shards:  67% Completed | 2/3 [00:01<00:00,  1.68it/s]
Loading safetensors checkpoint shards: 100% Completed | 3/3 [00:02<00:00,  1.28it/s]
Loading safetensors checkpoint shards: 100% Completed | 3/3 [00:02<00:00,  1.29it/s]

Loading safetensors checkpoint shards:   0% Completed | 0/3 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  33% Completed | 1/3 [00:00<00:01,  1.04it/s]
Loading safetensors checkpoint shards:  67% Completed | 2/3 [00:01<00:00,  1.85it/s]
Loading safetensors checkpoint shards: 100% Completed | 3/3 [00:02<00:00,  1.39it/s]
Loading safetensors checkpoint shards: 100% Completed | 3/3 [00:02<00:00,  1.40it/s]



INFO 07-01 11:17:49 [punica_selector.py:18] Using PunicaWrapperGPU.
INFO 07-01 11:17:50 [model_runner.py:1140] Model loading took 10.6668 GiB and 90.747297 seconds
INFO 07-01 11:17:56 [worker.py:287] Memory profiling takes 5.67 seconds
INFO 07-01 11:17:56 [worker.py:287] the current vLLM instance can use total_gpu_memory (79.19GiB) x gpu_memory_utilization (0.78) = 62.14GiB
INFO 07-01 11:17:56 [worker.py:287] model weights take 10.67GiB; non_torch_memory takes 0.14GiB; PyTorch activation peak memory takes 4.26GiB; the rest of the memory reserved for KV Cache is 47.07GiB.
INFO 07-01 11:17:56 [executor_base.py:112] # cuda blocks: 16066, # CPU blocks: 2048
INFO 07-01 11:17:56 [executor_base.py:117] Maximum concurrency for 32768 tokens per request: 7.84x
INFO 07-01 11:18:01 [model_runner.py:1450] Capturing cudagraphs for decoding. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI

Capturing CUDA graph shapes: 100%|██████████| 49/49 [00:50<00:00,  1.04s/it]


INFO 07-01 11:18:52 [model_runner.py:1592] Graph capturing finished in 51 secs, took 1.94 GiB
INFO 07-01 11:18:52 [llm_engine.py:437] init engine (profile, create kv cache, warmup model) took 62.03 seconds
Unsloth: Just some info: will skip parsing ['pre_feedforward_layernorm', 'k_norm', 'q_norm', 'post_feedforward_layernorm']
Unsloth: Just some info: will skip parsing ['pre_feedforward_layernorm', 'k_norm', 'q_norm', 'post_feedforward_layernorm']


Unsloth 2025.5.1 patched 48 layers with 48 QKV layers, 48 O layers and 48 MLP layers.


### Defining a Rollout

<a name="Rollout"></a>

A rollout is a single episode of an agent performing its task. It generates one or more trajectories, which are lists of messages and choices.

In this example, the rollout function starts the agent at the Philosophy Wikipedia page and gives it a target page to reach. The agent navigates by selecting links from the first paragraph of each page until it either reaches the target or makes too many moves.

When the navigation is finished, the `reward` for the agent's performance is calculated based on whether it successfully reached the target, how many steps it took, and whether it made any errors.

This rollout function will be called many times in parallel during each step of the training loop.

In [9]:
import art
import openai
import time
from pydantic import BaseModel

class WikihopScenario(BaseModel):
    step: int
    target_url: str

@art.retry(exceptions=(openai.LengthFinishReasonError,))
async def rollout(
    model: art.Model, scenario: WikihopScenario
) -> art.Trajectory:
    current_url = STARTING_URL
    target_url = scenario.target_url
    max_hops = 30
    
    trajectory = art.Trajectory(
        messages_and_choices=[
            {
                "role": "system",
                "content": f"You are a Wikipedia navigator. Your goal is to reach the target page: {target_url}\n\nYou will be shown the first few paragraphs of each Wikipedia page. Select the link from these paragraphs that is most likely to get you closer to your target. If you see a direct link to your target, choose it immediately.\n\nRespond with ONLY the full URL in this exact format: <url>https://en.wikipedia.org/wiki/YourChoice</url>. You must always choose a link from the list of available links.",
            }
        ],
        reward=0,
    )

    hop_number = 0
    
    while hop_number < max_hops:
        # Check if we've reached the target
        if check_target_match(current_url, target_url):
            print("reached target")
            trajectory.reward = -hop_number  # Negative of total hops used
            trajectory.metrics["success"] = 1
            trajectory.metrics["hops"] = hop_number
            break
            
        # Read the first paragraph of the current page
        try:
            first_paragraph = read_first_paragraphs_with_links(current_url)
            if first_paragraph.startswith("Error"):
                trajectory.reward = -100 + hop_number
                trajectory.metrics["success"] = 0
                trajectory.metrics["hops"] = hop_number
                trajectory.metadata["error"] = "page_read_error"
                break
        except Exception as e:
            trajectory.reward = -100 + hop_number
            trajectory.metrics["success"] = 0
            trajectory.metrics["hops"] = hop_number
            trajectory.metadata["error"] = "page_read_exception"
            break

        # Extract links from the first paragraph
        links = extract_urls(first_paragraph)
        
        if not links:
            trajectory.reward = -100 + hop_number
            trajectory.metrics["success"] = 0
            trajectory.metrics["hops"] = hop_number
            trajectory.metadata["error"] = "no_valid_links"
            break

        links_text = "\n".join(links)

        # Add the current page content to the trajectory
        page_title = current_url.split('/wiki/')[-1].replace('_', ' ')
        trajectory.messages_and_choices.append({
            "role": "user", 
            "content": f"Current page: {page_title}\nHop {hop_number + 1}/{max_hops}\n\nChoose from one of the following available links:\n{links_text}"
        })

        requested_at = int(time.time() * 1000)

        try:
            
            
            # Get the model's choice of next URL
            client = model.openai_client()

            chat_completion = None
            chat_completion = await client.chat.completions.create(
                model=model.get_inference_name(),
                messages=trajectory.messages(),
                max_completion_tokens=2000,
            )

            
            response = chat_completion.choices[0].message.content
            if not response:
                trajectory.reward = -100 + hop_number
                trajectory.metrics["success"] = 0
                trajectory.metrics["hops"] = hop_number
                trajectory.metadata["error"] = "empty_model_response"
                break
            
            # Parse the XML response to get the selected URL
            try:
                root = ET.fromstring(response)
                next_url = root.text
                if not next_url:
                    trajectory.reward = -100 + hop_number
                    trajectory.metrics["success"] = 0
                    trajectory.metrics["hops"] = hop_number
                    trajectory.metadata["error"] = "empty_url_in_response"
                    break
                next_url = next_url.strip()
            except ET.ParseError:
                # Try to extract URL with regex as fallback
                url_pattern = r'https://en\.wikipedia\.org/wiki/[^\s<>]+'
                matches = re.findall(url_pattern, response)
                if matches:
                    next_url = matches[0]
                else:
                    trajectory.reward = -100 + hop_number
                    trajectory.metrics["success"] = 0
                    trajectory.metrics["hops"] = hop_number
                    trajectory.metadata["error"] = "could_not_parse_url"
                    break

            print(target_url.split('/wiki/')[-1], next_url)
            
            # Validate that the selected URL is actually in the first paragraph
            if next_url not in links:
                trajectory.reward = -100 + hop_number
                trajectory.metrics["success"] = 0
                trajectory.metrics["hops"] = hop_number
                trajectory.metadata["error"] = "selected_url_not_in_text"
                break
            
        except openai.LengthFinishReasonError as e:
            raise e
        except Exception as e:
            print(f"caught exception during URL selection: {e}")
            trajectory.reward = -100 + hop_number
            trajectory.metrics["success"] = 0
            trajectory.metrics["hops"] = hop_number
            trajectory.metadata["error"] = "url_selection_error"
            break

        # Record the choice
        trajectory.messages_and_choices.append(chat_completion.choices[0])

        # Move to the next page
        current_url = next_url
        hop_number += 1

    # If we ran out of hops without reaching the target
    if hop_number >= max_hops and not check_target_match(current_url, target_url):
        trajectory.reward = -100 + hop_number  # Error penalty for not reaching target, but reward good initial hops
        trajectory.metrics["success"] = 0
        trajectory.metrics["hops"] = hop_number
        trajectory.metadata["error"] = "max_hops_exceeded"
    
    # Final metrics
    if "success" not in trajectory.metrics:
        trajectory.metrics["success"] = 1 if check_target_match(current_url, target_url) else 0
    if "hops" not in trajectory.metrics:
        trajectory.metrics["hops"] = hop_number

    if op_client.api_key:
        messages = trajectory.messages()
        if messages[-1]["role"] == "assistant":
            messages = messages[:-1]

        try:
            op_client.report(
                requested_at=requested_at,
                received_at=int(time.time() * 1000),
                req_payload={
                    "model": model.name,
                    "messages": messages,
                    "metadata": {
                        "notebook-id": "wikihop",
                        "step": str(scenario.step),
                        "final_hops": str(hop_number),
                        "success": str(trajectory.metrics["success"]),
                        "reward": str(trajectory.reward),
                        "target_url": target_url,
                        "final_url": current_url,
                        "error": trajectory.metadata["error"] if "error" in trajectory.metadata else "none",
                    },
                },
                resp_payload=chat_completion,
                status_code=200,
            )
        except Exception as e:
            print(f"Error reporting to OpenPipe: {e}")

    return trajectory


In [10]:

if TEST_MODE:
    await rollout(model, WikihopScenario(step=0, target_url="https://en.wikipedia.org/wiki/Maple_syrup"))
    raise Exception("stopping early for test mode")

<a name="Loop"></a>

### Training Loop

The training loop is where the magic happens. For each of the 50 steps defined below, the rollout function will be called 48 times in parallel with different target Wikipedia pages. This means that 48 Wikipedia navigation tasks will be performed at once, each with a randomly selected target page.

The `gather` step will wait for all of the trajectories to be generated, then it will delete all but the most recent checkpoint and train the model on the new trajectories.

Inference will be blocked until the training is complete.


In [None]:
batch_size = 4

for i in range(await model.get_step(), 50):

    batch_start_idx = i * batch_size % len(TARGET_URLS)
    batch_end_idx = (i + 1) * batch_size % len(TARGET_URLS)

    batch_urls = TARGET_URLS[batch_start_idx:batch_end_idx]

    
    train_groups = await art.gather_trajectory_groups(
        (
            art.TrajectoryGroup(
                rollout(model, WikihopScenario(step=i, target_url=target_url)) for _ in range(8)
            )
            for target_url in batch_urls
        ),
        pbar_desc="gather",
    )
    await model.delete_checkpoints()
    await model.train(train_groups, config=art.TrainConfig(learning_rate=5e-5))

gather:   0%|          | 0/32 [00:00<?, ?it/s]

https://en.wikipedia.org/wiki/Western_philosophy
https://en.wikipedia.org/wiki/Western_philosophy
https://en.wikipedia.org/wiki/Western_philosophy
https://en.wikipedia.org/wiki/Western_philosophy
https://en.wikipedia.org/wiki/Western_philosophy
https://en.wikipedia.org/wiki/Western_philosophy
https://en.wikipedia.org/wiki/Western_philosophy
https://en.wikipedia.org/wiki/Western_philosophy
https://en.wikipedia.org/wiki/Western_philosophy
https://en.wikipedia.org/wiki/Western_philosophy
https://en.wikipedia.org/wiki/Western_philosophy
https://en.wikipedia.org/wiki/Western_philosophy
https://en.wikipedia.org/wiki/Western_philosophy
https://en.wikipedia.org/wiki/Western_philosophy
https://en.wikipedia.org/wiki/Western_philosophy
https://en.wikipedia.org/wiki/Western_philosophy
https://en.wikipedia.org/wiki/Western_philosophy
https://en.wikipedia.org/wiki/Western_philosophy
https://en.wikipedia.org/wiki/Western_philosophy
https://en.wikipedia.org/wiki/Western_philosophy
https://en.wikipedia

[34m[1mwandb[0m: Currently logged in as: [33mopenpipe[0m ([33mopenpipe-team[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


Wandb run initialized! You can view it at https://wandb.ai/openpipe-team/wikihop-navigation/runs/003-wikihop


tokenizer_config.json: 0.00B [00:00, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

Packed 32 trajectories into 13 sequences of length 30720


train:   0%|          | 0/13 [00:00<?, ?it/s]

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 10,000,000 | Num Epochs = 3 | Total steps = 30,000,000
O^O/ \_/ \    Batch size per device = 2 | Gradient accumulation steps = 1
\        /    Data Parallel GPUs = 1 | Total batch size (2 x 1 x 1) = 2
 "-____-"     Trainable parameters = 34,406,400/14,000,000,000 (0.25% trained)


Unsloth: Will smartly offload gradients to save VRAM!


gather:   0%|          | 0/32 [00:00<?, ?it/s]

https://en.wikipedia.org/wiki/History_of_philosophy
https://en.wikipedia.org/wiki/History_of_philosophy
https://en.wikipedia.org/wiki/History_of_philosophy
https://en.wikipedia.org/wiki/Western_philosophy
https://en.wikipedia.org/wiki/Western_philosophy
https://en.wikipedia.org/wiki/Western_philosophy
https://en.wikipedia.org/wiki/Western_philosophy
https://en.wikipedia.org/wiki/History_of_philosophy
https://en.wikipedia.org/wiki/History_of_philosophy
https://en.wikipedia.org/wiki/French_philosophy
https://en.wikipedia.org/wiki/French_philosophy
https://en.wikipedia.org/wiki/French_philosophy
https://en.wikipedia.org/wiki/American_philosophy
https://en.wikipedia.org/wiki/American_philosophy
https://en.wikipedia.org/wiki/American_philosophy
https://en.wikipedia.org/wiki/American_philosophy
https://en.wikipedia.org/wiki/French_philosophy
https://en.wikipedia.org/wiki/American_philosophy
https://en.wikipedia.org/wiki/Modern_philosophy
https://en.wikipedia.org/wiki/American_philosophy
http

train:   0%|          | 0/9 [00:00<?, ?it/s]

gather:   0%|          | 0/32 [00:00<?, ?it/s]

https://en.wikipedia.org/wiki/Western_philosophy
https://en.wikipedia.org/wiki/Western_philosophy
https://en.wikipedia.org/wiki/British_philosophy
https://en.wikipedia.org/wiki/Western_philosophy
https://en.wikipedia.org/wiki/Western_philosophy
https://en.wikipedia.org/wiki/British_philosophy
https://en.wikipedia.org/wiki/Western_philosophy
https://en.wikipedia.org/wiki/History_of_philosophy
https://en.wikipedia.org/wiki/British_philosophy
https://en.wikipedia.org/wiki/British_philosophy
https://en.wikipedia.org/wiki/History_of_philosophy
https://en.wikipedia.org/wiki/History_of_philosophy
https://en.wikipedia.org/wiki/Western_philosophy
https://en.wikipedia.org/wiki/Western_philosophy
https://en.wikipedia.org/wiki/Western_philosophy
https://en.wikipedia.org/wiki/Western_philosophy
https://en.wikipedia.org/wiki/Western_philosophy
https://en.wikipedia.org/wiki/British_philosophy
https://en.wikipedia.org/wiki/Medieval_philosophy
https://en.wikipedia.org/wiki/Western_philosophy
https://en

### Using the Model

Just like that, you've trained an agent to navigate Wikipedia! Now it's time to use your model outside of ART, in the wild! The easiest way to do that is to load it from disk, where it was saved after each training step, and either run inference on it locally or upload it to a central hub like HuggingFace.

Check out the code below for a small demo of the model you just trained navigating Wikipedia to reach a target page!


In [33]:
import torch
from unsloth import FastLanguageModel


# example: .art/wikihop-navigation/models/001-wikihop/0003
lora_model_path = (
    f".art/{model.project}/models/{model.name}/{await model.get_step():04d}"
)

print(f"loading model from {lora_model_path}\n")

peft_model, tokenizer = FastLanguageModel.from_pretrained(
    model_name=lora_model_path,
    max_seq_length=16384,
    dtype=torch.bfloat16,
    load_in_4bit=True,
)
FastLanguageModel.for_inference(peft_model)

# Demo navigation to a target page
current_url = STARTING_URL
target_url = random.choice(TARGET_URLS)
max_hops = 5
hop_number = 0

print(f"🎯 Target: {target_url}")
print(f"🚀 Starting from: {current_url}\n")

messages = [
    {
        "role": "system",
        "content": f"You are a Wikipedia navigator. Your goal is to reach the target page: {target_url}\n\nYou will be shown the first paragraph of each Wikipedia page. Select the link that is most likely to get you closer to your target. If you see a direct link to your target, choose it immediately.\n\nRespond with ONLY the full URL in this exact format: <url>https://en.wikipedia.org/wiki/YourChoice</url>",
    },
]

while hop_number < max_hops:
    # Check if we've reached the target
    if check_target_match(current_url, target_url):
        print(f"🎉 SUCCESS! Reached target in {hop_number} hops!")
        break
        
    # Read the first paragraph of the current page
    try:
        first_paragraph = read_first_paragraphs_with_links(current_url)
        if first_paragraph.startswith("Error"):
            print(f"❌ Error reading page: {first_paragraph}")
            break
    except Exception as e:
        print(f"❌ Exception reading page: {e}")
        break

    # Show current page
    page_title = current_url.split('/wiki/')[-1].replace('_', ' ')
    print(f"📖 Hop {hop_number + 1}: Currently on '{page_title}'")
    
    # Extract and show available links
    links = extract_urls(first_paragraph)
    print(f"🔗 Found {len(links)} links: {', '.join([link.split(' -> ')[0] for link in links[:5]])}{'...' if len(links) > 5 else ''}")
    
    # Add the current page content to the trajectory
    page_content = f"Current page: {page_title}\nTarget: {target_url}\nHop {hop_number + 1}/{max_hops}\n\nFirst paragraph:\n{first_paragraph}"
    messages.append({"role": "user", "content": page_content})

    inputs = tokenizer.apply_chat_template(
        messages, return_tensors="pt", add_generation_prompt=True
    ).to("cuda")

    def get_completion() -> str:
        with torch.no_grad():
            outputs = peft_model.generate(
                input_ids=inputs,
                max_new_tokens=256,
                do_sample=True,
                temperature=0.7,
                top_p=0.9,
            )
            return tokenizer.decode(
                outputs[0][inputs.shape[1] :], skip_special_tokens=True
            )

    try:
        content = get_completion()
        print(f"🤖 Model response: {content}")
        
        # Parse the URL from the response
        try:
            root = ET.fromstring(content)
            next_url = root.text
            if not next_url:
                raise ValueError("Empty URL in response")
            next_url = next_url.strip()
        except ET.ParseError:
            # Try to extract URL with regex as fallback
            import re
            url_pattern = r'https://en\.wikipedia\.org/wiki/[^\s<>]+'
            matches = re.findall(url_pattern, content)
            if matches:
                next_url = matches[0]
            else:
                print("❌ Could not parse URL from response")
                break
                
    except Exception as e:
        print(f"❌ Error generating completion: {e}")
        break

    messages.append({"role": "assistant", "content": content})

    # Show the selected link
    next_title = next_url.split('/wiki/')[-1].replace('_', ' ')
    print(f"🔗 Selected: '{next_title}' -> {next_url}")
    
    # Move to the next page
    current_url = next_url
    hop_number += 1
    print()

# Final result
if hop_number >= max_hops and not check_target_match(current_url, target_url):
    print(f"⏰ Reached maximum hops ({max_hops}) without finding target")
    final_title = current_url.split('/wiki/')[-1].replace('_', ' ')
    print(f"📍 Final page: '{final_title}'")
elif check_target_match(current_url, target_url):
    print(f"🎉 SUCCESS! Found target '{target_url}' in {hop_number} hops!")
else:
    print(f"❌ Navigation stopped due to error")


loading model from .art/tic-tac-toe-local/models/001-script/0100

==((====))==  Unsloth 2025.3.19: Fast Qwen2 patching. Transformers: 4.51.1. vLLM: 0.7.3.
   \\   /|    NVIDIA H100 PCIe. Num GPUs = 1. Max memory: 79.097 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.5.1+cu124. CUDA: 9.0. CUDA Toolkit: 12.4. Triton: 3.1.0
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.29.post1. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!

move 1
board:
   1   2   3
A  _ | _ | _
B  _ | _ | _
C  _ | _ | _

agent move: <move>B1</move>
updated board:
   1   2   3
A  _ | _ | _
B  x | _ | _
C  _ | _ | _


move 3
board:
   1   2   3
A  _ | _ | _
B  x | _ | _
C  _ | o | _

agent move: <move>A1</move>
updated board:
   1   2   3
A  x | _ | _
B  x | _ | _
C  _ | o | _


move 5
board:
   1   2   3
A  x | o | _
B  x | _ | _
C  _ | o | _

agent move: <move>C1</move>
updated board:
   1   2   3
A  x 

<div class="align-center">
<a href="https://github.com/openpipe/art"><img src="https://github.com/openpipe/art/raw/notebooks/assets/ART_pill.png" height="50"></a>
<a href="https://discord.gg/zbBHRUpwf4"><img src="https://github.com/openpipe/art/raw/notebooks/assets/Discord_pill.png" height="50"></a>
<a href="https://openpipe.ai/blog/art-e-mail-agent"><img src="https://github.com/openpipe/art/raw/main/assets/ART_E_pill.png" height="50"></a>

Questions? Join the Discord and ask away! For feature requests or to leave a star, visit our [Github](https://github.com/openpipe/art).

</div>
