To train this agent, click **Runtime** > **Run all**. Make sure you've enabled a free Tesla T4 GPU!

<div class="align-center">
<a href="https://github.com/openpipe/art"><img src="https://github.com/openpipe/art/raw/main/assets/ART_pill.png" height="50"></a>
<a href="https://discord.gg/zbBHRUpwf4"><img src="https://github.com/openpipe/art/raw/main/assets/Discord.png" height="50"></a>
<a href="https://art.openpipe.ai"><img src="https://github.com/openpipe/art/raw/main/assets/Documentation_pill.png" height="50"></a>

Questions? Join the Discord and ask away! For feature requests or to leave a star, visit our [GitHub](https://github.com/openpipe/art).

</div>

<a href="https://art.openpipe.ai/"><img src="https://github.com/openpipe/art/raw/main/assets/Header_separator.png" height="5"></a>

This notebook shows how to train a Qwen 2.5 3B model to play tic tac toe. It will demonstrate how to set up a multi-turn agent, how to train it, and how to evaluate it.

Completions and metrics will be logged to Weights & Biases.


### Installation


In [None]:
# Portions adapted from Unsloth Notebooks (https://github.com/unslothai/notebooks)
# Copyright (c) Unsloth contributors.
# License: GNU LGPL v3.0.
# Modifications by OpenPipe:
# - switched to uv
# - changed vllm/triton pinning logic
# - added protobuf pins
# See /licenses/LGPL-3.0.txt and /licenses/GPL-3.0.txt for full text.

%%capture
import os

if "COLAB_" not in "".join(os.environ.keys()):
    !uv pip install openpipe-art[backend]==0.4.11 --prerelease allow --no-cache-dir
else:
    try:
        import numpy

        get_numpy = f"numpy=={numpy.__version__}"
    except:
        get_numpy = "numpy"
    try:
        import subprocess

        is_t4 = "Tesla T4" in str(subprocess.check_output(["nvidia-smi"]))
    except:
        is_t4 = False
    get_vllm, get_triton = (
        ("vllm==0.9.2", "triton==3.2.0") if is_t4 else ("vllm", "triton")
    )
    !uv pip install --upgrade \
        openpipe-art[backend]==0.4.11 protobuf==5.29.5 {get_vllm} {get_numpy} --prerelease allow --no-cache-dir
    !uv pip install -qqq {get_triton}

### Environment Variables

Later on in the notebook, we'll be creating a model that can automatically logs metrics and chat completions to Weights & Biases. In order to do so, you'll need to provide your Weights & Biases API key as an environment variable.


In [None]:
import os

# Optional
WANDB_API_KEY = ""
if WANDB_API_KEY:
    os.environ["WANDB_API_KEY"] = WANDB_API_KEY

### Agentic Environment

<a name="Environment"></a>

ART allows your agent to learn by interacting with its environment. In this example, we'll create an environment in which the agent can play tic tac toe.

Feel free to read as much or as little of this section's code as you'd like. The important thing to understand is that we're defining the rules of this agent's environment. In many cases, this will already be defined by the task you're trying to solve, but if you need to define a custom environment, this is how you do it.


In [None]:
import random
import xml.etree.ElementTree as ET
from typing import Literal, TypedDict, List
import gymnasium as gym
import numpy as np


class SudokuGame(TypedDict):
    board: List[List[int]]
    initial_board: List[List[int]]
    difficulty: Literal["easy", "medium", "hard"]


class PrimeSudokuEnv:
    """Prime Intellect style Sudoku Environment"""
    def __init__(self, difficulty: str = "easy"):
        self.difficulty = difficulty
        self.reset()

    def reset(self):
        if self.difficulty == "easy":
            self.board = [
                [5, 3, 0, 0, 7, 0, 0, 0, 0],
                [6, 0, 0, 1, 9, 5, 0, 0, 0],
                [0, 9, 8, 0, 0, 0, 0, 6, 0],
                [8, 0, 0, 0, 6, 0, 0, 0, 3],
                [4, 0, 0, 8, 0, 3, 0, 0, 1],
                [7, 0, 0, 0, 2, 0, 0, 0, 6],
                [0, 6, 0, 0, 0, 0, 2, 8, 0],
                [0, 0, 0, 4, 1, 9, 0, 0, 5],
                [0, 0, 0, 0, 8, 0, 0, 7, 9]
            ]
        else:
            self.board = [[0 for _ in range(9)] for _ in range(9)]

        self.initial_board = [row[:] for row in self.board]
        self.steps = 0
        self.solved = False
        return {"text": self._get_board_text()}

    def step(self, action):
        try:
            self._apply_move(action)
            self.steps += 1

            game_state = self._check_complete()
            if game_state == "solved":
                reward = 1.0
                self.solved = True
                done = True
            elif game_state == "invalid":
                reward = -0.5
                done = True
            else:
                reward = 0.1
                done = self.steps >= 81

        except ValueError as e:
            reward = -1.0
            done = False

        obs = {"text": self._get_board_text()}
        info = {"steps": self.steps, "solved": self.solved}
        return obs, reward, done, info

    def _apply_move(self, move_xml):
        try:
            root = ET.fromstring(move_xml)
            move_text = root.text.strip().upper()
        except Exception:
            raise ValueError("Invalid XML format")

        if '=' not in move_text:
            raise ValueError("Move must contain '='")

        cell, number_str = move_text.split('=')
        if len(cell) != 2:
            raise ValueError("Cell must be like 'A1'")

        row = ord(cell[0]) - 65
        col = int(cell[1]) - 1
        number = int(number_str)

        if row < 0 or row >= 9 or col < 0 or col >= 9:
            raise ValueError("Row or column out of bounds")
        if number < 1 or number > 9:
            raise ValueError("Number must be between 1-9")

        if not self._is_valid_move(row, col, number):
            raise ValueError(f"Invalid move: {move_text}")

        self.board[row][col] = number

    def _is_valid_move(self, row, col, number):
        if self.initial_board[row][col] != 0:
            return False

        if number in self.board[row]:
            return False

        if number in [self.board[i][col] for i in range(9)]:
            return False

        box_row, box_col = 3 * (row // 3), 3 * (col // 3)
        for i in range(box_row, box_row + 3):
            for j in range(box_col, box_col + 3):
                if self.board[i][j] == number:
                    return False

        return True

    def _check_complete(self):
        if any(0 in row for row in self.board):
            return "incomplete"

        for row in self.board:
            if sorted(row) != list(range(1, 10)):
                return "invalid"

        for col in range(9):
            column = [self.board[row][col] for row in range(9)]
            if sorted(column) != list(range(1, 10)):
                return "invalid"

        for box_row in range(0, 9, 3):
            for box_col in range(0, 9, 3):
                box = []
                for i in range(3):
                    for j in range(3):
                        box.append(self.board[box_row + i][box_col + j])
                if sorted(box) != list(range(1, 10)):
                    return "invalid"

        return "solved"

    def _get_board_text(self):
        board = self.board
        board_str = "   1 2 3   4 5 6   7 8 9\n"
        board_str += "  ┌───────┬───────┬───────┐\n"

        for i in range(9):
            if i in [3, 6]:
                board_str += "  ├───────┼───────┼───────┤\n"

            row_str = f"{chr(65+i)} │ "
            for j in range(9):
                if j in [3, 6]:
                    row_str += "│ "
                cell = board[i][j]
                row_str += f"{cell if cell != 0 else '.'} "

            row_str += "│\n"
            board_str += row_str

        board_str += "  └───────┴───────┴───────┘\n"
        return board_str

    def render(self):
        print(self._get_board_text())


def generate_sudoku_game(difficulty: str = "easy") -> SudokuGame:
    prime_env = PrimeSudokuEnv(difficulty)
    prime_env.reset()

    return {
        "board": [row[:] for row in prime_env.board],
        "initial_board": [row[:] for row in prime_env.initial_board],
        "difficulty": difficulty
    }


def render_sudoku_board(game: SudokuGame) -> str:
    board = game["board"]
    board_str = "   1 2 3   4 5 6   7 8 9\n"
    board_str += "  ┌───────┬───────┬───────┐\n"

    for i in range(9):
        if i in [3, 6]:
            board_str += "  ├───────┼───────┼───────┤\n"

        row_str = f"{chr(65+i)} │ "
        for j in range(9):
            if j in [3, 6]:
                row_str += "│ "
            cell = board[i][j]
            row_str += f"{cell if cell != 0 else '.'} "

        row_str += "│\n"
        board_str += row_str

    board_str += "  └───────┴───────┴───────┘\n"
    return board_str


def apply_agent_move(game: SudokuGame, move: str) -> None:
    prime_env = PrimeSudokuEnv(game["difficulty"])
    prime_env.board = [row[:] for row in game["board"]]
    prime_env.initial_board = [row[:] for row in game["initial_board"]]

    try:
        prime_env._apply_move(move)
        game["board"] = [row[:] for row in prime_env.board]
    except ValueError as e:
        raise ValueError(str(e))


def check_sudoku_complete(game: SudokuGame) -> Literal["solved", "invalid", "incomplete"]:
    prime_env = PrimeSudokuEnv(game["difficulty"])
    prime_env.board = [row[:] for row in game["board"]]
    prime_env.initial_board = [row[:] for row in game["initial_board"]]

    return prime_env._check_complete()

### Creating a Model

Now that we've defined the rules of our environment, we can create a model that will learn to play 2048. We'll use a Qwen 2.5 3B model for this example. The `name` parameter will be associated with a wandb run, and the `base_model` parameter is the model that we'll be training a LoRA on top of.


In [None]:
from dotenv import load_dotenv

import art
from art.local import LocalBackend

load_dotenv()

random.seed(42)

backend = LocalBackend(path="./.art")

### Creating a Model

Now that we've defined the rules of our environment, we can create a model that will learn to play tic tac toe. We'll use a Qwen 2.5 3B model for this example. The `name` parameter will be associated with a wandb run, and the `base_model` parameter is the model that we'll be training a LoRA on top of.


In [None]:
import os

model = art.TrainableModel(
    name="001-sudoku-script",
    project="sudoku-solver",
    base_model="Qwen/Qwen2.5-3B-Instruct",
)
await model.register(backend)

### Defining a Rollout

<a name="Rollout"></a>

A rollout is a single episode of an agent performing its task. It generates one or more trajectories, which are lists of messages and choices.

In this example, the rollout function generates a game of tic tac toe, and the agent plays it until the game is finished. It then returns a trajectory which contains all the `system` and `user` messages presented to the agent, as well as all the `choices` that the agent made.

When the game is finished the `reward` for the agent's performance is calculated based on whether the agent won, lost, drew, or errored, which is then assigned to the trajectory.

This rollout function will be called many times in parallel during each step of the training loop.


In [None]:
import math

import openai
import weave
from openai import AsyncOpenAI
from pydantic import BaseModel

import art

if os.getenv("WANDB_API_KEY", ""):
    print("initializing weave")
    weave.init(model.project, settings={"print_call_link": False})


class SudokuScenario(BaseModel):
    step: int


@weave.op
@art.retry(exceptions=(openai.LengthFinishReasonError,))
async def rollout(model: art.Model, scenario: SudokuScenario) -> art.Trajectory:
    game = generate_sudoku_game()

    trajectory = art.Trajectory(
        messages_and_choices=[
            {
                "role": "system",
                "content": "You are a Sudoku solver. Your goal is to solve the Sudoku puzzle completely and correctly. Always choose the move that brings you closer to solving the puzzle. Return your move as an XML object with a single property 'move', like so: <move>A1=5</move>. Optional moves are 'A1=1', 'B3=9', 'C2=7', etc. You can place numbers 1-9 in empty cells.",
            }
        ],
        metadata={
            "notebook-id": "sudoku-solver",
            "step": scenario.step,
        },
        reward=0,
    )

    move_number = 0

    while check_sudoku_complete(game) == "incomplete":
        trajectory.messages_and_choices.append(
            {"role": "user", "content": render_sudoku_board(game)}
        )

        messages = trajectory.messages()

        try:
            client = AsyncOpenAI(
                base_url=model.inference_base_url,
                api_key=model.inference_api_key,
            )

            chat_completion = await client.chat.completions.create(
                model=model.get_inference_name(),
                messages=messages,
                max_completion_tokens=128,
            )
        except openai.LengthFinishReasonError as e:
            raise e
        except Exception as e:
            print("caught exception generating chat completion")
            print(e)
            global failing_trajectory
            failing_trajectory = trajectory
            raise e

        choice = chat_completion.choices[0]
        content = choice.message.content
        assert isinstance(content, str)
        trajectory.messages_and_choices.append(choice)

        try:
            apply_agent_move(game, content)
        except ValueError:
            trajectory.reward = -1 + (math.log(move_number + 1) / math.log(100))
            break

        move_number += 1
        if check_sudoku_complete(game) != "incomplete":
            break

    winner = check_sudoku_complete(game)

    if winner == "solved":
        trajectory.reward = 1
        trajectory.metrics["solved"] = 1
    elif winner == "invalid":
        trajectory.reward = 0
        trajectory.metrics["solved"] = 0
    elif winner == "incomplete":
        trajectory.reward = 0.5
        trajectory.metrics["solved"] = 0.5

    trajectory.metrics["num_moves"] = move_number

    return trajectory

<a name="Loop"></a>

### Training Loop

The training loop is where the magic happens. For each of the 100 steps defined below, the rollout function will be called 200 times in parallel. This means that 200 games will be played at once. Each game will produce a trajectory, which will be used to update the model.

The `gather` step will wait for all of the trajectories to be generated, then it will delete all but the most recent checkpoint and train the model on the new trajectories.

Inference will be blocked until the training is complete.


In [None]:
TRAINING_STEPS = 2
ROLLOUTS_PER_STEP = 48
LEARNING_RATE = 5e-5

for i in range(await model.get_step(), TRAINING_STEPS):
    train_groups = await art.gather_trajectory_groups(
        (
            art.TrajectoryGroup(
                rollout(model, SudokuScenario(step=i))
                for _ in range(ROLLOUTS_PER_STEP)
            )
            for _ in range(1)
        ),
        pbar_desc="gather",
    )
    await model.delete_checkpoints()
    await model.train(train_groups, config=art.TrainConfig(learning_rate=LEARNING_RATE))

### Using the Model

Just like that, you've trained an agent to play tic tac toe! Now it's time to use your model outside of ART, in the wild! The easiest way to do that is to load it from disk, where it was saved after each training step, and either run inference on it locally or upload it to a central hub like HuggingFace.

Check out the code below for small demo of the model you just trained playing tic tac toe!


In [None]:
import os
from pathlib import Path

# example: .art/sudoku-solver/models/001-sudoku-script/checkpoints/0003
lora_model_path = (
    f".art/{model.project}/models/{model.name}/checkpoints/{await model.get_step():04d}"
)

if Path(lora_model_path).exists():
    import torch
    from unsloth import FastLanguageModel

    print(f"loading model from {lora_model_path}\n")

    peft_model, tokenizer = FastLanguageModel.from_pretrained(
        model_name=lora_model_path,
        max_seq_length=16384,
        dtype=torch.bfloat16,
        load_in_4bit=True,
    )
    FastLanguageModel.for_inference(peft_model)

    game = generate_sudoku_game()
    move_number = 0

    messages = [
        {
            "role": "system",
            "content": "You are a Sudoku solver. Your goal is to solve the Sudoku puzzle completely and correctly. Return your move as an XML object with a single property 'move', like so: <move>A1=5</move>.",
        },
    ]

    while check_sudoku_complete(game) == "incomplete":
        rendered_board = render_sudoku_board(game)
        messages.append({"role": "user", "content": rendered_board})

        inputs = tokenizer.apply_chat_template(
            messages, return_tensors="pt", add_generation_prompt=True
        ).to("cuda")

        content = ""

        def get_completion() -> str:
            with torch.no_grad():
                outputs = peft_model.generate(
                    input_ids=inputs,
                    max_new_tokens=100,
                    do_sample=True,
                    temperature=0.7,
                    top_p=0.9,
                )
                return tokenizer.decode(
                    outputs[0][inputs.shape[1] :], skip_special_tokens=True
                )

        try:
            content = get_completion()
        except Exception as e:
            print("caught exception generating chat completion", e)
            raise e

        messages.append({"role": "assistant", "content": content})

        try:
            apply_agent_move(game, content)
            move_number += 1
        except ValueError as e:
            print(f"Invalid move on move {move_number}: {content}")
            print(f"Reason: {e}")
            continue

        # print the board every move
        print(f"\nmove {move_number}")
        print(f"board:\n{rendered_board}")
        print(f"agent move: {content}")
        print(f"updated board:\n{render_sudoku_board(game)}")

        if check_sudoku_complete(game) != "incomplete":
            break

    winner = check_sudoku_complete(game)

    print(f"game finished in {move_number} moves")

    if winner == "solved":
        print("puzzle solved! 💪")
    elif winner == "invalid":
        print("invalid solution! 😢")
    elif winner == "incomplete":
        print("puzzle incomplete! 🤷‍♂️")

    print(f"final board:\n\n{render_sudoku_board(game)}")

<div class="align-center">
<a href="https://github.com/openpipe/art"><img src="https://github.com/openpipe/art/raw/main/assets/ART_pill.png" height="50"></a>
<a href="https://discord.gg/zbBHRUpwf4"><img src="https://github.com/openpipe/art/raw/main/assets/Discord.png" height="50"></a>
<a href="https://openpipe.ai/blog/art-e-mail-agent"><img src="https://github.com/openpipe/art/raw/main/assets/ART_E_pill.png" height="50"></a>

Questions? Join the Discord and ask away! For feature requests or to leave a star, visit our [Github](https://github.com/openpipe/art).

</div>
