<a href="https://colab.research.google.com/github/bianhao123/APTOS2019BlindnessDetection/blob/master/examples/2048/2048.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

To train this agent, click **Runtime** > **Run all**. Make sure you've set your `WANDB_API_KEY`.

<div class="align-center">
<a href="https://github.com/openpipe/art"><img src="https://github.com/openpipe/art/raw/main/assets/ART_pill.png" height="50"></a>
<a href="https://discord.gg/zbBHRUpwf4"><img src="https://github.com/openpipe/art/raw/main/assets/Discord.png" height="50"></a>
<a href="https://art.openpipe.ai"><img src="https://github.com/openpipe/art/raw/main/assets/Documentation_pill.png" height="50"></a>

Questions? Join the Discord and ask away! For feature requests or to leave a star, visit our [Github](https://github.com/openpipe/art).

</div>

<a href="https://art.openpipe.ai/"><img src="https://github.com/openpipe/art/raw/main/assets/Header_separator.png" height="5"></a>

This notebook shows how to train a Qwen 3 14B model to play 2048. It will demonstrate how to set up a multi-turn agent, how to train it, and how to evaluate it.

Completions, metrics, and model checkpoints will be saved to Weights & Biases.


### Installation


In [1]:
# 使用 uv pip 安装 openpipe-art 库及其指定版本
!uv pip install openpipe-art==0.5.0

[2mUsing Python 3.12.12 environment at: /usr[0m
[2K[2mResolved [1m86 packages[0m [2min 2.06s[0m[0m
[2K[2mPrepared [1m17 packages[0m [2min 873ms[0m[0m
[2mUninstalled [1m2 packages[0m [2min 88ms[0m[0m
[2K[2mInstalled [1m18 packages[0m [2min 87ms[0m[0m
 [32m+[39m [1mabnf[0m[2m==2.2.0[0m
 [32m+[39m [1mbackoff[0m[2m==2.2.1[0m
 [32m+[39m [1mcint[0m[2m==1.0.0[0m
 [32m+[39m [1mdiskcache[0m[2m==5.6.3[0m
 [32m+[39m [1meval-type-backport[0m[2m==0.2.2[0m
 [32m+[39m [1mfickling[0m[2m==0.1.4[0m
 [32m+[39m [1mgql[0m[2m==4.0.0[0m
 [32m+[39m [1mgraphql-core[0m[2m==3.2.6[0m
 [32m+[39m [1mintervaltree[0m[2m==3.1.0[0m
 [32m+[39m [1mkaitaistruct[0m[2m==0.11[0m
 [32m+[39m [1mlitellm[0m[2m==1.74.1[0m
 [31m-[39m [1mopenai[0m[2m==1.109.1[0m
 [32m+[39m [1mopenai[0m[2m==1.99.1[0m
 [32m+[39m [1mopenpipe-art[0m[2m==0.5.0[0m
 [32m+[39m [1mpdfminer-six[0m[2m==20250506[0m
 [32m+[39m [1mpolyfile-we

### Environment Variables

Later on in the notebook, we'll be creating a model that can automatically logs metrics to Weights & Biases and chat completions to Weave. In order to do so, you'll need to provide your Weights & Biases API key as an environment variable.

*If you don't already have a W&B API key, you can get one [here](https://wandb.ai/home).*


In [2]:
import os

# 导入 os 模块，用于访问环境变量

os.environ["WANDB_API_KEY"] = "10028aeabf490676ce9421227c1abf20b86f3125"
# 设置 WANDB_API_KEY 环境变量。请注意，在实际使用中，您应该使用 Colab 的 Secrets 功能来安全地存储 API 密钥。

if not os.environ.get("WANDB_API_KEY"):
    # 检查 WANDB_API_KEY 是否已设置
    raise ValueError("WANDB_API_KEY is required for inference, training, and logging to Weights & Biases.")
    # 如果没有设置，则抛出 ValueError 异常

### Agentic Environment

<a name="Environment"></a>

ART allows your agent to learn by interacting with its environment. In this example, we'll create an environment in which the agent can play 2048.

Feel free to read as much or as little of this section's code as you'd like. The important thing to understand is that we're defining the rules of this agent's environment. In many cases, this will already be defined by the task you're trying to solve, but if you need to define a custom environment, this is how you do it.

NOTE: To speed up training, we're reducing the winning value from 2048 to 64, which in turn reduces the minimum number of moves to win.


In [3]:
import random
import string
import xml.etree.ElementTree as ET
from typing import Literal, TypedDict

from dotenv import load_dotenv

# 导入所需的模块：random 用于生成随机数，string 用于字符串操作，xml.etree.ElementTree 用于解析 XML，typing 用于类型提示
# load_dotenv 用于加载 .env 文件中的环境变量

load_dotenv()
# 加载环境变量

WINNING_VALUE = 64
# 定义游戏的胜利值为 64（为了加速训练，原版是 2048）


# 定义一个 TypedDict，用于表示 2048 游戏的状态
class TwentyFortyEightGame(TypedDict):
    id: str  # 游戏 ID
    board: list[list[int | None]]  # 游戏棋盘，使用嵌套列表表示，单元格值为整数或 None（表示空）


# 随机在棋盘上的一个空单元格填充 2 或 4
def populate_random_cell(game: TwentyFortyEightGame) -> None:
    all_clear_coordinates = [
        (i, j)
        for i in range(len(game["board"]))
        for j in range(len(game["board"][i]))
        if game["board"][i][j] is None
    ]
    # 找到棋盘上所有空单元格的坐标
    random_clear_coordinates = random.choice(all_clear_coordinates)
    # 随机选择一个空单元格的坐标
    # 90% 的概率填充 2，10% 的概率填充 4
    game["board"][random_clear_coordinates[0]][random_clear_coordinates[1]] = (
        2 if random.random() < 0.9 else 4
    )


# 生成一个新的 2048 游戏
def generate_game(board_length: int = 4) -> TwentyFortyEightGame:
    # board_length 参数指定棋盘的边长，默认为 4
    id = "".join(random.choices(string.ascii_letters + string.digits, k=6))
    # 生成一个随机的 6 个字符长的游戏 ID
    game = {
        "id": id,
        "board": [[None for _ in range(board_length)] for _ in range(board_length)],
    }
    # 初始化游戏状态，创建一个空的棋盘

    # 填充两个随机单元格
    populate_random_cell(game)
    populate_random_cell(game)

    return game
    # 返回新生成的游戏


# 以人类可读的格式渲染棋盘
def render_board(game: TwentyFortyEightGame) -> str:
    board = game["board"]
    # 获取游戏棋盘
    # 打印类似以下的格式：
    # _    | 2    | _    | 4
    # 4    | 8    | 2    | 16
    # 16   | 32   | 64   | 128
    # _    | 2    | 2    | 4
    # 其中 _ 表示空单元格

    max_cell_width = max(
        [len(str(cell)) for row in board for cell in row if cell is not None]
    )
    # 计算棋盘中所有非空单元格值的最大字符串长度，用于对齐

    board_str = ""
    for row in board:
        # 遍历棋盘的每一行
        # 使用空格填充单元格，使它们具有相同的宽度
        board_str += "|".join(
            [
                str(cell).rjust(max_cell_width)
                if cell is not None
                else "_".rjust(max_cell_width)
                for cell in row
            ]
        )
        board_str += "\n"
        # 将渲染后的行添加到 board_str 中，并在末尾添加换行符
    return board_str
    # 返回渲染后的棋盘字符串


# 压缩序列，优先合并序列开头的匹配单元格
# 序列应该按照棋盘被压缩的方向，从最远的单元格开始传递
def condense_sequence(sequence: list[int | None]) -> list[int | None]:
    condensed_sequence = []
    # 初始化压缩后的序列

    gapless_sequence = [cell for cell in sequence if cell is not None]
    # 移除序列中的 None（空单元格），得到一个没有空隙的序列

    i = 0
    while i < len(gapless_sequence):
        # 遍历没有空隙的序列
        if (
            i + 1 < len(gapless_sequence)
            and gapless_sequence[i] == gapless_sequence[i + 1]
        ):
            # 如果当前单元格和下一个单元格相等
            condensed_sequence.append(gapless_sequence[i] * 2)
            # 将它们合并（值乘以 2）并添加到压缩后的序列中
            i += 2
            # 跳过下一个单元格
        else:
            condensed_sequence.append(gapless_sequence[i])
            # 如果当前单元格和下一个单元格不相等，或者已经是序列的最后一个单元格，则直接添加到压缩后的序列中
            i += 1
            # 移动到下一个单元格

    # 在末尾用 None 填充序列，使其长度为 4
    return condensed_sequence + [None] * (4 - len(condensed_sequence))


# 沿给定方向压缩棋盘
def condense_board(
    game: TwentyFortyEightGame, direction: Literal["left", "right", "up", "down"]
) -> None:
    # direction 参数指定压缩方向

    if direction == "left":
        # 如果方向是向左
        for row in game["board"]:
            # 遍历棋盘的每一行
            condensed_row = condense_sequence(row)
            # 压缩当前行
            for i in range(len(row)):
                row[i] = condensed_row[i]
                # 更新当前行

    if direction == "right":
        # 如果方向是向右
        for row in game["board"]:
            reversed_row = row[::-1]
            # 反转当前行
            # 在压缩之前和之后反转行
            condensed_row = condense_sequence(reversed_row)[::-1]
            # 压缩反转后的行，然后再反转回来
            for i in range(len(row)):
                row[i] = condensed_row[i]
                # 更新当前行

    if direction == "up":
        # 如果方向是向上
        for col_index in range(len(game["board"][0])):
            # 遍历棋盘的每一列
            column = [row[col_index] for row in game["board"]]
            # 获取当前列
            condensed_column = condense_sequence(column)
            # 压缩当前列
            for row_index in range(len(column)):
                game["board"][row_index][col_index] = condensed_column[row_index]
                # 更新当前列

    if direction == "down":
        # 如果方向是向下
        for col_index in range(len(game["board"][0])):
            # 遍历棋盘的每一列
            column = [row[col_index] for row in game["board"]]
            # 获取当前列
            reversed_column = column[::-1]
            # 反转当前列
            condensed_column = condense_sequence(reversed_column)[::-1]
            # 压缩反转后的列，然后再反转回来
            for row_index in range(len(column)):
                game["board"][row_index][col_index] = condensed_column[row_index]
                # 更新当前列


# 将智能体的移动应用到游戏棋盘上
def apply_agent_move(game: TwentyFortyEightGame, move_xml: str) -> None:
    direction = None
    # 初始化方向变量
    # 解析移动指令
    try:
        root = ET.fromstring(move_xml)
        # 从 XML 字符串解析 XML 树
        direction = root.text
        # 获取 XML 根元素的文本内容作为方向
    except Exception:
        # 如果解析 XML 失败
        raise ValueError("Invalid xml")
        # 抛出 ValueError 异常

    if direction not in ["left", "right", "up", "down"]:
        # 如果解析出的方向不是有效的方向
        raise ValueError("Invalid direction")
        # 抛出 ValueError 异常

    condense_board(game, direction)
    # 沿指定方向压缩棋盘

    populate_random_cell(game)
    # 在棋盘上随机填充一个新的单元格


# 返回棋盘上的最大单元格值
def max_cell_value(game: TwentyFortyEightGame) -> int:
    return max([cell for row in game["board"] for cell in row if cell is not None])
    # 遍历棋盘所有单元格，返回非空单元格中的最大值


# 检查游戏是否结束，返回 True 或 False
def check_game_finished(game: TwentyFortyEightGame) -> bool:
    if max_cell_value(game) >= WINNING_VALUE:
        # 如果棋盘上的最大值达到了胜利值
        return True
        # 游戏结束

    # 检查是否有空单元格
    if any(cell is None for row in game["board"] for cell in row):
        # 如果棋盘上还有空单元格
        return False
        # 游戏未结束

    # 如果没有空单元格且没有达到胜利值，则检查是否还有可能的移动（这里没有实现，但通常 2048 游戏结束条件是无法再进行任何有效移动）
    # 在这个简化版本中，没有空单元格且没有达到胜利值，则认为游戏结束
    return True


# 返回棋盘上所有单元格值的总和
def total_board_value(game: TwentyFortyEightGame) -> int:
    return sum([cell for row in game["board"] for cell in row if cell is not None])
    # 遍历棋盘所有单元格，计算非空单元格值的总和

### Creating a Model

Now that we've defined the rules of our environment, we can create a model that will learn to play 2048. We'll use a Qwen 3 14B model for this example. The `name` parameter will be associated with a wandb run, and the `base_model` parameter is the model that we'll be training a LoRA on top of. `ServerlessBackend` hooks into Serverless RL through W&B Training to autoscale GPUs as your job progresses.


In [4]:
from dotenv import load_dotenv

import art
from art.serverless.backend import ServerlessBackend

# 导入所需的模块：load_dotenv 用于加载环境变量，art 是 ART 框架，ServerlessBackend 是 ART 的 Serverless 后端

load_dotenv()
# 加载环境变量

random.seed(42)
# 设置随机种子，以确保结果的可复现性

# 声明模型
model = art.TrainableModel(
    name="agent-001",  # 模型名称
    project="2048",  # 项目名称，与 Weights & Biases 项目关联
    base_model="OpenPipe/Qwen3-14B-Instruct",  # 基础模型，将在其上进行 LoRA 训练
)

# 初始化服务器
# 训练和推理将在 Weights & Biases 服务器上运行
backend = ServerlessBackend()

# 向 Serverless Backend 注册模型（设置日志记录、推理和训练）
await model.register(backend)
# 使用 await 关键字等待模型注册完成

### Defining a Rollout

<a name="Rollout"></a>

A rollout is a single episode of an agent performing its task. It generates one or more trajectories, which are lists of messages and choices.

In this example, the rollout function generates a game of 2048, and the agent plays it until the game is finished. It then returns a trajectory which contains all the `system` and `user` messages presented to the agent, as well as all the `choices` that the agent made.

When the game is finished the `reward` for the agent's performance is calculated based on the highest cell value on the board, which is then assigned to the trajectory.

This rollout function will be called many times in parallel during each step of the training loop.


In [5]:
import math

import requests
import weave
from openai import AsyncOpenAI
from pydantic import BaseModel

import art

# 导入所需的模块：math 用于数学计算，requests 用于发送 HTTP 请求，weave 用于数据可视化和实验跟踪，
# AsyncOpenAI 是 OpenAI 客户端的异步版本，BaseModel 是 Pydantic 的基础模型

weave.init(model.project, settings={"print_call_link": False})
# 初始化 Weave，与模型项目关联，并设置不打印调用链接

# 定义一个 Pydantic 模型，用于表示 2048 场景
class Scenario2048(BaseModel):
    step: int  # 场景中的步数


@weave.op
@art.retry(exceptions=(requests.ReadTimeout))
# 使用 weave.op 装饰器将函数标记为 Weave 操作，使用 art.retry 装饰器在发生 requests.ReadTimeout 异常时重试函数
async def rollout(model: art.Model, scenario: Scenario2048) -> art.Trajectory:
    # 定义异步函数 rollout，用于生成一个训练轨迹
    # model 参数是 ART 模型对象，scenario 参数是 2048 场景对象
    # 返回一个 art.Trajectory 对象

    client = AsyncOpenAI(
        base_url=model.inference_base_url,
        api_key=model.inference_api_key,
    )
    # 创建一个 AsyncOpenAI 客户端，用于与模型的推理端点交互
    game = generate_game()
    # 生成一个新的 2048 游戏

    move_number = 0
    # 初始化移动次数

    trajectory = art.Trajectory(
        messages_and_choices=[
            {
                "role": "system",
                "content": "You are an excellent 2048 player. Always choose the move most likely to lead to combine cells to eventually reach the number 2048. Optional moves are 'left', 'right', 'up', 'down'. Return your move as an XML object with a single property 'move', like so: <move>left</move>",
            }
            # 添加一个系统消息，指导智能体如何玩游戏和输出格式
        ],
        metadata={
            "game_id": game["id"],  # 记录游戏 ID
            "notebook-id": "2048",  # 记录 notebook ID
            "step": scenario.step,  # 记录当前的训练步数
        },
        reward=0,  # 初始化奖励为 0
    )

    while True:
        # 进入游戏循环，直到游戏结束
        trajectory.messages_and_choices.append(
            {"role": "user", "content": render_board(game)}
        )
        # 将当前棋盘状态渲染为字符串，作为用户消息添加到轨迹中

        try:
            # 尝试生成 chat completion
            messages = trajectory.messages()
            # 获取轨迹中的所有消息
            chat_completion = await client.chat.completions.create(
                max_completion_tokens=128,  # 设置最大生成 token 数
                messages=messages,  # 传递消息列表
                model=model.get_inference_name(),  # 指定要使用的模型名称
            )
        except Exception as e:
            # 捕获生成 chat completion 时发生的异常
            print("caught exception generating chat completion", e)
            # 打印异常信息
            raise e
            # 重新抛出异常

        choice = chat_completion.choices[0]
        # 获取 chat completion 的第一个选择
        content = choice.message.content
        # 获取模型的回复内容
        assert isinstance(content, str)
        # 断言回复内容是字符串类型
        trajectory.messages_and_choices.append(choice)
        # 将模型的选择添加到轨迹中

        try:
            # 尝试应用智能体的移动
            apply_agent_move(game, content)
            # 调用 apply_agent_move 函数应用移动
            move_number += 1
            # 移动次数加一
        except ValueError:
            # 捕获 apply_agent_move 抛出的 ValueError（例如，无效的移动指令）
            trajectory.reward = -1
            # 设置奖励为 -1（表示无效移动或游戏失败）
            break
            # 结束游戏循环

        if check_game_finished(game):
            # 检查游戏是否结束
            max_value = max_cell_value(game)
            # 获取棋盘上的最大单元格值
            board_value = total_board_value(game)
            # 计算棋盘上所有单元格值的总和
            trajectory.metrics["max_value"] = max_value
            # 将最大值记录到轨迹的 metrics 中
            trajectory.metrics["board_value"] = board_value
            # 将棋盘总值记录到轨迹的 metrics 中
            trajectory.metrics["move_number"] = move_number
            # 将移动次数记录到轨迹的 metrics 中

            # 根据游戏结果计算奖励
            # 尝试尽可能接近胜利值
            # 否则，尝试最大化棋盘上的高值单元格数量
            # 但最重要的是：赢得游戏！
            if max_value < WINNING_VALUE:
                # 如果没有赢得游戏
                # 将最大值对数缩放到 0（对应 2）到 1（对应 WINNING_VALUE）之间
                max_value_reward = (math.log(max_value, 2) - 1) / (
                    math.log(WINNING_VALUE, 2) - 1
                )
                # 将棋盘总值对数缩放到 0（对应 2 * 16）到 1（对应 WINNING_VALUE * 16）之间
                board_value_reward = (math.log(board_value, 2) - 1) / (
                    math.log(WINNING_VALUE * 16, 2) - 1
                )
                # 结合两个奖励，最大值奖励权重更高
                trajectory.reward = max_value_reward + (board_value_reward * 0.2)
            else:
                # 如果赢得游戏
                # 奖励翻倍
                trajectory.reward = 2
            break
            # 结束游戏循环

    return trajectory
    # 返回生成的轨迹

[36m[1mweave[0m: wandb version 0.22.3 is available!  To upgrade, please run:
[36m[1mweave[0m:  $ pip install wandb --upgrade
[36m[1mweave[0m: Logged in as Weights & Biases user: bianhao123.
[36m[1mweave[0m: View Weave data at https://wandb.ai/bianhao_team/2048/weave


<a name="Loop"></a>

### Training Loop

The training loop is where the magic happens. For each of the 20 steps defined below, the rollout function will be called 18 times in parallel. This means that 18 games will be played at once. Each game will produce a trajectory, which will be used to update the model.

The `gather` step will wait for all of the trajectories to be generated, then it will delete all but the best-performing and most recent checkpoints and train the model on the new trajectories. Inference will be blocked until the training is complete.

While training executes, track your agent's metrics and traces in [W&B](https://wandb.ai/home).


In [None]:
for i in range(await model.get_step(), 20):
    # 这是一个 for 循环，用于进行多步训练。从模型的当前训练步数开始，最多进行到第 19 步（总共 20 步）。
    train_groups = await art.gather_trajectory_groups(
        # 调用 art.gather_trajectory_groups 函数来并行生成训练轨迹
        (
            art.TrajectoryGroup(rollout(model, Scenario2048(step=i)) for _ in range(18))
            # 内层生成器运行 18 次，每次调用 rollout 函数运行一局游戏并生成轨迹
            for _ in range(1)
            # 外层生成器运行 1 次
        ),
        pbar_desc="gather",
        # 设置进度条的描述为 "gather"
        max_exceptions=18,
        # 设置允许的最大异常数量为 18
    )
    await model.delete_checkpoints('train/reward')
    # 删除除了基于训练奖励的最佳和最新的检查点之外的所有检查点
    await model.train(
        # 调用 model.train 方法，使用收集到的轨迹来训练模型
        train_groups,
        # 训练轨迹数据
        config=art.TrainConfig(learning_rate=1e-5),
        # 设置训练配置，包括学习率
    )

gather:   0%|          | 0/18 [00:00<?, ?it/s]

gather:   0%|          | 0/18 [00:00<?, ?it/s]

train:   0%|          | 0/17 [00:00<?, ?it/s]

gather:   0%|          | 0/18 [00:00<?, ?it/s]

train:   0%|          | 0/18 [00:00<?, ?it/s]

gather:   0%|          | 0/18 [00:00<?, ?it/s]

train:   0%|          | 0/17 [00:00<?, ?it/s]

gather:   0%|          | 0/18 [00:00<?, ?it/s]

gather:   0%|          | 0/18 [00:00<?, ?it/s]

train:   0%|          | 0/11 [00:00<?, ?it/s]

gather:   0%|          | 0/18 [00:00<?, ?it/s]

train:   0%|          | 0/10 [00:00<?, ?it/s]

gather:   0%|          | 0/18 [00:00<?, ?it/s]

gather:   0%|          | 0/18 [00:00<?, ?it/s]

train:   0%|          | 0/18 [00:00<?, ?it/s]

gather:   0%|          | 0/18 [00:00<?, ?it/s]

train:   0%|          | 0/18 [00:00<?, ?it/s]

gather:   0%|          | 0/18 [00:00<?, ?it/s]

train:   0%|          | 0/10 [00:00<?, ?it/s]

gather:   0%|          | 0/18 [00:00<?, ?it/s]

train:   0%|          | 0/10 [00:00<?, ?it/s]

gather:   0%|          | 0/18 [00:00<?, ?it/s]

train:   0%|          | 0/10 [00:00<?, ?it/s]

gather:   0%|          | 0/18 [00:00<?, ?it/s]

gather:   0%|          | 0/18 [00:00<?, ?it/s]

train:   0%|          | 0/10 [00:00<?, ?it/s]

gather:   0%|          | 0/18 [00:00<?, ?it/s]

train:   0%|          | 0/10 [00:00<?, ?it/s]

gather:   0%|          | 0/18 [00:00<?, ?it/s]

train:   0%|          | 0/12 [00:00<?, ?it/s]

### Using the Model

Just like that, you've trained an agent to play 2048! Now it's time to use your model outside of ART, in the wild! The easiest way to do that is to create an OpenAI client and make a chat completion request to W&B Inference, where it's already deployed 😊.

Check out the code below for small demo of the model you just trained playing 2048!


In [None]:
from openai import AsyncOpenAI
# 导入 AsyncOpenAI 客户端

last_step = await model.get_step()
# 获取模型训练的最后一步

deployed_inference_model_name = f"{model.get_inference_name()}:step{last_step}"
# 构建部署在 W&B Inference 上的模型名称，格式为 模型名称:step步数

print(f"step {last_step} deployed as {deployed_inference_model_name}")
# 打印部署的模型名称

client = AsyncOpenAI(
    base_url=model.inference_base_url,
    # 设置 OpenAI 客户端的基础 URL 为模型的推理端点
    api_key=model.inference_api_key,
    # 设置 OpenAI 客户端的 API 密钥
)

game = generate_game()
# 生成一局新的 2048 游戏
move_number = 0
# 初始化移动次数

messages = [
    {
        "role": "system",
        "content": "You are an excellent 2048 player. Always choose the move most likely to lead to combine cells to eventually reach the number 2048. Optional moves are 'left', 'right', 'up', 'down'. Return your move as an XML object with a single property 'move', like so: <move>left</move>",
    },
    # 初始化对话列表，包含一个系统消息，指导智能体
]

while not check_game_finished(game):
    # 进入游戏循环，直到游戏结束
    rendered_board = render_board(game)
    # 渲染当前的游戏棋盘为字符串格式
    messages.append({"role": "user", "content": rendered_board})
    # 将渲染后的棋盘作为用户消息添加到对话列表中

    try:
        # 尝试生成 chat completion
        content = (await client.chat.completions.create(
            model=deployed_inference_model_name,
            # 指定要使用的部署模型名称
            messages=messages,
            # 传递对话历史
        )).choices[0].message.content
        # 调用模型的 chat completion 接口，获取模型的回复内容
    except Exception as e:
        # 捕获生成 chat completion 时发生的异常
        print("caught exception generating chat completion", e)
        # 打印异常信息
        raise e
        # 重新抛出异常

    messages.append({"role": "assistant", "content": content})
    # 将模型的回复作为助手消息添加到对话列表中

    try:
        # 尝试应用智能体的移动
        apply_agent_move(game, content)
        # 调用 apply_agent_move 函数应用移动
        move_number += 1
        # 移动次数加一
    except ValueError:
        # 捕获 apply_agent_move 抛出的 ValueError（例如，无效的移动指令）
        raise ValueError(f"Invalid move on move {move_number}: {content}")
        # 抛出 ValueError 异常，包含移动次数和无效的移动指令

    # 每隔 10 步，打印当前的移动次数、棋盘状态、智能体的移动和更新后的棋盘
    if move_number % 10 == 0:
        print(f"\nmove {move_number}")
        print(f"board:\n{rendered_board}")
        print(f"agent move: {content}")
        print(f"updated board:\n{render_board(game)}")


print(f"game finished in {move_number} moves")
# 游戏结束后，打印游戏的总移动次数

max_value = max_cell_value(game)
# 获取棋盘上的最大单元格值
board_value = total_board_value(game)
# 计算棋盘上所有单元格值的总和

if max_value >= WINNING_VALUE:
    # 如果棋盘上的最大值达到了胜利值
    print("game won! 💪")
    # 打印游戏胜利的消息
else:
    # 否则
    print("game lost! 😢")
    # 打印游戏失败的消息


print(f"final board:\n\n{render_board(game)}")
# 打印最终的游戏棋盘
print(f"max value: {max_value}")
# 打印最终的最大单元格值
print(f"board value: {board_value}")
# 打印最终的棋盘总值

step 10 deployed as wandb-artifact:///openpipe/2048/agent-001:step10

move 10
board:
_|2|4|2
_|_|4|4
2|_|4|2
_|_|_|_

agent move: <move>down</move>
updated board:
_|_|_|_
4|_|_|2
_|_|4|4
2|2|8|2


move 20
board:
 _| _| 2| 4
 _| _| _|16
 _| _| 2| 8
 _| 4| 2| 8

agent move: <move>up</move>
updated board:
 _| 4| 4| 4
 _| _| 2|16
 _| 2| _|16
 _| _| _| _


move 30
board:
 2| _| _| _
 2| 4| _| _
 2| 8| 2| _
 2| 8| 4|32

agent move: <move>down</move>
updated board:
 _| _| _| 2
 _| _| _| _
 4| 4| 2| _
 4|16| 4|32


move 40
board:
 _| _|16| 4
 _| 8|32| 8
 _| _|16| 4
 _| 2| _| _

agent move: <move>down</move>
updated board:
 2| _| _| _
 _| _|16| 4
 _| 8|32| 8
 _| 2|16| 4


move 50
board:
 _| _| _| 4
 2| _|16| 4
 _| 4| 8|32
 2|16| 8|16

agent move: <move>up</move>
updated board:
 4| 4|16| 8
 _|16|16|32
 _| _| _|16
 _| _| _| 2


move 60
board:
 8| _| _| _
 2| 4| 2| _
32|64| _| 2
 4| 8| 8| _

agent move: <move>down</move>
updated board:
 8| _| _| _
 2| 4| 2| _
32|64| 2| _
 4| 8| 8| 2


move 70
boar

<div class="align-center">
<a href="https://github.com/openpipe/art"><img src="https://github.com/openpipe/art/raw/main/assets/ART_pill.png" height="50"></a>
<a href="https://discord.gg/zbBHRUpwf4"><img src="https://github.com/openpipe/art/raw/main/assets/Discord.png" height="50"></a>
<a href="https://openpipe.ai/blog/art-e-mail-agent"><img src="https://github.com/openpipe/art/raw/main/assets/ART_E_pill.png" height="50"></a>

Congratulations! Now that you've seen a basic notebook, try training a more realistic [email search agent](https://colab.research.google.com/github/openpipe/art-notebooks/blob/main/examples/art-e.ipynb).

If you have questions along the way, join the Discord and ask away! For feature requests or to leave a star, visit our [Github](https://github.com/openpipe/art).

</div>
