# 使用DQN解决平衡杆问题 (CartPole)

在这个Notebook中，我们将实现一个深度Q网络（DQN）来学习如何玩`CartPole-v1`游戏。目标是训练一个智能体，使其能够通过向左或向右移动小车来尽可能长时间地保持杆的平衡。

**核心概念:**
1.  **Q网络 (Q-Network)**: 我们将使用一个简单的全连接神经网络来近似Q函数。输入是环境的状态（4个值），输出是每个可能动作（向左或向右）的Q值。
2.  **经验回放 (Experience Replay)**: 创建一个缓存区来存储过去的`(状态, 动作, 奖励, 下一状态, 是否结束)`五元组。训练时，从缓存区中随机采样一个小批量数据，这可以打破数据之间的相关性，使训练更稳定。
3.  **目标网络 (Target Network)**: 使用一个独立的、更新较慢的目标网络来计算TD目标值，这有助于缓解训练过程中的振荡问题，提高稳定性。
4.  **ε-greedy策略**: 在探索（随机选择动作）和利用（选择Q值最高的动作）之间取得平衡。

## 1. 环境设置

首先，我们需要安装必要的库。`gymnasium`是环境库，`torch`是深度学习框架，`tensorboard`用于可视化。

In [1]:
!pip install gymnasium torch tensorboard --user



## 2. 启动 TensorBoard (可选)

为了可视化训练过程中的奖励变化和损失函数，您可以在终端中运行以下命令来启动TensorBoard。

In [None]:
# 请在您的终端(Terminal/CMD)中运行此命令
# tensorboard --logdir runs_cartpole

## 3. DQN 代码实现

以下是DQN算法的完整Python代码。它被封装在一个`DQN`类中以便于理解和使用。

In [2]:
import gymnasium as gym
import math
import random
import numpy as np
import matplotlib.pyplot as plt
from collections import namedtuple, deque
import time

import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
from torch.utils.tensorboard import SummaryWriter

# 设置环境
env = gym.make("CartPole-v1")

# 设置设备 (GPU or CPU)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

# 定义经验回放的 Transition 结构
Transition = namedtuple('Transition', ('state', 'action', 'next_state', 'reward'))
# Transition 用于存储每个经验的状态、动作、下一状态和奖励
# Transition(state, action, next_state, reward) 可以用来存储一个完整的经验元组


# 经验回放缓存区
class ReplayMemory(object):
    def __init__(self, capacity):
        self.memory = deque([], maxlen=capacity)

    def push(self, *args):
        """保存一个 transition"""
        self.memory.append(Transition(*args))

    def sample(self, batch_size):
        """从缓存区中随机采样一个批次"""
        return random.sample(self.memory, batch_size)

    def __len__(self):
        return len(self.memory)

# 定义Q网络 (一个简单的全连接网络)
class QNetwork(nn.Module):
    def __init__(self, n_observations, n_actions):
        super(QNetwork, self).__init__()
        self.layer1 = nn.Linear(n_observations, 128)
        self.layer2 = nn.Linear(128, 128)
        self.layer3 = nn.Linear(128, n_actions)

    def forward(self, x):
        x = F.relu(self.layer1(x))
        x = F.relu(self.layer2(x))
        return self.layer3(x)

# --- 超参数 --- #
BATCH_SIZE = 128         # 每次训练的样本数量
GAMMA = 0.99             # 奖励折扣因子
EPS_START = 0.9          # epsilon 的初始值
EPS_END = 0.05           # epsilon 的最终值
EPS_DECAY = 1000         # epsilon 的衰减速率
TAU = 0.005              # 目标网络软更新的系数
LR = 1e-4                # 学习率
MEMORY_CAPACITY = 10000  # 经验回放缓存区大小
NUM_EPISODES = 600       # 总共训练的回合数

# 获取状态和动作空间的维度
n_actions = env.action_space.n
state, info = env.reset()
n_observations = len(state)

# 初始化网络、优化器和经验回放
policy_net = QNetwork(n_observations, n_actions).to(device)
target_net = QNetwork(n_observations, n_actions).to(device)
target_net.load_state_dict(policy_net.state_dict())
target_net.eval() # 目标网络不进行训练

optimizer = optim.AdamW(policy_net.parameters(), lr=LR, amsgrad=True)
memory = ReplayMemory(MEMORY_CAPACITY)
writer = SummaryWriter(f"runs_cartpole/dqn_{int(time.time())}")

steps_done = 0

def select_action(state):
    """根据 epsilon-greedy 策略选择动作"""
    global steps_done
    sample = random.random()
    # 计算当前的 epsilon 值
    eps_threshold = EPS_END + (EPS_START - EPS_END) * \
        math.exp(-1. * steps_done / EPS_DECAY)
    steps_done += 1
    
    if sample > eps_threshold:
        # 利用: 选择Q值最高的动作
        with torch.no_grad():
            return policy_net(state).max(1)[1].view(1, 1)
    else:
        # 探索: 随机选择一个动作
        return torch.tensor([[env.action_space.sample()]], device=device, dtype=torch.long)

def optimize_model():
    """执行一步优化"""
    if len(memory) < BATCH_SIZE:
        return
    
    transitions = memory.sample(BATCH_SIZE)
    batch = Transition(*zip(*transitions))

    # 创建非最终状态的掩码和对应的 next_states
    non_final_mask = torch.tensor(tuple(map(lambda s: s is not None, batch.next_state)), device=device, dtype=torch.bool)
    non_final_next_states = torch.cat([s for s in batch.next_state if s is not None])
    
    state_batch = torch.cat(batch.state)
    action_batch = torch.cat(batch.action)
    reward_batch = torch.cat(batch.reward)

    # 计算当前状态的Q值: Q(s_t, a)
    state_action_values = policy_net(state_batch).gather(1, action_batch)

    # 计算下一状态的期望Q值: V(s_{t+1})
    next_state_values = torch.zeros(BATCH_SIZE, device=device)
    with torch.no_grad():
        next_state_values[non_final_mask] = target_net(non_final_next_states).max(1)[0]
    
    # 计算期望的Q值 (TD Target)
    expected_state_action_values = (next_state_values * GAMMA) + reward_batch

    # 计算损失 (Huber Loss)
    criterion = nn.SmoothL1Loss()
    loss = criterion(state_action_values, expected_state_action_values.unsqueeze(1))
    writer.add_scalar('Loss', loss.item(), steps_done)

    # 优化
    optimizer.zero_grad()
    loss.backward()
    torch.nn.utils.clip_grad_value_(policy_net.parameters(), 100)
    optimizer.step()

# --- 训练循环 --- #
for i_episode in range(NUM_EPISODES):
    state, info = env.reset()
    state = torch.tensor(state, dtype=torch.float32, device=device).unsqueeze(0)
    total_reward = 0

    while True:
        action = select_action(state)
        observation, reward, terminated, truncated, _ = env.step(action.item())
        total_reward += reward
        reward = torch.tensor([reward], device=device)
        done = terminated or truncated

        if terminated:
            next_state = None
        else:
            next_state = torch.tensor(observation, dtype=torch.float32, device=device).unsqueeze(0)

        # 存入经验回放
        memory.push(state, action, next_state, reward)

        state = next_state

        # 执行一步优化
        optimize_model()

        # 软更新目标网络权重
        target_net_state_dict = target_net.state_dict()
        policy_net_state_dict = policy_net.state_dict()
        for key in policy_net_state_dict:
            target_net_state_dict[key] = policy_net_state_dict[key]*TAU + target_net_state_dict[key]*(1-TAU)
        target_net.load_state_dict(target_net_state_dict)

        if done:
            writer.add_scalar('Reward per Episode', total_reward, i_episode)
            if (i_episode + 1) % 50 == 0:
                print(f'Episode {i_episode+1}/{NUM_EPISODES}, Total Reward: {total_reward}')
            break

print('训练完成')
env.close()
writer.close()

Using device: cuda
Episode 50/600, Total Reward: 9.0
Episode 100/600, Total Reward: 12.0
Episode 150/600, Total Reward: 12.0
Episode 200/600, Total Reward: 62.0
Episode 250/600, Total Reward: 71.0
Episode 300/600, Total Reward: 115.0
Episode 350/600, Total Reward: 140.0
Episode 400/600, Total Reward: 210.0
Episode 450/600, Total Reward: 66.0
Episode 500/600, Total Reward: 500.0
Episode 550/600, Total Reward: 500.0
Episode 600/600, Total Reward: 500.0
训练完成


## 4. 结果与总结

训练完成后，可以通过TensorBoard查看每个回合（Episode）的总奖励曲线。

如果训练成功，奖励曲线随着训练的进行而稳步上升，最终收敛在一个较高的水平（对于CartPole-v1，最高奖励是500）。

运行tensorboard结果

tensorboard --logdir runs_cartpole