# DQN Tutorial

- From：https://pytorch.org/tutorials/intermediate/reinforcement_q_learning.html<br>
- 其他资料：<br>
Markdown语法：https://www.runoob.com/markdown/md-tutorial.html<br>

## 1 环境设置

### 1.1导入各种库

In [152]:
import gym
import math
import random
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
from collections import namedtuple, deque
from itertools import count
from PIL import Image

import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
import torchvision.transforms as T
from torchvision.transforms import InterpolationMode  # 用interpolation=Image.CUBIC会出错

### 1.2硬件环境：GPU

In [153]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print("device:{}".format(device))

device:cuda


### 1.3gym环境导入：以CartPole为例

杆保持直立的每个时间步长都提供+1的奖励。 当杆与垂直线的夹角超过15度时，或者推车从中心移出2.4个单位以上时，训练结束。<br>
据说gym的多数环境都用TimeLimit（源码）包装了，以限制Epoch，就是step的次数限制，比如限定为200次。所以小车保持平衡200步后，就会失败;用env.unwrapped可以得到原始的类，原始类想step多久就多久，不会200步后失败

In [154]:
env = gym.make('CartPole-v0').unwrapped

### 1.4为了在jupyter中动态显示，设置matplotlib环境
Ref：https://blog.csdn.net/sinat_22510827/article/details/90693385 <br>
plt.ion()是开启interactive mode成功的关键函数

In [155]:
is_ipython = 'inline' in matplotlib.get_backend()
if is_ipython:
    from IPython import display
    
plt.ion()

<matplotlib.pyplot._IonContext at 0x2480960df40>

## 2 基本数据的结构

**（1）基本数据是以transition tuple存储的：** DQN的数据是基于gym环境sample出来的，每个episodes有n个steps，将每个setp的state，action，next_state，reward作为一个transition tuple；<br>
**（2）将transitions存储在replay memory中：** 数据放在一个memory deque中，将数据集抽象为一个ReplayMemory类；<br>

In [156]:
Transition = namedtuple('Transition',
                        ('state', 'action', 'next_state', 'reward'))


class ReplayMemory(object):
    
    def __init__(self, capacity):
        self.memory = deque([],maxlen=capacity)
        
    def push(self, *args):
        self.memory.append(Transition(*args))

    def sample(self, batch_size):
        return random.sample(self.memory, batch_size)

    def __len__(self):
        return len(self.memory)

## 3 构建模型

### DQN：
**（1）整体框架：** DQN是基于Generalized Policy Iteration和Q-Network进行策略寻优的；<br>
**（2）Q-Network的构建为：** Input（游戏图像直接作为输入） - Conv - ReLU - Conv - ReLU - Conv - ReLU - FC - ReLU - Output（FC）；Loss为MSE；<br>
### 用于CartPole的DQN：
**（1）class DQN：**  <br>
- 包含：forward网络结构（3层Conv+BN+ReLU，1层FC）、init参数初始化；<br>
- 输入：图像tensor，BCHW；<br>
- 输出：tensor([ B [ 2个actions得分(outputs) ] ]）；<br>
**（2）init：** <br>
- ReLU、Pooling等操作无参数，因此不需要初始化，只需要初始化Conv和FC；<br>
- Function：torch.nn.Conv2d(in_channels, out_channels, kernel_size, stride=1, padding=0, dilation=1, groups=1,bias=True)；<br>
其中，out_channels（filter个数）一般为2的指数，kernal_size一般为奇数，stride一般满足( input_size + padding x 2 - kernel_size ) / （stride）计算结果为整数；<br>
- Process:<br>
 **Input：** w0 x h0 x 3；<br>
 **Conv 1：** 经过16个filter，每个filter大小为5x5x3，stride为2；<br>输出为 [(w0-5)/2+1] x [(h0-5)/2+1] x 16  =  w1 x h1 x 16；<br>
 **Conv 2：** 经过32个filter，每个filter大小为5x5x16，stride为2；<br>输出为[(w1-5)/2+1] x [(h1-5)/2+1] x 32  =  w2 x h2 x 32；<br>
 **Conv 3：** 经过32个filter，每个filter大小为5x5x32，stride为2；<br>输出为[(w2-5)/2+1] x [(h2-5)/2+1] x 32  =  w3 x h3 x 32；<br>
 **FC：** 输出为outputs；一个样本的output为[0,1]或者[1,0]，表征[left,right]；而x为batch个样本，output则为[[0,1],...,[0,1]]；<br>

**（3）forward：** <br>
 - x.view(x.size(0), -1):



In [157]:
class DQN(nn.Module):

    def __init__(self, h, w, outputs):
        super(DQN, self).__init__()
        self.conv1 = nn.Conv2d(3, 16, kernel_size=5, stride=2)
        self.bn1 = nn.BatchNorm2d(16)
        self.conv2 = nn.Conv2d(16, 32, kernel_size=5, stride=2)
        self.bn2 = nn.BatchNorm2d(32)
        self.conv3 = nn.Conv2d(32, 32, kernel_size=5, stride=2)
        self.bn3 = nn.BatchNorm2d(32)

        # Number of Linear input connections depends on output of conv2d layers
        # and therefore the input image size, so compute it.
        def conv2d_size_out(size, kernel_size = 5, stride = 2):
            return (size - (kernel_size - 1) - 1) // stride  + 1
        convw = conv2d_size_out(conv2d_size_out(conv2d_size_out(w)))
        convh = conv2d_size_out(conv2d_size_out(conv2d_size_out(h)))
        linear_input_size = convw * convh * 32
        self.head = nn.Linear(linear_input_size, outputs)

    # Called with either one element to determine next action, or a batch
    # during optimization. Returns tensor([[left0exp,right0exp]...]).
    def forward(self, x):
        x = x.to(device)
        x = F.relu(self.bn1(self.conv1(x)))
        x = F.relu(self.bn2(self.conv2(x)))
        x = F.relu(self.bn3(self.conv3(x)))
        return self.head(x.view(x.size(0), -1))

## 4 训练模型

### 整体步骤：
1，import库；2，获得env；————（【part1环境设置】）<br>
3，初始化图像参数、训练超参、policy和target网络；————（【part4训练模型】，调用【part2基本数据的结构】构建空的memory，调用【part3构建模型】初始化2个DQN网络参数）<br>
4，训练：获得初始图像；策略迭代框架（喂给网络计算所有动作的Q值，选择当前最优动作，达到下一个状态，更新网络参数），同时，将数据存入replay memory；————（【part4训练模型】，调用memory，更新网络）


### 4.1初始化图像参数（从env环境获得图像）：
**CartPole环境：**<br>
Ref：https://www.jianshu.com/p/66d7df5fc34d<br>
- 小车只进行朝左/朝右运动，水平方向初始点为0（对应图像水平中点位置），最大运动范围为±env.x_threshold；<br>
- 小车位置为env.state，水平坐标为env.state[0]，以位置坐标点表示，需要将其转换为图像中的位置表示法；<br>

**(1)获得初始图像(numpy数组)：** env.render()；并将初始图像的HWC序列转换为torch的CHW序列；<br> 
**(2)对图像进行裁剪：** 记初始图像左上角为（0，0），右下角为（1，1）；<br>
- H：高度保留40%；即，截取40%-80%；<br>
- W：宽度保留60%；即，如果小车左右还有30%的空间，则从小车位置前后截30%，如果小车太靠左则（或右则）没有30%的空间，则从最左侧（或最右侧）截取60%；<br>

**(3)对图像进一步放缩：** <br> 
- 经过slice裁剪操作，数组往往是非连续的，可采用np.ascontiguousarray()将一个内存不连续存储的数组转换为内存连续存储的数组，使得运行速度更快；ref：https://zhuanlan.zhihu.com/p/59767914 ；<br> 
- 初始为numpy数组，值范围[0, 255]，int8；需要用torch.from_numpy()转化为float32的tensor，再喂给ToPILImage()转换为PIL格式，用torchvision库来处理（Resize函数）；<br>
- torchvision是pytorch的一个图形库，它服务于PyTorch深度学习框架的，主要用来构建计算机视觉模型,torchvision的构成：<br>
 ①torchvision.datasets: 一些加载数据的函数及常用的数据集接口；<br>
 ②torchvision.models: 包含常用的模型结构（含预训练模型），例如AlexNet、VGG、ResNet等；<br>
 ③torchvision.transforms: 常用的图片变换，例如裁剪、旋转等；<br>
 ④torchvision.utils: 其他的一些有用的方法。<br>
- T=torchvision.transforms中的函数：<br>
 Ref: https://blog.csdn.net/ai_faker/article/details/115320418 ；<br>Ref: https://blog.csdn.net/u014380165/article/details/79167753 ；<br>
 
**图像尺寸转换：**<br>
初始为（H400，W600，C3）；(1)转换顺序后为（C3，H400，W600）；(2)裁剪之后变为（C3，H160，W360）；(3)放缩后为（C3，H40，W90）；

In [158]:
# 用resize = T.Compose([T.ToPILImage(),T.Resize(40, interpolation=Image.CUBIC),T.ToTensor()])会报Warning
resize = T.Compose([T.ToPILImage(), # 将tensor格式转换为PIL格式
                    T.Resize(40, interpolation=InterpolationMode.BICUBIC),  # Resize将短边压缩为40，长边按比例变化
                    T.ToTensor()]) # 将PIL格式重新转换为tensor格式

def get_cart_location(screen_width):
    world_width = env.x_threshold * 2
    scale = screen_width / world_width
    return int(env.state[0] * scale + screen_width / 2.0)  # MIDDLE OF CART

def get_screen():
    # Returned screen requested by gym is 400x600x3, but is sometimes larger
    # such as 800x1200x3. Transpose it into torch order (CHW).
    screen = env.render(mode='rgb_array').transpose((2, 0, 1))
    # Cart is in the lower half, so strip off the top and bottom of the screen
    _, screen_height, screen_width = screen.shape
    screen = screen[:, int(screen_height*0.4):int(screen_height * 0.8)]
    view_width = int(screen_width * 0.6)
    cart_location = get_cart_location(screen_width)
    if cart_location < view_width // 2:
        slice_range = slice(view_width)
    elif cart_location > (screen_width - view_width // 2):
        slice_range = slice(-view_width, None)
    else:
        slice_range = slice(cart_location - view_width // 2,
                            cart_location + view_width // 2)
    # Strip off the edges, so that we have a square image centered on a cart
    screen = screen[:, :, slice_range]
    # Convert to float, rescale, convert to torch tensor
    # (this doesn't require a copy)
    screen = np.ascontiguousarray(screen, dtype=np.float32) / 255
    screen = torch.from_numpy(screen)
    # Resize, and add a batch dimension (BCHW)
    return resize(screen).unsqueeze(0)  # 用unsqueeze函数增加一个维度，即BCHW

**从env环境获得图像测试：**<br>
Ref：https://blog.csdn.net/WASEFADG/article/details/81043075 ；<br>
创建环境：env = gym.make('CartPole-v0').unwrapped；在1环境设置部分完成了；<br>
初始化环境（重置环境）：env.reset()；<br>
图像引擎（渲染环境）：env.render()；在这一块（4训练模型）的前面部分完成了，调用get_screen()函数即可；<br>
物理引擎（执行动作）：env.step()；<br>

In [159]:
# env.reset()
# plt.figure()
# plt.imshow(get_screen().cpu().squeeze(0).permute(1, 2, 0).numpy(),
#            interpolation='none') # imshow还得将它转换为原图像的格式（numpy，HWC）
# plt.title('Example extracted screen')
# plt.show()

### 4.2初始化超参、网络结构：<br>
- **load_state_dict()：** torch.nn.Module模块中的state_dict变量存放训练过程中需要学习的权重和偏执系数，state_dict作为python的字典对象将每一层的参数映射成tensor张量，需要注意的是torch.nn.Module模块中的state_dict只包含卷积层和全连接层的参数，当网络中存在batchnorm时，例如vgg网络结构，torch.nn.Module模块中的state_dict也会存放batchnorm's running_mean；<br>
 Ref：https://blog.csdn.net/bigFatCat_Tom/article/details/90722261 ；<br>
 Ref：https://www.cnblogs.com/marsggbo/p/12075356.html ；<br>
- **net.train()和 net.eval()：**<br>
 Ref：https://blog.csdn.net/qq_46284579/article/details/120439049?spm=1001.2101.3001.6650.5&utm_medium=distribute.pc_relevant.none-task-blog-2%7Edefault%7EBlogCommendFromBaidu%7Edefault-5.no_search_link&depth_1-utm_source=distribute.pc_relevant.none-task-blog-2%7Edefault%7EBlogCommendFromBaidu%7Edefault-5.no_search_link ；<br>

In [160]:
BATCH_SIZE = 128
GAMMA = 0.999
TARGET_UPDATE = 10

# Get screen size so that we can initialize layers correctly based on shape
# returned from AI gym. Typical dimensions at this point are close to 3x40x90
# which is the result of a clamped and down-scaled render buffer in get_screen()
env.reset()
init_screen = get_screen()
_, _, screen_height, screen_width = init_screen.shape

# Get number of actions from gym action space
n_actions = env.action_space.n

policy_net = DQN(screen_height, screen_width, n_actions).to(device)
target_net = DQN(screen_height, screen_width, n_actions).to(device)
target_net.load_state_dict(policy_net.state_dict())
target_net.eval()

optimizer = optim.RMSprop(policy_net.parameters())
memory = ReplayMemory(10000)

### 4.3训练过程中运用到的其他函数：

**策略迭代中的动作选择：**<br>
- sample > eps_threshold：<br>
 留下大小为epsilon_threshold的空间来随机选择，也就是初期随机性强一些（因为初期的Q拟合不太准），越往后期，随机性越弱；<br>
- policy_net(state).max(1)[1].view(1, 1)：<br>
 因为policy_net输出的tensor为[1个样本[2个actions得分]]，即0维度-当前sample/batchsize=1，1维度-2个actions得分；<br>
 max(1)指在1维度，即按action取最大；max(1)[1]则表示返回最大值的索引，[[大,小]]则输出为left动作编号0，[[小,大]]则输出为right动作编号1；如果是max(1)[0]则返回最大值的数值；<br>
- view(1,1)：<br>
 为了在策略提升部分方便组合成batch；<br>

In [161]:
EPS_START = 0.9
EPS_END = 0.05
EPS_DECAY = 200  # 越小，eps_threshold下降得越快
steps_done = 0
# \表示换行
eps_threshold = EPS_END + (EPS_START - EPS_END) * \
        math.exp(-1. * steps_done / EPS_DECAY)


def select_action(state):
    global steps_done
    sample = random.random()
    # 随着steps_done增大，epsilon_threshold减小，从0.9减小到0.362（steps_done=200时），到0；
    eps_threshold = EPS_END + (EPS_START - EPS_END) * \
        math.exp(-1. * steps_done / EPS_DECAY)
    steps_done += 1
    if sample > eps_threshold:
        with torch.no_grad():
            # t.max(1) will return largest column value of each row.
            # second column on max result is index of where max element was
            # found, so we pick action with the larger expected reward.
            return policy_net(state).max(1)[1].view(1, 1)
    else:
        return torch.tensor([[random.randrange(n_actions)]], device=device, dtype=torch.long)

**策略提升：**<br>
- batch = Transition(\*zip(\*transitions))<br>
①zip(\*transitions)，把memory中sample出来的batch给拆成每个sample；<br>
②Transition(\*list），把这些sample组装成一个Transition元组；<br>
例如：<br>
memory = deque([],maxlen=3)<br>
Transition = namedtuple('Transition',('state', 'action'))<br>
memory.append(Transition(torch.tensor([0]),1))<br>
memory.append(Transition(torch.tensor([1]),0))<br>
memory.append(Transition(torch.tensor([2]),2))<br>
transitions = random.sample(memory, 2)<br>
print("transitions为：{}".format(transitions))<br>
batch = Transition(\*zip(\*transitions))<br>
print("batch为：{}".format(batch))<br>
则输出为：<br>
transitions为：[Transition(state=tensor([0]), action=1), Transition(state=tensor([2]), action=2)]<br>
batch为：Transition(state=(tensor([0]), tensor([2])), action=(1, 2))<br>
- torch.cat()将batch继续拆分：<br>
 state_batch = torch.cat(batch.state)<br>
 print("state_batch为：{}".format(state_batch))<br>
 则输出为：<br>
 state_batch为：tensor([0, 2])<br>
- torch.gather()：<br>
 同维tensor，假设a,b都为2维；<br>
 c=a.gather(0,b)即表示c[ i，j ]=a[ b[i,j]，j ]；<br>
 c=a.gather(1,b)即表示c[ i，j ]=a[ i，b[i,j] ]；<br>
- torch.detach()：<br>
 Ref：https://blog.csdn.net/qq_27825451/article/details/95498211 ；<br>
- 维度问题：<br>
 (1)transitions：从memory中sample选出来初始样本<br>
 ———————[ **batch_size个** Transition **(** state=tensor([1个(样本)[ 3个[ 40个 [ 90个数值 ] ] ]]),  action=tensor([ 1个(样本) [ 1个动作编号 ]], device='cuda:0'),  next_state=tensor(1CHW),  reward=tensor([ 1个数值 ], device='cuda:0') **)** ]<br>
 (2)batch：将sample整合成为1个Transition元组<br>
 ———————Transition **(** state= **( batch_size个** tensor([1个(样本)[ 3个[ 40个 [ 90个数值 ] ] ]]) **)** ,  action= **( batch_size个** tensor([ 1个(样本) [ 1个动作编号 ]], device='cuda:0') **)** ,  next_state=**( batch_size个**tensor(1CHW)**)** ,  reward=**( batch_size个** tensor([ 1个数值 ], device='cuda:0')**)**  **)**<br>
 (3)non_final_mask：对每一个样本判定下一步是否停止done<br>
 ———————tensor([ batch_size个True/False], device='cuda:0')<br>
 (4)non_final_next_states：用cat拼接<br>
 ———————tensor([ B 去掉不满足的 [C [H [W] ] ] ])<br>
 (5)state_batch：用cat拼接<br>
 ———————tensor([ B [C [H [W] ] ] ])<br>
 (6)action_batch：用cat拼接<br>
 ———————tensor([ B [1个动作编号] ], device='cuda:0')<br>
 (7)reward_batch：用cat拼接<br>
 ———————tensor([ B个数值 ], device='cuda:0')<br>
 (8)policy_net(state_batch)：<br>
 ———————tensor([ B [ 2个actions得分(outputs) ] ], device='cuda:0', grad_fn=\<AddmmBackward\>)<br>
 (9)state_action_values：<br>
 ———————tensor([ B [ 备择的action得分 ] ], device='cuda:0', grad_fn=\<GatherBackward\>)<br>
 (10)next_state_values：<br>
 target_net(non_final_next_states).max(1)[0].detach()，target输出为B2，按1维度取max，即在2actions中取大的；torch.max()[0]返回最大值的每个数值，troch.max()[1]返回最大值的每个索引，因而这里返回最大值，并将其detach()，使得backwards过程不会影响其数值；<br>
 ———————tensor([ B个数值 ], device='cuda:0')<br>
 (11)expected_state_action_values.unsqueeze(1)：<br>
 expected_state_action_values的输出同（10），也为：[ B个数值 ]，维度为0，需要增加维度1；<br>
 因为state_action_values的输出为（9）：[ B [ 备择的action得分 ] ]；

In [162]:
def optimize_model():
    if len(memory) < BATCH_SIZE:
        return
    transitions = memory.sample(BATCH_SIZE)
    # Transpose the batch (see https://stackoverflow.com/a/19343/3343043 for detailed explanation). 
    # This converts batch-array of Transitions to Transition of batch-arrays.
    batch = Transition(*zip(*transitions))

    # Compute a mask of non-final states and concatenate the batch elements
    # (a final state would've been the one after which simulation ended)
    # map将lambda s: s is not None与 batch.next_state形成映射，进而判断batch.next_state是否为空；
    # 输出为[true,false...](batch_size个true/false)这样的mask
    non_final_mask = torch.tensor(tuple(map(lambda s: s is not None,
                                          batch.next_state)), device=device, dtype=torch.bool)
    # torch.cat将所有 非空next_state、state、action、reward 分别抽出来作为一个个tensor；
    non_final_next_states = torch.cat([s for s in batch.next_state
                                                if s is not None])
    state_batch = torch.cat(batch.state)
    action_batch = torch.cat(batch.action)
    reward_batch = torch.cat(batch.reward)

    # Compute Q(s_t, a) - the model computes Q(s_t), then we select the
    # columns of actions taken. These are the actions which would've been taken
    # for each batch state according to policy_net
    state_action_values = policy_net(state_batch).gather(1, action_batch)

    # Compute V(s_{t+1}) for all next states.
    # Expected values of actions for non_final_next_states are computed based
    # on the "older" target_net; selecting their best reward with max(1)[0].
    # This is merged based on the mask, such that we'll have either the expected
    # state value or 0 in case the state was final.
    next_state_values = torch.zeros(BATCH_SIZE, device=device)
    next_state_values[non_final_mask] = target_net(non_final_next_states).max(1)[0].detach()
    # Compute the expected Q values
    expected_state_action_values = (next_state_values * GAMMA) + reward_batch

    # Compute Huber loss
    criterion = nn.SmoothL1Loss()
    loss = criterion(state_action_values, expected_state_action_values.unsqueeze(1))

    # Optimize the model
    optimizer.zero_grad() # 清理所有参数的梯度。
    loss.backward()
    for param in policy_net.parameters():
        param.grad.data.clamp_(-1, 1) # 将所有的梯度限制在-1到1之间
    optimizer.step() # 更新模型的参数

### 4.4进行训练并展示结果<br>
**绘图：**

In [163]:
episode_durations = []


def plot_durations():
    plt.figure(2)
    plt.clf()
    durations_t = torch.tensor(episode_durations, dtype=torch.float)
    plt.title('Training...')
    plt.xlabel('Episode')
    plt.ylabel('Duration')
    plt.plot(durations_t.numpy())
    # Take 100 episode averages and plot them too
    if len(durations_t) >= 100:
        means = durations_t.unfold(0, 100, 1).mean(1).view(-1)
        means = torch.cat((torch.zeros(99), means))
        plt.plot(means.numpy())

    plt.pause(0.001)  # pause a bit so that plots are updated
    if is_ipython:
        display.clear_output(wait=True)
        display.display(plt.gcf())

**运行并展示结果：**<br>
get_screen()最终返回的是BCHW；next_state = current_screen - last_screen；所以next_state对应的是4维tensor（1x3x40x90）；

In [164]:
num_episodes = 50
for i_episode in range(num_episodes):
    # Initialize the environment and state
    env.reset()
    last_screen = get_screen()
    current_screen = get_screen()
    state = current_screen - last_screen
    for t in count():
        # Select and perform an action
        action = select_action(state)
        _, reward, done, _ = env.step(action.item())
        # reward是一个float格式的数值，转换为tensor
        reward = torch.tensor([reward], device=device)

        # Observe new state
        last_screen = current_screen
        current_screen = get_screen()
        if not done:
            next_state = current_screen - last_screen
        else:
            next_state = None

        # Store the transition in memory
        memory.push(state, action, next_state, reward)

        # Move to the next state
        state = next_state

        # Perform one step of the optimization (on the policy network)
        optimize_model()
        if done:
            episode_durations.append(t + 1)
            plot_durations()
            break
    # Update the target network, copying all weights and biases in DQN
    if i_episode % TARGET_UPDATE == 0:
        target_net.load_state_dict(policy_net.state_dict())

print('Complete')
env.render()
env.close()
plt.ioff()
plt.show()

<Figure size 432x288 with 0 Axes>

Complete


<Figure size 432x288 with 0 Axes>