# 基于确定性策略梯度的深度强化学习算法
&emsp;&emsp;<font size=4>
    在动作离散的强化学习任务中，通常可以遍历所有的动作来计算动作值函数$q(s, a)$，从而得到最优动作值函数$q_{*}(s, a)$ 。但在大规模连续动作空间中，遍历所有动作是不现实，且计算代价过大。针对解决连续动作空间问题，2016年TP Lillicrap等人提出深度确定性策略梯度算法（Deep Deterministic Policy Gradient，DDPG）算法。该算法基于深度神经网络表达确定性策略$\mu(s)$，采用确定性策略梯度来更新网络参数，能够有效应用于大规模或连续动作空间的强化学习任务中。
</font><br>

### 背景
&emsp;&emsp;<font size=4>
    <b>(1) PG法</b>
</font><br>
&emsp;&emsp;<font size=4>
    构建策略概率分布函数$\pi\left(a \mid s, \boldsymbol{\theta}^{\pi}\right)$，在每个时刻，Agent根据该概率分布选择动作：
\begin{equation}
    a \sim \pi\left(a \mid s, \boldsymbol{\theta}^{\pi}\right) \label{bbb.1}\tag{bbb.1}
\end{equation}
其中，$\boldsymbol{\theta}^{\pi}$是一个关于随机策略$\pi$的参数。
</font><br>
&emsp;&emsp;<font size=4>
    由于PG算法既涉及到状态空间又涉及到动作空间，因此在大规模情况下，得到随机策略需要大量的样本。这样在采样过程中会耗费较多的计算资源，相对而言，该算法效率较为低下。
</font><br>
&emsp;&emsp;<font size=4>
    <b>(2) DPG算法</b>
</font><br>
&emsp;&emsp;<font size=4>
    构建确定性策略函数$\mu\left(s, \boldsymbol{\theta}^{\mu}\right)$，在每个时刻，Agent根据该策略函数获得确定的动作：
\begin{equation}
    a=\mu\left(s, \boldsymbol{\theta}^{\mu}\right)  \label{bbb.2}\tag{bbb.2}
\end{equation}
其中，$\theta^{\mu}$表示这是一个关于确定性策略$\mu$的参数。
</font><br>
&emsp;&emsp;<font size=4>
    由于DPG算法仅涉及状态空间，因此与PG算法相比，需要的样本数较少，尤其在大规模或连续动作空间任务中，算法效率会显著提升。
</font><br>
&emsp;&emsp;<font size=4>
    <b>(2) DDPG法</b>
</font><br>
&emsp;&emsp;<font size=4>
    深度策略梯度方法在每个迭代步都需要采样$N$个完整情节$\left\{\varpi_{i}\right\}_{i=1}^{N}$来作为训练样本，然后构造目标函数关于策略参数的梯度项以求解最优策略。然而在许多现实场景下的任务中，很难在线获得大量完整情节的样本数据。例如在真实场景下机器人的操控任务中，在线收集并利用大量的完整情节会产生十分昂贵的代价，并且连续动作的特性使得在线抽取批量情节的方式无法覆盖整个状态特征空间。这些问题会导致算法在求解最优策略时出现局部最优解。针对上述问题，可将传统强化学习中的行动者评论家框架拓展到深度策略梯度方法中。这类算法被统称为基于AC框架的深度策略梯度方法。其中最具代表性的是深度确定性策略梯度（DDPG）算法，该算法能够解决一系列连续动作空间中的控制问题。DDPG算法基于DPG法，使用AC算法框架，利用深度神经网络学习近似动作值函数$Q(s, a, \boldsymbol{w})$和确定性策略$\mu(s, \boldsymbol{\theta})$，其中$\boldsymbol{w}$和$\theta$分别为值网络和策略网络的权重。值网络用于评估当前状态动作对的Q值，评估完成后再向策略网络提供更新策略权重的梯度信息，对应AC框架中的评论家；策略网络用于进行选择策略，对应AC框架中的行动者。它主要涉及以下概念：
</font><br>
&emsp;&emsp;<font size=4>
    （1） 行为策略$\beta$：一种探索性策略，通过引入随机噪声影响动作的选择；
</font><br>
&emsp;&emsp;<font size=4>
    （2） 状态分布$\rho^{\beta}$ ：Agent根据行为策略$\beta$产生的状态分布；
</font><br>
&emsp;&emsp;<font size=4>
    （3） 策略网络：或行动者网络：DDPG使用深度网络对确定性策略函数$\mu(s, \boldsymbol{\theta})$进行逼近，$\boldsymbol{\theta}$为网络参数，输入为当前的状态$s$，输出为确定性的动作值$a$。有时$\theta$也表示为$\boldsymbol{\theta}^{\mu}$；
</font><br>
&emsp;&emsp;<font size=4>
    （4） 价值网络：或评论家网络，DDPG使用深度网络对近似动作值函数$Q(s, a, \boldsymbol{w})$进行逼近，$\boldsymbol{w}$为网络参数。有时$\boldsymbol{w}$也表示为$\boldsymbol{\theta}^{Q}$。
</font><br>
&emsp;&emsp;<font size=4>
    相对于DPG法，DDPG法的主要改进如下：
</font><br>
&emsp;&emsp;<font size=4>
    （1） 采用深度神经网络：构建策略网络和价值网络，分别用来学习近似性策略函数$\mu(s, \boldsymbol{\theta})$和近似动作值函数$Q(s, a, \boldsymbol{w})$，并使用Adam训练网络模型；
</font><br>
&emsp;&emsp;<font size=4>
    （2） 引入经验回放机制：Agent与环境进行交互时产生的经验转移样本具有时序相关性，通过引入经验回放机制，减少值函数估计所产生的偏差，解决数据间相关性及非静态分布问题，使算法更加容易收敛；
</font><br> 
&emsp;&emsp;<font size=4>
    （3） 使用双网络架构：策略函数和价值函数均使用双网络架构，即分别设置预测网络和目标网络，使算法的学习过程更加稳定，收敛更快。
</font><br> 

### 核心思想
&emsp;&emsp;<font size=4>
    DDPG的价值网络作为评论家，用于评估策略，学习Q函数，为策略网络提供梯度信息；策略网络作为行动者，利用评论家学习到的Q函数及梯度信息对策略进行改进；同时还引入了带噪声的探索机制和软更新方法。本节将介绍DDPG法的核心思想和主要技术。
</font><br>
&emsp;&emsp;<font size=4>
    <b>双网络架构</b>
</font><br>
&emsp;&emsp;<font size=4>
    从DQN中已知，仅利用单一网络进行学习会出现不稳定现象。因此，DDPG为价值网络和策略网络分别引入了目标网络：
</font><br>
&emsp;&emsp;<font size=4>
    $\left\{\begin{array}{ll}\text { 预测价值网络 }:&Q(s, a, \boldsymbol{w}), \text { 更新 } \boldsymbol{w} \\ \text { 目标价值网络 }: &Q\left(s, a, \boldsymbol{w}^{\prime}\right), \text { 吏新 } \boldsymbol{w}^{\prime}\end{array}\right.$
</font><br>
&emsp;&emsp;<font size=4>
    $\left\{\begin{array}{l1}\text { 预测策略网络 }: \mu(s, a, \theta), \text { 更新 } \boldsymbol{\theta} \\ \text { 目标策略网络 }: \quad \mu\left(s, a, \theta^{\prime}\right), \text { 更新 } \boldsymbol{\theta}^{\prime}\end{array}\right.$
</font><br>
&emsp;&emsp;<font size=4>
    每次完成小批量经验转移样本的训练之后，就利用小批量梯度上升法（mini-batch BGA，MBGA）更新预测策略网络参数，利用MBGD法更新预测价值网络参数，然后通过软更新算法更新目标网络的参数。
</font><br>
&emsp;&emsp;<font size=4>
    <b>策略网络-行动者</b>
</font><br>
&emsp;&emsp;<font size=4>
    在DDPG算法中，优化目标被定义为累积折扣奖赏：
</font><br>
\begin{equation}
J(\boldsymbol{\theta})=\mathbb{E}_{\boldsymbol{\theta}}\left[r_{0}+\gamma r_{1}+\gamma^{2} r_{2}+\ldots\right] \label{bbb.3}\tag{bbb.3}
\end{equation}
&emsp;&emsp;<font size=4>
    然后采用SGD方法对优化目标关于策略参数求偏导数。Silver等人证明在确定性环境下，目标函数关于权重$\theta$的梯度等价于Q值函数关于$\theta$梯度的期望：
</font><br>  
\begin{equation}
\nabla_{\boldsymbol{\theta}} J(\boldsymbol{\theta})=\mathbb{E}_{s-\rho^{\beta}}\left[\nabla_{\boldsymbol{\theta}} Q(s, a, \boldsymbol{w})\right]  \label{bbb.4}\tag{bbb.4}
\end{equation}
&emsp;&emsp;<font size=4>
    根据确定性策略$a=\mu(s, \theta)$可得：
</font><br>
\begin{equation}\nabla_{\theta} \hat{J}_{\beta}(\boldsymbol{\theta})=\mathbb{E}_{S_{i} \sim \rho^{\beta}}\left[\left.\nabla_{\theta} \mu\left(S_{t}, \boldsymbol{\theta}\right) \nabla_{a} Q\left(S_{t}, a, \boldsymbol{w}\right)\right|_{a=\mu\left(S_{t}, \theta\right)}\right] \label{bbb.5}\tag{bbb.5}
\end{equation} 
&emsp;&emsp;<font size=4>
    利用MBGA法，从经验池$\mathcal{D}$中随机采样获得$N$个小批量数据作为对期望值的采样估计： 
</font><br>
\begin{equation}
\left.\nabla_{\theta} \hat{J}_{\beta}(\boldsymbol{\theta}) \approx \frac{1}{N} \sum_{i} \nabla_{\theta} \mu\left(S_{i}, \boldsymbol{\theta}\right) \nabla_{a} Q\left(S_{i}, a, \boldsymbol{w}\right)\right|_{a=\mu\left(S_{i}, \boldsymbol{\theta}\right)} \label{bbb.6}\tag{bbb.6}
\end{equation}
&emsp;&emsp;<font size=4>
    <b>价值网络-评论家</b>
</font><br>
&emsp;&emsp;<font size=4>
    与DQN一样，DDPG利用基于TD误差的MSE作为损失函数，它们的区别在于目标值$y_{i}$:
</font><br>
\begin{equation}
L(\boldsymbol{w})=\mathbb{E}\left[\left(r+\gamma Q^{\prime}\left(s^{\prime}, \mu^{\prime}\left(s^{\prime}, \boldsymbol{\theta}^{\prime}\right), \boldsymbol{w}^{\prime}\right)-Q(s, a, \boldsymbol{w})\right)^{2}\right] \label{bbb.7}\tag{bbb.7}
\end{equation}
&emsp;&emsp;<font size=4>
    可以看出，目标值$y_{i}=r+\gamma Q^{\prime}\left(s^{\prime}, \mu^{\prime}\left(s^{\prime}, \boldsymbol{\theta}^{\prime}\right), \boldsymbol{w}^{\prime}\right)$的计算过程涉及目标策略网络 和目标价值网络 ，这能够让预测价值网络在学习时更加稳定，也更容易收敛。
</font><br>
&emsp;&emsp;<font size=4>
    价值网络的目标是最小化损失函数，故采用MSGD法，从经验池$\mathcal{D}$中随机采样获得$N$个小批量数据作为对期望值的采样估计：
</font><br>
\begin{equation}
\nabla_{w} L(\boldsymbol{w}) \approx \frac{1}{N} \sum_{i}^{N}\left(r+\gamma Q^{\prime}\left(s^{\prime}, \mu\left(s^{\prime}, \boldsymbol{\theta}^{\prime}\right), \boldsymbol{w}^{\prime}\right)-Q(s, a, \boldsymbol{w})\right) \nabla_{w} Q(s, a, \boldsymbol{w}) \label{bbb.8}\tag{bbb.8}
\end{equation}
&emsp;&emsp;<font size=4>
    其中，$\theta^{\prime}$和$\boldsymbol{w}^{\prime}$分别表示目标策略网络$\mu^{\prime}$和目标值网络的权重$Q^{\prime}$。每次更新时，DDPG使用经验回放机制从样本池中抽取固定数量（如$N$个）的转移样本，并将由Q值函数关于动作的梯度信息从评论家网络传递到行动者网络。最后依据式（bbb.5）沿着提升Q值的方向更新策略网络的参数，以求解最优策略。
</font><br>
&emsp;&emsp;<font size=4>
    <b>探索机制</b>
</font><br>
&emsp;&emsp;<font size=4>
    在利用与探索问题中，DQN采用的是$\mathcal{E}$-贪婪策略，该策略在离散型动作空间任务中能够得到较好的效果，但是面对大规模或连续动作空间任务，$\mathcal{E}$-贪婪策略就无能为力了。根据AC框架和确定性策略的特性，DDPG通过对参数空间或动作空间添加噪声来增加探索机制：
</font><br>
<center>
<table align="center" width="100%">
    <tr>
        <th><img src='./参数空间或动作空间中添加噪声.png' width='500px' height='200px'></th>
    </tr>
    <tr>
        <td><font size=3><center> 图bbb.1&ensp;参数空间或动作空间中添加噪声</center></font></td>
    </tr>
</table>
</center>
&emsp;&emsp;<font size=4>
  如在$t$时刻通过为动作空间添加噪声$\mathcal{N}$的方式来选择动作：
</font><br>
\begin{equation}
A_{t}=\mu\left(S_{t}, \boldsymbol{\theta}\right)+\mathcal{N}  \label{bbb.9}\tag{bbb.9}
\end{equation}
&emsp;&emsp;<font size=4>
    在添加噪声时，采用了Ornstein-Uhlenbeck过程（OU过程），OU过程是与强化学习任务类似的一种序列相关过程，利用该噪声，Agent能在一些物理环境中实现较好的探索。噪声的递推式如下：
</font><br>
\begin{equation}
\mathcal{N}_{t} \leftarrow-\mathcal{N}_{t-1} \varphi+N(0, \sigma I) 
\end{equation}
&emsp;&emsp;<font size=4>
    但后续研究表明，相对OU过程，直接采用均值为0且互不相关的高斯噪声效果更好，且实现更简单。若采用期望为0，方差为1的高斯噪声，动作选择过程为：
</font><br>
\begin{equation}
A_{t}=\mu\left(S_{t}, \boldsymbol{\theta}\right)+\mathcal{N}(0,1)
\end{equation}
&emsp;&emsp;<font size=4>
    <b>软更新</b>
</font><br>
&emsp;&emsp;<font size=4>
    DDPG目标网络参数的同步方式也与DQN不同。DQN采用硬更新方法，每隔固定步数才从预测网络中拷贝参数到目标网络中。DDPG采用软更新方法，每次预测网络参数更新后，目标网络参数都会在一定程度上靠近预测网络：
 </font><br>
\begin{equation}\left\{\begin{array}{l}
\mathbf{w}^{\prime} \leftarrow \tau \mathbf{w}+(1-\tau) \mathbf{w}^{\prime} \\
\boldsymbol{\theta}^{\prime} \leftarrow \tau \boldsymbol{\theta}+(1-\tau) \boldsymbol{\theta}^{\prime}
\end{array}\right. \label{bbb.10}\tag{bbb.10}
\end{equation}  
&emsp;&emsp;<font size=4>
    其中，$\tau$是一个远小于1的超参数，通常设为0.001。
</font><br>
&emsp;&emsp;<font size=4>
    采用软更新方法，目标值会一直缓慢地向当前估算值靠近，既保证了网络参数的及时更新，又保证了训练时预测网络梯度的相对稳定，使算法更容易收敛。其缺点是参数变化很小，学习过程较长。
 </font><br>   

### DDPG算法
&emsp;&emsp;<font size=4>
    由bbb.1.2节已知，DDPG的目标是最大化策略目标函数$J_{\beta}(\boldsymbol{\theta})$，同时最小化价值网络的损失函数$L(\boldsymbol{w})$，该算法的流程图如图bbb.2所示：
</font><br>
<center>
<table align="center" width="100%">
    <tr>
        <th><img src='./DDPG算法流程图.png' width='500px' height='200px'></th>
    </tr>
    <tr>
        <td><font size=3><center>图bbb.2&ensp;DDPG算法流程图</center></font></td>
    </tr>
</table>
</center>

用于构建策略网络模型的DDPG法，如算法bbb.1所示：
<hr style="height:1px;border:none;border-top:1px solid #555555;" />
&emsp;&emsp;<font size=3.5><b>算法bbb.1</b> 用于构建策略网络模型的DDPG算法（Lillicrap al. 2016）</font><br>
<hr>
&emsp;&emsp;<font size=3.5>初始化：</font><br>
&emsp;&emsp;&emsp;&emsp;<font size=3.5>
    1.&emsp;初始化预测策略网络$\mu(s, \boldsymbol{\theta})$和预测价值网络$Q(s, a, \boldsymbol{w})$，网络参数分别为$\theta$和$\boldsymbol{w}$
</font><br>
&emsp;&emsp;&emsp;&emsp;<font size=3.5>
    2.&emsp;初始化目标策略网络$\mu\left(s, \boldsymbol{\theta}^{\prime}\right)$和目标价值网络$Q\left(s, a, \boldsymbol{w}^{\prime}\right)$，网络参数为$\boldsymbol{\theta}^{\prime} \leftarrow \boldsymbol{\theta}, \quad \boldsymbol{w}^{\prime} \leftarrow \boldsymbol{w}$
</font><br>
&emsp;&emsp;&emsp;&emsp;<font size=3.5>
    3.&emsp;经验池$\mathcal{D}$的容量为$N$
</font><br>
&emsp;&emsp;&emsp;&emsp;<font size=3.5>
    4.&emsp;总迭代次数$M$，折扣因子$\gamma$，$\tau=0.0001$，随机小批量采样样本数量$n$
</font><br>
<hr>
&emsp;&emsp;&emsp;&emsp;<font size=3.5>
    5.&emsp;<b>for</b> $e$=1 <b>to</b> $M$ <b>do:</b>
</font><br>
&emsp;&emsp;&emsp;&emsp;<font size=3.5>
    6.&emsp;&emsp;&emsp;初始化一个随机过程$\mathcal{N}$，用于动作的探索
</font><br>
&emsp;&emsp;&emsp;&emsp;<font size=3.5>
    7.&emsp;&emsp;&emsp;初始化状态设置为$S_{0}$
</font><br>
&emsp;&emsp;&emsp;&emsp;<font size=3.5>
    8.&emsp;&emsp;&emsp;<b>repeat</b>（情节中的每一时间步$t$）：
</font><br>
&emsp;&emsp;&emsp;&emsp;<font size=3.5>
    9.&emsp;&emsp;&emsp;&emsp;&emsp;根据当前的预测策略网络和探索噪声来选择动作$A_{t}=\mu\left(S_{t}, \boldsymbol{\theta}\right)+\mathcal{N}_{t}$
</font><br>
&emsp;&emsp;&emsp;&emsp;<font size=3.5>
    10.&emsp;&emsp;&emsp;&emsp;&emsp;执行动作$ A_{t}$，获得奖赏$R_{t+1}$和下一状态$S_{t+1}$
</font><br>
&emsp;&emsp;&emsp;&emsp;<font size=3.5>
    11.&emsp;&emsp;&emsp;&emsp;&emsp;将经验转换$\left(S_{t}, A_{t}, R_{t+1}, S_{t+1}\right)$存储在经验池$\mathcal{D}$中
</font><br>
&emsp;&emsp;&emsp;&emsp;<font size=3.5>
    12.&emsp;&emsp;&emsp;&emsp;&emsp;从经验池$\mathcal{D}$中随机采样小批量的$n$个经验转移样本$\left(S_{i}, A_{i}, R_{i+1}, S_{i+1}\right)$，计算目标值：
</font><br>
&emsp;&emsp;&emsp;&emsp;<font size=3.5>
    13.&emsp;&emsp;&emsp;&emsp;&emsp;$y_{i}=R_{i+1}+\gamma Q\left(S_{i+1}, \mu\left(S_{i+1}, \boldsymbol{\theta}^{\prime}\right), \boldsymbol{w}^{\prime}\right)$
</font><br>
&emsp;&emsp;&emsp;&emsp;<font size=3.5>
    14.&emsp;&emsp;&emsp;&emsp;&emsp;使用MBGD，根据最小化损失函数来更新价值网络（评论家网络）参数$\boldsymbol{w}$：
</font><br>
&emsp;&emsp;&emsp;&emsp;<font size=3.5>
    &emsp;&emsp;&emsp;&emsp;&emsp;
          $\nabla_{w} L(\boldsymbol{w}) \approx \frac{1}{N} \sum_{i}^{N}\left(y_{i}-Q\left(S_{i}, A_{i}, \boldsymbol{w}\right)\right) \nabla_{w} Q\left(S_{i}, A_{i}, \boldsymbol{w}\right)$
</font><br>
&emsp;&emsp;&emsp;&emsp;<font size=3.5>
    15.&emsp;&emsp;&emsp;&emsp;&emsp;使用MBGA法，根据最大化目标函数来更新策略网络（行动者网络）参数$\theta$：
</font><br>
&emsp;&emsp;&emsp;&emsp;<font size=3.5>
    &emsp;&emsp;&emsp;&emsp;&emsp;
              $\left.\nabla_{\theta} \hat{J}_{\beta}(\boldsymbol{\theta}) \approx \frac{1}{N} \sum_{i} \nabla_{\theta} \mu\left(S_{i}, \boldsymbol{\theta}\right) \nabla_{a} Q\left(S_{i}, a, \boldsymbol{w}\right)\right|_{a=\mu\left(S_{i}, \boldsymbol{\theta}\right)}$
</font><br>
&emsp;&emsp;&emsp;&emsp;<font size=3.5>
    16.&emsp;&emsp;&emsp;&emsp;&emsp;软更新目标网络：$\left\{\begin{array}{l}\boldsymbol{w}^{\prime} \leftarrow \tau \boldsymbol{w}+(1-\tau) \boldsymbol{w}^{\prime} \\ \boldsymbol{\theta}^{\prime} \leftarrow \tau \boldsymbol{\theta}+(1-\tau) \boldsymbol{\theta}^{\prime}\end{array}\right.$
</font><br>
&emsp;&emsp;&emsp;&emsp;<font size=3.5>
    17.&emsp;&emsp;&emsp;<b>until</b> $t$为终止状态
</font><br>
<hr style="height:1px;border:none;border-top:1px solid #555555;" /><br>

In [1]:
import numpy as np
import torch
import gym
import os
import copy
import numpy as np
import torch
import torch.nn as nn
import torch.nn.functional as F

## USE CUDA

In [2]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

## Replay Buffer

In [3]:
class ReplayBuffer(object):
    def __init__(self, state_dim, action_dim, max_size=int(1e6)):
        self.max_size = max_size
        self.ptr = 0
        self.size = 0

        self.state = np.zeros((max_size, state_dim))
        self.action = np.zeros((max_size, action_dim))
        self.next_state = np.zeros((max_size, state_dim))
        self.reward = np.zeros((max_size, 1))
        self.not_done = np.zeros((max_size, 1))

        self.device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

    def add(self, state, action, next_state, reward, done):
        self.state[self.ptr] = state
        self.action[self.ptr] = action
        self.next_state[self.ptr] = next_state
        self.reward[self.ptr] = reward
        self.not_done[self.ptr] = 1. - done

        self.ptr = (self.ptr + 1) % self.max_size
        self.size = min(self.size + 1, self.max_size)

    def sample(self, batch_size):
        ind = np.random.randint(0, self.size, size=batch_size)

        return (
            torch.FloatTensor(self.state[ind]).to(self.device),
            torch.FloatTensor(self.action[ind]).to(self.device),
            torch.FloatTensor(self.next_state[ind]).to(self.device),
            torch.FloatTensor(self.reward[ind]).to(self.device),
            torch.FloatTensor(self.not_done[ind]).to(self.device)
        )

In [4]:
class Actor(nn.Module):
    def __init__(self, state_dim, action_dim, max_action):
        super(Actor, self).__init__()

        self.l1 = nn.Linear(state_dim, 400)
        self.l2 = nn.Linear(400, 300)
        self.l3 = nn.Linear(300, action_dim)

        self.max_action = max_action


    def forward(self, state):
        a = F.relu(self.l1(state))
        a = F.relu(self.l2(a))
        return self.max_action * torch.tanh(self.l3(a))


class Critic(nn.Module):
    def __init__(self, state_dim, action_dim):
        super(Critic, self).__init__()

        self.l1 = nn.Linear(state_dim, 400)
        self.l2 = nn.Linear(400 + action_dim, 300)
        self.l3 = nn.Linear(300, 1)


    def forward(self, state, action):
        q = F.relu(self.l1(state))
        q = F.relu(self.l2(torch.cat([q, action], 1)))
        return self.l3(q)

In [9]:
actor1=Actor(17,6,1.0)
for ch in actor1.children():
    print(ch)
print("*********************")
critic1=Critic(17,6)
for ch in critic1.children():
    print(ch)

Linear(in_features=17, out_features=400, bias=True)
Linear(in_features=400, out_features=300, bias=True)
Linear(in_features=300, out_features=6, bias=True)
Linear(in_features=17, out_features=400, bias=True)
Linear(in_features=406, out_features=300, bias=True)
Linear(in_features=300, out_features=1, bias=True)


## DDPG Update

In [5]:
class DDPG(object):
    def __init__(self, state_dim, action_dim, max_action, discount=0.99, tau=0.001):
        self.actor = Actor(state_dim, action_dim, max_action).to(device)
        self.actor_target = copy.deepcopy(self.actor)
        self.actor_optimizer = torch.optim.Adam(self.actor.parameters(), lr=1e-4)

        self.critic = Critic(state_dim, action_dim).to(device)
        self.critic_target = copy.deepcopy(self.critic)
        self.critic_optimizer = torch.optim.Adam(self.critic.parameters(), weight_decay=1e-2)

        self.discount = discount
        self.tau = tau


    def select_action(self, state):
        state = torch.FloatTensor(state.reshape(1, -1)).to(device)
        return self.actor(state).cpu().data.numpy().flatten()


    def train(self, replay_buffer, batch_size=64):
        # Sample replay buffer 
        state, action, next_state, reward, not_done = replay_buffer.sample(batch_size)

        # Compute the target Q value
        target_Q = self.critic_target(next_state, self.actor_target(next_state))
        target_Q = reward + (not_done * self.discount * target_Q).detach()

        # Get current Q estimate
        current_Q = self.critic(state, action)

        # Compute critic loss
        critic_loss = F.mse_loss(current_Q, target_Q)

        # Optimize the critic
        self.critic_optimizer.zero_grad()
        critic_loss.backward()
        self.critic_optimizer.step()

        # Compute actor loss
        actor_loss = -self.critic(state, self.actor(state)).mean()
        
        # Optimize the actor 
        self.actor_optimizer.zero_grad()
        actor_loss.backward()
        self.actor_optimizer.step()

        # Update the frozen target models
        for param, target_param in zip(self.critic.parameters(), self.critic_target.parameters()):
            target_param.data.copy_(self.tau * param.data + (1 - self.tau) * target_param.data)

        for param, target_param in zip(self.actor.parameters(), self.actor_target.parameters()):
            target_param.data.copy_(self.tau * param.data + (1 - self.tau) * target_param.data)


    def save(self, filename):
        torch.save(self.critic.state_dict(), filename + "_critic")
        torch.save(self.critic_optimizer.state_dict(), filename + "_critic_optimizer")
        
        torch.save(self.actor.state_dict(), filename + "_actor")
        torch.save(self.actor_optimizer.state_dict(), filename + "_actor_optimizer")


    def load(self, filename):
        self.critic.load_state_dict(torch.load(filename + "_critic"))
        self.critic_optimizer.load_state_dict(torch.load(filename + "_critic_optimizer"))
        self.critic_target = copy.deepcopy(self.critic)

        self.actor.load_state_dict(torch.load(filename + "_actor"))
        self.actor_optimizer.load_state_dict(torch.load(filename + "_actor_optimizer"))
        self.actor_target = copy.deepcopy(self.actor)

In [6]:
# Runs policy for X episodes and returns average reward
# A fixed seed is used for the eval environment
def eval_policy(policy, env_name, seed, eval_episodes=10):
    eval_env = gym.make(env_name)
    eval_env.seed(seed + 100)

    avg_reward = 0.
    for _ in range(eval_episodes):
        state, done = eval_env.reset(), False
        while not done:
            action = policy.select_action(np.array(state))
            state, reward, done, _ = eval_env.step(action)
            avg_reward += reward

    avg_reward /= eval_episodes

    print("---------------------------------------")
    print(f"Evaluation over {eval_episodes} episodes: {avg_reward:.3f}")
    print("---------------------------------------")
    return avg_reward

In [7]:
policy="DDPG"
env_name="Walker2d-v2"          # OpenAI gym environment name
seed=0                        # Sets Gym, PyTorch and Numpy seeds
start_timesteps=25e3         # Time steps initial random policy is used
eval_freq=5e3               # How often (time steps) we evaluate
max_timesteps=1e6   # Max time steps to run environment
expl_noise=0.1                 # Std of Gaussian exploration noise
batch_size=256      # Batch size for both actor and critic
discount=0.99                 # Discount factor
tau=0.005                     # Target network update rate
policy_noise=0.2              # Noise added to target policy during critic update
noise_clip=0.5                # Range to clip target policy noise
policy_freq=2                 # Frequency of delayed policy updates
save_model="store_true"       # Save model and optimizer parameters
load_model=""                # Model load file name, "" doesn't load, "default" uses file_name

file_name = f"{policy}_{env_name}_{seed}"
print("---------------------------------------")
print(f"Policy: {policy}, Env: {env_name}, Seed: {seed}")
print("---------------------------------------")

if not os.path.exists("./results"):
    os.makedirs("./results")

if save_model and not os.path.exists("./models"):
    os.makedirs("./models")

env = gym.make(env_name)

# Set seeds
env.seed(seed)
torch.manual_seed(seed)
np.random.seed(seed)

state_dim = env.observation_space.shape[0]
action_dim = env.action_space.shape[0] 
max_action = float(env.action_space.high[0])

kwargs = {
    "state_dim": state_dim,
    "action_dim": action_dim,
    "max_action": max_action,
    "discount": discount,
    "tau": tau,
}


policy = DDPG(**kwargs)

if load_model != "":
    policy_file = file_name if load_model == "default" else load_model
    policy.load(f"./models/{policy_file}")

replay_buffer = ReplayBuffer(state_dim, action_dim)

# Evaluate untrained policy
evaluations = [eval_policy(policy, env_name, seed)]

state, done = env.reset(), False
episode_reward = 0
episode_timesteps = 0
episode_num = 0

for t in range(int(max_timesteps)):

    episode_timesteps += 1

    # Select action randomly or according to policy
    if t < start_timesteps:
        action = env.action_space.sample()
    else:
        action = (
            policy.select_action(np.array(state))
            + np.random.normal(0, max_action * expl_noise, size=action_dim)
        ).clip(-max_action, max_action)

    # Perform action
    next_state, reward, done, _ = env.step(action) 
    done_bool = float(done) if episode_timesteps < env._max_episode_steps else 0

    # Store data in replay buffer
    replay_buffer.add(state, action, next_state, reward, done_bool)

    state = next_state
    episode_reward += reward

    # Train agent after collecting sufficient data
    if t >= start_timesteps:
        policy.train(replay_buffer, batch_size)

    if done: 
        # +1 to account for 0 indexing. +0 on ep_timesteps since it will increment +1 even if done=True
        print(f"Total T: {t+1} Episode Num: {episode_num+1} Episode T: {episode_timesteps} Reward: {episode_reward:.3f}")
        # Reset environment
        state, done = env.reset(), False
        episode_reward = 0
        episode_timesteps = 0
        episode_num += 1 

    # Evaluate episode
    if (t + 1) % eval_freq == 0:
        evaluations.append(eval_policy(policy, env_name, seed))
        np.save(f"./results/{file_name}", evaluations)
    
    if save_model: 
        policy.save(f"./models/{file_name}")

---------------------------------------
Policy: DDPG, Env: Walker2d-v2, Seed: 0
---------------------------------------
---------------------------------------
Evaluation over 10 episodes: 2.778
---------------------------------------
Total T: 16 Episode Num: 1 Episode T: 16 Reward: -2.204
Total T: 32 Episode Num: 2 Episode T: 16 Reward: 3.876
Total T: 86 Episode Num: 3 Episode T: 54 Reward: 17.944
Total T: 110 Episode Num: 4 Episode T: 24 Reward: -1.971
Total T: 128 Episode Num: 5 Episode T: 18 Reward: -5.018
Total T: 139 Episode Num: 6 Episode T: 11 Reward: -1.377
Total T: 154 Episode Num: 7 Episode T: 15 Reward: -1.568
Total T: 235 Episode Num: 8 Episode T: 81 Reward: 50.821
Total T: 272 Episode Num: 9 Episode T: 37 Reward: 4.655
Total T: 306 Episode Num: 10 Episode T: 34 Reward: 16.521
Total T: 330 Episode Num: 11 Episode T: 24 Reward: 5.578
Total T: 356 Episode Num: 12 Episode T: 26 Reward: 15.820
Total T: 384 Episode Num: 13 Episode T: 28 Reward: 1.189
Total T: 394 Episode Num: 1

Total T: 2957 Episode Num: 141 Episode T: 37 Reward: 1.916
Total T: 2977 Episode Num: 142 Episode T: 20 Reward: -7.539
Total T: 3015 Episode Num: 143 Episode T: 38 Reward: 11.741
Total T: 3054 Episode Num: 144 Episode T: 39 Reward: -0.730
Total T: 3071 Episode Num: 145 Episode T: 17 Reward: -2.328
Total T: 3093 Episode Num: 146 Episode T: 22 Reward: 0.613
Total T: 3107 Episode Num: 147 Episode T: 14 Reward: 1.201
Total T: 3127 Episode Num: 148 Episode T: 20 Reward: 4.688
Total T: 3141 Episode Num: 149 Episode T: 14 Reward: 0.432
Total T: 3157 Episode Num: 150 Episode T: 16 Reward: -2.430
Total T: 3168 Episode Num: 151 Episode T: 11 Reward: -3.701
Total T: 3190 Episode Num: 152 Episode T: 22 Reward: 7.893
Total T: 3206 Episode Num: 153 Episode T: 16 Reward: -0.248
Total T: 3235 Episode Num: 154 Episode T: 29 Reward: -6.261
Total T: 3258 Episode Num: 155 Episode T: 23 Reward: 3.304
Total T: 3275 Episode Num: 156 Episode T: 17 Reward: 0.388
Total T: 3319 Episode Num: 157 Episode T: 44 Rew

Total T: 5793 Episode Num: 281 Episode T: 18 Reward: 1.096
Total T: 5815 Episode Num: 282 Episode T: 22 Reward: 3.788
Total T: 5843 Episode Num: 283 Episode T: 28 Reward: 5.568
Total T: 5853 Episode Num: 284 Episode T: 10 Reward: -1.379
Total T: 5877 Episode Num: 285 Episode T: 24 Reward: 7.184
Total T: 5911 Episode Num: 286 Episode T: 34 Reward: 2.713
Total T: 5924 Episode Num: 287 Episode T: 13 Reward: -0.829
Total T: 5943 Episode Num: 288 Episode T: 19 Reward: -4.762
Total T: 5956 Episode Num: 289 Episode T: 13 Reward: 3.783
Total T: 5984 Episode Num: 290 Episode T: 28 Reward: 0.875
Total T: 6004 Episode Num: 291 Episode T: 20 Reward: -6.211
Total T: 6020 Episode Num: 292 Episode T: 16 Reward: 2.624
Total T: 6038 Episode Num: 293 Episode T: 18 Reward: -13.402
Total T: 6051 Episode Num: 294 Episode T: 13 Reward: 0.188
Total T: 6067 Episode Num: 295 Episode T: 16 Reward: 5.040
Total T: 6078 Episode Num: 296 Episode T: 11 Reward: -0.263
Total T: 6100 Episode Num: 297 Episode T: 22 Rewa

Total T: 8470 Episode Num: 423 Episode T: 13 Reward: -0.795
Total T: 8488 Episode Num: 424 Episode T: 18 Reward: 0.932
Total T: 8501 Episode Num: 425 Episode T: 13 Reward: -1.247
Total T: 8516 Episode Num: 426 Episode T: 15 Reward: -4.530
Total T: 8528 Episode Num: 427 Episode T: 12 Reward: -0.693
Total T: 8549 Episode Num: 428 Episode T: 21 Reward: 1.912
Total T: 8556 Episode Num: 429 Episode T: 7 Reward: -3.572
Total T: 8567 Episode Num: 430 Episode T: 11 Reward: -0.460
Total T: 8591 Episode Num: 431 Episode T: 24 Reward: 3.405
Total T: 8613 Episode Num: 432 Episode T: 22 Reward: 3.256
Total T: 8639 Episode Num: 433 Episode T: 26 Reward: 13.353
Total T: 8672 Episode Num: 434 Episode T: 33 Reward: 4.822
Total T: 8692 Episode Num: 435 Episode T: 20 Reward: 4.625
Total T: 8719 Episode Num: 436 Episode T: 27 Reward: 2.612
Total T: 8741 Episode Num: 437 Episode T: 22 Reward: -2.218
Total T: 8754 Episode Num: 438 Episode T: 13 Reward: -1.205
Total T: 8765 Episode Num: 439 Episode T: 11 Rew

Total T: 11106 Episode Num: 558 Episode T: 15 Reward: 2.578
Total T: 11134 Episode Num: 559 Episode T: 28 Reward: -1.724
Total T: 11187 Episode Num: 560 Episode T: 53 Reward: 5.568
Total T: 11201 Episode Num: 561 Episode T: 14 Reward: 2.212
Total T: 11223 Episode Num: 562 Episode T: 22 Reward: -4.343
Total T: 11260 Episode Num: 563 Episode T: 37 Reward: 12.025
Total T: 11281 Episode Num: 564 Episode T: 21 Reward: 7.593
Total T: 11302 Episode Num: 565 Episode T: 21 Reward: -7.765
Total T: 11321 Episode Num: 566 Episode T: 19 Reward: -4.540
Total T: 11353 Episode Num: 567 Episode T: 32 Reward: 8.846
Total T: 11366 Episode Num: 568 Episode T: 13 Reward: 0.022
Total T: 11381 Episode Num: 569 Episode T: 15 Reward: -4.678
Total T: 11415 Episode Num: 570 Episode T: 34 Reward: 20.188
Total T: 11430 Episode Num: 571 Episode T: 15 Reward: -1.628
Total T: 11449 Episode Num: 572 Episode T: 19 Reward: 1.332
Total T: 11480 Episode Num: 573 Episode T: 31 Reward: 17.731
Total T: 11498 Episode Num: 574

Total T: 13983 Episode Num: 698 Episode T: 22 Reward: 5.579
Total T: 14002 Episode Num: 699 Episode T: 19 Reward: 1.149
Total T: 14020 Episode Num: 700 Episode T: 18 Reward: 1.331
Total T: 14033 Episode Num: 701 Episode T: 13 Reward: 3.006
Total T: 14057 Episode Num: 702 Episode T: 24 Reward: 0.091
Total T: 14074 Episode Num: 703 Episode T: 17 Reward: -0.380
Total T: 14106 Episode Num: 704 Episode T: 32 Reward: 0.415
Total T: 14123 Episode Num: 705 Episode T: 17 Reward: -0.896
Total T: 14137 Episode Num: 706 Episode T: 14 Reward: 4.938
Total T: 14162 Episode Num: 707 Episode T: 25 Reward: 4.206
Total T: 14174 Episode Num: 708 Episode T: 12 Reward: -2.987
Total T: 14194 Episode Num: 709 Episode T: 20 Reward: -3.920
Total T: 14209 Episode Num: 710 Episode T: 15 Reward: 2.401
Total T: 14236 Episode Num: 711 Episode T: 27 Reward: -1.199
Total T: 14248 Episode Num: 712 Episode T: 12 Reward: -1.194
Total T: 14264 Episode Num: 713 Episode T: 16 Reward: 1.144
Total T: 14281 Episode Num: 714 Ep

Total T: 16681 Episode Num: 832 Episode T: 14 Reward: 3.023
Total T: 16691 Episode Num: 833 Episode T: 10 Reward: -2.014
Total T: 16732 Episode Num: 834 Episode T: 41 Reward: 3.237
Total T: 16760 Episode Num: 835 Episode T: 28 Reward: 1.718
Total T: 16774 Episode Num: 836 Episode T: 14 Reward: 2.972
Total T: 16786 Episode Num: 837 Episode T: 12 Reward: 2.765
Total T: 16805 Episode Num: 838 Episode T: 19 Reward: 1.219
Total T: 16827 Episode Num: 839 Episode T: 22 Reward: 7.001
Total T: 16838 Episode Num: 840 Episode T: 11 Reward: 1.257
Total T: 16851 Episode Num: 841 Episode T: 13 Reward: -5.595
Total T: 16878 Episode Num: 842 Episode T: 27 Reward: 9.473
Total T: 16897 Episode Num: 843 Episode T: 19 Reward: 0.931
Total T: 16922 Episode Num: 844 Episode T: 25 Reward: -2.518
Total T: 16951 Episode Num: 845 Episode T: 29 Reward: 1.700
Total T: 16966 Episode Num: 846 Episode T: 15 Reward: -0.510
Total T: 16983 Episode Num: 847 Episode T: 17 Reward: -1.488
Total T: 17000 Episode Num: 848 Epi

Total T: 19489 Episode Num: 972 Episode T: 22 Reward: -1.909
Total T: 19508 Episode Num: 973 Episode T: 19 Reward: -7.741
Total T: 19521 Episode Num: 974 Episode T: 13 Reward: -1.423
Total T: 19550 Episode Num: 975 Episode T: 29 Reward: -0.580
Total T: 19582 Episode Num: 976 Episode T: 32 Reward: -4.144
Total T: 19614 Episode Num: 977 Episode T: 32 Reward: 11.719
Total T: 19631 Episode Num: 978 Episode T: 17 Reward: -1.952
Total T: 19645 Episode Num: 979 Episode T: 14 Reward: -2.084
Total T: 19659 Episode Num: 980 Episode T: 14 Reward: -1.933
Total T: 19686 Episode Num: 981 Episode T: 27 Reward: -2.291
Total T: 19700 Episode Num: 982 Episode T: 14 Reward: 2.243
Total T: 19718 Episode Num: 983 Episode T: 18 Reward: 3.538
Total T: 19765 Episode Num: 984 Episode T: 47 Reward: 16.434
Total T: 19780 Episode Num: 985 Episode T: 15 Reward: -3.104
Total T: 19802 Episode Num: 986 Episode T: 22 Reward: -2.110
Total T: 19832 Episode Num: 987 Episode T: 30 Reward: 6.193
Total T: 19852 Episode Num:

Total T: 22291 Episode Num: 1106 Episode T: 13 Reward: 2.404
Total T: 22316 Episode Num: 1107 Episode T: 25 Reward: 0.723
Total T: 22338 Episode Num: 1108 Episode T: 22 Reward: 5.493
Total T: 22354 Episode Num: 1109 Episode T: 16 Reward: 1.705
Total T: 22371 Episode Num: 1110 Episode T: 17 Reward: 1.327
Total T: 22381 Episode Num: 1111 Episode T: 10 Reward: -1.966
Total T: 22395 Episode Num: 1112 Episode T: 14 Reward: -2.358
Total T: 22407 Episode Num: 1113 Episode T: 12 Reward: -2.527
Total T: 22419 Episode Num: 1114 Episode T: 12 Reward: -0.684
Total T: 22438 Episode Num: 1115 Episode T: 19 Reward: -6.146
Total T: 22447 Episode Num: 1116 Episode T: 9 Reward: 0.348
Total T: 22475 Episode Num: 1117 Episode T: 28 Reward: 5.108
Total T: 22491 Episode Num: 1118 Episode T: 16 Reward: -2.423
Total T: 22506 Episode Num: 1119 Episode T: 15 Reward: 4.719
Total T: 22521 Episode Num: 1120 Episode T: 15 Reward: 0.684
Total T: 22530 Episode Num: 1121 Episode T: 9 Reward: -2.803
Total T: 22542 Epis

Total T: 24941 Episode Num: 1243 Episode T: 16 Reward: 2.567
Total T: 24959 Episode Num: 1244 Episode T: 18 Reward: 2.301
Total T: 24972 Episode Num: 1245 Episode T: 13 Reward: -2.360
Total T: 24986 Episode Num: 1246 Episode T: 14 Reward: 2.399
---------------------------------------
Evaluation over 10 episodes: 2.778
---------------------------------------
Total T: 25107 Episode Num: 1247 Episode T: 121 Reward: -6.151
Total T: 25173 Episode Num: 1248 Episode T: 66 Reward: -11.525
Total T: 25236 Episode Num: 1249 Episode T: 63 Reward: -12.990
Total T: 25305 Episode Num: 1250 Episode T: 69 Reward: -8.602
Total T: 25372 Episode Num: 1251 Episode T: 67 Reward: -9.933
Total T: 25439 Episode Num: 1252 Episode T: 67 Reward: -10.767
Total T: 25508 Episode Num: 1253 Episode T: 69 Reward: -6.807
Total T: 25576 Episode Num: 1254 Episode T: 68 Reward: -9.404
Total T: 25640 Episode Num: 1255 Episode T: 64 Reward: -10.926
Total T: 25710 Episode Num: 1256 Episode T: 70 Reward: -11.379
Total T: 25886

Total T: 38937 Episode Num: 1368 Episode T: 238 Reward: -78.982
Total T: 39295 Episode Num: 1369 Episode T: 358 Reward: 169.019
Total T: 39376 Episode Num: 1370 Episode T: 81 Reward: 135.274
Total T: 39511 Episode Num: 1371 Episode T: 135 Reward: 278.460
Total T: 39727 Episode Num: 1372 Episode T: 216 Reward: 259.184
Total T: 39833 Episode Num: 1373 Episode T: 106 Reward: 175.252
---------------------------------------
Evaluation over 10 episodes: 174.911
---------------------------------------
Total T: 40040 Episode Num: 1374 Episode T: 207 Reward: -2.287
Total T: 40240 Episode Num: 1375 Episode T: 200 Reward: 180.474
Total T: 40311 Episode Num: 1376 Episode T: 71 Reward: 113.524
Total T: 40398 Episode Num: 1377 Episode T: 87 Reward: 92.144
Total T: 40524 Episode Num: 1378 Episode T: 126 Reward: 96.460
Total T: 40711 Episode Num: 1379 Episode T: 187 Reward: 29.617
Total T: 40818 Episode Num: 1380 Episode T: 107 Reward: 179.010
Total T: 41141 Episode Num: 1381 Episode T: 323 Reward: 36

Total T: 58244 Episode Num: 1490 Episode T: 152 Reward: 235.196
Total T: 58377 Episode Num: 1491 Episode T: 133 Reward: 201.829
Total T: 58560 Episode Num: 1492 Episode T: 183 Reward: 353.309
Total T: 58741 Episode Num: 1493 Episode T: 181 Reward: 300.524
Total T: 58841 Episode Num: 1494 Episode T: 100 Reward: 104.375
Total T: 58918 Episode Num: 1495 Episode T: 77 Reward: 51.065
Total T: 59120 Episode Num: 1496 Episode T: 202 Reward: 349.078
Total T: 59276 Episode Num: 1497 Episode T: 156 Reward: 339.729
Total T: 59493 Episode Num: 1498 Episode T: 217 Reward: 35.844
Total T: 59613 Episode Num: 1499 Episode T: 120 Reward: 189.060
Total T: 59671 Episode Num: 1500 Episode T: 58 Reward: 8.178
Total T: 59881 Episode Num: 1501 Episode T: 210 Reward: 183.101
---------------------------------------
Evaluation over 10 episodes: 184.272
---------------------------------------
Total T: 60317 Episode Num: 1502 Episode T: 436 Reward: 430.324
Total T: 60514 Episode Num: 1503 Episode T: 197 Reward: 1

Total T: 82276 Episode Num: 1610 Episode T: 267 Reward: 373.629
Total T: 82665 Episode Num: 1611 Episode T: 389 Reward: 459.201
Total T: 82919 Episode Num: 1612 Episode T: 254 Reward: 375.415
Total T: 83104 Episode Num: 1613 Episode T: 185 Reward: 354.908
Total T: 83226 Episode Num: 1614 Episode T: 122 Reward: 213.752
Total T: 83310 Episode Num: 1615 Episode T: 84 Reward: 37.722
Total T: 83388 Episode Num: 1616 Episode T: 78 Reward: 51.993
Total T: 83548 Episode Num: 1617 Episode T: 160 Reward: 262.736
Total T: 83671 Episode Num: 1618 Episode T: 123 Reward: -1.517
Total T: 83906 Episode Num: 1619 Episode T: 235 Reward: 338.218
Total T: 84135 Episode Num: 1620 Episode T: 229 Reward: -19.079
Total T: 84281 Episode Num: 1621 Episode T: 146 Reward: 232.720
Total T: 84396 Episode Num: 1622 Episode T: 115 Reward: 185.563
Total T: 84482 Episode Num: 1623 Episode T: 86 Reward: 124.981
Total T: 84612 Episode Num: 1624 Episode T: 130 Reward: 211.522
Total T: 84700 Episode Num: 1625 Episode T: 88

Total T: 105067 Episode Num: 1731 Episode T: 239 Reward: 549.353
Total T: 105333 Episode Num: 1732 Episode T: 266 Reward: 443.980
Total T: 105431 Episode Num: 1733 Episode T: 98 Reward: 116.959
Total T: 105504 Episode Num: 1734 Episode T: 73 Reward: 125.045
Total T: 105572 Episode Num: 1735 Episode T: 68 Reward: 118.658
Total T: 105957 Episode Num: 1736 Episode T: 385 Reward: 221.771
Total T: 106033 Episode Num: 1737 Episode T: 76 Reward: 123.245
Total T: 106584 Episode Num: 1738 Episode T: 551 Reward: 732.377
Total T: 106659 Episode Num: 1739 Episode T: 75 Reward: 114.346
Total T: 106785 Episode Num: 1740 Episode T: 126 Reward: 241.841
Total T: 106941 Episode Num: 1741 Episode T: 156 Reward: 308.920
Total T: 107125 Episode Num: 1742 Episode T: 184 Reward: 364.747
Total T: 107233 Episode Num: 1743 Episode T: 108 Reward: 187.125
Total T: 107332 Episode Num: 1744 Episode T: 99 Reward: 168.261
Total T: 107422 Episode Num: 1745 Episode T: 90 Reward: 156.066
Total T: 107535 Episode Num: 174

Total T: 125733 Episode Num: 1851 Episode T: 145 Reward: 251.943
Total T: 125841 Episode Num: 1852 Episode T: 108 Reward: 193.105
Total T: 126054 Episode Num: 1853 Episode T: 213 Reward: 494.351
Total T: 126236 Episode Num: 1854 Episode T: 182 Reward: 377.682
Total T: 126343 Episode Num: 1855 Episode T: 107 Reward: 236.185
Total T: 126490 Episode Num: 1856 Episode T: 147 Reward: 234.675
Total T: 126659 Episode Num: 1857 Episode T: 169 Reward: 241.992
Total T: 126742 Episode Num: 1858 Episode T: 83 Reward: 114.764
Total T: 126858 Episode Num: 1859 Episode T: 116 Reward: 251.382
Total T: 127012 Episode Num: 1860 Episode T: 154 Reward: 43.613
Total T: 127158 Episode Num: 1861 Episode T: 146 Reward: 199.184
Total T: 127264 Episode Num: 1862 Episode T: 106 Reward: 180.867
Total T: 127478 Episode Num: 1863 Episode T: 214 Reward: 455.816
Total T: 127571 Episode Num: 1864 Episode T: 93 Reward: 149.361
Total T: 127678 Episode Num: 1865 Episode T: 107 Reward: 150.341
Total T: 128035 Episode Num:

Total T: 143439 Episode Num: 1973 Episode T: 132 Reward: 253.351
Total T: 143529 Episode Num: 1974 Episode T: 90 Reward: 226.191
Total T: 143616 Episode Num: 1975 Episode T: 87 Reward: 135.794
Total T: 143744 Episode Num: 1976 Episode T: 128 Reward: 277.360
Total T: 143886 Episode Num: 1977 Episode T: 142 Reward: 303.952
Total T: 143913 Episode Num: 1978 Episode T: 27 Reward: 6.559
Total T: 144036 Episode Num: 1979 Episode T: 123 Reward: 226.783
Total T: 144121 Episode Num: 1980 Episode T: 85 Reward: 89.317
Total T: 144275 Episode Num: 1981 Episode T: 154 Reward: 344.391
Total T: 144514 Episode Num: 1982 Episode T: 239 Reward: 257.885
Total T: 144687 Episode Num: 1983 Episode T: 173 Reward: 388.523
Total T: 144874 Episode Num: 1984 Episode T: 187 Reward: 374.341
Total T: 145000 Episode Num: 1985 Episode T: 126 Reward: 273.063
---------------------------------------
Evaluation over 10 episodes: 380.983
---------------------------------------
Total T: 145125 Episode Num: 1986 Episode T: 

Total T: 168206 Episode Num: 2091 Episode T: 181 Reward: 393.460
Total T: 168381 Episode Num: 2092 Episode T: 175 Reward: 238.260
Total T: 168470 Episode Num: 2093 Episode T: 89 Reward: 148.607
Total T: 168754 Episode Num: 2094 Episode T: 284 Reward: 380.352
Total T: 168900 Episode Num: 2095 Episode T: 146 Reward: 251.166
Total T: 169099 Episode Num: 2096 Episode T: 199 Reward: 434.351
Total T: 169334 Episode Num: 2097 Episode T: 235 Reward: 412.254
Total T: 169515 Episode Num: 2098 Episode T: 181 Reward: 415.377
Total T: 169942 Episode Num: 2099 Episode T: 427 Reward: 798.801
---------------------------------------
Evaluation over 10 episodes: 393.320
---------------------------------------
Total T: 170165 Episode Num: 2100 Episode T: 223 Reward: 361.749
Total T: 170536 Episode Num: 2101 Episode T: 371 Reward: 440.724
Total T: 170794 Episode Num: 2102 Episode T: 258 Reward: 555.680
Total T: 171225 Episode Num: 2103 Episode T: 431 Reward: 218.448
Total T: 172225 Episode Num: 2104 Episo

Total T: 194091 Episode Num: 2209 Episode T: 234 Reward: 551.072
Total T: 194802 Episode Num: 2210 Episode T: 711 Reward: 514.044
---------------------------------------
Evaluation over 10 episodes: 401.686
---------------------------------------
Total T: 195054 Episode Num: 2211 Episode T: 252 Reward: 29.009
Total T: 195212 Episode Num: 2212 Episode T: 158 Reward: 288.407
Total T: 195352 Episode Num: 2213 Episode T: 140 Reward: 242.554
Total T: 195689 Episode Num: 2214 Episode T: 337 Reward: 167.472
Total T: 195874 Episode Num: 2215 Episode T: 185 Reward: 342.217
Total T: 196663 Episode Num: 2216 Episode T: 789 Reward: 902.929
Total T: 196750 Episode Num: 2217 Episode T: 87 Reward: 277.578
Total T: 196986 Episode Num: 2218 Episode T: 236 Reward: 490.999
Total T: 197180 Episode Num: 2219 Episode T: 194 Reward: 295.598
Total T: 197283 Episode Num: 2220 Episode T: 103 Reward: 180.847
Total T: 197528 Episode Num: 2221 Episode T: 245 Reward: 201.095
Total T: 197625 Episode Num: 2222 Episod

Total T: 219739 Episode Num: 2327 Episode T: 235 Reward: 417.661
Total T: 219811 Episode Num: 2328 Episode T: 72 Reward: 135.968
Total T: 219949 Episode Num: 2329 Episode T: 138 Reward: 281.298
---------------------------------------
Evaluation over 10 episodes: 337.965
---------------------------------------
Total T: 220140 Episode Num: 2330 Episode T: 191 Reward: 328.833
Total T: 220310 Episode Num: 2331 Episode T: 170 Reward: 384.142
Total T: 220427 Episode Num: 2332 Episode T: 117 Reward: 206.390
Total T: 220643 Episode Num: 2333 Episode T: 216 Reward: 432.763
Total T: 220723 Episode Num: 2334 Episode T: 80 Reward: 100.223
Total T: 220824 Episode Num: 2335 Episode T: 101 Reward: 118.912
Total T: 221011 Episode Num: 2336 Episode T: 187 Reward: 384.578
Total T: 221202 Episode Num: 2337 Episode T: 191 Reward: 290.456
Total T: 221491 Episode Num: 2338 Episode T: 289 Reward: 634.735
Total T: 221607 Episode Num: 2339 Episode T: 116 Reward: 165.832
Total T: 221786 Episode Num: 2340 Episod

Total T: 241239 Episode Num: 2445 Episode T: 227 Reward: 431.944
Total T: 241485 Episode Num: 2446 Episode T: 246 Reward: 459.234
Total T: 241731 Episode Num: 2447 Episode T: 246 Reward: 479.180
Total T: 241811 Episode Num: 2448 Episode T: 80 Reward: 137.864
Total T: 241915 Episode Num: 2449 Episode T: 104 Reward: 153.376
Total T: 241993 Episode Num: 2450 Episode T: 78 Reward: 147.834
Total T: 242215 Episode Num: 2451 Episode T: 222 Reward: 442.902
Total T: 242432 Episode Num: 2452 Episode T: 217 Reward: 390.548
Total T: 242667 Episode Num: 2453 Episode T: 235 Reward: 349.632
Total T: 242752 Episode Num: 2454 Episode T: 85 Reward: 151.836
Total T: 242897 Episode Num: 2455 Episode T: 145 Reward: 311.290
Total T: 243198 Episode Num: 2456 Episode T: 301 Reward: 599.688
Total T: 243397 Episode Num: 2457 Episode T: 199 Reward: 416.326
Total T: 243532 Episode Num: 2458 Episode T: 135 Reward: 215.130
Total T: 243702 Episode Num: 2459 Episode T: 170 Reward: 325.634
Total T: 243801 Episode Num:

Total T: 263237 Episode Num: 2565 Episode T: 83 Reward: 96.471
Total T: 263309 Episode Num: 2566 Episode T: 72 Reward: 93.029
Total T: 263519 Episode Num: 2567 Episode T: 210 Reward: 397.189
Total T: 263573 Episode Num: 2568 Episode T: 54 Reward: 61.031
Total T: 263808 Episode Num: 2569 Episode T: 235 Reward: 555.866
Total T: 263926 Episode Num: 2570 Episode T: 118 Reward: 268.644
Total T: 263960 Episode Num: 2571 Episode T: 34 Reward: 40.268
Total T: 263994 Episode Num: 2572 Episode T: 34 Reward: 39.379
Total T: 264028 Episode Num: 2573 Episode T: 34 Reward: 38.451
Total T: 264205 Episode Num: 2574 Episode T: 177 Reward: 385.984
Total T: 264241 Episode Num: 2575 Episode T: 36 Reward: 43.949
Total T: 264294 Episode Num: 2576 Episode T: 53 Reward: 77.737
Total T: 264373 Episode Num: 2577 Episode T: 79 Reward: 117.743
Total T: 264718 Episode Num: 2578 Episode T: 345 Reward: 643.174
Total T: 264939 Episode Num: 2579 Episode T: 221 Reward: 370.339
---------------------------------------
Ev

Total T: 287947 Episode Num: 2683 Episode T: 363 Reward: 962.984
Total T: 288378 Episode Num: 2684 Episode T: 431 Reward: 1121.549
Total T: 288733 Episode Num: 2685 Episode T: 355 Reward: 859.545
Total T: 289048 Episode Num: 2686 Episode T: 315 Reward: 889.558
Total T: 289134 Episode Num: 2687 Episode T: 86 Reward: 105.435
Total T: 289310 Episode Num: 2688 Episode T: 176 Reward: 321.829
Total T: 289391 Episode Num: 2689 Episode T: 81 Reward: 103.908
Total T: 289519 Episode Num: 2690 Episode T: 128 Reward: 133.298
Total T: 289794 Episode Num: 2691 Episode T: 275 Reward: 496.115
Total T: 289879 Episode Num: 2692 Episode T: 85 Reward: 148.151
---------------------------------------
Evaluation over 10 episodes: 584.139
---------------------------------------
Total T: 290145 Episode Num: 2693 Episode T: 266 Reward: 567.537
Total T: 290463 Episode Num: 2694 Episode T: 318 Reward: 692.576
Total T: 290717 Episode Num: 2695 Episode T: 254 Reward: 678.315
Total T: 290965 Episode Num: 2696 Episod

Total T: 322331 Episode Num: 2797 Episode T: 128 Reward: 236.479
Total T: 322562 Episode Num: 2798 Episode T: 231 Reward: 583.117
Total T: 322843 Episode Num: 2799 Episode T: 281 Reward: 521.526
Total T: 323243 Episode Num: 2800 Episode T: 400 Reward: 884.413
Total T: 323466 Episode Num: 2801 Episode T: 223 Reward: 515.284
Total T: 323543 Episode Num: 2802 Episode T: 77 Reward: 139.347
Total T: 323720 Episode Num: 2803 Episode T: 177 Reward: 407.621
Total T: 323903 Episode Num: 2804 Episode T: 183 Reward: 344.279
Total T: 324353 Episode Num: 2805 Episode T: 450 Reward: 977.884
Total T: 324519 Episode Num: 2806 Episode T: 166 Reward: 376.639
Total T: 324641 Episode Num: 2807 Episode T: 122 Reward: 221.057
---------------------------------------
Evaluation over 10 episodes: 430.331
---------------------------------------
Total T: 325494 Episode Num: 2808 Episode T: 853 Reward: 1950.377
Total T: 325650 Episode Num: 2809 Episode T: 156 Reward: 211.827
Total T: 325922 Episode Num: 2810 Epis

Total T: 369513 Episode Num: 2907 Episode T: 38 Reward: 30.742
---------------------------------------
Evaluation over 10 episodes: 495.879
---------------------------------------
Total T: 370099 Episode Num: 2908 Episode T: 586 Reward: 1269.408
Total T: 370566 Episode Num: 2909 Episode T: 467 Reward: 1263.999
Total T: 371172 Episode Num: 2910 Episode T: 606 Reward: 1646.018
Total T: 371837 Episode Num: 2911 Episode T: 665 Reward: 1889.039
Total T: 372837 Episode Num: 2912 Episode T: 1000 Reward: 2410.164
Total T: 373364 Episode Num: 2913 Episode T: 527 Reward: 1323.298
Total T: 373823 Episode Num: 2914 Episode T: 459 Reward: 1290.048
Total T: 374298 Episode Num: 2915 Episode T: 475 Reward: 1530.164
Total T: 374452 Episode Num: 2916 Episode T: 154 Reward: 335.231
---------------------------------------
Evaluation over 10 episodes: 732.591
---------------------------------------
Total T: 375098 Episode Num: 2917 Episode T: 646 Reward: 1474.263
Total T: 375275 Episode Num: 2918 Episode T

Total T: 411477 Episode Num: 3017 Episode T: 418 Reward: 1145.658
Total T: 411957 Episode Num: 3018 Episode T: 480 Reward: 1444.394
Total T: 412082 Episode Num: 3019 Episode T: 125 Reward: 247.076
Total T: 412231 Episode Num: 3020 Episode T: 149 Reward: 172.578
Total T: 412602 Episode Num: 3021 Episode T: 371 Reward: 932.277
Total T: 412937 Episode Num: 3022 Episode T: 335 Reward: 810.190
Total T: 413786 Episode Num: 3023 Episode T: 849 Reward: 2350.829
Total T: 414211 Episode Num: 3024 Episode T: 425 Reward: 1251.564
Total T: 414316 Episode Num: 3025 Episode T: 105 Reward: 228.548
Total T: 414371 Episode Num: 3026 Episode T: 55 Reward: 51.593
Total T: 414473 Episode Num: 3027 Episode T: 102 Reward: 244.837
Total T: 414820 Episode Num: 3028 Episode T: 347 Reward: 990.606
---------------------------------------
Evaluation over 10 episodes: 878.207
---------------------------------------
Total T: 415197 Episode Num: 3029 Episode T: 377 Reward: 965.923
Total T: 415290 Episode Num: 3030 Ep

Total T: 456724 Episode Num: 3126 Episode T: 520 Reward: 1324.200
Total T: 457031 Episode Num: 3127 Episode T: 307 Reward: 929.598
Total T: 457598 Episode Num: 3128 Episode T: 567 Reward: 1675.363
Total T: 458422 Episode Num: 3129 Episode T: 824 Reward: 2430.696
Total T: 458708 Episode Num: 3130 Episode T: 286 Reward: 966.081
Total T: 459196 Episode Num: 3131 Episode T: 488 Reward: 1456.347
Total T: 459817 Episode Num: 3132 Episode T: 621 Reward: 1839.792
---------------------------------------
Evaluation over 10 episodes: 1331.363
---------------------------------------
Total T: 460145 Episode Num: 3133 Episode T: 328 Reward: 1058.925
Total T: 460893 Episode Num: 3134 Episode T: 748 Reward: 2033.624
Total T: 461352 Episode Num: 3135 Episode T: 459 Reward: 1266.566
Total T: 461444 Episode Num: 3136 Episode T: 92 Reward: 37.785
Total T: 461631 Episode Num: 3137 Episode T: 187 Reward: 378.323
Total T: 461929 Episode Num: 3138 Episode T: 298 Reward: 850.393
Total T: 462454 Episode Num: 31

Total T: 512786 Episode Num: 3231 Episode T: 83 Reward: 206.742
Total T: 512944 Episode Num: 3232 Episode T: 158 Reward: 321.837
Total T: 513793 Episode Num: 3233 Episode T: 849 Reward: 2557.185
Total T: 514777 Episode Num: 3234 Episode T: 984 Reward: 3154.665
Total T: 514960 Episode Num: 3235 Episode T: 183 Reward: 292.184
---------------------------------------
Evaluation over 10 episodes: 2386.527
---------------------------------------
Total T: 515960 Episode Num: 3236 Episode T: 1000 Reward: 3358.985
Total T: 516753 Episode Num: 3237 Episode T: 793 Reward: 2789.253
Total T: 516846 Episode Num: 3238 Episode T: 93 Reward: 226.935
Total T: 516930 Episode Num: 3239 Episode T: 84 Reward: 188.840
Total T: 517351 Episode Num: 3240 Episode T: 421 Reward: 1377.414
Total T: 518351 Episode Num: 3241 Episode T: 1000 Reward: 3269.131
Total T: 519351 Episode Num: 3242 Episode T: 1000 Reward: 3347.150
Total T: 519688 Episode Num: 3243 Episode T: 337 Reward: 1017.519
-----------------------------

Total T: 564594 Episode Num: 3338 Episode T: 114 Reward: 91.869
---------------------------------------
Evaluation over 10 episodes: 1283.277
---------------------------------------
Total T: 565362 Episode Num: 3339 Episode T: 768 Reward: 2543.709
Total T: 566362 Episode Num: 3340 Episode T: 1000 Reward: 3643.319
Total T: 566441 Episode Num: 3341 Episode T: 79 Reward: 155.668
Total T: 567069 Episode Num: 3342 Episode T: 628 Reward: 2211.771
Total T: 567169 Episode Num: 3343 Episode T: 100 Reward: 210.506
Total T: 568162 Episode Num: 3344 Episode T: 993 Reward: 3203.896
Total T: 568518 Episode Num: 3345 Episode T: 356 Reward: 1121.505
Total T: 568707 Episode Num: 3346 Episode T: 189 Reward: 554.090
Total T: 569309 Episode Num: 3347 Episode T: 602 Reward: 2033.980
Total T: 569897 Episode Num: 3348 Episode T: 588 Reward: 1813.551
---------------------------------------
Evaluation over 10 episodes: 2482.809
---------------------------------------
Total T: 570139 Episode Num: 3349 Episode T

Total T: 627090 Episode Num: 3440 Episode T: 62 Reward: 169.806
Total T: 627358 Episode Num: 3441 Episode T: 268 Reward: 820.199
Total T: 628358 Episode Num: 3442 Episode T: 1000 Reward: 3713.561
Total T: 628825 Episode Num: 3443 Episode T: 467 Reward: 1622.042
Total T: 629374 Episode Num: 3444 Episode T: 549 Reward: 1965.364
Total T: 629525 Episode Num: 3445 Episode T: 151 Reward: 375.905
---------------------------------------
Evaluation over 10 episodes: 2669.089
---------------------------------------
Total T: 630358 Episode Num: 3446 Episode T: 833 Reward: 3076.045
Total T: 631358 Episode Num: 3447 Episode T: 1000 Reward: 3523.710
Total T: 631442 Episode Num: 3448 Episode T: 84 Reward: 206.801
Total T: 632442 Episode Num: 3449 Episode T: 1000 Reward: 3894.434
Total T: 633442 Episode Num: 3450 Episode T: 1000 Reward: 3840.398
Total T: 634442 Episode Num: 3451 Episode T: 1000 Reward: 3702.714
Total T: 634522 Episode Num: 3452 Episode T: 80 Reward: 196.427
---------------------------

Total T: 684172 Episode Num: 3545 Episode T: 134 Reward: 389.227
Total T: 684653 Episode Num: 3546 Episode T: 481 Reward: 1550.706
Total T: 684765 Episode Num: 3547 Episode T: 112 Reward: 217.792
Total T: 684880 Episode Num: 3548 Episode T: 115 Reward: 253.801
Total T: 684995 Episode Num: 3549 Episode T: 115 Reward: 271.824
---------------------------------------
Evaluation over 10 episodes: 1572.997
---------------------------------------
Total T: 685329 Episode Num: 3550 Episode T: 334 Reward: 1103.681
Total T: 685690 Episode Num: 3551 Episode T: 361 Reward: 1239.907
Total T: 686029 Episode Num: 3552 Episode T: 339 Reward: 1087.562
Total T: 686179 Episode Num: 3553 Episode T: 150 Reward: 149.851
Total T: 686361 Episode Num: 3554 Episode T: 182 Reward: 487.767
Total T: 686468 Episode Num: 3555 Episode T: 107 Reward: 203.293
Total T: 686565 Episode Num: 3556 Episode T: 97 Reward: 188.036
Total T: 687129 Episode Num: 3557 Episode T: 564 Reward: 2156.061
Total T: 687565 Episode Num: 3558

---------------------------------------
Evaluation over 10 episodes: 2407.371
---------------------------------------
Total T: 740836 Episode Num: 3650 Episode T: 863 Reward: 3455.338
Total T: 741047 Episode Num: 3651 Episode T: 211 Reward: 657.404
Total T: 741163 Episode Num: 3652 Episode T: 116 Reward: 232.751
Total T: 742163 Episode Num: 3653 Episode T: 1000 Reward: 3904.535
Total T: 742779 Episode Num: 3654 Episode T: 616 Reward: 2309.852
Total T: 743448 Episode Num: 3655 Episode T: 669 Reward: 2429.703
Total T: 743957 Episode Num: 3656 Episode T: 509 Reward: 1827.088
Total T: 744388 Episode Num: 3657 Episode T: 431 Reward: 1645.993
Total T: 744882 Episode Num: 3658 Episode T: 494 Reward: 1787.429
---------------------------------------
Evaluation over 10 episodes: 3619.119
---------------------------------------
Total T: 745634 Episode Num: 3659 Episode T: 752 Reward: 2921.369
Total T: 746189 Episode Num: 3660 Episode T: 555 Reward: 1790.269
Total T: 746770 Episode Num: 3661 Episo

Total T: 791643 Episode Num: 3755 Episode T: 199 Reward: 554.922
Total T: 791852 Episode Num: 3756 Episode T: 209 Reward: 663.995
Total T: 792274 Episode Num: 3757 Episode T: 422 Reward: 1483.609
Total T: 792448 Episode Num: 3758 Episode T: 174 Reward: 594.218
Total T: 793389 Episode Num: 3759 Episode T: 941 Reward: 3768.638
Total T: 793834 Episode Num: 3760 Episode T: 445 Reward: 1588.900
Total T: 794267 Episode Num: 3761 Episode T: 433 Reward: 1534.100
---------------------------------------
Evaluation over 10 episodes: 3098.346
---------------------------------------
Total T: 795089 Episode Num: 3762 Episode T: 822 Reward: 3108.675
Total T: 795306 Episode Num: 3763 Episode T: 217 Reward: 659.898
Total T: 796116 Episode Num: 3764 Episode T: 810 Reward: 2980.841
Total T: 796799 Episode Num: 3765 Episode T: 683 Reward: 2564.109
Total T: 797143 Episode Num: 3766 Episode T: 344 Reward: 1257.907
Total T: 797690 Episode Num: 3767 Episode T: 547 Reward: 2114.262
Total T: 797779 Episode Num:

Total T: 849952 Episode Num: 3860 Episode T: 109 Reward: 146.735
---------------------------------------
Evaluation over 10 episodes: 2547.280
---------------------------------------
Total T: 850952 Episode Num: 3861 Episode T: 1000 Reward: 3587.060
Total T: 851910 Episode Num: 3862 Episode T: 958 Reward: 3649.289
Total T: 852519 Episode Num: 3863 Episode T: 609 Reward: 2187.973
Total T: 852926 Episode Num: 3864 Episode T: 407 Reward: 1464.679
Total T: 853824 Episode Num: 3865 Episode T: 898 Reward: 3421.606
Total T: 853949 Episode Num: 3866 Episode T: 125 Reward: 246.019
Total T: 854680 Episode Num: 3867 Episode T: 731 Reward: 2861.730
Total T: 854861 Episode Num: 3868 Episode T: 181 Reward: 510.567
---------------------------------------
Evaluation over 10 episodes: 1648.322
---------------------------------------
Total T: 855861 Episode Num: 3869 Episode T: 1000 Reward: 3735.085
Total T: 856861 Episode Num: 3870 Episode T: 1000 Reward: 3854.299
Total T: 857693 Episode Num: 3871 Epis

---------------------------------------
Evaluation over 10 episodes: 583.653
---------------------------------------
Total T: 915147 Episode Num: 3961 Episode T: 905 Reward: 3195.430
Total T: 915607 Episode Num: 3962 Episode T: 460 Reward: 1710.066
Total T: 915767 Episode Num: 3963 Episode T: 160 Reward: 437.122
Total T: 916438 Episode Num: 3964 Episode T: 671 Reward: 2540.032
Total T: 917039 Episode Num: 3965 Episode T: 601 Reward: 2207.391
Total T: 917281 Episode Num: 3966 Episode T: 242 Reward: 851.251
Total T: 918237 Episode Num: 3967 Episode T: 956 Reward: 3857.827
Total T: 918338 Episode Num: 3968 Episode T: 101 Reward: 196.984
Total T: 919329 Episode Num: 3969 Episode T: 991 Reward: 3449.478
Total T: 919609 Episode Num: 3970 Episode T: 280 Reward: 1163.454
---------------------------------------
Evaluation over 10 episodes: 1882.094
---------------------------------------
Total T: 920609 Episode Num: 3971 Episode T: 1000 Reward: 3877.765
Total T: 921134 Episode Num: 3972 Episode

Total T: 970093 Episode Num: 4065 Episode T: 153 Reward: 304.315
Total T: 970490 Episode Num: 4066 Episode T: 397 Reward: 1458.838
Total T: 970590 Episode Num: 4067 Episode T: 100 Reward: 253.114
Total T: 971427 Episode Num: 4068 Episode T: 837 Reward: 2966.119
Total T: 971640 Episode Num: 4069 Episode T: 213 Reward: 783.596
Total T: 972640 Episode Num: 4070 Episode T: 1000 Reward: 4082.652
Total T: 973235 Episode Num: 4071 Episode T: 595 Reward: 2541.211
Total T: 973520 Episode Num: 4072 Episode T: 285 Reward: 997.129
Total T: 973986 Episode Num: 4073 Episode T: 466 Reward: 1592.400
Total T: 974645 Episode Num: 4074 Episode T: 659 Reward: 2642.586
---------------------------------------
Evaluation over 10 episodes: 1133.766
---------------------------------------
Total T: 975094 Episode Num: 4075 Episode T: 449 Reward: 1588.009
Total T: 975819 Episode Num: 4076 Episode T: 725 Reward: 2881.579
Total T: 976819 Episode Num: 4077 Episode T: 1000 Reward: 3904.513
Total T: 977672 Episode Nu