## 双延迟-确定性策略梯度（Twin Delayed Deterministic policy gradient，TD3）算法

&emsp;&emsp;<font size=4>
    在DDPG算法基础上，TD3算法的主要目的在于解决AC框架中，由函数逼近引入的偏差和方差问题。一方面，由于方差会引起过高估计，为解决过高估计问题，TD3将截断式双Q学习（clipped Double Q-Learning）应用于AC框架；另一方面，高方差会引起误差累积，为解决误差累积问题，TD3分别采用延迟策略更新和添加噪声平滑目标策略两种技巧。
</font><br>

### 过高估计问题解决方案

&emsp;&emsp;<font size=4>
    从策略梯度方法已知，基于PG的强化学习存在过高估计问题，但由于DDPG评论家的目标值不是取最优动作值函数的，所以不存在最大化操作。此时，将Double DQN思想直接用于DDPG的评论家，构造如下目标函数：
</font><br>
\begin{equation}
y=r+\gamma Q\left(s^{\prime}, \mu\left(s^{\prime}, \boldsymbol{\theta}\right), \boldsymbol{w}^{\prime}\right)  \label{bbb.11}\tag{bbb.11}
\end{equation}
&emsp;&emsp;<font size=4>
    实际上，这样的处理效果并不好，这是因为在连续动作空间中，策略变化缓慢，行动者更新较为平缓，使得预测$Q$值与目标$Q$值相差不大，无法避免过高估计问题。  
</font><br>
&emsp;&emsp;<font size=4>
    考虑将Double Q-Learning思想应用于DDPG，采用两个独立的评论家$Q_{w_{1}}$、$Q_{w_{2}}$和两个独立的行动者$\mu_{\theta_{1}}$、$\mu_{\theta_{2}}$，以50%的概率利用$Q_{1}$产生动作，然后更新$Q_{2}$估计值，而另外50%的概率正好相反。构建更新所需的两个目标值分别为：
</font><br> 
\begin{equation}
\left\{\begin{array}{l}
y_{1}=r+\gamma Q\left(s^{\prime}, \mu\left(s^{\prime}, \boldsymbol{\theta}_{1}\right), \boldsymbol{w}_{2}^{\prime}\right) \\
y_{2}=r+\gamma Q\left(s^{\prime}, \mu\left(s^{\prime}, \boldsymbol{\theta}_{2}\right), \boldsymbol{w}_{1}^{\prime}\right)
\end{array}\right.  \label{bbb.12}\tag{bbb.12}
\end{equation}
&emsp;&emsp;<font size=4>
    但由于样本均来自于同一经验池，不能保证样本数据完全独立，所以两个行动者的样本具有一定相关性，在一定的情况下，甚至会加剧高估问题。针对此种情形，秉持“宁可低估，也不要高估”的想法，对Double Q-Learning进行修改，构建基于Clipped Double Q-learning方法的目标值：
</font><br> 
\begin{equation}
y=r+\gamma \min _{i=1,2} Q\left(s^{\prime}, \mu\left(s^{\prime}, \theta_{1}\right), w_{i}^{\prime}\right) \label{bbb.13}\tag{bbb.13}
\end{equation}
&emsp;&emsp;<font size=4>
    如式（bbb.13）所示，目标值只使用了一个行动者网络$\mu_{\theta_{1}}$，取两个评论家网络$Q_{w_{1}}$和$Q_{w_{2}}$的最小值来作为值函数估计值。
</font><br>
&emsp;&emsp;<font size=4>
    在更新评论家网络$Q_{w_{1}}$和$Q_{w_{2}}$时，均采用式（bbb.13）目标值y，共用如下损失函数：
</font><br>
\begin{equation}
L\left(\boldsymbol{w}_{i}\right)=\mathbb{E}_{s, a, r, s^{\prime} \sim \mathcal{D}}\left[y-Q\left(s, a, \boldsymbol{w}_{i}\right)\right]^{2}    \label{bbb.14}\tag{bbb.14}    
\end{equation}
&emsp;&emsp;<font size=4>
    该算法相比于原算法的区别仅在于多了一个和原评论家$Q_{w_{1}}$同步更新的辅助评论家$Q_{w_{2}}$，在更新目标值y时取最小值。不过这一修改仍然会让人疑惑，$Q_{w_{1}}$和$Q_{w_{2}}$只有初始参数不同，后面的更新都一样，这样形成的两个类似的评论家能否有效消除TD误差带来的偏置估计。
</font><br>

### 累积误差问题解决方案

&emsp;&emsp;<font size=4>
    在函数逼近问题中，TD(0)算法的过高估计问题会进一步加剧，每次更新都会产生一定量的TD误差$\delta(s, a)$：
</font><br>
\begin{equation}
Q(s, a, \boldsymbol{w})=r+\gamma \mathbb{E}\left[Q\left(s^{\prime}, a^{\prime}, \boldsymbol{w}\right)\right]-\delta(s, a) \label{bbb.15}\tag{bbb.15}
\end{equation}
&emsp;&emsp;<font size=4>
    经过多次迭代更新后，误差会被累积：
</font><br>
\begin{equation}
\begin{aligned}
Q\left(S_{t}, A_{t}, \boldsymbol{w}\right) &=R_{t+1}+\gamma \mathbb{E}\left[Q\left(S_{t+1}, A_{t+1}, \boldsymbol{w}\right)\right]-\delta_{t+1} \\
&=R_{t+1}+\gamma \mathbb{E}\left[R_{t+2}+\gamma \mathbb{E}\left[Q\left(S_{t+2}, A_{t+2}, \boldsymbol{w}\right)\right]-\delta_{t+2}\right]-\delta_{t+1} \\
& \cdots \cdots \\
&=\mathbb{E}_{S_{i} \sim \rho^{\beta}, A_{i} \sim \mu}\left[\sum^{\mathrm{T}-1} \gamma^{i-t}\left(R_{i+1}-\delta_{i+1}\right)\right]
\end{aligned}  \label{bbb.16}\tag{bbb.16}
\end{equation}
&emsp;&emsp;<font size=4>
    由此可见，估计的方差与未来奖励、未来TD误差的方差成正比。当折扣因子$\gamma$较大时，每次更新都可以引起方差的快速提升，所以通常TD3设置较小的折扣系数$\gamma$。
</font><br>

### 延迟的策略更新

&emsp;&emsp;<font size=4>
    TD3目标网络的更新方式与DDPG相同，都采用软更新，尽管软更新比硬更新更有利于算法的稳定性，但AC算法依然会失败，其原因通常在于行动者和评论家的更新是相互作用的结果：评论家提供的值函数估计值不准确，就会使行动者将策略往错误方向改进；行动者产生了较差的策略，就会进一步加剧评论家误差累积问题，两者不断作用产生恶性循环。
</font><br>
&emsp;&emsp;<font size=4>
    为解决以上问题，TD3考虑对策略进行延时更新，减少行动者的更新频率，尽可能等待评论家训练收敛后再进行更新操作。延时更新操作可以有效减少累积误差，从而降低方差；同时，也能减少不必要的重复更新操作，一定程度上提升效率。在实际应用时，TD3采取的操作是每隔评论家更新$d$次后，再对行动者进行更新。
</font><br>

### 目标策略平滑操作

&emsp;&emsp;<font size=4>
    上节中通过延时更新策略来减小误差累积，接下来考虑误差本身。首先，误差的根源是值函数逼近所产生的偏差，在机器学习中，消除估计偏差的常用方法就是对参数更新进行正则化，同样的，这一思想也可以应用在强化学习中。
</font><br>
&emsp;&emsp;<font size=4>
    一个很自然的想法是，相似的动作应该拥有相似的价值，动作空间中目标动作周围的一小片区域的价值若能足够平滑，就可以有效减少误差的产生。TD3的具体做法是，为目标动作添加截断噪声：
</font><br>
\begin{equation}
\begin{array}{l}
\tilde{a} \leftarrow \mu\left(s^{\prime}, \boldsymbol{\theta}^{\prime}\right)+\varepsilon \\
\boldsymbol{\varepsilon} \sim \operatorname{clip}(N(0, \sigma),-c, c)
\end{array}   \label{bbb.17}\tag{bbb.17}
\end{equation}
&emsp;&emsp;<font size=4>
    该噪声处理也是一种正则化方式。通过这种平滑操作，可以增加算法的泛化能力，缓解过拟合问题，减少价值被过高估计的一些不良状态对策略学习的干扰。
</font><br>

## TD3算法

&emsp;&emsp;<font size=4>
    TD3算法，如算法bbb.2所示：
</font><br>

<hr style="height:1px;border:none;border-top:1px solid #555555;" />
&emsp;&emsp;<font size=3.5><b>算法bbb.2</b> TD3算法（Lillicrap al. 2016）</font><br>
<hr>
&emsp;&emsp;<font size=3.5>初始化：</font><br>
&emsp;&emsp;&emsp;&emsp;<font size=3.5>
    1.&emsp;初始化预测价值网络$Q_{w_{1}}$和$Q_{w_{2}}$，网络参数分别为$\boldsymbol{w}_{1}$和$\boldsymbol{w}_{2}$
</font><br>
&emsp;&emsp;&emsp;&emsp;<font size=3.5>
    2.&emsp;初始化目标价值网络$Q_{\boldsymbol{w}_{1}^{\prime}}$和$Q_{\boldsymbol{w}_{2}^{\prime}}$，网络参数分别为$\boldsymbol{w}_{1}^{\prime}$和$\boldsymbol{w}_{2}^{\prime}$
</font><br>
&emsp;&emsp;&emsp;&emsp;<font size=3.5>
    3.&emsp;初始化预测策略网络$\mu_{\theta}$和目标策略网络$\mu_{\theta^{\prime}}$，网络参数分别为$\theta$和$\theta^{\prime}$
</font><br>
&emsp;&emsp;&emsp;&emsp;<font size=3.5>
    4.&emsp;同步参数$w_{1}^{\prime} \leftarrow w_{1}$，$w_{2}^{\prime} \leftarrow w_{2}$，$\theta^{\prime} \leftarrow \theta$ 
</font><br>
&emsp;&emsp;&emsp;&emsp;<font size=3.5>
    5.&emsp;经验池$\mathcal{D}$的容量为$N$
</font><br>
&emsp;&emsp;&emsp;&emsp;<font size=3.5>
    6.&emsp;总迭代次数$M$，折扣因子$\gamma$，$\tau=0.0001$，随机小批量采样样本数量$n$
</font><br>
<hr>
&emsp;&emsp;&emsp;&emsp;<font size=3.5>
    7.&emsp;<b>for</b> $e$=1 <b>to</b> $M$ <b>do:</b>
</font><br>
&emsp;&emsp;&emsp;&emsp;<font size=3.5>
    8.&emsp;&emsp;&emsp;初始化状态设置为$S_{0}$
</font><br>
&emsp;&emsp;&emsp;&emsp;<font size=3.5>
    9.&emsp;&emsp;&emsp;<b>repeat</b>（情节中的每一时间步$t=0,1,2, \ldots$）：
</font><br>
&emsp;&emsp;&emsp;&emsp;<font size=3.5>
    10.&emsp;&emsp;&emsp;&emsp;&emsp;根据当前的预测策略网络和探索噪声来选择动作根据当前的预测策略网络和探索噪声来选择动作$A_{t}=\mu\left(S_{t}, \boldsymbol{\theta}\right)+\varepsilon_{t}$,
</font><br>
&emsp;&emsp;&emsp;&emsp;<font size=3.5>
    &emsp;&emsp;&emsp;&emsp;&emsp;其中$\varepsilon_{t} \sim \mathcal{N}_{t}(0, \sigma)$
</font><br>
&emsp;&emsp;&emsp;&emsp;<font size=3.5>
    11.&emsp;&emsp;&emsp;&emsp;&emsp;执行动作$ A_{t}$，获得奖赏$R_{t+1}$和下一状态$S_{t+1}$
</font><br>
&emsp;&emsp;&emsp;&emsp;<font size=3.5>
    12.&emsp;&emsp;&emsp;&emsp;&emsp;将经验转换$\left(S_{t}, A_{t}, R_{t+1}, S_{t+1}\right)$存储在经验池$\mathcal{D}$中
</font><br>
&emsp;&emsp;&emsp;&emsp;<font size=3.5>
    13.&emsp;&emsp;&emsp;&emsp;&emsp;从经验池$\mathcal{D}$中随机采样小批量的$n$个经验转移样本$\left(S_{i}, A_{i}, R_{i+1}, S_{i+1}\right)$，计算：
</font><br>
&emsp;&emsp;&emsp;&emsp;<font size=3.5>
    &emsp;&emsp;&emsp;&emsp;&emsp;（1）扰动后的动作$\tilde{a}_{i+1} \leftarrow \mu\left(S_{i+1}, \boldsymbol{\theta}^{\prime}\right)+\varepsilon_{i}$，其中$\varepsilon_{i} \sim \operatorname{clip}\left(\mathcal{N}_{t}(0, \tilde{\sigma}),-c, c\right)$
</font><br>
&emsp;&emsp;&emsp;&emsp;<font size=3.5>
    &emsp;&emsp;&emsp;&emsp;&emsp;（2）更新目标$y_{i}=R_{i+1}+\gamma \min _{i=1,2} Q\left(S_{i+1}, \tilde{a}_{i+1}, \boldsymbol{w}_{i}^{\prime}\right)$
</font><br>
&emsp;&emsp;&emsp;&emsp;<font size=3.5>
    14.&emsp;&emsp;&emsp;&emsp;&emsp;使用MBGD，根据最小化损失函数来更新价值网络（评论家网络）参数$\boldsymbol{w}$：
</font><br>
\begin{equation}
\nabla_{w} L(\boldsymbol{w}) \approx \frac{1}{N} \sum_{i}^{N}\left(y_{i}-Q\left(S_{i}, A_{i}, \boldsymbol{w}\right)\right) \nabla_{w} Q\left(S_{i}, A_{i}, \boldsymbol{w}\right)
\end{equation}
&emsp;&emsp;&emsp;&emsp;<font size=3.5>
    15.&emsp;&emsp;&emsp;&emsp;&emsp;<b>if</b> $t$ mod $d$ <b>then</b>
</font><br>
&emsp;&emsp;&emsp;&emsp;<font size=3.5>
    16.&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;使用MBGA法，根据最大化目标函数来更新策略网络（行动者网络）参数$\theta$： 
</font><br>
\begin{equation}
\left.\nabla_{\theta} \hat{J}_{\beta}(\boldsymbol{\theta}) \approx \frac{1}{N} \sum_{i} \nabla_{\theta} \mu\left(S_{i}, \boldsymbol{\theta}\right) \nabla_{a} Q\left(S_{i}, a, \boldsymbol{w}\right)\right|_{a=\mu\left(S_{i}, \boldsymbol{\theta}\right)} 
\end{equation}
&emsp;&emsp;&emsp;&emsp;<font size=3.5>
    17.&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;软更新目标网络：$\left\{\begin{array}{l}\boldsymbol{w}^{\prime} \leftarrow \tau \boldsymbol{w}+(1-\tau) \boldsymbol{w}^{\prime} \\ \boldsymbol{\theta}^{\prime} \leftarrow \tau \boldsymbol{\theta}+(1-\tau) \boldsymbol{\theta}^{\prime}\end{array}\right.$
</font><br>
&emsp;&emsp;&emsp;&emsp;<font size=3.5>
    18.&emsp;&emsp;&emsp;<b>until</b> $t=\mathrm{T}-1$
</font><br>
<hr style="height:1px;border:none;border-top:1px solid #555555;" /><br>

In [1]:
import numpy as np
import torch
import gym
import os
import copy
import numpy as np
import torch
import torch.nn as nn
import torch.nn.functional as F

## USE CUDA

In [2]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

## Replay Buffer

In [3]:
class ReplayBuffer(object):
    def __init__(self, state_dim, action_dim, max_size=int(1e6)):
        self.max_size = max_size
        self.ptr = 0
        self.size = 0

        self.state = np.zeros((max_size, state_dim))
        self.action = np.zeros((max_size, action_dim))
        self.next_state = np.zeros((max_size, state_dim))
        self.reward = np.zeros((max_size, 1))
        self.not_done = np.zeros((max_size, 1))

        self.device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

    def add(self, state, action, next_state, reward, done):
        self.state[self.ptr] = state
        self.action[self.ptr] = action
        self.next_state[self.ptr] = next_state
        self.reward[self.ptr] = reward
        self.not_done[self.ptr] = 1. - done

        self.ptr = (self.ptr + 1) % self.max_size
        self.size = min(self.size + 1, self.max_size)

    def sample(self, batch_size):
        ind = np.random.randint(0, self.size, size=batch_size)

        return (
            torch.FloatTensor(self.state[ind]).to(self.device),
            torch.FloatTensor(self.action[ind]).to(self.device),
            torch.FloatTensor(self.next_state[ind]).to(self.device),
            torch.FloatTensor(self.reward[ind]).to(self.device),
            torch.FloatTensor(self.not_done[ind]).to(self.device)
        )

In [4]:
class Actor(nn.Module):
    def __init__(self, state_dim, action_dim, max_action):
        super(Actor, self).__init__()

        self.l1 = nn.Linear(state_dim, 256)
        self.l2 = nn.Linear(256, 256)
        self.l3 = nn.Linear(256, action_dim)

        self.max_action = max_action


    def forward(self, state):
        a = F.relu(self.l1(state))
        a = F.relu(self.l2(a))
        return self.max_action * torch.tanh(self.l3(a))


class Critic(nn.Module):
    def __init__(self, state_dim, action_dim):
        super(Critic, self).__init__()

        # Q1 architecture
        self.l1 = nn.Linear(state_dim + action_dim, 256)
        self.l2 = nn.Linear(256, 256)
        self.l3 = nn.Linear(256, 1)

        # Q2 architecture
        self.l4 = nn.Linear(state_dim + action_dim, 256)
        self.l5 = nn.Linear(256, 256)
        self.l6 = nn.Linear(256, 1)


    def forward(self, state, action):
        sa = torch.cat([state, action], 1)

        q1 = F.relu(self.l1(sa))
        q1 = F.relu(self.l2(q1))
        q1 = self.l3(q1)

        q2 = F.relu(self.l4(sa))
        q2 = F.relu(self.l5(q2))
        q2 = self.l6(q2)
        return q1, q2


    def Q1(self, state, action):
        sa = torch.cat([state, action], 1)

        q1 = F.relu(self.l1(sa))
        q1 = F.relu(self.l2(q1))
        q1 = self.l3(q1)
        return q1

In [9]:
actor1=Actor(17,6,1.0)
for ch in actor1.children():
    print(ch)
print("*********************")
critic1=Critic(17,6)
for ch in critic1.children():
    print(ch)

Linear(in_features=17, out_features=256, bias=True)
Linear(in_features=256, out_features=256, bias=True)
Linear(in_features=256, out_features=6, bias=True)
*********************
Linear(in_features=23, out_features=256, bias=True)
Linear(in_features=256, out_features=256, bias=True)
Linear(in_features=256, out_features=1, bias=True)
Linear(in_features=23, out_features=256, bias=True)
Linear(in_features=256, out_features=256, bias=True)
Linear(in_features=256, out_features=1, bias=True)


In [5]:
class TD3(object):
    def __init__(
        self,
        state_dim,
        action_dim,
        max_action,
        discount=0.99,
        tau=0.005,
        policy_noise=0.2,
        noise_clip=0.5,
        policy_freq=2
    ):

        self.actor = Actor(state_dim, action_dim, max_action).to(device)
        self.actor_target = copy.deepcopy(self.actor)
        self.actor_optimizer = torch.optim.Adam(self.actor.parameters(), lr=3e-4)

        self.critic = Critic(state_dim, action_dim).to(device)
        self.critic_target = copy.deepcopy(self.critic)
        self.critic_optimizer = torch.optim.Adam(self.critic.parameters(), lr=3e-4)

        self.max_action = max_action
        self.discount = discount
        self.tau = tau
        self.policy_noise = policy_noise
        self.noise_clip = noise_clip
        self.policy_freq = policy_freq

        self.total_it = 0


    def select_action(self, state):
        state = torch.FloatTensor(state.reshape(1, -1)).to(device)
        return self.actor(state).cpu().data.numpy().flatten()


    def train(self, replay_buffer, batch_size=100):
        self.total_it += 1

        # Sample replay buffer 
        state, action, next_state, reward, not_done = replay_buffer.sample(batch_size)

        with torch.no_grad():
            # Select action according to policy and add clipped noise
            noise = (
                torch.randn_like(action) * self.policy_noise
            ).clamp(-self.noise_clip, self.noise_clip)

            next_action = (
                self.actor_target(next_state) + noise
            ).clamp(-self.max_action, self.max_action)

            # Compute the target Q value
            target_Q1, target_Q2 = self.critic_target(next_state, next_action)
            target_Q = torch.min(target_Q1, target_Q2)
            target_Q = reward + not_done * self.discount * target_Q

        # Get current Q estimates
        current_Q1, current_Q2 = self.critic(state, action)

        # Compute critic loss
        critic_loss = F.mse_loss(current_Q1, target_Q) + F.mse_loss(current_Q2, target_Q)

        # Optimize the critic
        self.critic_optimizer.zero_grad()
        critic_loss.backward()
        self.critic_optimizer.step()

        # Delayed policy updates
        if self.total_it % self.policy_freq == 0:

            # Compute actor losse
            actor_loss = -self.critic.Q1(state, self.actor(state)).mean()

            # Optimize the actor 
            self.actor_optimizer.zero_grad()
            actor_loss.backward()
            self.actor_optimizer.step()

            # Update the frozen target models
            for param, target_param in zip(self.critic.parameters(), self.critic_target.parameters()):
                target_param.data.copy_(self.tau * param.data + (1 - self.tau) * target_param.data)

            for param, target_param in zip(self.actor.parameters(), self.actor_target.parameters()):
                target_param.data.copy_(self.tau * param.data + (1 - self.tau) * target_param.data)


    def save(self, filename):
        torch.save(self.critic.state_dict(), filename + "_critic")
        torch.save(self.critic_optimizer.state_dict(), filename + "_critic_optimizer")

        torch.save(self.actor.state_dict(), filename + "_actor")
        torch.save(self.actor_optimizer.state_dict(), filename + "_actor_optimizer")


    def load(self, filename):
        self.critic.load_state_dict(torch.load(filename + "_critic"))
        self.critic_optimizer.load_state_dict(torch.load(filename + "_critic_optimizer"))
        self.critic_target = copy.deepcopy(self.critic)

        self.actor.load_state_dict(torch.load(filename + "_actor"))
        self.actor_optimizer.load_state_dict(torch.load(filename + "_actor_optimizer"))
        self.actor_target = copy.deepcopy(self.actor)

In [6]:
# Runs policy for X episodes and returns average reward
# A fixed seed is used for the eval environment
def eval_policy(policy, env_name, seed, eval_episodes=10):
    eval_env = gym.make(env_name)
    eval_env.seed(seed + 100)

    avg_reward = 0.
    for _ in range(eval_episodes):
        state, done = eval_env.reset(), False
        while not done:
            action = policy.select_action(np.array(state))
            state, reward, done, _ = eval_env.step(action)
            avg_reward += reward

    avg_reward /= eval_episodes

    print("---------------------------------------")
    print(f"Evaluation over {eval_episodes} episodes: {avg_reward:.3f}")
    print("---------------------------------------")
    return avg_reward

In [7]:
policy="TD3"
env_name="Walker2d-v2"          # OpenAI gym environment name
seed=0                        # Sets Gym, PyTorch and Numpy seeds
start_timesteps=25e3         # Time steps initial random policy is used
eval_freq=5e3               # How often (time steps) we evaluate
max_timesteps=1e6   # Max time steps to run environment
expl_noise=0.1                 # Std of Gaussian exploration noise
batch_size=256      # Batch size for both actor and critic
discount=0.99                 # Discount factor
tau=0.005                     # Target network update rate
policy_noise=0.2              # Noise added to target policy during critic update
noise_clip=0.5                # Range to clip target policy noise
policy_freq=2                 # Frequency of delayed policy updates
save_model="store_true"       # Save model and optimizer parameters
load_model=""                # Model load file name, "" doesn't load, "default" uses file_name

file_name = f"{policy}_{env_name}_{seed}"
print("---------------------------------------")
print(f"Policy: {policy}, Env: {env_name}, Seed: {seed}")
print("---------------------------------------")

if not os.path.exists("./results"):
    os.makedirs("./results")

if save_model and not os.path.exists("./models"):
    os.makedirs("./models")

env = gym.make(env_name)

# Set seeds
env.seed(seed)
torch.manual_seed(seed)
np.random.seed(seed)

state_dim = env.observation_space.shape[0]
action_dim = env.action_space.shape[0] 
max_action = float(env.action_space.high[0])

kwargs = {
    "state_dim": state_dim,
    "action_dim": action_dim,
    "max_action": max_action,
    "discount": discount,
    "tau": tau,
    "policy_noise": policy_noise * max_action,
    "noise_clip": noise_clip * max_action,
    "policy_freq": policy_freq
}


policy = TD3(**kwargs)

if load_model != "":
    policy_file = file_name if load_model == "default" else load_model
    policy.load(f"./models/{policy_file}")

replay_buffer = ReplayBuffer(state_dim, action_dim)

# Evaluate untrained policy
evaluations = [eval_policy(policy, env_name, seed)]

state, done = env.reset(), False
episode_reward = 0
episode_timesteps = 0
episode_num = 0

for t in range(int(max_timesteps)):

    episode_timesteps += 1

    # Select action randomly or according to policy
    if t < start_timesteps:
        action = env.action_space.sample()
    else:
        action = (
            policy.select_action(np.array(state))
            + np.random.normal(0, max_action * expl_noise, size=action_dim)
        ).clip(-max_action, max_action)

    # Perform action
    next_state, reward, done, _ = env.step(action) 
    done_bool = float(done) if episode_timesteps < env._max_episode_steps else 0

    # Store data in replay buffer
    replay_buffer.add(state, action, next_state, reward, done_bool)

    state = next_state
    episode_reward += reward

    # Train agent after collecting sufficient data
    if t >= start_timesteps:
        policy.train(replay_buffer, batch_size)

    if done: 
        # +1 to account for 0 indexing. +0 on ep_timesteps since it will increment +1 even if done=True
        print(f"Total T: {t+1} Episode Num: {episode_num+1} Episode T: {episode_timesteps} Reward: {episode_reward:.3f}")
        # Reset environment
        state, done = env.reset(), False
        episode_reward = 0
        episode_timesteps = 0
        episode_num += 1 

    # Evaluate episode
    if (t + 1) % eval_freq == 0:
        evaluations.append(eval_policy(policy, env_name, seed))
        np.save(f"./results/{file_name}", evaluations)
    
    if save_model: 
        policy.save(f"./models/{file_name}")

---------------------------------------
Policy: TD3, Env: Walker2d-v2, Seed: 0
---------------------------------------
---------------------------------------
Evaluation over 10 episodes: 7.883
---------------------------------------
Total T: 18 Episode Num: 1 Episode T: 18 Reward: 0.863
Total T: 33 Episode Num: 2 Episode T: 15 Reward: 0.775
Total T: 48 Episode Num: 3 Episode T: 15 Reward: -8.107
Total T: 68 Episode Num: 4 Episode T: 20 Reward: 2.861
Total T: 81 Episode Num: 5 Episode T: 13 Reward: -1.973
Total T: 98 Episode Num: 6 Episode T: 17 Reward: 1.578
Total T: 109 Episode Num: 7 Episode T: 11 Reward: -2.600
Total T: 122 Episode Num: 8 Episode T: 13 Reward: 2.055
Total T: 155 Episode Num: 9 Episode T: 33 Reward: 11.694
Total T: 184 Episode Num: 10 Episode T: 29 Reward: 7.228
Total T: 196 Episode Num: 11 Episode T: 12 Reward: -3.649
Total T: 216 Episode Num: 12 Episode T: 20 Reward: 2.729
Total T: 238 Episode Num: 13 Episode T: 22 Reward: -1.251
Total T: 269 Episode Num: 14 Episo

Total T: 2864 Episode Num: 140 Episode T: 14 Reward: -3.958
Total T: 2878 Episode Num: 141 Episode T: 14 Reward: 1.810
Total T: 2893 Episode Num: 142 Episode T: 15 Reward: -0.519
Total T: 2917 Episode Num: 143 Episode T: 24 Reward: 4.097
Total T: 2946 Episode Num: 144 Episode T: 29 Reward: 14.824
Total T: 2960 Episode Num: 145 Episode T: 14 Reward: 4.699
Total T: 2989 Episode Num: 146 Episode T: 29 Reward: 2.584
Total T: 3027 Episode Num: 147 Episode T: 38 Reward: 1.286
Total T: 3045 Episode Num: 148 Episode T: 18 Reward: 1.103
Total T: 3101 Episode Num: 149 Episode T: 56 Reward: 32.155
Total T: 3109 Episode Num: 150 Episode T: 8 Reward: -2.982
Total T: 3129 Episode Num: 151 Episode T: 20 Reward: -1.912
Total T: 3157 Episode Num: 152 Episode T: 28 Reward: 8.522
Total T: 3172 Episode Num: 153 Episode T: 15 Reward: -3.467
Total T: 3194 Episode Num: 154 Episode T: 22 Reward: -0.799
Total T: 3223 Episode Num: 155 Episode T: 29 Reward: 9.442
Total T: 3242 Episode Num: 156 Episode T: 19 Rewa

Total T: 5627 Episode Num: 278 Episode T: 25 Reward: 10.907
Total T: 5640 Episode Num: 279 Episode T: 13 Reward: -1.665
Total T: 5665 Episode Num: 280 Episode T: 25 Reward: 8.104
Total T: 5675 Episode Num: 281 Episode T: 10 Reward: -0.196
Total T: 5686 Episode Num: 282 Episode T: 11 Reward: -2.271
Total T: 5719 Episode Num: 283 Episode T: 33 Reward: 8.561
Total T: 5728 Episode Num: 284 Episode T: 9 Reward: -2.376
Total T: 5759 Episode Num: 285 Episode T: 31 Reward: -1.148
Total T: 5779 Episode Num: 286 Episode T: 20 Reward: -6.399
Total T: 5809 Episode Num: 287 Episode T: 30 Reward: -3.334
Total T: 5831 Episode Num: 288 Episode T: 22 Reward: 0.140
Total T: 5851 Episode Num: 289 Episode T: 20 Reward: -0.119
Total T: 5863 Episode Num: 290 Episode T: 12 Reward: 0.794
Total T: 5879 Episode Num: 291 Episode T: 16 Reward: -0.785
Total T: 5894 Episode Num: 292 Episode T: 15 Reward: -0.445
Total T: 5910 Episode Num: 293 Episode T: 16 Reward: -2.775
Total T: 5927 Episode Num: 294 Episode T: 17 

Total T: 8560 Episode Num: 419 Episode T: 11 Reward: -3.072
Total T: 8595 Episode Num: 420 Episode T: 35 Reward: 2.889
Total T: 8612 Episode Num: 421 Episode T: 17 Reward: -0.215
Total T: 8639 Episode Num: 422 Episode T: 27 Reward: -2.735
Total T: 8664 Episode Num: 423 Episode T: 25 Reward: 9.905
Total T: 8685 Episode Num: 424 Episode T: 21 Reward: 9.672
Total T: 8698 Episode Num: 425 Episode T: 13 Reward: -0.227
Total T: 8711 Episode Num: 426 Episode T: 13 Reward: 0.171
Total T: 8734 Episode Num: 427 Episode T: 23 Reward: -4.664
Total T: 8777 Episode Num: 428 Episode T: 43 Reward: 10.907
Total T: 8792 Episode Num: 429 Episode T: 15 Reward: -1.877
Total T: 8805 Episode Num: 430 Episode T: 13 Reward: 1.071
Total T: 8844 Episode Num: 431 Episode T: 39 Reward: 5.077
Total T: 8859 Episode Num: 432 Episode T: 15 Reward: 1.975
Total T: 8869 Episode Num: 433 Episode T: 10 Reward: -0.908
Total T: 8882 Episode Num: 434 Episode T: 13 Reward: 0.218
Total T: 8897 Episode Num: 435 Episode T: 15 Rew

Total T: 11266 Episode Num: 555 Episode T: 33 Reward: 11.064
Total T: 11285 Episode Num: 556 Episode T: 19 Reward: -0.856
Total T: 11302 Episode Num: 557 Episode T: 17 Reward: 4.289
Total T: 11317 Episode Num: 558 Episode T: 15 Reward: -2.102
Total T: 11341 Episode Num: 559 Episode T: 24 Reward: -1.828
Total T: 11361 Episode Num: 560 Episode T: 20 Reward: 1.399
Total T: 11376 Episode Num: 561 Episode T: 15 Reward: 3.854
Total T: 11391 Episode Num: 562 Episode T: 15 Reward: 1.496
Total T: 11404 Episode Num: 563 Episode T: 13 Reward: -4.943
Total T: 11421 Episode Num: 564 Episode T: 17 Reward: 5.138
Total T: 11441 Episode Num: 565 Episode T: 20 Reward: -5.927
Total T: 11453 Episode Num: 566 Episode T: 12 Reward: -0.165
Total T: 11478 Episode Num: 567 Episode T: 25 Reward: 14.339
Total T: 11494 Episode Num: 568 Episode T: 16 Reward: 4.875
Total T: 11519 Episode Num: 569 Episode T: 25 Reward: 11.570
Total T: 11550 Episode Num: 570 Episode T: 31 Reward: 14.996
Total T: 11565 Episode Num: 57

Total T: 14172 Episode Num: 695 Episode T: 33 Reward: 2.011
Total T: 14180 Episode Num: 696 Episode T: 8 Reward: -2.995
Total T: 14195 Episode Num: 697 Episode T: 15 Reward: -4.263
Total T: 14210 Episode Num: 698 Episode T: 15 Reward: -1.418
Total T: 14233 Episode Num: 699 Episode T: 23 Reward: 0.208
Total T: 14258 Episode Num: 700 Episode T: 25 Reward: 4.665
Total T: 14297 Episode Num: 701 Episode T: 39 Reward: 13.827
Total T: 14320 Episode Num: 702 Episode T: 23 Reward: -4.615
Total T: 14342 Episode Num: 703 Episode T: 22 Reward: 7.359
Total T: 14380 Episode Num: 704 Episode T: 38 Reward: 12.914
Total T: 14406 Episode Num: 705 Episode T: 26 Reward: -6.024
Total T: 14432 Episode Num: 706 Episode T: 26 Reward: -0.292
Total T: 14478 Episode Num: 707 Episode T: 46 Reward: 20.052
Total T: 14496 Episode Num: 708 Episode T: 18 Reward: 3.736
Total T: 14512 Episode Num: 709 Episode T: 16 Reward: -1.584
Total T: 14527 Episode Num: 710 Episode T: 15 Reward: -0.409
Total T: 14542 Episode Num: 71

Total T: 16865 Episode Num: 829 Episode T: 16 Reward: 5.276
Total T: 16876 Episode Num: 830 Episode T: 11 Reward: 0.261
Total T: 16890 Episode Num: 831 Episode T: 14 Reward: 4.265
Total T: 16903 Episode Num: 832 Episode T: 13 Reward: -2.853
Total T: 16923 Episode Num: 833 Episode T: 20 Reward: -3.970
Total T: 16932 Episode Num: 834 Episode T: 9 Reward: -2.570
Total T: 16950 Episode Num: 835 Episode T: 18 Reward: 6.706
Total T: 16969 Episode Num: 836 Episode T: 19 Reward: 11.170
Total T: 17007 Episode Num: 837 Episode T: 38 Reward: 0.183
Total T: 17017 Episode Num: 838 Episode T: 10 Reward: -3.852
Total T: 17031 Episode Num: 839 Episode T: 14 Reward: -5.139
Total T: 17042 Episode Num: 840 Episode T: 11 Reward: 0.890
Total T: 17059 Episode Num: 841 Episode T: 17 Reward: -3.008
Total T: 17084 Episode Num: 842 Episode T: 25 Reward: -6.775
Total T: 17102 Episode Num: 843 Episode T: 18 Reward: -0.814
Total T: 17123 Episode Num: 844 Episode T: 21 Reward: -1.274
Total T: 17136 Episode Num: 845

Total T: 19734 Episode Num: 968 Episode T: 21 Reward: 5.077
Total T: 19767 Episode Num: 969 Episode T: 33 Reward: -3.490
Total T: 19787 Episode Num: 970 Episode T: 20 Reward: -3.924
Total T: 19819 Episode Num: 971 Episode T: 32 Reward: 17.148
Total T: 19845 Episode Num: 972 Episode T: 26 Reward: -0.728
Total T: 19865 Episode Num: 973 Episode T: 20 Reward: -2.894
Total T: 19876 Episode Num: 974 Episode T: 11 Reward: -2.479
Total T: 19906 Episode Num: 975 Episode T: 30 Reward: 6.039
Total T: 19916 Episode Num: 976 Episode T: 10 Reward: 0.594
Total T: 19932 Episode Num: 977 Episode T: 16 Reward: 2.538
Total T: 19973 Episode Num: 978 Episode T: 41 Reward: 25.587
Total T: 19985 Episode Num: 979 Episode T: 12 Reward: -0.889
---------------------------------------
Evaluation over 10 episodes: 7.883
---------------------------------------
Total T: 20010 Episode Num: 980 Episode T: 25 Reward: -2.288
Total T: 20027 Episode Num: 981 Episode T: 17 Reward: -0.194
Total T: 20047 Episode Num: 982 Epi

Total T: 22435 Episode Num: 1101 Episode T: 14 Reward: -0.924
Total T: 22458 Episode Num: 1102 Episode T: 23 Reward: 8.039
Total T: 22470 Episode Num: 1103 Episode T: 12 Reward: 0.798
Total T: 22491 Episode Num: 1104 Episode T: 21 Reward: 8.616
Total T: 22514 Episode Num: 1105 Episode T: 23 Reward: 1.106
Total T: 22530 Episode Num: 1106 Episode T: 16 Reward: -2.162
Total T: 22550 Episode Num: 1107 Episode T: 20 Reward: 1.834
Total T: 22578 Episode Num: 1108 Episode T: 28 Reward: 9.464
Total T: 22592 Episode Num: 1109 Episode T: 14 Reward: 0.347
Total T: 22614 Episode Num: 1110 Episode T: 22 Reward: 4.412
Total T: 22622 Episode Num: 1111 Episode T: 8 Reward: -4.190
Total T: 22645 Episode Num: 1112 Episode T: 23 Reward: 0.356
Total T: 22661 Episode Num: 1113 Episode T: 16 Reward: 6.085
Total T: 22690 Episode Num: 1114 Episode T: 29 Reward: 0.507
Total T: 22708 Episode Num: 1115 Episode T: 18 Reward: -7.338
Total T: 22721 Episode Num: 1116 Episode T: 13 Reward: 3.468
Total T: 22744 Episod

Total T: 25879 Episode Num: 1233 Episode T: 67 Reward: -8.655
Total T: 25946 Episode Num: 1234 Episode T: 67 Reward: -10.205
Total T: 26013 Episode Num: 1235 Episode T: 67 Reward: -8.761
Total T: 26080 Episode Num: 1236 Episode T: 67 Reward: -10.304
Total T: 26147 Episode Num: 1237 Episode T: 67 Reward: -5.801
Total T: 26326 Episode Num: 1238 Episode T: 179 Reward: 66.475
Total T: 26483 Episode Num: 1239 Episode T: 157 Reward: 163.491
Total T: 26548 Episode Num: 1240 Episode T: 65 Reward: 65.211
Total T: 27444 Episode Num: 1241 Episode T: 896 Reward: 928.468
Total T: 27654 Episode Num: 1242 Episode T: 210 Reward: 104.904
Total T: 27716 Episode Num: 1243 Episode T: 62 Reward: -13.670
Total T: 27780 Episode Num: 1244 Episode T: 64 Reward: -13.845
Total T: 27851 Episode Num: 1245 Episode T: 71 Reward: -21.468
Total T: 27934 Episode Num: 1246 Episode T: 83 Reward: -35.847
Total T: 28039 Episode Num: 1247 Episode T: 105 Reward: -45.785
Total T: 28114 Episode Num: 1248 Episode T: 75 Reward: 

Total T: 53865 Episode Num: 1354 Episode T: 108 Reward: 170.969
Total T: 54017 Episode Num: 1355 Episode T: 152 Reward: 215.335
Total T: 54439 Episode Num: 1356 Episode T: 422 Reward: -60.193
Total T: 54538 Episode Num: 1357 Episode T: 99 Reward: 95.236
Total T: 54767 Episode Num: 1358 Episode T: 229 Reward: 211.356
Total T: 54868 Episode Num: 1359 Episode T: 101 Reward: 157.090
---------------------------------------
Evaluation over 10 episodes: 183.039
---------------------------------------
Total T: 55121 Episode Num: 1360 Episode T: 253 Reward: 171.075
Total T: 55300 Episode Num: 1361 Episode T: 179 Reward: 45.174
Total T: 55409 Episode Num: 1362 Episode T: 109 Reward: 151.887
Total T: 55528 Episode Num: 1363 Episode T: 119 Reward: 179.961
Total T: 55625 Episode Num: 1364 Episode T: 97 Reward: 165.469
Total T: 55722 Episode Num: 1365 Episode T: 97 Reward: 184.770
Total T: 55813 Episode Num: 1366 Episode T: 91 Reward: 133.829
Total T: 56678 Episode Num: 1367 Episode T: 865 Reward: 7

Total T: 82377 Episode Num: 1472 Episode T: 318 Reward: 19.652
Total T: 82690 Episode Num: 1473 Episode T: 313 Reward: 520.614
Total T: 83106 Episode Num: 1474 Episode T: 416 Reward: 461.350
Total T: 83353 Episode Num: 1475 Episode T: 247 Reward: 398.683
Total T: 83501 Episode Num: 1476 Episode T: 148 Reward: 65.418
Total T: 83631 Episode Num: 1477 Episode T: 130 Reward: 132.998
Total T: 84021 Episode Num: 1478 Episode T: 390 Reward: 504.989
Total T: 84131 Episode Num: 1479 Episode T: 110 Reward: 172.956
Total T: 84438 Episode Num: 1480 Episode T: 307 Reward: 477.119
Total T: 84649 Episode Num: 1481 Episode T: 211 Reward: 311.925
Total T: 84811 Episode Num: 1482 Episode T: 162 Reward: 346.369
---------------------------------------
Evaluation over 10 episodes: 347.666
---------------------------------------
Total T: 85015 Episode Num: 1483 Episode T: 204 Reward: 234.833
Total T: 85180 Episode Num: 1484 Episode T: 165 Reward: 40.879
Total T: 85485 Episode Num: 1485 Episode T: 305 Reward

Total T: 110411 Episode Num: 1589 Episode T: 205 Reward: 226.163
Total T: 110555 Episode Num: 1590 Episode T: 144 Reward: 194.261
Total T: 110675 Episode Num: 1591 Episode T: 120 Reward: 191.056
Total T: 110928 Episode Num: 1592 Episode T: 253 Reward: 292.348
Total T: 111480 Episode Num: 1593 Episode T: 552 Reward: 665.548
Total T: 112062 Episode Num: 1594 Episode T: 582 Reward: 808.799
Total T: 112229 Episode Num: 1595 Episode T: 167 Reward: 248.624
Total T: 112731 Episode Num: 1596 Episode T: 502 Reward: 590.276
Total T: 112819 Episode Num: 1597 Episode T: 88 Reward: 141.984
Total T: 113819 Episode Num: 1598 Episode T: 1000 Reward: 1169.023
Total T: 113959 Episode Num: 1599 Episode T: 140 Reward: 172.131
Total T: 114368 Episode Num: 1600 Episode T: 409 Reward: 703.049
Total T: 114857 Episode Num: 1601 Episode T: 489 Reward: 868.429
---------------------------------------
Evaluation over 10 episodes: 349.893
---------------------------------------
Total T: 115062 Episode Num: 1602 Epi

Total T: 150195 Episode Num: 1701 Episode T: 233 Reward: 348.175
Total T: 150647 Episode Num: 1702 Episode T: 452 Reward: 686.320
Total T: 150793 Episode Num: 1703 Episode T: 146 Reward: 242.390
Total T: 150971 Episode Num: 1704 Episode T: 178 Reward: 224.004
Total T: 151239 Episode Num: 1705 Episode T: 268 Reward: 465.584
Total T: 151587 Episode Num: 1706 Episode T: 348 Reward: 765.143
Total T: 151836 Episode Num: 1707 Episode T: 249 Reward: 405.539
Total T: 151979 Episode Num: 1708 Episode T: 143 Reward: 223.635
Total T: 152148 Episode Num: 1709 Episode T: 169 Reward: 165.841
Total T: 152322 Episode Num: 1710 Episode T: 174 Reward: 432.648
Total T: 152713 Episode Num: 1711 Episode T: 391 Reward: 703.177
Total T: 152853 Episode Num: 1712 Episode T: 140 Reward: 232.945
Total T: 153159 Episode Num: 1713 Episode T: 306 Reward: 595.880
Total T: 153375 Episode Num: 1714 Episode T: 216 Reward: 484.073
Total T: 153584 Episode Num: 1715 Episode T: 209 Reward: 405.316
Total T: 153712 Episode N

Total T: 176000 Episode Num: 1819 Episode T: 96 Reward: 176.486
Total T: 176483 Episode Num: 1820 Episode T: 483 Reward: 703.162
Total T: 176647 Episode Num: 1821 Episode T: 164 Reward: 438.400
Total T: 176742 Episode Num: 1822 Episode T: 95 Reward: 163.549
Total T: 176975 Episode Num: 1823 Episode T: 233 Reward: 638.489
Total T: 177054 Episode Num: 1824 Episode T: 79 Reward: 144.807
Total T: 177258 Episode Num: 1825 Episode T: 204 Reward: 475.350
Total T: 177426 Episode Num: 1826 Episode T: 168 Reward: 322.442
Total T: 177509 Episode Num: 1827 Episode T: 83 Reward: 145.667
Total T: 177636 Episode Num: 1828 Episode T: 127 Reward: 302.471
Total T: 177707 Episode Num: 1829 Episode T: 71 Reward: 126.178
Total T: 177902 Episode Num: 1830 Episode T: 195 Reward: 376.442
Total T: 178042 Episode Num: 1831 Episode T: 140 Reward: 370.343
Total T: 178249 Episode Num: 1832 Episode T: 207 Reward: 550.821
Total T: 178479 Episode Num: 1833 Episode T: 230 Reward: 596.784
Total T: 178743 Episode Num: 1

Total T: 203279 Episode Num: 1937 Episode T: 269 Reward: 749.911
Total T: 203539 Episode Num: 1938 Episode T: 260 Reward: 672.776
Total T: 203804 Episode Num: 1939 Episode T: 265 Reward: 675.027
Total T: 204005 Episode Num: 1940 Episode T: 201 Reward: 425.738
Total T: 204292 Episode Num: 1941 Episode T: 287 Reward: 773.744
Total T: 204543 Episode Num: 1942 Episode T: 251 Reward: 623.929
Total T: 204790 Episode Num: 1943 Episode T: 247 Reward: 629.789
---------------------------------------
Evaluation over 10 episodes: 633.254
---------------------------------------
Total T: 205061 Episode Num: 1944 Episode T: 271 Reward: 680.560
Total T: 205364 Episode Num: 1945 Episode T: 303 Reward: 791.311
Total T: 205611 Episode Num: 1946 Episode T: 247 Reward: 676.597
Total T: 205871 Episode Num: 1947 Episode T: 260 Reward: 683.301
Total T: 206115 Episode Num: 1948 Episode T: 244 Reward: 604.645
Total T: 206230 Episode Num: 1949 Episode T: 115 Reward: 313.924
Total T: 206475 Episode Num: 1950 Epis

Total T: 231735 Episode Num: 2053 Episode T: 277 Reward: 687.260
Total T: 232024 Episode Num: 2054 Episode T: 289 Reward: 668.026
Total T: 232266 Episode Num: 2055 Episode T: 242 Reward: 619.889
Total T: 232529 Episode Num: 2056 Episode T: 263 Reward: 637.128
Total T: 232658 Episode Num: 2057 Episode T: 129 Reward: 332.160
Total T: 232806 Episode Num: 2058 Episode T: 148 Reward: 419.828
Total T: 232917 Episode Num: 2059 Episode T: 111 Reward: 259.298
Total T: 233194 Episode Num: 2060 Episode T: 277 Reward: 653.443
Total T: 233537 Episode Num: 2061 Episode T: 343 Reward: 864.166
Total T: 233890 Episode Num: 2062 Episode T: 353 Reward: 875.187
Total T: 234197 Episode Num: 2063 Episode T: 307 Reward: 790.965
Total T: 234478 Episode Num: 2064 Episode T: 281 Reward: 766.644
Total T: 234783 Episode Num: 2065 Episode T: 305 Reward: 692.846
---------------------------------------
Evaluation over 10 episodes: 809.429
---------------------------------------
Total T: 235118 Episode Num: 2066 Epis

Total T: 262896 Episode Num: 2169 Episode T: 307 Reward: 836.424
Total T: 263216 Episode Num: 2170 Episode T: 320 Reward: 845.823
Total T: 263530 Episode Num: 2171 Episode T: 314 Reward: 832.452
Total T: 263834 Episode Num: 2172 Episode T: 304 Reward: 804.999
Total T: 264171 Episode Num: 2173 Episode T: 337 Reward: 893.774
Total T: 264480 Episode Num: 2174 Episode T: 309 Reward: 861.668
Total T: 264663 Episode Num: 2175 Episode T: 183 Reward: 516.306
Total T: 264795 Episode Num: 2176 Episode T: 132 Reward: 377.532
---------------------------------------
Evaluation over 10 episodes: 682.235
---------------------------------------
Total T: 265105 Episode Num: 2177 Episode T: 310 Reward: 831.799
Total T: 265426 Episode Num: 2178 Episode T: 321 Reward: 848.613
Total T: 265736 Episode Num: 2179 Episode T: 310 Reward: 757.976
Total T: 266031 Episode Num: 2180 Episode T: 295 Reward: 813.952
Total T: 266312 Episode Num: 2181 Episode T: 281 Reward: 707.916
Total T: 266594 Episode Num: 2182 Epis

Total T: 295775 Episode Num: 2283 Episode T: 256 Reward: 619.913
Total T: 295993 Episode Num: 2284 Episode T: 218 Reward: 586.031
Total T: 296319 Episode Num: 2285 Episode T: 326 Reward: 933.993
Total T: 296539 Episode Num: 2286 Episode T: 220 Reward: 614.814
Total T: 296871 Episode Num: 2287 Episode T: 332 Reward: 862.225
Total T: 297152 Episode Num: 2288 Episode T: 281 Reward: 772.299
Total T: 297484 Episode Num: 2289 Episode T: 332 Reward: 935.086
Total T: 297728 Episode Num: 2290 Episode T: 244 Reward: 677.991
Total T: 297930 Episode Num: 2291 Episode T: 202 Reward: 496.575
Total T: 298249 Episode Num: 2292 Episode T: 319 Reward: 828.632
Total T: 298559 Episode Num: 2293 Episode T: 310 Reward: 866.878
Total T: 298883 Episode Num: 2294 Episode T: 324 Reward: 828.553
Total T: 299115 Episode Num: 2295 Episode T: 232 Reward: 648.516
Total T: 299351 Episode Num: 2296 Episode T: 236 Reward: 786.757
Total T: 299658 Episode Num: 2297 Episode T: 307 Reward: 927.436
-------------------------

---------------------------------------
Evaluation over 10 episodes: 329.778
---------------------------------------
Total T: 325099 Episode Num: 2401 Episode T: 354 Reward: 919.749
Total T: 325412 Episode Num: 2402 Episode T: 313 Reward: 815.527
Total T: 325654 Episode Num: 2403 Episode T: 242 Reward: 569.250
Total T: 326011 Episode Num: 2404 Episode T: 357 Reward: 1019.801
Total T: 326238 Episode Num: 2405 Episode T: 227 Reward: 633.255
Total T: 326625 Episode Num: 2406 Episode T: 387 Reward: 868.573
Total T: 326711 Episode Num: 2407 Episode T: 86 Reward: 172.662
Total T: 326851 Episode Num: 2408 Episode T: 140 Reward: 365.743
Total T: 326925 Episode Num: 2409 Episode T: 74 Reward: 139.015
Total T: 327136 Episode Num: 2410 Episode T: 211 Reward: 603.876
Total T: 327463 Episode Num: 2411 Episode T: 327 Reward: 924.724
Total T: 327717 Episode Num: 2412 Episode T: 254 Reward: 717.446
Total T: 328035 Episode Num: 2413 Episode T: 318 Reward: 950.079
Total T: 328355 Episode Num: 2414 Episo

---------------------------------------
Evaluation over 10 episodes: 744.907
---------------------------------------
Total T: 355175 Episode Num: 2517 Episode T: 230 Reward: 696.429
Total T: 355487 Episode Num: 2518 Episode T: 312 Reward: 935.270
Total T: 355727 Episode Num: 2519 Episode T: 240 Reward: 822.013
Total T: 355868 Episode Num: 2520 Episode T: 141 Reward: 427.477
Total T: 356281 Episode Num: 2521 Episode T: 413 Reward: 1452.947
Total T: 356590 Episode Num: 2522 Episode T: 309 Reward: 900.058
Total T: 356811 Episode Num: 2523 Episode T: 221 Reward: 610.004
Total T: 357075 Episode Num: 2524 Episode T: 264 Reward: 861.630
Total T: 357141 Episode Num: 2525 Episode T: 66 Reward: 110.259
Total T: 357357 Episode Num: 2526 Episode T: 216 Reward: 607.601
Total T: 357809 Episode Num: 2527 Episode T: 452 Reward: 1428.440
Total T: 358063 Episode Num: 2528 Episode T: 254 Reward: 648.535
Total T: 358226 Episode Num: 2529 Episode T: 163 Reward: 509.961
Total T: 358536 Episode Num: 2530 Epi

Total T: 395071 Episode Num: 2627 Episode T: 616 Reward: 1890.514
Total T: 395319 Episode Num: 2628 Episode T: 248 Reward: 821.115
Total T: 395466 Episode Num: 2629 Episode T: 147 Reward: 440.936
Total T: 395634 Episode Num: 2630 Episode T: 168 Reward: 531.738
Total T: 395699 Episode Num: 2631 Episode T: 65 Reward: 112.848
Total T: 396152 Episode Num: 2632 Episode T: 453 Reward: 1546.802
Total T: 396256 Episode Num: 2633 Episode T: 104 Reward: 226.748
Total T: 396673 Episode Num: 2634 Episode T: 417 Reward: 1358.200
Total T: 396817 Episode Num: 2635 Episode T: 144 Reward: 422.299
Total T: 397817 Episode Num: 2636 Episode T: 1000 Reward: 3026.000
Total T: 398209 Episode Num: 2637 Episode T: 392 Reward: 1119.119
Total T: 399209 Episode Num: 2638 Episode T: 1000 Reward: 2883.344
Total T: 399599 Episode Num: 2639 Episode T: 390 Reward: 1316.774
Total T: 399972 Episode Num: 2640 Episode T: 373 Reward: 1218.510
---------------------------------------
Evaluation over 10 episodes: 2847.417
---

Total T: 453000 Episode Num: 2732 Episode T: 1000 Reward: 3447.649
Total T: 453581 Episode Num: 2733 Episode T: 581 Reward: 2062.212
Total T: 453939 Episode Num: 2734 Episode T: 358 Reward: 1166.117
Total T: 454051 Episode Num: 2735 Episode T: 112 Reward: 282.863
Total T: 454597 Episode Num: 2736 Episode T: 546 Reward: 1906.434
Total T: 454871 Episode Num: 2737 Episode T: 274 Reward: 847.996
---------------------------------------
Evaluation over 10 episodes: 1179.382
---------------------------------------
Total T: 455628 Episode Num: 2738 Episode T: 757 Reward: 2477.154
Total T: 456103 Episode Num: 2739 Episode T: 475 Reward: 1661.590
Total T: 456484 Episode Num: 2740 Episode T: 381 Reward: 1265.147
Total T: 456830 Episode Num: 2741 Episode T: 346 Reward: 1146.007
Total T: 457108 Episode Num: 2742 Episode T: 278 Reward: 861.986
Total T: 457280 Episode Num: 2743 Episode T: 172 Reward: 419.561
Total T: 457764 Episode Num: 2744 Episode T: 484 Reward: 1578.206
Total T: 458179 Episode Num

Total T: 499031 Episode Num: 2841 Episode T: 170 Reward: 505.405
Total T: 499839 Episode Num: 2842 Episode T: 808 Reward: 2977.675
---------------------------------------
Evaluation over 10 episodes: 1820.490
---------------------------------------
Total T: 500229 Episode Num: 2843 Episode T: 390 Reward: 1291.781
Total T: 500691 Episode Num: 2844 Episode T: 462 Reward: 1603.157
Total T: 501128 Episode Num: 2845 Episode T: 437 Reward: 1552.526
Total T: 501642 Episode Num: 2846 Episode T: 514 Reward: 1932.535
Total T: 501799 Episode Num: 2847 Episode T: 157 Reward: 435.890
Total T: 502386 Episode Num: 2848 Episode T: 587 Reward: 2171.934
Total T: 502967 Episode Num: 2849 Episode T: 581 Reward: 2282.172
Total T: 503281 Episode Num: 2850 Episode T: 314 Reward: 1030.541
Total T: 503842 Episode Num: 2851 Episode T: 561 Reward: 2167.196
Total T: 504143 Episode Num: 2852 Episode T: 301 Reward: 1051.795
Total T: 504506 Episode Num: 2853 Episode T: 363 Reward: 1284.594
Total T: 504618 Episode Nu

---------------------------------------
Evaluation over 10 episodes: 2794.426
---------------------------------------
Total T: 560862 Episode Num: 2944 Episode T: 1000 Reward: 4292.526
Total T: 561613 Episode Num: 2945 Episode T: 751 Reward: 3086.040
Total T: 562613 Episode Num: 2946 Episode T: 1000 Reward: 4135.038
Total T: 562959 Episode Num: 2947 Episode T: 346 Reward: 1298.268
Total T: 563083 Episode Num: 2948 Episode T: 124 Reward: 311.018
Total T: 563239 Episode Num: 2949 Episode T: 156 Reward: 419.643
Total T: 563338 Episode Num: 2950 Episode T: 99 Reward: 223.161
Total T: 564292 Episode Num: 2951 Episode T: 954 Reward: 4175.232
Total T: 564868 Episode Num: 2952 Episode T: 576 Reward: 2394.081
---------------------------------------
Evaluation over 10 episodes: 3592.318
---------------------------------------
Total T: 565329 Episode Num: 2953 Episode T: 461 Reward: 1894.680
Total T: 565606 Episode Num: 2954 Episode T: 277 Reward: 938.494
Total T: 566019 Episode Num: 2955 Episode

Total T: 638318 Episode Num: 3039 Episode T: 1000 Reward: 4249.230
Total T: 639318 Episode Num: 3040 Episode T: 1000 Reward: 4267.123
---------------------------------------
Evaluation over 10 episodes: 4172.185
---------------------------------------
Total T: 640318 Episode Num: 3041 Episode T: 1000 Reward: 4286.727
Total T: 641318 Episode Num: 3042 Episode T: 1000 Reward: 4200.828
Total T: 642228 Episode Num: 3043 Episode T: 910 Reward: 3987.868
Total T: 643085 Episode Num: 3044 Episode T: 857 Reward: 3711.723
Total T: 643505 Episode Num: 3045 Episode T: 420 Reward: 1730.029
Total T: 644040 Episode Num: 3046 Episode T: 535 Reward: 2213.184
---------------------------------------
Evaluation over 10 episodes: 3615.827
---------------------------------------
Total T: 645040 Episode Num: 3047 Episode T: 1000 Reward: 4248.520
Total T: 646040 Episode Num: 3048 Episode T: 1000 Reward: 4070.327
Total T: 647040 Episode Num: 3049 Episode T: 1000 Reward: 4188.960
Total T: 648040 Episode Num: 30

Total T: 722340 Episode Num: 3132 Episode T: 1000 Reward: 4123.218
Total T: 723340 Episode Num: 3133 Episode T: 1000 Reward: 4161.516
Total T: 724340 Episode Num: 3134 Episode T: 1000 Reward: 4043.134
---------------------------------------
Evaluation over 10 episodes: 3892.184
---------------------------------------
Total T: 725340 Episode Num: 3135 Episode T: 1000 Reward: 4199.475
Total T: 726340 Episode Num: 3136 Episode T: 1000 Reward: 4202.914
Total T: 727340 Episode Num: 3137 Episode T: 1000 Reward: 4315.049
Total T: 728340 Episode Num: 3138 Episode T: 1000 Reward: 4210.778
Total T: 729340 Episode Num: 3139 Episode T: 1000 Reward: 4186.181
---------------------------------------
Evaluation over 10 episodes: 4077.844
---------------------------------------
Total T: 730340 Episode Num: 3140 Episode T: 1000 Reward: 4075.218
Total T: 731171 Episode Num: 3141 Episode T: 831 Reward: 3463.781
Total T: 732171 Episode Num: 3142 Episode T: 1000 Reward: 4233.412
Total T: 733171 Episode Num:

Total T: 810623 Episode Num: 3223 Episode T: 1000 Reward: 4410.662
Total T: 811623 Episode Num: 3224 Episode T: 1000 Reward: 4472.352
Total T: 812623 Episode Num: 3225 Episode T: 1000 Reward: 4354.994
Total T: 813623 Episode Num: 3226 Episode T: 1000 Reward: 4309.307
Total T: 814623 Episode Num: 3227 Episode T: 1000 Reward: 4220.009
---------------------------------------
Evaluation over 10 episodes: 4180.755
---------------------------------------
Total T: 815623 Episode Num: 3228 Episode T: 1000 Reward: 4372.823
Total T: 816623 Episode Num: 3229 Episode T: 1000 Reward: 4343.525
Total T: 817623 Episode Num: 3230 Episode T: 1000 Reward: 4280.633
Total T: 818623 Episode Num: 3231 Episode T: 1000 Reward: 4549.305
Total T: 819623 Episode Num: 3232 Episode T: 1000 Reward: 4476.145
Total T: 819903 Episode Num: 3233 Episode T: 280 Reward: 934.012
---------------------------------------
Evaluation over 10 episodes: 4531.719
---------------------------------------
Total T: 820903 Episode Num: 

Total T: 899741 Episode Num: 3316 Episode T: 1000 Reward: 4353.110
---------------------------------------
Evaluation over 10 episodes: 4380.195
---------------------------------------
Total T: 900741 Episode Num: 3317 Episode T: 1000 Reward: 4312.863
Total T: 901741 Episode Num: 3318 Episode T: 1000 Reward: 4251.099
Total T: 902741 Episode Num: 3319 Episode T: 1000 Reward: 4346.339
Total T: 903741 Episode Num: 3320 Episode T: 1000 Reward: 4409.965
Total T: 904741 Episode Num: 3321 Episode T: 1000 Reward: 4381.219
---------------------------------------
Evaluation over 10 episodes: 4409.121
---------------------------------------
Total T: 905741 Episode Num: 3322 Episode T: 1000 Reward: 4434.765
Total T: 906741 Episode Num: 3323 Episode T: 1000 Reward: 4397.271
Total T: 907741 Episode Num: 3324 Episode T: 1000 Reward: 4362.079
Total T: 908741 Episode Num: 3325 Episode T: 1000 Reward: 4412.058
Total T: 909741 Episode Num: 3326 Episode T: 1000 Reward: 4214.954
---------------------------

---------------------------------------
Evaluation over 10 episodes: 4513.698
---------------------------------------
Total T: 990005 Episode Num: 3407 Episode T: 1000 Reward: 4483.452
Total T: 991005 Episode Num: 3408 Episode T: 1000 Reward: 4483.843
Total T: 992005 Episode Num: 3409 Episode T: 1000 Reward: 4450.741
Total T: 993005 Episode Num: 3410 Episode T: 1000 Reward: 4502.575
Total T: 994005 Episode Num: 3411 Episode T: 1000 Reward: 4420.689
---------------------------------------
Evaluation over 10 episodes: 4543.741
---------------------------------------
Total T: 995005 Episode Num: 3412 Episode T: 1000 Reward: 4496.086
Total T: 996005 Episode Num: 3413 Episode T: 1000 Reward: 4513.234
Total T: 997005 Episode Num: 3414 Episode T: 1000 Reward: 4541.839
Total T: 998005 Episode Num: 3415 Episode T: 1000 Reward: 4609.012
Total T: 999005 Episode Num: 3416 Episode T: 1000 Reward: 4557.577
---------------------------------------
Evaluation over 10 episodes: 4551.109
----------------

In [10]:
state_dim

17