# SIRX: Training RL

Baseline comparion in terms of total loss and energy.

To run this script:
1. Please make sure that the required data folder is available at the paths used by the script.
You may generate the required data by running the python script
```nodec_experiments/sirx/gen_parameters.py```.

2. The scripts below:
 - ```nodec_experiments/sirx/sirx.py```
 - ```nodec_experiments/sirx/rl_utils.py```
 - ```nodec_experiments/sirx/sirx_utils.py```
contain very important utilities for running training , evaluation and plotting scripts. Please make sure that they are available in the python path when running experiments.

Reinforcement Learning requires some significant time to train.

As neural network intialization is stochastic, please make sure that appropriate seeds are used or expect some variance to paper results.

## Imports

In [1]:
%load_ext autoreload
%autoreload 2
import os
import sys
sys.path.append("../../../") # append modules from parent dir
sys.path.append("../") # append modules from parent dir

import time

import copy

import numpy as np
import gym
from gym.spaces import Box
import numpy as np
import torch
from torchdiffeq import odeint, odeint_adjoint
from nnc.controllers.neural_network.nnc_controllers import NNCDynamics
from nnc.helpers.torch_utils.graphs import drivers_to_tensor

In [2]:
from sirx import SIRDelta, neighborhood_mask, flat_to_channels, GCNNControl
from rl_utils import SIRXEnv, RLGCNN, Actor, Critic

import tianshou as ts
from tianshou.policy import TD3Policy
from tianshou.trainer import offpolicy_trainer
from tianshou.data import Collector, ReplayBuffer, to_torch
from tianshou.exploration import GaussianNoise

from torch.utils.tensorboard import SummaryWriter

In [3]:
device = 'cuda:0'
dtype = torch.float

### Graph parameters

In [4]:
graph = 'lattice'
parameters_folder = '../data/parameters/sirx/'
results_folder = '../data/results/sirx/'+graph+'/'

graph_parameters_folder = parameters_folder + '/' + 'lattice' + '/'

adjacency_matrix = torch.load(graph_parameters_folder + 'adjacency.pt', map_location=device).to(dtype)
n_nodes = adjacency_matrix.shape[-1]
drivers = torch.load(graph_parameters_folder + 'drivers.pt', map_location='cpu').to(torch.long)
driver_matrix = drivers_to_tensor(n_nodes, drivers).to(dtype=dtype, device=device)
alpha = adjacency_matrix
beta = driver_matrix
side_size = int(np.sqrt(n_nodes))

### Dynamics Parameters

In [5]:
x0 = torch.load(graph_parameters_folder + 'initial_state.pt').to(device=device, dtype=dtype)
target_subgraph = torch.load(graph_parameters_folder + 'target_subgraph_nodes.pt')
dynamics_params = torch.load(graph_parameters_folder + 'dynamics_parameters.pt')
# budget and rates need to be choosen according to graph size
budget = dynamics_params['budget']
infection_rate = dynamics_params['infection_rate']
recovery_rate = dynamics_params['recovery_rate']
total_time = 5 # determined via no control testing

In [6]:
sirx_dyn = SIRDelta(
             adjacency_matrix=alpha,
             infection_rate=infection_rate,
             recovery_rate=recovery_rate,
             driver_matrix=beta,
             k_0=0.0,
            ).to(device=device, dtype=dtype)

In [7]:
rl_dt = 0.01 # RL interaction frequency
env_config={
    'sirx' : sirx_dyn,
    'target_nodes' : target_subgraph.tolist(),
    'dt' : rl_dt,
    'T' : total_time,
    'ode_solve_method' : 'dopri5',
    'reward_type' : 'sum_to_max',
    'x0' : x0,
    'budget' : budget    
}

In [8]:
train_envs = ts.env.DummyVectorEnv([lambda: SIRXEnv(env_config) for _ in range(2)])
test_envs = ts.env.DummyVectorEnv([lambda: SIRXEnv(env_config) for _ in range(2)])

### RL Neural Networks
If you check code you will see that it has the same learnable parameters and structure as the network
used for NODEC before the decision layer.

In [9]:
mask, ninds = neighborhood_mask(alpha)
in_preprocessor = lambda x: flat_to_channels(x, n_nodes=n_nodes, mask=mask, inds=ninds)

policy_net = RLGCNN(
                   adjacency_matrix = alpha,
                   driver_matrix = beta, 
                   input_preprocessor = in_preprocessor,
                   in_channels=4,
                   feat_channels=5,
                   message_passes=4
                  )

actor = Actor(model = policy_net, device=device).to(device)
actor_optim = torch.optim.Adam(actor.parameters(), lr=0.0003)

critic1 = Critic(1, 4096, 512, device=device).to(device)
critic1_optim = torch.optim.Adam(critic1.parameters(), lr=1e-4)

critic2 = Critic(1, 4096, 512, device=device).to(device)
critic2_optim = torch.optim.Adam(critic2.parameters(), lr=1e-4)


In [10]:
# for transfer learning we can literally load the model
#actor.model.load_state_dict(torch.load('../sir/sirx_best.torch'))
secs = int(round(time.time()))
log_path = results_folder + 'rl/td3/time_'+str(secs)
log_path

'../data/results/sirx/lattice/rl/td3/time_1608770137'

In [12]:
# Policy training proceedure
# evaluation environment
env = SIRXEnv(env_config)


# YOu can change TD3 to SAC or any other contious action policy provided from tianshou
policy = TD3Policy(
    actor = actor,
    actor_optim = actor_optim,
    critic1 = critic1,
    critic1_optim = critic1_optim,
    critic2 = critic2,
    critic2_optim = critic2_optim,
    tau= 0.005,
    gamma = 0.999,
    exploration_noise = GaussianNoise(0.01),
    policy_noise = 0.001,
    update_actor_freq = 5,
    noise_clip = 0.5,
    action_range =  [env.action_space.low[0], env.action_space.high[0]],
    reward_normalization = True,
    ignore_done = False,
)


   
# Experience Collector
train_collector = Collector(
    policy, train_envs, ReplayBuffer(8000))
test_collector = Collector(policy, test_envs)
writer = SummaryWriter(log_path)

def save_fn(policy):
    # save best model
    torch.save(policy.state_dict(), os.path.join(log_path, 'policy.pth'))

# trainer
result = offpolicy_trainer(
    policy = policy,
    train_collector = train_collector,
    test_collector = test_collector,
    max_epoch = 100,
    step_per_epoch = len(env.time_steps),
    collect_per_step = 1,
    episode_per_test = 1,
    batch_size = len(env.time_steps),
    save_fn = save_fn,
    writer = writer,
    log_interval = 1,
    verbose = True,
)


Epoch #1: 500it [00:30, 16.16it/s, env_step=998, len=499, loss/actor=-0.182702, loss/critic1=0.000060, loss/critic2=0.000069, n/ep=2, n/st=998, rew=-0.09, v/ep=0.33, v/st=163.73]                         
Epoch #2:   0%|          | 0/499 [00:00<?, ?it/s]

Epoch #1: test_reward: -0.089169 ± 0.000000, best_reward: -0.089169 ± 0.000000 in #1


Epoch #2: 500it [00:30, 16.13it/s, env_step=1996, len=499, loss/actor=-0.226563, loss/critic1=0.000258, loss/critic2=0.000202, n/ep=2, n/st=998, rew=-0.09, v/ep=0.33, v/st=166.62]                         
Epoch #3:   0%|          | 0/499 [00:00<?, ?it/s]

Epoch #2: test_reward: -0.091385 ± 0.000000, best_reward: -0.089169 ± 0.000000 in #1


Epoch #3: 500it [00:31, 16.07it/s, env_step=2994, len=499, loss/actor=-0.255267, loss/critic1=0.000115, loss/critic2=0.000057, n/ep=2, n/st=998, rew=-0.09, v/ep=0.33, v/st=166.12]                         
Epoch #4:   0%|          | 0/499 [00:00<?, ?it/s]

Epoch #3: test_reward: -0.089622 ± 0.000000, best_reward: -0.089169 ± 0.000000 in #1


Epoch #4: 500it [00:31, 16.03it/s, env_step=3992, len=499, loss/actor=-0.220343, loss/critic1=0.000244, loss/critic2=0.002239, n/ep=2, n/st=998, rew=-0.09, v/ep=0.33, v/st=166.39]                         
Epoch #5:   0%|          | 0/499 [00:00<?, ?it/s]

Epoch #4: test_reward: -0.089925 ± 0.000000, best_reward: -0.089169 ± 0.000000 in #1


Epoch #5: 500it [00:31, 16.02it/s, env_step=4990, len=499, loss/actor=-0.174341, loss/critic1=0.000558, loss/critic2=0.000803, n/ep=2, n/st=998, rew=-0.09, v/ep=0.33, v/st=166.34]                         
Epoch #6:   0%|          | 0/499 [00:00<?, ?it/s]

Epoch #5: test_reward: -0.091205 ± 0.000000, best_reward: -0.089169 ± 0.000000 in #1


Epoch #6: 500it [00:31, 15.99it/s, env_step=5988, len=499, loss/actor=-0.185871, loss/critic1=0.003273, loss/critic2=0.000078, n/ep=2, n/st=998, rew=-0.09, v/ep=0.33, v/st=166.56]                         
Epoch #7:   0%|          | 0/499 [00:00<?, ?it/s]

Epoch #6: test_reward: -0.078287 ± 0.000000, best_reward: -0.078287 ± 0.000000 in #6


Epoch #7: 500it [00:31, 15.97it/s, env_step=6986, len=499, loss/actor=-0.497847, loss/critic1=0.000594, loss/critic2=0.000087, n/ep=2, n/st=998, rew=-0.08, v/ep=0.33, v/st=165.55]                         
Epoch #8:   0%|          | 0/499 [00:00<?, ?it/s]

Epoch #7: test_reward: -0.051297 ± 0.000000, best_reward: -0.051297 ± 0.000000 in #7


Epoch #8: 500it [00:31, 15.95it/s, env_step=7984, len=499, loss/actor=-0.453223, loss/critic1=0.000642, loss/critic2=0.000444, n/ep=2, n/st=998, rew=-0.05, v/ep=0.33, v/st=163.77]                         
Epoch #9:   0%|          | 0/499 [00:00<?, ?it/s]

Epoch #8: test_reward: -0.043418 ± 0.000000, best_reward: -0.043418 ± 0.000000 in #8


Epoch #9: 500it [00:31, 15.90it/s, env_step=8982, len=499, loss/actor=-0.269737, loss/critic1=0.002413, loss/critic2=0.002117, n/ep=2, n/st=998, rew=-0.04, v/ep=0.33, v/st=162.78]                         
Epoch #10:   0%|          | 0/499 [00:00<?, ?it/s]

Epoch #9: test_reward: -0.045763 ± 0.000000, best_reward: -0.043418 ± 0.000000 in #8


Epoch #10: 500it [00:31, 15.87it/s, env_step=9980, len=499, loss/actor=-0.115519, loss/critic1=0.009713, loss/critic2=0.006006, n/ep=2, n/st=998, rew=-0.04, v/ep=0.32, v/st=161.57]                         
Epoch #11:   0%|          | 0/499 [00:00<?, ?it/s]

Epoch #10: test_reward: -0.030054 ± 0.000000, best_reward: -0.030054 ± 0.000000 in #10


Epoch #11: 500it [00:31, 15.87it/s, env_step=10978, len=499, loss/actor=-0.492415, loss/critic1=0.006899, loss/critic2=0.006998, n/ep=2, n/st=998, rew=-0.03, v/ep=0.32, v/st=160.84]                         
Epoch #12:   0%|          | 0/499 [00:00<?, ?it/s]

Epoch #11: test_reward: -0.029304 ± 0.000000, best_reward: -0.029304 ± 0.000000 in #11


Epoch #12: 500it [00:31, 15.86it/s, env_step=11976, len=499, loss/actor=-0.462911, loss/critic1=0.015160, loss/critic2=0.010416, n/ep=2, n/st=998, rew=-0.03, v/ep=0.32, v/st=159.97]                         
Epoch #13:   0%|          | 0/499 [00:00<?, ?it/s]

Epoch #12: test_reward: -0.024714 ± 0.000000, best_reward: -0.024714 ± 0.000000 in #12


Epoch #13: 500it [00:31, 16.00it/s, env_step=12974, len=499, loss/actor=-0.510774, loss/critic1=0.032960, loss/critic2=0.020230, n/ep=2, n/st=998, rew=-0.02, v/ep=0.33, v/st=166.21]                         
Epoch #14:   0%|          | 0/499 [00:00<?, ?it/s]

Epoch #13: test_reward: -0.025791 ± 0.000000, best_reward: -0.024714 ± 0.000000 in #12


Epoch #14: 500it [00:31, 15.98it/s, env_step=13972, len=499, loss/actor=-0.500411, loss/critic1=0.019060, loss/critic2=0.019045, n/ep=2, n/st=998, rew=-0.03, v/ep=0.33, v/st=165.48]                         
Epoch #15:   0%|          | 0/499 [00:00<?, ?it/s]

Epoch #14: test_reward: -0.028609 ± 0.000000, best_reward: -0.024714 ± 0.000000 in #12


Epoch #15: 500it [00:31, 16.01it/s, env_step=14970, len=499, loss/actor=-0.526824, loss/critic1=0.056588, loss/critic2=0.020147, n/ep=2, n/st=998, rew=-0.03, v/ep=0.33, v/st=165.59]                         
Epoch #16:   0%|          | 0/499 [00:00<?, ?it/s]

Epoch #15: test_reward: -0.030261 ± 0.000000, best_reward: -0.024714 ± 0.000000 in #12


Epoch #16: 500it [00:31, 16.00it/s, env_step=15968, len=499, loss/actor=-0.621937, loss/critic1=0.019864, loss/critic2=0.021427, n/ep=2, n/st=998, rew=-0.03, v/ep=0.33, v/st=165.80]                         
Epoch #17:   0%|          | 0/499 [00:00<?, ?it/s]

Epoch #16: test_reward: -0.032767 ± 0.000000, best_reward: -0.024714 ± 0.000000 in #12


Epoch #17: 500it [00:31, 15.99it/s, env_step=16966, len=499, loss/actor=-0.483102, loss/critic1=0.030204, loss/critic2=0.088933, n/ep=2, n/st=998, rew=-0.03, v/ep=0.33, v/st=165.98]                         
Epoch #18:   0%|          | 0/499 [00:00<?, ?it/s]

Epoch #17: test_reward: -0.032895 ± 0.000000, best_reward: -0.024714 ± 0.000000 in #12


Epoch #18: 500it [00:31, 15.98it/s, env_step=17964, len=499, loss/actor=-0.394851, loss/critic1=0.036553, loss/critic2=0.046492, n/ep=2, n/st=998, rew=-0.03, v/ep=0.33, v/st=165.84]                         
Epoch #19:   0%|          | 0/499 [00:00<?, ?it/s]

Epoch #18: test_reward: -0.035359 ± 0.000000, best_reward: -0.024714 ± 0.000000 in #12


Epoch #19: 500it [00:31, 15.90it/s, env_step=18962, len=499, loss/actor=-0.468322, loss/critic1=0.023306, loss/critic2=0.026111, n/ep=2, n/st=998, rew=-0.04, v/ep=0.32, v/st=160.95]                         
Epoch #20:   0%|          | 0/499 [00:00<?, ?it/s]

Epoch #19: test_reward: -0.037397 ± 0.000000, best_reward: -0.024714 ± 0.000000 in #12


Epoch #20: 500it [00:31, 15.98it/s, env_step=19960, len=499, loss/actor=-0.481417, loss/critic1=0.038642, loss/critic2=0.044475, n/ep=2, n/st=998, rew=-0.04, v/ep=0.33, v/st=163.75]                         
Epoch #21:   0%|          | 0/499 [00:00<?, ?it/s]

Epoch #20: test_reward: -0.043690 ± 0.000000, best_reward: -0.024714 ± 0.000000 in #12


Epoch #21: 500it [00:31, 15.98it/s, env_step=20958, len=499, loss/actor=-0.377054, loss/critic1=0.061318, loss/critic2=0.038936, n/ep=2, n/st=998, rew=-0.04, v/ep=0.33, v/st=166.89]                         
Epoch #22:   0%|          | 0/499 [00:00<?, ?it/s]

Epoch #21: test_reward: -0.040046 ± 0.000000, best_reward: -0.024714 ± 0.000000 in #12


Epoch #22: 500it [00:31, 16.04it/s, env_step=21956, len=499, loss/actor=-0.534506, loss/critic1=0.034281, loss/critic2=0.034406, n/ep=2, n/st=998, rew=-0.04, v/ep=0.34, v/st=168.27]                         
Epoch #23:   0%|          | 0/499 [00:00<?, ?it/s]

Epoch #22: test_reward: -0.041237 ± 0.000000, best_reward: -0.024714 ± 0.000000 in #12


Epoch #23: 500it [00:31, 15.98it/s, env_step=22954, len=499, loss/actor=-0.429760, loss/critic1=0.069779, loss/critic2=0.078621, n/ep=2, n/st=998, rew=-0.04, v/ep=0.33, v/st=165.38]                         
Epoch #24:   0%|          | 0/499 [00:00<?, ?it/s]

Epoch #23: test_reward: -0.036368 ± 0.000000, best_reward: -0.024714 ± 0.000000 in #12


Epoch #24: 500it [00:31, 16.00it/s, env_step=23952, len=499, loss/actor=-0.375796, loss/critic1=0.060198, loss/critic2=0.045705, n/ep=2, n/st=998, rew=-0.04, v/ep=0.33, v/st=166.29]                         
Epoch #25:   0%|          | 0/499 [00:00<?, ?it/s]

Epoch #24: test_reward: -0.031343 ± 0.000000, best_reward: -0.024714 ± 0.000000 in #12


Epoch #25: 500it [00:31, 16.05it/s, env_step=24950, len=499, loss/actor=-0.446282, loss/critic1=0.086104, loss/critic2=0.053187, n/ep=2, n/st=998, rew=-0.03, v/ep=0.34, v/st=167.95]                         
Epoch #26:   0%|          | 0/499 [00:00<?, ?it/s]

Epoch #25: test_reward: -0.027228 ± 0.000000, best_reward: -0.024714 ± 0.000000 in #12


Epoch #26: 500it [00:31, 16.06it/s, env_step=25948, len=499, loss/actor=-0.584010, loss/critic1=0.052122, loss/critic2=0.050325, n/ep=2, n/st=998, rew=-0.03, v/ep=0.34, v/st=169.67]                         
Epoch #27:   0%|          | 0/499 [00:00<?, ?it/s]

Epoch #26: test_reward: -0.025952 ± 0.000000, best_reward: -0.024714 ± 0.000000 in #12


Epoch #27: 500it [00:31, 16.05it/s, env_step=26946, len=499, loss/actor=-0.402688, loss/critic1=0.067775, loss/critic2=0.069221, n/ep=2, n/st=998, rew=-0.03, v/ep=0.34, v/st=170.10]                         
Epoch #28:   0%|          | 0/499 [00:00<?, ?it/s]

Epoch #27: test_reward: -0.024353 ± 0.000000, best_reward: -0.024353 ± 0.000000 in #27


Epoch #28: 500it [00:31, 16.02it/s, env_step=27944, len=499, loss/actor=-0.267946, loss/critic1=0.063548, loss/critic2=0.069296, n/ep=2, n/st=998, rew=-0.02, v/ep=0.34, v/st=168.26]                         
Epoch #29:   0%|          | 0/499 [00:00<?, ?it/s]

Epoch #28: test_reward: -0.022796 ± 0.000000, best_reward: -0.022796 ± 0.000000 in #28


Epoch #29: 500it [00:31, 16.06it/s, env_step=28942, len=499, loss/actor=-0.627075, loss/critic1=0.059310, loss/critic2=0.060191, n/ep=2, n/st=998, rew=-0.02, v/ep=0.34, v/st=170.88]                         
Epoch #30:   0%|          | 0/499 [00:00<?, ?it/s]

Epoch #29: test_reward: -0.021358 ± 0.000000, best_reward: -0.021358 ± 0.000000 in #29


Epoch #30: 500it [00:31, 16.10it/s, env_step=29940, len=499, loss/actor=-0.639575, loss/critic1=0.070597, loss/critic2=0.070623, n/ep=2, n/st=998, rew=-0.02, v/ep=0.34, v/st=170.38]                         
Epoch #31:   0%|          | 0/499 [00:00<?, ?it/s]

Epoch #30: test_reward: -0.020400 ± 0.000000, best_reward: -0.020400 ± 0.000000 in #30


Epoch #31: 500it [00:31, 16.07it/s, env_step=30938, len=499, loss/actor=-0.662914, loss/critic1=0.064313, loss/critic2=0.059997, n/ep=2, n/st=998, rew=-0.02, v/ep=0.34, v/st=169.90]                         
Epoch #32:   0%|          | 0/499 [00:00<?, ?it/s]

Epoch #31: test_reward: -0.019138 ± 0.000000, best_reward: -0.019138 ± 0.000000 in #31


Epoch #32: 500it [00:31, 16.09it/s, env_step=31936, len=499, loss/actor=-0.589951, loss/critic1=0.072795, loss/critic2=0.068830, n/ep=2, n/st=998, rew=-0.02, v/ep=0.34, v/st=170.11]                         
Epoch #33:   0%|          | 0/499 [00:00<?, ?it/s]

Epoch #32: test_reward: -0.018770 ± 0.000000, best_reward: -0.018770 ± 0.000000 in #32


Epoch #33: 500it [00:31, 16.06it/s, env_step=32934, len=499, loss/actor=-0.604604, loss/critic1=0.058737, loss/critic2=0.061849, n/ep=2, n/st=998, rew=-0.02, v/ep=0.34, v/st=170.66]                         
Epoch #34:   0%|          | 0/499 [00:00<?, ?it/s]

Epoch #33: test_reward: -0.016963 ± 0.000000, best_reward: -0.016963 ± 0.000000 in #33


Epoch #34: 500it [00:31, 16.02it/s, env_step=33932, len=499, loss/actor=-0.480831, loss/critic1=0.067464, loss/critic2=0.073987, n/ep=2, n/st=998, rew=-0.02, v/ep=0.33, v/st=167.02]                         
Epoch #35:   0%|          | 0/499 [00:00<?, ?it/s]

Epoch #34: test_reward: -0.016121 ± 0.000000, best_reward: -0.016121 ± 0.000000 in #34


Epoch #35: 500it [00:31, 16.07it/s, env_step=34930, len=499, loss/actor=-0.581225, loss/critic1=0.100201, loss/critic2=0.093458, n/ep=2, n/st=998, rew=-0.02, v/ep=0.34, v/st=168.69]                         
Epoch #36:   0%|          | 0/499 [00:00<?, ?it/s]

Epoch #35: test_reward: -0.016104 ± 0.000000, best_reward: -0.016104 ± 0.000000 in #35


Epoch #36: 500it [00:31, 16.04it/s, env_step=35928, len=499, loss/actor=-0.708248, loss/critic1=0.081576, loss/critic2=0.089730, n/ep=2, n/st=998, rew=-0.02, v/ep=0.34, v/st=170.19]                         
Epoch #37:   0%|          | 0/499 [00:00<?, ?it/s]

Epoch #36: test_reward: -0.016782 ± 0.000000, best_reward: -0.016104 ± 0.000000 in #35


Epoch #37: 500it [00:31, 16.07it/s, env_step=36926, len=499, loss/actor=-0.621277, loss/critic1=0.074873, loss/critic2=0.095040, n/ep=2, n/st=998, rew=-0.02, v/ep=0.34, v/st=170.12]                         
Epoch #38:   0%|          | 0/499 [00:00<?, ?it/s]

Epoch #37: test_reward: -0.016940 ± 0.000000, best_reward: -0.016104 ± 0.000000 in #35


Epoch #38: 500it [00:31, 16.03it/s, env_step=37924, len=499, loss/actor=-0.884892, loss/critic1=0.072329, loss/critic2=0.111590, n/ep=2, n/st=998, rew=-0.02, v/ep=0.34, v/st=169.97]                         
Epoch #39:   0%|          | 0/499 [00:00<?, ?it/s]

Epoch #38: test_reward: -0.016758 ± 0.000000, best_reward: -0.016104 ± 0.000000 in #35


Epoch #39: 500it [00:31, 16.08it/s, env_step=38922, len=499, loss/actor=-0.810772, loss/critic1=0.096546, loss/critic2=0.097327, n/ep=2, n/st=998, rew=-0.02, v/ep=0.34, v/st=171.35]                         
Epoch #40:   0%|          | 0/499 [00:00<?, ?it/s]

Epoch #39: test_reward: -0.016631 ± 0.000000, best_reward: -0.016104 ± 0.000000 in #35


Epoch #40: 500it [00:31, 15.99it/s, env_step=39920, len=499, loss/actor=-1.130542, loss/critic1=0.096329, loss/critic2=0.126593, n/ep=2, n/st=998, rew=-0.02, v/ep=0.33, v/st=165.48]                         
Epoch #41:   0%|          | 0/499 [00:00<?, ?it/s]

Epoch #40: test_reward: -0.016851 ± 0.000000, best_reward: -0.016104 ± 0.000000 in #35


Epoch #41: 500it [00:31, 15.98it/s, env_step=40918, len=499, loss/actor=-0.999636, loss/critic1=0.118851, loss/critic2=0.115903, n/ep=2, n/st=998, rew=-0.02, v/ep=0.33, v/st=166.16]                         
Epoch #42:   0%|          | 0/499 [00:00<?, ?it/s]

Epoch #41: test_reward: -0.016906 ± 0.000000, best_reward: -0.016104 ± 0.000000 in #35


Epoch #42: 500it [00:31, 15.95it/s, env_step=41916, len=499, loss/actor=-1.000379, loss/critic1=0.075484, loss/critic2=0.088067, n/ep=2, n/st=998, rew=-0.02, v/ep=0.33, v/st=166.88]                         
Epoch #43:   0%|          | 0/499 [00:00<?, ?it/s]

Epoch #42: test_reward: -0.017137 ± 0.000000, best_reward: -0.016104 ± 0.000000 in #35


Epoch #43: 500it [00:31, 15.94it/s, env_step=42914, len=499, loss/actor=-1.260322, loss/critic1=0.095813, loss/critic2=0.102007, n/ep=2, n/st=998, rew=-0.02, v/ep=0.33, v/st=164.48]                         
Epoch #44:   0%|          | 0/499 [00:00<?, ?it/s]

Epoch #43: test_reward: -0.016307 ± 0.000000, best_reward: -0.016104 ± 0.000000 in #35


Epoch #44: 500it [00:31, 16.07it/s, env_step=43912, len=499, loss/actor=-1.131633, loss/critic1=0.125257, loss/critic2=0.120295, n/ep=2, n/st=998, rew=-0.02, v/ep=0.34, v/st=170.02]                         
Epoch #45:   0%|          | 0/499 [00:00<?, ?it/s]

Epoch #44: test_reward: -0.016134 ± 0.000000, best_reward: -0.016104 ± 0.000000 in #35


Epoch #45: 500it [00:31, 16.02it/s, env_step=44910, len=499, loss/actor=-0.897106, loss/critic1=0.095607, loss/critic2=0.096458, n/ep=2, n/st=998, rew=-0.02, v/ep=0.34, v/st=168.67]                         
Epoch #46:   0%|          | 0/499 [00:00<?, ?it/s]

Epoch #45: test_reward: -0.017515 ± 0.000000, best_reward: -0.016104 ± 0.000000 in #35


Epoch #46: 500it [00:31, 15.98it/s, env_step=45908, len=499, loss/actor=-0.866447, loss/critic1=0.146163, loss/critic2=0.134884, n/ep=2, n/st=998, rew=-0.02, v/ep=0.33, v/st=166.45]                         
Epoch #47:   0%|          | 0/499 [00:00<?, ?it/s]

Epoch #46: test_reward: -0.017190 ± 0.000000, best_reward: -0.016104 ± 0.000000 in #35


Epoch #47: 500it [00:31, 15.99it/s, env_step=46906, len=499, loss/actor=-0.807570, loss/critic1=0.163035, loss/critic2=0.162596, n/ep=2, n/st=998, rew=-0.02, v/ep=0.33, v/st=167.12]                         
Epoch #48:   0%|          | 0/499 [00:00<?, ?it/s]

Epoch #47: test_reward: -0.017389 ± 0.000000, best_reward: -0.016104 ± 0.000000 in #35


Epoch #48: 500it [00:31, 15.99it/s, env_step=47904, len=499, loss/actor=-0.998103, loss/critic1=0.134446, loss/critic2=0.130056, n/ep=2, n/st=998, rew=-0.02, v/ep=0.33, v/st=167.16]                         
Epoch #49:   0%|          | 0/499 [00:00<?, ?it/s]

Epoch #48: test_reward: -0.017132 ± 0.000000, best_reward: -0.016104 ± 0.000000 in #35


Epoch #49: 500it [00:31, 15.94it/s, env_step=48902, len=499, loss/actor=-0.761924, loss/critic1=0.153445, loss/critic2=0.131725, n/ep=2, n/st=998, rew=-0.02, v/ep=0.33, v/st=165.38]                         
Epoch #50:   0%|          | 0/499 [00:00<?, ?it/s]

Epoch #49: test_reward: -0.018502 ± 0.000000, best_reward: -0.016104 ± 0.000000 in #35


Epoch #50: 500it [00:31, 15.93it/s, env_step=49900, len=499, loss/actor=-0.619672, loss/critic1=0.147713, loss/critic2=0.137555, n/ep=2, n/st=998, rew=-0.02, v/ep=0.33, v/st=162.39]                         
Epoch #51:   0%|          | 0/499 [00:00<?, ?it/s]

Epoch #50: test_reward: -0.020606 ± 0.000000, best_reward: -0.016104 ± 0.000000 in #35


Epoch #51: 500it [00:31, 15.90it/s, env_step=50898, len=499, loss/actor=-0.638134, loss/critic1=0.162925, loss/critic2=0.191500, n/ep=2, n/st=998, rew=-0.02, v/ep=0.32, v/st=161.95]                         
Epoch #52:   0%|          | 0/499 [00:00<?, ?it/s]

Epoch #51: test_reward: -0.020087 ± 0.000000, best_reward: -0.016104 ± 0.000000 in #35


Epoch #52: 500it [00:31, 15.93it/s, env_step=51896, len=499, loss/actor=-0.415417, loss/critic1=0.133458, loss/critic2=0.127756, n/ep=2, n/st=998, rew=-0.02, v/ep=0.33, v/st=163.58]                         
Epoch #53:   0%|          | 0/499 [00:00<?, ?it/s]

Epoch #52: test_reward: -0.017712 ± 0.000000, best_reward: -0.016104 ± 0.000000 in #35


Epoch #53: 500it [00:31, 15.89it/s, env_step=52894, len=499, loss/actor=-0.217052, loss/critic1=0.159227, loss/critic2=0.155172, n/ep=2, n/st=998, rew=-0.02, v/ep=0.33, v/st=163.64]                         
Epoch #54:   0%|          | 0/499 [00:00<?, ?it/s]

Epoch #53: test_reward: -0.016210 ± 0.000000, best_reward: -0.016104 ± 0.000000 in #35


Epoch #54: 500it [00:31, 15.95it/s, env_step=53892, len=499, loss/actor=0.013910, loss/critic1=0.172546, loss/critic2=0.143311, n/ep=2, n/st=998, rew=-0.02, v/ep=0.33, v/st=164.44]                          
Epoch #55:   0%|          | 0/499 [00:00<?, ?it/s]

Epoch #54: test_reward: -0.015813 ± 0.000000, best_reward: -0.015813 ± 0.000000 in #54


Epoch #55: 500it [00:31, 16.01it/s, env_step=54890, len=499, loss/actor=-0.173229, loss/critic1=0.195589, loss/critic2=0.177469, n/ep=2, n/st=998, rew=-0.02, v/ep=0.33, v/st=166.55]                         
Epoch #56:   0%|          | 0/499 [00:00<?, ?it/s]

Epoch #55: test_reward: -0.014946 ± 0.000000, best_reward: -0.014946 ± 0.000000 in #55


Epoch #56: 500it [00:31, 15.98it/s, env_step=55888, len=499, loss/actor=-0.383734, loss/critic1=0.216212, loss/critic2=0.235020, n/ep=2, n/st=998, rew=-0.02, v/ep=0.33, v/st=165.76]                         
Epoch #57:   0%|          | 0/499 [00:00<?, ?it/s]

Epoch #56: test_reward: -0.015388 ± 0.000000, best_reward: -0.014946 ± 0.000000 in #55


Epoch #57: 500it [00:31, 15.81it/s, env_step=56886, len=499, loss/actor=0.082549, loss/critic1=0.227139, loss/critic2=0.203271, n/ep=2, n/st=998, rew=-0.02, v/ep=0.33, v/st=166.71]                          
Epoch #58:   0%|          | 0/499 [00:00<?, ?it/s]

Epoch #57: test_reward: -0.015391 ± 0.000000, best_reward: -0.014946 ± 0.000000 in #55


Epoch #58: 500it [00:31, 16.00it/s, env_step=57884, len=499, loss/actor=-0.189052, loss/critic1=0.317611, loss/critic2=0.236140, n/ep=2, n/st=998, rew=-0.02, v/ep=0.33, v/st=167.08]                         
Epoch #59:   0%|          | 0/499 [00:00<?, ?it/s]

Epoch #58: test_reward: -0.015269 ± 0.000000, best_reward: -0.014946 ± 0.000000 in #55


Epoch #59: 500it [00:31, 16.00it/s, env_step=58882, len=499, loss/actor=-0.479744, loss/critic1=0.190270, loss/critic2=0.180963, n/ep=2, n/st=998, rew=-0.01, v/ep=0.34, v/st=167.41]                         
Epoch #60:   0%|          | 0/499 [00:00<?, ?it/s]

Epoch #59: test_reward: -0.015200 ± 0.000000, best_reward: -0.014946 ± 0.000000 in #55


Epoch #60: 500it [00:31, 16.01it/s, env_step=59880, len=499, loss/actor=-0.911451, loss/critic1=0.196787, loss/critic2=0.212976, n/ep=2, n/st=998, rew=-0.02, v/ep=0.33, v/st=167.00]                         
Epoch #61:   0%|          | 0/499 [00:00<?, ?it/s]

Epoch #60: test_reward: -0.014598 ± 0.000000, best_reward: -0.014598 ± 0.000000 in #60


Epoch #61: 500it [00:31, 16.04it/s, env_step=60878, len=499, loss/actor=-0.518148, loss/critic1=0.137666, loss/critic2=0.180793, n/ep=2, n/st=998, rew=-0.01, v/ep=0.34, v/st=169.73]                         
Epoch #62:   0%|          | 0/499 [00:00<?, ?it/s]

Epoch #61: test_reward: -0.015318 ± 0.000000, best_reward: -0.014598 ± 0.000000 in #60


Epoch #62: 500it [00:31, 15.95it/s, env_step=61876, len=499, loss/actor=-1.058979, loss/critic1=0.156956, loss/critic2=0.199887, n/ep=2, n/st=998, rew=-0.02, v/ep=0.33, v/st=164.93]                         
Epoch #63:   0%|          | 0/499 [00:00<?, ?it/s]

Epoch #62: test_reward: -0.015279 ± 0.000000, best_reward: -0.014598 ± 0.000000 in #60


Epoch #63: 500it [00:31, 15.94it/s, env_step=62874, len=499, loss/actor=-0.448369, loss/critic1=0.167204, loss/critic2=0.193395, n/ep=2, n/st=998, rew=-0.02, v/ep=0.33, v/st=164.34]                         
Epoch #64:   0%|          | 0/499 [00:00<?, ?it/s]

Epoch #63: test_reward: -0.016019 ± 0.000000, best_reward: -0.014598 ± 0.000000 in #60


Epoch #64: 500it [00:31, 15.94it/s, env_step=63872, len=499, loss/actor=-1.320117, loss/critic1=0.186311, loss/critic2=0.184243, n/ep=2, n/st=998, rew=-0.02, v/ep=0.33, v/st=164.16]                         
Epoch #65:   0%|          | 0/499 [00:00<?, ?it/s]

Epoch #64: test_reward: -0.016864 ± 0.000000, best_reward: -0.014598 ± 0.000000 in #60


Epoch #65: 500it [00:31, 15.96it/s, env_step=64870, len=499, loss/actor=-1.145816, loss/critic1=0.227179, loss/critic2=0.225754, n/ep=2, n/st=998, rew=-0.02, v/ep=0.33, v/st=163.85]                         
Epoch #66:   0%|          | 0/499 [00:00<?, ?it/s]

Epoch #65: test_reward: -0.016699 ± 0.000000, best_reward: -0.014598 ± 0.000000 in #60


Epoch #66: 500it [00:32, 15.47it/s, env_step=65868, len=499, loss/actor=-1.077983, loss/critic1=0.276546, loss/critic2=0.364836, n/ep=2, n/st=998, rew=-0.02, v/ep=0.34, v/st=168.92]                         
Epoch #67:   0%|          | 0/499 [00:00<?, ?it/s]

Epoch #66: test_reward: -0.016382 ± 0.000000, best_reward: -0.014598 ± 0.000000 in #60


Epoch #67: 500it [00:38, 12.89it/s, env_step=66866, len=499, loss/actor=-1.180386, loss/critic1=0.234801, loss/critic2=0.308236, n/ep=2, n/st=998, rew=-0.02, v/ep=0.19, v/st=94.52]                         
Epoch #68:   0%|          | 0/499 [00:00<?, ?it/s]

Epoch #67: test_reward: -0.016068 ± 0.000000, best_reward: -0.014598 ± 0.000000 in #60


Epoch #68: 500it [00:33, 14.76it/s, env_step=67864, len=499, loss/actor=-1.030148, loss/critic1=0.242382, loss/critic2=0.254361, n/ep=2, n/st=998, rew=-0.02, v/ep=0.33, v/st=164.95]                         
Epoch #69:   0%|          | 0/499 [00:00<?, ?it/s]

Epoch #68: test_reward: -0.015219 ± 0.000000, best_reward: -0.014598 ± 0.000000 in #60


Epoch #69: 500it [00:37, 13.27it/s, env_step=68862, len=499, loss/actor=-0.738607, loss/critic1=0.297289, loss/critic2=0.324159, n/ep=2, n/st=998, rew=-0.02, v/ep=0.19, v/st=94.44]                         
Epoch #70:   0%|          | 0/499 [00:00<?, ?it/s]

Epoch #69: test_reward: -0.015120 ± 0.000000, best_reward: -0.014598 ± 0.000000 in #60


Epoch #70: 500it [00:31, 15.91it/s, env_step=69860, len=499, loss/actor=-0.789814, loss/critic1=0.260475, loss/critic2=0.260871, n/ep=2, n/st=998, rew=-0.02, v/ep=0.33, v/st=162.44]                         
Epoch #71:   0%|          | 0/499 [00:00<?, ?it/s]

Epoch #70: test_reward: -0.015153 ± 0.000000, best_reward: -0.014598 ± 0.000000 in #60


Epoch #71: 500it [00:31, 15.92it/s, env_step=70858, len=499, loss/actor=-0.821940, loss/critic1=0.265607, loss/critic2=0.250218, n/ep=2, n/st=998, rew=-0.02, v/ep=0.33, v/st=163.35]                         
Epoch #72:   0%|          | 0/499 [00:00<?, ?it/s]

Epoch #71: test_reward: -0.014880 ± 0.000000, best_reward: -0.014598 ± 0.000000 in #60


Epoch #72: 500it [00:34, 14.70it/s, env_step=71856, len=499, loss/actor=-1.172746, loss/critic1=0.205058, loss/critic2=0.192140, n/ep=2, n/st=998, rew=-0.02, v/ep=0.33, v/st=162.74]                         
Epoch #73:   0%|          | 0/499 [00:00<?, ?it/s]

Epoch #72: test_reward: -0.015625 ± 0.000000, best_reward: -0.014598 ± 0.000000 in #60


Epoch #73: 500it [00:37, 13.25it/s, env_step=72854, len=499, loss/actor=-1.201870, loss/critic1=0.302965, loss/critic2=0.289139, n/ep=2, n/st=998, rew=-0.02, v/ep=0.19, v/st=93.75]                         
Epoch #74:   0%|          | 0/499 [00:00<?, ?it/s]

Epoch #73: test_reward: -0.014803 ± 0.000000, best_reward: -0.014598 ± 0.000000 in #60


Epoch #74: 500it [00:31, 15.92it/s, env_step=73852, len=499, loss/actor=-1.662886, loss/critic1=0.245337, loss/critic2=0.224895, n/ep=2, n/st=998, rew=-0.01, v/ep=0.33, v/st=163.42]                         
Epoch #75:   0%|          | 0/499 [00:00<?, ?it/s]

Epoch #74: test_reward: -0.014140 ± 0.000000, best_reward: -0.014140 ± 0.000000 in #74


Epoch #75: 500it [00:33, 15.11it/s, env_step=74850, len=499, loss/actor=-1.203461, loss/critic1=0.256071, loss/critic2=0.239267, n/ep=2, n/st=998, rew=-0.01, v/ep=0.33, v/st=163.30]                         
Epoch #76:   0%|          | 0/499 [00:00<?, ?it/s]

Epoch #75: test_reward: -0.013144 ± 0.000000, best_reward: -0.013144 ± 0.000000 in #75


Epoch #76: 500it [00:35, 13.90it/s, env_step=75848, len=499, loss/actor=-1.237971, loss/critic1=0.244746, loss/critic2=0.225662, n/ep=2, n/st=998, rew=-0.01, v/ep=0.19, v/st=93.85]                         
Epoch #77:   0%|          | 0/499 [00:00<?, ?it/s]

Epoch #76: test_reward: -0.010869 ± 0.000000, best_reward: -0.010869 ± 0.000000 in #76


Epoch #77: 500it [00:31, 15.91it/s, env_step=76846, len=499, loss/actor=-1.790272, loss/critic1=0.262149, loss/critic2=0.239071, n/ep=2, n/st=998, rew=-0.01, v/ep=0.32, v/st=161.89]                         
Epoch #78:   0%|          | 0/499 [00:00<?, ?it/s]

Epoch #77: test_reward: -0.010202 ± 0.000000, best_reward: -0.010202 ± 0.000000 in #77


Epoch #78: 500it [00:31, 16.04it/s, env_step=77844, len=499, loss/actor=-1.429572, loss/critic1=0.258055, loss/critic2=0.203298, n/ep=2, n/st=998, rew=-0.01, v/ep=0.34, v/st=168.56]                         
Epoch #79:   0%|          | 0/499 [00:00<?, ?it/s]

Epoch #78: test_reward: -0.010009 ± 0.000000, best_reward: -0.010009 ± 0.000000 in #78


Epoch #79: 500it [00:32, 15.27it/s, env_step=78842, len=499, loss/actor=-1.513995, loss/critic1=0.249623, loss/critic2=0.223798, n/ep=2, n/st=998, rew=-0.01, v/ep=0.33, v/st=166.91]                         
Epoch #80:   0%|          | 0/499 [00:00<?, ?it/s]

Epoch #79: test_reward: -0.010892 ± 0.000000, best_reward: -0.010009 ± 0.000000 in #78


Epoch #80: 500it [00:38, 12.98it/s, env_step=79840, len=499, loss/actor=-1.158707, loss/critic1=0.234318, loss/critic2=0.190834, n/ep=2, n/st=998, rew=-0.01, v/ep=0.19, v/st=95.04]                         
Epoch #81:   0%|          | 0/499 [00:00<?, ?it/s]

Epoch #80: test_reward: -0.010429 ± 0.000000, best_reward: -0.010009 ± 0.000000 in #78


Epoch #81: 500it [00:36, 13.64it/s, env_step=80838, len=499, loss/actor=-1.161834, loss/critic1=0.272483, loss/critic2=0.317208, n/ep=2, n/st=998, rew=-0.01, v/ep=0.19, v/st=94.31]                         
Epoch #82:   0%|          | 0/499 [00:00<?, ?it/s]

Epoch #81: test_reward: -0.010715 ± 0.000000, best_reward: -0.010009 ± 0.000000 in #78


Epoch #82: 500it [00:31, 15.98it/s, env_step=81836, len=499, loss/actor=-0.987569, loss/critic1=0.237641, loss/critic2=0.225861, n/ep=2, n/st=998, rew=-0.01, v/ep=0.34, v/st=167.43]                         
Epoch #83:   0%|          | 0/499 [00:00<?, ?it/s]

Epoch #82: test_reward: -0.011318 ± 0.000000, best_reward: -0.010009 ± 0.000000 in #78


Epoch #83: 500it [00:31, 15.99it/s, env_step=82834, len=499, loss/actor=-1.429172, loss/critic1=0.259348, loss/critic2=0.330440, n/ep=2, n/st=998, rew=-0.01, v/ep=0.33, v/st=166.25]                         
Epoch #84:   0%|          | 0/499 [00:00<?, ?it/s]

Epoch #83: test_reward: -0.010426 ± 0.000000, best_reward: -0.010009 ± 0.000000 in #78


Epoch #84: 500it [00:31, 15.97it/s, env_step=83832, len=499, loss/actor=-0.709398, loss/critic1=0.282268, loss/critic2=0.300735, n/ep=2, n/st=998, rew=-0.01, v/ep=0.33, v/st=165.69]                         
Epoch #85:   0%|          | 0/499 [00:00<?, ?it/s]

Epoch #84: test_reward: -0.011415 ± 0.000000, best_reward: -0.010009 ± 0.000000 in #78


Epoch #85: 500it [00:31, 15.98it/s, env_step=84830, len=499, loss/actor=-1.005684, loss/critic1=0.281056, loss/critic2=0.266120, n/ep=2, n/st=998, rew=-0.01, v/ep=0.33, v/st=165.31]                         
Epoch #86:   0%|          | 0/499 [00:00<?, ?it/s]

Epoch #85: test_reward: -0.010847 ± 0.000000, best_reward: -0.010009 ± 0.000000 in #78


Epoch #86: 500it [00:31, 15.91it/s, env_step=85828, len=499, loss/actor=-0.929738, loss/critic1=0.305226, loss/critic2=0.260069, n/ep=2, n/st=998, rew=-0.01, v/ep=0.33, v/st=164.81]                         
Epoch #87:   0%|          | 0/499 [00:00<?, ?it/s]

Epoch #86: test_reward: -0.010105 ± 0.000000, best_reward: -0.010009 ± 0.000000 in #78


Epoch #87: 500it [00:31, 15.89it/s, env_step=86826, len=499, loss/actor=-0.753291, loss/critic1=0.302677, loss/critic2=0.440805, n/ep=2, n/st=998, rew=-0.01, v/ep=0.33, v/st=163.86]                         
Epoch #88:   0%|          | 0/499 [00:00<?, ?it/s]

Epoch #87: test_reward: -0.010432 ± 0.000000, best_reward: -0.010009 ± 0.000000 in #78


Epoch #88: 500it [00:31, 15.95it/s, env_step=87824, len=499, loss/actor=-0.796024, loss/critic1=0.370298, loss/critic2=0.509255, n/ep=2, n/st=998, rew=-0.01, v/ep=0.33, v/st=165.04]                         
Epoch #89:   0%|          | 0/499 [00:00<?, ?it/s]

Epoch #88: test_reward: -0.010527 ± 0.000000, best_reward: -0.010009 ± 0.000000 in #78


Epoch #89: 500it [00:31, 15.96it/s, env_step=88822, len=499, loss/actor=-1.193232, loss/critic1=0.291014, loss/critic2=0.386739, n/ep=2, n/st=998, rew=-0.01, v/ep=0.33, v/st=165.24]                         
Epoch #90:   0%|          | 0/499 [00:00<?, ?it/s]

Epoch #89: test_reward: -0.010595 ± 0.000000, best_reward: -0.010009 ± 0.000000 in #78


Epoch #90: 500it [00:31, 15.92it/s, env_step=89820, len=499, loss/actor=-0.546546, loss/critic1=0.291215, loss/critic2=0.345743, n/ep=2, n/st=998, rew=-0.01, v/ep=0.33, v/st=162.73]                         
Epoch #91:   0%|          | 0/499 [00:00<?, ?it/s]

Epoch #90: test_reward: -0.010587 ± 0.000000, best_reward: -0.010009 ± 0.000000 in #78


Epoch #91: 500it [00:33, 14.80it/s, env_step=90818, len=499, loss/actor=-0.754443, loss/critic1=0.368899, loss/critic2=0.462581, n/ep=2, n/st=998, rew=-0.01, v/ep=0.33, v/st=162.87]                         
Epoch #92:   0%|          | 0/499 [00:00<?, ?it/s]

Epoch #91: test_reward: -0.012053 ± 0.000000, best_reward: -0.010009 ± 0.000000 in #78


Epoch #92: 500it [00:37, 13.24it/s, env_step=91816, len=499, loss/actor=-0.722450, loss/critic1=0.326318, loss/critic2=0.430169, n/ep=2, n/st=998, rew=-0.01, v/ep=0.18, v/st=91.61]                         
Epoch #93:   0%|          | 0/499 [00:00<?, ?it/s]

Epoch #92: test_reward: -0.013542 ± 0.000000, best_reward: -0.010009 ± 0.000000 in #78


Epoch #93: 500it [00:33, 15.12it/s, env_step=92814, len=499, loss/actor=-1.098051, loss/critic1=0.418341, loss/critic2=0.533243, n/ep=2, n/st=998, rew=-0.01, v/ep=0.31, v/st=157.13]                         
Epoch #94:   0%|          | 0/499 [00:00<?, ?it/s]

Epoch #93: test_reward: -0.012690 ± 0.000000, best_reward: -0.010009 ± 0.000000 in #78


Epoch #94: 500it [00:38, 12.88it/s, env_step=93812, len=499, loss/actor=-0.729689, loss/critic1=0.499788, loss/critic2=0.708758, n/ep=2, n/st=998, rew=-0.01, v/ep=0.18, v/st=90.92]                         
Epoch #95:   0%|          | 0/499 [00:00<?, ?it/s]

Epoch #94: test_reward: -0.011997 ± 0.000000, best_reward: -0.010009 ± 0.000000 in #78


Epoch #95: 500it [00:31, 15.86it/s, env_step=94810, len=499, loss/actor=-0.405377, loss/critic1=0.354375, loss/critic2=0.620257, n/ep=2, n/st=998, rew=-0.01, v/ep=0.32, v/st=160.47]                         
Epoch #96:   0%|          | 0/499 [00:00<?, ?it/s]

Epoch #95: test_reward: -0.011282 ± 0.000000, best_reward: -0.010009 ± 0.000000 in #78


Epoch #96: 500it [00:31, 15.86it/s, env_step=95808, len=499, loss/actor=-0.774895, loss/critic1=0.422788, loss/critic2=0.595026, n/ep=2, n/st=998, rew=-0.01, v/ep=0.32, v/st=159.76]                         
Epoch #97:   0%|          | 0/499 [00:00<?, ?it/s]

Epoch #96: test_reward: -0.011125 ± 0.000000, best_reward: -0.010009 ± 0.000000 in #78


Epoch #97: 500it [00:31, 15.80it/s, env_step=96806, len=499, loss/actor=-0.181300, loss/critic1=0.707088, loss/critic2=0.698642, n/ep=2, n/st=998, rew=-0.01, v/ep=0.32, v/st=157.58]                         
Epoch #98:   0%|          | 0/499 [00:00<?, ?it/s]

Epoch #97: test_reward: -0.010974 ± 0.000000, best_reward: -0.010009 ± 0.000000 in #78


Epoch #98: 500it [00:31, 15.73it/s, env_step=97804, len=499, loss/actor=-0.215555, loss/critic1=0.783396, loss/critic2=0.736205, n/ep=2, n/st=998, rew=-0.01, v/ep=0.31, v/st=156.22]                         
Epoch #99:   0%|          | 0/499 [00:00<?, ?it/s]

Epoch #98: test_reward: -0.011363 ± 0.000000, best_reward: -0.010009 ± 0.000000 in #78


Epoch #99: 500it [00:32, 15.41it/s, env_step=98802, len=499, loss/actor=-0.009768, loss/critic1=0.818182, loss/critic2=0.748141, n/ep=2, n/st=998, rew=-0.01, v/ep=0.32, v/st=157.85]                         
Epoch #100:   0%|          | 0/499 [00:00<?, ?it/s]

Epoch #99: test_reward: -0.010926 ± 0.000000, best_reward: -0.010009 ± 0.000000 in #78


Epoch #100: 500it [00:39, 12.77it/s, env_step=99800, len=499, loss/actor=-1.335035, loss/critic1=0.559528, loss/critic2=0.500978, n/ep=2, n/st=998, rew=-0.01, v/ep=0.18, v/st=91.55]                         


Epoch #100: test_reward: -0.014636 ± 0.000000, best_reward: -0.010009 ± 0.000000 in #78
