# Cartpole tests with policy gradient

This notebook contains a simple test for each implemented policy gradient method. In order to test if they function properly, we rely on the [Cartpole](https://gym.openai.com/envs/CartPole-v0/) environment, provided out-of-the-box in OpenAI Gym. As stated in Gym's documentation, the problem is considered "solved" if the agent is able to obtain a mean return of 195 in the last 100 episodes.

## Pre-requisites

The cells down below install and import the necessary libraries to successfully run the notebook examples.

In [67]:
import sys
sys.path.append('../')

In [68]:
%%capture
!pip install -r ../init/requirements.txt

In [69]:
import numpy as np
import gym

from src import models, policies

%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


## Utilities

The cell down below defines the environment, along with common variables to be used throughout the notebook.

In [70]:
env = gym.make('CartPole-v0')

In [74]:
observation_space_size = 4
action_space_size = 2
hidden_sizes = [32, 32]
epochs = 800
steps_per_epoch = 200
minibatch_size = 100
episodes_mean_return = 100
wandb_config = {
    "api_key": open("../wandb_api_key_file", "r").read().strip(),
    "project": "cpr-appropriation",
    "entity": "wadaboa",
}

## VPG

This section deals with training a Cartpole agent using our custom Vanilla Policy Gradient implementation.

In [75]:
vpg_policy_nn = models.MLP(observation_space_size, hidden_sizes, action_space_size)
vpg_baseline_nn = models.MLP(observation_space_size, hidden_sizes, 1, log_softmax=False)
vpg_policy = policies.VPGPolicy(env, vpg_policy_nn, baseline_nn=vpg_baseline_nn)
vpg_policy.train(
    epochs,
    steps_per_epoch,
    minibatch_size,
    enable_wandb=True,
    wandb_config={**wandb_config, "group": "VPG"},
    episodes_mean_return=episodes_mean_return
)

[34m[1mwandb[0m: Currently logged in as: [33mwadaboa[0m (use `wandb login --relogin` to force relogin)
[34m[1mwandb[0m: wandb version 0.12.1 is available!  To upgrade, please run:
[34m[1mwandb[0m:  $ pip install wandb --upgrade


2021-08-26 22:49:31.622 | INFO     | src.policies:train:116 - Epoch 1 / 800
2021-08-26 22:49:31.623 | INFO     | src.policies:collect_trajectories:213 - Episode 1
2021-08-26 22:49:31.652 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:49:31.654 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 39.0
2021-08-26 22:49:31.655 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 39.0
2021-08-26 22:49:31.656 | INFO     | src.policies:collect_trajectories:213 - Episode 2
2021-08-26 22:49:31.679 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:49:31.681 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 46.0
2021-08-26 22:49:31.682 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 42.5
2021-08-26 22:49:31.682 | INFO     | src.policies:collect_trajectories:213 - Episode 3
2021-08-26 22:49:31.692

2021-08-26 22:49:32.018 | INFO     | src.policies:train:152 - Mini-batch 2 / 2
2021-08-26 22:49:32.021 | INFO     | src.policies:minibatch_update:270 - Total loss: 0.4283561110496521
2021-08-26 22:49:32.023 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.10767123848199844
2021-08-26 22:49:32.025 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.7013068795204163
2021-08-26 22:49:32.027 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.10767123848199844
2021-08-26 22:49:32.030 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.4999993145465851
2021-08-26 22:49:32.034 | INFO     | src.policies:train:116 - Epoch 3 / 800
2021-08-26 22:49:32.036 | INFO     | src.policies:collect_trajectories:213 - Episode 15
2021-08-26 22:49:32.043 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22

2021-08-26 22:49:32.364 | INFO     | src.policies:collect_trajectories:213 - Episode 29
2021-08-26 22:49:32.389 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:49:32.391 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 52.0
2021-08-26 22:49:32.392 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 33.166666666666664
2021-08-26 22:49:32.393 | INFO     | src.policies:collect_trajectories:213 - Episode 30
2021-08-26 22:49:32.408 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:49:32.409 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 31.0
2021-08-26 22:49:32.410 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 32.857142857142854
2021-08-26 22:49:32.417 | INFO     | src.policies:train:152 - Mini-batch 1 / 2
2021-08-26 22:49:32.420 | INFO     | src.policies:minibatch_update:270 - Tota

2021-08-26 22:49:32.662 | INFO     | src.policies:collect_trajectories:213 - Episode 41
2021-08-26 22:49:32.677 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:49:32.678 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 34.0
2021-08-26 22:49:32.679 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 23.5
2021-08-26 22:49:32.680 | INFO     | src.policies:collect_trajectories:213 - Episode 42
2021-08-26 22:49:32.686 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:49:32.687 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 10.0
2021-08-26 22:49:32.688 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 20.8
2021-08-26 22:49:32.689 | INFO     | src.policies:collect_trajectories:213 - Episode 43
2021-08-26 22:49:32.768 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agen

2021-08-26 22:49:32.989 | INFO     | src.policies:collect_trajectories:213 - Episode 57
2021-08-26 22:49:33.000 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:49:33.002 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 18.0
2021-08-26 22:49:33.003 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 21.2
2021-08-26 22:49:33.010 | INFO     | src.policies:train:152 - Mini-batch 1 / 2
2021-08-26 22:49:33.013 | INFO     | src.policies:minibatch_update:270 - Total loss: 0.39514970779418945
2021-08-26 22:49:33.015 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.10121826082468033
2021-08-26 22:49:33.017 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.7331262230873108
2021-08-26 22:49:33.020 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.10121826082468033
2021-08-26 22:

2021-08-26 22:49:33.365 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:49:33.366 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 57.0
2021-08-26 22:49:33.367 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 29.166666666666668
2021-08-26 22:49:33.368 | INFO     | src.policies:collect_trajectories:213 - Episode 70
2021-08-26 22:49:33.378 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:49:33.379 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 19.0
2021-08-26 22:49:33.380 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 27.714285714285715
2021-08-26 22:49:33.381 | INFO     | src.policies:collect_trajectories:213 - Episode 71
2021-08-26 22:49:33.399 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:49:33.400 | INFO     | src.policies:co

2021-08-26 22:49:33.743 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:49:33.744 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 31.0
2021-08-26 22:49:33.745 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 26.5
2021-08-26 22:49:33.746 | INFO     | src.policies:collect_trajectories:213 - Episode 82
2021-08-26 22:49:33.763 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:49:33.764 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 30.0
2021-08-26 22:49:33.765 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 27.666666666666668
2021-08-26 22:49:33.766 | INFO     | src.policies:collect_trajectories:213 - Episode 83
2021-08-26 22:49:33.786 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:49:33.787 | INFO     | src.policies:collect_trajecto

2021-08-26 22:49:34.135 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 93.0
2021-08-26 22:49:34.136 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 46.333333333333336
2021-08-26 22:49:34.137 | INFO     | src.policies:collect_trajectories:213 - Episode 94
2021-08-26 22:49:34.144 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:49:34.145 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 15.0
2021-08-26 22:49:34.146 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 38.5
2021-08-26 22:49:34.147 | INFO     | src.policies:collect_trajectories:213 - Episode 95
2021-08-26 22:49:34.174 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:49:34.176 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 51.0
2021-08-26 22:49:34.177 | INFO     | src.policies:collect_trajector

2021-08-26 22:49:34.476 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 30.0
2021-08-26 22:49:34.477 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 30.333333333333332
2021-08-26 22:49:34.477 | INFO     | src.policies:collect_trajectories:213 - Episode 106
2021-08-26 22:49:34.493 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:49:34.494 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 35.0
2021-08-26 22:49:34.495 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 31.5
2021-08-26 22:49:34.496 | INFO     | src.policies:collect_trajectories:213 - Episode 107
2021-08-26 22:49:34.519 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:49:34.520 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 58.0
2021-08-26 22:49:34.521 | INFO     | src.policies:collect_traject

2021-08-26 22:49:34.745 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 20.0
2021-08-26 22:49:34.746 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 29.5
2021-08-26 22:49:34.747 | INFO     | src.policies:collect_trajectories:213 - Episode 118
2021-08-26 22:49:34.759 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:49:34.761 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 21.0
2021-08-26 22:49:34.762 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 26.666666666666668
2021-08-26 22:49:34.763 | INFO     | src.policies:collect_trajectories:213 - Episode 119
2021-08-26 22:49:34.776 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:49:34.777 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 29.0
2021-08-26 22:49:34.778 | INFO     | src.policies:collect_traject

2021-08-26 22:49:35.266 | INFO     | src.policies:train:116 - Epoch 19 / 800
2021-08-26 22:49:35.267 | INFO     | src.policies:collect_trajectories:213 - Episode 128
2021-08-26 22:49:35.288 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:49:35.289 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 56.0
2021-08-26 22:49:35.290 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 56.0
2021-08-26 22:49:35.291 | INFO     | src.policies:collect_trajectories:213 - Episode 129
2021-08-26 22:49:35.327 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:49:35.328 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 95.0
2021-08-26 22:49:35.329 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 75.5
2021-08-26 22:49:35.330 | INFO     | src.policies:collect_trajectories:213 - Episode 130
2021-08-26 22:49

2021-08-26 22:49:35.642 | INFO     | src.policies:collect_trajectories:213 - Episode 140
2021-08-26 22:49:35.654 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:49:35.655 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 24.0
2021-08-26 22:49:35.661 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 37.5
2021-08-26 22:49:35.712 | INFO     | src.policies:collect_trajectories:213 - Episode 141
2021-08-26 22:49:35.867 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:49:35.868 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 68.0
2021-08-26 22:49:35.869 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 43.6
2021-08-26 22:49:35.875 | INFO     | src.policies:train:152 - Mini-batch 1 / 2
2021-08-26 22:49:35.877 | INFO     | src.policies:minibatch_update:270 - Total loss: 0.2558278441429138

2021-08-26 22:49:36.134 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:49:36.135 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 62.0
2021-08-26 22:49:36.136 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 48.0
2021-08-26 22:49:36.137 | INFO     | src.policies:collect_trajectories:213 - Episode 153
2021-08-26 22:49:36.154 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:49:36.155 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 38.0
2021-08-26 22:49:36.156 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 45.5
2021-08-26 22:49:36.157 | INFO     | src.policies:collect_trajectories:213 - Episode 154
2021-08-26 22:49:36.168 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:49:36.169 | INFO     | src.policies:collect_trajectories:229 - M

2021-08-26 22:49:36.502 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 62.0
2021-08-26 22:49:36.502 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 43.5
2021-08-26 22:49:36.503 | INFO     | src.policies:collect_trajectories:213 - Episode 165
2021-08-26 22:49:36.547 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:49:36.548 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 117.0
2021-08-26 22:49:36.549 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 58.2
2021-08-26 22:49:36.557 | INFO     | src.policies:train:152 - Mini-batch 1 / 2
2021-08-26 22:49:36.559 | INFO     | src.policies:minibatch_update:270 - Total loss: 0.23798063397407532
2021-08-26 22:49:36.562 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.12938730418682098
2021-08-26 22:49:36.565 | INFO     | src.policies:minibatch_upda

2021-08-26 22:49:36.945 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.3729385733604431
2021-08-26 22:49:36.947 | INFO     | src.policies:train:152 - Mini-batch 2 / 3
2021-08-26 22:49:36.950 | INFO     | src.policies:minibatch_update:270 - Total loss: 0.19929251074790955
2021-08-26 22:49:36.953 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.18349701166152954
2021-08-26 22:49:36.955 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.3832055330276489
2021-08-26 22:49:36.957 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.18349701166152954
2021-08-26 22:49:36.959 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.3832055330276489
2021-08-26 22:49:36.962 | INFO     | src.policies:train:152 - Mini-batch 3 / 3
2021-08-26 22:49:36.964 | INFO     | src.policies:minibatch

2021-08-26 22:49:37.284 | INFO     | src.policies:collect_trajectories:213 - Episode 183
2021-08-26 22:49:37.301 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:49:37.303 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 36.0
2021-08-26 22:49:37.304 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 41.5
2021-08-26 22:49:37.305 | INFO     | src.policies:collect_trajectories:213 - Episode 184
2021-08-26 22:49:37.411 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:49:37.413 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 127.0
2021-08-26 22:49:37.414 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 70.0
2021-08-26 22:49:37.420 | INFO     | src.policies:train:152 - Mini-batch 1 / 2
2021-08-26 22:49:37.423 | INFO     | src.policies:minibatch_update:270 - Total loss: 0.244942456483840

2021-08-26 22:49:37.704 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.17992708086967468
2021-08-26 22:49:37.706 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.2877075672149658
2021-08-26 22:49:37.708 | INFO     | src.policies:train:116 - Epoch 33 / 800
2021-08-26 22:49:37.709 | INFO     | src.policies:collect_trajectories:213 - Episode 192
2021-08-26 22:49:37.742 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:49:37.743 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 94.0
2021-08-26 22:49:37.744 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 94.0
2021-08-26 22:49:37.745 | INFO     | src.policies:collect_trajectories:213 - Episode 193
2021-08-26 22:49:37.761 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:49:37.762 | INFO     | s

2021-08-26 22:49:38.198 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:49:38.199 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 33.0
2021-08-26 22:49:38.200 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 40.25
2021-08-26 22:49:38.201 | INFO     | src.policies:collect_trajectories:213 - Episode 204
2021-08-26 22:49:38.220 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:49:38.222 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 43.0
2021-08-26 22:49:38.223 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 40.8
2021-08-26 22:49:38.229 | INFO     | src.policies:train:152 - Mini-batch 1 / 2
2021-08-26 22:49:38.231 | INFO     | src.policies:minibatch_update:270 - Total loss: 0.1482749581336975
2021-08-26 22:49:38.234 | INFO     | src.policies:minibatch_update:277 - Policy network

2021-08-26 22:49:38.580 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.07648938149213791
2021-08-26 22:49:38.583 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.3158435523509979
2021-08-26 22:49:38.585 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.07648938149213791
2021-08-26 22:49:38.587 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.3158435523509979
2021-08-26 22:49:38.589 | INFO     | src.policies:train:116 - Epoch 38 / 800
2021-08-26 22:49:38.590 | INFO     | src.policies:collect_trajectories:213 - Episode 213
2021-08-26 22:49:38.613 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:49:38.615 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 60.0
2021-08-26 22:49:38.615 | INFO     | src.policies:collect_trajectories:230 - Last 100 episo

2021-08-26 22:49:38.920 | INFO     | src.policies:collect_trajectories:213 - Episode 224
2021-08-26 22:49:38.945 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:49:38.946 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 67.0
2021-08-26 22:49:38.947 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 46.0
2021-08-26 22:49:38.948 | INFO     | src.policies:collect_trajectories:213 - Episode 225
2021-08-26 22:49:38.970 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:49:38.971 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 51.0
2021-08-26 22:49:38.971 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 47.0
2021-08-26 22:49:38.979 | INFO     | src.policies:train:152 - Mini-batch 1 / 2
2021-08-26 22:49:38.981 | INFO     | src.policies:minibatch_update:270 - Total loss: 0.2684170305728912

2021-08-26 22:49:39.404 | INFO     | src.policies:train:152 - Mini-batch 1 / 2
2021-08-26 22:49:39.407 | INFO     | src.policies:minibatch_update:270 - Total loss: 0.19670796394348145
2021-08-26 22:49:39.409 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.10500463843345642
2021-08-26 22:49:39.410 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.31188490986824036
2021-08-26 22:49:39.413 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.10500463843345642
2021-08-26 22:49:39.415 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.31188490986824036
2021-08-26 22:49:39.418 | INFO     | src.policies:train:152 - Mini-batch 2 / 2
2021-08-26 22:49:39.420 | INFO     | src.policies:minibatch_update:270 - Total loss: 0.19266214966773987
2021-08-26 22:49:39.421 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gra

2021-08-26 22:49:39.826 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 87.0
2021-08-26 22:49:39.827 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 86.0
2021-08-26 22:49:39.827 | INFO     | src.policies:collect_trajectories:213 - Episode 245
2021-08-26 22:49:39.852 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:49:39.853 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 59.0
2021-08-26 22:49:39.854 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 77.0
2021-08-26 22:49:39.859 | INFO     | src.policies:train:152 - Mini-batch 1 / 2
2021-08-26 22:49:39.862 | INFO     | src.policies:minibatch_update:270 - Total loss: 0.21463578939437866
2021-08-26 22:49:39.864 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.08497659116983414
2021-08-26 22:49:39.865 | INFO     | src.policies:minibatch_updat

2021-08-26 22:49:40.342 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.33217546343803406
2021-08-26 22:49:40.345 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.22028931975364685
2021-08-26 22:49:40.347 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.33217546343803406
2021-08-26 22:49:40.350 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.22028931975364685
2021-08-26 22:49:40.353 | INFO     | src.policies:train:152 - Mini-batch 2 / 2
2021-08-26 22:49:40.355 | INFO     | src.policies:minibatch_update:270 - Total loss: 0.23770281672477722
2021-08-26 22:49:40.358 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.10079822689294815
2021-08-26 22:49:40.360 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.16729538142681122
2021-08-26 22:49:40.363 

2021-08-26 22:49:40.734 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.1659277230501175
2021-08-26 22:49:40.736 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.22833475470542908
2021-08-26 22:49:40.739 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.1659277230501175
2021-08-26 22:49:40.742 | INFO     | src.policies:train:152 - Mini-batch 2 / 2
2021-08-26 22:49:40.744 | INFO     | src.policies:minibatch_update:270 - Total loss: 0.22699880599975586
2021-08-26 22:49:40.748 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.16039010882377625
2021-08-26 22:49:40.750 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.16152530908584595
2021-08-26 22:49:40.752 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.16039010882377625
2021-08-26 

2021-08-26 22:49:41.220 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:49:41.221 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 101.0
2021-08-26 22:49:41.222 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 90.33333333333333
2021-08-26 22:49:41.228 | INFO     | src.policies:train:152 - Mini-batch 1 / 2
2021-08-26 22:49:41.231 | INFO     | src.policies:minibatch_update:270 - Total loss: 0.25063449144363403
2021-08-26 22:49:41.233 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.1484300047159195
2021-08-26 22:49:41.235 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.19556082785129547
2021-08-26 22:49:41.237 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.1484300047159195
2021-08-26 22:49:41.239 | INFO     | src.policies:minibatch_update:295 - Baseline network

2021-08-26 22:49:41.599 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 36.0
2021-08-26 22:49:41.600 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 36.0
2021-08-26 22:49:41.601 | INFO     | src.policies:collect_trajectories:213 - Episode 282
2021-08-26 22:49:41.619 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:49:41.620 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 43.0
2021-08-26 22:49:41.621 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 38.333333333333336
2021-08-26 22:49:41.622 | INFO     | src.policies:collect_trajectories:213 - Episode 283
2021-08-26 22:49:41.667 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:49:41.668 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 124.0
2021-08-26 22:49:41.670 | INFO     | src.policies:collect_trajec

2021-08-26 22:49:42.052 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.25999996066093445
2021-08-26 22:49:42.054 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.2518532872200012
2021-08-26 22:49:42.056 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.25999996066093445
2021-08-26 22:49:42.058 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.2518532872200012
2021-08-26 22:49:42.060 | INFO     | src.policies:train:152 - Mini-batch 3 / 3
2021-08-26 22:49:42.062 | INFO     | src.policies:minibatch_update:270 - Total loss: 0.18275102972984314
2021-08-26 22:49:42.064 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.18266722559928894
2021-08-26 22:49:42.065 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.1876288801431656
2021-08-26 22:49:42.067 | I

2021-08-26 22:49:42.612 | INFO     | src.policies:train:152 - Mini-batch 1 / 3
2021-08-26 22:49:42.615 | INFO     | src.policies:minibatch_update:270 - Total loss: 0.18267181515693665
2021-08-26 22:49:42.617 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.23336182534694672
2021-08-26 22:49:42.619 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.2188604772090912
2021-08-26 22:49:42.621 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.23336182534694672
2021-08-26 22:49:42.623 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.2188604772090912
2021-08-26 22:49:42.625 | INFO     | src.policies:train:152 - Mini-batch 2 / 3
2021-08-26 22:49:42.627 | INFO     | src.policies:minibatch_update:270 - Total loss: 0.14089226722717285
2021-08-26 22:49:42.630 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradi

2021-08-26 22:49:42.964 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 75.5
2021-08-26 22:49:42.965 | INFO     | src.policies:collect_trajectories:213 - Episode 307
2021-08-26 22:49:43.020 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:49:43.021 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 145.0
2021-08-26 22:49:43.022 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 98.66666666666667
2021-08-26 22:49:43.028 | INFO     | src.policies:train:152 - Mini-batch 1 / 2
2021-08-26 22:49:43.031 | INFO     | src.policies:minibatch_update:270 - Total loss: 0.24848851561546326
2021-08-26 22:49:43.033 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.06283150613307953
2021-08-26 22:49:43.034 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.12664857506752014
2021-08-26 22:49:43.036 | 

2021-08-26 22:49:43.399 | INFO     | src.policies:train:116 - Epoch 67 / 800
2021-08-26 22:49:43.400 | INFO     | src.policies:collect_trajectories:213 - Episode 315
2021-08-26 22:49:43.463 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:49:43.464 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 170.0
2021-08-26 22:49:43.465 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 170.0
2021-08-26 22:49:43.466 | INFO     | src.policies:collect_trajectories:213 - Episode 316
2021-08-26 22:49:43.517 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:49:43.518 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 135.0
2021-08-26 22:49:43.518 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 152.5
2021-08-26 22:49:43.524 | INFO     | src.policies:train:152 - Mini-batch 1 / 3
2021-08-26 22:49:43.52

2021-08-26 22:49:43.881 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.20216815173625946
2021-08-26 22:49:43.883 | INFO     | src.policies:train:152 - Mini-batch 3 / 3
2021-08-26 22:49:43.885 | INFO     | src.policies:minibatch_update:270 - Total loss: 0.09522151947021484
2021-08-26 22:49:43.887 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.1771339774131775
2021-08-26 22:49:43.889 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.18630990386009216
2021-08-26 22:49:43.891 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.1771339774131775
2021-08-26 22:49:43.893 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.18630990386009216
2021-08-26 22:49:43.896 | INFO     | src.policies:train:116 - Epoch 70 / 800
2021-08-26 22:49:43.897 | INFO     | src.policies:collect_tr

2021-08-26 22:49:44.580 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.14127522706985474
2021-08-26 22:49:44.581 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.21122312545776367
2021-08-26 22:49:44.584 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.14127522706985474
2021-08-26 22:49:44.586 | INFO     | src.policies:train:152 - Mini-batch 2 / 2
2021-08-26 22:49:44.588 | INFO     | src.policies:minibatch_update:270 - Total loss: 0.2028685212135315
2021-08-26 22:49:44.590 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.20584654808044434
2021-08-26 22:49:44.592 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.12231768667697906
2021-08-26 22:49:44.594 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.20584654808044434
2021-08-26

2021-08-26 22:49:45.074 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 183.0
2021-08-26 22:49:45.075 | INFO     | src.policies:collect_trajectories:213 - Episode 335
2021-08-26 22:49:45.090 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:49:45.091 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 35.0
2021-08-26 22:49:45.092 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 109.0
2021-08-26 22:49:45.098 | INFO     | src.policies:train:152 - Mini-batch 1 / 2
2021-08-26 22:49:45.100 | INFO     | src.policies:minibatch_update:270 - Total loss: 0.16279956698417664
2021-08-26 22:49:45.102 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.26400598883628845
2021-08-26 22:49:45.104 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.13482248783111572
2021-08-26 22:49:45.106 | INFO     | s

2021-08-26 22:49:45.548 | INFO     | src.policies:minibatch_update:270 - Total loss: 0.16100776195526123
2021-08-26 22:49:45.550 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.4142434000968933
2021-08-26 22:49:45.552 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.2879733145236969
2021-08-26 22:49:45.554 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.4142434000968933
2021-08-26 22:49:45.557 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.2879733145236969
2021-08-26 22:49:45.559 | INFO     | src.policies:train:116 - Epoch 79 / 800
2021-08-26 22:49:45.561 | INFO     | src.policies:collect_trajectories:213 - Episode 341
2021-08-26 22:49:45.615 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:49:45.616 | INFO     | src.policies:collect_trajectories:229 - Mean episode r

2021-08-26 22:49:46.066 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.2633207142353058
2021-08-26 22:49:46.068 | INFO     | src.policies:train:152 - Mini-batch 2 / 3
2021-08-26 22:49:46.070 | INFO     | src.policies:minibatch_update:270 - Total loss: 0.09053359925746918
2021-08-26 22:49:46.072 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.03557436540722847
2021-08-26 22:49:46.074 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.3024747371673584
2021-08-26 22:49:46.076 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.03557436540722847
2021-08-26 22:49:46.078 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.3024747371673584
2021-08-26 22:49:46.080 | INFO     | src.policies:train:152 - Mini-batch 3 / 3
2021-08-26 22:49:46.082 | INFO     | src.policies:minibatch

2021-08-26 22:49:46.617 | INFO     | src.policies:train:152 - Mini-batch 1 / 2
2021-08-26 22:49:46.620 | INFO     | src.policies:minibatch_update:270 - Total loss: 0.09498931467533112
2021-08-26 22:49:46.622 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.14341124892234802
2021-08-26 22:49:46.623 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.2495340257883072
2021-08-26 22:49:46.626 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.14341124892234802
2021-08-26 22:49:46.627 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.2495340257883072
2021-08-26 22:49:46.630 | INFO     | src.policies:train:152 - Mini-batch 2 / 2
2021-08-26 22:49:46.632 | INFO     | src.policies:minibatch_update:270 - Total loss: 0.0777641236782074
2021-08-26 22:49:46.634 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradie

2021-08-26 22:49:47.117 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.16990286111831665
2021-08-26 22:49:47.118 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.361283540725708
2021-08-26 22:49:47.121 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.16990286111831665
2021-08-26 22:49:47.123 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.361283540725708
2021-08-26 22:49:47.125 | INFO     | src.policies:train:152 - Mini-batch 3 / 3
2021-08-26 22:49:47.127 | INFO     | src.policies:minibatch_update:270 - Total loss: 0.07448747754096985
2021-08-26 22:49:47.129 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.1910359412431717
2021-08-26 22:49:47.131 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.35267066955566406
2021-08-26 22:49:47.133 | INF

2021-08-26 22:49:47.912 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:49:47.913 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 200.0
2021-08-26 22:49:47.914 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 191.5
2021-08-26 22:49:47.921 | INFO     | src.policies:train:152 - Mini-batch 1 / 3
2021-08-26 22:49:47.923 | INFO     | src.policies:minibatch_update:270 - Total loss: 0.24754023551940918
2021-08-26 22:49:47.926 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.05925721675157547
2021-08-26 22:49:47.927 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.10569735616445541
2021-08-26 22:49:47.929 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.05925721675157547
2021-08-26 22:49:47.932 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradie

2021-08-26 22:49:48.466 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 187.0
2021-08-26 22:49:48.467 | INFO     | src.policies:collect_trajectories:213 - Episode 374
2021-08-26 22:49:48.487 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:49:48.488 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 42.0
2021-08-26 22:49:48.489 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 114.5
2021-08-26 22:49:48.496 | INFO     | src.policies:train:152 - Mini-batch 1 / 2
2021-08-26 22:49:48.500 | INFO     | src.policies:minibatch_update:270 - Total loss: 0.04129369556903839
2021-08-26 22:49:48.502 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.11471763998270035
2021-08-26 22:49:48.504 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.2445218861103058
2021-08-26 22:49:48.506 | INFO     | sr

2021-08-26 22:49:48.931 | INFO     | src.policies:minibatch_update:270 - Total loss: 0.11358138918876648
2021-08-26 22:49:48.933 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.27639442682266235
2021-08-26 22:49:48.934 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.2847876250743866
2021-08-26 22:49:48.936 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.27639442682266235
2021-08-26 22:49:48.938 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.2847876250743866
2021-08-26 22:49:48.941 | INFO     | src.policies:train:152 - Mini-batch 3 / 3
2021-08-26 22:49:48.942 | INFO     | src.policies:minibatch_update:270 - Total loss: 0.0836828351020813
2021-08-26 22:49:48.944 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.21017581224441528
2021-08-26 22:49:48.946 | INFO     | src.policies:

2021-08-26 22:49:49.394 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.40747636556625366
2021-08-26 22:49:49.397 | INFO     | src.policies:train:152 - Mini-batch 2 / 2
2021-08-26 22:49:49.399 | INFO     | src.policies:minibatch_update:270 - Total loss: 0.06018088757991791
2021-08-26 22:49:49.402 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.14005638659000397
2021-08-26 22:49:49.404 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.3003602623939514
2021-08-26 22:49:49.406 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.14005638659000397
2021-08-26 22:49:49.408 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.3003602623939514
2021-08-26 22:49:49.410 | INFO     | src.policies:train:116 - Epoch 100 / 800
2021-08-26 22:49:49.411 | INFO     | src.policies:collect_t

2021-08-26 22:49:49.985 | INFO     | src.policies:train:152 - Mini-batch 2 / 2
2021-08-26 22:49:49.988 | INFO     | src.policies:minibatch_update:270 - Total loss: 0.1399112194776535
2021-08-26 22:49:49.991 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.138768270611763
2021-08-26 22:49:49.993 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.21427612006664276
2021-08-26 22:49:49.996 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.138768270611763
2021-08-26 22:49:49.998 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.21427612006664276
2021-08-26 22:49:50.002 | INFO     | src.policies:train:116 - Epoch 103 / 800
2021-08-26 22:49:50.003 | INFO     | src.policies:collect_trajectories:213 - Episode 393
2021-08-26 22:49:50.055 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 2

2021-08-26 22:49:50.565 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.07561386376619339
2021-08-26 22:49:50.567 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.38848841190338135
2021-08-26 22:49:50.568 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.07561386376619339
2021-08-26 22:49:50.571 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.38848841190338135
2021-08-26 22:49:50.574 | INFO     | src.policies:train:116 - Epoch 106 / 800
2021-08-26 22:49:50.575 | INFO     | src.policies:collect_trajectories:213 - Episode 399
2021-08-26 22:49:50.626 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:49:50.627 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 147.0
2021-08-26 22:49:50.628 | INFO     | src.policies:collect_trajectories:230 - Last 100 e

2021-08-26 22:49:51.050 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.19296394288539886
2021-08-26 22:49:51.052 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.20095664262771606
2021-08-26 22:49:51.054 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.19296394288539886
2021-08-26 22:49:51.057 | INFO     | src.policies:train:116 - Epoch 109 / 800
2021-08-26 22:49:51.058 | INFO     | src.policies:collect_trajectories:213 - Episode 405
2021-08-26 22:49:51.067 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:49:51.068 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 20.0
2021-08-26 22:49:51.069 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 20.0
2021-08-26 22:49:51.070 | INFO     | src.policies:collect_trajectories:213 - Episode 406
2021-08-26 2

2021-08-26 22:49:51.539 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.24659758806228638
2021-08-26 22:49:51.541 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.23566071689128876
2021-08-26 22:49:51.543 | INFO     | src.policies:train:152 - Mini-batch 3 / 3
2021-08-26 22:49:51.545 | INFO     | src.policies:minibatch_update:270 - Total loss: 0.026142612099647522
2021-08-26 22:49:51.547 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.2719423174858093
2021-08-26 22:49:51.549 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.2105761170387268
2021-08-26 22:49:51.551 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.2719423174858093
2021-08-26 22:49:51.553 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.21057611703872

2021-08-26 22:49:51.933 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.09010351449251175
2021-08-26 22:49:51.936 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.07698588073253632
2021-08-26 22:49:51.938 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.09010351449251175
2021-08-26 22:49:51.941 | INFO     | src.policies:train:116 - Epoch 115 / 800
2021-08-26 22:49:51.942 | INFO     | src.policies:collect_trajectories:213 - Episode 420
2021-08-26 22:49:52.067 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:49:52.068 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 200.0
2021-08-26 22:49:52.069 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 200.0
2021-08-26 22:49:52.074 | INFO     | src.policies:train:152 - Mini-batch 1 / 2
2021-08-26 22:49:52.

2021-08-26 22:49:52.463 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.2027144730091095
2021-08-26 22:49:52.466 | INFO     | src.policies:train:152 - Mini-batch 3 / 3
2021-08-26 22:49:52.469 | INFO     | src.policies:minibatch_update:270 - Total loss: 0.07325020432472229
2021-08-26 22:49:52.471 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.18643805384635925
2021-08-26 22:49:52.474 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.22389422357082367
2021-08-26 22:49:52.477 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.18643805384635925
2021-08-26 22:49:52.480 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.22389422357082367
2021-08-26 22:49:52.483 | INFO     | src.policies:train:116 - Epoch 118 / 800
2021-08-26 22:49:52.484 | INFO     | src.policies:collect_

2021-08-26 22:49:53.121 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.12743215262889862
2021-08-26 22:49:53.124 | INFO     | src.policies:train:152 - Mini-batch 3 / 3
2021-08-26 22:49:53.127 | INFO     | src.policies:minibatch_update:270 - Total loss: 0.14106321334838867
2021-08-26 22:49:53.130 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.24108971655368805
2021-08-26 22:49:53.132 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.16562175750732422
2021-08-26 22:49:53.135 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.24108971655368805
2021-08-26 22:49:53.138 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.16562175750732422
2021-08-26 22:49:53.142 | INFO     | src.policies:train:116 - Epoch 121 / 800
2021-08-26 22:49:53.143 | INFO     | src.policies:collect

2021-08-26 22:49:53.694 | INFO     | src.policies:train:152 - Mini-batch 1 / 3
2021-08-26 22:49:53.697 | INFO     | src.policies:minibatch_update:270 - Total loss: 0.12344437837600708
2021-08-26 22:49:53.699 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.11194245517253876
2021-08-26 22:49:53.702 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.13123200833797455
2021-08-26 22:49:53.704 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.11194245517253876
2021-08-26 22:49:53.706 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.13123200833797455
2021-08-26 22:49:53.709 | INFO     | src.policies:train:152 - Mini-batch 2 / 3
2021-08-26 22:49:53.711 | INFO     | src.policies:minibatch_update:270 - Total loss: 0.22448566555976868
2021-08-26 22:49:53.713 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gra

2021-08-26 22:49:54.159 | INFO     | src.policies:collect_trajectories:213 - Episode 442
2021-08-26 22:49:54.204 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:49:54.205 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 117.0
2021-08-26 22:49:54.206 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 147.0
2021-08-26 22:49:54.212 | INFO     | src.policies:train:152 - Mini-batch 1 / 2
2021-08-26 22:49:54.216 | INFO     | src.policies:minibatch_update:270 - Total loss: 0.1883345991373062
2021-08-26 22:49:54.218 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.2196209579706192
2021-08-26 22:49:54.221 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.13382810354232788
2021-08-26 22:49:54.223 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.2196209579706192
2021-08-26 22

2021-08-26 22:49:54.891 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.10634850710630417
2021-08-26 22:49:54.893 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.09235862642526627
2021-08-26 22:49:54.895 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.10634850710630417
2021-08-26 22:49:54.897 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.09235862642526627
2021-08-26 22:49:54.900 | INFO     | src.policies:train:152 - Mini-batch 2 / 3
2021-08-26 22:49:54.902 | INFO     | src.policies:minibatch_update:270 - Total loss: 0.09374651312828064
2021-08-26 22:49:54.904 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.14050139486789703
2021-08-26 22:49:54.906 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.11810189485549927
2021-08-26 22:49:54.908 

2021-08-26 22:49:55.512 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 149.0
2021-08-26 22:49:55.514 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 169.0
2021-08-26 22:49:55.521 | INFO     | src.policies:train:152 - Mini-batch 1 / 3
2021-08-26 22:49:55.523 | INFO     | src.policies:minibatch_update:270 - Total loss: 0.12433525919914246
2021-08-26 22:49:55.526 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.09631463140249252
2021-08-26 22:49:55.528 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.1166088804602623
2021-08-26 22:49:55.531 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.09631463140249252
2021-08-26 22:49:55.533 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.1166088804602623
2021-08-26 22:49:55.535 | INFO     | src.policies:train:152 - 

2021-08-26 22:49:55.934 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.1408836990594864
2021-08-26 22:49:55.936 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.04948481172323227
2021-08-26 22:49:55.939 | INFO     | src.policies:train:152 - Mini-batch 2 / 2
2021-08-26 22:49:55.941 | INFO     | src.policies:minibatch_update:270 - Total loss: 0.1835571527481079
2021-08-26 22:49:55.943 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.28056347370147705
2021-08-26 22:49:55.945 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.05993125960230827
2021-08-26 22:49:55.947 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.28056347370147705
2021-08-26 22:49:55.949 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.05993125960230

2021-08-26 22:49:56.438 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.09066928923130035
2021-08-26 22:49:56.440 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.29271239042282104
2021-08-26 22:49:56.442 | INFO     | src.policies:train:116 - Epoch 139 / 800
2021-08-26 22:49:56.443 | INFO     | src.policies:collect_trajectories:213 - Episode 464
2021-08-26 22:49:56.468 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:49:56.469 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 63.0
2021-08-26 22:49:56.470 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 63.0
2021-08-26 22:49:56.471 | INFO     | src.policies:collect_trajectories:213 - Episode 465
2021-08-26 22:49:56.506 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:49:56.507 | INFO     |

2021-08-26 22:49:57.079 | INFO     | src.policies:train:152 - Mini-batch 1 / 2
2021-08-26 22:49:57.082 | INFO     | src.policies:minibatch_update:270 - Total loss: 0.16047269105911255
2021-08-26 22:49:57.084 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.21614892780780792
2021-08-26 22:49:57.086 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.1171339675784111
2021-08-26 22:49:57.087 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.21614892780780792
2021-08-26 22:49:57.090 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.1171339675784111
2021-08-26 22:49:57.132 | INFO     | src.policies:train:152 - Mini-batch 2 / 2
2021-08-26 22:49:57.135 | INFO     | src.policies:minibatch_update:270 - Total loss: 0.1318444013595581
2021-08-26 22:49:57.137 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradie

2021-08-26 22:49:57.555 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.03246726095676422
2021-08-26 22:49:57.557 | INFO     | src.policies:train:116 - Epoch 146 / 800
2021-08-26 22:49:57.558 | INFO     | src.policies:collect_trajectories:213 - Episode 474
2021-08-26 22:49:57.607 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:49:57.608 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 130.0
2021-08-26 22:49:57.609 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 130.0
2021-08-26 22:49:57.610 | INFO     | src.policies:collect_trajectories:213 - Episode 475
2021-08-26 22:49:57.733 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:49:57.734 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 195.0
2021-08-26 22:49:57.735 | INFO     | src.policies:collect_trajectories:

2021-08-26 22:49:58.092 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:49:58.093 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 118.0
2021-08-26 22:49:58.094 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 71.0
2021-08-26 22:49:58.095 | INFO     | src.policies:collect_trajectories:213 - Episode 481
2021-08-26 22:49:58.157 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:49:58.158 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 153.0
2021-08-26 22:49:58.159 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 98.33333333333333
2021-08-26 22:49:58.165 | INFO     | src.policies:train:152 - Mini-batch 1 / 2
2021-08-26 22:49:58.168 | INFO     | src.policies:minibatch_update:270 - Total loss: 0.14585018157958984
2021-08-26 22:49:58.171 | INFO     | src.policies:minibatch_update:277 -

2021-08-26 22:49:58.617 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.07036369293928146
2021-08-26 22:49:58.619 | INFO     | src.policies:train:152 - Mini-batch 2 / 2
2021-08-26 22:49:58.621 | INFO     | src.policies:minibatch_update:270 - Total loss: 0.21223807334899902
2021-08-26 22:49:58.624 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.10527066886425018
2021-08-26 22:49:58.626 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.02591940574347973
2021-08-26 22:49:58.628 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.10527066886425018
2021-08-26 22:49:58.630 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.02591940574347973
2021-08-26 22:49:58.632 | INFO     | src.policies:train:116 - Epoch 153 / 800
2021-08-26 22:49:58.633 | INFO     | src.policies:collect

2021-08-26 22:49:59.170 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.10096419602632523
2021-08-26 22:49:59.172 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.067191481590271
2021-08-26 22:49:59.175 | INFO     | src.policies:train:116 - Epoch 156 / 800
2021-08-26 22:49:59.176 | INFO     | src.policies:collect_trajectories:213 - Episode 494
2021-08-26 22:49:59.198 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:49:59.199 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 58.0
2021-08-26 22:49:59.200 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 58.0
2021-08-26 22:49:59.201 | INFO     | src.policies:collect_trajectories:213 - Episode 495
2021-08-26 22:49:59.224 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:49:59.226 | INFO     | s

2021-08-26 22:49:59.611 | INFO     | src.policies:minibatch_update:270 - Total loss: 0.18560320138931274
2021-08-26 22:49:59.613 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.15772047638893127
2021-08-26 22:49:59.615 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.04691487178206444
2021-08-26 22:49:59.617 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.15772047638893127
2021-08-26 22:49:59.619 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.04691487178206444
2021-08-26 22:49:59.621 | INFO     | src.policies:train:116 - Epoch 159 / 800
2021-08-26 22:49:59.622 | INFO     | src.policies:collect_trajectories:213 - Episode 503
2021-08-26 22:49:59.690 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:49:59.692 | INFO     | src.policies:collect_trajectories:229 - Mean epis

2021-08-26 22:50:00.094 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.187574565410614
2021-08-26 22:50:00.097 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.042182814329862595
2021-08-26 22:50:00.100 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.187574565410614
2021-08-26 22:50:00.102 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.042182814329862595
2021-08-26 22:50:00.106 | INFO     | src.policies:train:116 - Epoch 162 / 800
2021-08-26 22:50:00.107 | INFO     | src.policies:collect_trajectories:213 - Episode 511
2021-08-26 22:50:00.162 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:50:00.163 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 116.0
2021-08-26 22:50:00.164 | INFO     | src.policies:collect_trajectories:230 - Last 100 epi

2021-08-26 22:50:00.577 | INFO     | src.policies:train:116 - Epoch 165 / 800
2021-08-26 22:50:00.578 | INFO     | src.policies:collect_trajectories:213 - Episode 518
2021-08-26 22:50:00.600 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:50:00.601 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 54.0
2021-08-26 22:50:00.602 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 54.0
2021-08-26 22:50:00.603 | INFO     | src.policies:collect_trajectories:213 - Episode 519
2021-08-26 22:50:00.641 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:50:00.642 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 93.0
2021-08-26 22:50:00.643 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 73.5
2021-08-26 22:50:00.644 | INFO     | src.policies:collect_trajectories:213 - Episode 520
2021-08-26 22:5

2021-08-26 22:50:01.271 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.19260120391845703
2021-08-26 22:50:01.274 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.1019611656665802
2021-08-26 22:50:01.278 | INFO     | src.policies:train:152 - Mini-batch 2 / 2
2021-08-26 22:50:01.281 | INFO     | src.policies:minibatch_update:270 - Total loss: 0.20859259366989136
2021-08-26 22:50:01.283 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.08736707270145416
2021-08-26 22:50:01.285 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.102872833609581
2021-08-26 22:50:01.287 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.08736707270145416
2021-08-26 22:50:01.289 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.102872833609581

2021-08-26 22:50:01.800 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.2244955152273178
2021-08-26 22:50:01.802 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.1614861935377121
2021-08-26 22:50:01.804 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.2244955152273178
2021-08-26 22:50:01.806 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.1614861935377121
2021-08-26 22:50:01.809 | INFO     | src.policies:train:152 - Mini-batch 2 / 3
2021-08-26 22:50:01.811 | INFO     | src.policies:minibatch_update:270 - Total loss: 0.071723073720932
2021-08-26 22:50:01.813 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.14146578311920166
2021-08-26 22:50:01.815 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.1830005794763565
2021-08-26 22:50:01.817 | INFO 

2021-08-26 22:50:02.333 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.1806880682706833
2021-08-26 22:50:02.334 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.2827039062976837
2021-08-26 22:50:02.336 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.1806880682706833
2021-08-26 22:50:02.338 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.2827039062976837
2021-08-26 22:50:02.341 | INFO     | src.policies:train:152 - Mini-batch 2 / 3
2021-08-26 22:50:02.343 | INFO     | src.policies:minibatch_update:270 - Total loss: 0.1243915855884552
2021-08-26 22:50:02.345 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.23783111572265625
2021-08-26 22:50:02.347 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.19146943092346191
2021-08-26 22:50:02.349 | INF

2021-08-26 22:50:02.780 | INFO     | src.policies:train:116 - Epoch 176 / 800
2021-08-26 22:50:02.781 | INFO     | src.policies:collect_trajectories:213 - Episode 541
2021-08-26 22:50:02.834 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:50:02.836 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 138.0
2021-08-26 22:50:02.837 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 138.0
2021-08-26 22:50:02.838 | INFO     | src.policies:collect_trajectories:213 - Episode 542
2021-08-26 22:50:02.877 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:50:02.878 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 103.0
2021-08-26 22:50:02.879 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 120.5
2021-08-26 22:50:02.885 | INFO     | src.policies:train:152 - Mini-batch 1 / 2
2021-08-26 22:50:02.8

2021-08-26 22:50:03.464 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 159.0
2021-08-26 22:50:03.464 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 161.0
2021-08-26 22:50:03.470 | INFO     | src.policies:train:152 - Mini-batch 1 / 3
2021-08-26 22:50:03.474 | INFO     | src.policies:minibatch_update:270 - Total loss: 0.06117938458919525
2021-08-26 22:50:03.476 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.0649133175611496
2021-08-26 22:50:03.478 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.19582022726535797
2021-08-26 22:50:03.480 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.0649133175611496
2021-08-26 22:50:03.482 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.19582022726535797
2021-08-26 22:50:03.485 | INFO     | src.policies:train:152 - 

2021-08-26 22:50:03.933 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:50:03.935 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 143.0
2021-08-26 22:50:03.935 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 143.0
2021-08-26 22:50:03.936 | INFO     | src.policies:collect_trajectories:213 - Episode 553
2021-08-26 22:50:04.014 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:50:04.015 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 200.0
2021-08-26 22:50:04.016 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 171.5
2021-08-26 22:50:04.022 | INFO     | src.policies:train:152 - Mini-batch 1 / 3
2021-08-26 22:50:04.025 | INFO     | src.policies:minibatch_update:270 - Total loss: 0.12511980533599854
2021-08-26 22:50:04.027 | INFO     | src.policies:minibatch_update:277 - Policy net

2021-08-26 22:50:04.471 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:50:04.472 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 128.0
2021-08-26 22:50:04.473 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 128.0
2021-08-26 22:50:04.474 | INFO     | src.policies:collect_trajectories:213 - Episode 558
2021-08-26 22:50:04.548 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:50:04.549 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 200.0
2021-08-26 22:50:04.550 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 164.0
2021-08-26 22:50:04.556 | INFO     | src.policies:train:152 - Mini-batch 1 / 3
2021-08-26 22:50:04.559 | INFO     | src.policies:minibatch_update:270 - Total loss: 0.084699347615242
2021-08-26 22:50:04.561 | INFO     | src.policies:minibatch_update:277 - Policy netwo

2021-08-26 22:50:05.008 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.37926730513572693
2021-08-26 22:50:05.010 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.19739627838134766
2021-08-26 22:50:05.013 | INFO     | src.policies:train:152 - Mini-batch 2 / 3
2021-08-26 22:50:05.015 | INFO     | src.policies:minibatch_update:270 - Total loss: 0.14661180973052979
2021-08-26 22:50:05.017 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.2758459448814392
2021-08-26 22:50:05.019 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.1718427836894989
2021-08-26 22:50:05.021 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.2758459448814392
2021-08-26 22:50:05.023 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.171842783689498

2021-08-26 22:50:05.581 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:50:05.582 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 121.0
2021-08-26 22:50:05.583 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 121.0
2021-08-26 22:50:05.584 | INFO     | src.policies:collect_trajectories:213 - Episode 568
2021-08-26 22:50:05.646 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:50:05.647 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 164.0
2021-08-26 22:50:05.648 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 142.5
2021-08-26 22:50:05.654 | INFO     | src.policies:train:152 - Mini-batch 1 / 2
2021-08-26 22:50:05.657 | INFO     | src.policies:minibatch_update:270 - Total loss: 0.17982622981071472
2021-08-26 22:50:05.659 | INFO     | src.policies:minibatch_update:277 - Policy net

2021-08-26 22:50:06.066 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.16382890939712524
2021-08-26 22:50:06.068 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.09642701596021652
2021-08-26 22:50:06.071 | INFO     | src.policies:train:116 - Epoch 196 / 800
2021-08-26 22:50:06.072 | INFO     | src.policies:collect_trajectories:213 - Episode 573
2021-08-26 22:50:06.131 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:50:06.132 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 135.0
2021-08-26 22:50:06.134 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 135.0
2021-08-26 22:50:06.135 | INFO     | src.policies:collect_trajectories:213 - Episode 574
2021-08-26 22:50:06.170 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:50:06.171 | INFO    

2021-08-26 22:50:06.521 | INFO     | src.policies:train:116 - Epoch 199 / 800
2021-08-26 22:50:06.522 | INFO     | src.policies:collect_trajectories:213 - Episode 579
2021-08-26 22:50:06.690 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:50:06.691 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 57.0
2021-08-26 22:50:06.692 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 57.0
2021-08-26 22:50:06.693 | INFO     | src.policies:collect_trajectories:213 - Episode 580
2021-08-26 22:50:06.766 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:50:06.767 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 190.0
2021-08-26 22:50:06.768 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 123.5
2021-08-26 22:50:06.775 | INFO     | src.policies:train:152 - Mini-batch 1 / 2
2021-08-26 22:50:06.777

2021-08-26 22:50:07.319 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:50:07.320 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 167.0
2021-08-26 22:50:07.321 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 167.0
2021-08-26 22:50:07.322 | INFO     | src.policies:collect_trajectories:213 - Episode 586
2021-08-26 22:50:07.398 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:50:07.399 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 200.0
2021-08-26 22:50:07.400 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 183.5
2021-08-26 22:50:07.406 | INFO     | src.policies:train:152 - Mini-batch 1 / 3
2021-08-26 22:50:07.409 | INFO     | src.policies:minibatch_update:270 - Total loss: 0.10492515563964844
2021-08-26 22:50:07.411 | INFO     | src.policies:minibatch_update:277 - Policy net

2021-08-26 22:50:07.781 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.14179536700248718
2021-08-26 22:50:07.783 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.08485374599695206
2021-08-26 22:50:07.785 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.14179536700248718
2021-08-26 22:50:07.788 | INFO     | src.policies:train:116 - Epoch 205 / 800
2021-08-26 22:50:07.790 | INFO     | src.policies:collect_trajectories:213 - Episode 591
2021-08-26 22:50:07.908 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:50:07.909 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 200.0
2021-08-26 22:50:07.910 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 200.0
2021-08-26 22:50:07.914 | INFO     | src.policies:train:152 - Mini-batch 1 / 2
2021-08-26 22:50:07.

2021-08-26 22:50:08.326 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.1383872926235199
2021-08-26 22:50:08.328 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.2528882920742035
2021-08-26 22:50:08.330 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.1383872926235199
2021-08-26 22:50:08.333 | INFO     | src.policies:train:116 - Epoch 208 / 800
2021-08-26 22:50:08.334 | INFO     | src.policies:collect_trajectories:213 - Episode 596
2021-08-26 22:50:08.462 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:50:08.464 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 200.0
2021-08-26 22:50:08.465 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 200.0
2021-08-26 22:50:08.469 | INFO     | src.policies:train:152 - Mini-batch 1 / 2
2021-08-26 22:50:08.472

2021-08-26 22:50:08.885 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.21955759823322296
2021-08-26 22:50:08.887 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.06244621053338051
2021-08-26 22:50:08.889 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.21955759823322296
2021-08-26 22:50:08.892 | INFO     | src.policies:train:116 - Epoch 211 / 800
2021-08-26 22:50:08.893 | INFO     | src.policies:collect_trajectories:213 - Episode 601
2021-08-26 22:50:09.009 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:50:09.010 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 200.0
2021-08-26 22:50:09.011 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 200.0
2021-08-26 22:50:09.016 | INFO     | src.policies:train:152 - Mini-batch 1 / 2
2021-08-26 22:50:09.

2021-08-26 22:50:09.542 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.3151629865169525
2021-08-26 22:50:09.544 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.039626095443964005
2021-08-26 22:50:09.546 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.3151629865169525
2021-08-26 22:50:09.549 | INFO     | src.policies:train:152 - Mini-batch 2 / 2
2021-08-26 22:50:09.551 | INFO     | src.policies:minibatch_update:270 - Total loss: 0.0551215261220932
2021-08-26 22:50:09.553 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.13557516038417816
2021-08-26 22:50:09.555 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.2293519526720047
2021-08-26 22:50:09.557 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.13557516038417816
2021-08-26 2

2021-08-26 22:50:10.109 | INFO     | src.policies:minibatch_update:270 - Total loss: 0.11676235496997833
2021-08-26 22:50:10.111 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.11556469649076462
2021-08-26 22:50:10.113 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.1016831248998642
2021-08-26 22:50:10.115 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.11556469649076462
2021-08-26 22:50:10.117 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.1016831248998642
2021-08-26 22:50:10.120 | INFO     | src.policies:train:152 - Mini-batch 2 / 3
2021-08-26 22:50:10.122 | INFO     | src.policies:minibatch_update:270 - Total loss: 0.07262501120567322
2021-08-26 22:50:10.124 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.13780871033668518
2021-08-26 22:50:10.126 | INFO     | src.policies

2021-08-26 22:50:10.524 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.116268090903759
2021-08-26 22:50:10.526 | INFO     | src.policies:train:152 - Mini-batch 2 / 2
2021-08-26 22:50:10.528 | INFO     | src.policies:minibatch_update:270 - Total loss: 0.2003350853919983
2021-08-26 22:50:10.531 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.2492891103029251
2021-08-26 22:50:10.532 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.11235776543617249
2021-08-26 22:50:10.534 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.2492891103029251
2021-08-26 22:50:10.536 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.11235776543617249
2021-08-26 22:50:10.593 | INFO     | src.policies:train:116 - Epoch 221 / 800
2021-08-26 22:50:10.595 | INFO     | src.policies:collect_traj

2021-08-26 22:50:11.062 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.2543608546257019
2021-08-26 22:50:11.065 | INFO     | src.policies:train:152 - Mini-batch 3 / 3
2021-08-26 22:50:11.067 | INFO     | src.policies:minibatch_update:270 - Total loss: 0.160756453871727
2021-08-26 22:50:11.069 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.0712086632847786
2021-08-26 22:50:11.071 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.2001902163028717
2021-08-26 22:50:11.073 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.0712086632847786
2021-08-26 22:50:11.075 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.2001902163028717
2021-08-26 22:50:11.078 | INFO     | src.policies:train:116 - Epoch 224 / 800
2021-08-26 22:50:11.079 | INFO     | src.policies:collect_trajec

2021-08-26 22:50:11.716 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.0941266417503357
2021-08-26 22:50:11.718 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.3059103786945343
2021-08-26 22:50:11.720 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.0941266417503357
2021-08-26 22:50:11.722 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.3059103786945343
2021-08-26 22:50:11.724 | INFO     | src.policies:train:152 - Mini-batch 2 / 3
2021-08-26 22:50:11.726 | INFO     | src.policies:minibatch_update:270 - Total loss: 0.03763571381568909
2021-08-26 22:50:11.728 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.07073380798101425
2021-08-26 22:50:11.730 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.2439696490764618
2021-08-26 22:50:11.732 | INF

2021-08-26 22:50:12.163 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.04289606958627701
2021-08-26 22:50:12.164 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.1808641254901886
2021-08-26 22:50:12.167 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.04289606958627701
2021-08-26 22:50:12.169 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.1808641254901886
2021-08-26 22:50:12.171 | INFO     | src.policies:train:152 - Mini-batch 2 / 3
2021-08-26 22:50:12.173 | INFO     | src.policies:minibatch_update:270 - Total loss: 0.013781517744064331
2021-08-26 22:50:12.175 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.0779852345585823
2021-08-26 22:50:12.177 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.11658982932567596
2021-08-26 22:50:12.179 | 

2021-08-26 22:50:12.676 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.09829751402139664
2021-08-26 22:50:12.678 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.11087430268526077
2021-08-26 22:50:12.680 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.09829751402139664
2021-08-26 22:50:12.682 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.11087430268526077
2021-08-26 22:50:12.684 | INFO     | src.policies:train:152 - Mini-batch 2 / 3
2021-08-26 22:50:12.686 | INFO     | src.policies:minibatch_update:270 - Total loss: 0.04293800890445709
2021-08-26 22:50:12.688 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.10055872052907944
2021-08-26 22:50:12.690 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.1768072545528412
2021-08-26 22:50:12.692 |

2021-08-26 22:50:13.280 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.23887857794761658
2021-08-26 22:50:13.282 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.08533885329961777
2021-08-26 22:50:13.284 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.23887857794761658
2021-08-26 22:50:13.286 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.08533885329961777
2021-08-26 22:50:13.288 | INFO     | src.policies:train:152 - Mini-batch 2 / 3
2021-08-26 22:50:13.290 | INFO     | src.policies:minibatch_update:270 - Total loss: 0.1425696313381195
2021-08-26 22:50:13.292 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.33308854699134827
2021-08-26 22:50:13.294 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.05398673936724663
2021-08-26 22:50:13.296 |

2021-08-26 22:50:13.843 | INFO     | src.policies:train:152 - Mini-batch 3 / 3
2021-08-26 22:50:13.846 | INFO     | src.policies:minibatch_update:270 - Total loss: 0.06176924705505371
2021-08-26 22:50:13.847 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.13214649260044098
2021-08-26 22:50:13.849 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.29921963810920715
2021-08-26 22:50:13.851 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.13214649260044098
2021-08-26 22:50:13.854 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.29921963810920715
2021-08-26 22:50:13.857 | INFO     | src.policies:train:116 - Epoch 240 / 800
2021-08-26 22:50:13.857 | INFO     | src.policies:collect_trajectories:213 - Episode 647
2021-08-26 22:50:13.972 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08

2021-08-26 22:50:14.330 | INFO     | src.policies:train:116 - Epoch 243 / 800
2021-08-26 22:50:14.331 | INFO     | src.policies:collect_trajectories:213 - Episode 652
2021-08-26 22:50:14.393 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:50:14.394 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 161.0
2021-08-26 22:50:14.395 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 161.0
2021-08-26 22:50:14.396 | INFO     | src.policies:collect_trajectories:213 - Episode 653
2021-08-26 22:50:14.513 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:50:14.515 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 200.0
2021-08-26 22:50:14.515 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 180.5
2021-08-26 22:50:14.521 | INFO     | src.policies:train:152 - Mini-batch 1 / 3
2021-08-26 22:50:14.5

2021-08-26 22:50:14.914 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:50:14.915 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 192.0
2021-08-26 22:50:14.916 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 192.0
2021-08-26 22:50:14.917 | INFO     | src.policies:collect_trajectories:213 - Episode 659
2021-08-26 22:50:14.988 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:50:14.989 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 192.0
2021-08-26 22:50:14.990 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 192.0
2021-08-26 22:50:14.996 | INFO     | src.policies:train:152 - Mini-batch 1 / 3
2021-08-26 22:50:14.999 | INFO     | src.policies:minibatch_update:270 - Total loss: 0.05147503316402435
2021-08-26 22:50:15.001 | INFO     | src.policies:minibatch_update:277 - Policy net

2021-08-26 22:50:15.424 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.1867082267999649
2021-08-26 22:50:15.426 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.07354892045259476
2021-08-26 22:50:15.428 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.1867082267999649
2021-08-26 22:50:15.431 | INFO     | src.policies:train:152 - Mini-batch 3 / 3
2021-08-26 22:50:15.433 | INFO     | src.policies:minibatch_update:270 - Total loss: 0.10777440667152405
2021-08-26 22:50:15.435 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.2146608829498291
2021-08-26 22:50:15.437 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.23507922887802124
2021-08-26 22:50:15.439 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.2146608829498291
2021-08-26 22

2021-08-26 22:50:15.972 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.07839391380548477
2021-08-26 22:50:15.974 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.08981142938137054
2021-08-26 22:50:15.976 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.07839391380548477
2021-08-26 22:50:15.979 | INFO     | src.policies:train:116 - Epoch 252 / 800
2021-08-26 22:50:15.980 | INFO     | src.policies:collect_trajectories:213 - Episode 669
2021-08-26 22:50:16.024 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:50:16.025 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 112.0
2021-08-26 22:50:16.026 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 112.0
2021-08-26 22:50:16.027 | INFO     | src.policies:collect_trajectories:213 - Episode 670
2021-08-26

2021-08-26 22:50:16.526 | INFO     | src.policies:minibatch_update:270 - Total loss: 0.09634234011173248
2021-08-26 22:50:16.529 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.1801166534423828
2021-08-26 22:50:16.531 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.40822622179985046
2021-08-26 22:50:16.532 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.1801166534423828
2021-08-26 22:50:16.535 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.40822622179985046
2021-08-26 22:50:16.538 | INFO     | src.policies:train:152 - Mini-batch 3 / 3
2021-08-26 22:50:16.540 | INFO     | src.policies:minibatch_update:270 - Total loss: 0.10531994700431824
2021-08-26 22:50:16.541 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.032934993505477905
2021-08-26 22:50:16.543 | INFO     | src.policie

2021-08-26 22:50:17.001 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.12413647770881653
2021-08-26 22:50:17.004 | INFO     | src.policies:train:116 - Epoch 258 / 800
2021-08-26 22:50:17.005 | INFO     | src.policies:collect_trajectories:213 - Episode 680
2021-08-26 22:50:17.079 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:50:17.080 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 200.0
2021-08-26 22:50:17.081 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 200.0
2021-08-26 22:50:17.086 | INFO     | src.policies:train:152 - Mini-batch 1 / 2
2021-08-26 22:50:17.088 | INFO     | src.policies:minibatch_update:270 - Total loss: 0.04279908537864685
2021-08-26 22:50:17.090 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.07491110265254974
2021-08-26 22:50:17.092 | INFO     | src.policies:minibatc

2021-08-26 22:50:17.558 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.04146260395646095
2021-08-26 22:50:17.560 | INFO     | src.policies:train:152 - Mini-batch 2 / 2
2021-08-26 22:50:17.563 | INFO     | src.policies:minibatch_update:270 - Total loss: 0.29638952016830444
2021-08-26 22:50:17.565 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.16053660213947296
2021-08-26 22:50:17.566 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.014020716771483421
2021-08-26 22:50:17.568 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.16053660213947296
2021-08-26 22:50:17.571 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.014020716771483421
2021-08-26 22:50:17.574 | INFO     | src.policies:train:116 - Epoch 262 / 800
2021-08-26 22:50:17.574 | INFO     | src.policies:colle

2021-08-26 22:50:18.157 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.21187493205070496
2021-08-26 22:50:18.159 | INFO     | src.policies:train:152 - Mini-batch 3 / 3
2021-08-26 22:50:18.161 | INFO     | src.policies:minibatch_update:270 - Total loss: 0.0409858375787735
2021-08-26 22:50:18.163 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.1003480777144432
2021-08-26 22:50:18.165 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.15439419448375702
2021-08-26 22:50:18.167 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.1003480777144432
2021-08-26 22:50:18.169 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.15439419448375702
2021-08-26 22:50:18.172 | INFO     | src.policies:train:116 - Epoch 265 / 800
2021-08-26 22:50:18.173 | INFO     | src.policies:collect_tr

2021-08-26 22:50:18.730 | INFO     | src.policies:train:116 - Epoch 268 / 800
2021-08-26 22:50:18.731 | INFO     | src.policies:collect_trajectories:213 - Episode 696
2021-08-26 22:50:18.809 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:50:18.810 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 200.0
2021-08-26 22:50:18.811 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 200.0
2021-08-26 22:50:18.816 | INFO     | src.policies:train:152 - Mini-batch 1 / 2
2021-08-26 22:50:18.818 | INFO     | src.policies:minibatch_update:270 - Total loss: 0.033207252621650696
2021-08-26 22:50:18.820 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.16467146575450897
2021-08-26 22:50:18.822 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.25621235370635986
2021-08-26 22:50:18.824 | INFO     | src.policies:minibatch_update:288 -

2021-08-26 22:50:19.259 | INFO     | src.policies:collect_trajectories:213 - Episode 701
2021-08-26 22:50:19.342 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:50:19.343 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 200.0
2021-08-26 22:50:19.343 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 156.5
2021-08-26 22:50:19.350 | INFO     | src.policies:train:152 - Mini-batch 1 / 3
2021-08-26 22:50:19.353 | INFO     | src.policies:minibatch_update:270 - Total loss: 0.1003355085849762
2021-08-26 22:50:19.355 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.23941706120967865
2021-08-26 22:50:19.357 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.11149785667657852
2021-08-26 22:50:19.358 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.23941706120967865
2021-08-26 

2021-08-26 22:50:19.902 | INFO     | src.policies:train:152 - Mini-batch 1 / 2
2021-08-26 22:50:19.905 | INFO     | src.policies:minibatch_update:270 - Total loss: 0.2970401644706726
2021-08-26 22:50:19.908 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.2655510902404785
2021-08-26 22:50:19.911 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.04291829094290733
2021-08-26 22:50:19.914 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.2655510902404785
2021-08-26 22:50:19.917 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.04291829094290733
2021-08-26 22:50:19.920 | INFO     | src.policies:train:152 - Mini-batch 2 / 2
2021-08-26 22:50:19.922 | INFO     | src.policies:minibatch_update:270 - Total loss: 0.16479122638702393
2021-08-26 22:50:19.925 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradie

2021-08-26 22:50:20.371 | INFO     | src.policies:train:116 - Epoch 279 / 800
2021-08-26 22:50:20.372 | INFO     | src.policies:collect_trajectories:213 - Episode 711
2021-08-26 22:50:20.447 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:50:20.448 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 200.0
2021-08-26 22:50:20.449 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 200.0
2021-08-26 22:50:20.452 | INFO     | src.policies:train:152 - Mini-batch 1 / 2
2021-08-26 22:50:20.456 | INFO     | src.policies:minibatch_update:270 - Total loss: 0.2071923315525055
2021-08-26 22:50:20.458 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.165283203125
2021-08-26 22:50:20.460 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.37884119153022766
2021-08-26 22:50:20.462 | INFO     | src.policies:minibatch_update:288 - Policy

2021-08-26 22:50:20.879 | INFO     | src.policies:train:116 - Epoch 282 / 800
2021-08-26 22:50:20.880 | INFO     | src.policies:collect_trajectories:213 - Episode 716
2021-08-26 22:50:20.957 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:50:20.958 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 200.0
2021-08-26 22:50:20.959 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 200.0
2021-08-26 22:50:20.963 | INFO     | src.policies:train:152 - Mini-batch 1 / 2
2021-08-26 22:50:20.965 | INFO     | src.policies:minibatch_update:270 - Total loss: 0.08715009689331055
2021-08-26 22:50:20.967 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.23309801518917084
2021-08-26 22:50:20.969 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.23836159706115723
2021-08-26 22:50:20.972 | INFO     | src.policies:minibatch_update:288 - 

2021-08-26 22:50:21.476 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.3044602870941162
2021-08-26 22:50:21.477 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.2793906331062317
2021-08-26 22:50:21.479 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.3044602870941162
2021-08-26 22:50:21.482 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.2793906331062317
2021-08-26 22:50:21.484 | INFO     | src.policies:train:152 - Mini-batch 2 / 2
2021-08-26 22:50:21.486 | INFO     | src.policies:minibatch_update:270 - Total loss: 0.015357792377471924
2021-08-26 22:50:21.488 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.06635867804288864
2021-08-26 22:50:21.490 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.24525439739227295
2021-08-26 22:50:21.492 | I

2021-08-26 22:50:22.084 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.16417638957500458
2021-08-26 22:50:22.085 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.13263872265815735
2021-08-26 22:50:22.087 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.16417638957500458
2021-08-26 22:50:22.089 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.13263872265815735
2021-08-26 22:50:22.092 | INFO     | src.policies:train:116 - Epoch 289 / 800
2021-08-26 22:50:22.093 | INFO     | src.policies:collect_trajectories:213 - Episode 727
2021-08-26 22:50:22.143 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:50:22.144 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 132.0
2021-08-26 22:50:22.145 | INFO     | src.policies:collect_trajectories:230 - Last 100 e

2021-08-26 22:50:22.577 | INFO     | src.policies:minibatch_update:270 - Total loss: 0.059715792536735535
2021-08-26 22:50:22.579 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.18699485063552856
2021-08-26 22:50:22.581 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.24440526962280273
2021-08-26 22:50:22.583 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.18699485063552856
2021-08-26 22:50:22.584 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.24440526962280273
2021-08-26 22:50:22.587 | INFO     | src.policies:train:152 - Mini-batch 3 / 3
2021-08-26 22:50:22.589 | INFO     | src.policies:minibatch_update:270 - Total loss: 0.07396899163722992
2021-08-26 22:50:22.591 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.07197662442922592
2021-08-26 22:50:22.592 | INFO     | src.polic

2021-08-26 22:50:23.131 | INFO     | src.policies:minibatch_update:270 - Total loss: 0.06341315805912018
2021-08-26 22:50:23.134 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.08465559035539627
2021-08-26 22:50:23.135 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.1725986748933792
2021-08-26 22:50:23.137 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.08465559035539627
2021-08-26 22:50:23.139 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.1725986748933792
2021-08-26 22:50:23.142 | INFO     | src.policies:train:152 - Mini-batch 2 / 3
2021-08-26 22:50:23.144 | INFO     | src.policies:minibatch_update:270 - Total loss: 0.0958579033613205
2021-08-26 22:50:23.146 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.06889658421278
2021-08-26 22:50:23.148 | INFO     | src.policies:min

2021-08-26 22:50:23.569 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.1591573804616928
2021-08-26 22:50:23.572 | INFO     | src.policies:train:116 - Epoch 299 / 800
2021-08-26 22:50:23.573 | INFO     | src.policies:collect_trajectories:213 - Episode 743
2021-08-26 22:50:23.637 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:50:23.638 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 167.0
2021-08-26 22:50:23.639 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 167.0
2021-08-26 22:50:23.640 | INFO     | src.policies:collect_trajectories:213 - Episode 744
2021-08-26 22:50:23.710 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:50:23.711 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 184.0
2021-08-26 22:50:23.712 | INFO     | src.policies:collect_trajectories:2

2021-08-26 22:50:24.312 | INFO     | src.policies:minibatch_update:270 - Total loss: 0.1303234100341797
2021-08-26 22:50:24.315 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.2301756739616394
2021-08-26 22:50:24.318 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.2255486696958542
2021-08-26 22:50:24.320 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.2301756739616394
2021-08-26 22:50:24.323 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.2255486696958542
2021-08-26 22:50:24.326 | INFO     | src.policies:train:152 - Mini-batch 2 / 3
2021-08-26 22:50:24.328 | INFO     | src.policies:minibatch_update:270 - Total loss: 0.14186349511146545
2021-08-26 22:50:24.331 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.2071799486875534
2021-08-26 22:50:24.334 | INFO     | src.policies:min

2021-08-26 22:50:24.820 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 200.0
2021-08-26 22:50:24.824 | INFO     | src.policies:train:152 - Mini-batch 1 / 2
2021-08-26 22:50:24.827 | INFO     | src.policies:minibatch_update:270 - Total loss: 0.07857903093099594
2021-08-26 22:50:24.829 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.04123426601290703
2021-08-26 22:50:24.831 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.5923202037811279
2021-08-26 22:50:24.833 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.04123426601290703
2021-08-26 22:50:24.835 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.49999913573265076
2021-08-26 22:50:24.837 | INFO     | src.policies:train:152 - Mini-batch 2 / 2
2021-08-26 22:50:24.839 | INFO     | src.policies:minibatch_update:270 - Total loss: 0

2021-08-26 22:50:25.326 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.208669513463974
2021-08-26 22:50:25.328 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.41763272881507874
2021-08-26 22:50:25.330 | INFO     | src.policies:train:152 - Mini-batch 2 / 2
2021-08-26 22:50:25.332 | INFO     | src.policies:minibatch_update:270 - Total loss: 0.14260493218898773
2021-08-26 22:50:25.334 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.3343997597694397
2021-08-26 22:50:25.336 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.34032008051872253
2021-08-26 22:50:25.338 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.3343997597694397
2021-08-26 22:50:25.340 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.3403200805187225

2021-08-26 22:50:25.883 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:50:25.885 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 200.0
2021-08-26 22:50:25.886 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 200.0
2021-08-26 22:50:25.891 | INFO     | src.policies:train:152 - Mini-batch 1 / 2
2021-08-26 22:50:25.893 | INFO     | src.policies:minibatch_update:270 - Total loss: 0.09253391623497009
2021-08-26 22:50:25.895 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.1582746058702469
2021-08-26 22:50:25.897 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.2926237881183624
2021-08-26 22:50:25.899 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.1582746058702469
2021-08-26 22:50:25.901 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient 

2021-08-26 22:50:26.481 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.06805937737226486
2021-08-26 22:50:26.483 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.31022313237190247
2021-08-26 22:50:26.485 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.06805937737226486
2021-08-26 22:50:26.487 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.31022313237190247
2021-08-26 22:50:26.489 | INFO     | src.policies:train:152 - Mini-batch 2 / 2
2021-08-26 22:50:26.492 | INFO     | src.policies:minibatch_update:270 - Total loss: 0.05900971591472626
2021-08-26 22:50:26.493 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.14645899832248688
2021-08-26 22:50:26.495 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.23339591920375824
2021-08-26 22:50:26.498 

2021-08-26 22:50:26.943 | INFO     | src.policies:train:116 - Epoch 322 / 800
2021-08-26 22:50:26.944 | INFO     | src.policies:collect_trajectories:213 - Episode 770
2021-08-26 22:50:27.021 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:50:27.022 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 200.0
2021-08-26 22:50:27.023 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 200.0
2021-08-26 22:50:27.028 | INFO     | src.policies:train:152 - Mini-batch 1 / 2
2021-08-26 22:50:27.030 | INFO     | src.policies:minibatch_update:270 - Total loss: 0.12283819913864136
2021-08-26 22:50:27.032 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.18293903768062592
2021-08-26 22:50:27.034 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.02278011292219162
2021-08-26 22:50:27.036 | INFO     | src.policies:minibatch_update:288 - 

2021-08-26 22:50:27.553 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.057816680520772934
2021-08-26 22:50:27.554 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.1351991593837738
2021-08-26 22:50:27.556 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.057816680520772934
2021-08-26 22:50:27.559 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.1351991593837738
2021-08-26 22:50:27.561 | INFO     | src.policies:train:152 - Mini-batch 2 / 3
2021-08-26 22:50:27.563 | INFO     | src.policies:minibatch_update:270 - Total loss: 0.030414700508117676
2021-08-26 22:50:27.566 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.07975611090660095
2021-08-26 22:50:27.568 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.1512582004070282
2021-08-26 22:50:27.570 

2021-08-26 22:50:27.992 | INFO     | src.policies:train:116 - Epoch 328 / 800
2021-08-26 22:50:27.993 | INFO     | src.policies:collect_trajectories:213 - Episode 780
2021-08-26 22:50:28.067 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:50:28.069 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 200.0
2021-08-26 22:50:28.069 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 200.0
2021-08-26 22:50:28.075 | INFO     | src.policies:train:152 - Mini-batch 1 / 2
2021-08-26 22:50:28.078 | INFO     | src.policies:minibatch_update:270 - Total loss: 0.054427370429039
2021-08-26 22:50:28.080 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.048582300543785095
2021-08-26 22:50:28.082 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.20427246391773224
2021-08-26 22:50:28.084 | INFO     | src.policies:minibatch_update:288 - P

2021-08-26 22:50:28.632 | INFO     | src.policies:collect_trajectories:213 - Episode 785
2021-08-26 22:50:28.707 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:50:28.708 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 200.0
2021-08-26 22:50:28.709 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 192.5
2021-08-26 22:50:28.716 | INFO     | src.policies:train:152 - Mini-batch 1 / 3
2021-08-26 22:50:28.718 | INFO     | src.policies:minibatch_update:270 - Total loss: 0.13692861795425415
2021-08-26 22:50:28.720 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.10633552819490433
2021-08-26 22:50:28.722 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.412436842918396
2021-08-26 22:50:28.724 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.10633552819490433
2021-08-26 2

2021-08-26 22:50:29.187 | INFO     | src.policies:train:152 - Mini-batch 1 / 2
2021-08-26 22:50:29.190 | INFO     | src.policies:minibatch_update:270 - Total loss: 0.0806557834148407
2021-08-26 22:50:29.192 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.12473457306623459
2021-08-26 22:50:29.194 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.34777554869651794
2021-08-26 22:50:29.196 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.12473457306623459
2021-08-26 22:50:29.198 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.34777554869651794
2021-08-26 22:50:29.200 | INFO     | src.policies:train:152 - Mini-batch 2 / 2
2021-08-26 22:50:29.202 | INFO     | src.policies:minibatch_update:270 - Total loss: 0.021488681435585022
2021-08-26 22:50:29.204 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gra

2021-08-26 22:50:29.807 | INFO     | src.policies:train:152 - Mini-batch 1 / 2
2021-08-26 22:50:29.810 | INFO     | src.policies:minibatch_update:270 - Total loss: 0.022948771715164185
2021-08-26 22:50:29.812 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.23230475187301636
2021-08-26 22:50:29.814 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.13667525351047516
2021-08-26 22:50:29.816 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.23230475187301636
2021-08-26 22:50:29.818 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.13667525351047516
2021-08-26 22:50:29.821 | INFO     | src.policies:train:152 - Mini-batch 2 / 2
2021-08-26 22:50:29.823 | INFO     | src.policies:minibatch_update:270 - Total loss: 0.05793607234954834
2021-08-26 22:50:29.825 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gr

2021-08-26 22:50:30.391 | INFO     | src.policies:train:152 - Mini-batch 1 / 2
2021-08-26 22:50:30.393 | INFO     | src.policies:minibatch_update:270 - Total loss: 0.30700793862342834
2021-08-26 22:50:30.395 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.05304235219955444
2021-08-26 22:50:30.398 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.022601064294576645
2021-08-26 22:50:30.400 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.05304235219955444
2021-08-26 22:50:30.402 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.022601064294576645
2021-08-26 22:50:30.405 | INFO     | src.policies:train:152 - Mini-batch 2 / 2
2021-08-26 22:50:30.407 | INFO     | src.policies:minibatch_update:270 - Total loss: 0.24199604988098145
2021-08-26 22:50:30.409 | INFO     | src.policies:minibatch_update:277 - Policy network L2 g

2021-08-26 22:50:30.887 | INFO     | src.policies:train:152 - Mini-batch 1 / 2
2021-08-26 22:50:30.890 | INFO     | src.policies:minibatch_update:270 - Total loss: 0.07999080419540405
2021-08-26 22:50:30.892 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.18390238285064697
2021-08-26 22:50:30.894 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.15793442726135254
2021-08-26 22:50:30.896 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.18390238285064697
2021-08-26 22:50:30.898 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.15793442726135254
2021-08-26 22:50:30.900 | INFO     | src.policies:train:152 - Mini-batch 2 / 2
2021-08-26 22:50:30.902 | INFO     | src.policies:minibatch_update:270 - Total loss: 0.1112641990184784
2021-08-26 22:50:30.904 | INFO     | src.policies:minibatch_update:277 - Policy network L2 grad

2021-08-26 22:50:31.306 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.016839297488331795
2021-08-26 22:50:31.310 | INFO     | src.policies:train:116 - Epoch 351 / 800
2021-08-26 22:50:31.310 | INFO     | src.policies:collect_trajectories:213 - Episode 809
2021-08-26 22:50:31.433 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:50:31.434 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 200.0
2021-08-26 22:50:31.435 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 200.0
2021-08-26 22:50:31.438 | INFO     | src.policies:train:152 - Mini-batch 1 / 2
2021-08-26 22:50:31.442 | INFO     | src.policies:minibatch_update:270 - Total loss: 0.1924101710319519
2021-08-26 22:50:31.445 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.329741895198822
2021-08-26 22:50:31.446 | INFO     | src.policies:minibatch_

2021-08-26 22:50:31.827 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.014475219883024693
2021-08-26 22:50:31.830 | INFO     | src.policies:train:116 - Epoch 355 / 800
2021-08-26 22:50:31.831 | INFO     | src.policies:collect_trajectories:213 - Episode 814
2021-08-26 22:50:31.947 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:50:31.948 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 200.0
2021-08-26 22:50:31.948 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 200.0
2021-08-26 22:50:31.952 | INFO     | src.policies:train:152 - Mini-batch 1 / 2
2021-08-26 22:50:31.955 | INFO     | src.policies:minibatch_update:270 - Total loss: 0.24396592378616333
2021-08-26 22:50:31.958 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.5787656307220459
2021-08-26 22:50:31.960 | INFO     | src.policies:minibatc

2021-08-26 22:50:32.406 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.01940731890499592
2021-08-26 22:50:32.409 | INFO     | src.policies:train:116 - Epoch 359 / 800
2021-08-26 22:50:32.410 | INFO     | src.policies:collect_trajectories:213 - Episode 819
2021-08-26 22:50:32.448 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:50:32.466 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 98.0
2021-08-26 22:50:32.485 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 98.0
2021-08-26 22:50:32.496 | INFO     | src.policies:collect_trajectories:213 - Episode 820
2021-08-26 22:50:32.572 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:50:32.573 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 200.0
2021-08-26 22:50:32.574 | INFO     | src.policies:collect_trajectories:23

2021-08-26 22:50:32.963 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.054371584206819534
2021-08-26 22:50:32.966 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.34128716588020325
2021-08-26 22:50:32.968 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.054371584206819534
2021-08-26 22:50:32.970 | INFO     | src.policies:train:152 - Mini-batch 2 / 2
2021-08-26 22:50:32.972 | INFO     | src.policies:minibatch_update:270 - Total loss: 0.14050358533859253
2021-08-26 22:50:32.974 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.14106646180152893
2021-08-26 22:50:32.976 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.12942004203796387
2021-08-26 22:50:32.978 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.14106646180152893
2021-08

2021-08-26 22:50:33.434 | INFO     | src.policies:collect_trajectories:213 - Episode 830
2021-08-26 22:50:33.479 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:50:33.481 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 121.0
2021-08-26 22:50:33.481 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 121.0
2021-08-26 22:50:33.482 | INFO     | src.policies:collect_trajectories:213 - Episode 831
2021-08-26 22:50:33.597 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:50:33.598 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 200.0
2021-08-26 22:50:33.599 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 160.5
2021-08-26 22:50:33.605 | INFO     | src.policies:train:152 - Mini-batch 1 / 3
2021-08-26 22:50:33.608 | INFO     | src.policies:minibatch_update:270 - Total loss: 0.266214847564

2021-08-26 22:50:33.929 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 78.0
2021-08-26 22:50:33.930 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 78.0
2021-08-26 22:50:33.931 | INFO     | src.policies:collect_trajectories:213 - Episode 837
2021-08-26 22:50:34.008 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:50:34.009 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 200.0
2021-08-26 22:50:34.010 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 139.0
2021-08-26 22:50:34.016 | INFO     | src.policies:train:152 - Mini-batch 1 / 2
2021-08-26 22:50:34.019 | INFO     | src.policies:minibatch_update:270 - Total loss: 0.3041178584098816
2021-08-26 22:50:34.021 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.12923328578472137
2021-08-26 22:50:34.022 | INFO     | src.policies:minibatch_upda

2021-08-26 22:50:34.554 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.012815063819289207
2021-08-26 22:50:34.556 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.49999910593032837
2021-08-26 22:50:34.558 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.012815063819289207
2021-08-26 22:50:34.560 | INFO     | src.policies:train:152 - Mini-batch 2 / 2
2021-08-26 22:50:34.562 | INFO     | src.policies:minibatch_update:270 - Total loss: 0.2944730222225189
2021-08-26 22:50:34.564 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.3780598044395447
2021-08-26 22:50:34.566 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.012605500407516956
2021-08-26 22:50:34.568 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.3780598044395447
2021-08-2

2021-08-26 22:50:35.059 | INFO     | src.policies:minibatch_update:270 - Total loss: 0.14449745416641235
2021-08-26 22:50:35.061 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.15366584062576294
2021-08-26 22:50:35.063 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.11164379864931107
2021-08-26 22:50:35.065 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.15366584062576294
2021-08-26 22:50:35.067 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.11164379864931107
2021-08-26 22:50:35.070 | INFO     | src.policies:train:152 - Mini-batch 3 / 3
2021-08-26 22:50:35.072 | INFO     | src.policies:minibatch_update:270 - Total loss: 0.3390929102897644
2021-08-26 22:50:35.074 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.37994176149368286
2021-08-26 22:50:35.076 | INFO     | src.policie

2021-08-26 22:50:35.573 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.19387684762477875
2021-08-26 22:50:35.575 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.019537167623639107
2021-08-26 22:50:35.577 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.19387684762477875
2021-08-26 22:50:35.579 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.019537167623639107
2021-08-26 22:50:35.581 | INFO     | src.policies:train:152 - Mini-batch 2 / 3
2021-08-26 22:50:35.584 | INFO     | src.policies:minibatch_update:270 - Total loss: 0.3070634603500366
2021-08-26 22:50:35.586 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.2953226566314697
2021-08-26 22:50:35.588 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.02842722460627556
2021-08-26 22:50:35.590 

2021-08-26 22:50:36.080 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.3969847857952118
2021-08-26 22:50:36.082 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.026616493239998817
2021-08-26 22:50:36.084 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.3969847857952118
2021-08-26 22:50:36.086 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.026616493239998817
2021-08-26 22:50:36.089 | INFO     | src.policies:train:116 - Epoch 381 / 800
2021-08-26 22:50:36.090 | INFO     | src.policies:collect_trajectories:213 - Episode 864
2021-08-26 22:50:36.150 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:50:36.151 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 155.0
2021-08-26 22:50:36.152 | INFO     | src.policies:collect_trajectories:230 - Last 100 e

2021-08-26 22:50:36.623 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.4646194279193878
2021-08-26 22:50:36.624 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.013998680748045444
2021-08-26 22:50:36.626 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.4646194279193878
2021-08-26 22:50:36.629 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.013998680748045444
2021-08-26 22:50:36.631 | INFO     | src.policies:train:152 - Mini-batch 2 / 2
2021-08-26 22:50:36.633 | INFO     | src.policies:minibatch_update:270 - Total loss: 0.22840464115142822
2021-08-26 22:50:36.636 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.22210745513439178
2021-08-26 22:50:36.637 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.013781673274934292
2021-08-26 22:50:36.639

2021-08-26 22:50:37.079 | INFO     | src.policies:train:152 - Mini-batch 1 / 2
2021-08-26 22:50:37.081 | INFO     | src.policies:minibatch_update:270 - Total loss: 0.32249459624290466
2021-08-26 22:50:37.083 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.10866470634937286
2021-08-26 22:50:37.085 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.12367802858352661
2021-08-26 22:50:37.087 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.10866470634937286
2021-08-26 22:50:37.089 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.12367802858352661
2021-08-26 22:50:37.091 | INFO     | src.policies:train:152 - Mini-batch 2 / 2
2021-08-26 22:50:37.093 | INFO     | src.policies:minibatch_update:270 - Total loss: 0.31854483485221863
2021-08-26 22:50:37.095 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gra

2021-08-26 22:50:37.430 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:50:37.432 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 83.0
2021-08-26 22:50:37.433 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 83.0
2021-08-26 22:50:37.434 | INFO     | src.policies:collect_trajectories:213 - Episode 892
2021-08-26 22:50:37.533 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:50:37.534 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 104.0
2021-08-26 22:50:37.535 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 93.5
2021-08-26 22:50:37.536 | INFO     | src.policies:collect_trajectories:213 - Episode 893
2021-08-26 22:50:37.551 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:50:37.554 | INFO     | src.policies:collect_trajectories:229 - 

2021-08-26 22:50:37.890 | INFO     | src.policies:minibatch_update:270 - Total loss: 0.3002433776855469
2021-08-26 22:50:37.891 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.3969693183898926
2021-08-26 22:50:37.893 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.014261995442211628
2021-08-26 22:50:37.895 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.3969693183898926
2021-08-26 22:50:37.897 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.014261995442211628
2021-08-26 22:50:37.900 | INFO     | src.policies:train:116 - Epoch 392 / 800
2021-08-26 22:50:37.901 | INFO     | src.policies:collect_trajectories:213 - Episode 901
2021-08-26 22:50:37.931 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:50:37.932 | INFO     | src.policies:collect_trajectories:229 - Mean episo

2021-08-26 22:50:38.364 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 109.0
2021-08-26 22:50:38.369 | INFO     | src.policies:train:152 - Mini-batch 1 / 2
2021-08-26 22:50:38.372 | INFO     | src.policies:minibatch_update:270 - Total loss: 0.24968817830085754
2021-08-26 22:50:38.374 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.5234021544456482
2021-08-26 22:50:38.376 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.058699220418930054
2021-08-26 22:50:38.377 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.4999990165233612
2021-08-26 22:50:38.380 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.058699220418930054
2021-08-26 22:50:38.382 | INFO     | src.policies:train:152 - Mini-batch 2 / 2
2021-08-26 22:50:38.384 | INFO     | src.policies:minibatch_update:270 - Total loss: 

2021-08-26 22:50:38.933 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.03334549814462662
2021-08-26 22:50:38.935 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.38500145077705383
2021-08-26 22:50:38.937 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.03334549814462662
2021-08-26 22:50:38.940 | INFO     | src.policies:train:152 - Mini-batch 2 / 2
2021-08-26 22:50:38.942 | INFO     | src.policies:minibatch_update:270 - Total loss: 0.33774030208587646
2021-08-26 22:50:38.944 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.40459752082824707
2021-08-26 22:50:38.946 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.02761583775281906
2021-08-26 22:50:38.948 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.40459752082824707
2021-08-2

2021-08-26 22:50:39.417 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.013701554387807846
2021-08-26 22:50:39.420 | INFO     | src.policies:train:152 - Mini-batch 2 / 2
2021-08-26 22:50:39.422 | INFO     | src.policies:minibatch_update:270 - Total loss: 0.2671576142311096
2021-08-26 22:50:39.424 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.22712884843349457
2021-08-26 22:50:39.425 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.018610714003443718
2021-08-26 22:50:39.427 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.22712884843349457
2021-08-26 22:50:39.429 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.018610714003443718
2021-08-26 22:50:39.432 | INFO     | src.policies:train:116 - Epoch 401 / 800
2021-08-26 22:50:39.433 | INFO     | src.policies:colle

2021-08-26 22:50:39.978 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 200.0
2021-08-26 22:50:39.982 | INFO     | src.policies:train:152 - Mini-batch 1 / 2
2021-08-26 22:50:39.985 | INFO     | src.policies:minibatch_update:270 - Total loss: 0.04768413305282593
2021-08-26 22:50:39.987 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.16258321702480316
2021-08-26 22:50:39.989 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.1835079938173294
2021-08-26 22:50:39.991 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.16258321702480316
2021-08-26 22:50:39.993 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.1835079938173294
2021-08-26 22:50:39.995 | INFO     | src.policies:train:152 - Mini-batch 2 / 2
2021-08-26 22:50:39.997 | INFO     | src.policies:minibatch_update:270 - Total loss: 0.

2021-08-26 22:50:40.514 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.20082245767116547
2021-08-26 22:50:40.516 | INFO     | src.policies:train:152 - Mini-batch 2 / 3
2021-08-26 22:50:40.519 | INFO     | src.policies:minibatch_update:270 - Total loss: 0.19035357236862183
2021-08-26 22:50:40.520 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.17332348227500916
2021-08-26 22:50:40.522 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.10953614860773087
2021-08-26 22:50:40.524 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.17332348227500916
2021-08-26 22:50:40.526 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.10953614860773087
2021-08-26 22:50:40.529 | INFO     | src.policies:train:152 - Mini-batch 3 / 3
2021-08-26 22:50:40.531 | INFO     | src.policies:miniba

2021-08-26 22:50:41.164 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.0510978139936924
2021-08-26 22:50:41.166 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.11386759579181671
2021-08-26 22:50:41.168 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.0510978139936924
2021-08-26 22:50:41.171 | INFO     | src.policies:train:116 - Epoch 411 / 800
2021-08-26 22:50:41.172 | INFO     | src.policies:collect_trajectories:213 - Episode 937
2021-08-26 22:50:41.244 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:50:41.245 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 200.0
2021-08-26 22:50:41.246 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 200.0
2021-08-26 22:50:41.251 | INFO     | src.policies:train:152 - Mini-batch 1 / 2
2021-08-26 22:50:41.25

2021-08-26 22:50:41.728 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.0057471501640975475
2021-08-26 22:50:41.730 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.35033077001571655
2021-08-26 22:50:41.732 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.0057471501640975475
2021-08-26 22:50:41.735 | INFO     | src.policies:train:152 - Mini-batch 2 / 2
2021-08-26 22:50:41.737 | INFO     | src.policies:minibatch_update:270 - Total loss: 0.3218575119972229
2021-08-26 22:50:41.739 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.7242187261581421
2021-08-26 22:50:41.741 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.007472440134733915
2021-08-26 22:50:41.743 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.4999992847442627
2021-08

2021-08-26 22:50:42.208 | INFO     | src.policies:minibatch_update:270 - Total loss: 0.1876528263092041
2021-08-26 22:50:42.210 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.13254719972610474
2021-08-26 22:50:42.212 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.22639188170433044
2021-08-26 22:50:42.214 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.13254719972610474
2021-08-26 22:50:42.216 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.22639188170433044
2021-08-26 22:50:42.219 | INFO     | src.policies:train:116 - Epoch 419 / 800
2021-08-26 22:50:42.220 | INFO     | src.policies:collect_trajectories:213 - Episode 946
2021-08-26 22:50:42.288 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:50:42.289 | INFO     | src.policies:collect_trajectories:229 - Mean episo

2021-08-26 22:50:42.882 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:50:42.883 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 160.0
2021-08-26 22:50:42.884 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 171.0
2021-08-26 22:50:42.890 | INFO     | src.policies:train:152 - Mini-batch 1 / 3
2021-08-26 22:50:42.893 | INFO     | src.policies:minibatch_update:270 - Total loss: 0.06060895323753357
2021-08-26 22:50:42.895 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.16555723547935486
2021-08-26 22:50:42.897 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.11125269532203674
2021-08-26 22:50:42.898 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.16555723547935486
2021-08-26 22:50:42.901 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradie

2021-08-26 22:50:43.359 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.2043611705303192
2021-08-26 22:50:43.361 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.18902339041233063
2021-08-26 22:50:43.363 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.2043611705303192
2021-08-26 22:50:43.366 | INFO     | src.policies:train:152 - Mini-batch 2 / 2
2021-08-26 22:50:43.368 | INFO     | src.policies:minibatch_update:270 - Total loss: 0.20159456133842468
2021-08-26 22:50:43.370 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.08244379609823227
2021-08-26 22:50:43.372 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.22643329203128815
2021-08-26 22:50:43.374 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.08244379609823227
2021-08-26 

2021-08-26 22:50:43.888 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 140.5
2021-08-26 22:50:43.895 | INFO     | src.policies:train:152 - Mini-batch 1 / 2
2021-08-26 22:50:43.897 | INFO     | src.policies:minibatch_update:270 - Total loss: 0.27495434880256653
2021-08-26 22:50:43.899 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.15231339633464813
2021-08-26 22:50:43.901 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.015707239508628845
2021-08-26 22:50:43.903 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.15231339633464813
2021-08-26 22:50:43.905 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.015707239508628845
2021-08-26 22:50:43.907 | INFO     | src.policies:train:152 - Mini-batch 2 / 2
2021-08-26 22:50:43.909 | INFO     | src.policies:minibatch_update:270 - Total loss

2021-08-26 22:50:44.373 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 136.0
2021-08-26 22:50:44.373 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 136.0
2021-08-26 22:50:44.374 | INFO     | src.policies:collect_trajectories:213 - Episode 968
2021-08-26 22:50:44.427 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:50:44.428 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 139.0
2021-08-26 22:50:44.429 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 137.5
2021-08-26 22:50:44.435 | INFO     | src.policies:train:152 - Mini-batch 1 / 2
2021-08-26 22:50:44.438 | INFO     | src.policies:minibatch_update:270 - Total loss: 0.13684888184070587
2021-08-26 22:50:44.440 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.3154646158218384
2021-08-26 22:50:44.442 | INFO     | src.policies:minibatch_up

2021-08-26 22:50:45.008 | INFO     | src.policies:train:152 - Mini-batch 1 / 2
2021-08-26 22:50:45.012 | INFO     | src.policies:minibatch_update:270 - Total loss: 0.23683816194534302
2021-08-26 22:50:45.013 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.1894134134054184
2021-08-26 22:50:45.015 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.010653441771864891
2021-08-26 22:50:45.017 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.1894134134054184
2021-08-26 22:50:45.019 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.010653441771864891
2021-08-26 22:50:45.021 | INFO     | src.policies:train:152 - Mini-batch 2 / 2
2021-08-26 22:50:45.023 | INFO     | src.policies:minibatch_update:270 - Total loss: 0.31088143587112427
2021-08-26 22:50:45.025 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gra

2021-08-26 22:50:45.438 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.012453668750822544
2021-08-26 22:50:45.441 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.4944152235984802
2021-08-26 22:50:45.443 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.012453668750822544
2021-08-26 22:50:45.447 | INFO     | src.policies:train:116 - Epoch 439 / 800
2021-08-26 22:50:45.448 | INFO     | src.policies:collect_trajectories:213 - Episode 980
2021-08-26 22:50:45.500 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:50:45.501 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 139.0
2021-08-26 22:50:45.502 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 139.0
2021-08-26 22:50:45.503 | INFO     | src.policies:collect_trajectories:213 - Episode 981
2021-08-2

2021-08-26 22:50:45.914 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.49999940395355225
2021-08-26 22:50:45.916 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.044613614678382874
2021-08-26 22:50:45.919 | INFO     | src.policies:train:116 - Epoch 442 / 800
2021-08-26 22:50:45.920 | INFO     | src.policies:collect_trajectories:213 - Episode 986
2021-08-26 22:50:45.977 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:50:45.978 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 160.0
2021-08-26 22:50:45.979 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 160.0
2021-08-26 22:50:45.980 | INFO     | src.policies:collect_trajectories:213 - Episode 987
2021-08-26 22:50:46.047 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:50:46.048 | INFO   

2021-08-26 22:50:46.379 | INFO     | src.policies:train:152 - Mini-batch 2 / 2
2021-08-26 22:50:46.381 | INFO     | src.policies:minibatch_update:270 - Total loss: 0.2806705832481384
2021-08-26 22:50:46.384 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.5964418053627014
2021-08-26 22:50:46.385 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.044140320271253586
2021-08-26 22:50:46.387 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.4999992251396179
2021-08-26 22:50:46.389 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.044140320271253586
2021-08-26 22:50:46.528 | INFO     | src.policies:train:116 - Epoch 445 / 800
2021-08-26 22:50:46.539 | INFO     | src.policies:collect_trajectories:213 - Episode 992
2021-08-26 22:50:46.575 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-

2021-08-26 22:50:47.019 | INFO     | src.policies:minibatch_update:270 - Total loss: 0.17791950702667236
2021-08-26 22:50:47.021 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.09937770664691925
2021-08-26 22:50:47.023 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.00725850835442543
2021-08-26 22:50:47.025 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.09937770664691925
2021-08-26 22:50:47.027 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.00725850835442543
2021-08-26 22:50:47.029 | INFO     | src.policies:train:152 - Mini-batch 2 / 2
2021-08-26 22:50:47.031 | INFO     | src.policies:minibatch_update:270 - Total loss: 0.3589542508125305
2021-08-26 22:50:47.033 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.21099704504013062
2021-08-26 22:50:47.034 | INFO     | src.policie

2021-08-26 22:50:47.459 | INFO     | src.policies:minibatch_update:270 - Total loss: 0.2396906614303589
2021-08-26 22:50:47.461 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.3813312351703644
2021-08-26 22:50:47.463 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.03927498683333397
2021-08-26 22:50:47.465 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.3813312351703644
2021-08-26 22:50:47.467 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.03927498683333397
2021-08-26 22:50:47.470 | INFO     | src.policies:train:152 - Mini-batch 2 / 2
2021-08-26 22:50:47.472 | INFO     | src.policies:minibatch_update:270 - Total loss: 0.32472914457321167
2021-08-26 22:50:47.474 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.1896585077047348
2021-08-26 22:50:47.475 | INFO     | src.policies:m

2021-08-26 22:50:47.848 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 52.0
2021-08-26 22:50:47.849 | INFO     | src.policies:collect_trajectories:213 - Episode 1017
2021-08-26 22:50:47.904 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:50:47.905 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 144.0
2021-08-26 22:50:47.906 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 82.66666666666667
2021-08-26 22:50:47.912 | INFO     | src.policies:train:152 - Mini-batch 1 / 2
2021-08-26 22:50:47.914 | INFO     | src.policies:minibatch_update:270 - Total loss: 0.1329035758972168
2021-08-26 22:50:47.916 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.17193442583084106
2021-08-26 22:50:47.918 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.13155721127986908
2021-08-26 22:50:47.920 | 

2021-08-26 22:50:48.288 | INFO     | src.policies:collect_trajectories:213 - Episode 1025
2021-08-26 22:50:48.333 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:50:48.334 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 119.0
2021-08-26 22:50:48.335 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 94.0
2021-08-26 22:50:48.336 | INFO     | src.policies:collect_trajectories:213 - Episode 1026
2021-08-26 22:50:48.362 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:50:48.363 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 67.0
2021-08-26 22:50:48.363 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 85.0
2021-08-26 22:50:48.370 | INFO     | src.policies:train:152 - Mini-batch 1 / 2
2021-08-26 22:50:48.372 | INFO     | src.policies:minibatch_update:270 - Total loss: 0.2364711463451

2021-08-26 22:50:48.706 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.0800708457827568
2021-08-26 22:50:48.708 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.00579051161184907
2021-08-26 22:50:48.710 | INFO     | src.policies:train:152 - Mini-batch 2 / 2
2021-08-26 22:50:48.712 | INFO     | src.policies:minibatch_update:270 - Total loss: 0.27144068479537964
2021-08-26 22:50:48.715 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.2452518343925476
2021-08-26 22:50:48.717 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.02319670468568802
2021-08-26 22:50:48.779 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.2452518343925476
2021-08-26 22:50:48.782 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.023196704685688

2021-08-26 22:50:49.262 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.3579123914241791
2021-08-26 22:50:49.264 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.003814294468611479
2021-08-26 22:50:49.266 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.3579123914241791
2021-08-26 22:50:49.268 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.003814294468611479
2021-08-26 22:50:49.271 | INFO     | src.policies:train:116 - Epoch 462 / 800
2021-08-26 22:50:49.272 | INFO     | src.policies:collect_trajectories:213 - Episode 1041
2021-08-26 22:50:49.370 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:50:49.371 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 158.0
2021-08-26 22:50:49.372 | INFO     | src.policies:collect_trajectories:230 - Last 100 

2021-08-26 22:50:49.717 | INFO     | src.policies:train:116 - Epoch 465 / 800
2021-08-26 22:50:49.718 | INFO     | src.policies:collect_trajectories:213 - Episode 1048
2021-08-26 22:50:49.777 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:50:49.778 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 155.0
2021-08-26 22:50:49.779 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 155.0
2021-08-26 22:50:49.780 | INFO     | src.policies:collect_trajectories:213 - Episode 1049
2021-08-26 22:50:49.824 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:50:49.825 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 121.0
2021-08-26 22:50:49.826 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 138.0
2021-08-26 22:50:49.877 | INFO     | src.policies:train:152 - Mini-batch 1 / 2
2021-08-26 22:50:49

2021-08-26 22:50:50.239 | INFO     | src.policies:collect_trajectories:213 - Episode 1056
2021-08-26 22:50:50.273 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:50:50.274 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 91.0
2021-08-26 22:50:50.275 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 88.66666666666667
2021-08-26 22:50:50.280 | INFO     | src.policies:train:152 - Mini-batch 1 / 2
2021-08-26 22:50:50.283 | INFO     | src.policies:minibatch_update:270 - Total loss: 0.22927314043045044
2021-08-26 22:50:50.285 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.01394630502909422
2021-08-26 22:50:50.287 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.015063644386827946
2021-08-26 22:50:50.289 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.013946305029094

2021-08-26 22:50:50.758 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.007208407856523991
2021-08-26 22:50:50.760 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.29954269528388977
2021-08-26 22:50:50.763 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.007208407856523991
2021-08-26 22:50:50.765 | INFO     | src.policies:train:152 - Mini-batch 2 / 3
2021-08-26 22:50:50.767 | INFO     | src.policies:minibatch_update:270 - Total loss: 0.2176058292388916
2021-08-26 22:50:50.769 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.208116814494133
2021-08-26 22:50:50.771 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.01274249516427517
2021-08-26 22:50:50.772 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.208116814494133
2021-08-26 2

2021-08-26 22:50:51.322 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.19000062346458435
2021-08-26 22:50:51.324 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.03984362259507179
2021-08-26 22:50:51.326 | INFO     | src.policies:train:152 - Mini-batch 2 / 2
2021-08-26 22:50:51.328 | INFO     | src.policies:minibatch_update:270 - Total loss: 0.18820658326148987
2021-08-26 22:50:51.331 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.06247049942612648
2021-08-26 22:50:51.333 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.15751665830612183
2021-08-26 22:50:51.335 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.06247049942612648
2021-08-26 22:50:51.337 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.157516658306

2021-08-26 22:50:51.854 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.3824352025985718
2021-08-26 22:50:51.856 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.017355266958475113
2021-08-26 22:50:51.858 | INFO     | src.policies:train:116 - Epoch 478 / 800
2021-08-26 22:50:51.859 | INFO     | src.policies:collect_trajectories:213 - Episode 1074
2021-08-26 22:50:51.914 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:50:51.915 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 157.0
2021-08-26 22:50:51.916 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 157.0
2021-08-26 22:50:51.917 | INFO     | src.policies:collect_trajectories:213 - Episode 1075
2021-08-26 22:50:51.992 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:50:51.993 | INFO  

2021-08-26 22:50:52.363 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 106.0
2021-08-26 22:50:52.363 | INFO     | src.policies:collect_trajectories:213 - Episode 1080
2021-08-26 22:50:52.403 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:50:52.404 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 104.0
2021-08-26 22:50:52.405 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 105.0
2021-08-26 22:50:52.410 | INFO     | src.policies:train:152 - Mini-batch 1 / 2
2021-08-26 22:50:52.413 | INFO     | src.policies:minibatch_update:270 - Total loss: 0.03148356080055237
2021-08-26 22:50:52.415 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.11476098001003265
2021-08-26 22:50:52.417 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.3299673795700073
2021-08-26 22:50:52.418 | INFO     | 

2021-08-26 22:50:52.863 | INFO     | src.policies:minibatch_update:270 - Total loss: 0.24444043636322021
2021-08-26 22:50:52.865 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.24454432725906372
2021-08-26 22:50:52.867 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.051801491528749466
2021-08-26 22:50:52.869 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.24454432725906372
2021-08-26 22:50:52.870 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.051801491528749466
2021-08-26 22:50:52.873 | INFO     | src.policies:train:152 - Mini-batch 3 / 3
2021-08-26 22:50:52.875 | INFO     | src.policies:minibatch_update:270 - Total loss: 0.1805572807788849
2021-08-26 22:50:52.877 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.14899884164333344
2021-08-26 22:50:52.879 | INFO     | src.polic

2021-08-26 22:50:53.420 | INFO     | src.policies:minibatch_update:270 - Total loss: 0.1282637119293213
2021-08-26 22:50:53.423 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.14825862646102905
2021-08-26 22:50:53.424 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.2783467173576355
2021-08-26 22:50:53.426 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.14825862646102905
2021-08-26 22:50:53.429 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.2783467173576355
2021-08-26 22:50:53.431 | INFO     | src.policies:train:116 - Epoch 488 / 800
2021-08-26 22:50:53.432 | INFO     | src.policies:collect_trajectories:213 - Episode 1091
2021-08-26 22:50:53.480 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:50:53.481 | INFO     | src.policies:collect_trajectories:229 - Mean episod

2021-08-26 22:50:53.942 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 115.0
2021-08-26 22:50:53.943 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 115.0
2021-08-26 22:50:53.944 | INFO     | src.policies:collect_trajectories:213 - Episode 1098
2021-08-26 22:50:54.001 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:50:54.002 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 155.0
2021-08-26 22:50:54.003 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 135.0
2021-08-26 22:50:54.008 | INFO     | src.policies:train:152 - Mini-batch 1 / 2
2021-08-26 22:50:54.011 | INFO     | src.policies:minibatch_update:270 - Total loss: 0.15359053015708923
2021-08-26 22:50:54.014 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.19363363087177277
2021-08-26 22:50:54.016 | INFO     | src.policies:minibatch_

2021-08-26 22:50:54.348 | INFO     | src.policies:train:116 - Epoch 494 / 800
2021-08-26 22:50:54.348 | INFO     | src.policies:collect_trajectories:213 - Episode 1104
2021-08-26 22:50:54.453 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:50:54.454 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 154.0
2021-08-26 22:50:54.455 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 154.0
2021-08-26 22:50:54.456 | INFO     | src.policies:collect_trajectories:213 - Episode 1105
2021-08-26 22:50:54.525 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:50:54.527 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 195.0
2021-08-26 22:50:54.527 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 174.5
2021-08-26 22:50:54.533 | INFO     | src.policies:train:152 - Mini-batch 1 / 3
2021-08-26 22:50:54

2021-08-26 22:50:54.882 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.23069047927856445
2021-08-26 22:50:54.884 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.09610175341367722
2021-08-26 22:50:54.886 | INFO     | src.policies:train:152 - Mini-batch 3 / 3
2021-08-26 22:50:54.888 | INFO     | src.policies:minibatch_update:270 - Total loss: 0.1626356840133667
2021-08-26 22:50:54.890 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.47343936562538147
2021-08-26 22:50:54.891 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.03388514742255211
2021-08-26 22:50:54.893 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.47343936562538147
2021-08-26 22:50:54.895 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.0338851474225

2021-08-26 22:50:55.460 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 131.0
2021-08-26 22:50:55.461 | INFO     | src.policies:collect_trajectories:213 - Episode 1117
2021-08-26 22:50:55.561 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:50:55.562 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 167.0
2021-08-26 22:50:55.563 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 149.0
2021-08-26 22:50:55.569 | INFO     | src.policies:train:152 - Mini-batch 1 / 2
2021-08-26 22:50:55.571 | INFO     | src.policies:minibatch_update:270 - Total loss: 0.08261317759752274
2021-08-26 22:50:55.573 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.060713671147823334
2021-08-26 22:50:55.575 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.18272840976715088
2021-08-26 22:50:55.577 | INFO     

2021-08-26 22:50:55.983 | INFO     | src.policies:train:152 - Mini-batch 1 / 2
2021-08-26 22:50:55.985 | INFO     | src.policies:minibatch_update:270 - Total loss: 0.29566726088523865
2021-08-26 22:50:55.987 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.45738309621810913
2021-08-26 22:50:55.989 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.06601870059967041
2021-08-26 22:50:55.991 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.45738309621810913
2021-08-26 22:50:55.993 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.06601870059967041
2021-08-26 22:50:55.995 | INFO     | src.policies:train:152 - Mini-batch 2 / 2
2021-08-26 22:50:55.998 | INFO     | src.policies:minibatch_update:270 - Total loss: 0.2056254744529724
2021-08-26 22:50:56.000 | INFO     | src.policies:minibatch_update:277 - Policy network L2 grad

2021-08-26 22:50:56.441 | INFO     | src.policies:train:116 - Epoch 507 / 800
2021-08-26 22:50:56.442 | INFO     | src.policies:collect_trajectories:213 - Episode 1128
2021-08-26 22:50:56.492 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:50:56.494 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 138.0
2021-08-26 22:50:56.495 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 138.0
2021-08-26 22:50:56.496 | INFO     | src.policies:collect_trajectories:213 - Episode 1129
2021-08-26 22:50:56.561 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:50:56.563 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 170.0
2021-08-26 22:50:56.564 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 154.0
2021-08-26 22:50:56.569 | INFO     | src.policies:train:152 - Mini-batch 1 / 3
2021-08-26 22:50:56

2021-08-26 22:50:56.934 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.02181559056043625
2021-08-26 22:50:56.936 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.3183847665786743
2021-08-26 22:50:56.938 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.02181559056043625
2021-08-26 22:50:56.939 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.3183847665786743
2021-08-26 22:50:56.942 | INFO     | src.policies:train:116 - Epoch 510 / 800
2021-08-26 22:50:56.943 | INFO     | src.policies:collect_trajectories:213 - Episode 1134
2021-08-26 22:50:57.017 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:50:57.018 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 200.0
2021-08-26 22:50:57.019 | INFO     | src.policies:collect_trajectories:230 - Last 100 ep

2021-08-26 22:50:57.638 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.11095424741506577
2021-08-26 22:50:57.640 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.257376104593277
2021-08-26 22:50:57.642 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.11095424741506577
2021-08-26 22:50:57.644 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.257376104593277
2021-08-26 22:50:57.646 | INFO     | src.policies:train:152 - Mini-batch 2 / 2
2021-08-26 22:50:57.648 | INFO     | src.policies:minibatch_update:270 - Total loss: 0.16152982413768768
2021-08-26 22:50:57.650 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.18907244503498077
2021-08-26 22:50:57.652 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.4406777024269104
2021-08-26 22:50:57.654 | INF

2021-08-26 22:50:58.134 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 98.0
2021-08-26 22:50:58.134 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 122.5
2021-08-26 22:50:58.140 | INFO     | src.policies:train:152 - Mini-batch 1 / 2
2021-08-26 22:50:58.143 | INFO     | src.policies:minibatch_update:270 - Total loss: 0.11454181373119354
2021-08-26 22:50:58.145 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.17266294360160828
2021-08-26 22:50:58.147 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.10428410023450851
2021-08-26 22:50:58.149 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.17266294360160828
2021-08-26 22:50:58.151 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.10428410023450851
2021-08-26 22:50:58.153 | INFO     | src.policies:train:152 -

2021-08-26 22:50:58.590 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.515013575553894
2021-08-26 22:50:58.591 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.024272123351693153
2021-08-26 22:50:58.593 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.49999910593032837
2021-08-26 22:50:58.595 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.024272123351693153
2021-08-26 22:50:58.598 | INFO     | src.policies:train:116 - Epoch 520 / 800
2021-08-26 22:50:58.599 | INFO     | src.policies:collect_trajectories:213 - Episode 1149
2021-08-26 22:50:58.671 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:50:58.672 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 200.0
2021-08-26 22:50:58.673 | INFO     | src.policies:collect_trajectories:230 - Last 100 

2021-08-26 22:50:59.085 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.17296797037124634
2021-08-26 22:50:59.087 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.35370615124702454
2021-08-26 22:50:59.088 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.17296797037124634
2021-08-26 22:50:59.090 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.35370615124702454
2021-08-26 22:50:59.093 | INFO     | src.policies:train:116 - Epoch 523 / 800
2021-08-26 22:50:59.094 | INFO     | src.policies:collect_trajectories:213 - Episode 1154
2021-08-26 22:50:59.167 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:50:59.168 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 200.0
2021-08-26 22:50:59.169 | INFO     | src.policies:collect_trajectories:230 - Last 100 

2021-08-26 22:50:59.672 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.44540247321128845
2021-08-26 22:50:59.673 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.06728771328926086
2021-08-26 22:50:59.675 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.44540247321128845
2021-08-26 22:50:59.677 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.06728771328926086
2021-08-26 22:50:59.680 | INFO     | src.policies:train:116 - Epoch 526 / 800
2021-08-26 22:50:59.681 | INFO     | src.policies:collect_trajectories:213 - Episode 1159
2021-08-26 22:50:59.742 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:50:59.743 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 179.0
2021-08-26 22:50:59.744 | INFO     | src.policies:collect_trajectories:230 - Last 100 

2021-08-26 22:51:00.158 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.005086780525743961
2021-08-26 22:51:00.161 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.14336062967777252
2021-08-26 22:51:00.164 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.005086780525743961
2021-08-26 22:51:00.167 | INFO     | src.policies:train:116 - Epoch 529 / 800
2021-08-26 22:51:00.169 | INFO     | src.policies:collect_trajectories:213 - Episode 1165
2021-08-26 22:51:00.206 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:51:00.207 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 96.0
2021-08-26 22:51:00.208 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 96.0
2021-08-26 22:51:00.209 | INFO     | src.policies:collect_trajectories:213 - Episode 1166
2021-08-

2021-08-26 22:51:00.690 | INFO     | src.policies:train:152 - Mini-batch 1 / 2
2021-08-26 22:51:00.692 | INFO     | src.policies:minibatch_update:270 - Total loss: 0.0638907253742218
2021-08-26 22:51:00.694 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.40353262424468994
2021-08-26 22:51:00.696 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.07894197851419449
2021-08-26 22:51:00.698 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.40353262424468994
2021-08-26 22:51:00.701 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.07894197851419449
2021-08-26 22:51:00.704 | INFO     | src.policies:train:152 - Mini-batch 2 / 2
2021-08-26 22:51:00.706 | INFO     | src.policies:minibatch_update:270 - Total loss: 0.11210939288139343
2021-08-26 22:51:00.709 | INFO     | src.policies:minibatch_update:277 - Policy network L2 grad

2021-08-26 22:51:01.239 | INFO     | src.policies:train:152 - Mini-batch 2 / 2
2021-08-26 22:51:01.241 | INFO     | src.policies:minibatch_update:270 - Total loss: 0.22910571098327637
2021-08-26 22:51:01.244 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.33839544653892517
2021-08-26 22:51:01.245 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.007446709088981152
2021-08-26 22:51:01.247 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.33839544653892517
2021-08-26 22:51:01.250 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.007446709088981152
2021-08-26 22:51:01.253 | INFO     | src.policies:train:116 - Epoch 536 / 800
2021-08-26 22:51:01.254 | INFO     | src.policies:collect_trajectories:213 - Episode 1177
2021-08-26 22:51:01.317 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021

2021-08-26 22:51:01.796 | INFO     | src.policies:train:116 - Epoch 539 / 800
2021-08-26 22:51:01.797 | INFO     | src.policies:collect_trajectories:213 - Episode 1182
2021-08-26 22:51:01.830 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:51:01.831 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 92.0
2021-08-26 22:51:01.832 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 92.0
2021-08-26 22:51:01.833 | INFO     | src.policies:collect_trajectories:213 - Episode 1183
2021-08-26 22:51:01.893 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:51:01.894 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 164.0
2021-08-26 22:51:01.895 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 128.0
2021-08-26 22:51:01.901 | INFO     | src.policies:train:152 - Mini-batch 1 / 2
2021-08-26 22:51:01.9

2021-08-26 22:51:02.394 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 188.0
2021-08-26 22:51:02.395 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 189.5
2021-08-26 22:51:02.401 | INFO     | src.policies:train:152 - Mini-batch 1 / 3
2021-08-26 22:51:02.404 | INFO     | src.policies:minibatch_update:270 - Total loss: 0.32051125168800354
2021-08-26 22:51:02.406 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.42937374114990234
2021-08-26 22:51:02.407 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.11837245523929596
2021-08-26 22:51:02.409 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.42937374114990234
2021-08-26 22:51:02.411 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.11837245523929596
2021-08-26 22:51:02.413 | INFO     | src.policies:train:152 

2021-08-26 22:51:02.911 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.06250330060720444
2021-08-26 22:51:02.913 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.5874732732772827
2021-08-26 22:51:02.915 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.06250330060720444
2021-08-26 22:51:02.917 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.49999910593032837
2021-08-26 22:51:02.919 | INFO     | src.policies:train:116 - Epoch 546 / 800
2021-08-26 22:51:02.920 | INFO     | src.policies:collect_trajectories:213 - Episode 1193
2021-08-26 22:51:02.977 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:51:02.978 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 163.0
2021-08-26 22:51:02.979 | INFO     | src.policies:collect_trajectories:230 - Last 100 e

2021-08-26 22:51:03.364 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.1606079339981079
2021-08-26 22:51:03.366 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.015070408582687378
2021-08-26 22:51:03.368 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.1606079339981079
2021-08-26 22:51:03.370 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.015070408582687378
2021-08-26 22:51:03.373 | INFO     | src.policies:train:116 - Epoch 549 / 800
2021-08-26 22:51:03.373 | INFO     | src.policies:collect_trajectories:213 - Episode 1198
2021-08-26 22:51:03.494 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:51:03.495 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 200.0
2021-08-26 22:51:03.496 | INFO     | src.policies:collect_trajectories:230 - Last 100 

2021-08-26 22:51:03.988 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.13818520307540894
2021-08-26 22:51:03.990 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.07272384315729141
2021-08-26 22:51:03.993 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.13818520307540894
2021-08-26 22:51:03.995 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.07272384315729141
2021-08-26 22:51:03.999 | INFO     | src.policies:train:116 - Epoch 553 / 800
2021-08-26 22:51:04.000 | INFO     | src.policies:collect_trajectories:213 - Episode 1203
2021-08-26 22:51:04.058 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:51:04.059 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 155.0
2021-08-26 22:51:04.060 | INFO     | src.policies:collect_trajectories:230 - Last 100 

2021-08-26 22:51:04.546 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:51:04.547 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 200.0
2021-08-26 22:51:04.548 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 200.0
2021-08-26 22:51:04.552 | INFO     | src.policies:train:152 - Mini-batch 1 / 2
2021-08-26 22:51:04.555 | INFO     | src.policies:minibatch_update:270 - Total loss: 0.02752131223678589
2021-08-26 22:51:04.557 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.07964790612459183
2021-08-26 22:51:04.559 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.13669896125793457
2021-08-26 22:51:04.561 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.07964790612459183
2021-08-26 22:51:04.563 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradie

2021-08-26 22:51:04.966 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.1429767906665802
2021-08-26 22:51:04.968 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.12023212760686874
2021-08-26 22:51:04.970 | INFO     | src.policies:train:152 - Mini-batch 3 / 3
2021-08-26 22:51:04.972 | INFO     | src.policies:minibatch_update:270 - Total loss: 0.2004520744085312
2021-08-26 22:51:04.974 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.3011122941970825
2021-08-26 22:51:04.976 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.10746533423662186
2021-08-26 22:51:04.978 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.3011122941970825
2021-08-26 22:51:04.980 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.1074653342366218

2021-08-26 22:51:05.458 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.0077144308015704155
2021-08-26 22:51:05.460 | INFO     | src.policies:train:152 - Mini-batch 2 / 2
2021-08-26 22:51:05.462 | INFO     | src.policies:minibatch_update:270 - Total loss: 0.22534781694412231
2021-08-26 22:51:05.464 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.255471408367157
2021-08-26 22:51:05.466 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.01728423498570919
2021-08-26 22:51:05.467 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.255471408367157
2021-08-26 22:51:05.469 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.01728423498570919
2021-08-26 22:51:05.472 | INFO     | src.policies:train:116 - Epoch 563 / 800
2021-08-26 22:51:05.473 | INFO     | src.policies:collect_t

2021-08-26 22:51:06.063 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.26492998003959656
2021-08-26 22:51:06.065 | INFO     | src.policies:train:152 - Mini-batch 2 / 2
2021-08-26 22:51:06.067 | INFO     | src.policies:minibatch_update:270 - Total loss: 0.13183307647705078
2021-08-26 22:51:06.069 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.5342816710472107
2021-08-26 22:51:06.071 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.26912304759025574
2021-08-26 22:51:06.073 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.4999990165233612
2021-08-26 22:51:06.075 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.26912304759025574
2021-08-26 22:51:06.077 | INFO     | src.policies:train:116 - Epoch 566 / 800
2021-08-26 22:51:06.078 | INFO     | src.policies:collect_t

2021-08-26 22:51:06.583 | INFO     | src.policies:train:152 - Mini-batch 1 / 3
2021-08-26 22:51:06.586 | INFO     | src.policies:minibatch_update:270 - Total loss: 0.08938856422901154
2021-08-26 22:51:06.588 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.1378592550754547
2021-08-26 22:51:06.590 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.34511518478393555
2021-08-26 22:51:06.591 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.1378592550754547
2021-08-26 22:51:06.593 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.34511518478393555
2021-08-26 22:51:06.596 | INFO     | src.policies:train:152 - Mini-batch 2 / 3
2021-08-26 22:51:06.598 | INFO     | src.policies:minibatch_update:270 - Total loss: 0.03544531762599945
2021-08-26 22:51:06.600 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradi

2021-08-26 22:51:07.053 | INFO     | src.policies:collect_trajectories:213 - Episode 1236
2021-08-26 22:51:07.105 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:51:07.106 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 140.0
2021-08-26 22:51:07.106 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 157.0
2021-08-26 22:51:07.112 | INFO     | src.policies:train:152 - Mini-batch 1 / 3
2021-08-26 22:51:07.114 | INFO     | src.policies:minibatch_update:270 - Total loss: 0.1568862795829773
2021-08-26 22:51:07.116 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.12693236768245697
2021-08-26 22:51:07.118 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.0671130120754242
2021-08-26 22:51:07.120 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.12693236768245697
2021-08-26 

2021-08-26 22:51:07.529 | INFO     | src.policies:train:152 - Mini-batch 3 / 3
2021-08-26 22:51:07.531 | INFO     | src.policies:minibatch_update:270 - Total loss: 0.28473302721977234
2021-08-26 22:51:07.534 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.45023754239082336
2021-08-26 22:51:07.536 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.015701908618211746
2021-08-26 22:51:07.538 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.45023754239082336
2021-08-26 22:51:07.541 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.015701908618211746
2021-08-26 22:51:07.544 | INFO     | src.policies:train:116 - Epoch 574 / 800
2021-08-26 22:51:07.545 | INFO     | src.policies:collect_trajectories:213 - Episode 1241
2021-08-26 22:51:07.594 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021

2021-08-26 22:51:08.144 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 126.0
2021-08-26 22:51:08.145 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 117.0
2021-08-26 22:51:08.150 | INFO     | src.policies:train:152 - Mini-batch 1 / 2
2021-08-26 22:51:08.153 | INFO     | src.policies:minibatch_update:270 - Total loss: 0.02114645391702652
2021-08-26 22:51:08.155 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.08855842798948288
2021-08-26 22:51:08.156 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.24443601071834564
2021-08-26 22:51:08.158 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.08855842798948288
2021-08-26 22:51:08.160 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.24443601071834564
2021-08-26 22:51:08.162 | INFO     | src.policies:train:152 

2021-08-26 22:51:08.671 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.0819208025932312
2021-08-26 22:51:08.674 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.013616879470646381
2021-08-26 22:51:08.676 | INFO     | src.policies:train:116 - Epoch 581 / 800
2021-08-26 22:51:08.677 | INFO     | src.policies:collect_trajectories:213 - Episode 1253
2021-08-26 22:51:08.726 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:51:08.727 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 137.0
2021-08-26 22:51:08.728 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 137.0
2021-08-26 22:51:08.729 | INFO     | src.policies:collect_trajectories:213 - Episode 1254
2021-08-26 22:51:08.780 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:51:08.781 | INFO  

2021-08-26 22:51:09.155 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.016943318769335747
2021-08-26 22:51:09.157 | INFO     | src.policies:train:116 - Epoch 584 / 800
2021-08-26 22:51:09.158 | INFO     | src.policies:collect_trajectories:213 - Episode 1259
2021-08-26 22:51:09.217 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:51:09.218 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 156.0
2021-08-26 22:51:09.218 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 156.0
2021-08-26 22:51:09.219 | INFO     | src.policies:collect_trajectories:213 - Episode 1260
2021-08-26 22:51:09.272 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:51:09.273 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 147.0
2021-08-26 22:51:09.274 | INFO     | src.policies:collect_trajectori

2021-08-26 22:51:09.653 | INFO     | src.policies:train:152 - Mini-batch 3 / 3
2021-08-26 22:51:09.655 | INFO     | src.policies:minibatch_update:270 - Total loss: 0.17076677083969116
2021-08-26 22:51:09.657 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.07876074314117432
2021-08-26 22:51:09.659 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.0629790797829628
2021-08-26 22:51:09.661 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.07876074314117432
2021-08-26 22:51:09.663 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.0629790797829628
2021-08-26 22:51:09.665 | INFO     | src.policies:train:116 - Epoch 587 / 800
2021-08-26 22:51:09.666 | INFO     | src.policies:collect_trajectories:213 - Episode 1265
2021-08-26 22:51:09.733 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-

2021-08-26 22:51:10.246 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.1638917326927185
2021-08-26 22:51:10.248 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.06297844648361206
2021-08-26 22:51:10.250 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.1638917326927185
2021-08-26 22:51:10.252 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.06297844648361206
2021-08-26 22:51:10.254 | INFO     | src.policies:train:152 - Mini-batch 2 / 3
2021-08-26 22:51:10.256 | INFO     | src.policies:minibatch_update:270 - Total loss: 0.09490025043487549
2021-08-26 22:51:10.258 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.03498108685016632
2021-08-26 22:51:10.259 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.2258244752883911
2021-08-26 22:51:10.261 | I

2021-08-26 22:51:10.689 | INFO     | src.policies:train:152 - Mini-batch 3 / 3
2021-08-26 22:51:10.691 | INFO     | src.policies:minibatch_update:270 - Total loss: 0.06971395015716553
2021-08-26 22:51:10.693 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.12113124877214432
2021-08-26 22:51:10.695 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.06354758888483047
2021-08-26 22:51:10.696 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.12113124877214432
2021-08-26 22:51:10.698 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.06354758888483047
2021-08-26 22:51:10.701 | INFO     | src.policies:train:116 - Epoch 593 / 800
2021-08-26 22:51:10.702 | INFO     | src.policies:collect_trajectories:213 - Episode 1275
2021-08-26 22:51:10.750 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-0

2021-08-26 22:51:11.190 | INFO     | src.policies:train:152 - Mini-batch 3 / 3
2021-08-26 22:51:11.192 | INFO     | src.policies:minibatch_update:270 - Total loss: 0.03389202058315277
2021-08-26 22:51:11.194 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.03203164041042328
2021-08-26 22:51:11.195 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.17175617814064026
2021-08-26 22:51:11.198 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.03203164041042328
2021-08-26 22:51:11.199 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.17175617814064026
2021-08-26 22:51:11.202 | INFO     | src.policies:train:116 - Epoch 596 / 800
2021-08-26 22:51:11.203 | INFO     | src.policies:collect_trajectories:213 - Episode 1280
2021-08-26 22:51:11.239 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-0

2021-08-26 22:51:11.636 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.5158048868179321
2021-08-26 22:51:11.638 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.05817002058029175
2021-08-26 22:51:11.640 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.4999990165233612
2021-08-26 22:51:11.642 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.05817002058029175
2021-08-26 22:51:11.645 | INFO     | src.policies:train:116 - Epoch 599 / 800
2021-08-26 22:51:11.646 | INFO     | src.policies:collect_trajectories:213 - Episode 1286
2021-08-26 22:51:11.740 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:51:11.741 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 127.0
2021-08-26 22:51:11.742 | INFO     | src.policies:collect_trajectories:230 - Last 100 ep

2021-08-26 22:51:12.231 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.032337237149477005
2021-08-26 22:51:12.270 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.2696237564086914
2021-08-26 22:51:12.273 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.032337237149477005
2021-08-26 22:51:12.276 | INFO     | src.policies:train:116 - Epoch 602 / 800
2021-08-26 22:51:12.277 | INFO     | src.policies:collect_trajectories:213 - Episode 1292
2021-08-26 22:51:12.336 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:51:12.337 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 167.0
2021-08-26 22:51:12.338 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 167.0
2021-08-26 22:51:12.339 | INFO     | src.policies:collect_trajectories:213 - Episode 1293
2021-08

2021-08-26 22:51:12.758 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.17202167212963104
2021-08-26 22:51:12.760 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.007522324565798044
2021-08-26 22:51:12.762 | INFO     | src.policies:train:152 - Mini-batch 2 / 3
2021-08-26 22:51:12.764 | INFO     | src.policies:minibatch_update:270 - Total loss: 0.3115899860858917
2021-08-26 22:51:12.765 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.3710125982761383
2021-08-26 22:51:12.767 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.00852640625089407
2021-08-26 22:51:12.769 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.3710125982761383
2021-08-26 22:51:12.771 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.00852640625089

2021-08-26 22:51:13.223 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:51:13.224 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 122.0
2021-08-26 22:51:13.225 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 122.0
2021-08-26 22:51:13.226 | INFO     | src.policies:collect_trajectories:213 - Episode 1303
2021-08-26 22:51:13.281 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:51:13.282 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 153.0
2021-08-26 22:51:13.283 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 137.5
2021-08-26 22:51:13.289 | INFO     | src.policies:train:152 - Mini-batch 1 / 2
2021-08-26 22:51:13.291 | INFO     | src.policies:minibatch_update:270 - Total loss: 0.38995763659477234
2021-08-26 22:51:13.293 | INFO     | src.policies:minibatch_update:277 - Policy ne

2021-08-26 22:51:13.814 | INFO     | src.policies:train:152 - Mini-batch 1 / 2
2021-08-26 22:51:13.817 | INFO     | src.policies:minibatch_update:270 - Total loss: 0.1323886513710022
2021-08-26 22:51:13.819 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.29505494236946106
2021-08-26 22:51:13.820 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.03464160114526749
2021-08-26 22:51:13.822 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.29505494236946106
2021-08-26 22:51:13.824 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.03464160114526749
2021-08-26 22:51:13.827 | INFO     | src.policies:train:152 - Mini-batch 2 / 2
2021-08-26 22:51:13.829 | INFO     | src.policies:minibatch_update:270 - Total loss: 0.24939674139022827
2021-08-26 22:51:13.831 | INFO     | src.policies:minibatch_update:277 - Policy network L2 grad

2021-08-26 22:51:14.391 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 122.0
2021-08-26 22:51:14.392 | INFO     | src.policies:collect_trajectories:213 - Episode 1314
2021-08-26 22:51:14.494 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:51:14.496 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 200.0
2021-08-26 22:51:14.497 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 161.0
2021-08-26 22:51:14.502 | INFO     | src.policies:train:152 - Mini-batch 1 / 3
2021-08-26 22:51:14.505 | INFO     | src.policies:minibatch_update:270 - Total loss: 0.2563542127609253
2021-08-26 22:51:14.507 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.3765414357185364
2021-08-26 22:51:14.509 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.009449108503758907
2021-08-26 22:51:14.511 | INFO     | 

2021-08-26 22:51:14.921 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:51:14.923 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 154.0
2021-08-26 22:51:14.923 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 162.5
2021-08-26 22:51:14.929 | INFO     | src.policies:train:152 - Mini-batch 1 / 3
2021-08-26 22:51:14.933 | INFO     | src.policies:minibatch_update:270 - Total loss: 0.17277994751930237
2021-08-26 22:51:14.934 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.2646491229534149
2021-08-26 22:51:14.936 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.0885806530714035
2021-08-26 22:51:14.938 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.2646491229534149
2021-08-26 22:51:14.940 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient 

2021-08-26 22:51:15.339 | INFO     | src.policies:train:116 - Epoch 619 / 800
2021-08-26 22:51:15.339 | INFO     | src.policies:collect_trajectories:213 - Episode 1325
2021-08-26 22:51:15.390 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:51:15.391 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 136.0
2021-08-26 22:51:15.392 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 136.0
2021-08-26 22:51:15.393 | INFO     | src.policies:collect_trajectories:213 - Episode 1326
2021-08-26 22:51:15.458 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:51:15.459 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 179.0
2021-08-26 22:51:15.460 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 157.5
2021-08-26 22:51:15.466 | INFO     | src.policies:train:152 - Mini-batch 1 / 3
2021-08-26 22:51:15

2021-08-26 22:51:15.849 | INFO     | src.policies:train:152 - Mini-batch 2 / 2
2021-08-26 22:51:15.852 | INFO     | src.policies:minibatch_update:270 - Total loss: 0.2444203495979309
2021-08-26 22:51:15.854 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.24990622699260712
2021-08-26 22:51:15.855 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.038495734333992004
2021-08-26 22:51:15.857 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.24990622699260712
2021-08-26 22:51:15.859 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.038495734333992004
2021-08-26 22:51:15.862 | INFO     | src.policies:train:116 - Epoch 623 / 800
2021-08-26 22:51:15.863 | INFO     | src.policies:collect_trajectories:213 - Episode 1330
2021-08-26 22:51:15.923 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-

2021-08-26 22:51:16.435 | INFO     | src.policies:train:152 - Mini-batch 2 / 2
2021-08-26 22:51:16.437 | INFO     | src.policies:minibatch_update:270 - Total loss: 0.20054662227630615
2021-08-26 22:51:16.439 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.06387918442487717
2021-08-26 22:51:16.441 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.505163848400116
2021-08-26 22:51:16.442 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.06387918442487717
2021-08-26 22:51:16.445 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.4999990165233612
2021-08-26 22:51:16.448 | INFO     | src.policies:train:116 - Epoch 626 / 800
2021-08-26 22:51:16.449 | INFO     | src.policies:collect_trajectories:213 - Episode 1335
2021-08-26 22:51:16.494 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-2

2021-08-26 22:51:16.876 | INFO     | src.policies:train:116 - Epoch 629 / 800
2021-08-26 22:51:16.877 | INFO     | src.policies:collect_trajectories:213 - Episode 1340
2021-08-26 22:51:16.930 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:51:16.931 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 145.0
2021-08-26 22:51:16.933 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 145.0
2021-08-26 22:51:16.934 | INFO     | src.policies:collect_trajectories:213 - Episode 1341
2021-08-26 22:51:17.019 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:51:17.020 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 158.0
2021-08-26 22:51:17.022 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 151.5
2021-08-26 22:51:17.031 | INFO     | src.policies:train:152 - Mini-batch 1 / 3
2021-08-26 22:51:17

2021-08-26 22:51:17.423 | INFO     | src.policies:train:116 - Epoch 632 / 800
2021-08-26 22:51:17.424 | INFO     | src.policies:collect_trajectories:213 - Episode 1345
2021-08-26 22:51:17.488 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:51:17.489 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 177.0
2021-08-26 22:51:17.490 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 177.0
2021-08-26 22:51:17.491 | INFO     | src.policies:collect_trajectories:213 - Episode 1346
2021-08-26 22:51:17.535 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:51:17.536 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 113.0
2021-08-26 22:51:17.537 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 145.0
2021-08-26 22:51:17.542 | INFO     | src.policies:train:152 - Mini-batch 1 / 2
2021-08-26 22:51:17

2021-08-26 22:51:17.993 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:51:17.994 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 200.0
2021-08-26 22:51:17.996 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 200.0
2021-08-26 22:51:18.000 | INFO     | src.policies:train:152 - Mini-batch 1 / 2
2021-08-26 22:51:18.003 | INFO     | src.policies:minibatch_update:270 - Total loss: 0.20634639263153076
2021-08-26 22:51:18.005 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.11883773654699326
2021-08-26 22:51:18.007 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.6134584546089172
2021-08-26 22:51:18.009 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.11883773654699326
2021-08-26 22:51:18.011 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradien

2021-08-26 22:51:18.623 | INFO     | src.policies:train:152 - Mini-batch 1 / 3
2021-08-26 22:51:18.627 | INFO     | src.policies:minibatch_update:270 - Total loss: 0.05472484230995178
2021-08-26 22:51:18.629 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.11678256094455719
2021-08-26 22:51:18.630 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.08899006247520447
2021-08-26 22:51:18.632 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.11678256094455719
2021-08-26 22:51:18.634 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.08899006247520447
2021-08-26 22:51:18.636 | INFO     | src.policies:train:152 - Mini-batch 2 / 3
2021-08-26 22:51:18.639 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.02622462809085846
2021-08-26 22:51:18.641 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gr

2021-08-26 22:51:19.033 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.11621406674385071
2021-08-26 22:51:19.035 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.27819597721099854
2021-08-26 22:51:19.037 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.11621406674385071
2021-08-26 22:51:19.040 | INFO     | src.policies:train:116 - Epoch 641 / 800
2021-08-26 22:51:19.041 | INFO     | src.policies:collect_trajectories:213 - Episode 1362
2021-08-26 22:51:19.102 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:51:19.103 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 169.0
2021-08-26 22:51:19.104 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 169.0
2021-08-26 22:51:19.105 | INFO     | src.policies:collect_trajectories:213 - Episode 1363
2021-08-

2021-08-26 22:51:19.594 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.10265800356864929
2021-08-26 22:51:19.596 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.21751205623149872
2021-08-26 22:51:19.599 | INFO     | src.policies:train:116 - Epoch 644 / 800
2021-08-26 22:51:19.600 | INFO     | src.policies:collect_trajectories:213 - Episode 1368
2021-08-26 22:51:19.661 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:51:19.662 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 173.0
2021-08-26 22:51:19.663 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 173.0
2021-08-26 22:51:19.664 | INFO     | src.policies:collect_trajectories:213 - Episode 1369
2021-08-26 22:51:19.708 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:51:19.709 | INFO  

2021-08-26 22:51:20.088 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.25076761841773987
2021-08-26 22:51:20.091 | INFO     | src.policies:train:116 - Epoch 647 / 800
2021-08-26 22:51:20.092 | INFO     | src.policies:collect_trajectories:213 - Episode 1374
2021-08-26 22:51:20.142 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:51:20.143 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 140.0
2021-08-26 22:51:20.144 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 140.0
2021-08-26 22:51:20.145 | INFO     | src.policies:collect_trajectories:213 - Episode 1375
2021-08-26 22:51:20.195 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:51:20.196 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 135.0
2021-08-26 22:51:20.197 | INFO     | src.policies:collect_trajectorie

2021-08-26 22:51:20.674 | INFO     | src.policies:train:152 - Mini-batch 3 / 3
2021-08-26 22:51:20.676 | INFO     | src.policies:minibatch_update:270 - Total loss: 0.03745415061712265
2021-08-26 22:51:20.678 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.13118641078472137
2021-08-26 22:51:20.680 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.11313624680042267
2021-08-26 22:51:20.683 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.13118641078472137
2021-08-26 22:51:20.686 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.11313624680042267
2021-08-26 22:51:20.691 | INFO     | src.policies:train:116 - Epoch 650 / 800
2021-08-26 22:51:20.692 | INFO     | src.policies:collect_trajectories:213 - Episode 1380
2021-08-26 22:51:20.759 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-0

2021-08-26 22:51:21.173 | INFO     | src.policies:train:152 - Mini-batch 2 / 2
2021-08-26 22:51:21.175 | INFO     | src.policies:minibatch_update:270 - Total loss: 0.016595035791397095
2021-08-26 22:51:21.177 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.11798276752233505
2021-08-26 22:51:21.179 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.22870662808418274
2021-08-26 22:51:21.181 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.11798276752233505
2021-08-26 22:51:21.183 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.22870662808418274
2021-08-26 22:51:21.185 | INFO     | src.policies:train:116 - Epoch 653 / 800
2021-08-26 22:51:21.186 | INFO     | src.policies:collect_trajectories:213 - Episode 1385
2021-08-26 22:51:21.237 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-

2021-08-26 22:51:21.692 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.24446700513362885
2021-08-26 22:51:21.694 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.17619265615940094
2021-08-26 22:51:21.696 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.24446700513362885
2021-08-26 22:51:21.698 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.17619265615940094
2021-08-26 22:51:21.700 | INFO     | src.policies:train:152 - Mini-batch 2 / 2
2021-08-26 22:51:21.702 | INFO     | src.policies:minibatch_update:270 - Total loss: 0.03956633433699608
2021-08-26 22:51:21.704 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.2359544187784195
2021-08-26 22:51:21.705 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.15481576323509216
2021-08-26 22:51:21.707 |

2021-08-26 22:51:22.210 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.12696614861488342
2021-08-26 22:51:22.212 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.24345539510250092
2021-08-26 22:51:22.214 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.12696614861488342
2021-08-26 22:51:22.216 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.24345539510250092
2021-08-26 22:51:22.218 | INFO     | src.policies:train:152 - Mini-batch 2 / 3
2021-08-26 22:51:22.220 | INFO     | src.policies:minibatch_update:270 - Total loss: 0.25983625650405884
2021-08-26 22:51:22.222 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.10098832845687866
2021-08-26 22:51:22.224 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.05040422081947327
2021-08-26 22:51:22.226 

2021-08-26 22:51:22.757 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.08248629420995712
2021-08-26 22:51:22.759 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.06637105345726013
2021-08-26 22:51:22.761 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.08248629420995712
2021-08-26 22:51:22.763 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.06637105345726013
2021-08-26 22:51:22.765 | INFO     | src.policies:train:152 - Mini-batch 2 / 2
2021-08-26 22:51:22.767 | INFO     | src.policies:minibatch_update:270 - Total loss: 0.006671123206615448
2021-08-26 22:51:22.769 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.12165968865156174
2021-08-26 22:51:22.771 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.39422091841697693
2021-08-26 22:51:22.772

2021-08-26 22:51:23.152 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.1320773959159851
2021-08-26 22:51:23.154 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.03768468648195267
2021-08-26 22:51:23.156 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.1320773959159851
2021-08-26 22:51:23.157 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.03768468648195267
2021-08-26 22:51:23.160 | INFO     | src.policies:train:116 - Epoch 667 / 800
2021-08-26 22:51:23.161 | INFO     | src.policies:collect_trajectories:213 - Episode 1406
2021-08-26 22:51:23.215 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:51:23.216 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 157.0
2021-08-26 22:51:23.217 | INFO     | src.policies:collect_trajectories:230 - Last 100 ep

2021-08-26 22:51:23.675 | INFO     | src.policies:train:152 - Mini-batch 1 / 2
2021-08-26 22:51:23.678 | INFO     | src.policies:minibatch_update:270 - Total loss: 0.01735152304172516
2021-08-26 22:51:23.681 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.22803854942321777
2021-08-26 22:51:23.682 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.11906569451093674
2021-08-26 22:51:23.684 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.22803854942321777
2021-08-26 22:51:23.686 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.11906569451093674
2021-08-26 22:51:23.688 | INFO     | src.policies:train:152 - Mini-batch 2 / 2
2021-08-26 22:51:23.690 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.0038774609565734863
2021-08-26 22:51:23.693 | INFO     | src.policies:minibatch_update:277 - Policy network L2 

2021-08-26 22:51:24.162 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.002059429883956909
2021-08-26 22:51:24.164 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.04324883595108986
2021-08-26 22:51:24.166 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.13019856810569763
2021-08-26 22:51:24.168 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.04324883595108986
2021-08-26 22:51:24.170 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.13019856810569763
2021-08-26 22:51:24.172 | INFO     | src.policies:train:152 - Mini-batch 3 / 3
2021-08-26 22:51:24.174 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.011422619223594666
2021-08-26 22:51:24.176 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.055900510400533676
2021-08-26 22:51:24.177 | INFO     | src.p

2021-08-26 22:51:24.700 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.04088572785258293
2021-08-26 22:51:24.702 | INFO     | src.policies:train:152 - Mini-batch 2 / 2
2021-08-26 22:51:24.704 | INFO     | src.policies:minibatch_update:270 - Total loss: 0.23838478326797485
2021-08-26 22:51:24.706 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.2073172628879547
2021-08-26 22:51:24.708 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.12728868424892426
2021-08-26 22:51:24.710 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.2073172628879547
2021-08-26 22:51:24.712 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.12728868424892426
2021-08-26 22:51:24.715 | INFO     | src.policies:train:116 - Epoch 677 / 800
2021-08-26 22:51:24.716 | INFO     | src.policies:collect_t

2021-08-26 22:51:25.207 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.06158411130309105
2021-08-26 22:51:25.209 | INFO     | src.policies:train:116 - Epoch 680 / 800
2021-08-26 22:51:25.210 | INFO     | src.policies:collect_trajectories:213 - Episode 1428
2021-08-26 22:51:25.261 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:51:25.262 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 154.0
2021-08-26 22:51:25.263 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 154.0
2021-08-26 22:51:25.264 | INFO     | src.policies:collect_trajectories:213 - Episode 1429
2021-08-26 22:51:25.321 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:51:25.322 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 160.0
2021-08-26 22:51:25.323 | INFO     | src.policies:collect_trajectorie

2021-08-26 22:51:25.673 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.021233312785625458
2021-08-26 22:51:25.676 | INFO     | src.policies:train:116 - Epoch 683 / 800
2021-08-26 22:51:25.677 | INFO     | src.policies:collect_trajectories:213 - Episode 1433
2021-08-26 22:51:25.748 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:51:25.749 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 200.0
2021-08-26 22:51:25.750 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 200.0
2021-08-26 22:51:25.753 | INFO     | src.policies:train:152 - Mini-batch 1 / 2
2021-08-26 22:51:25.756 | INFO     | src.policies:minibatch_update:270 - Total loss: 0.2739567458629608
2021-08-26 22:51:25.758 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.2725011110305786
2021-08-26 22:51:25.760 | INFO     | src.policies:minibatc

2021-08-26 22:51:26.234 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:51:26.235 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 140.0
2021-08-26 22:51:26.236 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 152.0
2021-08-26 22:51:26.241 | INFO     | src.policies:train:152 - Mini-batch 1 / 3
2021-08-26 22:51:26.244 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.035647809505462646
2021-08-26 22:51:26.245 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.1894620656967163
2021-08-26 22:51:26.247 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.13485129177570343
2021-08-26 22:51:26.249 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.1894620656967163
2021-08-26 22:51:26.251 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradie

2021-08-26 22:51:26.789 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:51:26.791 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 147.0
2021-08-26 22:51:26.791 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 155.0
2021-08-26 22:51:26.796 | INFO     | src.policies:train:152 - Mini-batch 1 / 3
2021-08-26 22:51:26.799 | INFO     | src.policies:minibatch_update:270 - Total loss: 0.06210167706012726
2021-08-26 22:51:26.801 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.14576245844364166
2021-08-26 22:51:26.802 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.28516674041748047
2021-08-26 22:51:26.804 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.14576245844364166
2021-08-26 22:51:26.806 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradie

2021-08-26 22:51:27.140 | INFO     | src.policies:train:152 - Mini-batch 3 / 3
2021-08-26 22:51:27.142 | INFO     | src.policies:minibatch_update:270 - Total loss: 0.048953235149383545
2021-08-26 22:51:27.143 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.1608526110649109
2021-08-26 22:51:27.145 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.16090331971645355
2021-08-26 22:51:27.147 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.1608526110649109
2021-08-26 22:51:27.149 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.16090331971645355
2021-08-26 22:51:27.152 | INFO     | src.policies:train:116 - Epoch 692 / 800
2021-08-26 22:51:27.153 | INFO     | src.policies:collect_trajectories:213 - Episode 1449
2021-08-26 22:51:27.206 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08

2021-08-26 22:51:27.627 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.1150212287902832
2021-08-26 22:51:27.629 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.17771057784557343
2021-08-26 22:51:27.631 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.1150212287902832
2021-08-26 22:51:27.634 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.17771057784557343
2021-08-26 22:51:27.636 | INFO     | src.policies:train:152 - Mini-batch 2 / 2
2021-08-26 22:51:27.638 | INFO     | src.policies:minibatch_update:270 - Total loss: 0.03049786388874054
2021-08-26 22:51:27.640 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.047843169420957565
2021-08-26 22:51:27.642 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.0977887287735939
2021-08-26 22:51:27.644 | 

2021-08-26 22:51:28.151 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.02263367921113968
2021-08-26 22:51:28.153 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.14359013736248016
2021-08-26 22:51:28.155 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.02263367921113968
2021-08-26 22:51:28.157 | INFO     | src.policies:train:152 - Mini-batch 2 / 3
2021-08-26 22:51:28.159 | INFO     | src.policies:minibatch_update:270 - Total loss: 0.24269503355026245
2021-08-26 22:51:28.161 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.04564585164189339
2021-08-26 22:51:28.162 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.007498441729694605
2021-08-26 22:51:28.164 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.04564585164189339
2021-08-

2021-08-26 22:51:28.613 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 145.0
2021-08-26 22:51:28.619 | INFO     | src.policies:train:152 - Mini-batch 1 / 2
2021-08-26 22:51:28.621 | INFO     | src.policies:minibatch_update:270 - Total loss: 0.0009727329015731812
2021-08-26 22:51:28.623 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.1392420083284378
2021-08-26 22:51:28.624 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.0438799113035202
2021-08-26 22:51:28.626 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.1392420083284378
2021-08-26 22:51:28.628 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.0438799113035202
2021-08-26 22:51:28.630 | INFO     | src.policies:train:152 - Mini-batch 2 / 2
2021-08-26 22:51:28.632 | INFO     | src.policies:minibatch_update:270 - Total loss: 0.

2021-08-26 22:51:29.172 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:51:29.174 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 181.0
2021-08-26 22:51:29.175 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 156.5
2021-08-26 22:51:29.184 | INFO     | src.policies:train:152 - Mini-batch 1 / 3
2021-08-26 22:51:29.187 | INFO     | src.policies:minibatch_update:270 - Total loss: 0.07271414995193481
2021-08-26 22:51:29.190 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.1240379586815834
2021-08-26 22:51:29.192 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.27206122875213623
2021-08-26 22:51:29.195 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.1240379586815834
2021-08-26 22:51:29.196 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient

2021-08-26 22:51:29.598 | INFO     | src.policies:train:152 - Mini-batch 3 / 3
2021-08-26 22:51:29.600 | INFO     | src.policies:minibatch_update:270 - Total loss: 0.06383132934570312
2021-08-26 22:51:29.602 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.20710188150405884
2021-08-26 22:51:29.603 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.048496875911951065
2021-08-26 22:51:29.605 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.20710188150405884
2021-08-26 22:51:29.607 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.048496875911951065
2021-08-26 22:51:29.610 | INFO     | src.policies:train:116 - Epoch 706 / 800
2021-08-26 22:51:29.611 | INFO     | src.policies:collect_trajectories:213 - Episode 1478
2021-08-26 22:51:29.667 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021

2021-08-26 22:51:30.138 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.117725670337677
2021-08-26 22:51:30.140 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.1286548674106598
2021-08-26 22:51:30.141 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.117725670337677
2021-08-26 22:51:30.143 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.1286548674106598
2021-08-26 22:51:30.146 | INFO     | src.policies:train:116 - Epoch 709 / 800
2021-08-26 22:51:30.147 | INFO     | src.policies:collect_trajectories:213 - Episode 1484
2021-08-26 22:51:30.161 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:51:30.162 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 39.0
2021-08-26 22:51:30.163 | INFO     | src.policies:collect_trajectories:230 - Last 100 episode

2021-08-26 22:51:30.583 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.16025902330875397
2021-08-26 22:51:30.585 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.15339185297489166
2021-08-26 22:51:30.587 | INFO     | src.policies:train:152 - Mini-batch 3 / 3
2021-08-26 22:51:30.589 | INFO     | src.policies:minibatch_update:270 - Total loss: 0.013208389282226562
2021-08-26 22:51:30.592 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.15089696645736694
2021-08-26 22:51:30.593 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.0787583738565445
2021-08-26 22:51:30.595 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.15089696645736694
2021-08-26 22:51:30.597 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.078758373856

2021-08-26 22:51:31.094 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.1382221281528473
2021-08-26 22:51:31.097 | INFO     | src.policies:train:116 - Epoch 715 / 800
2021-08-26 22:51:31.098 | INFO     | src.policies:collect_trajectories:213 - Episode 1496
2021-08-26 22:51:31.151 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:51:31.152 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 166.0
2021-08-26 22:51:31.153 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 166.0
2021-08-26 22:51:31.154 | INFO     | src.policies:collect_trajectories:213 - Episode 1497
2021-08-26 22:51:31.256 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:51:31.257 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 181.0
2021-08-26 22:51:31.258 | INFO     | src.policies:collect_trajectories

2021-08-26 22:51:31.574 | INFO     | src.policies:collect_trajectories:213 - Episode 1502
2021-08-26 22:51:31.617 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:51:31.618 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 117.0
2021-08-26 22:51:31.619 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 129.0
2021-08-26 22:51:31.625 | INFO     | src.policies:train:152 - Mini-batch 1 / 2
2021-08-26 22:51:31.627 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.003215193748474121
2021-08-26 22:51:31.629 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.05142078921198845
2021-08-26 22:51:31.631 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.09892131388187408
2021-08-26 22:51:31.633 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.05142078921198845
2021-08

2021-08-26 22:51:32.092 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.21241021156311035
2021-08-26 22:51:32.094 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.14868932962417603
2021-08-26 22:51:32.096 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.21241021156311035
2021-08-26 22:51:32.098 | INFO     | src.policies:train:152 - Mini-batch 2 / 3
2021-08-26 22:51:32.100 | INFO     | src.policies:minibatch_update:270 - Total loss: 0.04942697286605835
2021-08-26 22:51:32.102 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.14201036095619202
2021-08-26 22:51:32.103 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.135990709066391
2021-08-26 22:51:32.105 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.14201036095619202
2021-08-26 

2021-08-26 22:51:32.565 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 151.0
2021-08-26 22:51:32.571 | INFO     | src.policies:train:152 - Mini-batch 1 / 3
2021-08-26 22:51:32.574 | INFO     | src.policies:minibatch_update:270 - Total loss: 0.06497043371200562
2021-08-26 22:51:32.576 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.15888243913650513
2021-08-26 22:51:32.578 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.21177133917808533
2021-08-26 22:51:32.579 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.15888243913650513
2021-08-26 22:51:32.581 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.21177133917808533
2021-08-26 22:51:32.583 | INFO     | src.policies:train:152 - Mini-batch 2 / 3
2021-08-26 22:51:32.585 | INFO     | src.policies:minibatch_update:270 - Total loss: 

2021-08-26 22:51:33.063 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.15821883082389832
2021-08-26 22:51:33.065 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.06665670871734619
2021-08-26 22:51:33.067 | INFO     | src.policies:train:152 - Mini-batch 2 / 2
2021-08-26 22:51:33.069 | INFO     | src.policies:minibatch_update:270 - Total loss: 0.03812667727470398
2021-08-26 22:51:33.070 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.09772111475467682
2021-08-26 22:51:33.072 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.11614081263542175
2021-08-26 22:51:33.074 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.09772111475467682
2021-08-26 22:51:33.076 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.116140812635

2021-08-26 22:51:33.515 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.07820313423871994
2021-08-26 22:51:33.517 | INFO     | src.policies:train:152 - Mini-batch 2 / 2
2021-08-26 22:51:33.519 | INFO     | src.policies:minibatch_update:270 - Total loss: 0.012768641114234924
2021-08-26 22:51:33.521 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.06148962303996086
2021-08-26 22:51:33.523 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.09693700820207596
2021-08-26 22:51:33.524 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.06148962303996086
2021-08-26 22:51:33.526 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.09693700820207596
2021-08-26 22:51:33.529 | INFO     | src.policies:train:116 - Epoch 731 / 800
2021-08-26 22:51:33.530 | INFO     | src.policies:collec

2021-08-26 22:51:34.010 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:51:34.012 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 136.0
2021-08-26 22:51:34.012 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 129.0
2021-08-26 22:51:34.017 | INFO     | src.policies:train:152 - Mini-batch 1 / 2
2021-08-26 22:51:34.020 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.01917821168899536
2021-08-26 22:51:34.022 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.10209304094314575
2021-08-26 22:51:34.024 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.2384638786315918
2021-08-26 22:51:34.025 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.10209304094314575
2021-08-26 22:51:34.027 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradie

2021-08-26 22:51:34.414 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.17548899352550507
2021-08-26 22:51:34.416 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.22112996876239777
2021-08-26 22:51:34.418 | INFO     | src.policies:train:152 - Mini-batch 2 / 2
2021-08-26 22:51:34.420 | INFO     | src.policies:minibatch_update:270 - Total loss: 0.018655136227607727
2021-08-26 22:51:34.422 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.10102660953998566
2021-08-26 22:51:34.423 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.5602988004684448
2021-08-26 22:51:34.425 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.10102660953998566
2021-08-26 22:51:34.427 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.499999076128

2021-08-26 22:51:34.822 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.09737395495176315
2021-08-26 22:51:34.825 | INFO     | src.policies:train:116 - Epoch 741 / 800
2021-08-26 22:51:34.826 | INFO     | src.policies:collect_trajectories:213 - Episode 1545
2021-08-26 22:51:34.869 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:51:34.871 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 130.0
2021-08-26 22:51:34.871 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 130.0
2021-08-26 22:51:34.872 | INFO     | src.policies:collect_trajectories:213 - Episode 1546
2021-08-26 22:51:34.942 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:51:34.943 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 200.0
2021-08-26 22:51:34.944 | INFO     | src.policies:collect_trajectorie

2021-08-26 22:51:35.511 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 143.0
2021-08-26 22:51:35.512 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 113.66666666666667
2021-08-26 22:51:35.517 | INFO     | src.policies:train:152 - Mini-batch 1 / 3
2021-08-26 22:51:35.520 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.01774004101753235
2021-08-26 22:51:35.521 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.07129848748445511
2021-08-26 22:51:35.523 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.21627651154994965
2021-08-26 22:51:35.525 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.07129848748445511
2021-08-26 22:51:35.527 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.21627651154994965
2021-08-26 22:51:35.529 | INFO     | src.polic

2021-08-26 22:51:35.967 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.07462441176176071
2021-08-26 22:51:35.969 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.0597870908677578
2021-08-26 22:51:35.971 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.07462441176176071
2021-08-26 22:51:35.972 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.0597870908677578
2021-08-26 22:51:35.974 | INFO     | src.policies:train:152 - Mini-batch 3 / 3
2021-08-26 22:51:35.976 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.00268767774105072
2021-08-26 22:51:35.978 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.09012290090322495
2021-08-26 22:51:35.980 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.13530796766281128
2021-08-26 22:51:35.981 |

2021-08-26 22:51:36.397 | INFO     | src.policies:train:116 - Epoch 751 / 800
2021-08-26 22:51:36.398 | INFO     | src.policies:collect_trajectories:213 - Episode 1560
2021-08-26 22:51:36.443 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:51:36.444 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 125.0
2021-08-26 22:51:36.445 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 125.0
2021-08-26 22:51:36.445 | INFO     | src.policies:collect_trajectories:213 - Episode 1561
2021-08-26 22:51:36.492 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:51:36.493 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 132.0
2021-08-26 22:51:36.494 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 128.5
2021-08-26 22:51:36.499 | INFO     | src.policies:train:152 - Mini-batch 1 / 2
2021-08-26 22:51:36

2021-08-26 22:51:36.895 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.27891090512275696
2021-08-26 22:51:36.896 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.07398828119039536
2021-08-26 22:51:36.899 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.27891090512275696
2021-08-26 22:51:36.901 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.07398828119039536
2021-08-26 22:51:36.904 | INFO     | src.policies:train:116 - Epoch 754 / 800
2021-08-26 22:51:36.904 | INFO     | src.policies:collect_trajectories:213 - Episode 1566
2021-08-26 22:51:36.948 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:51:36.950 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 132.0
2021-08-26 22:51:36.950 | INFO     | src.policies:collect_trajectories:230 - Last 100 

2021-08-26 22:51:37.499 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:51:37.500 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 200.0
2021-08-26 22:51:37.501 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 200.0
2021-08-26 22:51:37.506 | INFO     | src.policies:train:152 - Mini-batch 1 / 2
2021-08-26 22:51:37.508 | INFO     | src.policies:minibatch_update:270 - Total loss: 0.35202187299728394
2021-08-26 22:51:37.510 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.1309645175933838
2021-08-26 22:51:37.512 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.014424742199480534
2021-08-26 22:51:37.513 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.1309645175933838
2021-08-26 22:51:37.515 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradien

2021-08-26 22:51:37.885 | INFO     | src.policies:collect_trajectories:213 - Episode 1578
2021-08-26 22:51:38.004 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:51:38.006 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 200.0
2021-08-26 22:51:38.006 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 177.5
2021-08-26 22:51:38.012 | INFO     | src.policies:train:152 - Mini-batch 1 / 3
2021-08-26 22:51:38.016 | INFO     | src.policies:minibatch_update:270 - Total loss: 0.3708369731903076
2021-08-26 22:51:38.018 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.11294364184141159
2021-08-26 22:51:38.019 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.014974146150052547
2021-08-26 22:51:38.021 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.11294364184141159
2021-08-2

2021-08-26 22:51:38.325 | INFO     | src.policies:train:116 - Epoch 763 / 800
2021-08-26 22:51:38.326 | INFO     | src.policies:collect_trajectories:213 - Episode 1583
2021-08-26 22:51:38.391 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:51:38.392 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 188.0
2021-08-26 22:51:38.392 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 188.0
2021-08-26 22:51:38.393 | INFO     | src.policies:collect_trajectories:213 - Episode 1584
2021-08-26 22:51:38.439 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:51:38.440 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 129.0
2021-08-26 22:51:38.440 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 158.5
2021-08-26 22:51:38.446 | INFO     | src.policies:train:152 - Mini-batch 1 / 3
2021-08-26 22:51:38

2021-08-26 22:51:38.812 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.12191780656576157
2021-08-26 22:51:38.813 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.006434567738324404
2021-08-26 22:51:38.815 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.12191780656576157
2021-08-26 22:51:38.817 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.006434567738324404
2021-08-26 22:51:38.819 | INFO     | src.policies:train:116 - Epoch 766 / 800
2021-08-26 22:51:38.820 | INFO     | src.policies:collect_trajectories:213 - Episode 1589
2021-08-26 22:51:38.889 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:51:38.890 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 200.0
2021-08-26 22:51:38.891 | INFO     | src.policies:collect_trajectories:230 - Last 10

2021-08-26 22:51:39.396 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:51:39.397 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 124.0
2021-08-26 22:51:39.398 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 124.0
2021-08-26 22:51:39.399 | INFO     | src.policies:collect_trajectories:213 - Episode 1595
2021-08-26 22:51:39.433 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:51:39.434 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 94.0
2021-08-26 22:51:39.434 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 109.0
2021-08-26 22:51:39.440 | INFO     | src.policies:train:152 - Mini-batch 1 / 2
2021-08-26 22:51:39.442 | INFO     | src.policies:minibatch_update:270 - Total loss: 0.18914681673049927
2021-08-26 22:51:39.445 | INFO     | src.policies:minibatch_update:277 - Policy net

2021-08-26 22:51:39.838 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.21515782177448273
2021-08-26 22:51:39.840 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.16554126143455505
2021-08-26 22:51:39.842 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.21515782177448273
2021-08-26 22:51:39.844 | INFO     | src.policies:train:152 - Mini-batch 2 / 2
2021-08-26 22:51:39.846 | INFO     | src.policies:minibatch_update:270 - Total loss: 0.12649156153202057
2021-08-26 22:51:39.848 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.07443153113126755
2021-08-26 22:51:39.850 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.23082399368286133
2021-08-26 22:51:39.851 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.07443153113126755
2021-08-2

2021-08-26 22:51:40.423 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 156.5
2021-08-26 22:51:40.428 | INFO     | src.policies:train:152 - Mini-batch 1 / 3
2021-08-26 22:51:40.431 | INFO     | src.policies:minibatch_update:270 - Total loss: 0.33822768926620483
2021-08-26 22:51:40.433 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.18990498781204224
2021-08-26 22:51:40.434 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.0028217018116265535
2021-08-26 22:51:40.436 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.18990498781204224
2021-08-26 22:51:40.438 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.0028217018116265535
2021-08-26 22:51:40.440 | INFO     | src.policies:train:152 - Mini-batch 2 / 3
2021-08-26 22:51:40.442 | INFO     | src.policies:minibatch_update:270 - Total lo

2021-08-26 22:51:40.874 | INFO     | src.policies:minibatch_update:270 - Total loss: 0.26053714752197266
2021-08-26 22:51:40.876 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.37934058904647827
2021-08-26 22:51:40.878 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.003005524631589651
2021-08-26 22:51:40.879 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.37934058904647827
2021-08-26 22:51:40.881 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.003005524631589651
2021-08-26 22:51:40.883 | INFO     | src.policies:train:152 - Mini-batch 2 / 3
2021-08-26 22:51:40.885 | INFO     | src.policies:minibatch_update:270 - Total loss: 0.27629247307777405
2021-08-26 22:51:40.887 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.0994214415550232
2021-08-26 22:51:40.889 | INFO     | src.polic

2021-08-26 22:51:41.243 | INFO     | src.policies:collect_trajectories:213 - Episode 1617
2021-08-26 22:51:41.266 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:51:41.267 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 61.0
2021-08-26 22:51:41.268 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 96.5
2021-08-26 22:51:41.269 | INFO     | src.policies:collect_trajectories:213 - Episode 1618
2021-08-26 22:51:41.426 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:51:41.427 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 105.0
2021-08-26 22:51:41.428 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 99.33333333333333
2021-08-26 22:51:41.434 | INFO     | src.policies:train:152 - Mini-batch 1 / 2
2021-08-26 22:51:41.436 | INFO     | src.policies:minibatch_update:270 - Total loss: 0.

2021-08-26 22:51:41.825 | INFO     | src.policies:minibatch_update:270 - Total loss: 0.33014485239982605
2021-08-26 22:51:41.827 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.531157374382019
2021-08-26 22:51:41.828 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.015719881281256676
2021-08-26 22:51:41.830 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.4999990165233612
2021-08-26 22:51:41.832 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.015719881281256676
2021-08-26 22:51:41.834 | INFO     | src.policies:train:152 - Mini-batch 2 / 3
2021-08-26 22:51:41.836 | INFO     | src.policies:minibatch_update:270 - Total loss: 0.21804016828536987
2021-08-26 22:51:41.838 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.1798676997423172
2021-08-26 22:51:41.840 | INFO     | src.policies

2021-08-26 22:51:42.226 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.2013465315103531
2021-08-26 22:51:42.229 | INFO     | src.policies:train:116 - Epoch 787 / 800
2021-08-26 22:51:42.230 | INFO     | src.policies:collect_trajectories:213 - Episode 1629
2021-08-26 22:51:42.272 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:51:42.273 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 112.0
2021-08-26 22:51:42.274 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 112.0
2021-08-26 22:51:42.275 | INFO     | src.policies:collect_trajectories:213 - Episode 1630
2021-08-26 22:51:42.327 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:51:42.329 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 153.0
2021-08-26 22:51:42.329 | INFO     | src.policies:collect_trajectories

2021-08-26 22:51:42.782 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 156.0
2021-08-26 22:51:42.788 | INFO     | src.policies:train:152 - Mini-batch 1 / 3
2021-08-26 22:51:42.790 | INFO     | src.policies:minibatch_update:270 - Total loss: 0.07145292311906815
2021-08-26 22:51:42.792 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.11236056685447693
2021-08-26 22:51:42.794 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.2002604603767395
2021-08-26 22:51:42.796 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.11236056685447693
2021-08-26 22:51:42.798 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.2002604603767395
2021-08-26 22:51:42.800 | INFO     | src.policies:train:152 - Mini-batch 2 / 3
2021-08-26 22:51:42.802 | INFO     | src.policies:minibatch_update:270 - Total loss: -0

2021-08-26 22:51:43.185 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.08903299272060394
2021-08-26 22:51:43.187 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.10998163372278214
2021-08-26 22:51:43.188 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.08903299272060394
2021-08-26 22:51:43.190 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.10998163372278214
2021-08-26 22:51:43.193 | INFO     | src.policies:train:116 - Epoch 793 / 800
2021-08-26 22:51:43.194 | INFO     | src.policies:collect_trajectories:213 - Episode 1641
2021-08-26 22:51:43.245 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:51:43.246 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 149.0
2021-08-26 22:51:43.246 | INFO     | src.policies:collect_trajectories:230 - Last 100 

2021-08-26 22:51:43.781 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:51:43.782 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 117.0
2021-08-26 22:51:43.783 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 117.0
2021-08-26 22:51:43.784 | INFO     | src.policies:collect_trajectories:213 - Episode 1647
2021-08-26 22:51:43.844 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:51:43.845 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 169.0
2021-08-26 22:51:43.846 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 143.0
2021-08-26 22:51:43.850 | INFO     | src.policies:train:152 - Mini-batch 1 / 2
2021-08-26 22:51:43.853 | INFO     | src.policies:minibatch_update:270 - Total loss: 0.001399010419845581
2021-08-26 22:51:43.856 | INFO     | src.policies:minibatch_update:277 - Policy n

2021-08-26 22:51:44.321 | INFO     | src.policies:train:152 - Mini-batch 1 / 2
2021-08-26 22:51:44.324 | INFO     | src.policies:minibatch_update:270 - Total loss: 0.05085162818431854
2021-08-26 22:51:44.326 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.052461765706539154
2021-08-26 22:51:44.327 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.13830402493476868
2021-08-26 22:51:44.329 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.052461765706539154
2021-08-26 22:51:44.331 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.13830402493476868
2021-08-26 22:51:44.334 | INFO     | src.policies:train:152 - Mini-batch 2 / 2
2021-08-26 22:51:44.335 | INFO     | src.policies:minibatch_update:270 - Total loss: 0.011080294847488403
2021-08-26 22:51:44.337 | INFO     | src.policies:minibatch_update:277 - Policy network L2 

VBox(children=(Label(value=' 0.00MB of 0.00MB uploaded (0.00MB deduped)\r'), FloatProgress(value=1.0, max=1.0)…

0,1
loss,0.08073
mean_return,135.5
_runtime,134.0
_timestamp,1630011104.0
_step,799.0


0,1
loss,█▅▅▅▃▃▃▅▆▃▃▆▂▅▂▆▂▄▅▅▂▃▅▆▇▂▂▂▁▃▄▁▁▂▁▂▂█▂▂
mean_return,▁▁▃▄▄▆█▇▄███▇▆▇██▆▄▂█▅▃▅▅█▇▅▅▆▆▆▆▇▆▅▆▇█▅
_runtime,▁▁▁▂▂▂▂▂▂▃▃▃▃▃▄▄▄▄▄▄▅▅▅▅▅▅▆▆▆▆▆▇▇▇▇▇▇███
_timestamp,▁▁▁▂▂▂▂▂▂▃▃▃▃▃▄▄▄▄▄▄▅▅▅▅▅▅▆▆▆▆▆▇▇▇▇▇▇███
_step,▁▁▁▁▂▂▂▂▂▃▃▃▃▃▃▄▄▄▄▄▅▅▅▅▅▅▆▆▆▆▆▇▇▇▇▇▇███


## TRPO

This section deals with training a Cartpole agent using our custom Trust Region Policy Optimization implementation.

In [76]:
beta = 1.0
kl_target = 0.01

In [77]:
trpo_policy_nn = models.MLP(observation_space_size, hidden_sizes, action_space_size)
trpo_baseline_nn = models.MLP(observation_space_size, hidden_sizes, 1, log_softmax=False)
trpo_policy = policies.TRPOPolicy(env, trpo_policy_nn, trpo_baseline_nn, beta=beta, kl_target=kl_target)
trpo_policy.train(
    epochs,
    steps_per_epoch,
    minibatch_size,
    enable_wandb=True,
    wandb_config={**wandb_config, "group": "TRPO"},
    episodes_mean_return=episodes_mean_return
)

[34m[1mwandb[0m: wandb version 0.12.1 is available!  To upgrade, please run:
[34m[1mwandb[0m:  $ pip install wandb --upgrade


2021-08-26 22:52:20.507 | INFO     | src.policies:train:116 - Epoch 1 / 800
2021-08-26 22:52:20.508 | INFO     | src.policies:collect_trajectories:213 - Episode 1
2021-08-26 22:52:20.523 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:52:20.524 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 33.0
2021-08-26 22:52:20.525 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 33.0
2021-08-26 22:52:20.526 | INFO     | src.policies:collect_trajectories:213 - Episode 2
2021-08-26 22:52:20.540 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:52:20.541 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 22.0
2021-08-26 22:52:20.542 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 27.5
2021-08-26 22:52:20.543 | INFO     | src.policies:collect_trajectories:213 - Episode 3
2021-08-26 22:52:20.551

2021-08-26 22:52:20.807 | INFO     | src.policies:train:152 - Mini-batch 1 / 2
2021-08-26 22:52:20.810 | INFO     | src.policies:minibatch_update:270 - Total loss: 0.5782373547554016
2021-08-26 22:52:20.812 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.06215010583400726
2021-08-26 22:52:20.814 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 1.6139529943466187
2021-08-26 22:52:20.816 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.06215010583400726
2021-08-26 22:52:20.818 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.4999997019767761
2021-08-26 22:52:20.821 | INFO     | src.policies:train:152 - Mini-batch 2 / 2
2021-08-26 22:52:20.823 | INFO     | src.policies:minibatch_update:270 - Total loss: 0.5835348963737488
2021-08-26 22:52:20.826 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradien

2021-08-26 22:52:21.208 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:52:21.209 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 16.0
2021-08-26 22:52:21.210 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 16.666666666666668
2021-08-26 22:52:21.210 | INFO     | src.policies:collect_trajectories:213 - Episode 30
2021-08-26 22:52:21.220 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:52:21.221 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 18.0
2021-08-26 22:52:21.221 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 17.0
2021-08-26 22:52:21.222 | INFO     | src.policies:collect_trajectories:213 - Episode 31
2021-08-26 22:52:21.248 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:52:21.249 | INFO     | src.policies:collect_trajecto

2021-08-26 22:52:21.464 | INFO     | src.policies:minibatch_update:270 - Total loss: 0.5345230102539062
2021-08-26 22:52:21.467 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.05467909201979637
2021-08-26 22:52:21.468 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 1.6633318662643433
2021-08-26 22:52:21.470 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.05467909201979637
2021-08-26 22:52:21.472 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.4999997317790985
2021-08-26 22:52:21.474 | INFO     | src.policies:train:116 - Epoch 6 / 800
2021-08-26 22:52:21.475 | INFO     | src.policies:collect_trajectories:213 - Episode 43
2021-08-26 22:52:21.489 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:52:21.491 | INFO     | src.policies:collect_trajectories:229 - Mean episode re

2021-08-26 22:52:21.796 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:52:21.796 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 11.0
2021-08-26 22:52:21.797 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 28.0
2021-08-26 22:52:21.798 | INFO     | src.policies:collect_trajectories:213 - Episode 58
2021-08-26 22:52:21.808 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:52:21.809 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 19.0
2021-08-26 22:52:21.810 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 26.875
2021-08-26 22:52:21.815 | INFO     | src.policies:train:152 - Mini-batch 1 / 2
2021-08-26 22:52:21.818 | INFO     | src.policies:minibatch_update:270 - Total loss: 0.5392647385597229
2021-08-26 22:52:21.820 | INFO     | src.policies:minibatch_update:277 - Policy network

2021-08-26 22:52:22.029 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 59.0
2021-08-26 22:52:22.029 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 36.5
2021-08-26 22:52:22.030 | INFO     | src.policies:collect_trajectories:213 - Episode 70
2021-08-26 22:52:22.044 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:52:22.045 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 26.0
2021-08-26 22:52:22.046 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 33.0
2021-08-26 22:52:22.047 | INFO     | src.policies:collect_trajectories:213 - Episode 71
2021-08-26 22:52:22.065 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:52:22.066 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 39.0
2021-08-26 22:52:22.067 | INFO     | src.policies:collect_trajectories:230 - Last

2021-08-26 22:52:22.354 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 15.0
2021-08-26 22:52:22.354 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 20.1
2021-08-26 22:52:22.361 | INFO     | src.policies:train:152 - Mini-batch 1 / 2
2021-08-26 22:52:22.363 | INFO     | src.policies:minibatch_update:270 - Total loss: 0.505031943321228
2021-08-26 22:52:22.366 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.02263684757053852
2021-08-26 22:52:22.367 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 1.588011384010315
2021-08-26 22:52:22.369 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.02263684757053852
2021-08-26 22:52:22.371 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.49999967217445374
2021-08-26 22:52:22.373 | INFO     | src.policies:train:152 - Mini

2021-08-26 22:52:22.689 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 33.0
2021-08-26 22:52:22.690 | INFO     | src.policies:collect_trajectories:213 - Episode 98
2021-08-26 22:52:22.704 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:52:22.705 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 19.0
2021-08-26 22:52:22.706 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 28.333333333333332
2021-08-26 22:52:22.707 | INFO     | src.policies:collect_trajectories:213 - Episode 99
2021-08-26 22:52:22.716 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:52:22.717 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 11.0
2021-08-26 22:52:22.718 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 24.0
2021-08-26 22:52:22.718 | INFO     | src.policies:collect

2021-08-26 22:52:22.984 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 21.666666666666668
2021-08-26 22:52:22.985 | INFO     | src.policies:collect_trajectories:213 - Episode 114
2021-08-26 22:52:22.992 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:52:22.993 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 13.0
2021-08-26 22:52:22.994 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 20.8
2021-08-26 22:52:23.002 | INFO     | src.policies:train:152 - Mini-batch 1 / 2
2021-08-26 22:52:23.004 | INFO     | src.policies:minibatch_update:270 - Total loss: 0.47702494263648987
2021-08-26 22:52:23.007 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.09040562808513641
2021-08-26 22:52:23.009 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 1.5703024864196777
2021-08-26 22:52:23.011 | I

2021-08-26 22:52:23.212 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 22.0
2021-08-26 22:52:23.213 | INFO     | src.policies:collect_trajectories:213 - Episode 126
2021-08-26 22:52:23.220 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:52:23.221 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 10.0
2021-08-26 22:52:23.221 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 16.0
2021-08-26 22:52:23.222 | INFO     | src.policies:collect_trajectories:213 - Episode 127
2021-08-26 22:52:23.229 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:52:23.230 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 13.0
2021-08-26 22:52:23.231 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 15.0
2021-08-26 22:52:23.231 | INFO     | src.policies:collect_trajectorie

2021-08-26 22:52:23.528 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 27.666666666666668
2021-08-26 22:52:23.528 | INFO     | src.policies:collect_trajectories:213 - Episode 142
2021-08-26 22:52:23.536 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:52:23.537 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 16.0
2021-08-26 22:52:23.538 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 26.0
2021-08-26 22:52:23.539 | INFO     | src.policies:collect_trajectories:213 - Episode 143
2021-08-26 22:52:23.556 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:52:23.557 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 42.0
2021-08-26 22:52:23.558 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 28.0
2021-08-26 22:52:23.563 | INFO     | src.policies:train

2021-08-26 22:52:23.732 | INFO     | src.policies:train:116 - Epoch 18 / 800
2021-08-26 22:52:23.733 | INFO     | src.policies:collect_trajectories:213 - Episode 154
2021-08-26 22:52:23.743 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:52:23.744 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 22.0
2021-08-26 22:52:23.745 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 22.0
2021-08-26 22:52:23.746 | INFO     | src.policies:collect_trajectories:213 - Episode 155
2021-08-26 22:52:23.755 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:52:23.756 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 19.0
2021-08-26 22:52:23.756 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 20.5
2021-08-26 22:52:23.757 | INFO     | src.policies:collect_trajectories:213 - Episode 156
2021-08-26 22:52

2021-08-26 22:52:24.119 | INFO     | src.policies:collect_trajectories:213 - Episode 170
2021-08-26 22:52:24.132 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:52:24.133 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 23.0
2021-08-26 22:52:24.134 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 21.142857142857142
2021-08-26 22:52:24.135 | INFO     | src.policies:collect_trajectories:213 - Episode 171
2021-08-26 22:52:24.148 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:52:24.149 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 24.0
2021-08-26 22:52:24.150 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 21.5
2021-08-26 22:52:24.152 | INFO     | src.policies:collect_trajectories:213 - Episode 172
2021-08-26 22:52:24.161 | DEBUG    | src.policies:execute_episode:398 - Early s

2021-08-26 22:52:24.359 | INFO     | src.policies:train:152 - Mini-batch 2 / 2
2021-08-26 22:52:24.361 | INFO     | src.policies:minibatch_update:270 - Total loss: 0.4383414387702942
2021-08-26 22:52:24.363 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.0904468223452568
2021-08-26 22:52:24.364 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 1.455326795578003
2021-08-26 22:52:24.366 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.0904468223452568
2021-08-26 22:52:24.368 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.4999997019767761
2021-08-26 22:52:24.370 | INFO     | src.policies:train:116 - Epoch 21 / 800
2021-08-26 22:52:24.371 | INFO     | src.policies:collect_trajectories:213 - Episode 184
2021-08-26 22:52:24.382 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:

2021-08-26 22:52:24.797 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:52:24.798 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 17.0
2021-08-26 22:52:24.799 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 32.833333333333336
2021-08-26 22:52:24.800 | INFO     | src.policies:collect_trajectories:213 - Episode 199
2021-08-26 22:52:24.816 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:52:24.818 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 36.0
2021-08-26 22:52:24.818 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 33.285714285714285
2021-08-26 22:52:24.824 | INFO     | src.policies:train:152 - Mini-batch 1 / 2
2021-08-26 22:52:24.826 | INFO     | src.policies:minibatch_update:270 - Total loss: 0.41912466287612915
2021-08-26 22:52:24.828 | INFO     | src.policies:minibatch

2021-08-26 22:52:25.035 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.07846525311470032
2021-08-26 22:52:25.037 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.49999967217445374
2021-08-26 22:52:25.040 | INFO     | src.policies:train:116 - Epoch 24 / 800
2021-08-26 22:52:25.041 | INFO     | src.policies:collect_trajectories:213 - Episode 211
2021-08-26 22:52:25.049 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:52:25.051 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 12.0
2021-08-26 22:52:25.051 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 12.0
2021-08-26 22:52:25.052 | INFO     | src.policies:collect_trajectories:213 - Episode 212
2021-08-26 22:52:25.077 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:52:25.078 | INFO     | 

2021-08-26 22:52:25.395 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:52:25.396 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 36.0
2021-08-26 22:52:25.397 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 23.375
2021-08-26 22:52:25.398 | INFO     | src.policies:collect_trajectories:213 - Episode 227
2021-08-26 22:52:25.421 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:52:25.422 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 51.0
2021-08-26 22:52:25.422 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 26.444444444444443
2021-08-26 22:52:25.428 | INFO     | src.policies:train:152 - Mini-batch 1 / 2
2021-08-26 22:52:25.430 | INFO     | src.policies:minibatch_update:270 - Total loss: 0.3766910135746002
2021-08-26 22:52:25.432 | INFO     | src.policies:minibatch_update:277 -

2021-08-26 22:52:25.670 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:52:25.670 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 41.0
2021-08-26 22:52:25.671 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 54.0
2021-08-26 22:52:25.672 | INFO     | src.policies:collect_trajectories:213 - Episode 239
2021-08-26 22:52:25.687 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:52:25.688 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 31.0
2021-08-26 22:52:25.688 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 46.333333333333336
2021-08-26 22:52:25.689 | INFO     | src.policies:collect_trajectories:213 - Episode 240
2021-08-26 22:52:25.703 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:52:25.704 | INFO     | src.policies:collect_trajec

2021-08-26 22:52:26.000 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 25.0
2021-08-26 22:52:26.000 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 25.0
2021-08-26 22:52:26.001 | INFO     | src.policies:collect_trajectories:213 - Episode 251
2021-08-26 22:52:26.017 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:52:26.018 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 36.0
2021-08-26 22:52:26.019 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 30.5
2021-08-26 22:52:26.019 | INFO     | src.policies:collect_trajectories:213 - Episode 252
2021-08-26 22:52:26.028 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:52:26.029 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 17.0
2021-08-26 22:52:26.030 | INFO     | src.policies:collect_trajectories:230 - La

2021-08-26 22:52:26.285 | INFO     | src.policies:minibatch_update:270 - Total loss: 0.3215941786766052
2021-08-26 22:52:26.287 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.04853447899222374
2021-08-26 22:52:26.288 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 1.4664180278778076
2021-08-26 22:52:26.290 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.04853447899222374
2021-08-26 22:52:26.292 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.49999964237213135
2021-08-26 22:52:26.294 | INFO     | src.policies:train:152 - Mini-batch 2 / 2
2021-08-26 22:52:26.296 | INFO     | src.policies:minibatch_update:270 - Total loss: 0.32131892442703247
2021-08-26 22:52:26.298 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.055662136524915695
2021-08-26 22:52:26.300 | INFO     | src.policie

2021-08-26 22:52:26.646 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 55.0
2021-08-26 22:52:26.647 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 34.6
2021-08-26 22:52:26.648 | INFO     | src.policies:collect_trajectories:213 - Episode 279
2021-08-26 22:52:26.662 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:52:26.663 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 27.0
2021-08-26 22:52:26.663 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 33.333333333333336
2021-08-26 22:52:26.668 | INFO     | src.policies:train:152 - Mini-batch 1 / 2
2021-08-26 22:52:26.670 | INFO     | src.policies:minibatch_update:270 - Total loss: 0.34590572118759155
2021-08-26 22:52:26.672 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.028686242178082466
2021-08-26 22:52:26.673 | INFO     | src.policies:

2021-08-26 22:52:27.151 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 28.333333333333332
2021-08-26 22:52:27.152 | INFO     | src.policies:collect_trajectories:213 - Episode 291
2021-08-26 22:52:27.177 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:52:27.178 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 50.0
2021-08-26 22:52:27.179 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 33.75
2021-08-26 22:52:27.180 | INFO     | src.policies:collect_trajectories:213 - Episode 292
2021-08-26 22:52:27.197 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:52:27.199 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 35.0
2021-08-26 22:52:27.200 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 34.0
2021-08-26 22:52:27.200 | INFO     | src.policies:coll

2021-08-26 22:52:27.425 | INFO     | src.policies:collect_trajectories:213 - Episode 303
2021-08-26 22:52:27.440 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:52:27.441 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 32.0
2021-08-26 22:52:27.441 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 24.5
2021-08-26 22:52:27.442 | INFO     | src.policies:collect_trajectories:213 - Episode 304
2021-08-26 22:52:27.464 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:52:27.465 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 49.0
2021-08-26 22:52:27.466 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 32.666666666666664
2021-08-26 22:52:27.467 | INFO     | src.policies:collect_trajectories:213 - Episode 305
2021-08-26 22:52:27.474 | DEBUG    | src.policies:execute_episode:398 - Early s

2021-08-26 22:52:27.874 | INFO     | src.policies:train:152 - Mini-batch 2 / 2
2021-08-26 22:52:27.876 | INFO     | src.policies:minibatch_update:270 - Total loss: 0.2624530792236328
2021-08-26 22:52:27.879 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.07526176422834396
2021-08-26 22:52:27.880 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 1.3939461708068848
2021-08-26 22:52:27.882 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.07526176422834396
2021-08-26 22:52:27.884 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.49999964237213135
2021-08-26 22:52:27.887 | INFO     | src.policies:train:116 - Epoch 38 / 800
2021-08-26 22:52:27.887 | INFO     | src.policies:collect_trajectories:213 - Episode 317
2021-08-26 22:52:27.902 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26

2021-08-26 22:52:28.142 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:52:28.143 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 16.0
2021-08-26 22:52:28.144 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 28.666666666666668
2021-08-26 22:52:28.144 | INFO     | src.policies:collect_trajectories:213 - Episode 332
2021-08-26 22:52:28.152 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:52:28.153 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 12.0
2021-08-26 22:52:28.153 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 26.285714285714285
2021-08-26 22:52:28.154 | INFO     | src.policies:collect_trajectories:213 - Episode 333
2021-08-26 22:52:28.178 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:52:28.179 | INFO     | src.policies:

2021-08-26 22:52:28.540 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:52:28.541 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 10.0
2021-08-26 22:52:28.541 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 31.4
2021-08-26 22:52:28.542 | INFO     | src.policies:collect_trajectories:213 - Episode 344
2021-08-26 22:52:28.553 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:52:28.554 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 22.0
2021-08-26 22:52:28.554 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 29.833333333333332
2021-08-26 22:52:28.555 | INFO     | src.policies:collect_trajectories:213 - Episode 345
2021-08-26 22:52:28.570 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:52:28.572 | INFO     | src.policies:collect_trajec

2021-08-26 22:52:28.811 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:52:28.812 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 27.0
2021-08-26 22:52:28.864 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 19.666666666666668
2021-08-26 22:52:28.875 | INFO     | src.policies:collect_trajectories:213 - Episode 356
2021-08-26 22:52:28.886 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:52:28.887 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 18.0
2021-08-26 22:52:28.888 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 19.428571428571427
2021-08-26 22:52:28.888 | INFO     | src.policies:collect_trajectories:213 - Episode 357
2021-08-26 22:52:28.897 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:52:28.898 | INFO     | src.policies:

2021-08-26 22:52:29.264 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:52:29.265 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 38.0
2021-08-26 22:52:29.265 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 33.5
2021-08-26 22:52:29.266 | INFO     | src.policies:collect_trajectories:213 - Episode 368
2021-08-26 22:52:29.277 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:52:29.278 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 19.0
2021-08-26 22:52:29.279 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 28.666666666666668
2021-08-26 22:52:29.280 | INFO     | src.policies:collect_trajectories:213 - Episode 369
2021-08-26 22:52:29.299 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:52:29.300 | INFO     | src.policies:collect_trajec

2021-08-26 22:52:29.605 | INFO     | src.policies:minibatch_update:270 - Total loss: 0.21985796093940735
2021-08-26 22:52:29.607 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.02173626236617565
2021-08-26 22:52:29.609 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 1.2585043907165527
2021-08-26 22:52:29.611 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.02173626236617565
2021-08-26 22:52:29.613 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.49999964237213135
2021-08-26 22:52:29.615 | INFO     | src.policies:train:116 - Epoch 47 / 800
2021-08-26 22:52:29.617 | INFO     | src.policies:collect_trajectories:213 - Episode 381
2021-08-26 22:52:29.638 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:52:29.639 | INFO     | src.policies:collect_trajectories:229 - Mean episod

2021-08-26 22:52:29.912 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.07506834715604782
2021-08-26 22:52:29.914 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.4999995827674866
2021-08-26 22:52:29.917 | INFO     | src.policies:train:152 - Mini-batch 2 / 2
2021-08-26 22:52:29.919 | INFO     | src.policies:minibatch_update:270 - Total loss: 0.2329246550798416
2021-08-26 22:52:29.922 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.06825419515371323
2021-08-26 22:52:29.923 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 1.2216711044311523
2021-08-26 22:52:29.925 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.06825419515371323
2021-08-26 22:52:29.927 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.499999612569808

2021-08-26 22:52:30.292 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.49999964237213135
2021-08-26 22:52:30.295 | INFO     | src.policies:train:116 - Epoch 51 / 800
2021-08-26 22:52:30.296 | INFO     | src.policies:collect_trajectories:213 - Episode 404
2021-08-26 22:52:30.306 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:52:30.307 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 22.0
2021-08-26 22:52:30.308 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 22.0
2021-08-26 22:52:30.309 | INFO     | src.policies:collect_trajectories:213 - Episode 405
2021-08-26 22:52:30.321 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:52:30.322 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 25.0
2021-08-26 22:52:30.323 | INFO     | src.policies:collect_trajectories:230 

2021-08-26 22:52:30.666 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.09036493301391602
2021-08-26 22:52:30.668 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 1.1994211673736572
2021-08-26 22:52:30.670 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.09036493301391602
2021-08-26 22:52:30.672 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.49999961256980896
2021-08-26 22:52:30.675 | INFO     | src.policies:train:116 - Epoch 53 / 800
2021-08-26 22:52:30.676 | INFO     | src.policies:collect_trajectories:213 - Episode 417
2021-08-26 22:52:30.684 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:52:30.685 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 16.0
2021-08-26 22:52:30.685 | INFO     | src.policies:collect_trajectories:230 - Last 100 epis

2021-08-26 22:52:30.936 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 23.0
2021-08-26 22:52:30.937 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 23.333333333333332
2021-08-26 22:52:30.938 | INFO     | src.policies:collect_trajectories:213 - Episode 432
2021-08-26 22:52:30.951 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:52:30.952 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 30.0
2021-08-26 22:52:30.953 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 24.285714285714285
2021-08-26 22:52:30.954 | INFO     | src.policies:collect_trajectories:213 - Episode 433
2021-08-26 22:52:30.970 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:52:30.971 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 34.0
2021-08-26 22:52:30.971 | INFO     | src.policies:c

2021-08-26 22:52:31.346 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 24.0
2021-08-26 22:52:31.347 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 24.0
2021-08-26 22:52:31.347 | INFO     | src.policies:collect_trajectories:213 - Episode 444
2021-08-26 22:52:31.361 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:52:31.362 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 29.0
2021-08-26 22:52:31.363 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 26.5
2021-08-26 22:52:31.363 | INFO     | src.policies:collect_trajectories:213 - Episode 445
2021-08-26 22:52:31.380 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:52:31.381 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 37.0
2021-08-26 22:52:31.382 | INFO     | src.policies:collect_trajectories:230 - La

2021-08-26 22:52:31.686 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 34.5
2021-08-26 22:52:31.687 | INFO     | src.policies:collect_trajectories:213 - Episode 456
2021-08-26 22:52:31.697 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:52:31.698 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 19.0
2021-08-26 22:52:31.842 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 29.333333333333332
2021-08-26 22:52:31.843 | INFO     | src.policies:collect_trajectories:213 - Episode 457
2021-08-26 22:52:31.858 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:52:31.859 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 30.0
2021-08-26 22:52:31.860 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 29.5
2021-08-26 22:52:31.861 | INFO     | src.policies:colle

2021-08-26 22:52:32.093 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 20.0
2021-08-26 22:52:32.093 | INFO     | src.policies:collect_trajectories:213 - Episode 468
2021-08-26 22:52:32.109 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:52:32.110 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 29.0
2021-08-26 22:52:32.111 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 24.5
2021-08-26 22:52:32.111 | INFO     | src.policies:collect_trajectories:213 - Episode 469
2021-08-26 22:52:32.126 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:52:32.127 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 28.0
2021-08-26 22:52:32.128 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 25.666666666666668
2021-08-26 22:52:32.128 | INFO     | src.policies:colle

2021-08-26 22:52:32.464 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.4999995529651642
2021-08-26 22:52:32.467 | INFO     | src.policies:train:152 - Mini-batch 2 / 2
2021-08-26 22:52:32.469 | INFO     | src.policies:minibatch_update:270 - Total loss: 0.17864316701889038
2021-08-26 22:52:32.471 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.10470758378505707
2021-08-26 22:52:32.473 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 1.0636709928512573
2021-08-26 22:52:32.475 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.10470758378505707
2021-08-26 22:52:32.478 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.4999995529651642
2021-08-26 22:52:32.481 | INFO     | src.policies:train:116 - Epoch 62 / 800
2021-08-26 22:52:32.482 | INFO     | src.policies:collect_tra

2021-08-26 22:52:32.793 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.0659799873828888
2021-08-26 22:52:32.795 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 1.1103901863098145
2021-08-26 22:52:32.797 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.0659799873828888
2021-08-26 22:52:32.799 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.4999995827674866
2021-08-26 22:52:32.801 | INFO     | src.policies:train:152 - Mini-batch 2 / 2
2021-08-26 22:52:32.803 | INFO     | src.policies:minibatch_update:270 - Total loss: 0.15028220415115356
2021-08-26 22:52:32.805 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.07093583047389984
2021-08-26 22:52:32.807 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 1.0837658643722534
2021-08-26 22:52:32.809 | INF

2021-08-26 22:52:33.296 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 35.57142857142857
2021-08-26 22:52:33.302 | INFO     | src.policies:train:152 - Mini-batch 1 / 2
2021-08-26 22:52:33.305 | INFO     | src.policies:minibatch_update:270 - Total loss: 0.11193668842315674
2021-08-26 22:52:33.307 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.10806912183761597
2021-08-26 22:52:33.309 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 1.0756102800369263
2021-08-26 22:52:33.311 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.10806912183761597
2021-08-26 22:52:33.313 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.4999995827674866
2021-08-26 22:52:33.315 | INFO     | src.policies:train:152 - Mini-batch 2 / 2
2021-08-26 22:52:33.317 | INFO     | src.policies:minibatch_update:270 - To

2021-08-26 22:52:33.615 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 27.25
2021-08-26 22:52:33.616 | INFO     | src.policies:collect_trajectories:213 - Episode 520
2021-08-26 22:52:33.634 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:52:33.635 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 38.0
2021-08-26 22:52:33.636 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 29.4
2021-08-26 22:52:33.636 | INFO     | src.policies:collect_trajectories:213 - Episode 521
2021-08-26 22:52:33.656 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:52:33.657 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 41.0
2021-08-26 22:52:33.658 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 31.333333333333332
2021-08-26 22:52:33.658 | INFO     | src.policies:coll

2021-08-26 22:52:33.895 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 30.0
2021-08-26 22:52:33.896 | INFO     | src.policies:collect_trajectories:213 - Episode 532
2021-08-26 22:52:33.905 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:52:33.906 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 11.0
2021-08-26 22:52:33.906 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 20.5
2021-08-26 22:52:33.907 | INFO     | src.policies:collect_trajectories:213 - Episode 533
2021-08-26 22:52:33.918 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:52:33.919 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 21.0
2021-08-26 22:52:33.920 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 20.666666666666668
2021-08-26 22:52:33.920 | INFO     | src.policies:colle

2021-08-26 22:52:34.273 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 19.0
2021-08-26 22:52:34.274 | INFO     | src.policies:collect_trajectories:213 - Episode 544
2021-08-26 22:52:34.282 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:52:34.283 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 14.0
2021-08-26 22:52:34.284 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 16.5
2021-08-26 22:52:34.285 | INFO     | src.policies:collect_trajectories:213 - Episode 545
2021-08-26 22:52:34.311 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:52:34.312 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 62.0
2021-08-26 22:52:34.313 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 31.666666666666668
2021-08-26 22:52:34.314 | INFO     | src.policies:colle

2021-08-26 22:52:34.586 | INFO     | src.policies:train:116 - Epoch 73 / 800
2021-08-26 22:52:34.587 | INFO     | src.policies:collect_trajectories:213 - Episode 556
2021-08-26 22:52:34.602 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:52:34.603 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 24.0
2021-08-26 22:52:34.604 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 24.0
2021-08-26 22:52:34.605 | INFO     | src.policies:collect_trajectories:213 - Episode 557
2021-08-26 22:52:34.675 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:52:34.676 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 15.0
2021-08-26 22:52:34.676 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 19.5
2021-08-26 22:52:34.677 | INFO     | src.policies:collect_trajectories:213 - Episode 558
2021-08-26 22:52

2021-08-26 22:52:34.962 | INFO     | src.policies:train:152 - Mini-batch 2 / 2
2021-08-26 22:52:34.965 | INFO     | src.policies:minibatch_update:270 - Total loss: 0.08922648429870605
2021-08-26 22:52:34.967 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.05981934443116188
2021-08-26 22:52:34.969 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.9673019647598267
2021-08-26 22:52:34.971 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.05981934443116188
2021-08-26 22:52:34.973 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.4999994933605194
2021-08-26 22:52:34.976 | INFO     | src.policies:train:116 - Epoch 75 / 800
2021-08-26 22:52:34.977 | INFO     | src.policies:collect_trajectories:213 - Episode 570
2021-08-26 22:52:34.989 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26

2021-08-26 22:52:35.565 | INFO     | src.policies:train:152 - Mini-batch 1 / 2
2021-08-26 22:52:35.568 | INFO     | src.policies:minibatch_update:270 - Total loss: 0.09063112735748291
2021-08-26 22:52:35.572 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.04903574287891388
2021-08-26 22:52:35.574 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.9597866535186768
2021-08-26 22:52:35.578 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.04903574287891388
2021-08-26 22:52:35.580 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.499999463558197
2021-08-26 22:52:35.583 | INFO     | src.policies:train:152 - Mini-batch 2 / 2
2021-08-26 22:52:35.586 | INFO     | src.policies:minibatch_update:270 - Total loss: 0.11441957950592041
2021-08-26 22:52:35.588 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradie

2021-08-26 22:52:35.931 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 33.0
2021-08-26 22:52:35.931 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 27.285714285714285
2021-08-26 22:52:35.932 | INFO     | src.policies:collect_trajectories:213 - Episode 597
2021-08-26 22:52:35.943 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:52:35.944 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 17.0
2021-08-26 22:52:35.945 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 26.0
2021-08-26 22:52:35.954 | INFO     | src.policies:train:152 - Mini-batch 1 / 2
2021-08-26 22:52:36.013 | INFO     | src.policies:minibatch_update:270 - Total loss: 0.03127947449684143
2021-08-26 22:52:36.016 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.07136518508195877
2021-08-26 22:52:36.017 | INFO     | src.policies:m

2021-08-26 22:52:36.313 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 76.0
2021-08-26 22:52:36.314 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 37.0
2021-08-26 22:52:36.315 | INFO     | src.policies:collect_trajectories:213 - Episode 609
2021-08-26 22:52:36.330 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:52:36.331 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 28.0
2021-08-26 22:52:36.332 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 35.2
2021-08-26 22:52:36.334 | INFO     | src.policies:collect_trajectories:213 - Episode 610
2021-08-26 22:52:36.357 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:52:36.358 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 36.0
2021-08-26 22:52:36.359 | INFO     | src.policies:collect_trajectories:230 - La

2021-08-26 22:52:36.728 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 44.5
2021-08-26 22:52:36.729 | INFO     | src.policies:collect_trajectories:213 - Episode 621
2021-08-26 22:52:36.740 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:52:36.741 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 18.0
2021-08-26 22:52:36.742 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 39.2
2021-08-26 22:52:36.743 | INFO     | src.policies:collect_trajectories:213 - Episode 622
2021-08-26 22:52:36.757 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:52:36.758 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 22.0
2021-08-26 22:52:36.759 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 36.333333333333336
2021-08-26 22:52:36.765 | INFO     | src.policies:train

2021-08-26 22:52:37.054 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 46.0
2021-08-26 22:52:37.055 | INFO     | src.policies:collect_trajectories:213 - Episode 633
2021-08-26 22:52:37.071 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:52:37.072 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 30.0
2021-08-26 22:52:37.073 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 42.0
2021-08-26 22:52:37.079 | INFO     | src.policies:collect_trajectories:213 - Episode 634
2021-08-26 22:52:37.161 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:52:37.163 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 46.0
2021-08-26 22:52:37.163 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 42.8
2021-08-26 22:52:37.170 | INFO     | src.policies:train:152 - Mini-ba

2021-08-26 22:52:37.613 | INFO     | src.policies:train:152 - Mini-batch 1 / 2
2021-08-26 22:52:37.615 | INFO     | src.policies:minibatch_update:270 - Total loss: 0.06430017948150635
2021-08-26 22:52:37.618 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.06703066825866699
2021-08-26 22:52:37.620 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.8796427249908447
2021-08-26 22:52:37.622 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.06703066825866699
2021-08-26 22:52:37.624 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.49999940395355225
2021-08-26 22:52:37.627 | INFO     | src.policies:train:152 - Mini-batch 2 / 2
2021-08-26 22:52:37.630 | INFO     | src.policies:minibatch_update:270 - Total loss: 0.0779096782207489
2021-08-26 22:52:37.632 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradi

2021-08-26 22:52:37.988 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.8612942695617676
2021-08-26 22:52:37.990 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.11535674333572388
2021-08-26 22:52:37.994 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.49999937415122986
2021-08-26 22:52:37.997 | INFO     | src.policies:train:152 - Mini-batch 2 / 2
2021-08-26 22:52:37.999 | INFO     | src.policies:minibatch_update:270 - Total loss: 0.03867313265800476
2021-08-26 22:52:38.003 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.09465716034173965
2021-08-26 22:52:38.005 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.8582290410995483
2021-08-26 22:52:38.007 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.09465716034173965
2021-08-26 

2021-08-26 22:52:38.420 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.0407601036131382
2021-08-26 22:52:38.423 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.49999934434890747
2021-08-26 22:52:38.426 | INFO     | src.policies:train:116 - Epoch 91 / 800
2021-08-26 22:52:38.427 | INFO     | src.policies:collect_trajectories:213 - Episode 666
2021-08-26 22:52:38.442 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:52:38.443 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 31.0
2021-08-26 22:52:38.444 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 31.0
2021-08-26 22:52:38.445 | INFO     | src.policies:collect_trajectories:213 - Episode 667
2021-08-26 22:52:38.457 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:52:38.458 | INFO     | s

2021-08-26 22:52:38.816 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.13940098881721497
2021-08-26 22:52:38.819 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.49999940395355225
2021-08-26 22:52:38.823 | INFO     | src.policies:train:116 - Epoch 93 / 800
2021-08-26 22:52:38.824 | INFO     | src.policies:collect_trajectories:213 - Episode 678
2021-08-26 22:52:38.848 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:52:38.849 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 52.0
2021-08-26 22:52:38.850 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 52.0
2021-08-26 22:52:38.851 | INFO     | src.policies:collect_trajectories:213 - Episode 679
2021-08-26 22:52:38.872 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:52:38.874 | INFO     | 

2021-08-26 22:52:39.175 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:52:39.176 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 31.0
2021-08-26 22:52:39.177 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 31.0
2021-08-26 22:52:39.178 | INFO     | src.policies:collect_trajectories:213 - Episode 690
2021-08-26 22:52:39.204 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:52:39.206 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 54.0
2021-08-26 22:52:39.207 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 42.5
2021-08-26 22:52:39.208 | INFO     | src.policies:collect_trajectories:213 - Episode 691
2021-08-26 22:52:39.234 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:52:39.236 | INFO     | src.policies:collect_trajectories:229 - M

2021-08-26 22:52:39.696 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:52:39.697 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 14.0
2021-08-26 22:52:39.698 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 21.0
2021-08-26 22:52:39.699 | INFO     | src.policies:collect_trajectories:213 - Episode 702
2021-08-26 22:52:39.748 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:52:39.749 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 93.0
2021-08-26 22:52:39.750 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 39.0
2021-08-26 22:52:39.751 | INFO     | src.policies:collect_trajectories:213 - Episode 703
2021-08-26 22:52:39.802 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:52:39.804 | INFO     | src.policies:collect_trajectories:229 - M

2021-08-26 22:52:40.296 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 30.0
2021-08-26 22:52:40.297 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 41.5
2021-08-26 22:52:40.298 | INFO     | src.policies:collect_trajectories:213 - Episode 714
2021-08-26 22:52:40.312 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:52:40.313 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 19.0
2021-08-26 22:52:40.314 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 37.0
2021-08-26 22:52:40.315 | INFO     | src.policies:collect_trajectories:213 - Episode 715
2021-08-26 22:52:40.359 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:52:40.360 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 74.0
2021-08-26 22:52:40.361 | INFO     | src.policies:collect_trajectories:230 - La

2021-08-26 22:52:40.789 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 42.0
2021-08-26 22:52:40.790 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 58.5
2021-08-26 22:52:40.791 | INFO     | src.policies:collect_trajectories:213 - Episode 726
2021-08-26 22:52:40.817 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:52:40.818 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 37.0
2021-08-26 22:52:40.820 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 51.333333333333336
2021-08-26 22:52:40.822 | INFO     | src.policies:collect_trajectories:213 - Episode 727
2021-08-26 22:52:40.835 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:52:40.836 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 17.0
2021-08-26 22:52:40.838 | INFO     | src.policies:collect_traject

2021-08-26 22:52:41.267 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 29.0
2021-08-26 22:52:41.269 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 43.0
2021-08-26 22:52:41.269 | INFO     | src.policies:collect_trajectories:213 - Episode 738
2021-08-26 22:52:41.288 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:52:41.290 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 29.0
2021-08-26 22:52:41.291 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 39.5
2021-08-26 22:52:41.292 | INFO     | src.policies:collect_trajectories:213 - Episode 739
2021-08-26 22:52:41.305 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:52:41.307 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 18.0
2021-08-26 22:52:41.308 | INFO     | src.policies:collect_trajectories:230 - La

2021-08-26 22:52:41.796 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 86.0
2021-08-26 22:52:41.797 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 51.333333333333336
2021-08-26 22:52:41.798 | INFO     | src.policies:collect_trajectories:213 - Episode 750
2021-08-26 22:52:41.831 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:52:41.832 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 66.0
2021-08-26 22:52:41.833 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 55.0
2021-08-26 22:52:41.842 | INFO     | src.policies:train:152 - Mini-batch 1 / 2
2021-08-26 22:52:41.845 | INFO     | src.policies:minibatch_update:270 - Total loss: 0.0011345446109771729
2021-08-26 22:52:41.847 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.11881952732801437
2021-08-26 22:52:41.849 | INFO     | src.policies

2021-08-26 22:52:42.176 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.4999992847442627
2021-08-26 22:52:42.180 | INFO     | src.policies:train:116 - Epoch 108 / 800
2021-08-26 22:52:42.181 | INFO     | src.policies:collect_trajectories:213 - Episode 758
2021-08-26 22:52:42.213 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:52:42.214 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 66.0
2021-08-26 22:52:42.215 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 66.0
2021-08-26 22:52:42.216 | INFO     | src.policies:collect_trajectories:213 - Episode 759
2021-08-26 22:52:42.230 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:52:42.231 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 21.0
2021-08-26 22:52:42.232 | INFO     | src.policies:collect_trajectories:230 

2021-08-26 22:52:42.602 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.4999992847442627
2021-08-26 22:52:42.605 | INFO     | src.policies:train:116 - Epoch 110 / 800
2021-08-26 22:52:42.606 | INFO     | src.policies:collect_trajectories:213 - Episode 770
2021-08-26 22:52:42.643 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:52:42.644 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 75.0
2021-08-26 22:52:42.645 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 75.0
2021-08-26 22:52:42.646 | INFO     | src.policies:collect_trajectories:213 - Episode 771
2021-08-26 22:52:42.672 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:52:42.673 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 50.0
2021-08-26 22:52:42.674 | INFO     | src.policies:collect_trajectories:230 

2021-08-26 22:52:43.049 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.041585035622119904
2021-08-26 22:52:43.051 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.6874145269393921
2021-08-26 22:52:43.053 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.041585035622119904
2021-08-26 22:52:43.056 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.4999992251396179
2021-08-26 22:52:43.059 | INFO     | src.policies:train:116 - Epoch 112 / 800
2021-08-26 22:52:43.060 | INFO     | src.policies:collect_trajectories:213 - Episode 783
2021-08-26 22:52:43.072 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:52:43.074 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 24.0
2021-08-26 22:52:43.074 | INFO     | src.policies:collect_trajectories:230 - Last 100 ep

2021-08-26 22:52:43.521 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 115.0
2021-08-26 22:52:43.522 | INFO     | src.policies:collect_trajectories:213 - Episode 794
2021-08-26 22:52:43.549 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:52:43.550 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 53.0
2021-08-26 22:52:43.551 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 84.0
2021-08-26 22:52:43.552 | INFO     | src.policies:collect_trajectories:213 - Episode 795
2021-08-26 22:52:43.587 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:52:43.589 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 68.0
2021-08-26 22:52:43.590 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 78.66666666666667
2021-08-26 22:52:43.596 | INFO     | src.policies:train

2021-08-26 22:52:44.028 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 33.666666666666664
2021-08-26 22:52:44.036 | INFO     | src.policies:train:152 - Mini-batch 1 / 2
2021-08-26 22:52:44.040 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.052268266677856445
2021-08-26 22:52:44.045 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.09489969164133072
2021-08-26 22:52:44.048 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.6683734059333801
2021-08-26 22:52:44.051 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.09489969164133072
2021-08-26 22:52:44.054 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.4999992251396179
2021-08-26 22:52:44.059 | INFO     | src.policies:train:152 - Mini-batch 2 / 2
2021-08-26 22:52:44.063 | INFO     | src.policies:minibatch_update:270 -

2021-08-26 22:52:44.460 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.08535466343164444
2021-08-26 22:52:44.461 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.6393449902534485
2021-08-26 22:52:44.464 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.08535466343164444
2021-08-26 22:52:44.466 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.4999992251396179
2021-08-26 22:52:44.468 | INFO     | src.policies:train:152 - Mini-batch 2 / 2
2021-08-26 22:52:44.471 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.06879010796546936
2021-08-26 22:52:44.473 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.07079216092824936
2021-08-26 22:52:44.475 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.6466624140739441
2021-08-26 22:52:44.477 | 

2021-08-26 22:52:44.871 | INFO     | src.policies:train:116 - Epoch 121 / 800
2021-08-26 22:52:44.872 | INFO     | src.policies:collect_trajectories:213 - Episode 826
2021-08-26 22:52:44.895 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:52:44.896 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 46.0
2021-08-26 22:52:44.897 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 46.0
2021-08-26 22:52:44.898 | INFO     | src.policies:collect_trajectories:213 - Episode 827
2021-08-26 22:52:44.924 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:52:44.925 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 47.0
2021-08-26 22:52:44.926 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 46.5
2021-08-26 22:52:44.927 | INFO     | src.policies:collect_trajectories:213 - Episode 828
2021-08-26 22:5

2021-08-26 22:52:45.460 | INFO     | src.policies:collect_trajectories:213 - Episode 838
2021-08-26 22:52:45.485 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:52:45.487 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 41.0
2021-08-26 22:52:45.490 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 48.666666666666664
2021-08-26 22:52:45.493 | INFO     | src.policies:collect_trajectories:213 - Episode 839
2021-08-26 22:52:45.509 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:52:45.510 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 14.0
2021-08-26 22:52:45.511 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 40.0
2021-08-26 22:52:45.512 | INFO     | src.policies:collect_trajectories:213 - Episode 840
2021-08-26 22:52:45.526 | DEBUG    | src.policies:execute_episode:398 - Early s

2021-08-26 22:52:46.029 | INFO     | src.policies:collect_trajectories:213 - Episode 850
2021-08-26 22:52:46.070 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:52:46.071 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 73.0
2021-08-26 22:52:46.072 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 40.8
2021-08-26 22:52:46.081 | INFO     | src.policies:train:152 - Mini-batch 1 / 2
2021-08-26 22:52:46.083 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.08444657921791077
2021-08-26 22:52:46.086 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.10160551965236664
2021-08-26 22:52:46.088 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.6089768409729004
2021-08-26 22:52:46.090 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.10160551965236664
2021-08-26 2

2021-08-26 22:52:46.481 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.0820196270942688
2021-08-26 22:52:46.483 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.15849046409130096
2021-08-26 22:52:46.485 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.588524580001831
2021-08-26 22:52:46.487 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.15849046409130096
2021-08-26 22:52:46.490 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.49999910593032837
2021-08-26 22:52:46.493 | INFO     | src.policies:train:116 - Epoch 128 / 800
2021-08-26 22:52:46.494 | INFO     | src.policies:collect_trajectories:213 - Episode 860
2021-08-26 22:52:46.560 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:52:46.561 | INFO     | src.policies:collect_trajectories:229 - Mean episod

2021-08-26 22:52:47.007 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 36.333333333333336
2021-08-26 22:52:47.009 | INFO     | src.policies:collect_trajectories:213 - Episode 869
2021-08-26 22:52:47.047 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:52:47.048 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 76.0
2021-08-26 22:52:47.049 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 46.25
2021-08-26 22:52:47.050 | INFO     | src.policies:collect_trajectories:213 - Episode 870
2021-08-26 22:52:47.080 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:52:47.081 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 62.0
2021-08-26 22:52:47.082 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 49.4
2021-08-26 22:52:47.089 | INFO     | src.policies:trai

2021-08-26 22:52:47.498 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.036059364676475525
2021-08-26 22:52:47.500 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.49999910593032837
2021-08-26 22:52:47.503 | INFO     | src.policies:train:116 - Epoch 133 / 800
2021-08-26 22:52:47.504 | INFO     | src.policies:collect_trajectories:213 - Episode 878
2021-08-26 22:52:47.545 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:52:47.546 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 84.0
2021-08-26 22:52:47.547 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 84.0
2021-08-26 22:52:47.548 | INFO     | src.policies:collect_trajectories:213 - Episode 879
2021-08-26 22:52:47.573 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:52:47.575 | INFO     

2021-08-26 22:52:48.091 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:52:48.092 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 30.0
2021-08-26 22:52:48.093 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 50.25
2021-08-26 22:52:48.099 | INFO     | src.policies:train:152 - Mini-batch 1 / 2
2021-08-26 22:52:48.102 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.14944884181022644
2021-08-26 22:52:48.105 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.12453635782003403
2021-08-26 22:52:48.107 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.5401336550712585
2021-08-26 22:52:48.109 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.12453635782003403
2021-08-26 22:52:48.111 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradien

2021-08-26 22:52:48.450 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.11296074092388153
2021-08-26 22:52:48.452 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.4999990463256836
2021-08-26 22:52:48.455 | INFO     | src.policies:train:152 - Mini-batch 2 / 2
2021-08-26 22:52:48.457 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.11143922805786133
2021-08-26 22:52:48.459 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.06599635630846024
2021-08-26 22:52:48.461 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.5285908579826355
2021-08-26 22:52:48.463 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.06599635630846024
2021-08-26 22:52:48.466 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.4999990165233

2021-08-26 22:52:48.989 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 18.0
2021-08-26 22:52:48.990 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 68.33333333333333
2021-08-26 22:52:48.995 | INFO     | src.policies:train:152 - Mini-batch 1 / 2
2021-08-26 22:52:48.999 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.12162625789642334
2021-08-26 22:52:49.001 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.10617437213659286
2021-08-26 22:52:49.003 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.5125166773796082
2021-08-26 22:52:49.006 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.10617437213659286
2021-08-26 22:52:49.009 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.4999989867210388
2021-08-26 22:52:49.011 | INFO     | src.policies:

2021-08-26 22:52:49.453 | INFO     | src.policies:train:116 - Epoch 143 / 800
2021-08-26 22:52:49.454 | INFO     | src.policies:collect_trajectories:213 - Episode 916
2021-08-26 22:52:49.475 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:52:49.476 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 40.0
2021-08-26 22:52:49.477 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 40.0
2021-08-26 22:52:49.478 | INFO     | src.policies:collect_trajectories:213 - Episode 917
2021-08-26 22:52:49.495 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:52:49.496 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 29.0
2021-08-26 22:52:49.497 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 34.5
2021-08-26 22:52:49.498 | INFO     | src.policies:collect_trajectories:213 - Episode 918
2021-08-26 22:5

2021-08-26 22:52:49.984 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.48207321763038635
2021-08-26 22:52:49.986 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.1071329340338707
2021-08-26 22:52:49.988 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.48207321763038635
2021-08-26 22:52:49.991 | INFO     | src.policies:train:152 - Mini-batch 2 / 2
2021-08-26 22:52:49.993 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.10901278257369995
2021-08-26 22:52:49.996 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.1446356177330017
2021-08-26 22:52:49.998 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.47856056690216064
2021-08-26 22:52:50.000 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.1446356177330017
2021-08-26 

2021-08-26 22:52:50.747 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.10535874217748642
2021-08-26 22:52:50.749 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.4748004078865051
2021-08-26 22:52:50.752 | INFO     | src.policies:train:152 - Mini-batch 2 / 3
2021-08-26 22:52:50.754 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.1585676670074463
2021-08-26 22:52:50.757 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.11571209877729416
2021-08-26 22:52:50.759 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.4602741003036499
2021-08-26 22:52:50.761 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.11571209877729416
2021-08-26 22:52:50.763 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.46027410030364

2021-08-26 22:52:51.189 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 15.0
2021-08-26 22:52:51.190 | INFO     | src.policies:collect_trajectories:213 - Episode 943
2021-08-26 22:52:51.213 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:52:51.214 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 44.0
2021-08-26 22:52:51.215 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 29.5
2021-08-26 22:52:51.216 | INFO     | src.policies:collect_trajectories:213 - Episode 944
2021-08-26 22:52:51.253 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:52:51.254 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 75.0
2021-08-26 22:52:51.255 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 44.666666666666664
2021-08-26 22:52:51.256 | INFO     | src.policies:colle

2021-08-26 22:52:51.734 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.2889844477176666
2021-08-26 22:52:51.736 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.4499155282974243
2021-08-26 22:52:51.738 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.2889844477176666
2021-08-26 22:52:51.741 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.4499155282974243
2021-08-26 22:52:51.743 | INFO     | src.policies:train:152 - Mini-batch 2 / 2
2021-08-26 22:52:51.746 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.16035783290863037
2021-08-26 22:52:51.749 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.1472388505935669
2021-08-26 22:52:51.750 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.44356611371040344
2021-08-26 22:52:51.753 | IN

2021-08-26 22:52:52.293 | INFO     | src.policies:collect_trajectories:213 - Episode 963
2021-08-26 22:52:52.341 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:52:52.342 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 91.0
2021-08-26 22:52:52.343 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 53.0
2021-08-26 22:52:52.349 | INFO     | src.policies:train:152 - Mini-batch 1 / 2
2021-08-26 22:52:52.352 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.13573282957077026
2021-08-26 22:52:52.355 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.17562243342399597
2021-08-26 22:52:52.356 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.423320472240448
2021-08-26 22:52:52.359 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.17562243342399597
2021-08-26 22

2021-08-26 22:52:52.739 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.18519610166549683
2021-08-26 22:52:52.742 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.418766587972641
2021-08-26 22:52:52.745 | INFO     | src.policies:train:116 - Epoch 159 / 800
2021-08-26 22:52:52.746 | INFO     | src.policies:collect_trajectories:213 - Episode 972
2021-08-26 22:52:52.766 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:52:52.767 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 40.0
2021-08-26 22:52:52.768 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 40.0
2021-08-26 22:52:52.769 | INFO     | src.policies:collect_trajectories:213 - Episode 973
2021-08-26 22:52:52.787 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:52:52.788 | INFO     | s

2021-08-26 22:52:53.211 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 62.0
2021-08-26 22:52:53.213 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 45.5
2021-08-26 22:52:53.214 | INFO     | src.policies:collect_trajectories:213 - Episode 982
2021-08-26 22:52:53.295 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:52:53.296 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 167.0
2021-08-26 22:52:53.297 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 86.0
2021-08-26 22:52:53.303 | INFO     | src.policies:train:152 - Mini-batch 1 / 2
2021-08-26 22:52:53.306 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.20199447870254517
2021-08-26 22:52:53.308 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.12508277595043182
2021-08-26 22:52:53.310 | INFO     | src.policies:minibatch_upd

2021-08-26 22:52:53.696 | INFO     | src.policies:train:116 - Epoch 164 / 800
2021-08-26 22:52:53.697 | INFO     | src.policies:collect_trajectories:213 - Episode 990
2021-08-26 22:52:53.750 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:52:53.751 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 114.0
2021-08-26 22:52:53.752 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 114.0
2021-08-26 22:52:53.753 | INFO     | src.policies:collect_trajectories:213 - Episode 991
2021-08-26 22:52:53.762 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:52:53.763 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 11.0
2021-08-26 22:52:53.764 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 62.5
2021-08-26 22:52:53.764 | INFO     | src.policies:collect_trajectories:213 - Episode 992
2021-08-26 22

2021-08-26 22:52:54.423 | INFO     | src.policies:train:152 - Mini-batch 1 / 3
2021-08-26 22:52:54.427 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.1845109462738037
2021-08-26 22:52:54.429 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.09033987671136856
2021-08-26 22:52:54.431 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.37927675247192383
2021-08-26 22:52:54.433 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.09033987671136856
2021-08-26 22:52:54.435 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.37927675247192383
2021-08-26 22:52:54.438 | INFO     | src.policies:train:152 - Mini-batch 2 / 3
2021-08-26 22:52:54.440 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.15569621324539185
2021-08-26 22:52:54.443 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gr

2021-08-26 22:52:54.843 | INFO     | src.policies:train:116 - Epoch 169 / 800
2021-08-26 22:52:54.844 | INFO     | src.policies:collect_trajectories:213 - Episode 1007
2021-08-26 22:52:54.853 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:52:54.854 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 15.0
2021-08-26 22:52:54.855 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 15.0
2021-08-26 22:52:54.856 | INFO     | src.policies:collect_trajectories:213 - Episode 1008
2021-08-26 22:52:54.882 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:52:54.883 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 48.0
2021-08-26 22:52:54.884 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 31.5
2021-08-26 22:52:54.885 | INFO     | src.policies:collect_trajectories:213 - Episode 1009
2021-08-26 2

2021-08-26 22:52:55.420 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:52:55.422 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 51.0
2021-08-26 22:52:55.422 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 80.0
2021-08-26 22:52:55.429 | INFO     | src.policies:train:152 - Mini-batch 1 / 2
2021-08-26 22:52:55.432 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.19529080390930176
2021-08-26 22:52:55.434 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.15775766968727112
2021-08-26 22:52:55.436 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.3572413921356201
2021-08-26 22:52:55.439 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.15775766968727112
2021-08-26 22:52:55.441 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient

2021-08-26 22:52:55.977 | INFO     | src.policies:train:116 - Epoch 174 / 800
2021-08-26 22:52:55.978 | INFO     | src.policies:collect_trajectories:213 - Episode 1024
2021-08-26 22:52:56.012 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:52:56.013 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 69.0
2021-08-26 22:52:56.014 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 69.0
2021-08-26 22:52:56.015 | INFO     | src.policies:collect_trajectories:213 - Episode 1025
2021-08-26 22:52:56.038 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:52:56.039 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 43.0
2021-08-26 22:52:56.040 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 56.0
2021-08-26 22:52:56.041 | INFO     | src.policies:collect_trajectories:213 - Episode 1026
2021-08-26 2

2021-08-26 22:52:56.575 | INFO     | src.policies:train:152 - Mini-batch 2 / 2
2021-08-26 22:52:56.577 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.184362530708313
2021-08-26 22:52:56.580 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.1204412579536438
2021-08-26 22:52:56.582 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.3360117971897125
2021-08-26 22:52:56.584 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.1204412579536438
2021-08-26 22:52:56.586 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.3360117971897125
2021-08-26 22:52:56.590 | INFO     | src.policies:train:116 - Epoch 177 / 800
2021-08-26 22:52:56.591 | INFO     | src.policies:collect_trajectories:213 - Episode 1034
2021-08-26 22:52:56.607 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 

2021-08-26 22:52:57.084 | INFO     | src.policies:train:152 - Mini-batch 1 / 2
2021-08-26 22:52:57.086 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.1803220510482788
2021-08-26 22:52:57.089 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.1278926581144333
2021-08-26 22:52:57.091 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.32781389355659485
2021-08-26 22:52:57.093 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.1278926581144333
2021-08-26 22:52:57.095 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.32781389355659485
2021-08-26 22:52:57.098 | INFO     | src.policies:train:152 - Mini-batch 2 / 2
2021-08-26 22:52:57.100 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.17215067148208618
2021-08-26 22:52:57.103 | INFO     | src.policies:minibatch_update:277 - Policy network L2 grad

2021-08-26 22:52:57.553 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.316709965467453
2021-08-26 22:52:57.602 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.18509125709533691
2021-08-26 22:52:57.605 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.316709965467453
2021-08-26 22:52:57.608 | INFO     | src.policies:train:116 - Epoch 182 / 800
2021-08-26 22:52:57.609 | INFO     | src.policies:collect_trajectories:213 - Episode 1052
2021-08-26 22:52:57.677 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:52:57.678 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 146.0
2021-08-26 22:52:57.679 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 146.0
2021-08-26 22:52:57.680 | INFO     | src.policies:collect_trajectories:213 - Episode 1053
2021-08-26 2

2021-08-26 22:52:58.227 | INFO     | src.policies:train:152 - Mini-batch 1 / 3
2021-08-26 22:52:58.230 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.1812518835067749
2021-08-26 22:52:58.233 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.28765419125556946
2021-08-26 22:52:58.234 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.31375330686569214
2021-08-26 22:52:58.236 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.28765419125556946
2021-08-26 22:52:58.239 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.31375330686569214
2021-08-26 22:52:58.241 | INFO     | src.policies:train:152 - Mini-batch 2 / 3
2021-08-26 22:52:58.244 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.20962828397750854
2021-08-26 22:52:58.246 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gr

2021-08-26 22:52:58.769 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.19523540139198303
2021-08-26 22:52:58.770 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.29934024810791016
2021-08-26 22:52:58.772 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.19523540139198303
2021-08-26 22:52:58.775 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.29934024810791016
2021-08-26 22:52:58.777 | INFO     | src.policies:train:152 - Mini-batch 3 / 3
2021-08-26 22:52:58.780 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.21711468696594238
2021-08-26 22:52:58.782 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.09373374283313751
2021-08-26 22:52:58.784 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.29762497544288635
2021-08-26 22:52:58.787

2021-08-26 22:52:59.424 | INFO     | src.policies:train:152 - Mini-batch 1 / 3
2021-08-26 22:52:59.427 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.23467344045639038
2021-08-26 22:52:59.430 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.11745359003543854
2021-08-26 22:52:59.431 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.2929501533508301
2021-08-26 22:52:59.434 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.11745359003543854
2021-08-26 22:52:59.436 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.2929501533508301
2021-08-26 22:52:59.438 | INFO     | src.policies:train:152 - Mini-batch 2 / 3
2021-08-26 22:52:59.441 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.21001511812210083
2021-08-26 22:52:59.444 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gra

2021-08-26 22:52:59.977 | INFO     | src.policies:train:116 - Epoch 192 / 800
2021-08-26 22:52:59.978 | INFO     | src.policies:collect_trajectories:213 - Episode 1081
2021-08-26 22:52:59.986 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:52:59.987 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 13.0
2021-08-26 22:52:59.988 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 13.0
2021-08-26 22:52:59.989 | INFO     | src.policies:collect_trajectories:213 - Episode 1082
2021-08-26 22:53:00.057 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:53:00.058 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 107.0
2021-08-26 22:53:00.059 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 60.0
2021-08-26 22:53:00.060 | INFO     | src.policies:collect_trajectories:213 - Episode 1083
2021-08-26 

2021-08-26 22:53:00.673 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.12462160736322403
2021-08-26 22:53:00.676 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.28055620193481445
2021-08-26 22:53:00.679 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.12462160736322403
2021-08-26 22:53:00.681 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.28055620193481445
2021-08-26 22:53:00.685 | INFO     | src.policies:train:116 - Epoch 195 / 800
2021-08-26 22:53:00.686 | INFO     | src.policies:collect_trajectories:213 - Episode 1089
2021-08-26 22:53:00.776 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:53:00.777 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 200.0
2021-08-26 22:53:00.778 | INFO     | src.policies:collect_trajectories:230 - Last 100 

2021-08-26 22:53:01.187 | INFO     | src.policies:collect_trajectories:213 - Episode 1096
2021-08-26 22:53:01.285 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:53:01.286 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 180.0
2021-08-26 22:53:01.287 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 79.33333333333333
2021-08-26 22:53:01.295 | INFO     | src.policies:train:152 - Mini-batch 1 / 2
2021-08-26 22:53:01.299 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.2472485899925232
2021-08-26 22:53:01.303 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.13933125138282776
2021-08-26 22:53:01.306 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.2731243669986725
2021-08-26 22:53:01.309 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.1393312513828277

2021-08-26 22:53:01.835 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.19765937328338623
2021-08-26 22:53:01.838 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.1648487001657486
2021-08-26 22:53:01.840 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.26396042108535767
2021-08-26 22:53:01.842 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.1648487001657486
2021-08-26 22:53:01.844 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.26396042108535767
2021-08-26 22:53:01.847 | INFO     | src.policies:train:152 - Mini-batch 2 / 2
2021-08-26 22:53:01.849 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.19339144229888916
2021-08-26 22:53:01.852 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.12887884676456451
2021-08-26 22:53:01.854 | INFO     | src.polici

2021-08-26 22:53:02.276 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.2528345286846161
2021-08-26 22:53:02.278 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.49999919533729553
2021-08-26 22:53:02.281 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.2528345286846161
2021-08-26 22:53:02.284 | INFO     | src.policies:train:116 - Epoch 204 / 800
2021-08-26 22:53:02.285 | INFO     | src.policies:collect_trajectories:213 - Episode 1112
2021-08-26 22:53:02.300 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:53:02.302 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 28.0
2021-08-26 22:53:02.302 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 28.0
2021-08-26 22:53:02.303 | INFO     | src.policies:collect_trajectories:213 - Episode 1113
2021-08-26 2

2021-08-26 22:53:02.963 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.24719618260860443
2021-08-26 22:53:02.965 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.251698762178421
2021-08-26 22:53:02.969 | INFO     | src.policies:train:116 - Epoch 207 / 800
2021-08-26 22:53:02.971 | INFO     | src.policies:collect_trajectories:213 - Episode 1120
2021-08-26 22:53:03.020 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:53:03.022 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 103.0
2021-08-26 22:53:03.022 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 103.0
2021-08-26 22:53:03.023 | INFO     | src.policies:collect_trajectories:213 - Episode 1121
2021-08-26 22:53:03.096 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:53:03.109 | INFO    

2021-08-26 22:53:03.530 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.24113549292087555
2021-08-26 22:53:03.533 | INFO     | src.policies:train:116 - Epoch 210 / 800
2021-08-26 22:53:03.534 | INFO     | src.policies:collect_trajectories:213 - Episode 1126
2021-08-26 22:53:03.563 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:53:03.565 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 61.0
2021-08-26 22:53:03.565 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 61.0
2021-08-26 22:53:03.566 | INFO     | src.policies:collect_trajectories:213 - Episode 1127
2021-08-26 22:53:03.613 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:53:03.614 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 95.0
2021-08-26 22:53:03.615 | INFO     | src.policies:collect_trajectories:2

2021-08-26 22:53:04.078 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.22703465819358826
2021-08-26 22:53:04.080 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.37650975584983826
2021-08-26 22:53:04.083 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.22703465819358826
2021-08-26 22:53:04.086 | INFO     | src.policies:train:116 - Epoch 213 / 800
2021-08-26 22:53:04.087 | INFO     | src.policies:collect_trajectories:213 - Episode 1135
2021-08-26 22:53:04.106 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:53:04.107 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 35.0
2021-08-26 22:53:04.108 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 35.0
2021-08-26 22:53:04.109 | INFO     | src.policies:collect_trajectories:213 - Episode 1136
2021-08-26

2021-08-26 22:53:04.688 | INFO     | src.policies:collect_trajectories:213 - Episode 1142
2021-08-26 22:53:04.861 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:53:04.863 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 158.0
2021-08-26 22:53:04.864 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 135.0
2021-08-26 22:53:04.870 | INFO     | src.policies:train:152 - Mini-batch 1 / 2
2021-08-26 22:53:04.873 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.19360768795013428
2021-08-26 22:53:04.876 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.30428677797317505
2021-08-26 22:53:04.878 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.23673386871814728
2021-08-26 22:53:04.881 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.30428677797317505
2021-08-

2021-08-26 22:53:05.281 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.21705341339111328
2021-08-26 22:53:05.285 | INFO     | src.policies:train:116 - Epoch 219 / 800
2021-08-26 22:53:05.285 | INFO     | src.policies:collect_trajectories:213 - Episode 1149
2021-08-26 22:53:05.339 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:53:05.341 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 15.0
2021-08-26 22:53:05.341 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 15.0
2021-08-26 22:53:05.342 | INFO     | src.policies:collect_trajectories:213 - Episode 1150
2021-08-26 22:53:05.399 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:53:05.400 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 120.0
2021-08-26 22:53:05.401 | INFO     | src.policies:collect_trajectories:

2021-08-26 22:53:05.913 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.2199387401342392
2021-08-26 22:53:05.916 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.11452616751194
2021-08-26 22:53:05.918 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.2199387401342392
2021-08-26 22:53:05.921 | INFO     | src.policies:train:116 - Epoch 222 / 800
2021-08-26 22:53:05.923 | INFO     | src.policies:collect_trajectories:213 - Episode 1156
2021-08-26 22:53:06.024 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:53:06.026 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 200.0
2021-08-26 22:53:06.027 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 200.0
2021-08-26 22:53:06.031 | INFO     | src.policies:train:152 - Mini-batch 1 / 2
2021-08-26 22:53:06.034 

2021-08-26 22:53:06.510 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.211342453956604
2021-08-26 22:53:06.513 | INFO     | src.policies:train:116 - Epoch 225 / 800
2021-08-26 22:53:06.514 | INFO     | src.policies:collect_trajectories:213 - Episode 1162
2021-08-26 22:53:06.605 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:53:06.606 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 200.0
2021-08-26 22:53:06.607 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 200.0
2021-08-26 22:53:06.611 | INFO     | src.policies:train:152 - Mini-batch 1 / 2
2021-08-26 22:53:06.614 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.2592660188674927
2021-08-26 22:53:06.616 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.0658656433224678
2021-08-26 22:53:06.618 | INFO     | src.policies:minibatch_

2021-08-26 22:53:07.108 | INFO     | src.policies:collect_trajectories:213 - Episode 1168
2021-08-26 22:53:07.201 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:53:07.202 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 200.0
2021-08-26 22:53:07.203 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 200.0
2021-08-26 22:53:07.207 | INFO     | src.policies:train:152 - Mini-batch 1 / 2
2021-08-26 22:53:07.210 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.23093688488006592
2021-08-26 22:53:07.212 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.0793556347489357
2021-08-26 22:53:07.214 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.2101147174835205
2021-08-26 22:53:07.217 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.0793556347489357
2021-08-26 

2021-08-26 22:53:07.763 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.20217879116535187
2021-08-26 22:53:07.766 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.1726842224597931
2021-08-26 22:53:07.768 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.20217879116535187
2021-08-26 22:53:07.771 | INFO     | src.policies:train:152 - Mini-batch 3 / 3
2021-08-26 22:53:07.773 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.24144762754440308
2021-08-26 22:53:07.775 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.1699773520231247
2021-08-26 22:53:07.777 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.20830196142196655
2021-08-26 22:53:07.779 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.1699773520231247
2021-08-26 

2021-08-26 22:53:08.263 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 64.0
2021-08-26 22:53:08.265 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 70.0
2021-08-26 22:53:08.265 | INFO     | src.policies:collect_trajectories:213 - Episode 1182
2021-08-26 22:53:08.299 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:53:08.300 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 68.0
2021-08-26 22:53:08.301 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 69.33333333333333
2021-08-26 22:53:08.307 | INFO     | src.policies:train:152 - Mini-batch 1 / 2
2021-08-26 22:53:08.310 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.2156173586845398
2021-08-26 22:53:08.312 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.25937986373901367
2021-08-26 22:53:08.314 | INFO     | src.policies:m

2021-08-26 22:53:08.978 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.19114409387111664
2021-08-26 22:53:08.980 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.17051658034324646
2021-08-26 22:53:08.982 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.19114409387111664
2021-08-26 22:53:08.986 | INFO     | src.policies:train:116 - Epoch 237 / 800
2021-08-26 22:53:08.987 | INFO     | src.policies:collect_trajectories:213 - Episode 1189
2021-08-26 22:53:09.048 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:53:09.049 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 129.0
2021-08-26 22:53:09.049 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 129.0
2021-08-26 22:53:09.050 | INFO     | src.policies:collect_trajectories:213 - Episode 1190
2021-08-

2021-08-26 22:53:09.548 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:53:09.549 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 24.0
2021-08-26 22:53:09.550 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 72.0
2021-08-26 22:53:09.551 | INFO     | src.policies:collect_trajectories:213 - Episode 1197
2021-08-26 22:53:09.643 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:53:09.644 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 186.0
2021-08-26 22:53:09.645 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 110.0
2021-08-26 22:53:09.651 | INFO     | src.policies:train:152 - Mini-batch 1 / 3
2021-08-26 22:53:09.654 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.25822150707244873
2021-08-26 22:53:09.657 | INFO     | src.policies:minibatch_update:277 - Policy net

2021-08-26 22:53:10.144 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 147.0
2021-08-26 22:53:10.145 | INFO     | src.policies:collect_trajectories:213 - Episode 1203
2021-08-26 22:53:10.167 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:53:10.168 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 41.0
2021-08-26 22:53:10.169 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 94.0
2021-08-26 22:53:10.170 | INFO     | src.policies:collect_trajectories:213 - Episode 1204
2021-08-26 22:53:10.236 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:53:10.238 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 147.0
2021-08-26 22:53:10.238 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 111.66666666666667
2021-08-26 22:53:10.245 | INFO     | src.policies:t

2021-08-26 22:53:10.612 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.10435272753238678
2021-08-26 22:53:10.614 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.1790165901184082
2021-08-26 22:53:10.617 | INFO     | src.policies:train:116 - Epoch 246 / 800
2021-08-26 22:53:10.618 | INFO     | src.policies:collect_trajectories:213 - Episode 1210
2021-08-26 22:53:10.680 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:53:10.681 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 125.0
2021-08-26 22:53:10.682 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 125.0
2021-08-26 22:53:10.683 | INFO     | src.policies:collect_trajectories:213 - Episode 1211
2021-08-26 22:53:10.713 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:53:10.714 | INFO   

2021-08-26 22:53:11.286 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.17193210124969482
2021-08-26 22:53:11.290 | INFO     | src.policies:train:116 - Epoch 249 / 800
2021-08-26 22:53:11.290 | INFO     | src.policies:collect_trajectories:213 - Episode 1218
2021-08-26 22:53:11.348 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:53:11.350 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 121.0
2021-08-26 22:53:11.350 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 121.0
2021-08-26 22:53:11.351 | INFO     | src.policies:collect_trajectories:213 - Episode 1219
2021-08-26 22:53:11.414 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:53:11.415 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 60.0
2021-08-26 22:53:11.416 | INFO     | src.policies:collect_trajectories

2021-08-26 22:53:11.888 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.16856823861598969
2021-08-26 22:53:11.890 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.17400969564914703
2021-08-26 22:53:11.892 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.16856823861598969
2021-08-26 22:53:11.894 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.17400969564914703
2021-08-26 22:53:11.939 | INFO     | src.policies:train:152 - Mini-batch 2 / 2
2021-08-26 22:53:11.942 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.2888936400413513
2021-08-26 22:53:11.944 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.07983715832233429
2021-08-26 22:53:11.946 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.16860945522785187
2021-08-26 22:53:11.949 

2021-08-26 22:53:12.517 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.17236649990081787
2021-08-26 22:53:12.520 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.12570695579051971
2021-08-26 22:53:12.522 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.17236649990081787
2021-08-26 22:53:12.525 | INFO     | src.policies:train:152 - Mini-batch 2 / 3
2021-08-26 22:53:12.527 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.2508639693260193
2021-08-26 22:53:12.530 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.34430360794067383
2021-08-26 22:53:12.531 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.1686696857213974
2021-08-26 22:53:12.533 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.34430360794067383
2021-08-26

2021-08-26 22:53:12.884 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.16974994540214539
2021-08-26 22:53:12.887 | INFO     | src.policies:train:116 - Epoch 257 / 800
2021-08-26 22:53:12.888 | INFO     | src.policies:collect_trajectories:213 - Episode 1243
2021-08-26 22:53:12.940 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:53:12.942 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 105.0
2021-08-26 22:53:12.943 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 105.0
2021-08-26 22:53:12.944 | INFO     | src.policies:collect_trajectories:213 - Episode 1244
2021-08-26 22:53:13.149 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:53:13.150 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 171.0
2021-08-26 22:53:13.151 | INFO     | src.policies:collect_trajectorie

2021-08-26 22:53:13.501 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.16428342461585999
2021-08-26 22:53:13.504 | INFO     | src.policies:train:152 - Mini-batch 2 / 2
2021-08-26 22:53:13.506 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.29744935035705566
2021-08-26 22:53:13.509 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.06810221821069717
2021-08-26 22:53:13.511 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.15448978543281555
2021-08-26 22:53:13.513 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.06810221821069717
2021-08-26 22:53:13.515 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.15448978543281555
2021-08-26 22:53:13.519 | INFO     | src.policies:train:116 - Epoch 260 / 800
2021-08-26 22:53:13.520 | INFO     | src.policies:collec

2021-08-26 22:53:14.031 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.2525608241558075
2021-08-26 22:53:14.032 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.15290716290473938
2021-08-26 22:53:14.035 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.2525608241558075
2021-08-26 22:53:14.037 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.15290716290473938
2021-08-26 22:53:14.040 | INFO     | src.policies:train:152 - Mini-batch 2 / 2
2021-08-26 22:53:14.042 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.29254990816116333
2021-08-26 22:53:14.044 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.06019378826022148
2021-08-26 22:53:14.046 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.15663646161556244
2021-08-26 22:53:14.048 |

2021-08-26 22:53:14.557 | INFO     | src.policies:collect_trajectories:213 - Episode 1269
2021-08-26 22:53:14.576 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:53:14.577 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 31.0
2021-08-26 22:53:14.578 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 31.5
2021-08-26 22:53:14.579 | INFO     | src.policies:collect_trajectories:213 - Episode 1270
2021-08-26 22:53:14.597 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:53:14.598 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 34.0
2021-08-26 22:53:14.599 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 32.0
2021-08-26 22:53:14.600 | INFO     | src.policies:collect_trajectories:213 - Episode 1271
2021-08-26 22:53:14.688 | DEBUG    | src.policies:execute_episode:398 - Early stopping, al

2021-08-26 22:53:15.304 | INFO     | src.policies:train:152 - Mini-batch 1 / 3
2021-08-26 22:53:15.307 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.2528013586997986
2021-08-26 22:53:15.310 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.400821328163147
2021-08-26 22:53:15.312 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.1562713086605072
2021-08-26 22:53:15.315 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.400821328163147
2021-08-26 22:53:15.318 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.1562713086605072
2021-08-26 22:53:15.321 | INFO     | src.policies:train:152 - Mini-batch 2 / 3
2021-08-26 22:53:15.323 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.25974345207214355
2021-08-26 22:53:15.326 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient

2021-08-26 22:53:15.791 | INFO     | src.policies:train:152 - Mini-batch 3 / 3
2021-08-26 22:53:15.794 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.27632462978363037
2021-08-26 22:53:15.796 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.09272276610136032
2021-08-26 22:53:15.798 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.1476481556892395
2021-08-26 22:53:15.800 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.09272276610136032
2021-08-26 22:53:15.803 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.1476481556892395
2021-08-26 22:53:15.806 | INFO     | src.policies:train:116 - Epoch 270 / 800
2021-08-26 22:53:15.807 | INFO     | src.policies:collect_trajectories:213 - Episode 1284
2021-08-26 22:53:15.874 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08

2021-08-26 22:53:16.281 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 25.0
2021-08-26 22:53:16.282 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 39.0
2021-08-26 22:53:16.283 | INFO     | src.policies:collect_trajectories:213 - Episode 1293
2021-08-26 22:53:16.377 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:53:16.378 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 200.0
2021-08-26 22:53:16.379 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 79.25
2021-08-26 22:53:16.386 | INFO     | src.policies:train:152 - Mini-batch 1 / 3
2021-08-26 22:53:16.388 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.2822829484939575
2021-08-26 22:53:16.390 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.3681652247905731
2021-08-26 22:53:16.393 | INFO     | src.policies:minibatch_upd

2021-08-26 22:53:16.886 | INFO     | src.policies:train:152 - Mini-batch 2 / 3
2021-08-26 22:53:16.889 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.2722461223602295
2021-08-26 22:53:16.891 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.25983956456184387
2021-08-26 22:53:16.893 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.1436367630958557
2021-08-26 22:53:16.895 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.25983956456184387
2021-08-26 22:53:16.898 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.1436367630958557
2021-08-26 22:53:16.901 | INFO     | src.policies:train:152 - Mini-batch 3 / 3
2021-08-26 22:53:16.903 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.2634822726249695
2021-08-26 22:53:16.906 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradi

2021-08-26 22:53:17.500 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.21823865175247192
2021-08-26 22:53:17.502 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.1431506723165512
2021-08-26 22:53:17.504 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.21823865175247192
2021-08-26 22:53:17.506 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.1431506723165512
2021-08-26 22:53:17.510 | INFO     | src.policies:train:116 - Epoch 278 / 800
2021-08-26 22:53:17.511 | INFO     | src.policies:collect_trajectories:213 - Episode 1305
2021-08-26 22:53:17.575 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:53:17.577 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 51.0
2021-08-26 22:53:17.577 | INFO     | src.policies:collect_trajectories:230 - Last 100 epi

2021-08-26 22:53:18.006 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.46355825662612915
2021-08-26 22:53:18.009 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.1325530707836151
2021-08-26 22:53:18.012 | INFO     | src.policies:train:116 - Epoch 280 / 800
2021-08-26 22:53:18.013 | INFO     | src.policies:collect_trajectories:213 - Episode 1313
2021-08-26 22:53:18.037 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:53:18.038 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 46.0
2021-08-26 22:53:18.038 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 46.0
2021-08-26 22:53:18.039 | INFO     | src.policies:collect_trajectories:213 - Episode 1314
2021-08-26 22:53:18.158 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:53:18.159 | INFO     

2021-08-26 22:53:18.601 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 79.0
2021-08-26 22:53:18.602 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 77.0
2021-08-26 22:53:18.608 | INFO     | src.policies:train:152 - Mini-batch 1 / 2
2021-08-26 22:53:18.612 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.2539801001548767
2021-08-26 22:53:18.614 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.22992394864559174
2021-08-26 22:53:18.616 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.1371193379163742
2021-08-26 22:53:18.659 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.22992394864559174
2021-08-26 22:53:18.662 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.1371193379163742
2021-08-26 22:53:18.665 | INFO     | src.policies:train:152 - Mi

2021-08-26 22:53:19.095 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 85.5
2021-08-26 22:53:19.096 | INFO     | src.policies:collect_trajectories:213 - Episode 1329
2021-08-26 22:53:19.211 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:53:19.212 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 162.0
2021-08-26 22:53:19.214 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 111.0
2021-08-26 22:53:19.220 | INFO     | src.policies:train:152 - Mini-batch 1 / 3
2021-08-26 22:53:19.223 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.26149511337280273
2021-08-26 22:53:19.226 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.08883772045373917
2021-08-26 22:53:19.227 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.13635684549808502
2021-08-26 22:53:19.230 | INFO     |

2021-08-26 22:53:19.639 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.10716300457715988
2021-08-26 22:53:19.642 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.13234765827655792
2021-08-26 22:53:19.645 | INFO     | src.policies:train:116 - Epoch 289 / 800
2021-08-26 22:53:19.646 | INFO     | src.policies:collect_trajectories:213 - Episode 1336
2021-08-26 22:53:19.747 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:53:19.748 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 116.0
2021-08-26 22:53:19.749 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 116.0
2021-08-26 22:53:19.750 | INFO     | src.policies:collect_trajectories:213 - Episode 1337
2021-08-26 22:53:19.780 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:53:19.781 | INFO  

2021-08-26 22:53:20.157 | INFO     | src.policies:train:152 - Mini-batch 1 / 2
2021-08-26 22:53:20.159 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.26980865001678467
2021-08-26 22:53:20.162 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.24666713178157806
2021-08-26 22:53:20.164 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.13121137022972107
2021-08-26 22:53:20.166 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.24666713178157806
2021-08-26 22:53:20.168 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.13121137022972107
2021-08-26 22:53:20.171 | INFO     | src.policies:train:152 - Mini-batch 2 / 2
2021-08-26 22:53:20.173 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.24401652812957764
2021-08-26 22:53:20.176 | INFO     | src.policies:minibatch_update:277 - Policy network L2 g

2021-08-26 22:53:20.798 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 121.0
2021-08-26 22:53:20.799 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 46.75
2021-08-26 22:53:20.800 | INFO     | src.policies:collect_trajectories:213 - Episode 1356
2021-08-26 22:53:20.817 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:53:20.818 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 31.0
2021-08-26 22:53:20.819 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 43.6
2021-08-26 22:53:20.825 | INFO     | src.policies:train:152 - Mini-batch 1 / 2
2021-08-26 22:53:20.828 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.3024306893348694
2021-08-26 22:53:20.830 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.26611292362213135
2021-08-26 22:53:20.833 | INFO     | src.policies:minibatch_up

2021-08-26 22:53:21.345 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 178.0
2021-08-26 22:53:21.346 | INFO     | src.policies:collect_trajectories:213 - Episode 1364
2021-08-26 22:53:21.438 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:53:21.439 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 33.0
2021-08-26 22:53:21.440 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 105.5
2021-08-26 22:53:21.480 | INFO     | src.policies:train:152 - Mini-batch 1 / 2
2021-08-26 22:53:21.483 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.26915550231933594
2021-08-26 22:53:21.485 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.26528531312942505
2021-08-26 22:53:21.487 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.1288987696170807
2021-08-26 22:53:21.490 | INFO     | 

2021-08-26 22:53:21.823 | INFO     | src.policies:train:152 - Mini-batch 2 / 2
2021-08-26 22:53:21.826 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.27836817502975464
2021-08-26 22:53:21.828 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.11327140778303146
2021-08-26 22:53:21.830 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.12098182737827301
2021-08-26 22:53:21.833 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.11327140778303146
2021-08-26 22:53:21.835 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.12098182737827301
2021-08-26 22:53:21.838 | INFO     | src.policies:train:116 - Epoch 300 / 800
2021-08-26 22:53:21.839 | INFO     | src.policies:collect_trajectories:213 - Episode 1374
2021-08-26 22:53:21.852 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-

2021-08-26 22:53:22.341 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.26937806606292725
2021-08-26 22:53:22.343 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.12417688965797424
2021-08-26 22:53:22.345 | INFO     | src.policies:train:152 - Mini-batch 2 / 2
2021-08-26 22:53:22.348 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.24666154384613037
2021-08-26 22:53:22.350 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.22820183634757996
2021-08-26 22:53:22.352 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.12502053380012512
2021-08-26 22:53:22.354 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.22820183634757996
2021-08-26 22:53:22.356 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.12502053380

2021-08-26 22:53:22.854 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 136.0
2021-08-26 22:53:22.855 | INFO     | src.policies:collect_trajectories:213 - Episode 1391
2021-08-26 22:53:22.906 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:53:22.907 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 106.0
2021-08-26 22:53:22.908 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 121.0
2021-08-26 22:53:22.914 | INFO     | src.policies:train:152 - Mini-batch 1 / 2
2021-08-26 22:53:22.917 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.2529377341270447
2021-08-26 22:53:22.919 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.09079615771770477
2021-08-26 22:53:22.921 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.12233386188745499
2021-08-26 22:53:22.923 | INFO     |

2021-08-26 22:53:23.554 | INFO     | src.policies:train:152 - Mini-batch 1 / 2
2021-08-26 22:53:23.557 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.23942416906356812
2021-08-26 22:53:23.559 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.4504494369029999
2021-08-26 22:53:23.561 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.1190524622797966
2021-08-26 22:53:23.563 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.4504494369029999
2021-08-26 22:53:23.566 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.1190524622797966
2021-08-26 22:53:23.569 | INFO     | src.policies:train:152 - Mini-batch 2 / 2
2021-08-26 22:53:23.571 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.2796165943145752
2021-08-26 22:53:23.574 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradie

2021-08-26 22:53:23.998 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 17.0
2021-08-26 22:53:23.999 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 17.0
2021-08-26 22:53:24.000 | INFO     | src.policies:collect_trajectories:213 - Episode 1406
2021-08-26 22:53:24.066 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:53:24.067 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 134.0
2021-08-26 22:53:24.068 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 75.5
2021-08-26 22:53:24.069 | INFO     | src.policies:collect_trajectories:213 - Episode 1407
2021-08-26 22:53:24.125 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:53:24.126 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 118.0
2021-08-26 22:53:24.126 | INFO     | src.policies:collect_trajectories:230 

2021-08-26 22:53:24.517 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 54.0
2021-08-26 22:53:24.518 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 54.0
2021-08-26 22:53:24.519 | INFO     | src.policies:collect_trajectories:213 - Episode 1414
2021-08-26 22:53:24.575 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:53:24.577 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 116.0
2021-08-26 22:53:24.577 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 85.0
2021-08-26 22:53:24.578 | INFO     | src.policies:collect_trajectories:213 - Episode 1415
2021-08-26 22:53:24.647 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:53:24.648 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 145.0
2021-08-26 22:53:24.649 | INFO     | src.policies:collect_trajectories:230 

2021-08-26 22:53:25.047 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.08207173645496368
2021-08-26 22:53:25.050 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.11279896646738052
2021-08-26 22:53:25.053 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.08207173645496368
2021-08-26 22:53:25.057 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.11279896646738052
2021-08-26 22:53:25.061 | INFO     | src.policies:train:152 - Mini-batch 2 / 2
2021-08-26 22:53:25.064 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.2544935941696167
2021-08-26 22:53:25.068 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.15793125331401825
2021-08-26 22:53:25.071 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.11074353009462357
2021-08-26 22:53:25.074 

2021-08-26 22:53:25.725 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:53:25.727 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 107.0
2021-08-26 22:53:25.727 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 103.5
2021-08-26 22:53:25.733 | INFO     | src.policies:train:152 - Mini-batch 1 / 2
2021-08-26 22:53:25.737 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.24364089965820312
2021-08-26 22:53:25.739 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.29760393500328064
2021-08-26 22:53:25.741 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.11175183206796646
2021-08-26 22:53:25.743 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.29760393500328064
2021-08-26 22:53:25.745 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradi

2021-08-26 22:53:26.281 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.24152517318725586
2021-08-26 22:53:26.283 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.34825727343559265
2021-08-26 22:53:26.285 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.10909350961446762
2021-08-26 22:53:26.288 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.34825727343559265
2021-08-26 22:53:26.290 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.10909350961446762
2021-08-26 22:53:26.293 | INFO     | src.policies:train:152 - Mini-batch 2 / 2
2021-08-26 22:53:26.295 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.27776962518692017
2021-08-26 22:53:26.298 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.1629704236984253
2021-08-26 22:53:26.300 | INFO     | src.polic

2021-08-26 22:53:26.923 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.2747313380241394
2021-08-26 22:53:26.926 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.10105405002832413
2021-08-26 22:53:26.928 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.10538236051797867
2021-08-26 22:53:26.930 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.10105405002832413
2021-08-26 22:53:26.932 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.10538236051797867
2021-08-26 22:53:26.935 | INFO     | src.policies:train:152 - Mini-batch 2 / 2
2021-08-26 22:53:26.938 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.2589210867881775
2021-08-26 22:53:26.940 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.1916641891002655
2021-08-26 22:53:26.942 | INFO     | src.policie

2021-08-26 22:53:27.357 | INFO     | src.policies:train:116 - Epoch 328 / 800
2021-08-26 22:53:27.358 | INFO     | src.policies:collect_trajectories:213 - Episode 1453
2021-08-26 22:53:27.415 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:53:27.417 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 117.0
2021-08-26 22:53:27.418 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 117.0
2021-08-26 22:53:27.420 | INFO     | src.policies:collect_trajectories:213 - Episode 1454
2021-08-26 22:53:27.487 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:53:27.489 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 139.0
2021-08-26 22:53:27.489 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 128.0
2021-08-26 22:53:27.495 | INFO     | src.policies:train:152 - Mini-batch 1 / 2
2021-08-26 22:53:27

2021-08-26 22:53:28.058 | INFO     | src.policies:collect_trajectories:213 - Episode 1461
2021-08-26 22:53:28.089 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:53:28.090 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 61.0
2021-08-26 22:53:28.091 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 76.33333333333333
2021-08-26 22:53:28.128 | INFO     | src.policies:train:152 - Mini-batch 1 / 2
2021-08-26 22:53:28.132 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.2697315216064453
2021-08-26 22:53:28.134 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.048965614289045334
2021-08-26 22:53:28.137 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.10267767310142517
2021-08-26 22:53:28.139 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.048965614289045

2021-08-26 22:53:28.575 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.12742911279201508
2021-08-26 22:53:28.577 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.10351883620023727
2021-08-26 22:53:28.579 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.12742911279201508
2021-08-26 22:53:28.582 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.10351883620023727
2021-08-26 22:53:28.584 | INFO     | src.policies:train:152 - Mini-batch 3 / 3
2021-08-26 22:53:28.587 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.27212709188461304
2021-08-26 22:53:28.589 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.28155258297920227
2021-08-26 22:53:28.591 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.10379909723997116
2021-08-26 22:53:28.593

2021-08-26 22:53:29.124 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.10202275961637497
2021-08-26 22:53:29.126 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.4999990165233612
2021-08-26 22:53:29.128 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.10202275961637497
2021-08-26 22:53:29.131 | INFO     | src.policies:train:152 - Mini-batch 2 / 2
2021-08-26 22:53:29.133 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.2573700547218323
2021-08-26 22:53:29.135 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.15225693583488464
2021-08-26 22:53:29.138 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.10285602509975433
2021-08-26 22:53:29.140 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.15225693583488464
2021-08-26

2021-08-26 22:53:29.689 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 88.0
2021-08-26 22:53:29.690 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 69.66666666666667
2021-08-26 22:53:29.696 | INFO     | src.policies:train:152 - Mini-batch 1 / 2
2021-08-26 22:53:29.699 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.2686969041824341
2021-08-26 22:53:29.701 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.4414989948272705
2021-08-26 22:53:29.703 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.09948237985372543
2021-08-26 22:53:29.705 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.4414989948272705
2021-08-26 22:53:29.708 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.09948237985372543
2021-08-26 22:53:29.813 | INFO     | src.policies:t

2021-08-26 22:53:30.234 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.09465302526950836
2021-08-26 22:53:30.237 | INFO     | src.policies:train:152 - Mini-batch 2 / 2
2021-08-26 22:53:30.240 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.28503310680389404
2021-08-26 22:53:30.242 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.28780901432037354
2021-08-26 22:53:30.244 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.09210831671953201
2021-08-26 22:53:30.246 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.28780901432037354
2021-08-26 22:53:30.248 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.09210831671953201
2021-08-26 22:53:30.252 | INFO     | src.policies:train:116 - Epoch 342 / 800
2021-08-26 22:53:30.253 | INFO     | src.policies:collec

2021-08-26 22:53:30.836 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:53:30.837 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 85.0
2021-08-26 22:53:30.838 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 73.33333333333333
2021-08-26 22:53:30.845 | INFO     | src.policies:train:152 - Mini-batch 1 / 2
2021-08-26 22:53:30.847 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.25334739685058594
2021-08-26 22:53:30.850 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.17426197230815887
2021-08-26 22:53:30.852 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.09517588466405869
2021-08-26 22:53:30.854 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.17426197230815887
2021-08-26 22:53:30.856 | INFO     | src.policies:minibatch_update:295 - Baseline netwo

2021-08-26 22:53:31.395 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 118.0
2021-08-26 22:53:31.396 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 114.5
2021-08-26 22:53:31.401 | INFO     | src.policies:train:152 - Mini-batch 1 / 2
2021-08-26 22:53:31.404 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.24940162897109985
2021-08-26 22:53:31.407 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.21146146953105927
2021-08-26 22:53:31.409 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.09620987623929977
2021-08-26 22:53:31.451 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.21146146953105927
2021-08-26 22:53:31.454 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.09620987623929977
2021-08-26 22:53:31.457 | INFO     | src.policies:train:152

2021-08-26 22:53:32.085 | INFO     | src.policies:train:152 - Mini-batch 2 / 3
2021-08-26 22:53:32.089 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.2814508080482483
2021-08-26 22:53:32.092 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.12659130990505219
2021-08-26 22:53:32.095 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.09282840043306351
2021-08-26 22:53:32.098 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.12659130990505219
2021-08-26 22:53:32.102 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.09282840043306351
2021-08-26 22:53:32.106 | INFO     | src.policies:train:152 - Mini-batch 3 / 3
2021-08-26 22:53:32.109 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.29267871379852295
2021-08-26 22:53:32.112 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gr

2021-08-26 22:53:32.735 | INFO     | src.policies:train:152 - Mini-batch 3 / 3
2021-08-26 22:53:32.738 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.2735423445701599
2021-08-26 22:53:32.740 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.23789110779762268
2021-08-26 22:53:32.742 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.08966871351003647
2021-08-26 22:53:32.745 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.23789110779762268
2021-08-26 22:53:32.747 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.08966871351003647
2021-08-26 22:53:32.751 | INFO     | src.policies:train:116 - Epoch 354 / 800
2021-08-26 22:53:32.752 | INFO     | src.policies:collect_trajectories:213 - Episode 1515
2021-08-26 22:53:32.831 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-0

2021-08-26 22:53:33.268 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.08832826465368271
2021-08-26 22:53:33.270 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.17421719431877136
2021-08-26 22:53:33.272 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.08832826465368271
2021-08-26 22:53:33.276 | INFO     | src.policies:train:116 - Epoch 357 / 800
2021-08-26 22:53:33.277 | INFO     | src.policies:collect_trajectories:213 - Episode 1521
2021-08-26 22:53:33.339 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:53:33.341 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 134.0
2021-08-26 22:53:33.341 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 134.0
2021-08-26 22:53:33.342 | INFO     | src.policies:collect_trajectories:213 - Episode 1522
2021-08-

2021-08-26 22:53:33.863 | INFO     | src.policies:train:152 - Mini-batch 1 / 2
2021-08-26 22:53:33.865 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.2775306701660156
2021-08-26 22:53:33.868 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.08392682671546936
2021-08-26 22:53:33.870 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.08458039164543152
2021-08-26 22:53:33.951 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.08392682671546936
2021-08-26 22:53:33.954 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.08458039164543152
2021-08-26 22:53:33.956 | INFO     | src.policies:train:152 - Mini-batch 2 / 2
2021-08-26 22:53:33.959 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.2690761089324951
2021-08-26 22:53:33.961 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gra

2021-08-26 22:53:34.433 | INFO     | src.policies:train:116 - Epoch 363 / 800
2021-08-26 22:53:34.433 | INFO     | src.policies:collect_trajectories:213 - Episode 1533
2021-08-26 22:53:34.508 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:53:34.509 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 169.0
2021-08-26 22:53:34.510 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 169.0
2021-08-26 22:53:34.511 | INFO     | src.policies:collect_trajectories:213 - Episode 1534
2021-08-26 22:53:34.579 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:53:34.580 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 156.0
2021-08-26 22:53:34.580 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 162.5
2021-08-26 22:53:34.586 | INFO     | src.policies:train:152 - Mini-batch 1 / 3
2021-08-26 22:53:34

2021-08-26 22:53:34.986 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:53:34.987 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 28.0
2021-08-26 22:53:34.988 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 28.0
2021-08-26 22:53:34.989 | INFO     | src.policies:collect_trajectories:213 - Episode 1540
2021-08-26 22:53:35.004 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:53:35.005 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 26.0
2021-08-26 22:53:35.006 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 27.0
2021-08-26 22:53:35.007 | INFO     | src.policies:collect_trajectories:213 - Episode 1541
2021-08-26 22:53:35.023 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:53:35.024 | INFO     | src.policies:collect_trajectories:229 -

2021-08-26 22:53:35.565 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.08009141683578491
2021-08-26 22:53:35.567 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.18566402792930603
2021-08-26 22:53:35.569 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.08009141683578491
2021-08-26 22:53:35.572 | INFO     | src.policies:train:152 - Mini-batch 2 / 2
2021-08-26 22:53:35.574 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.28787994384765625
2021-08-26 22:53:35.576 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.13970443606376648
2021-08-26 22:53:35.578 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.08233178406953812
2021-08-26 22:53:35.580 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.13970443606376648
2021-08-

2021-08-26 22:53:36.194 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.08467475324869156
2021-08-26 22:53:36.197 | INFO     | src.policies:train:152 - Mini-batch 2 / 2
2021-08-26 22:53:36.199 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.25301146507263184
2021-08-26 22:53:36.201 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.3786354660987854
2021-08-26 22:53:36.203 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.08257825672626495
2021-08-26 22:53:36.206 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.3786354660987854
2021-08-26 22:53:36.208 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.08257825672626495
2021-08-26 22:53:36.211 | INFO     | src.policies:train:116 - Epoch 373 / 800
2021-08-26 22:53:36.212 | INFO     | src.policies:collect_

2021-08-26 22:53:36.753 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:53:36.754 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 196.0
2021-08-26 22:53:36.755 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 105.0
2021-08-26 22:53:36.760 | INFO     | src.policies:train:152 - Mini-batch 1 / 2
2021-08-26 22:53:36.763 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.31045424938201904
2021-08-26 22:53:36.766 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.31987473368644714
2021-08-26 22:53:36.767 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.07821853458881378
2021-08-26 22:53:36.769 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.31987473368644714
2021-08-26 22:53:36.772 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradi

2021-08-26 22:53:37.256 | INFO     | src.policies:train:152 - Mini-batch 2 / 2
2021-08-26 22:53:37.258 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.27588313817977905
2021-08-26 22:53:37.260 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.17981138825416565
2021-08-26 22:53:37.262 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.0797596126794815
2021-08-26 22:53:37.265 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.17981138825416565
2021-08-26 22:53:37.267 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.0797596126794815
2021-08-26 22:53:37.270 | INFO     | src.policies:train:116 - Epoch 380 / 800
2021-08-26 22:53:37.271 | INFO     | src.policies:collect_trajectories:213 - Episode 1564
2021-08-26 22:53:37.360 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08

2021-08-26 22:53:37.858 | INFO     | src.policies:train:152 - Mini-batch 2 / 3
2021-08-26 22:53:37.861 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.2764965891838074
2021-08-26 22:53:37.863 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.2033068984746933
2021-08-26 22:53:37.865 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.07807987928390503
2021-08-26 22:53:37.867 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.2033068984746933
2021-08-26 22:53:37.869 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.07807987928390503
2021-08-26 22:53:37.872 | INFO     | src.policies:train:152 - Mini-batch 3 / 3
2021-08-26 22:53:37.874 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.2862950563430786
2021-08-26 22:53:37.877 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradi

2021-08-26 22:53:38.462 | INFO     | src.policies:train:116 - Epoch 387 / 800
2021-08-26 22:53:38.463 | INFO     | src.policies:collect_trajectories:213 - Episode 1574
2021-08-26 22:53:38.682 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:53:38.683 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 200.0
2021-08-26 22:53:38.684 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 200.0
2021-08-26 22:53:38.688 | INFO     | src.policies:train:152 - Mini-batch 1 / 2
2021-08-26 22:53:38.691 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.2933970093727112
2021-08-26 22:53:38.693 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.25793856382369995
2021-08-26 22:53:38.695 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.07631514221429825
2021-08-26 22:53:38.697 | INFO     | src.policies:minibatch_update:288 -

2021-08-26 22:53:39.141 | INFO     | src.policies:train:116 - Epoch 391 / 800
2021-08-26 22:53:39.142 | INFO     | src.policies:collect_trajectories:213 - Episode 1579
2021-08-26 22:53:39.268 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:53:39.269 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 185.0
2021-08-26 22:53:39.270 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 185.0
2021-08-26 22:53:39.271 | INFO     | src.policies:collect_trajectories:213 - Episode 1580
2021-08-26 22:53:39.324 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:53:39.325 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 114.0
2021-08-26 22:53:39.326 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 149.5
2021-08-26 22:53:39.331 | INFO     | src.policies:train:152 - Mini-batch 1 / 2
2021-08-26 22:53:39

2021-08-26 22:53:39.692 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.09957858920097351
2021-08-26 22:53:39.694 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.07394430786371231
2021-08-26 22:53:39.697 | INFO     | src.policies:train:116 - Epoch 394 / 800
2021-08-26 22:53:39.698 | INFO     | src.policies:collect_trajectories:213 - Episode 1586
2021-08-26 22:53:39.810 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:53:39.811 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 160.0
2021-08-26 22:53:39.812 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 160.0
2021-08-26 22:53:39.813 | INFO     | src.policies:collect_trajectories:213 - Episode 1587
2021-08-26 22:53:39.843 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:53:39.844 | INFO  

2021-08-26 22:53:40.392 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.3011589050292969
2021-08-26 22:53:40.395 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.19863292574882507
2021-08-26 22:53:40.397 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.07325095683336258
2021-08-26 22:53:40.399 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.19863292574882507
2021-08-26 22:53:40.401 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.07325095683336258
2021-08-26 22:53:40.403 | INFO     | src.policies:train:152 - Mini-batch 2 / 2
2021-08-26 22:53:40.406 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.27606308460235596
2021-08-26 22:53:40.408 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.2801094055175781
2021-08-26 22:53:40.410 | INFO     | src.polici

2021-08-26 22:53:40.942 | INFO     | src.policies:train:152 - Mini-batch 1 / 2
2021-08-26 22:53:40.944 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.29023057222366333
2021-08-26 22:53:40.946 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.14548823237419128
2021-08-26 22:53:40.948 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.07305245101451874
2021-08-26 22:53:40.951 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.14548823237419128
2021-08-26 22:53:40.953 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.07305245101451874
2021-08-26 22:53:40.956 | INFO     | src.policies:train:152 - Mini-batch 2 / 2
2021-08-26 22:53:40.958 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.2894570827484131
2021-08-26 22:53:40.960 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gr

2021-08-26 22:53:41.440 | INFO     | src.policies:train:116 - Epoch 404 / 800
2021-08-26 22:53:41.441 | INFO     | src.policies:collect_trajectories:213 - Episode 1603
2021-08-26 22:53:41.510 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:53:41.511 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 151.0
2021-08-26 22:53:41.512 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 151.0
2021-08-26 22:53:41.513 | INFO     | src.policies:collect_trajectories:213 - Episode 1604
2021-08-26 22:53:41.554 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:53:41.555 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 90.0
2021-08-26 22:53:41.556 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 120.5
2021-08-26 22:53:41.561 | INFO     | src.policies:train:152 - Mini-batch 1 / 2
2021-08-26 22:53:41.

2021-08-26 22:53:42.065 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.07412563264369965
2021-08-26 22:53:42.067 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.07053814083337784
2021-08-26 22:53:42.069 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.07412563264369965
2021-08-26 22:53:42.071 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.07053814083337784
2021-08-26 22:53:42.074 | INFO     | src.policies:train:152 - Mini-batch 2 / 2
2021-08-26 22:53:42.076 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.31150591373443604
2021-08-26 22:53:42.079 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.1532408595085144
2021-08-26 22:53:42.080 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.06835576146841049
2021-08-26 22:53:42.083 

2021-08-26 22:53:42.656 | INFO     | src.policies:train:116 - Epoch 411 / 800
2021-08-26 22:53:42.657 | INFO     | src.policies:collect_trajectories:213 - Episode 1613
2021-08-26 22:53:42.701 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:53:42.702 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 99.0
2021-08-26 22:53:42.703 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 99.0
2021-08-26 22:53:42.704 | INFO     | src.policies:collect_trajectories:213 - Episode 1614
2021-08-26 22:53:42.760 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:53:42.761 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 124.0
2021-08-26 22:53:42.762 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 111.5
2021-08-26 22:53:42.767 | INFO     | src.policies:train:152 - Mini-batch 1 / 2
2021-08-26 22:53:42.7

2021-08-26 22:53:43.177 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.49999934434890747
2021-08-26 22:53:43.180 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.06933325529098511
2021-08-26 22:53:43.183 | INFO     | src.policies:train:116 - Epoch 414 / 800
2021-08-26 22:53:43.183 | INFO     | src.policies:collect_trajectories:213 - Episode 1620
2021-08-26 22:53:43.231 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:53:43.232 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 104.0
2021-08-26 22:53:43.232 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 104.0
2021-08-26 22:53:43.233 | INFO     | src.policies:collect_trajectories:213 - Episode 1621
2021-08-26 22:53:43.275 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:53:43.276 | INFO  

2021-08-26 22:53:43.740 | INFO     | src.policies:train:116 - Epoch 417 / 800
2021-08-26 22:53:43.741 | INFO     | src.policies:collect_trajectories:213 - Episode 1626
2021-08-26 22:53:43.760 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:53:43.761 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 41.0
2021-08-26 22:53:43.762 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 41.0
2021-08-26 22:53:43.763 | INFO     | src.policies:collect_trajectories:213 - Episode 1627
2021-08-26 22:53:43.853 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:53:43.854 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 200.0
2021-08-26 22:53:43.855 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 120.5
2021-08-26 22:53:43.861 | INFO     | src.policies:train:152 - Mini-batch 1 / 2
2021-08-26 22:53:43.8

2021-08-26 22:53:44.484 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 166.0
2021-08-26 22:53:44.485 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 179.5
2021-08-26 22:53:44.491 | INFO     | src.policies:train:152 - Mini-batch 1 / 3
2021-08-26 22:53:44.493 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.27747321128845215
2021-08-26 22:53:44.496 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.5441098213195801
2021-08-26 22:53:44.498 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.06564519554376602
2021-08-26 22:53:44.500 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.49999895691871643
2021-08-26 22:53:44.502 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.06564519554376602
2021-08-26 22:53:44.505 | INFO     | src.policies:train:152 

2021-08-26 22:53:44.966 | INFO     | src.policies:collect_trajectories:213 - Episode 1639
2021-08-26 22:53:45.023 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:53:45.025 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 126.0
2021-08-26 22:53:45.025 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 130.5
2021-08-26 22:53:45.031 | INFO     | src.policies:train:152 - Mini-batch 1 / 2
2021-08-26 22:53:45.034 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.27314555644989014
2021-08-26 22:53:45.037 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.1780727207660675
2021-08-26 22:53:45.039 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.06668971478939056
2021-08-26 22:53:45.041 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.1780727207660675
2021-08-26

2021-08-26 22:53:45.552 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.15109463036060333
2021-08-26 22:53:45.555 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.06013299152255058
2021-08-26 22:53:45.557 | INFO     | src.policies:train:152 - Mini-batch 3 / 3
2021-08-26 22:53:45.560 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.30217576026916504
2021-08-26 22:53:45.562 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.39554429054260254
2021-08-26 22:53:45.563 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.061946891248226166
2021-08-26 22:53:45.565 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.39554429054260254
2021-08-26 22:53:45.568 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.0619468912

2021-08-26 22:53:46.055 | INFO     | src.policies:train:116 - Epoch 429 / 800
2021-08-26 22:53:46.056 | INFO     | src.policies:collect_trajectories:213 - Episode 1654
2021-08-26 22:53:46.139 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:53:46.140 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 182.0
2021-08-26 22:53:46.140 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 182.0
2021-08-26 22:53:46.141 | INFO     | src.policies:collect_trajectories:213 - Episode 1655
2021-08-26 22:53:46.230 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:53:46.232 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 200.0
2021-08-26 22:53:46.232 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 191.0
2021-08-26 22:53:46.239 | INFO     | src.policies:train:152 - Mini-batch 1 / 3
2021-08-26 22:53:46

2021-08-26 22:53:46.734 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.15491729974746704
2021-08-26 22:53:46.736 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.06107913330197334
2021-08-26 22:53:46.738 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.15491729974746704
2021-08-26 22:53:46.741 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.06107913330197334
2021-08-26 22:53:46.744 | INFO     | src.policies:train:116 - Epoch 432 / 800
2021-08-26 22:53:46.745 | INFO     | src.policies:collect_trajectories:213 - Episode 1660
2021-08-26 22:53:46.835 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:53:46.837 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 200.0
2021-08-26 22:53:46.837 | INFO     | src.policies:collect_trajectories:230 - Last 100 

2021-08-26 22:53:47.285 | INFO     | src.policies:collect_trajectories:213 - Episode 1667
2021-08-26 22:53:47.347 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:53:47.349 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 140.0
2021-08-26 22:53:47.349 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 135.0
2021-08-26 22:53:47.355 | INFO     | src.policies:train:152 - Mini-batch 1 / 2
2021-08-26 22:53:47.358 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.2676020860671997
2021-08-26 22:53:47.360 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.17452801764011383
2021-08-26 22:53:47.362 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.062072303146123886
2021-08-26 22:53:47.364 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.17452801764011383
2021-08-

2021-08-26 22:53:47.886 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.23536469042301178
2021-08-26 22:53:47.888 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.06107884272933006
2021-08-26 22:53:47.891 | INFO     | src.policies:train:152 - Mini-batch 2 / 3
2021-08-26 22:53:47.893 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.27500009536743164
2021-08-26 22:53:47.895 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.27604949474334717
2021-08-26 22:53:47.897 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.05963527783751488
2021-08-26 22:53:47.899 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.27604949474334717
2021-08-26 22:53:47.901 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.05963527783

2021-08-26 22:53:48.365 | INFO     | src.policies:train:152 - Mini-batch 2 / 2
2021-08-26 22:53:48.367 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.2898544669151306
2021-08-26 22:53:48.370 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.20080412924289703
2021-08-26 22:53:48.371 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.05949198082089424
2021-08-26 22:53:48.374 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.20080412924289703
2021-08-26 22:53:48.376 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.05949198082089424
2021-08-26 22:53:48.379 | INFO     | src.policies:train:116 - Epoch 442 / 800
2021-08-26 22:53:48.380 | INFO     | src.policies:collect_trajectories:213 - Episode 1680
2021-08-26 22:53:48.561 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-0

2021-08-26 22:53:49.016 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.05939440056681633
2021-08-26 22:53:49.018 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.3777102530002594
2021-08-26 22:53:49.020 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.05939440056681633
2021-08-26 22:53:49.023 | INFO     | src.policies:train:152 - Mini-batch 2 / 2
2021-08-26 22:53:49.025 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.2873687744140625
2021-08-26 22:53:49.028 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.48181095719337463
2021-08-26 22:53:49.029 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.058072857558727264
2021-08-26 22:53:49.031 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.48181095719337463
2021-08-2

2021-08-26 22:53:49.519 | INFO     | src.policies:collect_trajectories:213 - Episode 1690
2021-08-26 22:53:49.618 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:53:49.619 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 146.0
2021-08-26 22:53:49.620 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 146.0
2021-08-26 22:53:49.621 | INFO     | src.policies:collect_trajectories:213 - Episode 1691
2021-08-26 22:53:49.713 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:53:49.714 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 200.0
2021-08-26 22:53:49.715 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 173.0
2021-08-26 22:53:49.721 | INFO     | src.policies:train:152 - Mini-batch 1 / 3
2021-08-26 22:53:49.724 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.281431436

2021-08-26 22:53:50.172 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.28816890716552734
2021-08-26 22:53:50.174 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.13045121729373932
2021-08-26 22:53:50.176 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.058623213320970535
2021-08-26 22:53:50.178 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.13045121729373932
2021-08-26 22:53:50.180 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.058623213320970535
2021-08-26 22:53:50.183 | INFO     | src.policies:train:116 - Epoch 452 / 800
2021-08-26 22:53:50.184 | INFO     | src.policies:collect_trajectories:213 - Episode 1695
2021-08-26 22:53:50.276 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:53:50.277 | INFO     | src.policies:collect_trajectories:229 - Mean 

2021-08-26 22:53:50.783 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.05465243011713028
2021-08-26 22:53:50.785 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.252528578042984
2021-08-26 22:53:50.787 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.05465243011713028
2021-08-26 22:53:50.790 | INFO     | src.policies:train:116 - Epoch 455 / 800
2021-08-26 22:53:50.791 | INFO     | src.policies:collect_trajectories:213 - Episode 1701
2021-08-26 22:53:50.829 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:53:50.831 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 84.0
2021-08-26 22:53:50.831 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 84.0
2021-08-26 22:53:50.832 | INFO     | src.policies:collect_trajectories:213 - Episode 1702
2021-08-26 2

2021-08-26 22:53:51.489 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.3425098955631256
2021-08-26 22:53:51.491 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.056264445185661316
2021-08-26 22:53:51.494 | INFO     | src.policies:train:152 - Mini-batch 3 / 3
2021-08-26 22:53:51.496 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.294073224067688
2021-08-26 22:53:51.498 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.2072756290435791
2021-08-26 22:53:51.501 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.054059673100709915
2021-08-26 22:53:51.504 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.2072756290435791
2021-08-26 22:53:51.506 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.05405967310070

2021-08-26 22:53:52.045 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.2559564411640167
2021-08-26 22:53:52.048 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.05223684757947922
2021-08-26 22:53:52.050 | INFO     | src.policies:train:152 - Mini-batch 2 / 2
2021-08-26 22:53:52.053 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.2898896336555481
2021-08-26 22:53:52.055 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.18297968804836273
2021-08-26 22:53:52.057 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.055186185985803604
2021-08-26 22:53:52.060 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.18297968804836273
2021-08-26 22:53:52.062 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.055186185985

2021-08-26 22:53:52.699 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.2369442880153656
2021-08-26 22:53:52.702 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.05451539158821106
2021-08-26 22:53:52.705 | INFO     | src.policies:train:116 - Epoch 465 / 800
2021-08-26 22:53:52.706 | INFO     | src.policies:collect_trajectories:213 - Episode 1717
2021-08-26 22:53:52.793 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:53:52.794 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 200.0
2021-08-26 22:53:52.795 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 200.0
2021-08-26 22:53:52.799 | INFO     | src.policies:train:152 - Mini-batch 1 / 2
2021-08-26 22:53:52.802 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.3011283278465271
2021-08-26 22:53:52.805 | INFO     | src.po

2021-08-26 22:53:53.300 | INFO     | src.policies:train:152 - Mini-batch 2 / 3
2021-08-26 22:53:53.302 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.2984055280685425
2021-08-26 22:53:53.305 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.21273714303970337
2021-08-26 22:53:53.307 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.05305636674165726
2021-08-26 22:53:53.309 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.21273714303970337
2021-08-26 22:53:53.311 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.05305636674165726
2021-08-26 22:53:53.314 | INFO     | src.policies:train:152 - Mini-batch 3 / 3
2021-08-26 22:53:53.316 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.2659403085708618
2021-08-26 22:53:53.319 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gra

2021-08-26 22:53:53.858 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.05201487988233566
2021-08-26 22:53:53.861 | INFO     | src.policies:train:152 - Mini-batch 2 / 2
2021-08-26 22:53:53.863 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.31120723485946655
2021-08-26 22:53:53.866 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.34848567843437195
2021-08-26 22:53:53.868 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.05225111544132233
2021-08-26 22:53:53.870 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.34848567843437195
2021-08-26 22:53:53.872 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.05225111544132233
2021-08-26 22:53:53.875 | INFO     | src.policies:train:116 - Epoch 473 / 800
2021-08-26 22:53:53.876 | INFO     | src.policies:collec

2021-08-26 22:53:54.422 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.049309149384498596
2021-08-26 22:53:54.424 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.12783852219581604
2021-08-26 22:53:54.426 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.049309149384498596
2021-08-26 22:53:54.429 | INFO     | src.policies:train:152 - Mini-batch 3 / 3
2021-08-26 22:53:54.431 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.302049458026886
2021-08-26 22:53:54.433 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.2302563339471817
2021-08-26 22:53:54.435 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.047019343823194504
2021-08-26 22:53:54.437 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.2302563339471817
2021-08-2

2021-08-26 22:53:55.060 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 138.0
2021-08-26 22:53:55.061 | INFO     | src.policies:collect_trajectories:213 - Episode 1740
2021-08-26 22:53:55.183 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:53:55.184 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 200.0
2021-08-26 22:53:55.185 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 169.0
2021-08-26 22:53:55.191 | INFO     | src.policies:train:152 - Mini-batch 1 / 3
2021-08-26 22:53:55.194 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.2845802307128906
2021-08-26 22:53:55.197 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.2931581139564514
2021-08-26 22:53:55.199 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.051614921540021896
2021-08-26 22:53:55.201 | INFO     |

2021-08-26 22:53:55.692 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 200.0
2021-08-26 22:53:55.693 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 147.0
2021-08-26 22:53:55.699 | INFO     | src.policies:train:152 - Mini-batch 1 / 2
2021-08-26 22:53:55.701 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.2982407808303833
2021-08-26 22:53:55.704 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.3641453683376312
2021-08-26 22:53:55.706 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.05003003031015396
2021-08-26 22:53:55.708 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.3641453683376312
2021-08-26 22:53:55.710 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.05003003031015396
2021-08-26 22:53:55.713 | INFO     | src.policies:train:152 - 

2021-08-26 22:53:56.144 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.21319988369941711
2021-08-26 22:53:56.146 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.048762645572423935
2021-08-26 22:53:56.149 | INFO     | src.policies:train:116 - Epoch 485 / 800
2021-08-26 22:53:56.150 | INFO     | src.policies:collect_trajectories:213 - Episode 1752
2021-08-26 22:53:56.287 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:53:56.289 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 200.0
2021-08-26 22:53:56.289 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 200.0
2021-08-26 22:53:56.293 | INFO     | src.policies:train:152 - Mini-batch 1 / 2
2021-08-26 22:53:56.296 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.30805498361587524
2021-08-26 22:53:56.298 | INFO     | src

2021-08-26 22:53:56.901 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.14472636580467224
2021-08-26 22:53:56.903 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.048711083829402924
2021-08-26 22:53:56.905 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.14472636580467224
2021-08-26 22:53:56.907 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.048711083829402924
2021-08-26 22:53:56.910 | INFO     | src.policies:train:152 - Mini-batch 2 / 3
2021-08-26 22:53:56.912 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.2918463945388794
2021-08-26 22:53:56.915 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.08737782388925552
2021-08-26 22:53:56.916 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.04910911247134209
2021-08-26 22:53:56.91

2021-08-26 22:53:57.498 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:53:57.499 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 200.0
2021-08-26 22:53:57.500 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 187.0
2021-08-26 22:53:57.507 | INFO     | src.policies:train:152 - Mini-batch 1 / 3
2021-08-26 22:53:57.510 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.30348098278045654
2021-08-26 22:53:57.512 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.1373632550239563
2021-08-26 22:53:57.514 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.04901393502950668
2021-08-26 22:53:57.516 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.1373632550239563
2021-08-26 22:53:57.519 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradien

2021-08-26 22:53:58.070 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.30246424674987793
2021-08-26 22:53:58.073 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.3410787582397461
2021-08-26 22:53:58.075 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.04726189374923706
2021-08-26 22:53:58.077 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.3410787582397461
2021-08-26 22:53:58.079 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.04726189374923706
2021-08-26 22:53:58.082 | INFO     | src.policies:train:116 - Epoch 495 / 800
2021-08-26 22:53:58.083 | INFO     | src.policies:collect_trajectories:213 - Episode 1769
2021-08-26 22:53:58.121 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:53:58.123 | INFO     | src.policies:collect_trajectories:229 - Mean epis

2021-08-26 22:53:58.680 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.2942041754722595
2021-08-26 22:53:58.683 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.13181813061237335
2021-08-26 22:53:58.685 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.04708116501569748
2021-08-26 22:53:58.687 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.13181813061237335
2021-08-26 22:53:58.689 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.04708116501569748
2021-08-26 22:53:58.692 | INFO     | src.policies:train:116 - Epoch 499 / 800
2021-08-26 22:53:58.693 | INFO     | src.policies:collect_trajectories:213 - Episode 1774
2021-08-26 22:53:58.786 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:53:58.787 | INFO     | src.policies:collect_trajectories:229 - Mean epi

2021-08-26 22:53:59.283 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.04658818989992142
2021-08-26 22:53:59.286 | INFO     | src.policies:train:116 - Epoch 503 / 800
2021-08-26 22:53:59.287 | INFO     | src.policies:collect_trajectories:213 - Episode 1778
2021-08-26 22:53:59.375 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:53:59.376 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 200.0
2021-08-26 22:53:59.377 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 200.0
2021-08-26 22:53:59.381 | INFO     | src.policies:train:152 - Mini-batch 1 / 2
2021-08-26 22:53:59.383 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.2882588505744934
2021-08-26 22:53:59.386 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.28536897897720337
2021-08-26 22:53:59.387 | INFO     | src.policies:minibat

2021-08-26 22:53:59.883 | INFO     | src.policies:collect_trajectories:213 - Episode 1784
2021-08-26 22:54:00.000 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:54:00.001 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 200.0
2021-08-26 22:54:00.003 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 167.0
2021-08-26 22:54:00.009 | INFO     | src.policies:train:152 - Mini-batch 1 / 3
2021-08-26 22:54:00.012 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.30647391080856323
2021-08-26 22:54:00.014 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.5331760048866272
2021-08-26 22:54:00.016 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.0453246645629406
2021-08-26 22:54:00.019 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.4999989867210388
2021-08-26 

2021-08-26 22:54:00.406 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.3955647051334381
2021-08-26 22:54:00.408 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.04464150220155716
2021-08-26 22:54:00.410 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.3955647051334381
2021-08-26 22:54:00.412 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.04464150220155716
2021-08-26 22:54:00.415 | INFO     | src.policies:train:116 - Epoch 509 / 800
2021-08-26 22:54:00.416 | INFO     | src.policies:collect_trajectories:213 - Episode 1790
2021-08-26 22:54:00.473 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:54:00.474 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 125.0
2021-08-26 22:54:00.475 | INFO     | src.policies:collect_trajectories:230 - Last 100 ep

2021-08-26 22:54:01.085 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.1392490565776825
2021-08-26 22:54:01.088 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.04528578370809555
2021-08-26 22:54:01.090 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.1392490565776825
2021-08-26 22:54:01.092 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.04528578370809555
2021-08-26 22:54:01.096 | INFO     | src.policies:train:116 - Epoch 512 / 800
2021-08-26 22:54:01.096 | INFO     | src.policies:collect_trajectories:213 - Episode 1795
2021-08-26 22:54:01.187 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:54:01.188 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 200.0
2021-08-26 22:54:01.189 | INFO     | src.policies:collect_trajectories:230 - Last 100 ep

2021-08-26 22:54:01.654 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.11306509375572205
2021-08-26 22:54:01.657 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.045110367238521576
2021-08-26 22:54:01.659 | INFO     | src.policies:train:152 - Mini-batch 2 / 2
2021-08-26 22:54:01.661 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.2952841520309448
2021-08-26 22:54:01.663 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.06501305103302002
2021-08-26 22:54:01.665 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.043025773018598557
2021-08-26 22:54:01.667 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.06501305103302002
2021-08-26 22:54:01.669 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.0430257730

2021-08-26 22:54:02.241 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.23022815585136414
2021-08-26 22:54:02.243 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.04343334585428238
2021-08-26 22:54:02.246 | INFO     | src.policies:train:152 - Mini-batch 2 / 2
2021-08-26 22:54:02.248 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.29724961519241333
2021-08-26 22:54:02.251 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.08502446860074997
2021-08-26 22:54:02.253 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.043469689786434174
2021-08-26 22:54:02.255 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.08502446860074997
2021-08-26 22:54:02.257 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.0434696897

2021-08-26 22:54:02.805 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.30252909660339355
2021-08-26 22:54:02.807 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.04192591831088066
2021-08-26 22:54:02.810 | INFO     | src.policies:train:152 - Mini-batch 2 / 2
2021-08-26 22:54:02.812 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.31143856048583984
2021-08-26 22:54:02.814 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.2721191644668579
2021-08-26 22:54:02.816 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.042179979383945465
2021-08-26 22:54:02.818 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.2721191644668579
2021-08-26 22:54:02.820 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.042179979383

2021-08-26 22:54:03.472 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 200.0
2021-08-26 22:54:03.476 | INFO     | src.policies:train:152 - Mini-batch 1 / 2
2021-08-26 22:54:03.479 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.30167442560195923
2021-08-26 22:54:03.482 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.11191126704216003
2021-08-26 22:54:03.483 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.042317137122154236
2021-08-26 22:54:03.486 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.11191126704216003
2021-08-26 22:54:03.488 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.042317137122154236
2021-08-26 22:54:03.491 | INFO     | src.policies:train:152 - Mini-batch 2 / 2
2021-08-26 22:54:03.493 | INFO     | src.policies:minibatch_update:270 - Total los

2021-08-26 22:54:04.092 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.0410425029695034
2021-08-26 22:54:04.094 | INFO     | src.policies:train:152 - Mini-batch 2 / 2
2021-08-26 22:54:04.096 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.31117719411849976
2021-08-26 22:54:04.099 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.34792476892471313
2021-08-26 22:54:04.100 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.04082007706165314
2021-08-26 22:54:04.102 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.34792476892471313
2021-08-26 22:54:04.104 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.04082007706165314
2021-08-26 22:54:04.107 | INFO     | src.policies:train:116 - Epoch 531 / 800
2021-08-26 22:54:04.108 | INFO     | src.policies:collect

2021-08-26 22:54:04.642 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.2967163920402527
2021-08-26 22:54:04.644 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.5525871515274048
2021-08-26 22:54:04.646 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.04115193337202072
2021-08-26 22:54:04.648 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.499999076128006
2021-08-26 22:54:04.650 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.04115193337202072
2021-08-26 22:54:04.653 | INFO     | src.policies:train:152 - Mini-batch 2 / 2
2021-08-26 22:54:04.655 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.2798410654067993
2021-08-26 22:54:04.657 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.15275676548480988
2021-08-26 22:54:04.659 | INFO     | src.policies:

2021-08-26 22:54:05.208 | INFO     | src.policies:train:116 - Epoch 538 / 800
2021-08-26 22:54:05.209 | INFO     | src.policies:collect_trajectories:213 - Episode 1833
2021-08-26 22:54:05.251 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:54:05.252 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 99.0
2021-08-26 22:54:05.252 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 99.0
2021-08-26 22:54:05.253 | INFO     | src.policies:collect_trajectories:213 - Episode 1834
2021-08-26 22:54:05.295 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:54:05.296 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 97.0
2021-08-26 22:54:05.297 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 98.0
2021-08-26 22:54:05.298 | INFO     | src.policies:collect_trajectories:213 - Episode 1835
2021-08-26 2

2021-08-26 22:54:05.808 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.040268074721097946
2021-08-26 22:54:05.810 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.21377675235271454
2021-08-26 22:54:05.812 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.040268074721097946
2021-08-26 22:54:05.814 | INFO     | src.policies:train:152 - Mini-batch 2 / 2
2021-08-26 22:54:05.816 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.30528807640075684
2021-08-26 22:54:05.818 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.11166584491729736
2021-08-26 22:54:05.820 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.04049777612090111
2021-08-26 22:54:05.822 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.11166584491729736
2021-0

2021-08-26 22:54:06.308 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.04153243452310562
2021-08-26 22:54:06.310 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.15682123601436615
2021-08-26 22:54:06.312 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.04153243452310562
2021-08-26 22:54:06.315 | INFO     | src.policies:train:116 - Epoch 545 / 800
2021-08-26 22:54:06.316 | INFO     | src.policies:collect_trajectories:213 - Episode 1844
2021-08-26 22:54:06.399 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:54:06.401 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 200.0
2021-08-26 22:54:06.402 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 200.0
2021-08-26 22:54:06.405 | INFO     | src.policies:train:152 - Mini-batch 1 / 2
2021-08-26 22:54:06

2021-08-26 22:54:06.856 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.03887000307440758
2021-08-26 22:54:06.859 | INFO     | src.policies:train:116 - Epoch 548 / 800
2021-08-26 22:54:06.860 | INFO     | src.policies:collect_trajectories:213 - Episode 1850
2021-08-26 22:54:06.946 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:54:06.947 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 200.0
2021-08-26 22:54:06.948 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 200.0
2021-08-26 22:54:06.952 | INFO     | src.policies:train:152 - Mini-batch 1 / 2
2021-08-26 22:54:06.954 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.2961767911911011
2021-08-26 22:54:06.957 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.21670538187026978
2021-08-26 22:54:06.958 | INFO     | src.policies:minibat

2021-08-26 22:54:07.559 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 200.0
2021-08-26 22:54:07.559 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 123.0
2021-08-26 22:54:07.566 | INFO     | src.policies:train:152 - Mini-batch 1 / 3
2021-08-26 22:54:07.568 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.31002193689346313
2021-08-26 22:54:07.571 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.4450438916683197
2021-08-26 22:54:07.572 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.03765593096613884
2021-08-26 22:54:07.575 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.4450438916683197
2021-08-26 22:54:07.577 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.03765593096613884
2021-08-26 22:54:07.579 | INFO     | src.policies:train:152 -

2021-08-26 22:54:07.998 | INFO     | src.policies:train:116 - Epoch 555 / 800
2021-08-26 22:54:07.999 | INFO     | src.policies:collect_trajectories:213 - Episode 1860
2021-08-26 22:54:08.086 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:54:08.087 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 200.0
2021-08-26 22:54:08.088 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 200.0
2021-08-26 22:54:08.091 | INFO     | src.policies:train:152 - Mini-batch 1 / 2
2021-08-26 22:54:08.095 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.3144221305847168
2021-08-26 22:54:08.097 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.5836208462715149
2021-08-26 22:54:08.099 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.03803987428545952
2021-08-26 22:54:08.101 | INFO     | src.policies:minibatch_update:288 - 

2021-08-26 22:54:08.591 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.23342885076999664
2021-08-26 22:54:08.592 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.037684351205825806
2021-08-26 22:54:08.595 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.23342885076999664
2021-08-26 22:54:08.597 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.037684351205825806
2021-08-26 22:54:08.599 | INFO     | src.policies:train:152 - Mini-batch 3 / 3
2021-08-26 22:54:08.601 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.2753326892852783
2021-08-26 22:54:08.603 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.29362377524375916
2021-08-26 22:54:08.605 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.03891152888536453
2021-08-26 22:54:08.60

2021-08-26 22:54:09.091 | INFO     | src.policies:train:116 - Epoch 562 / 800
2021-08-26 22:54:09.092 | INFO     | src.policies:collect_trajectories:213 - Episode 1870
2021-08-26 22:54:09.107 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:54:09.108 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 34.0
2021-08-26 22:54:09.109 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 34.0
2021-08-26 22:54:09.110 | INFO     | src.policies:collect_trajectories:213 - Episode 1871
2021-08-26 22:54:09.140 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:54:09.141 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 67.0
2021-08-26 22:54:09.142 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 50.5
2021-08-26 22:54:09.143 | INFO     | src.policies:collect_trajectories:213 - Episode 1872
2021-08-26 2

2021-08-26 22:54:09.718 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.1486855298280716
2021-08-26 22:54:09.720 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.037295326590538025
2021-08-26 22:54:09.722 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.1486855298280716
2021-08-26 22:54:09.724 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.037295326590538025
2021-08-26 22:54:09.727 | INFO     | src.policies:train:152 - Mini-batch 2 / 2
2021-08-26 22:54:09.729 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.30706989765167236
2021-08-26 22:54:09.732 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.10745086520910263
2021-08-26 22:54:09.734 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.03685114532709122
2021-08-26 22:54:09.736

2021-08-26 22:54:10.316 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.32570159435272217
2021-08-26 22:54:10.317 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.03608683496713638
2021-08-26 22:54:10.319 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.32570159435272217
2021-08-26 22:54:10.321 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.03608683496713638
2021-08-26 22:54:10.324 | INFO     | src.policies:train:116 - Epoch 568 / 800
2021-08-26 22:54:10.325 | INFO     | src.policies:collect_trajectories:213 - Episode 1883
2021-08-26 22:54:10.410 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:54:10.411 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 200.0
2021-08-26 22:54:10.411 | INFO     | src.policies:collect_trajectories:230 - Last 100 

2021-08-26 22:54:10.842 | INFO     | src.policies:train:116 - Epoch 572 / 800
2021-08-26 22:54:10.843 | INFO     | src.policies:collect_trajectories:213 - Episode 1887
2021-08-26 22:54:10.890 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:54:10.891 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 116.0
2021-08-26 22:54:10.892 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 116.0
2021-08-26 22:54:10.893 | INFO     | src.policies:collect_trajectories:213 - Episode 1888
2021-08-26 22:54:10.973 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:54:10.974 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 191.0
2021-08-26 22:54:10.975 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 153.5
2021-08-26 22:54:10.980 | INFO     | src.policies:train:152 - Mini-batch 1 / 3
2021-08-26 22:54:10

2021-08-26 22:54:11.471 | INFO     | src.policies:train:152 - Mini-batch 2 / 2
2021-08-26 22:54:11.474 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.30587059259414673
2021-08-26 22:54:11.476 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.11663305759429932
2021-08-26 22:54:11.477 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.0354250967502594
2021-08-26 22:54:11.479 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.11663305759429932
2021-08-26 22:54:11.481 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.0354250967502594
2021-08-26 22:54:11.484 | INFO     | src.policies:train:116 - Epoch 576 / 800
2021-08-26 22:54:11.485 | INFO     | src.policies:collect_trajectories:213 - Episode 1892
2021-08-26 22:54:11.607 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08

2021-08-26 22:54:12.054 | INFO     | src.policies:train:152 - Mini-batch 1 / 2
2021-08-26 22:54:12.057 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.2960748076438904
2021-08-26 22:54:12.059 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.22555674612522125
2021-08-26 22:54:12.061 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.035572078078985214
2021-08-26 22:54:12.100 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.22555674612522125
2021-08-26 22:54:12.102 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.035572078078985214
2021-08-26 22:54:12.105 | INFO     | src.policies:train:152 - Mini-batch 2 / 2
2021-08-26 22:54:12.107 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.30285972356796265
2021-08-26 22:54:12.110 | INFO     | src.policies:minibatch_update:277 - Policy network L2 

2021-08-26 22:54:12.547 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.4999992251396179
2021-08-26 22:54:12.550 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.033812228590250015
2021-08-26 22:54:12.552 | INFO     | src.policies:train:116 - Epoch 583 / 800
2021-08-26 22:54:12.553 | INFO     | src.policies:collect_trajectories:213 - Episode 1901
2021-08-26 22:54:12.676 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:54:12.677 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 200.0
2021-08-26 22:54:12.678 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 200.0
2021-08-26 22:54:12.682 | INFO     | src.policies:train:152 - Mini-batch 1 / 2
2021-08-26 22:54:12.684 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.30396944284439087
2021-08-26 22:54:12.687 | INFO     | src.

2021-08-26 22:54:13.073 | INFO     | src.policies:train:116 - Epoch 586 / 800
2021-08-26 22:54:13.074 | INFO     | src.policies:collect_trajectories:213 - Episode 1907
2021-08-26 22:54:13.198 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:54:13.199 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 200.0
2021-08-26 22:54:13.200 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 200.0
2021-08-26 22:54:13.204 | INFO     | src.policies:train:152 - Mini-batch 1 / 2
2021-08-26 22:54:13.207 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.30018752813339233
2021-08-26 22:54:13.209 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.5296239256858826
2021-08-26 22:54:13.210 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.03473404049873352
2021-08-26 22:54:13.213 | INFO     | src.policies:minibatch_update:288 -

2021-08-26 22:54:13.803 | INFO     | src.policies:train:152 - Mini-batch 3 / 3
2021-08-26 22:54:13.805 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.2904307246208191
2021-08-26 22:54:13.807 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.28333112597465515
2021-08-26 22:54:13.809 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.03355945274233818
2021-08-26 22:54:13.811 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.28333112597465515
2021-08-26 22:54:13.813 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.03355945274233818
2021-08-26 22:54:13.816 | INFO     | src.policies:train:116 - Epoch 590 / 800
2021-08-26 22:54:13.817 | INFO     | src.policies:collect_trajectories:213 - Episode 1912
2021-08-26 22:54:13.890 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-0

2021-08-26 22:54:14.342 | INFO     | src.policies:train:116 - Epoch 593 / 800
2021-08-26 22:54:14.342 | INFO     | src.policies:collect_trajectories:213 - Episode 1917
2021-08-26 22:54:14.384 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:54:14.385 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 95.0
2021-08-26 22:54:14.385 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 95.0
2021-08-26 22:54:14.386 | INFO     | src.policies:collect_trajectories:213 - Episode 1918
2021-08-26 22:54:14.416 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:54:14.417 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 68.0
2021-08-26 22:54:14.417 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 81.5
2021-08-26 22:54:14.418 | INFO     | src.policies:collect_trajectories:213 - Episode 1919
2021-08-26 2

2021-08-26 22:54:14.901 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.4242055118083954
2021-08-26 22:54:14.903 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.033642321825027466
2021-08-26 22:54:14.906 | INFO     | src.policies:train:152 - Mini-batch 2 / 2
2021-08-26 22:54:14.908 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.2982507348060608
2021-08-26 22:54:14.910 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.09726490825414658
2021-08-26 22:54:14.912 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.03324086219072342
2021-08-26 22:54:14.914 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.09726490825414658
2021-08-26 22:54:14.916 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.033240862190

2021-08-26 22:54:15.526 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.1039590910077095
2021-08-26 22:54:15.528 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.0329912006855011
2021-08-26 22:54:15.531 | INFO     | src.policies:train:116 - Epoch 599 / 800
2021-08-26 22:54:15.532 | INFO     | src.policies:collect_trajectories:213 - Episode 1929
2021-08-26 22:54:15.615 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:54:15.616 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 200.0
2021-08-26 22:54:15.617 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 200.0
2021-08-26 22:54:15.621 | INFO     | src.policies:train:152 - Mini-batch 1 / 2
2021-08-26 22:54:15.624 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.2924343943595886
2021-08-26 22:54:15.627 | INFO     | src.pol

2021-08-26 22:54:16.156 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.21609248220920563
2021-08-26 22:54:16.158 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.03329548239707947
2021-08-26 22:54:16.160 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.21609248220920563
2021-08-26 22:54:16.162 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.03329548239707947
2021-08-26 22:54:16.165 | INFO     | src.policies:train:152 - Mini-batch 2 / 2
2021-08-26 22:54:16.167 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.29590415954589844
2021-08-26 22:54:16.169 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.5767894387245178
2021-08-26 22:54:16.171 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.030912742018699646
2021-08-26 22:54:16.173

2021-08-26 22:54:16.666 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.0316665843129158
2021-08-26 22:54:16.668 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.11846891790628433
2021-08-26 22:54:16.670 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.0316665843129158
2021-08-26 22:54:16.673 | INFO     | src.policies:train:116 - Epoch 606 / 800
2021-08-26 22:54:16.674 | INFO     | src.policies:collect_trajectories:213 - Episode 1941
2021-08-26 22:54:16.758 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:54:16.759 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 200.0
2021-08-26 22:54:16.760 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 200.0
2021-08-26 22:54:16.764 | INFO     | src.policies:train:152 - Mini-batch 1 / 2
2021-08-26 22:54:16.7

2021-08-26 22:54:17.280 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 164.0
2021-08-26 22:54:17.286 | INFO     | src.policies:train:152 - Mini-batch 1 / 3
2021-08-26 22:54:17.288 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.2877456545829773
2021-08-26 22:54:17.291 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.5329164862632751
2021-08-26 22:54:17.292 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.03187131881713867
2021-08-26 22:54:17.294 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.499999076128006
2021-08-26 22:54:17.296 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.03187131881713867
2021-08-26 22:54:17.299 | INFO     | src.policies:train:152 - Mini-batch 2 / 3
2021-08-26 22:54:17.301 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.

2021-08-26 22:54:17.804 | INFO     | src.policies:collect_trajectories:213 - Episode 1950
2021-08-26 22:54:17.883 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:54:17.885 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 191.0
2021-08-26 22:54:17.885 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 191.0
2021-08-26 22:54:17.886 | INFO     | src.policies:collect_trajectories:213 - Episode 1951
2021-08-26 22:54:17.975 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:54:17.976 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 200.0
2021-08-26 22:54:17.977 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 195.5
2021-08-26 22:54:17.982 | INFO     | src.policies:train:152 - Mini-batch 1 / 3
2021-08-26 22:54:17.985 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.291019320

2021-08-26 22:54:18.372 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.3624099791049957
2021-08-26 22:54:18.374 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.031126130372285843
2021-08-26 22:54:18.376 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.3624099791049957
2021-08-26 22:54:18.378 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.031126130372285843
2021-08-26 22:54:18.381 | INFO     | src.policies:train:116 - Epoch 616 / 800
2021-08-26 22:54:18.382 | INFO     | src.policies:collect_trajectories:213 - Episode 1956
2021-08-26 22:54:18.437 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:54:18.438 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 126.0
2021-08-26 22:54:18.438 | INFO     | src.policies:collect_trajectories:230 - Last 100 

2021-08-26 22:54:18.941 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.13945582509040833
2021-08-26 22:54:18.943 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.031016645953059196
2021-08-26 22:54:18.945 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.13945582509040833
2021-08-26 22:54:18.947 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.031016645953059196
2021-08-26 22:54:18.950 | INFO     | src.policies:train:152 - Mini-batch 2 / 2
2021-08-26 22:54:18.952 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.30974119901657104
2021-08-26 22:54:18.954 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.19979700446128845
2021-08-26 22:54:18.956 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.030497480183839798
2021-08-26 22:54:18.

2021-08-26 22:54:19.465 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:54:19.466 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 134.0
2021-08-26 22:54:19.467 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 134.0
2021-08-26 22:54:19.468 | INFO     | src.policies:collect_trajectories:213 - Episode 1967
2021-08-26 22:54:19.554 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:54:19.556 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 200.0
2021-08-26 22:54:19.556 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 167.0
2021-08-26 22:54:19.562 | INFO     | src.policies:train:152 - Mini-batch 1 / 3
2021-08-26 22:54:19.565 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.28883200883865356
2021-08-26 22:54:19.567 | INFO     | src.policies:minibatch_update:277 - Policy n

2021-08-26 22:54:20.018 | INFO     | src.policies:train:152 - Mini-batch 3 / 3
2021-08-26 22:54:20.021 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.29609405994415283
2021-08-26 22:54:20.023 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.5391917824745178
2021-08-26 22:54:20.025 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.030345143750309944
2021-08-26 22:54:20.027 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.49999916553497314
2021-08-26 22:54:20.029 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.030345143750309944
2021-08-26 22:54:20.032 | INFO     | src.policies:train:116 - Epoch 625 / 800
2021-08-26 22:54:20.032 | INFO     | src.policies:collect_trajectories:213 - Episode 1973
2021-08-26 22:54:20.117 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021

2021-08-26 22:54:20.579 | INFO     | src.policies:collect_trajectories:213 - Episode 1978
2021-08-26 22:54:20.599 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:54:20.600 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 41.0
2021-08-26 22:54:20.601 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 104.5
2021-08-26 22:54:20.606 | INFO     | src.policies:train:152 - Mini-batch 1 / 2
2021-08-26 22:54:20.609 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.31558936834335327
2021-08-26 22:54:20.611 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.2944258451461792
2021-08-26 22:54:20.613 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.028780115768313408
2021-08-26 22:54:20.614 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.2944258451461792
2021-08-26

2021-08-26 22:54:21.121 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:54:21.122 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 200.0
2021-08-26 22:54:21.123 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 121.33333333333333
2021-08-26 22:54:21.129 | INFO     | src.policies:train:152 - Mini-batch 1 / 3
2021-08-26 22:54:21.132 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.30287086963653564
2021-08-26 22:54:21.134 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.2652326226234436
2021-08-26 22:54:21.136 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.0289254579693079
2021-08-26 22:54:21.138 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.2652326226234436
2021-08-26 22:54:21.140 | INFO     | src.policies:minibatch_update:295 - Baseline networ

2021-08-26 22:54:21.545 | INFO     | src.policies:collect_trajectories:213 - Episode 1991
2021-08-26 22:54:21.576 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:54:21.577 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 73.0
2021-08-26 22:54:21.578 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 73.0
2021-08-26 22:54:21.579 | INFO     | src.policies:collect_trajectories:213 - Episode 1992
2021-08-26 22:54:21.602 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:54:21.603 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 48.0
2021-08-26 22:54:21.603 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 60.5
2021-08-26 22:54:21.604 | INFO     | src.policies:collect_trajectories:213 - Episode 1993
2021-08-26 22:54:21.773 | DEBUG    | src.policies:execute_episode:398 - Early stopping, al

2021-08-26 22:54:22.244 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 46.0
2021-08-26 22:54:22.244 | INFO     | src.policies:collect_trajectories:213 - Episode 1998
2021-08-26 22:54:22.331 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:54:22.333 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 200.0
2021-08-26 22:54:22.333 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 123.0
2021-08-26 22:54:22.339 | INFO     | src.policies:train:152 - Mini-batch 1 / 2
2021-08-26 22:54:22.342 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.3248080015182495
2021-08-26 22:54:22.344 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.1944286972284317
2021-08-26 22:54:22.346 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.02806437760591507
2021-08-26 22:54:22.348 | INFO     | s

2021-08-26 22:54:22.822 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.17286761105060577
2021-08-26 22:54:22.824 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.029120774939656258
2021-08-26 22:54:22.826 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.17286761105060577
2021-08-26 22:54:22.828 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.029120774939656258
2021-08-26 22:54:22.831 | INFO     | src.policies:train:152 - Mini-batch 2 / 2
2021-08-26 22:54:22.833 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.3070395588874817
2021-08-26 22:54:22.835 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.3432284891605377
2021-08-26 22:54:22.837 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.02834097109735012
2021-08-26 22:54:22.839

2021-08-26 22:54:23.417 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:54:23.418 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 200.0
2021-08-26 22:54:23.419 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 144.0
2021-08-26 22:54:23.424 | INFO     | src.policies:train:152 - Mini-batch 1 / 2
2021-08-26 22:54:23.427 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.32405781745910645
2021-08-26 22:54:23.429 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.23571448028087616
2021-08-26 22:54:23.431 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.027707539498806
2021-08-26 22:54:23.433 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.23571448028087616
2021-08-26 22:54:23.435 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradien

2021-08-26 22:54:24.016 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.2934643626213074
2021-08-26 22:54:24.019 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.7667868137359619
2021-08-26 22:54:24.021 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.028410401195287704
2021-08-26 22:54:24.023 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.49999943375587463
2021-08-26 22:54:24.025 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.028410401195287704
2021-08-26 22:54:24.027 | INFO     | src.policies:train:152 - Mini-batch 2 / 2
2021-08-26 22:54:24.030 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.3139236569404602
2021-08-26 22:54:24.032 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.21338844299316406
2021-08-26 22:54:24.034 | INFO     | src.polic

2021-08-26 22:54:24.505 | INFO     | src.policies:collect_trajectories:213 - Episode 2019
2021-08-26 22:54:24.593 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:54:24.594 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 200.0
2021-08-26 22:54:24.595 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 200.0
2021-08-26 22:54:24.600 | INFO     | src.policies:train:152 - Mini-batch 1 / 2
2021-08-26 22:54:24.602 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.3111090660095215
2021-08-26 22:54:24.605 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.24664092063903809
2021-08-26 22:54:24.607 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.027243303135037422
2021-08-26 22:54:24.609 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.24664092063903809
2021-08-

2021-08-26 22:54:25.061 | INFO     | src.policies:collect_trajectories:213 - Episode 2024
2021-08-26 22:54:25.148 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:54:25.149 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 200.0
2021-08-26 22:54:25.150 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 200.0
2021-08-26 22:54:25.203 | INFO     | src.policies:train:152 - Mini-batch 1 / 2
2021-08-26 22:54:25.207 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.2924400568008423
2021-08-26 22:54:25.210 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.2895870804786682
2021-08-26 22:54:25.213 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.027694417163729668
2021-08-26 22:54:25.216 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.2895870804786682
2021-08-26

2021-08-26 22:54:25.661 | INFO     | src.policies:collect_trajectories:213 - Episode 2029
2021-08-26 22:54:25.781 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:54:25.782 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 200.0
2021-08-26 22:54:25.783 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 200.0
2021-08-26 22:54:25.787 | INFO     | src.policies:train:152 - Mini-batch 1 / 2
2021-08-26 22:54:25.790 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.32596665620803833
2021-08-26 22:54:25.792 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.3341117799282074
2021-08-26 22:54:25.794 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.026482393965125084
2021-08-26 22:54:25.796 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.3341117799282074
2021-08-2

2021-08-26 22:54:26.352 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 107.5
2021-08-26 22:54:26.358 | INFO     | src.policies:train:152 - Mini-batch 1 / 2
2021-08-26 22:54:26.360 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.30819231271743774
2021-08-26 22:54:26.363 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.23651495575904846
2021-08-26 22:54:26.364 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.02756297029554844
2021-08-26 22:54:26.366 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.23651495575904846
2021-08-26 22:54:26.368 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.02756297029554844
2021-08-26 22:54:26.371 | INFO     | src.policies:train:152 - Mini-batch 2 / 2
2021-08-26 22:54:26.373 | INFO     | src.policies:minibatch_update:270 - Total loss:

2021-08-26 22:54:26.918 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 200.0
2021-08-26 22:54:26.919 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 200.0
2021-08-26 22:54:26.923 | INFO     | src.policies:train:152 - Mini-batch 1 / 2
2021-08-26 22:54:26.926 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.3024836778640747
2021-08-26 22:54:26.928 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.4572785794734955
2021-08-26 22:54:26.930 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.02719397470355034
2021-08-26 22:54:26.932 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.4572785794734955
2021-08-26 22:54:26.934 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.02719397470355034
2021-08-26 22:54:26.937 | INFO     | src.policies:train:152 - 

2021-08-26 22:54:27.400 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.29709339141845703
2021-08-26 22:54:27.402 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.025604991242289543
2021-08-26 22:54:27.405 | INFO     | src.policies:train:116 - Epoch 669 / 800
2021-08-26 22:54:27.406 | INFO     | src.policies:collect_trajectories:213 - Episode 2046
2021-08-26 22:54:27.489 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:54:27.490 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 200.0
2021-08-26 22:54:27.491 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 200.0
2021-08-26 22:54:27.495 | INFO     | src.policies:train:152 - Mini-batch 1 / 2
2021-08-26 22:54:27.498 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.30464571714401245
2021-08-26 22:54:27.500 | INFO     | src

2021-08-26 22:54:28.083 | INFO     | src.policies:collect_trajectories:213 - Episode 2052
2021-08-26 22:54:28.094 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:54:28.095 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 22.0
2021-08-26 22:54:28.095 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 22.0
2021-08-26 22:54:28.096 | INFO     | src.policies:collect_trajectories:213 - Episode 2053
2021-08-26 22:54:28.107 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:54:28.108 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 21.0
2021-08-26 22:54:28.109 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 21.5
2021-08-26 22:54:28.110 | INFO     | src.policies:collect_trajectories:213 - Episode 2054
2021-08-26 22:54:28.191 | DEBUG    | src.policies:execute_episode:398 - Early stopping, al

2021-08-26 22:54:28.629 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 110.0
2021-08-26 22:54:28.630 | INFO     | src.policies:collect_trajectories:213 - Episode 2059
2021-08-26 22:54:28.716 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:54:28.717 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 200.0
2021-08-26 22:54:28.718 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 155.0
2021-08-26 22:54:28.723 | INFO     | src.policies:train:152 - Mini-batch 1 / 3
2021-08-26 22:54:28.726 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.31246328353881836
2021-08-26 22:54:28.728 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.44651222229003906
2021-08-26 22:54:28.730 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.025134656578302383
2021-08-26 22:54:28.732 | INFO    

2021-08-26 22:54:29.062 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.2562631666660309
2021-08-26 22:54:29.064 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.0251831766217947
2021-08-26 22:54:29.067 | INFO     | src.policies:train:116 - Epoch 678 / 800
2021-08-26 22:54:29.068 | INFO     | src.policies:collect_trajectories:213 - Episode 2066
2021-08-26 22:54:29.195 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:54:29.196 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 200.0
2021-08-26 22:54:29.197 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 200.0
2021-08-26 22:54:29.200 | INFO     | src.policies:train:152 - Mini-batch 1 / 2
2021-08-26 22:54:29.203 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.29665130376815796
2021-08-26 22:54:29.205 | INFO     | src.po

2021-08-26 22:54:29.690 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.13385139405727386
2021-08-26 22:54:29.692 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.025768503546714783
2021-08-26 22:54:29.695 | INFO     | src.policies:train:152 - Mini-batch 3 / 3
2021-08-26 22:54:29.697 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.3105490803718567
2021-08-26 22:54:29.699 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.08188678324222565
2021-08-26 22:54:29.701 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.024897854775190353
2021-08-26 22:54:29.703 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.08188678324222565
2021-08-26 22:54:29.705 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.0248978547

2021-08-26 22:54:30.313 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.49999910593032837
2021-08-26 22:54:30.315 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.02481148950755596
2021-08-26 22:54:30.317 | INFO     | src.policies:train:152 - Mini-batch 2 / 2
2021-08-26 22:54:30.319 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.30013608932495117
2021-08-26 22:54:30.322 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.23359812796115875
2021-08-26 22:54:30.323 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.025710055604577065
2021-08-26 22:54:30.325 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.23359812796115875
2021-08-26 22:54:30.327 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.0257100556

2021-08-26 22:54:30.796 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.024717465043067932
2021-08-26 22:54:30.799 | INFO     | src.policies:train:116 - Epoch 689 / 800
2021-08-26 22:54:30.800 | INFO     | src.policies:collect_trajectories:213 - Episode 2081
2021-08-26 22:54:30.866 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:54:30.868 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 153.0
2021-08-26 22:54:30.869 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 153.0
2021-08-26 22:54:30.869 | INFO     | src.policies:collect_trajectories:213 - Episode 2082
2021-08-26 22:54:30.911 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:54:30.912 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 97.0
2021-08-26 22:54:30.913 | INFO     | src.policies:collect_trajectorie

2021-08-26 22:54:31.342 | INFO     | src.policies:collect_trajectories:213 - Episode 2087
2021-08-26 22:54:31.438 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:54:31.439 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 200.0
2021-08-26 22:54:31.440 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 143.5
2021-08-26 22:54:31.445 | INFO     | src.policies:train:152 - Mini-batch 1 / 2
2021-08-26 22:54:31.448 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.323092520236969
2021-08-26 22:54:31.450 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.5928782820701599
2021-08-26 22:54:31.452 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.0238866675645113
2021-08-26 22:54:31.454 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.49999913573265076
2021-08-26 2

2021-08-26 22:54:31.945 | INFO     | src.policies:train:152 - Mini-batch 2 / 2
2021-08-26 22:54:31.947 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.3148271441459656
2021-08-26 22:54:31.950 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.47830653190612793
2021-08-26 22:54:31.952 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.024396125227212906
2021-08-26 22:54:31.954 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.47830653190612793
2021-08-26 22:54:31.957 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.024396125227212906
2021-08-26 22:54:31.959 | INFO     | src.policies:train:116 - Epoch 696 / 800
2021-08-26 22:54:31.960 | INFO     | src.policies:collect_trajectories:213 - Episode 2092
2021-08-26 22:54:32.025 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021

2021-08-26 22:54:32.598 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.455705851316452
2021-08-26 22:54:32.600 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.02461831457912922
2021-08-26 22:54:32.602 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.455705851316452
2021-08-26 22:54:32.605 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.02461831457912922
2021-08-26 22:54:32.607 | INFO     | src.policies:train:152 - Mini-batch 2 / 2
2021-08-26 22:54:32.610 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.30762237310409546
2021-08-26 22:54:32.612 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.12029845267534256
2021-08-26 22:54:32.613 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.024210406467318535
2021-08-26 22:54:32.615 | 

2021-08-26 22:54:33.101 | INFO     | src.policies:train:116 - Epoch 703 / 800
2021-08-26 22:54:33.103 | INFO     | src.policies:collect_trajectories:213 - Episode 2102
2021-08-26 22:54:33.123 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:54:33.124 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 38.0
2021-08-26 22:54:33.125 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 38.0
2021-08-26 22:54:33.126 | INFO     | src.policies:collect_trajectories:213 - Episode 2103
2021-08-26 22:54:33.212 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:54:33.213 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 200.0
2021-08-26 22:54:33.214 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 119.0
2021-08-26 22:54:33.219 | INFO     | src.policies:train:152 - Mini-batch 1 / 2
2021-08-26 22:54:33.2

2021-08-26 22:54:33.699 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.08386650681495667
2021-08-26 22:54:33.702 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.02331782691180706
2021-08-26 22:54:33.704 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.08386650681495667
2021-08-26 22:54:33.707 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.02331782691180706
2021-08-26 22:54:33.710 | INFO     | src.policies:train:152 - Mini-batch 2 / 2
2021-08-26 22:54:33.712 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.3003171682357788
2021-08-26 22:54:33.715 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.08270534127950668
2021-08-26 22:54:33.717 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.024408048018813133
2021-08-26 22:54:33.719

2021-08-26 22:54:34.368 | INFO     | src.policies:train:116 - Epoch 710 / 800
2021-08-26 22:54:34.369 | INFO     | src.policies:collect_trajectories:213 - Episode 2112
2021-08-26 22:54:34.450 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:54:34.451 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 200.0
2021-08-26 22:54:34.452 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 200.0
2021-08-26 22:54:34.455 | INFO     | src.policies:train:152 - Mini-batch 1 / 2
2021-08-26 22:54:34.458 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.29704827070236206
2021-08-26 22:54:34.460 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.3215537667274475
2021-08-26 22:54:34.462 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.02391010895371437
2021-08-26 22:54:34.464 | INFO     | src.policies:minibatch_update:288 -

2021-08-26 22:54:34.923 | INFO     | src.policies:train:152 - Mini-batch 3 / 3
2021-08-26 22:54:34.925 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.3017128109931946
2021-08-26 22:54:34.927 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.2431601583957672
2021-08-26 22:54:34.928 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.023169752210378647
2021-08-26 22:54:34.930 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.2431601583957672
2021-08-26 22:54:34.932 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.023169752210378647
2021-08-26 22:54:34.935 | INFO     | src.policies:train:116 - Epoch 714 / 800
2021-08-26 22:54:34.936 | INFO     | src.policies:collect_trajectories:213 - Episode 2117
2021-08-26 22:54:35.017 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-0

2021-08-26 22:54:35.494 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.13726572692394257
2021-08-26 22:54:35.495 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.0231423769146204
2021-08-26 22:54:35.497 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.13726572692394257
2021-08-26 22:54:35.499 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.0231423769146204
2021-08-26 22:54:35.502 | INFO     | src.policies:train:152 - Mini-batch 2 / 3
2021-08-26 22:54:35.504 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.28959763050079346
2021-08-26 22:54:35.506 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.5930097103118896
2021-08-26 22:54:35.508 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.02323167957365513
2021-08-26 22:54:35.510 | 

2021-08-26 22:54:35.957 | INFO     | src.policies:collect_trajectories:213 - Episode 2127
2021-08-26 22:54:36.019 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:54:36.020 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 145.0
2021-08-26 22:54:36.021 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 126.0
2021-08-26 22:54:36.025 | INFO     | src.policies:train:152 - Mini-batch 1 / 2
2021-08-26 22:54:36.029 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.3025578260421753
2021-08-26 22:54:36.031 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.4335062801837921
2021-08-26 22:54:36.033 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.022333955392241478
2021-08-26 22:54:36.035 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.4335062801837921
2021-08-26

2021-08-26 22:54:36.538 | INFO     | src.policies:train:116 - Epoch 725 / 800
2021-08-26 22:54:36.539 | INFO     | src.policies:collect_trajectories:213 - Episode 2132
2021-08-26 22:54:36.625 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:54:36.626 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 200.0
2021-08-26 22:54:36.626 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 200.0
2021-08-26 22:54:36.630 | INFO     | src.policies:train:152 - Mini-batch 1 / 2
2021-08-26 22:54:36.633 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.29506927728652954
2021-08-26 22:54:36.635 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.2678067088127136
2021-08-26 22:54:36.637 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.022966446354985237
2021-08-26 22:54:36.639 | INFO     | src.policies:minibatch_update:288 

2021-08-26 22:54:37.046 | INFO     | src.policies:train:116 - Epoch 729 / 800
2021-08-26 22:54:37.047 | INFO     | src.policies:collect_trajectories:213 - Episode 2137
2021-08-26 22:54:37.069 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:54:37.070 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 50.0
2021-08-26 22:54:37.070 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 50.0
2021-08-26 22:54:37.071 | INFO     | src.policies:collect_trajectories:213 - Episode 2138
2021-08-26 22:54:37.152 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:54:37.153 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 200.0
2021-08-26 22:54:37.154 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 125.0
2021-08-26 22:54:37.159 | INFO     | src.policies:train:152 - Mini-batch 1 / 2
2021-08-26 22:54:37.1

2021-08-26 22:54:37.596 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.30244722962379456
2021-08-26 22:54:37.598 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.022560663521289825
2021-08-26 22:54:37.600 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.30244722962379456
2021-08-26 22:54:37.602 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.022560663521289825
2021-08-26 22:54:37.605 | INFO     | src.policies:train:116 - Epoch 733 / 800
2021-08-26 22:54:37.605 | INFO     | src.policies:collect_trajectories:213 - Episode 2143
2021-08-26 22:54:37.687 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:54:37.688 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 200.0
2021-08-26 22:54:37.688 | INFO     | src.policies:collect_trajectories:230 - Last 10

2021-08-26 22:54:38.108 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.44542720913887024
2021-08-26 22:54:38.110 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.021527748554944992
2021-08-26 22:54:38.112 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.44542720913887024
2021-08-26 22:54:38.114 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.021527748554944992
2021-08-26 22:54:38.117 | INFO     | src.policies:train:116 - Epoch 737 / 800
2021-08-26 22:54:38.118 | INFO     | src.policies:collect_trajectories:213 - Episode 2148
2021-08-26 22:54:38.198 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:54:38.199 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 200.0
2021-08-26 22:54:38.200 | INFO     | src.policies:collect_trajectories:230 - Last 10

2021-08-26 22:54:38.704 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.5206540822982788
2021-08-26 22:54:38.706 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.02129553258419037
2021-08-26 22:54:38.708 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.4999990165233612
2021-08-26 22:54:38.710 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.02129553258419037
2021-08-26 22:54:38.713 | INFO     | src.policies:train:152 - Mini-batch 2 / 2
2021-08-26 22:54:38.715 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.30550169944763184
2021-08-26 22:54:38.717 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.12103249132633209
2021-08-26 22:54:38.718 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.0224489476531744
2021-08-26 22:54:38.720 | 

2021-08-26 22:54:39.166 | INFO     | src.policies:train:116 - Epoch 744 / 800
2021-08-26 22:54:39.167 | INFO     | src.policies:collect_trajectories:213 - Episode 2157
2021-08-26 22:54:39.247 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:54:39.248 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 200.0
2021-08-26 22:54:39.249 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 200.0
2021-08-26 22:54:39.252 | INFO     | src.policies:train:152 - Mini-batch 1 / 2
2021-08-26 22:54:39.255 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.30553561449050903
2021-08-26 22:54:39.257 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.7979013919830322
2021-08-26 22:54:39.259 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.021572912111878395
2021-08-26 22:54:39.261 | INFO     | src.policies:minibatch_update:288 

2021-08-26 22:54:39.713 | INFO     | src.policies:train:116 - Epoch 748 / 800
2021-08-26 22:54:39.714 | INFO     | src.policies:collect_trajectories:213 - Episode 2162
2021-08-26 22:54:39.795 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:54:39.796 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 200.0
2021-08-26 22:54:39.797 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 200.0
2021-08-26 22:54:39.800 | INFO     | src.policies:train:152 - Mini-batch 1 / 2
2021-08-26 22:54:39.803 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.2960975170135498
2021-08-26 22:54:39.805 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.24237221479415894
2021-08-26 22:54:39.807 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.02182997390627861
2021-08-26 22:54:39.809 | INFO     | src.policies:minibatch_update:288 -

2021-08-26 22:54:40.341 | INFO     | src.policies:train:152 - Mini-batch 3 / 3
2021-08-26 22:54:40.343 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.3007923364639282
2021-08-26 22:54:40.345 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.3726906478404999
2021-08-26 22:54:40.347 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.02128283679485321
2021-08-26 22:54:40.349 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.3726906478404999
2021-08-26 22:54:40.351 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.02128283679485321
2021-08-26 22:54:40.353 | INFO     | src.policies:train:116 - Epoch 752 / 800
2021-08-26 22:54:40.354 | INFO     | src.policies:collect_trajectories:213 - Episode 2167
2021-08-26 22:54:40.516 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-

2021-08-26 22:54:40.912 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.020753685384988785
2021-08-26 22:54:40.915 | INFO     | src.policies:train:116 - Epoch 756 / 800
2021-08-26 22:54:40.916 | INFO     | src.policies:collect_trajectories:213 - Episode 2171
2021-08-26 22:54:40.969 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:54:40.970 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 136.0
2021-08-26 22:54:40.971 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 136.0
2021-08-26 22:54:40.972 | INFO     | src.policies:collect_trajectories:213 - Episode 2172
2021-08-26 22:54:41.055 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:54:41.056 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 200.0
2021-08-26 22:54:41.056 | INFO     | src.policies:collect_trajectori

2021-08-26 22:54:41.452 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.020618753507733345
2021-08-26 22:54:41.454 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.08356388658285141
2021-08-26 22:54:41.456 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.020618753507733345
2021-08-26 22:54:41.458 | INFO     | src.policies:train:152 - Mini-batch 2 / 2
2021-08-26 22:54:41.460 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.29825782775878906
2021-08-26 22:54:41.464 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.5229067206382751
2021-08-26 22:54:41.467 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.020528456196188927
2021-08-26 22:54:41.469 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.49999910593032837
2021-0

2021-08-26 22:54:41.953 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.02025754190981388
2021-08-26 22:54:41.956 | INFO     | src.policies:train:152 - Mini-batch 3 / 3
2021-08-26 22:54:41.958 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.3068805932998657
2021-08-26 22:54:41.960 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.225212961435318
2021-08-26 22:54:41.962 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.019781263545155525
2021-08-26 22:54:41.964 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.225212961435318
2021-08-26 22:54:41.966 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.019781263545155525
2021-08-26 22:54:41.969 | INFO     | src.policies:train:116 - Epoch 762 / 800
2021-08-26 22:54:41.970 | INFO     | src.policies:collect_t

2021-08-26 22:54:42.500 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.020589107647538185
2021-08-26 22:54:42.502 | INFO     | src.policies:train:152 - Mini-batch 3 / 3
2021-08-26 22:54:42.504 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.299655556678772
2021-08-26 22:54:42.584 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.476081907749176
2021-08-26 22:54:42.586 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.0202321819961071
2021-08-26 22:54:42.589 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.476081907749176
2021-08-26 22:54:42.591 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.0202321819961071
2021-08-26 22:54:42.593 | INFO     | src.policies:train:116 - Epoch 765 / 800
2021-08-26 22:54:42.594 | INFO     | src.policies:collect_traje

2021-08-26 22:54:43.113 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 200.0
2021-08-26 22:54:43.116 | INFO     | src.policies:train:152 - Mini-batch 1 / 2
2021-08-26 22:54:43.119 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.3022756576538086
2021-08-26 22:54:43.122 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.6266167163848877
2021-08-26 22:54:43.123 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.020230161026120186
2021-08-26 22:54:43.125 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.4999992549419403
2021-08-26 22:54:43.127 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.020230161026120186
2021-08-26 22:54:43.129 | INFO     | src.policies:train:152 - Mini-batch 2 / 2
2021-08-26 22:54:43.132 | INFO     | src.policies:minibatch_update:270 - Total loss: 

2021-08-26 22:54:43.620 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.25664231181144714
2021-08-26 22:54:43.622 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.020442796871066093
2021-08-26 22:54:43.624 | INFO     | src.policies:train:152 - Mini-batch 2 / 2
2021-08-26 22:54:43.626 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.2989763617515564
2021-08-26 22:54:43.628 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.4566536247730255
2021-08-26 22:54:43.630 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.01979731395840645
2021-08-26 22:54:43.632 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.4566536247730255
2021-08-26 22:54:43.634 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.0197973139584

2021-08-26 22:54:44.141 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.26881909370422363
2021-08-26 22:54:44.142 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.02018764242529869
2021-08-26 22:54:44.145 | INFO     | src.policies:train:152 - Mini-batch 2 / 2
2021-08-26 22:54:44.147 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.2914416193962097
2021-08-26 22:54:44.149 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.6668685078620911
2021-08-26 22:54:44.151 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.01981613039970398
2021-08-26 22:54:44.152 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.49999910593032837
2021-08-26 22:54:44.154 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.0198161303997

2021-08-26 22:54:44.723 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.2453111857175827
2021-08-26 22:54:44.725 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.01941939815878868
2021-08-26 22:54:44.727 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.2453111857175827
2021-08-26 22:54:44.729 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.01941939815878868
2021-08-26 22:54:44.732 | INFO     | src.policies:train:116 - Epoch 781 / 800
2021-08-26 22:54:44.733 | INFO     | src.policies:collect_trajectories:213 - Episode 2207
2021-08-26 22:54:44.814 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:54:44.815 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 200.0
2021-08-26 22:54:44.816 | INFO     | src.policies:collect_trajectories:230 - Last 100 ep

2021-08-26 22:54:45.213 | INFO     | src.policies:train:116 - Epoch 785 / 800
2021-08-26 22:54:45.214 | INFO     | src.policies:collect_trajectories:213 - Episode 2211
2021-08-26 22:54:45.299 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:54:45.300 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 200.0
2021-08-26 22:54:45.301 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 200.0
2021-08-26 22:54:45.305 | INFO     | src.policies:train:152 - Mini-batch 1 / 2
2021-08-26 22:54:45.307 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.2803589701652527
2021-08-26 22:54:45.309 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.5508317351341248
2021-08-26 22:54:45.311 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.019933786243200302
2021-08-26 22:54:45.313 | INFO     | src.policies:minibatch_update:288 -

2021-08-26 22:54:45.829 | INFO     | src.policies:train:116 - Epoch 789 / 800
2021-08-26 22:54:45.830 | INFO     | src.policies:collect_trajectories:213 - Episode 2216
2021-08-26 22:54:45.890 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:54:45.891 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 142.0
2021-08-26 22:54:45.892 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 142.0
2021-08-26 22:54:45.892 | INFO     | src.policies:collect_trajectories:213 - Episode 2217
2021-08-26 22:54:45.972 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:54:45.973 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 200.0
2021-08-26 22:54:45.974 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 171.0
2021-08-26 22:54:45.978 | INFO     | src.policies:train:152 - Mini-batch 1 / 3
2021-08-26 22:54:45

2021-08-26 22:54:46.398 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.09692373871803284
2021-08-26 22:54:46.400 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.018764425069093704
2021-08-26 22:54:46.402 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.09692373871803284
2021-08-26 22:54:46.404 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.018764425069093704
2021-08-26 22:54:46.406 | INFO     | src.policies:train:152 - Mini-batch 2 / 2
2021-08-26 22:54:46.409 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.30169934034347534
2021-08-26 22:54:46.411 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.24301904439926147
2021-08-26 22:54:46.412 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.01956826075911522
2021-08-26 22:54:46.4

2021-08-26 22:54:47.015 | INFO     | src.policies:train:152 - Mini-batch 1 / 2
2021-08-26 22:54:47.018 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.27715641260147095
2021-08-26 22:54:47.021 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.11753910034894943
2021-08-26 22:54:47.022 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.019061055034399033
2021-08-26 22:54:47.024 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.11753910034894943
2021-08-26 22:54:47.026 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.019061055034399033
2021-08-26 22:54:47.028 | INFO     | src.policies:train:152 - Mini-batch 2 / 2
2021-08-26 22:54:47.031 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.2850976586341858
2021-08-26 22:54:47.033 | INFO     | src.policies:minibatch_update:277 - Policy network L2 

2021-08-26 22:54:47.420 | INFO     | src.policies:train:116 - Epoch 798 / 800
2021-08-26 22:54:47.421 | INFO     | src.policies:collect_trajectories:213 - Episode 2234
2021-08-26 22:54:47.501 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:54:47.502 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 200.0
2021-08-26 22:54:47.503 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 200.0
2021-08-26 22:54:47.508 | INFO     | src.policies:train:152 - Mini-batch 1 / 2
2021-08-26 22:54:47.511 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.3033214807510376
2021-08-26 22:54:47.513 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.21069003641605377
2021-08-26 22:54:47.515 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.01867402158677578
2021-08-26 22:54:47.516 | INFO     | src.policies:minibatch_update:288 -

VBox(children=(Label(value=' 0.00MB of 0.00MB uploaded (0.00MB deduped)\r'), FloatProgress(value=1.0, max=1.0)…

0,1
loss,-0.61041
mean_return,200.0
_runtime,148.0
_timestamp,1630011287.0
_step,799.0


0,1
loss,█▇▆▅▅▄▄▃▃▂▃▂▂▁▂▂▂▂▂▂▂▂▁▂▁▂▂▂▂▂▁▂▂▂▂▂▂▂▂▂
mean_return,▁▁▁▂▁▂▅▄▂▇▅▄▄▃▂▄▅▅█▄██▇█▆████▅▇█▅▃▆███▅█
_runtime,▁▁▁▂▂▂▂▂▃▃▃▃▃▄▄▄▄▄▅▅▅▅▅▅▆▆▆▆▆▆▇▇▇▇▇▇████
_timestamp,▁▁▁▂▂▂▂▂▃▃▃▃▃▄▄▄▄▄▅▅▅▅▅▅▆▆▆▆▆▆▇▇▇▇▇▇████
_step,▁▁▁▁▂▂▂▂▂▃▃▃▃▃▃▄▄▄▄▄▅▅▅▅▅▅▆▆▆▆▆▇▇▇▇▇▇███


## PPO

This section deals with training a Cartpole agent using our custom Proximal Policy Optimization implementation.

In [78]:
alpha = 1.0
beta = 0.01
eps = 0.2

In [79]:
ppo_policy_nn = models.MLP(observation_space_size, hidden_sizes, action_space_size)
ppo_baseline_nn = models.MLP(observation_space_size, hidden_sizes, 1, log_softmax=False)
ppo_policy = policies.PPOPolicy(env, ppo_policy_nn, ppo_baseline_nn, alpha=alpha, beta=beta, eps=eps)
ppo_policy.train(
    epochs,
    steps_per_epoch,
    minibatch_size,
    enable_wandb=True,
    wandb_config={**wandb_config, "group": "PPO"},
    episodes_mean_return=episodes_mean_return
)

[34m[1mwandb[0m: wandb version 0.12.1 is available!  To upgrade, please run:
[34m[1mwandb[0m:  $ pip install wandb --upgrade


2021-08-26 22:55:12.903 | INFO     | src.policies:train:116 - Epoch 1 / 800
2021-08-26 22:55:12.904 | INFO     | src.policies:collect_trajectories:213 - Episode 1
2021-08-26 22:55:12.913 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:55:12.915 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 14.0
2021-08-26 22:55:12.916 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 14.0
2021-08-26 22:55:12.917 | INFO     | src.policies:collect_trajectories:213 - Episode 2
2021-08-26 22:55:12.935 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:55:12.936 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 26.0
2021-08-26 22:55:12.937 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 20.0
2021-08-26 22:55:12.938 | INFO     | src.policies:collect_trajectories:213 - Episode 3
2021-08-26 22:55:12.950

2021-08-26 22:55:13.184 | INFO     | src.policies:collect_trajectories:213 - Episode 17
2021-08-26 22:55:13.194 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:55:13.195 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 13.0
2021-08-26 22:55:13.196 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 14.5
2021-08-26 22:55:13.197 | INFO     | src.policies:collect_trajectories:213 - Episode 18
2021-08-26 22:55:13.205 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:55:13.206 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 11.0
2021-08-26 22:55:13.207 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 13.8
2021-08-26 22:55:13.208 | INFO     | src.policies:collect_trajectories:213 - Episode 19
2021-08-26 22:55:13.216 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agen

2021-08-26 22:55:13.501 | INFO     | src.policies:collect_trajectories:213 - Episode 33
2021-08-26 22:55:13.509 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:55:13.510 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 9.0
2021-08-26 22:55:13.511 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 14.571428571428571
2021-08-26 22:55:13.512 | INFO     | src.policies:collect_trajectories:213 - Episode 34
2021-08-26 22:55:13.521 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:55:13.522 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 14.0
2021-08-26 22:55:13.523 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 14.5
2021-08-26 22:55:13.524 | INFO     | src.policies:collect_trajectories:213 - Episode 35
2021-08-26 22:55:13.532 | DEBUG    | src.policies:execute_episode:398 - Early stopp

2021-08-26 22:55:13.733 | INFO     | src.policies:collect_trajectories:213 - Episode 49
2021-08-26 22:55:13.742 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:55:13.743 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 15.0
2021-08-26 22:55:13.744 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 15.7
2021-08-26 22:55:13.745 | INFO     | src.policies:collect_trajectories:213 - Episode 50
2021-08-26 22:55:13.757 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:55:13.758 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 19.0
2021-08-26 22:55:13.759 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 16.0
2021-08-26 22:55:13.759 | INFO     | src.policies:collect_trajectories:213 - Episode 51
2021-08-26 22:55:13.767 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agen

2021-08-26 22:55:14.057 | INFO     | src.policies:train:152 - Mini-batch 1 / 2
2021-08-26 22:55:14.059 | INFO     | src.policies:minibatch_update:270 - Total loss: 0.010800574906170368
2021-08-26 22:55:14.062 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.17468653619289398
2021-08-26 22:55:14.064 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.7414193153381348
2021-08-26 22:55:14.067 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.17468653619289398
2021-08-26 22:55:14.069 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.4999993145465851
2021-08-26 22:55:14.071 | INFO     | src.policies:train:152 - Mini-batch 2 / 2
2021-08-26 22:55:14.074 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.03334324061870575
2021-08-26 22:55:14.076 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gra

2021-08-26 22:55:14.278 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.0006263924296945333
2021-08-26 22:55:14.280 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.4999992847442627
2021-08-26 22:55:14.283 | INFO     | src.policies:train:116 - Epoch 7 / 800
2021-08-26 22:55:14.284 | INFO     | src.policies:collect_trajectories:213 - Episode 78
2021-08-26 22:55:14.292 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:55:14.293 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 12.0
2021-08-26 22:55:14.293 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 12.0
2021-08-26 22:55:14.294 | INFO     | src.policies:collect_trajectories:213 - Episode 79
2021-08-26 22:55:14.308 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:55:14.309 | INFO     | sr

2021-08-26 22:55:14.540 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:55:14.541 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 11.0
2021-08-26 22:55:14.541 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 13.25
2021-08-26 22:55:14.542 | INFO     | src.policies:collect_trajectories:213 - Episode 94
2021-08-26 22:55:14.739 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:55:14.740 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 16.0
2021-08-26 22:55:14.741 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 13.8
2021-08-26 22:55:14.742 | INFO     | src.policies:collect_trajectories:213 - Episode 95
2021-08-26 22:55:14.749 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:55:14.749 | INFO     | src.policies:collect_trajectories:229 - Me

2021-08-26 22:55:15.055 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:55:15.056 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 13.0
2021-08-26 22:55:15.057 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 14.857142857142858
2021-08-26 22:55:15.057 | INFO     | src.policies:collect_trajectories:213 - Episode 110
2021-08-26 22:55:15.073 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:55:15.074 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 31.0
2021-08-26 22:55:15.075 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 16.875
2021-08-26 22:55:15.076 | INFO     | src.policies:collect_trajectories:213 - Episode 111
2021-08-26 22:55:15.084 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:55:15.085 | INFO     | src.policies:collect_traj

2021-08-26 22:55:15.330 | INFO     | src.policies:collect_trajectories:213 - Episode 125
2021-08-26 22:55:15.345 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:55:15.346 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 29.0
2021-08-26 22:55:15.347 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 17.5
2021-08-26 22:55:15.348 | INFO     | src.policies:collect_trajectories:213 - Episode 126
2021-08-26 22:55:15.356 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:55:15.357 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 13.0
2021-08-26 22:55:15.358 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 17.09090909090909
2021-08-26 22:55:15.359 | INFO     | src.policies:collect_trajectories:213 - Episode 127
2021-08-26 22:55:15.369 | DEBUG    | src.policies:execute_episode:398 - Early st

2021-08-26 22:55:15.554 | INFO     | src.policies:train:152 - Mini-batch 1 / 2
2021-08-26 22:55:15.556 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.015508119016885757
2021-08-26 22:55:15.559 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.07395588606595993
2021-08-26 22:55:15.561 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.6984230279922485
2021-08-26 22:55:15.562 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.07395588606595993
2021-08-26 22:55:15.564 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.4999992549419403
2021-08-26 22:55:15.567 | INFO     | src.policies:train:152 - Mini-batch 2 / 2
2021-08-26 22:55:15.569 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.06168364733457565
2021-08-26 22:55:15.572 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gr

2021-08-26 22:55:15.763 | INFO     | src.policies:minibatch_update:270 - Total loss: 5.016196519136429e-05
2021-08-26 22:55:15.765 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.045356638729572296
2021-08-26 22:55:15.767 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.683060348033905
2021-08-26 22:55:15.769 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.045356638729572296
2021-08-26 22:55:15.771 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.4999992847442627
2021-08-26 22:55:15.774 | INFO     | src.policies:train:116 - Epoch 13 / 800
2021-08-26 22:55:15.774 | INFO     | src.policies:collect_trajectories:213 - Episode 155
2021-08-26 22:55:15.782 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:55:15.783 | INFO     | src.policies:collect_trajectories:229 - Mean epis

2021-08-26 22:55:16.043 | INFO     | src.policies:train:152 - Mini-batch 2 / 2
2021-08-26 22:55:16.045 | INFO     | src.policies:minibatch_update:270 - Total loss: 0.007758242543786764
2021-08-26 22:55:16.047 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.09836028516292572
2021-08-26 22:55:16.049 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.6831592321395874
2021-08-26 22:55:16.051 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.09836028516292572
2021-08-26 22:55:16.053 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.4999992549419403
2021-08-26 22:55:16.055 | INFO     | src.policies:train:116 - Epoch 14 / 800
2021-08-26 22:55:16.056 | INFO     | src.policies:collect_trajectories:213 - Episode 171
2021-08-26 22:55:16.063 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-2

2021-08-26 22:55:16.252 | INFO     | src.policies:collect_trajectories:213 - Episode 185
2021-08-26 22:55:16.260 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:55:16.261 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 12.0
2021-08-26 22:55:16.262 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 13.0
2021-08-26 22:55:16.262 | INFO     | src.policies:collect_trajectories:213 - Episode 186
2021-08-26 22:55:16.270 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:55:16.271 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 12.0
2021-08-26 22:55:16.271 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 12.75
2021-08-26 22:55:16.272 | INFO     | src.policies:collect_trajectories:213 - Episode 187
2021-08-26 22:55:16.280 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all 

2021-08-26 22:55:16.562 | INFO     | src.policies:collect_trajectories:213 - Episode 201
2021-08-26 22:55:16.575 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:55:16.576 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 25.0
2021-08-26 22:55:16.576 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 19.6
2021-08-26 22:55:16.577 | INFO     | src.policies:collect_trajectories:213 - Episode 202
2021-08-26 22:55:16.585 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:55:16.586 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 10.0
2021-08-26 22:55:16.586 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 18.0
2021-08-26 22:55:16.587 | INFO     | src.policies:collect_trajectories:213 - Episode 203
2021-08-26 22:55:16.594 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all a

2021-08-26 22:55:16.824 | INFO     | src.policies:collect_trajectories:213 - Episode 217
2021-08-26 22:55:16.832 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:55:16.833 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 16.0
2021-08-26 22:55:16.834 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 18.77777777777778
2021-08-26 22:55:16.835 | INFO     | src.policies:collect_trajectories:213 - Episode 218
2021-08-26 22:55:16.843 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:55:16.844 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 11.0
2021-08-26 22:55:16.844 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 18.0
2021-08-26 22:55:16.845 | INFO     | src.policies:collect_trajectories:213 - Episode 219
2021-08-26 22:55:16.854 | DEBUG    | src.policies:execute_episode:398 - Early st

2021-08-26 22:55:17.264 | INFO     | src.policies:train:152 - Mini-batch 2 / 2
2021-08-26 22:55:17.267 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.034528106451034546
2021-08-26 22:55:17.269 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.19259928166866302
2021-08-26 22:55:17.270 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.6662390828132629
2021-08-26 22:55:17.272 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.19259928166866302
2021-08-26 22:55:17.274 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.4999992251396179
2021-08-26 22:55:17.277 | INFO     | src.policies:train:116 - Epoch 19 / 800
2021-08-26 22:55:17.278 | INFO     | src.policies:collect_trajectories:213 - Episode 231
2021-08-26 22:55:17.288 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-

2021-08-26 22:55:17.482 | INFO     | src.policies:collect_trajectories:213 - Episode 245
2021-08-26 22:55:17.490 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:55:17.491 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 14.0
2021-08-26 22:55:17.492 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 29.5
2021-08-26 22:55:17.493 | INFO     | src.policies:collect_trajectories:213 - Episode 246
2021-08-26 22:55:17.501 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:55:17.502 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 15.0
2021-08-26 22:55:17.503 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 24.666666666666668
2021-08-26 22:55:17.503 | INFO     | src.policies:collect_trajectories:213 - Episode 247
2021-08-26 22:55:17.515 | DEBUG    | src.policies:execute_episode:398 - Early s

2021-08-26 22:55:17.851 | INFO     | src.policies:collect_trajectories:213 - Episode 261
2021-08-26 22:55:17.859 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:55:17.860 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 15.0
2021-08-26 22:55:17.861 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 17.0
2021-08-26 22:55:17.862 | INFO     | src.policies:collect_trajectories:213 - Episode 262
2021-08-26 22:55:17.871 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:55:17.872 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 17.0
2021-08-26 22:55:17.872 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 17.0
2021-08-26 22:55:17.873 | INFO     | src.policies:collect_trajectories:213 - Episode 263
2021-08-26 22:55:17.881 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all a

2021-08-26 22:55:18.070 | INFO     | src.policies:collect_trajectories:213 - Episode 277
2021-08-26 22:55:18.083 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:55:18.084 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 24.0
2021-08-26 22:55:18.084 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 17.0
2021-08-26 22:55:18.085 | INFO     | src.policies:collect_trajectories:213 - Episode 278
2021-08-26 22:55:18.092 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:55:18.093 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 9.0
2021-08-26 22:55:18.093 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 16.272727272727273
2021-08-26 22:55:18.094 | INFO     | src.policies:collect_trajectories:213 - Episode 279
2021-08-26 22:55:18.106 | DEBUG    | src.policies:execute_episode:398 - Early st

2021-08-26 22:55:18.283 | INFO     | src.policies:collect_trajectories:213 - Episode 293
2021-08-26 22:55:18.289 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:55:18.290 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 9.0
2021-08-26 22:55:18.290 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 14.714285714285714
2021-08-26 22:55:18.296 | INFO     | src.policies:train:152 - Mini-batch 1 / 2
2021-08-26 22:55:18.298 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.03778481483459473
2021-08-26 22:55:18.301 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.14273551106452942
2021-08-26 22:55:18.302 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.6392008066177368
2021-08-26 22:55:18.304 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.14273551106452942

2021-08-26 22:55:18.574 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.11383811384439468
2021-08-26 22:55:18.576 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.49999916553497314
2021-08-26 22:55:18.578 | INFO     | src.policies:train:116 - Epoch 25 / 800
2021-08-26 22:55:18.579 | INFO     | src.policies:collect_trajectories:213 - Episode 306
2021-08-26 22:55:18.586 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:55:18.587 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 10.0
2021-08-26 22:55:18.587 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 10.0
2021-08-26 22:55:18.589 | INFO     | src.policies:collect_trajectories:213 - Episode 307
2021-08-26 22:55:18.603 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:55:18.605 | INFO     | 

2021-08-26 22:55:18.813 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:55:18.814 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 10.0
2021-08-26 22:55:18.814 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 12.6
2021-08-26 22:55:18.815 | INFO     | src.policies:collect_trajectories:213 - Episode 322
2021-08-26 22:55:18.829 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:55:18.830 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 28.0
2021-08-26 22:55:18.831 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 15.166666666666666
2021-08-26 22:55:18.832 | INFO     | src.policies:collect_trajectories:213 - Episode 323
2021-08-26 22:55:18.841 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:55:18.841 | INFO     | src.policies:collect_trajec

2021-08-26 22:55:19.090 | INFO     | src.policies:collect_trajectories:213 - Episode 337
2021-08-26 22:55:19.097 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:55:19.098 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 11.0
2021-08-26 22:55:19.098 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 16.666666666666668
2021-08-26 22:55:19.099 | INFO     | src.policies:collect_trajectories:213 - Episode 338
2021-08-26 22:55:19.111 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:55:19.112 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 22.0
2021-08-26 22:55:19.113 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 17.428571428571427
2021-08-26 22:55:19.114 | INFO     | src.policies:collect_trajectories:213 - Episode 339
2021-08-26 22:55:19.129 | DEBUG    | src.policies:execute_episode

2021-08-26 22:55:19.457 | INFO     | src.policies:collect_trajectories:213 - Episode 353
2021-08-26 22:55:19.468 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:55:19.469 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 19.0
2021-08-26 22:55:19.469 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 22.88888888888889
2021-08-26 22:55:19.475 | INFO     | src.policies:train:152 - Mini-batch 1 / 2
2021-08-26 22:55:19.477 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.06901927292346954
2021-08-26 22:55:19.479 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.14044715464115143
2021-08-26 22:55:19.481 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.6135467886924744
2021-08-26 22:55:19.482 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.14044715464115143

2021-08-26 22:55:19.721 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.5786499381065369
2021-08-26 22:55:19.723 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.036995116621255875
2021-08-26 22:55:19.725 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.49999913573265076
2021-08-26 22:55:19.727 | INFO     | src.policies:train:152 - Mini-batch 2 / 2
2021-08-26 22:55:19.730 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.0462387390434742
2021-08-26 22:55:19.732 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.15504737198352814
2021-08-26 22:55:19.734 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.6054503917694092
2021-08-26 22:55:19.735 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.15504737198352814
2021-08-26

2021-08-26 22:55:19.952 | INFO     | src.policies:collect_trajectories:213 - Episode 381
2021-08-26 22:55:19.960 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:55:19.961 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 13.0
2021-08-26 22:55:19.962 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 15.8
2021-08-26 22:55:19.963 | INFO     | src.policies:collect_trajectories:213 - Episode 382
2021-08-26 22:55:19.971 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:55:19.972 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 13.0
2021-08-26 22:55:19.972 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 15.333333333333334
2021-08-26 22:55:19.973 | INFO     | src.policies:collect_trajectories:213 - Episode 383
2021-08-26 22:55:19.984 | DEBUG    | src.policies:execute_episode:398 - Early s

2021-08-26 22:55:20.259 | INFO     | src.policies:collect_trajectories:213 - Episode 397
2021-08-26 22:55:20.268 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:55:20.269 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 14.0
2021-08-26 22:55:20.269 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 20.375
2021-08-26 22:55:20.270 | INFO     | src.policies:collect_trajectories:213 - Episode 398
2021-08-26 22:55:20.283 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:55:20.284 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 24.0
2021-08-26 22:55:20.284 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 20.77777777777778
2021-08-26 22:55:20.285 | INFO     | src.policies:collect_trajectories:213 - Episode 399
2021-08-26 22:55:20.292 | DEBUG    | src.policies:execute_episode:398 - Early 

2021-08-26 22:55:20.520 | INFO     | src.policies:train:152 - Mini-batch 2 / 2
2021-08-26 22:55:20.522 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.09183013439178467
2021-08-26 22:55:20.525 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.040860097855329514
2021-08-26 22:55:20.527 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.5855798125267029
2021-08-26 22:55:20.528 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.040860097855329514
2021-08-26 22:55:20.530 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.49999916553497314
2021-08-26 22:55:20.533 | INFO     | src.policies:train:116 - Epoch 34 / 800
2021-08-26 22:55:20.534 | INFO     | src.policies:collect_trajectories:213 - Episode 411
2021-08-26 22:55:20.542 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-0

2021-08-26 22:55:20.895 | INFO     | src.policies:collect_trajectories:213 - Episode 425
2021-08-26 22:55:20.903 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:55:20.904 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 13.0
2021-08-26 22:55:20.905 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 13.0
2021-08-26 22:55:20.905 | INFO     | src.policies:collect_trajectories:213 - Episode 426
2021-08-26 22:55:20.913 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:55:20.914 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 12.0
2021-08-26 22:55:20.915 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 12.5
2021-08-26 22:55:20.916 | INFO     | src.policies:collect_trajectories:213 - Episode 427
2021-08-26 22:55:20.924 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all a

2021-08-26 22:55:21.122 | INFO     | src.policies:collect_trajectories:213 - Episode 441
2021-08-26 22:55:21.132 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:55:21.133 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 18.0
2021-08-26 22:55:21.134 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 13.75
2021-08-26 22:55:21.134 | INFO     | src.policies:collect_trajectories:213 - Episode 442
2021-08-26 22:55:21.149 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:55:21.150 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 27.0
2021-08-26 22:55:21.151 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 16.4
2021-08-26 22:55:21.151 | INFO     | src.policies:collect_trajectories:213 - Episode 443
2021-08-26 22:55:21.160 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all 

2021-08-26 22:55:21.534 | INFO     | src.policies:collect_trajectories:213 - Episode 457
2021-08-26 22:55:21.555 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:55:21.556 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 43.0
2021-08-26 22:55:21.557 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 19.0
2021-08-26 22:55:21.557 | INFO     | src.policies:collect_trajectories:213 - Episode 458
2021-08-26 22:55:21.568 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:55:21.569 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 17.0
2021-08-26 22:55:21.570 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 18.75
2021-08-26 22:55:21.571 | INFO     | src.policies:collect_trajectories:213 - Episode 459
2021-08-26 22:55:21.580 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all 

2021-08-26 22:55:21.793 | INFO     | src.policies:train:152 - Mini-batch 1 / 2
2021-08-26 22:55:21.795 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.09737798571586609
2021-08-26 22:55:21.798 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.05243741720914841
2021-08-26 22:55:21.799 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.5562484264373779
2021-08-26 22:55:21.801 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.05243741720914841
2021-08-26 22:55:21.804 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.49999910593032837
2021-08-26 22:55:21.806 | INFO     | src.policies:train:152 - Mini-batch 2 / 2
2021-08-26 22:55:21.808 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.09640289843082428
2021-08-26 22:55:21.811 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gr

2021-08-26 22:55:22.077 | INFO     | src.policies:collect_trajectories:213 - Episode 485
2021-08-26 22:55:22.084 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:55:22.085 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 11.0
2021-08-26 22:55:22.086 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 11.5
2021-08-26 22:55:22.087 | INFO     | src.policies:collect_trajectories:213 - Episode 486
2021-08-26 22:55:22.097 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:55:22.098 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 18.0
2021-08-26 22:55:22.099 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 13.666666666666666
2021-08-26 22:55:22.100 | INFO     | src.policies:collect_trajectories:213 - Episode 487
2021-08-26 22:55:22.109 | DEBUG    | src.policies:execute_episode:398 - Early s

2021-08-26 22:55:22.329 | INFO     | src.policies:collect_trajectories:213 - Episode 501
2021-08-26 22:55:22.343 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:55:22.344 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 25.0
2021-08-26 22:55:22.344 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 25.0
2021-08-26 22:55:22.345 | INFO     | src.policies:collect_trajectories:213 - Episode 502
2021-08-26 22:55:22.353 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:55:22.354 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 12.0
2021-08-26 22:55:22.355 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 23.142857142857142
2021-08-26 22:55:22.355 | INFO     | src.policies:collect_trajectories:213 - Episode 503
2021-08-26 22:55:22.364 | DEBUG    | src.policies:execute_episode:398 - Early s

2021-08-26 22:55:22.656 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.499999076128006
2021-08-26 22:55:22.659 | INFO     | src.policies:train:116 - Epoch 43 / 800
2021-08-26 22:55:22.660 | INFO     | src.policies:collect_trajectories:213 - Episode 514
2021-08-26 22:55:22.667 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:55:22.668 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 12.0
2021-08-26 22:55:22.668 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 12.0
2021-08-26 22:55:22.669 | INFO     | src.policies:collect_trajectories:213 - Episode 515
2021-08-26 22:55:22.678 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:55:22.678 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 15.0
2021-08-26 22:55:22.679 | INFO     | src.policies:collect_trajectories:230 - 

2021-08-26 22:55:22.883 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 21.0
2021-08-26 22:55:22.884 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 16.5
2021-08-26 22:55:22.885 | INFO     | src.policies:collect_trajectories:213 - Episode 530
2021-08-26 22:55:22.896 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:55:22.897 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 18.0
2021-08-26 22:55:22.898 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 17.0
2021-08-26 22:55:22.898 | INFO     | src.policies:collect_trajectories:213 - Episode 531
2021-08-26 22:55:22.910 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:55:22.911 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 19.0
2021-08-26 22:55:22.912 | INFO     | src.policies:collect_trajectories:230 - La

2021-08-26 22:55:23.096 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 12.0
2021-08-26 22:55:23.097 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 14.5
2021-08-26 22:55:23.098 | INFO     | src.policies:collect_trajectories:213 - Episode 546
2021-08-26 22:55:23.105 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:55:23.106 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 11.0
2021-08-26 22:55:23.106 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 13.8
2021-08-26 22:55:23.107 | INFO     | src.policies:collect_trajectories:213 - Episode 547
2021-08-26 22:55:23.121 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:55:23.121 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 25.0
2021-08-26 22:55:23.122 | INFO     | src.policies:collect_trajectories:230 - La

2021-08-26 22:55:23.414 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 30.0
2021-08-26 22:55:23.415 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 17.142857142857142
2021-08-26 22:55:23.415 | INFO     | src.policies:collect_trajectories:213 - Episode 562
2021-08-26 22:55:23.424 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:55:23.553 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 13.0
2021-08-26 22:55:23.554 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 16.625
2021-08-26 22:55:23.555 | INFO     | src.policies:collect_trajectories:213 - Episode 563
2021-08-26 22:55:23.567 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:55:23.567 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 19.0
2021-08-26 22:55:23.568 | INFO     | src.policies:collect_traje

2021-08-26 22:55:23.927 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 25.0
2021-08-26 22:55:23.928 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 19.454545454545453
2021-08-26 22:55:23.934 | INFO     | src.policies:train:152 - Mini-batch 1 / 2
2021-08-26 22:55:23.937 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.10389266908168793
2021-08-26 22:55:23.939 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.15450350940227509
2021-08-26 22:55:23.941 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.5040988326072693
2021-08-26 22:55:23.943 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.15450350940227509
2021-08-26 22:55:23.945 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.49999895691871643
2021-08-26 22:55:23.947 | INFO     | src.policie

2021-08-26 22:55:24.159 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.1290827989578247
2021-08-26 22:55:24.160 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.4994846284389496
2021-08-26 22:55:24.162 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.1290827989578247
2021-08-26 22:55:24.164 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.4994846284389496
2021-08-26 22:55:24.167 | INFO     | src.policies:train:116 - Epoch 49 / 800
2021-08-26 22:55:24.168 | INFO     | src.policies:collect_trajectories:213 - Episode 591
2021-08-26 22:55:24.174 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:55:24.175 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 10.0
2021-08-26 22:55:24.176 | INFO     | src.policies:collect_trajectories:230 - Last 100 episode

2021-08-26 22:55:24.479 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 10.0
2021-08-26 22:55:24.480 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 15.5
2021-08-26 22:55:24.480 | INFO     | src.policies:collect_trajectories:213 - Episode 606
2021-08-26 22:55:24.489 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:55:24.490 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 12.0
2021-08-26 22:55:24.491 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 14.8
2021-08-26 22:55:24.491 | INFO     | src.policies:collect_trajectories:213 - Episode 607
2021-08-26 22:55:24.502 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:55:24.503 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 10.0
2021-08-26 22:55:24.505 | INFO     | src.policies:collect_trajectories:230 - La

2021-08-26 22:55:24.701 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 13.0
2021-08-26 22:55:24.702 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 13.2
2021-08-26 22:55:24.703 | INFO     | src.policies:collect_trajectories:213 - Episode 622
2021-08-26 22:55:24.711 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:55:24.712 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 11.0
2021-08-26 22:55:24.712 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 12.833333333333334
2021-08-26 22:55:24.713 | INFO     | src.policies:collect_trajectories:213 - Episode 623
2021-08-26 22:55:24.722 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:55:24.723 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 13.0
2021-08-26 22:55:24.724 | INFO     | src.policies:collect_traject

2021-08-26 22:55:25.052 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 19.0
2021-08-26 22:55:25.053 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 20.0
2021-08-26 22:55:25.054 | INFO     | src.policies:collect_trajectories:213 - Episode 638
2021-08-26 22:55:25.068 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:55:25.070 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 24.0
2021-08-26 22:55:25.070 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 20.444444444444443
2021-08-26 22:55:25.071 | INFO     | src.policies:collect_trajectories:213 - Episode 639
2021-08-26 22:55:25.087 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:55:25.089 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 18.0
2021-08-26 22:55:25.090 | INFO     | src.policies:collect_traject

2021-08-26 22:55:25.315 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.4386177659034729
2021-08-26 22:55:25.318 | INFO     | src.policies:train:152 - Mini-batch 2 / 2
2021-08-26 22:55:25.321 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.16807502508163452
2021-08-26 22:55:25.324 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.05872773006558418
2021-08-26 22:55:25.325 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.4692980647087097
2021-08-26 22:55:25.327 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.05872773006558418
2021-08-26 22:55:25.329 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.4692980647087097
2021-08-26 22:55:25.333 | INFO     | src.policies:train:116 - Epoch 54 / 800
2021-08-26 22:55:25.334 | INFO     | src.policies:collect_tr

2021-08-26 22:55:25.772 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 15.8
2021-08-26 22:55:25.773 | INFO     | src.policies:collect_trajectories:213 - Episode 666
2021-08-26 22:55:25.782 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:55:25.782 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 14.0
2021-08-26 22:55:25.783 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 15.5
2021-08-26 22:55:25.784 | INFO     | src.policies:collect_trajectories:213 - Episode 667
2021-08-26 22:55:25.800 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:55:25.801 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 28.0
2021-08-26 22:55:25.802 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 17.285714285714285
2021-08-26 22:55:25.803 | INFO     | src.policies:colle

2021-08-26 22:55:26.048 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 20.125
2021-08-26 22:55:26.049 | INFO     | src.policies:collect_trajectories:213 - Episode 682
2021-08-26 22:55:26.059 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:55:26.060 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 17.0
2021-08-26 22:55:26.061 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 19.77777777777778
2021-08-26 22:55:26.062 | INFO     | src.policies:collect_trajectories:213 - Episode 683
2021-08-26 22:55:26.070 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:55:26.070 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 11.0
2021-08-26 22:55:26.071 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 18.9
2021-08-26 22:55:26.072 | INFO     | src.policies:coll

2021-08-26 22:55:26.376 | INFO     | src.policies:collect_trajectories:213 - Episode 694
2021-08-26 22:55:26.398 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:55:26.400 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 27.0
2021-08-26 22:55:26.401 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 27.0
2021-08-26 22:55:26.403 | INFO     | src.policies:collect_trajectories:213 - Episode 695
2021-08-26 22:55:26.415 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:55:26.416 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 14.0
2021-08-26 22:55:26.416 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 20.5
2021-08-26 22:55:26.418 | INFO     | src.policies:collect_trajectories:213 - Episode 696
2021-08-26 22:55:26.426 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all a

2021-08-26 22:55:26.670 | INFO     | src.policies:collect_trajectories:213 - Episode 710
2021-08-26 22:55:26.678 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:55:26.680 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 11.0
2021-08-26 22:55:26.680 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 16.5
2021-08-26 22:55:26.681 | INFO     | src.policies:collect_trajectories:213 - Episode 711
2021-08-26 22:55:26.695 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:55:26.697 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 19.0
2021-08-26 22:55:26.697 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 17.0
2021-08-26 22:55:26.698 | INFO     | src.policies:collect_trajectories:213 - Episode 712
2021-08-26 22:55:26.712 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all a

2021-08-26 22:55:27.072 | INFO     | src.policies:collect_trajectories:213 - Episode 726
2021-08-26 22:55:27.083 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:55:27.084 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 18.0
2021-08-26 22:55:27.085 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 20.77777777777778
2021-08-26 22:55:27.086 | INFO     | src.policies:collect_trajectories:213 - Episode 727
2021-08-26 22:55:27.098 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:55:27.099 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 15.0
2021-08-26 22:55:27.101 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 20.2
2021-08-26 22:55:27.109 | INFO     | src.policies:train:152 - Mini-batch 1 / 2
2021-08-26 22:55:27.112 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.11

2021-08-26 22:55:27.528 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.13737471401691437
2021-08-26 22:55:27.531 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.4272163212299347
2021-08-26 22:55:27.534 | INFO     | src.policies:train:116 - Epoch 62 / 800
2021-08-26 22:55:27.535 | INFO     | src.policies:collect_trajectories:213 - Episode 739
2021-08-26 22:55:27.558 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:55:27.559 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 35.0
2021-08-26 22:55:27.560 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 35.0
2021-08-26 22:55:27.561 | INFO     | src.policies:collect_trajectories:213 - Episode 740
2021-08-26 22:55:27.572 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:55:27.573 | INFO     | s

2021-08-26 22:55:27.878 | INFO     | src.policies:collect_trajectories:213 - Episode 754
2021-08-26 22:55:27.891 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:55:27.892 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 17.0
2021-08-26 22:55:27.892 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 12.0
2021-08-26 22:55:27.893 | INFO     | src.policies:collect_trajectories:213 - Episode 755
2021-08-26 22:55:27.903 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:55:27.904 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 13.0
2021-08-26 22:55:27.904 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 12.166666666666666
2021-08-26 22:55:27.905 | INFO     | src.policies:collect_trajectories:213 - Episode 756
2021-08-26 22:55:27.927 | DEBUG    | src.policies:execute_episode:398 - Early s

2021-08-26 22:55:28.292 | INFO     | src.policies:collect_trajectories:213 - Episode 770
2021-08-26 22:55:28.320 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:55:28.321 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 50.0
2021-08-26 22:55:28.322 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 21.3
2021-08-26 22:55:28.329 | INFO     | src.policies:train:152 - Mini-batch 1 / 2
2021-08-26 22:55:28.332 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.18038812279701233
2021-08-26 22:55:28.335 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.09969808161258698
2021-08-26 22:55:28.337 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.42527779936790466
2021-08-26 22:55:28.339 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.09969808161258698
2021-08-26 

2021-08-26 22:55:28.569 | INFO     | src.policies:train:152 - Mini-batch 2 / 2
2021-08-26 22:55:28.572 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.15261580049991608
2021-08-26 22:55:28.575 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.04702445864677429
2021-08-26 22:55:28.577 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.40982910990715027
2021-08-26 22:55:28.579 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.04702445864677429
2021-08-26 22:55:28.581 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.40982910990715027
2021-08-26 22:55:28.584 | INFO     | src.policies:train:116 - Epoch 66 / 800
2021-08-26 22:55:28.585 | INFO     | src.policies:collect_trajectories:213 - Episode 784
2021-08-26 22:55:28.596 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08

2021-08-26 22:55:28.946 | INFO     | src.policies:collect_trajectories:213 - Episode 798
2021-08-26 22:55:28.958 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:55:28.960 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 13.0
2021-08-26 22:55:28.960 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 20.6
2021-08-26 22:55:28.962 | INFO     | src.policies:collect_trajectories:213 - Episode 799
2021-08-26 22:55:28.976 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:55:28.977 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 21.0
2021-08-26 22:55:28.978 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 20.666666666666668
2021-08-26 22:55:28.978 | INFO     | src.policies:collect_trajectories:213 - Episode 800
2021-08-26 22:55:28.989 | DEBUG    | src.policies:execute_episode:398 - Early s

2021-08-26 22:55:29.301 | INFO     | src.policies:collect_trajectories:213 - Episode 814
2021-08-26 22:55:29.321 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:55:29.322 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 32.0
2021-08-26 22:55:29.324 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 23.125
2021-08-26 22:55:29.325 | INFO     | src.policies:collect_trajectories:213 - Episode 815
2021-08-26 22:55:29.336 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:55:29.337 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 16.0
2021-08-26 22:55:29.338 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 22.333333333333332
2021-08-26 22:55:29.346 | INFO     | src.policies:train:152 - Mini-batch 1 / 2
2021-08-26 22:55:29.348 | INFO     | src.policies:minibatch_update:270 - Total loss: -0

2021-08-26 22:55:29.573 | INFO     | src.policies:train:152 - Mini-batch 2 / 2
2021-08-26 22:55:29.576 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.1719375103712082
2021-08-26 22:55:29.579 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.05417793616652489
2021-08-26 22:55:29.581 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.400288462638855
2021-08-26 22:55:29.583 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.05417793616652489
2021-08-26 22:55:29.585 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.400288462638855
2021-08-26 22:55:29.589 | INFO     | src.policies:train:116 - Epoch 70 / 800
2021-08-26 22:55:29.590 | INFO     | src.policies:collect_trajectories:213 - Episode 828
2021-08-26 22:55:29.604 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 2

2021-08-26 22:55:30.019 | INFO     | src.policies:collect_trajectories:213 - Episode 842
2021-08-26 22:55:30.028 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:55:30.029 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 13.0
2021-08-26 22:55:30.030 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 19.0
2021-08-26 22:55:30.031 | INFO     | src.policies:collect_trajectories:213 - Episode 843
2021-08-26 22:55:30.040 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:55:30.041 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 13.0
2021-08-26 22:55:30.042 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 17.8
2021-08-26 22:55:30.043 | INFO     | src.policies:collect_trajectories:213 - Episode 844
2021-08-26 22:55:30.051 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all a

2021-08-26 22:55:30.300 | INFO     | src.policies:collect_trajectories:213 - Episode 858
2021-08-26 22:55:30.310 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:55:30.311 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 14.0
2021-08-26 22:55:30.312 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 22.5
2021-08-26 22:55:30.313 | INFO     | src.policies:collect_trajectories:213 - Episode 859
2021-08-26 22:55:30.326 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:55:30.327 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 19.0
2021-08-26 22:55:30.328 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 22.11111111111111
2021-08-26 22:55:30.328 | INFO     | src.policies:collect_trajectories:213 - Episode 860
2021-08-26 22:55:30.340 | DEBUG    | src.policies:execute_episode:398 - Early st

2021-08-26 22:55:30.642 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.38431742787361145
2021-08-26 22:55:30.644 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.037653662264347076
2021-08-26 22:55:30.647 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.38431742787361145
2021-08-26 22:55:30.649 | INFO     | src.policies:train:152 - Mini-batch 2 / 2
2021-08-26 22:55:30.652 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.16378046572208405
2021-08-26 22:55:30.655 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.11567526310682297
2021-08-26 22:55:30.657 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.38997024297714233
2021-08-26 22:55:30.660 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.11567526310682297
2021-08

2021-08-26 22:55:30.929 | INFO     | src.policies:collect_trajectories:213 - Episode 886
2021-08-26 22:55:30.974 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:55:30.975 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 76.0
2021-08-26 22:55:30.976 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 47.5
2021-08-26 22:55:30.977 | INFO     | src.policies:collect_trajectories:213 - Episode 887
2021-08-26 22:55:30.986 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:55:30.987 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 10.0
2021-08-26 22:55:30.988 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 35.0
2021-08-26 22:55:30.988 | INFO     | src.policies:collect_trajectories:213 - Episode 888
2021-08-26 22:55:31.160 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all a

2021-08-26 22:55:31.446 | INFO     | src.policies:train:152 - Mini-batch 2 / 2
2021-08-26 22:55:31.449 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.19565880298614502
2021-08-26 22:55:31.452 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.054347746074199677
2021-08-26 22:55:31.454 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.3895874321460724
2021-08-26 22:55:31.456 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.054347746074199677
2021-08-26 22:55:31.459 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.3895874321460724
2021-08-26 22:55:31.463 | INFO     | src.policies:train:116 - Epoch 77 / 800
2021-08-26 22:55:31.464 | INFO     | src.policies:collect_trajectories:213 - Episode 900
2021-08-26 22:55:31.480 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08

2021-08-26 22:55:31.785 | INFO     | src.policies:collect_trajectories:213 - Episode 914
2021-08-26 22:55:31.795 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:55:31.796 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 11.0
2021-08-26 22:55:31.797 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 17.666666666666668
2021-08-26 22:55:31.798 | INFO     | src.policies:collect_trajectories:213 - Episode 915
2021-08-26 22:55:31.827 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:55:31.828 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 39.0
2021-08-26 22:55:31.829 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 23.0
2021-08-26 22:55:31.830 | INFO     | src.policies:collect_trajectories:213 - Episode 916
2021-08-26 22:55:31.850 | DEBUG    | src.policies:execute_episode:398 - Early s

2021-08-26 22:55:32.206 | INFO     | src.policies:collect_trajectories:213 - Episode 930
2021-08-26 22:55:32.216 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:55:32.217 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 11.0
2021-08-26 22:55:32.218 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 19.555555555555557
2021-08-26 22:55:32.219 | INFO     | src.policies:collect_trajectories:213 - Episode 931
2021-08-26 22:55:32.287 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:55:32.288 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 20.0
2021-08-26 22:55:32.288 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 19.6
2021-08-26 22:55:32.289 | INFO     | src.policies:collect_trajectories:213 - Episode 932
2021-08-26 22:55:32.310 | DEBUG    | src.policies:execute_episode:398 - Early s

2021-08-26 22:55:32.547 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.3516126871109009
2021-08-26 22:55:32.549 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.10999050736427307
2021-08-26 22:55:32.551 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.3516126871109009
2021-08-26 22:55:32.554 | INFO     | src.policies:train:152 - Mini-batch 2 / 2
2021-08-26 22:55:32.557 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.20246724784374237
2021-08-26 22:55:32.560 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.07216232270002365
2021-08-26 22:55:32.562 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.35319283604621887
2021-08-26 22:55:32.564 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.07216232270002365
2021-08-26

2021-08-26 22:55:32.872 | INFO     | src.policies:collect_trajectories:213 - Episode 958
2021-08-26 22:55:32.890 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:55:32.891 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 28.0
2021-08-26 22:55:32.892 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 22.0
2021-08-26 22:55:32.893 | INFO     | src.policies:collect_trajectories:213 - Episode 959
2021-08-26 22:55:32.906 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:55:32.907 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 19.0
2021-08-26 22:55:32.908 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 21.0
2021-08-26 22:55:32.909 | INFO     | src.policies:collect_trajectories:213 - Episode 960
2021-08-26 22:55:32.922 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all a

2021-08-26 22:55:33.179 | INFO     | src.policies:collect_trajectories:213 - Episode 974
2021-08-26 22:55:33.202 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:55:33.203 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 39.0
2021-08-26 22:55:33.204 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 28.571428571428573
2021-08-26 22:55:33.209 | INFO     | src.policies:train:152 - Mini-batch 1 / 2
2021-08-26 22:55:33.211 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.17878910899162292
2021-08-26 22:55:33.214 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.17548257112503052
2021-08-26 22:55:33.216 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.3461248576641083
2021-08-26 22:55:33.218 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.1754825711250305

2021-08-26 22:55:33.544 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:55:33.545 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 13.0
2021-08-26 22:55:33.545 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 12.5
2021-08-26 22:55:33.546 | INFO     | src.policies:collect_trajectories:213 - Episode 987
2021-08-26 22:55:33.572 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:55:33.573 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 49.0
2021-08-26 22:55:33.574 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 24.666666666666668
2021-08-26 22:55:33.575 | INFO     | src.policies:collect_trajectories:213 - Episode 988
2021-08-26 22:55:33.598 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:55:33.599 | INFO     | src.policies:collect_trajec

2021-08-26 22:55:33.858 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:55:33.859 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 11.0
2021-08-26 22:55:33.860 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 18.5
2021-08-26 22:55:33.861 | INFO     | src.policies:collect_trajectories:213 - Episode 1003
2021-08-26 22:55:33.869 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:55:33.870 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 10.0
2021-08-26 22:55:33.871 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 17.727272727272727
2021-08-26 22:55:33.871 | INFO     | src.policies:collect_trajectories:213 - Episode 1004
2021-08-26 22:55:33.895 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:55:33.896 | INFO     | src.policies:collect_traj

2021-08-26 22:55:34.301 | INFO     | src.policies:train:152 - Mini-batch 1 / 2
2021-08-26 22:55:34.304 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.22080090641975403
2021-08-26 22:55:34.306 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.08128251880407333
2021-08-26 22:55:34.308 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.3205864727497101
2021-08-26 22:55:34.310 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.08128251880407333
2021-08-26 22:55:34.312 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.3205864727497101
2021-08-26 22:55:34.315 | INFO     | src.policies:train:152 - Mini-batch 2 / 2
2021-08-26 22:55:34.318 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.21618498861789703
2021-08-26 22:55:34.322 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gra

2021-08-26 22:55:34.704 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:55:34.705 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 36.0
2021-08-26 22:55:34.706 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 24.0
2021-08-26 22:55:34.707 | INFO     | src.policies:collect_trajectories:213 - Episode 1031
2021-08-26 22:55:34.722 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:55:34.723 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 25.0
2021-08-26 22:55:34.724 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 24.333333333333332
2021-08-26 22:55:34.725 | INFO     | src.policies:collect_trajectories:213 - Episode 1032
2021-08-26 22:55:34.734 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:55:34.735 | INFO     | src.policies:collect_traj

2021-08-26 22:55:34.959 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:55:34.960 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 10.0
2021-08-26 22:55:34.961 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 13.0
2021-08-26 22:55:34.962 | INFO     | src.policies:collect_trajectories:213 - Episode 1047
2021-08-26 22:55:34.979 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:55:34.979 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 25.0
2021-08-26 22:55:34.980 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 14.714285714285714
2021-08-26 22:55:34.981 | INFO     | src.policies:collect_trajectories:213 - Episode 1048
2021-08-26 22:55:34.993 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:55:34.994 | INFO     | src.policies:collect_traj

2021-08-26 22:55:35.314 | INFO     | src.policies:collect_trajectories:213 - Episode 1062
2021-08-26 22:55:35.326 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:55:35.327 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 17.0
2021-08-26 22:55:35.328 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 21.0
2021-08-26 22:55:35.328 | INFO     | src.policies:collect_trajectories:213 - Episode 1063
2021-08-26 22:55:35.338 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:55:35.340 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 15.0
2021-08-26 22:55:35.341 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 20.333333333333332
2021-08-26 22:55:35.341 | INFO     | src.policies:collect_trajectories:213 - Episode 1064
2021-08-26 22:55:35.354 | DEBUG    | src.policies:execute_episode:398 - Earl

2021-08-26 22:55:35.583 | INFO     | src.policies:train:152 - Mini-batch 2 / 2
2021-08-26 22:55:35.586 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.22298596799373627
2021-08-26 22:55:35.589 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.021209096536040306
2021-08-26 22:55:35.591 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.3124350309371948
2021-08-26 22:55:35.594 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.021209096536040306
2021-08-26 22:55:35.596 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.3124350309371948
2021-08-26 22:55:35.599 | INFO     | src.policies:train:116 - Epoch 93 / 800
2021-08-26 22:55:35.600 | INFO     | src.policies:collect_trajectories:213 - Episode 1076
2021-08-26 22:55:35.612 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-0

2021-08-26 22:55:35.917 | INFO     | src.policies:collect_trajectories:213 - Episode 1090
2021-08-26 22:55:35.927 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:55:35.928 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 14.0
2021-08-26 22:55:35.929 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 23.0
2021-08-26 22:55:35.930 | INFO     | src.policies:collect_trajectories:213 - Episode 1091
2021-08-26 22:55:35.941 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:55:35.942 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 15.0
2021-08-26 22:55:35.943 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 21.0
2021-08-26 22:55:35.944 | INFO     | src.policies:collect_trajectories:213 - Episode 1092
2021-08-26 22:55:35.960 | DEBUG    | src.policies:execute_episode:398 - Early stopping, al

2021-08-26 22:55:36.360 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:55:36.362 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 18.0
2021-08-26 22:55:36.363 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 19.88888888888889
2021-08-26 22:55:36.364 | INFO     | src.policies:collect_trajectories:213 - Episode 1107
2021-08-26 22:55:36.381 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:55:36.382 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 22.0
2021-08-26 22:55:36.384 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 20.1
2021-08-26 22:55:36.392 | INFO     | src.policies:train:152 - Mini-batch 1 / 2
2021-08-26 22:55:36.395 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.20513826608657837
2021-08-26 22:55:36.400 | INFO     | src.policies:minibatch_update:277 -

2021-08-26 22:55:36.704 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:55:36.705 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 19.0
2021-08-26 22:55:36.706 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 26.0
2021-08-26 22:55:36.707 | INFO     | src.policies:collect_trajectories:213 - Episode 1119
2021-08-26 22:55:36.715 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:55:36.716 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 12.0
2021-08-26 22:55:36.717 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 21.333333333333332
2021-08-26 22:55:36.718 | INFO     | src.policies:collect_trajectories:213 - Episode 1120
2021-08-26 22:55:36.735 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:55:36.736 | INFO     | src.policies:collect_traj

2021-08-26 22:55:37.083 | INFO     | src.policies:collect_trajectories:213 - Episode 1134
2021-08-26 22:55:37.106 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:55:37.107 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 40.0
2021-08-26 22:55:37.108 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 19.727272727272727
2021-08-26 22:55:37.116 | INFO     | src.policies:train:152 - Mini-batch 1 / 2
2021-08-26 22:55:37.119 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.21379758417606354
2021-08-26 22:55:37.122 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.02932455576956272
2021-08-26 22:55:37.125 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.30042701959609985
2021-08-26 22:55:37.127 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.02932455576956

2021-08-26 22:55:37.408 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.0253145694732666
2021-08-26 22:55:37.410 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.28190815448760986
2021-08-26 22:55:37.414 | INFO     | src.policies:train:116 - Epoch 100 / 800
2021-08-26 22:55:37.415 | INFO     | src.policies:collect_trajectories:213 - Episode 1147
2021-08-26 22:55:37.500 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:55:37.502 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 31.0
2021-08-26 22:55:37.504 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 31.0
2021-08-26 22:55:37.505 | INFO     | src.policies:collect_trajectories:213 - Episode 1148
2021-08-26 22:55:37.519 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:55:37.520 | INFO     

2021-08-26 22:55:37.808 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:55:37.809 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 17.0
2021-08-26 22:55:37.810 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 19.5
2021-08-26 22:55:37.811 | INFO     | src.policies:collect_trajectories:213 - Episode 1163
2021-08-26 22:55:37.847 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:55:37.848 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 57.0
2021-08-26 22:55:37.848 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 24.857142857142858
2021-08-26 22:55:37.849 | INFO     | src.policies:collect_trajectories:213 - Episode 1164
2021-08-26 22:55:37.859 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:55:37.861 | INFO     | src.policies:collect_traj

2021-08-26 22:55:38.201 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.1389981508255005
2021-08-26 22:55:38.306 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.2763362228870392
2021-08-26 22:55:38.310 | INFO     | src.policies:train:116 - Epoch 103 / 800
2021-08-26 22:55:38.311 | INFO     | src.policies:collect_trajectories:213 - Episode 1175
2021-08-26 22:55:38.321 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:55:38.322 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 16.0
2021-08-26 22:55:38.323 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 16.0
2021-08-26 22:55:38.324 | INFO     | src.policies:collect_trajectories:213 - Episode 1176
2021-08-26 22:55:38.343 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:55:38.344 | INFO     |

2021-08-26 22:55:38.780 | INFO     | src.policies:collect_trajectories:213 - Episode 1190
2021-08-26 22:55:38.790 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:55:38.791 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 12.0
2021-08-26 22:55:38.791 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 16.0
2021-08-26 22:55:38.792 | INFO     | src.policies:collect_trajectories:213 - Episode 1191
2021-08-26 22:55:38.800 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:55:38.802 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 10.0
2021-08-26 22:55:38.802 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 15.142857142857142
2021-08-26 22:55:38.803 | INFO     | src.policies:collect_trajectories:213 - Episode 1192
2021-08-26 22:55:38.819 | DEBUG    | src.policies:execute_episode:398 - Earl

2021-08-26 22:55:39.077 | INFO     | src.policies:collect_trajectories:213 - Episode 1206
2021-08-26 22:55:39.092 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:55:39.093 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 22.0
2021-08-26 22:55:39.094 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 14.8
2021-08-26 22:55:39.095 | INFO     | src.policies:collect_trajectories:213 - Episode 1207
2021-08-26 22:55:39.107 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:55:39.108 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 15.0
2021-08-26 22:55:39.108 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 14.818181818181818
2021-08-26 22:55:39.110 | INFO     | src.policies:collect_trajectories:213 - Episode 1208
2021-08-26 22:55:39.125 | DEBUG    | src.policies:execute_episode:398 - Earl

2021-08-26 22:55:39.473 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.27053675055503845
2021-08-26 22:55:39.476 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.04299597814679146
2021-08-26 22:55:39.478 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.27053675055503845
2021-08-26 22:55:39.482 | INFO     | src.policies:train:152 - Mini-batch 2 / 2
2021-08-26 22:55:39.485 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.2416742593050003
2021-08-26 22:55:39.489 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.047184985131025314
2021-08-26 22:55:39.492 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.26378270983695984
2021-08-26 22:55:39.494 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.047184985131025314
2021-08

2021-08-26 22:55:39.748 | INFO     | src.policies:collect_trajectories:213 - Episode 1234
2021-08-26 22:55:39.757 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:55:39.758 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 9.0
2021-08-26 22:55:39.759 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 11.0
2021-08-26 22:55:39.760 | INFO     | src.policies:collect_trajectories:213 - Episode 1235
2021-08-26 22:55:39.870 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:55:39.871 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 28.0
2021-08-26 22:55:39.872 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 16.666666666666668
2021-08-26 22:55:39.873 | INFO     | src.policies:collect_trajectories:213 - Episode 1236
2021-08-26 22:55:39.891 | DEBUG    | src.policies:execute_episode:398 - Early

2021-08-26 22:55:40.210 | INFO     | src.policies:collect_trajectories:213 - Episode 1250
2021-08-26 22:55:40.229 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:55:40.230 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 22.0
2021-08-26 22:55:40.231 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 20.1
2021-08-26 22:55:40.241 | INFO     | src.policies:train:152 - Mini-batch 1 / 2
2021-08-26 22:55:40.244 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.25115859508514404
2021-08-26 22:55:40.247 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.1417851746082306
2021-08-26 22:55:40.249 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.2576025128364563
2021-08-26 22:55:40.251 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.1417851746082306
2021-08-26 22

2021-08-26 22:55:40.633 | INFO     | src.policies:collect_trajectories:213 - Episode 1262
2021-08-26 22:55:40.643 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:55:40.644 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 14.0
2021-08-26 22:55:40.645 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 14.0
2021-08-26 22:55:40.646 | INFO     | src.policies:collect_trajectories:213 - Episode 1263
2021-08-26 22:55:40.667 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:55:40.669 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 34.0
2021-08-26 22:55:40.669 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 24.0
2021-08-26 22:55:40.670 | INFO     | src.policies:collect_trajectories:213 - Episode 1264
2021-08-26 22:55:40.702 | DEBUG    | src.policies:execute_episode:398 - Early stopping, al

2021-08-26 22:55:41.036 | INFO     | src.policies:collect_trajectories:213 - Episode 1278
2021-08-26 22:55:41.056 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:55:41.057 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 32.0
2021-08-26 22:55:41.058 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 20.555555555555557
2021-08-26 22:55:41.059 | INFO     | src.policies:collect_trajectories:213 - Episode 1279
2021-08-26 22:55:41.071 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:55:41.072 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 15.0
2021-08-26 22:55:41.073 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 20.0
2021-08-26 22:55:41.079 | INFO     | src.policies:train:152 - Mini-batch 1 / 2
2021-08-26 22:55:41.082 | INFO     | src.policies:minibatch_update:270 - Total loss: -0

2021-08-26 22:55:41.322 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.24118879437446594
2021-08-26 22:55:41.326 | INFO     | src.policies:train:116 - Epoch 114 / 800
2021-08-26 22:55:41.327 | INFO     | src.policies:collect_trajectories:213 - Episode 1291
2021-08-26 22:55:41.344 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:55:41.345 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 27.0
2021-08-26 22:55:41.346 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 27.0
2021-08-26 22:55:41.347 | INFO     | src.policies:collect_trajectories:213 - Episode 1292
2021-08-26 22:55:41.366 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:55:41.367 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 31.0
2021-08-26 22:55:41.368 | INFO     | src.policies:collect_trajectories:2

2021-08-26 22:55:41.734 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 13.0
2021-08-26 22:55:41.734 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 19.0
2021-08-26 22:55:41.735 | INFO     | src.policies:collect_trajectories:213 - Episode 1307
2021-08-26 22:55:41.755 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:55:41.756 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 32.0
2021-08-26 22:55:41.757 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 20.3
2021-08-26 22:55:41.766 | INFO     | src.policies:train:152 - Mini-batch 1 / 2
2021-08-26 22:55:41.768 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.24755236506462097
2021-08-26 22:55:41.771 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.0464613251388073
2021-08-26 22:55:41.773 | INFO     | src.policies:minibatch_upda

2021-08-26 22:55:42.045 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.2442205399274826
2021-08-26 22:55:42.048 | INFO     | src.policies:train:116 - Epoch 117 / 800
2021-08-26 22:55:42.049 | INFO     | src.policies:collect_trajectories:213 - Episode 1319
2021-08-26 22:55:42.072 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:55:42.073 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 39.0
2021-08-26 22:55:42.074 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 39.0
2021-08-26 22:55:42.075 | INFO     | src.policies:collect_trajectories:213 - Episode 1320
2021-08-26 22:55:42.153 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:55:42.155 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 11.0
2021-08-26 22:55:42.156 | INFO     | src.policies:collect_trajectories:23

2021-08-26 22:55:42.558 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 17.0
2021-08-26 22:55:42.559 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 19.5
2021-08-26 22:55:42.560 | INFO     | src.policies:collect_trajectories:213 - Episode 1335
2021-08-26 22:55:42.568 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:55:42.570 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 9.0
2021-08-26 22:55:42.570 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 18.0
2021-08-26 22:55:42.571 | INFO     | src.policies:collect_trajectories:213 - Episode 1336
2021-08-26 22:55:42.583 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:55:42.584 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 16.0
2021-08-26 22:55:42.585 | INFO     | src.policies:collect_trajectories:230 - L

2021-08-26 22:55:43.029 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.2545557916164398
2021-08-26 22:55:43.032 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.03741961345076561
2021-08-26 22:55:43.034 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.23393194377422333
2021-08-26 22:55:43.037 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.03741961345076561
2021-08-26 22:55:43.040 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.23393194377422333
2021-08-26 22:55:43.043 | INFO     | src.policies:train:152 - Mini-batch 2 / 2
2021-08-26 22:55:43.046 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.25123855471611023
2021-08-26 22:55:43.049 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.05243513733148575
2021-08-26 22:55:43.052 | INFO     | src.polic

2021-08-26 22:55:43.378 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 13.0
2021-08-26 22:55:43.379 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 14.0
2021-08-26 22:55:43.380 | INFO     | src.policies:collect_trajectories:213 - Episode 1363
2021-08-26 22:55:43.403 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:55:43.404 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 36.0
2021-08-26 22:55:43.405 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 21.333333333333332
2021-08-26 22:55:43.405 | INFO     | src.policies:collect_trajectories:213 - Episode 1364
2021-08-26 22:55:43.421 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:55:43.422 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 22.0
2021-08-26 22:55:43.423 | INFO     | src.policies:collect_traje

2021-08-26 22:55:43.753 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.10303343832492828
2021-08-26 22:55:43.755 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.23362180590629578
2021-08-26 22:55:43.758 | INFO     | src.policies:train:152 - Mini-batch 2 / 2
2021-08-26 22:55:43.762 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.23736457526683807
2021-08-26 22:55:43.765 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.07013633847236633
2021-08-26 22:55:43.767 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.2363833338022232
2021-08-26 22:55:43.769 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.07013633847236633
2021-08-26 22:55:43.772 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.236383333802

2021-08-26 22:55:44.150 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 14.0
2021-08-26 22:55:44.151 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 29.75
2021-08-26 22:55:44.152 | INFO     | src.policies:collect_trajectories:213 - Episode 1391
2021-08-26 22:55:44.181 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:55:44.182 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 44.0
2021-08-26 22:55:44.183 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 32.6
2021-08-26 22:55:44.184 | INFO     | src.policies:collect_trajectories:213 - Episode 1392
2021-08-26 22:55:44.195 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:55:44.196 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 13.0
2021-08-26 22:55:44.197 | INFO     | src.policies:collect_trajectories:230 -

2021-08-26 22:55:44.642 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.23234720528125763
2021-08-26 22:55:44.645 | INFO     | src.policies:train:116 - Epoch 126 / 800
2021-08-26 22:55:44.646 | INFO     | src.policies:collect_trajectories:213 - Episode 1403
2021-08-26 22:55:44.663 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:55:44.664 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 28.0
2021-08-26 22:55:44.665 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 28.0
2021-08-26 22:55:44.666 | INFO     | src.policies:collect_trajectories:213 - Episode 1404
2021-08-26 22:55:44.675 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:55:44.676 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 12.0
2021-08-26 22:55:44.676 | INFO     | src.policies:collect_trajectories:2

2021-08-26 22:55:44.942 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 9.0
2021-08-26 22:55:44.942 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 17.166666666666668
2021-08-26 22:55:44.943 | INFO     | src.policies:collect_trajectories:213 - Episode 1419
2021-08-26 22:55:44.953 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:55:44.954 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 13.0
2021-08-26 22:55:44.955 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 16.571428571428573
2021-08-26 22:55:44.955 | INFO     | src.policies:collect_trajectories:213 - Episode 1420
2021-08-26 22:55:44.964 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:55:44.965 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 10.0
2021-08-26 22:55:44.966 | INFO     | src.policies:

2021-08-26 22:55:45.313 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.05694304406642914
2021-08-26 22:55:45.315 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.21975651383399963
2021-08-26 22:55:45.318 | INFO     | src.policies:train:152 - Mini-batch 2 / 2
2021-08-26 22:55:45.321 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.2609027624130249
2021-08-26 22:55:45.324 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.050557851791381836
2021-08-26 22:55:45.326 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.21409185230731964
2021-08-26 22:55:45.328 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.050557851791381836
2021-08-26 22:55:45.331 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.2140918523

2021-08-26 22:55:45.676 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:55:45.677 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 23.0
2021-08-26 22:55:45.678 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 26.5
2021-08-26 22:55:45.679 | INFO     | src.policies:collect_trajectories:213 - Episode 1447
2021-08-26 22:55:45.690 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:55:45.691 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 12.0
2021-08-26 22:55:45.692 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 21.666666666666668
2021-08-26 22:55:45.693 | INFO     | src.policies:collect_trajectories:213 - Episode 1448
2021-08-26 22:55:45.706 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:55:45.707 | INFO     | src.policies:collect_traj

2021-08-26 22:55:45.971 | INFO     | src.policies:collect_trajectories:213 - Episode 1462
2021-08-26 22:55:45.988 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:55:45.989 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 24.0
2021-08-26 22:55:45.989 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 19.75
2021-08-26 22:55:45.991 | INFO     | src.policies:collect_trajectories:213 - Episode 1463
2021-08-26 22:55:46.006 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:55:46.007 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 22.0
2021-08-26 22:55:46.008 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 20.0
2021-08-26 22:55:46.009 | INFO     | src.policies:collect_trajectories:213 - Episode 1464
2021-08-26 22:55:46.018 | DEBUG    | src.policies:execute_episode:398 - Early stopping, a

2021-08-26 22:55:46.378 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.029618773609399796
2021-08-26 22:55:46.382 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.21152520179748535
2021-08-26 22:55:46.386 | INFO     | src.policies:train:116 - Epoch 133 / 800
2021-08-26 22:55:46.387 | INFO     | src.policies:collect_trajectories:213 - Episode 1475
2021-08-26 22:55:46.393 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:55:46.394 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 8.0
2021-08-26 22:55:46.394 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 8.0
2021-08-26 22:55:46.395 | INFO     | src.policies:collect_trajectories:213 - Episode 1476
2021-08-26 22:55:46.415 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:55:46.416 | INFO     

2021-08-26 22:55:46.933 | INFO     | src.policies:collect_trajectories:213 - Episode 1490
2021-08-26 22:55:46.951 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:55:46.952 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 27.0
2021-08-26 22:55:46.953 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 23.571428571428573
2021-08-26 22:55:46.954 | INFO     | src.policies:collect_trajectories:213 - Episode 1491
2021-08-26 22:55:46.968 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:55:46.969 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 20.0
2021-08-26 22:55:46.970 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 23.125
2021-08-26 22:55:46.971 | INFO     | src.policies:collect_trajectories:213 - Episode 1492
2021-08-26 22:55:46.982 | DEBUG    | src.policies:execute_episode:398 - Ea

2021-08-26 22:55:47.255 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.032497256994247437
2021-08-26 22:55:47.257 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.19384533166885376
2021-08-26 22:55:47.261 | INFO     | src.policies:train:116 - Epoch 136 / 800
2021-08-26 22:55:47.262 | INFO     | src.policies:collect_trajectories:213 - Episode 1503
2021-08-26 22:55:47.273 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:55:47.274 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 17.0
2021-08-26 22:55:47.275 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 17.0
2021-08-26 22:55:47.276 | INFO     | src.policies:collect_trajectories:213 - Episode 1504
2021-08-26 22:55:47.286 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:55:47.287 | INFO   

2021-08-26 22:55:47.677 | INFO     | src.policies:collect_trajectories:213 - Episode 1518
2021-08-26 22:55:47.688 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:55:47.689 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 13.0
2021-08-26 22:55:47.690 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 21.285714285714285
2021-08-26 22:55:47.691 | INFO     | src.policies:collect_trajectories:213 - Episode 1519
2021-08-26 22:55:47.704 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:55:47.705 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 16.0
2021-08-26 22:55:47.706 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 20.625
2021-08-26 22:55:47.706 | INFO     | src.policies:collect_trajectories:213 - Episode 1520
2021-08-26 22:55:47.717 | DEBUG    | src.policies:execute_episode:398 - Ea

2021-08-26 22:55:48.075 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.08963936567306519
2021-08-26 22:55:48.079 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.2043556123971939
2021-08-26 22:55:48.085 | INFO     | src.policies:train:116 - Epoch 139 / 800
2021-08-26 22:55:48.086 | INFO     | src.policies:collect_trajectories:213 - Episode 1531
2021-08-26 22:55:48.099 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:55:48.100 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 15.0
2021-08-26 22:55:48.101 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 15.0
2021-08-26 22:55:48.103 | INFO     | src.policies:collect_trajectories:213 - Episode 1532
2021-08-26 22:55:48.119 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:55:48.121 | INFO     

2021-08-26 22:55:48.575 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:55:48.576 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 48.0
2021-08-26 22:55:48.577 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 37.0
2021-08-26 22:55:48.588 | INFO     | src.policies:train:152 - Mini-batch 1 / 2
2021-08-26 22:55:48.592 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.24481827020645142
2021-08-26 22:55:48.597 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.0683913379907608
2021-08-26 22:55:48.600 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.1972133219242096
2021-08-26 22:55:48.604 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.0683913379907608
2021-08-26 22:55:48.608 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient n

2021-08-26 22:55:48.992 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:55:48.993 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 11.0
2021-08-26 22:55:48.994 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 11.0
2021-08-26 22:55:48.995 | INFO     | src.policies:collect_trajectories:213 - Episode 1559
2021-08-26 22:55:49.009 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:55:49.010 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 19.0
2021-08-26 22:55:49.011 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 15.0
2021-08-26 22:55:49.011 | INFO     | src.policies:collect_trajectories:213 - Episode 1560
2021-08-26 22:55:49.028 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:55:49.029 | INFO     | src.policies:collect_trajectories:229 -

2021-08-26 22:55:49.391 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:55:49.392 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 15.0
2021-08-26 22:55:49.393 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 21.75
2021-08-26 22:55:49.394 | INFO     | src.policies:collect_trajectories:213 - Episode 1575
2021-08-26 22:55:49.410 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:55:49.411 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 22.0
2021-08-26 22:55:49.412 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 21.77777777777778
2021-08-26 22:55:49.413 | INFO     | src.policies:collect_trajectories:213 - Episode 1576
2021-08-26 22:55:49.431 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:55:49.432 | INFO     | src.policies:collect_traj

2021-08-26 22:55:49.795 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:55:49.796 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 27.0
2021-08-26 22:55:49.797 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 31.5
2021-08-26 22:55:49.798 | INFO     | src.policies:collect_trajectories:213 - Episode 1587
2021-08-26 22:55:49.812 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:55:49.813 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 21.0
2021-08-26 22:55:49.814 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 28.0
2021-08-26 22:55:49.815 | INFO     | src.policies:collect_trajectories:213 - Episode 1588
2021-08-26 22:55:49.828 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:55:49.829 | INFO     | src.policies:collect_trajectories:229 -

2021-08-26 22:55:50.157 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:55:50.159 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 18.0
2021-08-26 22:55:50.160 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 24.0
2021-08-26 22:55:50.243 | INFO     | src.policies:train:152 - Mini-batch 1 / 2
2021-08-26 22:55:50.247 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.25247564911842346
2021-08-26 22:55:50.250 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.05742190033197403
2021-08-26 22:55:50.252 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.1833108812570572
2021-08-26 22:55:50.255 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.05742190033197403
2021-08-26 22:55:50.258 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient

2021-08-26 22:55:50.575 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:55:50.576 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 22.0
2021-08-26 22:55:50.577 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 21.75
2021-08-26 22:55:50.578 | INFO     | src.policies:collect_trajectories:213 - Episode 1615
2021-08-26 22:55:50.591 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:55:50.592 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 13.0
2021-08-26 22:55:50.593 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 20.0
2021-08-26 22:55:50.594 | INFO     | src.policies:collect_trajectories:213 - Episode 1616
2021-08-26 22:55:50.613 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:55:50.614 | INFO     | src.policies:collect_trajectories:229 

2021-08-26 22:55:51.246 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.07273554801940918
2021-08-26 22:55:51.249 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.18345649540424347
2021-08-26 22:55:51.252 | INFO     | src.policies:train:152 - Mini-batch 2 / 2
2021-08-26 22:55:51.255 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.26702871918678284
2021-08-26 22:55:51.259 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.04630584269762039
2021-08-26 22:55:51.262 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.18513524532318115
2021-08-26 22:55:51.265 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.04630584269762039
2021-08-26 22:55:51.268 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.18513524532

2021-08-26 22:55:51.671 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:55:51.672 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 19.0
2021-08-26 22:55:51.673 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 24.4
2021-08-26 22:55:51.674 | INFO     | src.policies:collect_trajectories:213 - Episode 1643
2021-08-26 22:55:51.688 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:55:51.690 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 19.0
2021-08-26 22:55:51.690 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 23.5
2021-08-26 22:55:51.691 | INFO     | src.policies:collect_trajectories:213 - Episode 1644
2021-08-26 22:55:51.704 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:55:51.705 | INFO     | src.policies:collect_trajectories:229 -

2021-08-26 22:55:52.006 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.23937499523162842
2021-08-26 22:55:52.010 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.027889076620340347
2021-08-26 22:55:52.013 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.179753378033638
2021-08-26 22:55:52.016 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.027889076620340347
2021-08-26 22:55:52.019 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.179753378033638
2021-08-26 22:55:52.023 | INFO     | src.policies:train:116 - Epoch 153 / 800
2021-08-26 22:55:52.024 | INFO     | src.policies:collect_trajectories:213 - Episode 1656
2021-08-26 22:55:52.034 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:55:52.035 | INFO     | src.policies:collect_trajectories:229 - Mean epis

2021-08-26 22:55:52.416 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:55:52.417 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 17.0
2021-08-26 22:55:52.418 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 19.333333333333332
2021-08-26 22:55:52.419 | INFO     | src.policies:collect_trajectories:213 - Episode 1671
2021-08-26 22:55:52.439 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:55:52.441 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 26.0
2021-08-26 22:55:52.441 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 21.0
2021-08-26 22:55:52.443 | INFO     | src.policies:collect_trajectories:213 - Episode 1672
2021-08-26 22:55:52.457 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:55:52.458 | INFO     | src.policies:collect_traj

2021-08-26 22:55:52.895 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.04967246949672699
2021-08-26 22:55:52.898 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.16978314518928528
2021-08-26 22:55:52.902 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.04967246949672699
2021-08-26 22:55:52.905 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.16978314518928528
2021-08-26 22:55:52.911 | INFO     | src.policies:train:116 - Epoch 156 / 800
2021-08-26 22:55:52.912 | INFO     | src.policies:collect_trajectories:213 - Episode 1684
2021-08-26 22:55:52.930 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:55:52.931 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 24.0
2021-08-26 22:55:52.932 | INFO     | src.policies:collect_trajectories:230 - Last 100 e

2021-08-26 22:55:53.386 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 11.0
2021-08-26 22:55:53.387 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 24.285714285714285
2021-08-26 22:55:53.388 | INFO     | src.policies:collect_trajectories:213 - Episode 1699
2021-08-26 22:55:53.404 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:55:53.405 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 21.0
2021-08-26 22:55:53.406 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 23.875
2021-08-26 22:55:53.407 | INFO     | src.policies:collect_trajectories:213 - Episode 1700
2021-08-26 22:55:53.420 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:55:53.422 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 18.0
2021-08-26 22:55:53.422 | INFO     | src.policies:collect_tra

2021-08-26 22:55:53.699 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.16290897130966187
2021-08-26 22:55:53.702 | INFO     | src.policies:train:116 - Epoch 159 / 800
2021-08-26 22:55:53.704 | INFO     | src.policies:collect_trajectories:213 - Episode 1711
2021-08-26 22:55:53.721 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:55:53.722 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 26.0
2021-08-26 22:55:53.723 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 26.0
2021-08-26 22:55:53.724 | INFO     | src.policies:collect_trajectories:213 - Episode 1712
2021-08-26 22:55:53.737 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:55:53.738 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 18.0
2021-08-26 22:55:53.739 | INFO     | src.policies:collect_trajectories:2

2021-08-26 22:55:54.117 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 37.0
2021-08-26 22:55:54.118 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 21.428571428571427
2021-08-26 22:55:54.119 | INFO     | src.policies:collect_trajectories:213 - Episode 1727
2021-08-26 22:55:54.138 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:55:54.139 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 22.0
2021-08-26 22:55:54.140 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 21.5
2021-08-26 22:55:54.141 | INFO     | src.policies:collect_trajectories:213 - Episode 1728
2021-08-26 22:55:54.165 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:55:54.167 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 33.0
2021-08-26 22:55:54.168 | INFO     | src.policies:collect_traje

2021-08-26 22:55:54.566 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 30.0
2021-08-26 22:55:54.567 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 22.0
2021-08-26 22:55:54.568 | INFO     | src.policies:collect_trajectories:213 - Episode 1739
2021-08-26 22:55:54.585 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:55:54.587 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 22.0
2021-08-26 22:55:54.588 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 22.0
2021-08-26 22:55:54.588 | INFO     | src.policies:collect_trajectories:213 - Episode 1740
2021-08-26 22:55:54.611 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:55:54.612 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 32.0
2021-08-26 22:55:54.613 | INFO     | src.policies:collect_trajectories:230 - 

2021-08-26 22:55:55.012 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.07158282399177551
2021-08-26 22:55:55.015 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.1509724110364914
2021-08-26 22:55:55.018 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.07158282399177551
2021-08-26 22:55:55.021 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.1509724110364914
2021-08-26 22:55:55.025 | INFO     | src.policies:train:116 - Epoch 164 / 800
2021-08-26 22:55:55.026 | INFO     | src.policies:collect_trajectories:213 - Episode 1752
2021-08-26 22:55:55.064 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:55:55.065 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 60.0
2021-08-26 22:55:55.066 | INFO     | src.policies:collect_trajectories:230 - Last 100 epi

2021-08-26 22:55:55.666 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 11.0
2021-08-26 22:55:55.667 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 22.0
2021-08-26 22:55:55.669 | INFO     | src.policies:collect_trajectories:213 - Episode 1767
2021-08-26 22:55:55.715 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:55:55.717 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 63.0
2021-08-26 22:55:55.718 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 27.125
2021-08-26 22:55:55.727 | INFO     | src.policies:train:152 - Mini-batch 1 / 2
2021-08-26 22:55:55.731 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.287461519241333
2021-08-26 22:55:55.735 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.02900301106274128
2021-08-26 22:55:55.737 | INFO     | src.policies:minibatch_upd

2021-08-26 22:55:56.077 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 49.0
2021-08-26 22:55:56.078 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 35.25
2021-08-26 22:55:56.079 | INFO     | src.policies:collect_trajectories:213 - Episode 1779
2021-08-26 22:55:56.091 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:55:56.093 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 15.0
2021-08-26 22:55:56.093 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 31.2
2021-08-26 22:55:56.094 | INFO     | src.policies:collect_trajectories:213 - Episode 1780
2021-08-26 22:55:56.106 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:55:56.108 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 15.0
2021-08-26 22:55:56.109 | INFO     | src.policies:collect_trajectories:230 -

2021-08-26 22:55:56.467 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.14342527091503143
2021-08-26 22:55:56.470 | INFO     | src.policies:train:116 - Epoch 169 / 800
2021-08-26 22:55:56.471 | INFO     | src.policies:collect_trajectories:213 - Episode 1791
2021-08-26 22:55:56.508 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:55:56.509 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 58.0
2021-08-26 22:55:56.510 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 58.0
2021-08-26 22:55:56.511 | INFO     | src.policies:collect_trajectories:213 - Episode 1792
2021-08-26 22:55:56.521 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:55:56.522 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 11.0
2021-08-26 22:55:56.523 | INFO     | src.policies:collect_trajectories:2

2021-08-26 22:55:56.921 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.2979762554168701
2021-08-26 22:55:56.924 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.1132265105843544
2021-08-26 22:55:56.926 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.15040884912014008
2021-08-26 22:55:56.929 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.1132265105843544
2021-08-26 22:55:56.932 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.15040884912014008
2021-08-26 22:55:56.935 | INFO     | src.policies:train:152 - Mini-batch 2 / 2
2021-08-26 22:55:56.938 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.27825942635536194
2021-08-26 22:55:56.942 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.14278066158294678
2021-08-26 22:55:56.944 | INFO     | src.policie

2021-08-26 22:55:57.422 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 54.0
2021-08-26 22:55:57.424 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 29.5
2021-08-26 22:55:57.425 | INFO     | src.policies:collect_trajectories:213 - Episode 1819
2021-08-26 22:55:57.457 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:55:57.458 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 32.0
2021-08-26 22:55:57.461 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 29.857142857142858
2021-08-26 22:55:57.472 | INFO     | src.policies:train:152 - Mini-batch 1 / 2
2021-08-26 22:55:57.477 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.25106582045555115
2021-08-26 22:55:57.480 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.0750446766614914
2021-08-26 22:55:57.483 | INFO     | src.policies:

2021-08-26 22:55:57.899 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 23.142857142857142
2021-08-26 22:55:57.900 | INFO     | src.policies:collect_trajectories:213 - Episode 1831
2021-08-26 22:55:57.913 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:55:57.914 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 15.0
2021-08-26 22:55:57.915 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 22.125
2021-08-26 22:55:57.916 | INFO     | src.policies:collect_trajectories:213 - Episode 1832
2021-08-26 22:55:57.937 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:55:57.939 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 28.0
2021-08-26 22:55:57.940 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 22.77777777777778
2021-08-26 22:55:57.947 | INFO     | s

2021-08-26 22:55:58.207 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.2994788587093353
2021-08-26 22:55:58.210 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.05976088345050812
2021-08-26 22:55:58.212 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.1300099790096283
2021-08-26 22:55:58.214 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.05976088345050812
2021-08-26 22:55:58.217 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.1300099790096283
2021-08-26 22:55:58.220 | INFO     | src.policies:train:152 - Mini-batch 2 / 2
2021-08-26 22:55:58.223 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.30041128396987915
2021-08-26 22:55:58.226 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.050080012530088425
2021-08-26 22:55:58.228 | INFO     | src.polici

2021-08-26 22:55:58.608 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 20.0
2021-08-26 22:55:58.609 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 20.0
2021-08-26 22:55:58.610 | INFO     | src.policies:collect_trajectories:213 - Episode 1859
2021-08-26 22:55:58.628 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:55:58.629 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 25.0
2021-08-26 22:55:58.630 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 21.25
2021-08-26 22:55:58.631 | INFO     | src.policies:collect_trajectories:213 - Episode 1860
2021-08-26 22:55:58.642 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:55:58.644 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 12.0
2021-08-26 22:55:58.645 | INFO     | src.policies:collect_trajectories:230 -

2021-08-26 22:55:59.047 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 18.0
2021-08-26 22:55:59.048 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 16.545454545454547
2021-08-26 22:55:59.049 | INFO     | src.policies:collect_trajectories:213 - Episode 1875
2021-08-26 22:55:59.071 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:55:59.072 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 30.0
2021-08-26 22:55:59.073 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 17.666666666666668
2021-08-26 22:55:59.082 | INFO     | src.policies:train:152 - Mini-batch 1 / 2
2021-08-26 22:55:59.085 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.25698792934417725
2021-08-26 22:55:59.088 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.12100367248058319
2021-08-26 22:55:59.091 | INFO     

2021-08-26 22:55:59.488 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 25.0
2021-08-26 22:55:59.489 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 21.0
2021-08-26 22:55:59.490 | INFO     | src.policies:collect_trajectories:213 - Episode 1887
2021-08-26 22:55:59.513 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:55:59.514 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 34.0
2021-08-26 22:55:59.515 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 24.25
2021-08-26 22:55:59.516 | INFO     | src.policies:collect_trajectories:213 - Episode 1888
2021-08-26 22:55:59.608 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:55:59.609 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 66.0
2021-08-26 22:55:59.610 | INFO     | src.policies:collect_trajectories:230 -

2021-08-26 22:55:59.904 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.13165242969989777
2021-08-26 22:55:59.908 | INFO     | src.policies:train:116 - Epoch 182 / 800
2021-08-26 22:55:59.909 | INFO     | src.policies:collect_trajectories:213 - Episode 1899
2021-08-26 22:55:59.922 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:55:59.923 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 19.0
2021-08-26 22:55:59.924 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 19.0
2021-08-26 22:55:59.925 | INFO     | src.policies:collect_trajectories:213 - Episode 1900
2021-08-26 22:55:59.948 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:55:59.949 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 36.0
2021-08-26 22:55:59.950 | INFO     | src.policies:collect_trajectories:2

2021-08-26 22:56:00.532 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 58.0
2021-08-26 22:56:00.533 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 27.333333333333332
2021-08-26 22:56:00.541 | INFO     | src.policies:train:152 - Mini-batch 1 / 2
2021-08-26 22:56:00.544 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.27696555852890015
2021-08-26 22:56:00.548 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.1126735582947731
2021-08-26 22:56:00.550 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.13339053094387054
2021-08-26 22:56:00.552 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.1126735582947731
2021-08-26 22:56:00.555 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.13339053094387054
2021-08-26 22:56:00.558 | INFO     | src.policies

2021-08-26 22:56:00.941 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 26.0
2021-08-26 22:56:00.942 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 21.2
2021-08-26 22:56:00.943 | INFO     | src.policies:collect_trajectories:213 - Episode 1927
2021-08-26 22:56:00.957 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:56:00.958 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 18.0
2021-08-26 22:56:00.959 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 20.666666666666668
2021-08-26 22:56:00.960 | INFO     | src.policies:collect_trajectories:213 - Episode 1928
2021-08-26 22:56:00.988 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:56:00.989 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 44.0
2021-08-26 22:56:00.990 | INFO     | src.policies:collect_traje

2021-08-26 22:56:01.472 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 11.0
2021-08-26 22:56:01.473 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 36.5
2021-08-26 22:56:01.474 | INFO     | src.policies:collect_trajectories:213 - Episode 1939
2021-08-26 22:56:01.491 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:56:01.492 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 24.0
2021-08-26 22:56:01.493 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 32.333333333333336
2021-08-26 22:56:01.494 | INFO     | src.policies:collect_trajectories:213 - Episode 1940
2021-08-26 22:56:01.530 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:56:01.532 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 53.0
2021-08-26 22:56:01.533 | INFO     | src.policies:collect_traje

2021-08-26 22:56:01.904 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.0633489340543747
2021-08-26 22:56:01.906 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.1303432434797287
2021-08-26 22:56:01.909 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.0633489340543747
2021-08-26 22:56:01.912 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.1303432434797287
2021-08-26 22:56:01.915 | INFO     | src.policies:train:116 - Epoch 189 / 800
2021-08-26 22:56:01.916 | INFO     | src.policies:collect_trajectories:213 - Episode 1952
2021-08-26 22:56:01.956 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:56:01.957 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 64.0
2021-08-26 22:56:01.959 | INFO     | src.policies:collect_trajectories:230 - Last 100 episo

2021-08-26 22:56:02.311 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.1283903270959854
2021-08-26 22:56:02.314 | INFO     | src.policies:train:152 - Mini-batch 2 / 2
2021-08-26 22:56:02.317 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.2694997489452362
2021-08-26 22:56:02.321 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.12049993872642517
2021-08-26 22:56:02.323 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.12771710753440857
2021-08-26 22:56:02.326 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.12049993872642517
2021-08-26 22:56:02.328 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.12771710753440857
2021-08-26 22:56:02.332 | INFO     | src.policies:train:116 - Epoch 191 / 800
2021-08-26 22:56:02.333 | INFO     | src.policies:collect_

2021-08-26 22:56:02.756 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 36.8
2021-08-26 22:56:02.756 | INFO     | src.policies:collect_trajectories:213 - Episode 1979
2021-08-26 22:56:02.781 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:56:02.782 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 34.0
2021-08-26 22:56:02.783 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 36.333333333333336
2021-08-26 22:56:02.790 | INFO     | src.policies:train:152 - Mini-batch 1 / 2
2021-08-26 22:56:02.793 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.29193487763404846
2021-08-26 22:56:02.797 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.05482494831085205
2021-08-26 22:56:02.799 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.12535595893859863
2021-08-26 22:56:02.801 

2021-08-26 22:56:03.189 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 20.0
2021-08-26 22:56:03.190 | INFO     | src.policies:collect_trajectories:213 - Episode 1991
2021-08-26 22:56:03.209 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:56:03.210 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 29.0
2021-08-26 22:56:03.211 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 22.25
2021-08-26 22:56:03.212 | INFO     | src.policies:collect_trajectories:213 - Episode 1992
2021-08-26 22:56:03.223 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:56:03.224 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 12.0
2021-08-26 22:56:03.225 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 20.2
2021-08-26 22:56:03.226 | INFO     | src.policies:collect_trajecto

2021-08-26 22:56:03.669 | INFO     | src.policies:collect_trajectories:213 - Episode 2003
2021-08-26 22:56:03.678 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:56:03.679 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 12.0
2021-08-26 22:56:03.680 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 12.0
2021-08-26 22:56:03.681 | INFO     | src.policies:collect_trajectories:213 - Episode 2004
2021-08-26 22:56:03.701 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:56:03.702 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 28.0
2021-08-26 22:56:03.703 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 20.0
2021-08-26 22:56:03.704 | INFO     | src.policies:collect_trajectories:213 - Episode 2005
2021-08-26 22:56:03.724 | DEBUG    | src.policies:execute_episode:398 - Early stopping, al

2021-08-26 22:56:04.157 | INFO     | src.policies:train:152 - Mini-batch 1 / 2
2021-08-26 22:56:04.160 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.28664809465408325
2021-08-26 22:56:04.163 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.1265854835510254
2021-08-26 22:56:04.166 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.11811679601669312
2021-08-26 22:56:04.168 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.1265854835510254
2021-08-26 22:56:04.172 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.11811679601669312
2021-08-26 22:56:04.175 | INFO     | src.policies:train:152 - Mini-batch 2 / 2
2021-08-26 22:56:04.178 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.3078401982784271
2021-08-26 22:56:04.182 | INFO     | src.policies:minibatch_update:277 - Policy network L2 grad

2021-08-26 22:56:04.527 | INFO     | src.policies:collect_trajectories:213 - Episode 2031
2021-08-26 22:56:04.548 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:56:04.549 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 29.0
2021-08-26 22:56:04.550 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 35.2
2021-08-26 22:56:04.551 | INFO     | src.policies:collect_trajectories:213 - Episode 2032
2021-08-26 22:56:04.567 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:56:04.568 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 19.0
2021-08-26 22:56:04.569 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 32.5
2021-08-26 22:56:04.570 | INFO     | src.policies:collect_trajectories:213 - Episode 2033
2021-08-26 22:56:04.580 | DEBUG    | src.policies:execute_episode:398 - Early stopping, al

2021-08-26 22:56:04.962 | INFO     | src.policies:collect_trajectories:213 - Episode 2043
2021-08-26 22:56:04.985 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:56:04.986 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 34.0
2021-08-26 22:56:04.988 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 38.666666666666664
2021-08-26 22:56:04.988 | INFO     | src.policies:collect_trajectories:213 - Episode 2044
2021-08-26 22:56:05.002 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:56:05.003 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 17.0
2021-08-26 22:56:05.004 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 33.25
2021-08-26 22:56:05.005 | INFO     | src.policies:collect_trajectories:213 - Episode 2045
2021-08-26 22:56:05.019 | DEBUG    | src.policies:execute_episode:398 - Ear

2021-08-26 22:56:05.509 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.030366597697138786
2021-08-26 22:56:05.512 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.11648187041282654
2021-08-26 22:56:05.515 | INFO     | src.policies:train:116 - Epoch 203 / 800
2021-08-26 22:56:05.516 | INFO     | src.policies:collect_trajectories:213 - Episode 2056
2021-08-26 22:56:05.553 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:56:05.554 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 60.0
2021-08-26 22:56:05.555 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 60.0
2021-08-26 22:56:05.556 | INFO     | src.policies:collect_trajectories:213 - Episode 2057
2021-08-26 22:56:05.655 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:56:05.656 | INFO   

2021-08-26 22:56:06.045 | INFO     | src.policies:train:152 - Mini-batch 1 / 2
2021-08-26 22:56:06.048 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.2870808243751526
2021-08-26 22:56:06.052 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.11890727281570435
2021-08-26 22:56:06.054 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.11146587133407593
2021-08-26 22:56:06.057 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.11890727281570435
2021-08-26 22:56:06.060 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.11146587133407593
2021-08-26 22:56:06.063 | INFO     | src.policies:train:152 - Mini-batch 2 / 2
2021-08-26 22:56:06.066 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.29363688826560974
2021-08-26 22:56:06.069 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gr

2021-08-26 22:56:06.400 | INFO     | src.policies:collect_trajectories:213 - Episode 2083
2021-08-26 22:56:06.481 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:56:06.482 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 26.0
2021-08-26 22:56:06.483 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 33.0
2021-08-26 22:56:06.484 | INFO     | src.policies:collect_trajectories:213 - Episode 2084
2021-08-26 22:56:06.502 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:56:06.503 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 25.0
2021-08-26 22:56:06.504 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 31.4
2021-08-26 22:56:06.506 | INFO     | src.policies:collect_trajectories:213 - Episode 2085
2021-08-26 22:56:06.516 | DEBUG    | src.policies:execute_episode:398 - Early stopping, al

2021-08-26 22:56:06.806 | INFO     | src.policies:collect_trajectories:213 - Episode 2095
2021-08-26 22:56:06.817 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:56:06.819 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 15.0
2021-08-26 22:56:06.820 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 15.0
2021-08-26 22:56:06.821 | INFO     | src.policies:collect_trajectories:213 - Episode 2096
2021-08-26 22:56:06.840 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:56:06.841 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 27.0
2021-08-26 22:56:06.842 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 21.0
2021-08-26 22:56:06.843 | INFO     | src.policies:collect_trajectories:213 - Episode 2097
2021-08-26 22:56:06.857 | DEBUG    | src.policies:execute_episode:398 - Early stopping, al

2021-08-26 22:56:07.244 | INFO     | src.policies:collect_trajectories:213 - Episode 2111
2021-08-26 22:56:07.252 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:56:07.254 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 9.0
2021-08-26 22:56:07.254 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 24.5
2021-08-26 22:56:07.255 | INFO     | src.policies:collect_trajectories:213 - Episode 2112
2021-08-26 22:56:07.285 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:56:07.287 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 45.0
2021-08-26 22:56:07.288 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 26.77777777777778
2021-08-26 22:56:07.296 | INFO     | src.policies:train:152 - Mini-batch 1 / 2
2021-08-26 22:56:07.299 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.2

2021-08-26 22:56:07.809 | INFO     | src.policies:collect_trajectories:213 - Episode 2123
2021-08-26 22:56:07.865 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:56:07.866 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 95.0
2021-08-26 22:56:07.867 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 57.5
2021-08-26 22:56:07.874 | INFO     | src.policies:train:152 - Mini-batch 1 / 2
2021-08-26 22:56:07.877 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.25382667779922485
2021-08-26 22:56:07.880 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.08603190630674362
2021-08-26 22:56:07.882 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.108711376786232
2021-08-26 22:56:07.885 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.08603190630674362
2021-08-26 2

2021-08-26 22:56:08.260 | INFO     | src.policies:collect_trajectories:213 - Episode 2135
2021-08-26 22:56:08.269 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:56:08.270 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 10.0
2021-08-26 22:56:08.271 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 23.75
2021-08-26 22:56:08.272 | INFO     | src.policies:collect_trajectories:213 - Episode 2136
2021-08-26 22:56:08.283 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:56:08.284 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 13.0
2021-08-26 22:56:08.285 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 21.6
2021-08-26 22:56:08.286 | INFO     | src.policies:collect_trajectories:213 - Episode 2137
2021-08-26 22:56:08.311 | DEBUG    | src.policies:execute_episode:398 - Early stopping, a

2021-08-26 22:56:08.599 | INFO     | src.policies:collect_trajectories:213 - Episode 2151
2021-08-26 22:56:08.609 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:56:08.610 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 11.0
2021-08-26 22:56:08.611 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 18.09090909090909
2021-08-26 22:56:08.612 | INFO     | src.policies:collect_trajectories:213 - Episode 2152
2021-08-26 22:56:08.626 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:56:08.628 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 19.0
2021-08-26 22:56:08.628 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 18.166666666666668
2021-08-26 22:56:08.637 | INFO     | src.policies:train:152 - Mini-batch 1 / 2
2021-08-26 22:56:08.640 | INFO     | src.policies:minibatch_update:270 - T

2021-08-26 22:56:09.110 | INFO     | src.policies:collect_trajectories:213 - Episode 2163
2021-08-26 22:56:09.137 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:56:09.138 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 40.0
2021-08-26 22:56:09.139 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 39.333333333333336
2021-08-26 22:56:09.141 | INFO     | src.policies:collect_trajectories:213 - Episode 2164
2021-08-26 22:56:09.151 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:56:09.152 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 11.0
2021-08-26 22:56:09.153 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 32.25
2021-08-26 22:56:09.154 | INFO     | src.policies:collect_trajectories:213 - Episode 2165
2021-08-26 22:56:09.169 | DEBUG    | src.policies:execute_episode:398 - Ear

2021-08-26 22:56:09.546 | INFO     | src.policies:collect_trajectories:213 - Episode 2175
2021-08-26 22:56:09.601 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:56:09.603 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 84.0
2021-08-26 22:56:09.604 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 84.0
2021-08-26 22:56:09.606 | INFO     | src.policies:collect_trajectories:213 - Episode 2176
2021-08-26 22:56:09.624 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:56:09.626 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 14.0
2021-08-26 22:56:09.627 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 49.0
2021-08-26 22:56:09.629 | INFO     | src.policies:collect_trajectories:213 - Episode 2177
2021-08-26 22:56:09.677 | DEBUG    | src.policies:execute_episode:398 - Early stopping, al

2021-08-26 22:56:10.127 | INFO     | src.policies:collect_trajectories:213 - Episode 2187
2021-08-26 22:56:10.136 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:56:10.137 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 11.0
2021-08-26 22:56:10.138 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 11.0
2021-08-26 22:56:10.139 | INFO     | src.policies:collect_trajectories:213 - Episode 2188
2021-08-26 22:56:10.151 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:56:10.153 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 16.0
2021-08-26 22:56:10.153 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 13.5
2021-08-26 22:56:10.154 | INFO     | src.policies:collect_trajectories:213 - Episode 2189
2021-08-26 22:56:10.168 | DEBUG    | src.policies:execute_episode:398 - Early stopping, al

2021-08-26 22:56:10.777 | INFO     | src.policies:train:152 - Mini-batch 2 / 2
2021-08-26 22:56:10.780 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.2422390878200531
2021-08-26 22:56:10.784 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.3177555799484253
2021-08-26 22:56:10.786 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.10153663903474808
2021-08-26 22:56:10.789 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.3177555799484253
2021-08-26 22:56:10.792 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.10153663903474808
2021-08-26 22:56:10.796 | INFO     | src.policies:train:116 - Epoch 222 / 800
2021-08-26 22:56:10.797 | INFO     | src.policies:collect_trajectories:213 - Episode 2201
2021-08-26 22:56:10.821 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-

2021-08-26 22:56:11.191 | INFO     | src.policies:collect_trajectories:213 - Episode 2215
2021-08-26 22:56:11.225 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:56:11.226 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 44.0
2021-08-26 22:56:11.227 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 29.166666666666668
2021-08-26 22:56:11.228 | INFO     | src.policies:collect_trajectories:213 - Episode 2216
2021-08-26 22:56:11.244 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:56:11.245 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 22.0
2021-08-26 22:56:11.246 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 28.142857142857142
2021-08-26 22:56:11.247 | INFO     | src.policies:collect_trajectories:213 - Episode 2217
2021-08-26 22:56:11.261 | DEBUG    | src.policies:execute_epis

2021-08-26 22:56:11.542 | INFO     | src.policies:collect_trajectories:213 - Episode 2227
2021-08-26 22:56:11.560 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:56:11.561 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 25.0
2021-08-26 22:56:11.562 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 16.333333333333332
2021-08-26 22:56:11.563 | INFO     | src.policies:collect_trajectories:213 - Episode 2228
2021-08-26 22:56:11.580 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:56:11.581 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 24.0
2021-08-26 22:56:11.582 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 18.25
2021-08-26 22:56:11.583 | INFO     | src.policies:collect_trajectories:213 - Episode 2229
2021-08-26 22:56:11.599 | DEBUG    | src.policies:execute_episode:398 - Ear

2021-08-26 22:56:12.090 | INFO     | src.policies:train:152 - Mini-batch 2 / 2
2021-08-26 22:56:12.093 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.2584100067615509
2021-08-26 22:56:12.096 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.16484305262565613
2021-08-26 22:56:12.098 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.1010412648320198
2021-08-26 22:56:12.101 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.16484305262565613
2021-08-26 22:56:12.103 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.1010412648320198
2021-08-26 22:56:12.107 | INFO     | src.policies:train:116 - Epoch 227 / 800
2021-08-26 22:56:12.108 | INFO     | src.policies:collect_trajectories:213 - Episode 2241
2021-08-26 22:56:12.136 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-

2021-08-26 22:56:12.532 | INFO     | src.policies:collect_trajectories:213 - Episode 2255
2021-08-26 22:56:12.551 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:56:12.553 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 27.0
2021-08-26 22:56:12.553 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 27.25
2021-08-26 22:56:12.561 | INFO     | src.policies:train:152 - Mini-batch 1 / 2
2021-08-26 22:56:12.564 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.28740307688713074
2021-08-26 22:56:12.567 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.03970184549689293
2021-08-26 22:56:12.570 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.09415505081415176
2021-08-26 22:56:12.572 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.03970184549689293
2021-08-2

2021-08-26 22:56:13.032 | INFO     | src.policies:train:152 - Mini-batch 1 / 2
2021-08-26 22:56:13.035 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.2796669900417328
2021-08-26 22:56:13.038 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.08426021784543991
2021-08-26 22:56:13.041 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.09695254266262054
2021-08-26 22:56:13.044 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.08426021784543991
2021-08-26 22:56:13.047 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.09695254266262054
2021-08-26 22:56:13.050 | INFO     | src.policies:train:152 - Mini-batch 2 / 2
2021-08-26 22:56:13.053 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.2863869369029999
2021-08-26 22:56:13.057 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gra

2021-08-26 22:56:13.459 | INFO     | src.policies:collect_trajectories:213 - Episode 2279
2021-08-26 22:56:13.500 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:56:13.501 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 63.0
2021-08-26 22:56:13.502 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 48.25
2021-08-26 22:56:13.503 | INFO     | src.policies:collect_trajectories:213 - Episode 2280
2021-08-26 22:56:13.551 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:56:13.552 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 75.0
2021-08-26 22:56:13.553 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 53.6
2021-08-26 22:56:13.560 | INFO     | src.policies:train:152 - Mini-batch 1 / 2
2021-08-26 22:56:13.563 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.234589666128

2021-08-26 22:56:13.902 | INFO     | src.policies:collect_trajectories:213 - Episode 2291
2021-08-26 22:56:14.026 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:56:14.028 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 51.0
2021-08-26 22:56:14.028 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 41.0
2021-08-26 22:56:14.029 | INFO     | src.policies:collect_trajectories:213 - Episode 2292
2021-08-26 22:56:14.054 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:56:14.055 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 37.0
2021-08-26 22:56:14.056 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 39.666666666666664
2021-08-26 22:56:14.057 | INFO     | src.policies:collect_trajectories:213 - Episode 2293
2021-08-26 22:56:14.074 | DEBUG    | src.policies:execute_episode:398 - Earl

2021-08-26 22:56:14.370 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.13732203841209412
2021-08-26 22:56:14.372 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.09250142425298691
2021-08-26 22:56:14.377 | INFO     | src.policies:train:116 - Epoch 236 / 800
2021-08-26 22:56:14.378 | INFO     | src.policies:collect_trajectories:213 - Episode 2304
2021-08-26 22:56:14.389 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:56:14.390 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 15.0
2021-08-26 22:56:14.391 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 15.0
2021-08-26 22:56:14.392 | INFO     | src.policies:collect_trajectories:213 - Episode 2305
2021-08-26 22:56:14.458 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:56:14.459 | INFO    

2021-08-26 22:56:14.828 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.0874493196606636
2021-08-26 22:56:14.830 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.08400269597768784
2021-08-26 22:56:14.833 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.0874493196606636
2021-08-26 22:56:14.837 | INFO     | src.policies:train:152 - Mini-batch 2 / 2
2021-08-26 22:56:14.840 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.2815437912940979
2021-08-26 22:56:14.843 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.1489488184452057
2021-08-26 22:56:14.845 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.09062229841947556
2021-08-26 22:56:14.847 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.1489488184452057
2021-08-26 22

2021-08-26 22:56:15.321 | INFO     | src.policies:collect_trajectories:213 - Episode 2331
2021-08-26 22:56:15.353 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:56:15.355 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 26.0
2021-08-26 22:56:15.357 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 26.25
2021-08-26 22:56:15.358 | INFO     | src.policies:collect_trajectories:213 - Episode 2332
2021-08-26 22:56:15.379 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:56:15.381 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 21.0
2021-08-26 22:56:15.382 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 25.2
2021-08-26 22:56:15.385 | INFO     | src.policies:collect_trajectories:213 - Episode 2333
2021-08-26 22:56:15.424 | DEBUG    | src.policies:execute_episode:398 - Early stopping, a

2021-08-26 22:56:15.914 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.054173532873392105
2021-08-26 22:56:15.918 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.09266211837530136
2021-08-26 22:56:15.922 | INFO     | src.policies:train:116 - Epoch 241 / 800
2021-08-26 22:56:15.924 | INFO     | src.policies:collect_trajectories:213 - Episode 2344
2021-08-26 22:56:15.952 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:56:15.954 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 29.0
2021-08-26 22:56:15.955 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 29.0
2021-08-26 22:56:15.957 | INFO     | src.policies:collect_trajectories:213 - Episode 2345
2021-08-26 22:56:15.971 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:56:15.972 | INFO   

2021-08-26 22:56:16.493 | INFO     | src.policies:collect_trajectories:213 - Episode 2359
2021-08-26 22:56:16.512 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:56:16.513 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 22.0
2021-08-26 22:56:16.514 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 24.333333333333332
2021-08-26 22:56:16.515 | INFO     | src.policies:collect_trajectories:213 - Episode 2360
2021-08-26 22:56:16.553 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:56:16.554 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 41.0
2021-08-26 22:56:16.556 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 26.714285714285715
2021-08-26 22:56:16.557 | INFO     | src.policies:collect_trajectories:213 - Episode 2361
2021-08-26 22:56:16.576 | DEBUG    | src.policies:execute_epis

2021-08-26 22:56:17.045 | INFO     | src.policies:collect_trajectories:213 - Episode 2371
2021-08-26 22:56:17.068 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:56:17.069 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 28.0
2021-08-26 22:56:17.070 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 35.0
2021-08-26 22:56:17.071 | INFO     | src.policies:collect_trajectories:213 - Episode 2372
2021-08-26 22:56:17.104 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:56:17.105 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 43.0
2021-08-26 22:56:17.106 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 36.333333333333336
2021-08-26 22:56:17.114 | INFO     | src.policies:train:152 - Mini-batch 1 / 2
2021-08-26 22:56:17.120 | INFO     | src.policies:minibatch_update:270 - Total loss: -0

2021-08-26 22:56:17.477 | INFO     | src.policies:collect_trajectories:213 - Episode 2383
2021-08-26 22:56:17.508 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:56:17.509 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 46.0
2021-08-26 22:56:17.510 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 35.0
2021-08-26 22:56:17.511 | INFO     | src.policies:collect_trajectories:213 - Episode 2384
2021-08-26 22:56:17.522 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:56:17.523 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 15.0
2021-08-26 22:56:17.524 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 30.0
2021-08-26 22:56:17.525 | INFO     | src.policies:collect_trajectories:213 - Episode 2385
2021-08-26 22:56:17.541 | DEBUG    | src.policies:execute_episode:398 - Early stopping, al

2021-08-26 22:56:17.944 | INFO     | src.policies:collect_trajectories:213 - Episode 2395
2021-08-26 22:56:17.982 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:56:17.983 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 61.0
2021-08-26 22:56:17.984 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 61.0
2021-08-26 22:56:17.985 | INFO     | src.policies:collect_trajectories:213 - Episode 2396
2021-08-26 22:56:18.001 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:56:18.002 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 20.0
2021-08-26 22:56:18.004 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 40.5
2021-08-26 22:56:18.005 | INFO     | src.policies:collect_trajectories:213 - Episode 2397
2021-08-26 22:56:18.035 | DEBUG    | src.policies:execute_episode:398 - Early stopping, al

2021-08-26 22:56:18.553 | INFO     | src.policies:collect_trajectories:213 - Episode 2407
2021-08-26 22:56:18.569 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:56:18.571 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 22.0
2021-08-26 22:56:18.571 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 24.25
2021-08-26 22:56:18.572 | INFO     | src.policies:collect_trajectories:213 - Episode 2408
2021-08-26 22:56:18.622 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:56:18.623 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 79.0
2021-08-26 22:56:18.624 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 35.2
2021-08-26 22:56:18.625 | INFO     | src.policies:collect_trajectories:213 - Episode 2409
2021-08-26 22:56:18.658 | DEBUG    | src.policies:execute_episode:398 - Early stopping, a

2021-08-26 22:56:19.048 | INFO     | src.policies:collect_trajectories:213 - Episode 2419
2021-08-26 22:56:19.065 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:56:19.066 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 23.0
2021-08-26 22:56:19.067 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 32.0
2021-08-26 22:56:19.068 | INFO     | src.policies:collect_trajectories:213 - Episode 2420
2021-08-26 22:56:19.107 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:56:19.108 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 59.0
2021-08-26 22:56:19.109 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 37.4
2021-08-26 22:56:19.110 | INFO     | src.policies:collect_trajectories:213 - Episode 2421
2021-08-26 22:56:19.125 | DEBUG    | src.policies:execute_episode:398 - Early stopping, al

2021-08-26 22:56:19.394 | INFO     | src.policies:collect_trajectories:213 - Episode 2431
2021-08-26 22:56:19.414 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:56:19.415 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 31.0
2021-08-26 22:56:19.416 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 31.0
2021-08-26 22:56:19.417 | INFO     | src.policies:collect_trajectories:213 - Episode 2432
2021-08-26 22:56:19.445 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:56:19.447 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 32.0
2021-08-26 22:56:19.448 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 31.5
2021-08-26 22:56:19.450 | INFO     | src.policies:collect_trajectories:213 - Episode 2433
2021-08-26 22:56:19.484 | DEBUG    | src.policies:execute_episode:398 - Early stopping, al

2021-08-26 22:56:19.898 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.20475931465625763
2021-08-26 22:56:19.901 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.08397302776575089
2021-08-26 22:56:19.904 | INFO     | src.policies:train:116 - Epoch 256 / 800
2021-08-26 22:56:19.905 | INFO     | src.policies:collect_trajectories:213 - Episode 2444
2021-08-26 22:56:19.961 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:56:19.962 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 92.0
2021-08-26 22:56:19.963 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 92.0
2021-08-26 22:56:19.964 | INFO     | src.policies:collect_trajectories:213 - Episode 2445
2021-08-26 22:56:20.005 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:56:20.007 | INFO    

2021-08-26 22:56:20.476 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:56:20.477 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 50.0
2021-08-26 22:56:20.478 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 29.5
2021-08-26 22:56:20.479 | INFO     | src.policies:collect_trajectories:213 - Episode 2456
2021-08-26 22:56:20.497 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:56:20.498 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 26.0
2021-08-26 22:56:20.499 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 28.333333333333332
2021-08-26 22:56:20.500 | INFO     | src.policies:collect_trajectories:213 - Episode 2457
2021-08-26 22:56:20.509 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:56:20.510 | INFO     | src.policies:collect_traj

2021-08-26 22:56:21.051 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:56:21.052 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 36.0
2021-08-26 22:56:21.052 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 61.5
2021-08-26 22:56:21.053 | INFO     | src.policies:collect_trajectories:213 - Episode 2468
2021-08-26 22:56:21.067 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:56:21.068 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 17.0
2021-08-26 22:56:21.069 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 46.666666666666664
2021-08-26 22:56:21.070 | INFO     | src.policies:collect_trajectories:213 - Episode 2469
2021-08-26 22:56:21.084 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:56:21.085 | INFO     | src.policies:collect_traj

2021-08-26 22:56:21.442 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.0787685364484787
2021-08-26 22:56:21.446 | INFO     | src.policies:train:116 - Epoch 262 / 800
2021-08-26 22:56:21.447 | INFO     | src.policies:collect_trajectories:213 - Episode 2480
2021-08-26 22:56:21.459 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:56:21.460 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 18.0
2021-08-26 22:56:21.461 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 18.0
2021-08-26 22:56:21.463 | INFO     | src.policies:collect_trajectories:213 - Episode 2481
2021-08-26 22:56:21.474 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:56:21.476 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 12.0
2021-08-26 22:56:21.477 | INFO     | src.policies:collect_trajectories:23

2021-08-26 22:56:21.913 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.07710754871368408
2021-08-26 22:56:21.917 | INFO     | src.policies:train:116 - Epoch 264 / 800
2021-08-26 22:56:21.918 | INFO     | src.policies:collect_trajectories:213 - Episode 2492
2021-08-26 22:56:21.935 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:56:21.936 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 24.0
2021-08-26 22:56:21.937 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 24.0
2021-08-26 22:56:21.938 | INFO     | src.policies:collect_trajectories:213 - Episode 2493
2021-08-26 22:56:21.968 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:56:21.970 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 47.0
2021-08-26 22:56:21.970 | INFO     | src.policies:collect_trajectories:2

2021-08-26 22:56:22.465 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 47.0
2021-08-26 22:56:22.466 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 47.0
2021-08-26 22:56:22.467 | INFO     | src.policies:collect_trajectories:213 - Episode 2504
2021-08-26 22:56:22.507 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:56:22.508 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 66.0
2021-08-26 22:56:22.509 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 56.5
2021-08-26 22:56:22.510 | INFO     | src.policies:collect_trajectories:213 - Episode 2505
2021-08-26 22:56:22.524 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:56:22.525 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 19.0
2021-08-26 22:56:22.526 | INFO     | src.policies:collect_trajectories:230 - 

2021-08-26 22:56:22.948 | INFO     | src.policies:collect_trajectories:213 - Episode 2514
2021-08-26 22:56:23.033 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:56:23.034 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 60.0
2021-08-26 22:56:23.035 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 41.0
2021-08-26 22:56:23.036 | INFO     | src.policies:collect_trajectories:213 - Episode 2515
2021-08-26 22:56:23.082 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:56:23.083 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 72.0
2021-08-26 22:56:23.084 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 48.75
2021-08-26 22:56:23.085 | INFO     | src.policies:collect_trajectories:213 - Episode 2516
2021-08-26 22:56:23.122 | DEBUG    | src.policies:execute_episode:398 - Early stopping, a

2021-08-26 22:56:23.585 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.07308842986822128
2021-08-26 22:56:23.587 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.30801936984062195
2021-08-26 22:56:23.590 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.07308842986822128
2021-08-26 22:56:23.593 | INFO     | src.policies:train:152 - Mini-batch 2 / 2
2021-08-26 22:56:23.596 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.3119126558303833
2021-08-26 22:56:23.599 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.19706782698631287
2021-08-26 22:56:23.601 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.07012777030467987
2021-08-26 22:56:23.603 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.19706782698631287
2021-08-2

2021-08-26 22:56:23.977 | INFO     | src.policies:collect_trajectories:213 - Episode 2538
2021-08-26 22:56:24.021 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:56:24.022 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 66.0
2021-08-26 22:56:24.023 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 28.75
2021-08-26 22:56:24.030 | INFO     | src.policies:train:152 - Mini-batch 1 / 2
2021-08-26 22:56:24.034 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.3014032244682312
2021-08-26 22:56:24.037 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.1345890462398529
2021-08-26 22:56:24.039 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.07201746851205826
2021-08-26 22:56:24.042 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.1345890462398529
2021-08-26 2

2021-08-26 22:56:24.596 | INFO     | src.policies:train:152 - Mini-batch 1 / 2
2021-08-26 22:56:24.599 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.2578205466270447
2021-08-26 22:56:24.603 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.2489953190088272
2021-08-26 22:56:24.605 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.07218374311923981
2021-08-26 22:56:24.644 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.2489953190088272
2021-08-26 22:56:24.659 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.07218374311923981
2021-08-26 22:56:24.663 | INFO     | src.policies:train:152 - Mini-batch 2 / 2
2021-08-26 22:56:24.666 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.2553223967552185
2021-08-26 22:56:24.670 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradi

2021-08-26 22:56:25.086 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.14934180676937103
2021-08-26 22:56:25.088 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.0701296329498291
2021-08-26 22:56:25.092 | INFO     | src.policies:train:116 - Epoch 277 / 800
2021-08-26 22:56:25.093 | INFO     | src.policies:collect_trajectories:213 - Episode 2559
2021-08-26 22:56:25.109 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:56:25.111 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 24.0
2021-08-26 22:56:25.111 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 24.0
2021-08-26 22:56:25.112 | INFO     | src.policies:collect_trajectories:213 - Episode 2560
2021-08-26 22:56:25.220 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:56:25.221 | INFO     

2021-08-26 22:56:25.667 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:56:25.668 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 119.0
2021-08-26 22:56:25.669 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 58.333333333333336
2021-08-26 22:56:25.670 | INFO     | src.policies:collect_trajectories:213 - Episode 2571
2021-08-26 22:56:25.703 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:56:25.704 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 53.0
2021-08-26 22:56:25.846 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 57.0
2021-08-26 22:56:25.852 | INFO     | src.policies:train:152 - Mini-batch 1 / 2
2021-08-26 22:56:25.856 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.27699318528175354
2021-08-26 22:56:25.859 | INFO     | src.policies:minibatch_update:277

2021-08-26 22:56:26.253 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.0954604372382164
2021-08-26 22:56:26.255 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.07041503489017487
2021-08-26 22:56:26.259 | INFO     | src.policies:train:152 - Mini-batch 2 / 2
2021-08-26 22:56:26.261 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.2969401478767395
2021-08-26 22:56:26.265 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.2398710399866104
2021-08-26 22:56:26.267 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.06774205714464188
2021-08-26 22:56:26.269 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.2398710399866104
2021-08-26 22:56:26.271 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.067742057144641

2021-08-26 22:56:26.820 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.2772785723209381
2021-08-26 22:56:26.823 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.11418332904577255
2021-08-26 22:56:26.825 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.06894896924495697
2021-08-26 22:56:26.828 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.11418332904577255
2021-08-26 22:56:26.830 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.06894896924495697
2021-08-26 22:56:26.834 | INFO     | src.policies:train:152 - Mini-batch 2 / 2
2021-08-26 22:56:26.836 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.2880690395832062
2021-08-26 22:56:26.840 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.2505217492580414
2021-08-26 22:56:26.842 | INFO     | src.policie

2021-08-26 22:56:27.337 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 53.0
2021-08-26 22:56:27.338 | INFO     | src.policies:collect_trajectories:213 - Episode 2603
2021-08-26 22:56:27.349 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:56:27.350 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 13.0
2021-08-26 22:56:27.350 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 39.666666666666664
2021-08-26 22:56:27.351 | INFO     | src.policies:collect_trajectories:213 - Episode 2604
2021-08-26 22:56:27.428 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:56:27.429 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 131.0
2021-08-26 22:56:27.430 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 62.5
2021-08-26 22:56:27.436 | INFO     | src.policies:tr

2021-08-26 22:56:27.879 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.06786402314901352
2021-08-26 22:56:27.882 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.22247736155986786
2021-08-26 22:56:27.884 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.06786402314901352
2021-08-26 22:56:27.888 | INFO     | src.policies:train:116 - Epoch 289 / 800
2021-08-26 22:56:27.889 | INFO     | src.policies:collect_trajectories:213 - Episode 2612
2021-08-26 22:56:27.918 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:56:27.919 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 45.0
2021-08-26 22:56:27.920 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 45.0
2021-08-26 22:56:27.921 | INFO     | src.policies:collect_trajectories:213 - Episode 2613
2021-08-26

2021-08-26 22:56:28.362 | INFO     | src.policies:collect_trajectories:213 - Episode 2623
2021-08-26 22:56:28.375 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:56:28.376 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 16.0
2021-08-26 22:56:28.377 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 19.0
2021-08-26 22:56:28.378 | INFO     | src.policies:collect_trajectories:213 - Episode 2624
2021-08-26 22:56:28.391 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:56:28.392 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 18.0
2021-08-26 22:56:28.393 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 18.666666666666668
2021-08-26 22:56:28.394 | INFO     | src.policies:collect_trajectories:213 - Episode 2625
2021-08-26 22:56:28.403 | DEBUG    | src.policies:execute_episode:398 - Earl

2021-08-26 22:56:28.934 | INFO     | src.policies:collect_trajectories:213 - Episode 2635
2021-08-26 22:56:28.990 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:56:28.991 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 92.0
2021-08-26 22:56:28.992 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 79.33333333333333
2021-08-26 22:56:28.998 | INFO     | src.policies:train:152 - Mini-batch 1 / 2
2021-08-26 22:56:29.002 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.25002437829971313
2021-08-26 22:56:29.006 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.13557440042495728
2021-08-26 22:56:29.008 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.0678061842918396
2021-08-26 22:56:29.011 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.1355744004249572

2021-08-26 22:56:29.444 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.06196199357509613
2021-08-26 22:56:29.447 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.16555944085121155
2021-08-26 22:56:29.450 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.06196199357509613
2021-08-26 22:56:29.453 | INFO     | src.policies:train:152 - Mini-batch 2 / 2
2021-08-26 22:56:29.456 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.3009762167930603
2021-08-26 22:56:29.459 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.07135962694883347
2021-08-26 22:56:29.461 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.06450454145669937
2021-08-26 22:56:29.463 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.07135962694883347
2021-08-2

2021-08-26 22:56:29.950 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:56:29.952 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 24.0
2021-08-26 22:56:29.952 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 37.5
2021-08-26 22:56:29.953 | INFO     | src.policies:collect_trajectories:213 - Episode 2656
2021-08-26 22:56:29.979 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:56:29.980 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 39.0
2021-08-26 22:56:29.981 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 38.0
2021-08-26 22:56:29.982 | INFO     | src.policies:collect_trajectories:213 - Episode 2657
2021-08-26 22:56:30.000 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:56:30.002 | INFO     | src.policies:collect_trajectories:229 -

2021-08-26 22:56:30.422 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:56:30.424 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 27.0
2021-08-26 22:56:30.425 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 27.0
2021-08-26 22:56:30.426 | INFO     | src.policies:collect_trajectories:213 - Episode 2668
2021-08-26 22:56:30.451 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:56:30.452 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 40.0
2021-08-26 22:56:30.453 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 33.5
2021-08-26 22:56:30.454 | INFO     | src.policies:collect_trajectories:213 - Episode 2669
2021-08-26 22:56:30.470 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:56:30.472 | INFO     | src.policies:collect_trajectories:229 -

2021-08-26 22:56:31.083 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:56:31.084 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 72.0
2021-08-26 22:56:31.085 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 63.25
2021-08-26 22:56:31.091 | INFO     | src.policies:train:152 - Mini-batch 1 / 2
2021-08-26 22:56:31.096 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.27356234192848206
2021-08-26 22:56:31.099 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.12118469178676605
2021-08-26 22:56:31.102 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.06378164142370224
2021-08-26 22:56:31.105 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.12118469178676605
2021-08-26 22:56:31.107 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradie

2021-08-26 22:56:31.585 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.06150249019265175
2021-08-26 22:56:31.589 | INFO     | src.policies:train:116 - Epoch 305 / 800
2021-08-26 22:56:31.590 | INFO     | src.policies:collect_trajectories:213 - Episode 2688
2021-08-26 22:56:31.625 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:56:31.626 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 59.0
2021-08-26 22:56:31.627 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 59.0
2021-08-26 22:56:31.628 | INFO     | src.policies:collect_trajectories:213 - Episode 2689
2021-08-26 22:56:31.665 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:56:31.666 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 58.0
2021-08-26 22:56:31.667 | INFO     | src.policies:collect_trajectories:2

2021-08-26 22:56:32.186 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 13.0
2021-08-26 22:56:32.187 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 18.666666666666668
2021-08-26 22:56:32.188 | INFO     | src.policies:collect_trajectories:213 - Episode 2700
2021-08-26 22:56:32.220 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:56:32.223 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 52.0
2021-08-26 22:56:32.224 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 27.0
2021-08-26 22:56:32.225 | INFO     | src.policies:collect_trajectories:213 - Episode 2701
2021-08-26 22:56:32.244 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:56:32.246 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 27.0
2021-08-26 22:56:32.247 | INFO     | src.policies:collect_traje

2021-08-26 22:56:32.780 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 50.0
2021-08-26 22:56:32.781 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 46.5
2021-08-26 22:56:32.782 | INFO     | src.policies:collect_trajectories:213 - Episode 2712
2021-08-26 22:56:32.795 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:56:32.796 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 17.0
2021-08-26 22:56:32.797 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 36.666666666666664
2021-08-26 22:56:32.798 | INFO     | src.policies:collect_trajectories:213 - Episode 2713
2021-08-26 22:56:32.824 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:56:32.825 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 36.0
2021-08-26 22:56:32.826 | INFO     | src.policies:collect_traje

2021-08-26 22:56:33.423 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 73.66666666666667
2021-08-26 22:56:33.430 | INFO     | src.policies:train:152 - Mini-batch 1 / 2
2021-08-26 22:56:33.433 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.30878645181655884
2021-08-26 22:56:33.436 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.3436775803565979
2021-08-26 22:56:33.439 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.05754058435559273
2021-08-26 22:56:33.441 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.3436775803565979
2021-08-26 22:56:33.444 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.05754058435559273
2021-08-26 22:56:33.447 | INFO     | src.policies:train:152 - Mini-batch 2 / 2
2021-08-26 22:56:33.450 | INFO     | src.policies:minibatch_update:270 - T

2021-08-26 22:56:33.952 | INFO     | src.policies:collect_trajectories:213 - Episode 2728
2021-08-26 22:56:34.021 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:56:34.023 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 116.0
2021-08-26 22:56:34.024 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 116.0
2021-08-26 22:56:34.025 | INFO     | src.policies:collect_trajectories:213 - Episode 2729
2021-08-26 22:56:34.047 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:56:34.048 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 31.0
2021-08-26 22:56:34.049 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 73.5
2021-08-26 22:56:34.050 | INFO     | src.policies:collect_trajectories:213 - Episode 2730
2021-08-26 22:56:34.076 | DEBUG    | src.policies:execute_episode:398 - Early stopping, 

2021-08-26 22:56:34.539 | INFO     | src.policies:collect_trajectories:213 - Episode 2740
2021-08-26 22:56:34.555 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:56:34.556 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 18.0
2021-08-26 22:56:34.557 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 86.0
2021-08-26 22:56:34.558 | INFO     | src.policies:collect_trajectories:213 - Episode 2741
2021-08-26 22:56:34.602 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:56:34.604 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 70.0
2021-08-26 22:56:34.605 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 80.66666666666667
2021-08-26 22:56:34.611 | INFO     | src.policies:train:152 - Mini-batch 1 / 2
2021-08-26 22:56:34.614 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.

2021-08-26 22:56:35.100 | INFO     | src.policies:train:152 - Mini-batch 2 / 2
2021-08-26 22:56:35.103 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.3008156716823578
2021-08-26 22:56:35.107 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.08540557324886322
2021-08-26 22:56:35.110 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.05650414153933525
2021-08-26 22:56:35.112 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.08540557324886322
2021-08-26 22:56:35.115 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.05650414153933525
2021-08-26 22:56:35.119 | INFO     | src.policies:train:116 - Epoch 319 / 800
2021-08-26 22:56:35.120 | INFO     | src.policies:collect_trajectories:213 - Episode 2750
2021-08-26 22:56:35.161 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-0

2021-08-26 22:56:35.619 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:56:35.620 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 59.0
2021-08-26 22:56:35.621 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 59.0
2021-08-26 22:56:35.622 | INFO     | src.policies:collect_trajectories:213 - Episode 2761
2021-08-26 22:56:35.640 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:56:35.641 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 24.0
2021-08-26 22:56:35.643 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 41.5
2021-08-26 22:56:35.645 | INFO     | src.policies:collect_trajectories:213 - Episode 2762
2021-08-26 22:56:35.677 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:56:35.678 | INFO     | src.policies:collect_trajectories:229 -

2021-08-26 22:56:36.131 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:56:36.132 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 50.0
2021-08-26 22:56:36.133 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 50.0
2021-08-26 22:56:36.134 | INFO     | src.policies:collect_trajectories:213 - Episode 2773
2021-08-26 22:56:36.158 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:56:36.160 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 38.0
2021-08-26 22:56:36.161 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 44.0
2021-08-26 22:56:36.162 | INFO     | src.policies:collect_trajectories:213 - Episode 2774
2021-08-26 22:56:36.198 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:56:36.199 | INFO     | src.policies:collect_trajectories:229 -

2021-08-26 22:56:36.606 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.05448506027460098
2021-08-26 22:56:36.610 | INFO     | src.policies:train:116 - Epoch 325 / 800
2021-08-26 22:56:36.611 | INFO     | src.policies:collect_trajectories:213 - Episode 2785
2021-08-26 22:56:36.625 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:56:36.626 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 21.0
2021-08-26 22:56:36.627 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 21.0
2021-08-26 22:56:36.628 | INFO     | src.policies:collect_trajectories:213 - Episode 2786
2021-08-26 22:56:36.700 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:56:36.702 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 118.0
2021-08-26 22:56:36.703 | INFO     | src.policies:collect_trajectories:

2021-08-26 22:56:37.269 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 60.0
2021-08-26 22:56:37.270 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 65.66666666666667
2021-08-26 22:56:37.271 | INFO     | src.policies:collect_trajectories:213 - Episode 2797
2021-08-26 22:56:37.287 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:56:37.288 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 21.0
2021-08-26 22:56:37.289 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 54.5
2021-08-26 22:56:37.296 | INFO     | src.policies:train:152 - Mini-batch 1 / 2
2021-08-26 22:56:37.299 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.2937665581703186
2021-08-26 22:56:37.302 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.3993516266345978
2021-08-26 22:56:37.304 | INFO     | src.policies:mi

2021-08-26 22:56:37.903 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.3186514377593994
2021-08-26 22:56:37.905 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.05596983805298805
2021-08-26 22:56:37.908 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.3186514377593994
2021-08-26 22:56:37.911 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.05596983805298805
2021-08-26 22:56:37.915 | INFO     | src.policies:train:116 - Epoch 330 / 800
2021-08-26 22:56:37.916 | INFO     | src.policies:collect_trajectories:213 - Episode 2806
2021-08-26 22:56:37.985 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:56:37.986 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 118.0
2021-08-26 22:56:37.987 | INFO     | src.policies:collect_trajectories:230 - Last 100 ep

2021-08-26 22:56:38.440 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 63.0
2021-08-26 22:56:38.441 | INFO     | src.policies:collect_trajectories:213 - Episode 2815
2021-08-26 22:56:38.473 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:56:38.475 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 50.0
2021-08-26 22:56:38.476 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 56.5
2021-08-26 22:56:38.477 | INFO     | src.policies:collect_trajectories:213 - Episode 2816
2021-08-26 22:56:38.510 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:56:38.511 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 50.0
2021-08-26 22:56:38.512 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 54.333333333333336
2021-08-26 22:56:38.513 | INFO     | src.policies:col

2021-08-26 22:56:39.066 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 58.333333333333336
2021-08-26 22:56:39.067 | INFO     | src.policies:collect_trajectories:213 - Episode 2827
2021-08-26 22:56:39.100 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:56:39.102 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 56.0
2021-08-26 22:56:39.103 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 57.75
2021-08-26 22:56:39.111 | INFO     | src.policies:train:152 - Mini-batch 1 / 2
2021-08-26 22:56:39.116 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.3041873574256897
2021-08-26 22:56:39.119 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.0997590720653534
2021-08-26 22:56:39.121 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.051122792065143585
2021-08-26 22:56:39.124 

2021-08-26 22:56:39.556 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 45.6
2021-08-26 22:56:39.564 | INFO     | src.policies:train:152 - Mini-batch 1 / 2
2021-08-26 22:56:39.567 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.27309709787368774
2021-08-26 22:56:39.571 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.3662108778953552
2021-08-26 22:56:39.573 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.05117981135845184
2021-08-26 22:56:39.575 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.3662108778953552
2021-08-26 22:56:39.578 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.05117981135845184
2021-08-26 22:56:39.581 | INFO     | src.policies:train:152 - Mini-batch 2 / 2
2021-08-26 22:56:39.584 | INFO     | src.policies:minibatch_update:270 - Total loss: -0

2021-08-26 22:56:40.147 | INFO     | src.policies:collect_trajectories:213 - Episode 2847
2021-08-26 22:56:40.160 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:56:40.161 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 16.0
2021-08-26 22:56:40.162 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 40.5
2021-08-26 22:56:40.163 | INFO     | src.policies:collect_trajectories:213 - Episode 2848
2021-08-26 22:56:40.251 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:56:40.252 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 146.0
2021-08-26 22:56:40.253 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 61.6
2021-08-26 22:56:40.261 | INFO     | src.policies:train:152 - Mini-batch 1 / 3
2021-08-26 22:56:40.264 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.316342532634

2021-08-26 22:56:40.711 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.04260454326868057
2021-08-26 22:56:40.713 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.05124981701374054
2021-08-26 22:56:40.715 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.04260454326868057
2021-08-26 22:56:40.718 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.05124981701374054
2021-08-26 22:56:40.722 | INFO     | src.policies:train:116 - Epoch 342 / 800
2021-08-26 22:56:40.723 | INFO     | src.policies:collect_trajectories:213 - Episode 2855
2021-08-26 22:56:40.766 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:56:40.767 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 68.0
2021-08-26 22:56:40.768 | INFO     | src.policies:collect_trajectories:230 - Last 100 e

2021-08-26 22:56:41.448 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.051002755761146545
2021-08-26 22:56:41.451 | INFO     | src.policies:train:152 - Mini-batch 2 / 2
2021-08-26 22:56:41.455 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.28135767579078674
2021-08-26 22:56:41.458 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.473884642124176
2021-08-26 22:56:41.461 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.05082181468605995
2021-08-26 22:56:41.464 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.473884642124176
2021-08-26 22:56:41.466 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.05082181468605995
2021-08-26 22:56:41.470 | INFO     | src.policies:train:116 - Epoch 345 / 800
2021-08-26 22:56:41.471 | INFO     | src.policies:collect_t

2021-08-26 22:56:41.994 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.05074350908398628
2021-08-26 22:56:41.996 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.44172289967536926
2021-08-26 22:56:41.999 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.05074350908398628
2021-08-26 22:56:42.002 | INFO     | src.policies:train:152 - Mini-batch 2 / 2
2021-08-26 22:56:42.005 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.2851710021495819
2021-08-26 22:56:42.009 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.19231681525707245
2021-08-26 22:56:42.011 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.049839913845062256
2021-08-26 22:56:42.014 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.19231681525707245
2021-08-

2021-08-26 22:56:42.604 | INFO     | src.policies:train:152 - Mini-batch 1 / 2
2021-08-26 22:56:42.607 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.27793702483177185
2021-08-26 22:56:42.611 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.13417086005210876
2021-08-26 22:56:42.613 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.04649943485856056
2021-08-26 22:56:42.615 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.13417086005210876
2021-08-26 22:56:42.618 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.04649943485856056
2021-08-26 22:56:42.621 | INFO     | src.policies:train:152 - Mini-batch 2 / 2
2021-08-26 22:56:42.624 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.28171223402023315
2021-08-26 22:56:42.628 | INFO     | src.policies:minibatch_update:277 - Policy network L2 g

2021-08-26 22:56:43.133 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.04722391441464424
2021-08-26 22:56:43.137 | INFO     | src.policies:train:116 - Epoch 352 / 800
2021-08-26 22:56:43.138 | INFO     | src.policies:collect_trajectories:213 - Episode 2893
2021-08-26 22:56:43.246 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:56:43.247 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 173.0
2021-08-26 22:56:43.248 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 173.0
2021-08-26 22:56:43.249 | INFO     | src.policies:collect_trajectories:213 - Episode 2894
2021-08-26 22:56:43.285 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:56:43.286 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 58.0
2021-08-26 22:56:43.287 | INFO     | src.policies:collect_trajectories

2021-08-26 22:56:43.753 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.04830154776573181
2021-08-26 22:56:43.757 | INFO     | src.policies:train:116 - Epoch 355 / 800
2021-08-26 22:56:43.758 | INFO     | src.policies:collect_trajectories:213 - Episode 2901
2021-08-26 22:56:43.812 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:56:43.814 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 94.0
2021-08-26 22:56:43.814 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 94.0
2021-08-26 22:56:43.815 | INFO     | src.policies:collect_trajectories:213 - Episode 2902
2021-08-26 22:56:43.845 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:56:43.846 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 46.0
2021-08-26 22:56:43.847 | INFO     | src.policies:collect_trajectories:2

2021-08-26 22:56:44.460 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 26.0
2021-08-26 22:56:44.461 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 51.5
2021-08-26 22:56:44.468 | INFO     | src.policies:train:152 - Mini-batch 1 / 2
2021-08-26 22:56:44.471 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.3007268011569977
2021-08-26 22:56:44.474 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.12195339798927307
2021-08-26 22:56:44.476 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.04527238383889198
2021-08-26 22:56:44.479 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.12195339798927307
2021-08-26 22:56:44.482 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.04527238383889198
2021-08-26 22:56:44.485 | INFO     | src.policies:train:152 - 

2021-08-26 22:56:45.103 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 40.5
2021-08-26 22:56:45.104 | INFO     | src.policies:collect_trajectories:213 - Episode 2921
2021-08-26 22:56:45.223 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:56:45.225 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 200.0
2021-08-26 22:56:45.225 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 93.66666666666667
2021-08-26 22:56:45.232 | INFO     | src.policies:train:152 - Mini-batch 1 / 2
2021-08-26 22:56:45.235 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.30030542612075806
2021-08-26 22:56:45.239 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.27875328063964844
2021-08-26 22:56:45.240 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.046880945563316345
2021-08-26 22:56:45.243

2021-08-26 22:56:45.743 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.047680020332336426
2021-08-26 22:56:45.746 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.10681883990764618
2021-08-26 22:56:45.748 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.047680020332336426
2021-08-26 22:56:45.752 | INFO     | src.policies:train:116 - Epoch 363 / 800
2021-08-26 22:56:45.753 | INFO     | src.policies:collect_trajectories:213 - Episode 2930
2021-08-26 22:56:45.774 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:56:45.775 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 32.0
2021-08-26 22:56:45.776 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 32.0
2021-08-26 22:56:45.777 | INFO     | src.policies:collect_trajectories:213 - Episode 2931
2021-08-

2021-08-26 22:56:46.523 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:56:46.524 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 99.0
2021-08-26 22:56:46.525 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 143.0
2021-08-26 22:56:46.531 | INFO     | src.policies:train:152 - Mini-batch 1 / 2
2021-08-26 22:56:46.535 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.27876150608062744
2021-08-26 22:56:46.538 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.391166090965271
2021-08-26 22:56:46.540 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.04642132297158241
2021-08-26 22:56:46.543 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.391166090965271
2021-08-26 22:56:46.546 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient n

2021-08-26 22:56:47.059 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 117.0
2021-08-26 22:56:47.060 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 117.0
2021-08-26 22:56:47.061 | INFO     | src.policies:collect_trajectories:213 - Episode 2946
2021-08-26 22:56:47.078 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:56:47.079 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 23.0
2021-08-26 22:56:47.079 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 70.0
2021-08-26 22:56:47.080 | INFO     | src.policies:collect_trajectories:213 - Episode 2947
2021-08-26 22:56:47.181 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:56:47.182 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 70.0
2021-08-26 22:56:47.183 | INFO     | src.policies:collect_trajectories:230 

2021-08-26 22:56:47.649 | INFO     | src.policies:train:116 - Epoch 372 / 800
2021-08-26 22:56:47.650 | INFO     | src.policies:collect_trajectories:213 - Episode 2954
2021-08-26 22:56:47.669 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:56:47.670 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 29.0
2021-08-26 22:56:47.671 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 29.0
2021-08-26 22:56:47.672 | INFO     | src.policies:collect_trajectories:213 - Episode 2955
2021-08-26 22:56:47.723 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:56:47.724 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 84.0
2021-08-26 22:56:47.725 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 56.5
2021-08-26 22:56:47.726 | INFO     | src.policies:collect_trajectories:213 - Episode 2956
2021-08-26 2

2021-08-26 22:56:48.324 | INFO     | src.policies:train:152 - Mini-batch 2 / 2
2021-08-26 22:56:48.328 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.29743319749832153
2021-08-26 22:56:48.332 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.15951327979564667
2021-08-26 22:56:48.334 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.04222971946001053
2021-08-26 22:56:48.338 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.15951327979564667
2021-08-26 22:56:48.340 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.04222971946001053
2021-08-26 22:56:48.344 | INFO     | src.policies:train:116 - Epoch 375 / 800
2021-08-26 22:56:48.345 | INFO     | src.policies:collect_trajectories:213 - Episode 2964
2021-08-26 22:56:48.354 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-

2021-08-26 22:56:48.948 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.2759862542152405
2021-08-26 22:56:48.954 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.2612268626689911
2021-08-26 22:56:48.956 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.043827082961797714
2021-08-26 22:56:48.960 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.2612268626689911
2021-08-26 22:56:48.963 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.043827082961797714
2021-08-26 22:56:48.966 | INFO     | src.policies:train:152 - Mini-batch 2 / 2
2021-08-26 22:56:48.970 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.28075966238975525
2021-08-26 22:56:48.973 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.5064104795455933
2021-08-26 22:56:48.976 | INFO     | src.polici

2021-08-26 22:56:49.484 | INFO     | src.policies:train:116 - Epoch 380 / 800
2021-08-26 22:56:49.486 | INFO     | src.policies:collect_trajectories:213 - Episode 2983
2021-08-26 22:56:49.532 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:56:49.533 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 72.0
2021-08-26 22:56:49.534 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 72.0
2021-08-26 22:56:49.535 | INFO     | src.policies:collect_trajectories:213 - Episode 2984
2021-08-26 22:56:49.560 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:56:49.561 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 38.0
2021-08-26 22:56:49.562 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 55.0
2021-08-26 22:56:49.563 | INFO     | src.policies:collect_trajectories:213 - Episode 2985
2021-08-26 2

2021-08-26 22:56:50.104 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.2756487727165222
2021-08-26 22:56:50.107 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.22352895140647888
2021-08-26 22:56:50.109 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.0414997898042202
2021-08-26 22:56:50.112 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.22352895140647888
2021-08-26 22:56:50.114 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.0414997898042202
2021-08-26 22:56:50.118 | INFO     | src.policies:train:116 - Epoch 383 / 800
2021-08-26 22:56:50.119 | INFO     | src.policies:collect_trajectories:213 - Episode 2993
2021-08-26 22:56:50.166 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:56:50.167 | INFO     | src.policies:collect_trajectories:229 - Mean episo

2021-08-26 22:56:50.670 | INFO     | src.policies:train:152 - Mini-batch 2 / 2
2021-08-26 22:56:50.673 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.2844807505607605
2021-08-26 22:56:50.676 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.4753780663013458
2021-08-26 22:56:50.678 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.04187894985079765
2021-08-26 22:56:50.680 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.4753780663013458
2021-08-26 22:56:50.683 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.04187894985079765
2021-08-26 22:56:50.686 | INFO     | src.policies:train:116 - Epoch 386 / 800
2021-08-26 22:56:50.687 | INFO     | src.policies:collect_trajectories:213 - Episode 3002
2021-08-26 22:56:50.888 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-

2021-08-26 22:56:51.473 | INFO     | src.policies:train:152 - Mini-batch 1 / 2
2021-08-26 22:56:51.475 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.2855741083621979
2021-08-26 22:56:51.479 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.152935191988945
2021-08-26 22:56:51.481 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.04129117354750633
2021-08-26 22:56:51.483 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.152935191988945
2021-08-26 22:56:51.486 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.04129117354750633
2021-08-26 22:56:51.489 | INFO     | src.policies:train:152 - Mini-batch 2 / 2
2021-08-26 22:56:51.492 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.2681371569633484
2021-08-26 22:56:51.495 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradien

2021-08-26 22:56:52.064 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.22962026298046112
2021-08-26 22:56:52.067 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.04055247828364372
2021-08-26 22:56:52.070 | INFO     | src.policies:train:152 - Mini-batch 2 / 2
2021-08-26 22:56:52.073 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.27893516421318054
2021-08-26 22:56:52.077 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.18553180992603302
2021-08-26 22:56:52.079 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.04069621488451958
2021-08-26 22:56:52.081 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.18553180992603302
2021-08-26 22:56:52.084 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.04069621488

2021-08-26 22:56:52.571 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 19.0
2021-08-26 22:56:52.571 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 19.0
2021-08-26 22:56:52.572 | INFO     | src.policies:collect_trajectories:213 - Episode 3029
2021-08-26 22:56:52.598 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:56:52.599 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 41.0
2021-08-26 22:56:52.600 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 30.0
2021-08-26 22:56:52.601 | INFO     | src.policies:collect_trajectories:213 - Episode 3030
2021-08-26 22:56:52.623 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:56:52.624 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 32.0
2021-08-26 22:56:52.625 | INFO     | src.policies:collect_trajectories:230 - 

2021-08-26 22:56:53.333 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.03999787196516991
2021-08-26 22:56:53.337 | INFO     | src.policies:train:152 - Mini-batch 2 / 3
2021-08-26 22:56:53.340 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.2922423481941223
2021-08-26 22:56:53.343 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.11286705732345581
2021-08-26 22:56:53.345 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.039202600717544556
2021-08-26 22:56:53.348 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.11286705732345581
2021-08-26 22:56:53.350 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.039202600717544556
2021-08-26 22:56:53.353 | INFO     | src.policies:train:152 - Mini-batch 3 / 3
2021-08-26 22:56:53.356 | INFO     | src.policies:mini

2021-08-26 22:56:53.899 | INFO     | src.policies:collect_trajectories:213 - Episode 3047
2021-08-26 22:56:53.916 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:56:53.917 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 24.0
2021-08-26 22:56:53.918 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 55.0
2021-08-26 22:56:53.919 | INFO     | src.policies:collect_trajectories:213 - Episode 3048
2021-08-26 22:56:53.945 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:56:53.946 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 38.0
2021-08-26 22:56:53.947 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 50.75
2021-08-26 22:56:53.954 | INFO     | src.policies:train:152 - Mini-batch 1 / 2
2021-08-26 22:56:53.958 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.302822589874

2021-08-26 22:56:54.484 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.07360643148422241
2021-08-26 22:56:54.487 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.03940258547663689
2021-08-26 22:56:54.491 | INFO     | src.policies:train:116 - Epoch 402 / 800
2021-08-26 22:56:54.492 | INFO     | src.policies:collect_trajectories:213 - Episode 3056
2021-08-26 22:56:54.518 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:56:54.519 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 43.0
2021-08-26 22:56:54.520 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 43.0
2021-08-26 22:56:54.521 | INFO     | src.policies:collect_trajectories:213 - Episode 3057
2021-08-26 22:56:54.567 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:56:54.568 | INFO    

2021-08-26 22:56:55.231 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.499999076128006
2021-08-26 22:56:55.234 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.03736886382102966
2021-08-26 22:56:55.237 | INFO     | src.policies:train:152 - Mini-batch 2 / 3
2021-08-26 22:56:55.240 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.2909526526927948
2021-08-26 22:56:55.244 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.3089672923088074
2021-08-26 22:56:55.246 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.03802939131855965
2021-08-26 22:56:55.248 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.3089672923088074
2021-08-26 22:56:55.251 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.0380293913185596

2021-08-26 22:56:55.866 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 98.5
2021-08-26 22:56:55.867 | INFO     | src.policies:collect_trajectories:213 - Episode 3074
2021-08-26 22:56:55.899 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:56:55.900 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 46.0
2021-08-26 22:56:55.901 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 81.0
2021-08-26 22:56:55.906 | INFO     | src.policies:train:152 - Mini-batch 1 / 2
2021-08-26 22:56:55.911 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.2974347174167633
2021-08-26 22:56:55.914 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.4785585105419159
2021-08-26 22:56:55.917 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.03655194863677025
2021-08-26 22:56:55.919 | INFO     | src

2021-08-26 22:56:56.494 | INFO     | src.policies:train:152 - Mini-batch 1 / 3
2021-08-26 22:56:56.497 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.29525384306907654
2021-08-26 22:56:56.500 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.6069056987762451
2021-08-26 22:56:56.502 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.03664113208651543
2021-08-26 22:56:56.505 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.49999910593032837
2021-08-26 22:56:56.508 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.03664113208651543
2021-08-26 22:56:56.511 | INFO     | src.policies:train:152 - Mini-batch 2 / 3
2021-08-26 22:56:56.514 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.30516791343688965
2021-08-26 22:56:56.517 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gr

2021-08-26 22:56:57.017 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.03687359765172005
2021-08-26 22:56:57.020 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.4999992251396179
2021-08-26 22:56:57.023 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.03687359765172005
2021-08-26 22:56:57.026 | INFO     | src.policies:train:116 - Epoch 412 / 800
2021-08-26 22:56:57.027 | INFO     | src.policies:collect_trajectories:213 - Episode 3090
2021-08-26 22:56:57.041 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:56:57.042 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 19.0
2021-08-26 22:56:57.043 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 19.0
2021-08-26 22:56:57.044 | INFO     | src.policies:collect_trajectories:213 - Episode 3091
2021-08-26 

2021-08-26 22:56:57.779 | INFO     | src.policies:train:152 - Mini-batch 2 / 2
2021-08-26 22:56:57.782 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.30316412448883057
2021-08-26 22:56:57.785 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.08191973716020584
2021-08-26 22:56:57.787 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.03520447760820389
2021-08-26 22:56:57.789 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.08191973716020584
2021-08-26 22:56:57.792 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.03520447760820389
2021-08-26 22:56:57.796 | INFO     | src.policies:train:116 - Epoch 415 / 800
2021-08-26 22:56:57.797 | INFO     | src.policies:collect_trajectories:213 - Episode 3099
2021-08-26 22:56:57.822 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-

2021-08-26 22:56:58.256 | INFO     | src.policies:collect_trajectories:213 - Episode 3109
2021-08-26 22:56:58.340 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:56:58.342 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 146.0
2021-08-26 22:56:58.342 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 146.0
2021-08-26 22:56:58.343 | INFO     | src.policies:collect_trajectories:213 - Episode 3110
2021-08-26 22:56:58.488 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:56:58.489 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 177.0
2021-08-26 22:56:58.490 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 161.5
2021-08-26 22:56:58.497 | INFO     | src.policies:train:152 - Mini-batch 1 / 3
2021-08-26 22:56:58.500 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.285985231

2021-08-26 22:56:58.921 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.036091987043619156
2021-08-26 22:56:58.925 | INFO     | src.policies:train:116 - Epoch 420 / 800
2021-08-26 22:56:58.926 | INFO     | src.policies:collect_trajectories:213 - Episode 3116
2021-08-26 22:56:59.001 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:56:59.003 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 48.0
2021-08-26 22:56:59.005 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 48.0
2021-08-26 22:56:59.007 | INFO     | src.policies:collect_trajectories:213 - Episode 3117
2021-08-26 22:56:59.031 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:56:59.032 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 26.0
2021-08-26 22:56:59.034 | INFO     | src.policies:collect_trajectories:

2021-08-26 22:56:59.657 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 45.2
2021-08-26 22:56:59.665 | INFO     | src.policies:train:152 - Mini-batch 1 / 2
2021-08-26 22:56:59.668 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.29255911707878113
2021-08-26 22:56:59.671 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.11349646002054214
2021-08-26 22:56:59.674 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.03543896973133087
2021-08-26 22:56:59.676 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.11349646002054214
2021-08-26 22:56:59.679 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.03543896973133087
2021-08-26 22:56:59.682 | INFO     | src.policies:train:152 - Mini-batch 2 / 2
2021-08-26 22:56:59.685 | INFO     | src.policies:minibatch_update:270 - Total loss: 

2021-08-26 22:57:00.298 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.03534034267067909
2021-08-26 22:57:00.301 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.19083882868289948
2021-08-26 22:57:00.303 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.03534034267067909
2021-08-26 22:57:00.308 | INFO     | src.policies:train:152 - Mini-batch 2 / 2
2021-08-26 22:57:00.312 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.30364617705345154
2021-08-26 22:57:00.316 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.1503119021654129
2021-08-26 22:57:00.319 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.034402862191200256
2021-08-26 22:57:00.323 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.1503119021654129
2021-08-2

2021-08-26 22:57:00.939 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 32.0
2021-08-26 22:57:00.940 | INFO     | src.policies:collect_trajectories:213 - Episode 3139
2021-08-26 22:57:01.043 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:57:01.045 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 174.0
2021-08-26 22:57:01.045 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 103.0
2021-08-26 22:57:01.052 | INFO     | src.policies:train:152 - Mini-batch 1 / 2
2021-08-26 22:57:01.055 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.3088871240615845
2021-08-26 22:57:01.058 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.08830227702856064
2021-08-26 22:57:01.061 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.033721327781677246
2021-08-26 22:57:01.063 | INFO     |

2021-08-26 22:57:01.611 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.2608410120010376
2021-08-26 22:57:01.615 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.5751303434371948
2021-08-26 22:57:01.617 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.03411950543522835
2021-08-26 22:57:01.619 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.4999990463256836
2021-08-26 22:57:01.622 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.03411950543522835
2021-08-26 22:57:01.626 | INFO     | src.policies:train:116 - Epoch 431 / 800
2021-08-26 22:57:01.627 | INFO     | src.policies:collect_trajectories:213 - Episode 3147
2021-08-26 22:57:01.697 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:57:01.698 | INFO     | src.policies:collect_trajectories:229 - Mean episo

2021-08-26 22:57:02.276 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 83.5
2021-08-26 22:57:02.278 | INFO     | src.policies:collect_trajectories:213 - Episode 3156
2021-08-26 22:57:02.316 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:57:02.318 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 54.0
2021-08-26 22:57:02.318 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 73.66666666666667
2021-08-26 22:57:02.325 | INFO     | src.policies:train:152 - Mini-batch 1 / 2
2021-08-26 22:57:02.329 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.3028586506843567
2021-08-26 22:57:02.332 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.08762990683317184
2021-08-26 22:57:02.334 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.032312702387571335
2021-08-26 22:57:02.337 |

2021-08-26 22:57:02.789 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.03227188065648079
2021-08-26 22:57:02.791 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.13572204113006592
2021-08-26 22:57:02.794 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.03227188065648079
2021-08-26 22:57:02.798 | INFO     | src.policies:train:116 - Epoch 436 / 800
2021-08-26 22:57:02.799 | INFO     | src.policies:collect_trajectories:213 - Episode 3165
2021-08-26 22:57:02.843 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:57:02.844 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 74.0
2021-08-26 22:57:02.845 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 74.0
2021-08-26 22:57:02.846 | INFO     | src.policies:collect_trajectories:213 - Episode 3166
2021-08-26

2021-08-26 22:57:03.513 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.03273715823888779
2021-08-26 22:57:03.515 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.29616406559944153
2021-08-26 22:57:03.518 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.03273715823888779
2021-08-26 22:57:03.521 | INFO     | src.policies:train:152 - Mini-batch 2 / 3
2021-08-26 22:57:03.524 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.3181101977825165
2021-08-26 22:57:03.527 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.4398089051246643
2021-08-26 22:57:03.529 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.03165533393621445
2021-08-26 22:57:03.532 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.4398089051246643
2021-08-26 

2021-08-26 22:57:04.179 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.03188670426607132
2021-08-26 22:57:04.182 | INFO     | src.policies:train:116 - Epoch 441 / 800
2021-08-26 22:57:04.183 | INFO     | src.policies:collect_trajectories:213 - Episode 3183
2021-08-26 22:57:04.300 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:57:04.302 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 200.0
2021-08-26 22:57:04.302 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 200.0
2021-08-26 22:57:04.307 | INFO     | src.policies:train:152 - Mini-batch 1 / 2
2021-08-26 22:57:04.311 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.2913825213909149
2021-08-26 22:57:04.314 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.47692689299583435
2021-08-26 22:57:04.316 | INFO     | src.policies:minibat

2021-08-26 22:57:04.913 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.031376518309116364
2021-08-26 22:57:04.916 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.2940537631511688
2021-08-26 22:57:04.918 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.031376518309116364
2021-08-26 22:57:04.922 | INFO     | src.policies:train:152 - Mini-batch 2 / 3
2021-08-26 22:57:04.925 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.3132569193840027
2021-08-26 22:57:04.928 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.1425091177225113
2021-08-26 22:57:04.930 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.030239511281251907
2021-08-26 22:57:04.932 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.1425091177225113
2021-08-2

2021-08-26 22:57:05.492 | INFO     | src.policies:collect_trajectories:213 - Episode 3196
2021-08-26 22:57:05.586 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:57:05.587 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 102.0
2021-08-26 22:57:05.588 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 67.5
2021-08-26 22:57:05.595 | INFO     | src.policies:train:152 - Mini-batch 1 / 2
2021-08-26 22:57:05.598 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.28771087527275085
2021-08-26 22:57:05.601 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.368285208940506
2021-08-26 22:57:05.603 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.03159444034099579
2021-08-26 22:57:05.605 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.368285208940506
2021-08-26 22

2021-08-26 22:57:06.205 | INFO     | src.policies:train:152 - Mini-batch 1 / 2
2021-08-26 22:57:06.208 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.2799498736858368
2021-08-26 22:57:06.211 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.3520474135875702
2021-08-26 22:57:06.213 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.030566992238163948
2021-08-26 22:57:06.215 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.3520474135875702
2021-08-26 22:57:06.218 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.030566992238163948
2021-08-26 22:57:06.221 | INFO     | src.policies:train:152 - Mini-batch 2 / 2
2021-08-26 22:57:06.223 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.29908740520477295
2021-08-26 22:57:06.227 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gr

2021-08-26 22:57:06.831 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 77.0
2021-08-26 22:57:06.837 | INFO     | src.policies:train:152 - Mini-batch 1 / 2
2021-08-26 22:57:06.841 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.31482458114624023
2021-08-26 22:57:06.843 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.24747535586357117
2021-08-26 22:57:06.845 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.029468979686498642
2021-08-26 22:57:06.848 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.24747535586357117
2021-08-26 22:57:06.850 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.029468979686498642
2021-08-26 22:57:06.853 | INFO     | src.policies:train:152 - Mini-batch 2 / 2
2021-08-26 22:57:06.856 | INFO     | src.policies:minibatch_update:270 - Total loss

2021-08-26 22:57:07.418 | INFO     | src.policies:train:152 - Mini-batch 1 / 2
2021-08-26 22:57:07.422 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.26618191599845886
2021-08-26 22:57:07.425 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.1413114368915558
2021-08-26 22:57:07.427 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.030818039551377296
2021-08-26 22:57:07.429 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.1413114368915558
2021-08-26 22:57:07.432 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.030818039551377296
2021-08-26 22:57:07.435 | INFO     | src.policies:train:152 - Mini-batch 2 / 2
2021-08-26 22:57:07.438 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.29622673988342285
2021-08-26 22:57:07.442 | INFO     | src.policies:minibatch_update:277 - Policy network L2 g

2021-08-26 22:57:08.122 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.03031815029680729
2021-08-26 22:57:08.126 | INFO     | src.policies:train:152 - Mini-batch 3 / 3
2021-08-26 22:57:08.129 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.3010702431201935
2021-08-26 22:57:08.132 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.3054133951663971
2021-08-26 22:57:08.134 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.03099871426820755
2021-08-26 22:57:08.136 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.3054133951663971
2021-08-26 22:57:08.139 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.03099871426820755
2021-08-26 22:57:08.142 | INFO     | src.policies:train:116 - Epoch 460 / 800
2021-08-26 22:57:08.143 | INFO     | src.policies:collect_t

2021-08-26 22:57:08.865 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:57:08.866 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 113.0
2021-08-26 22:57:08.867 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 122.5
2021-08-26 22:57:08.872 | INFO     | src.policies:train:152 - Mini-batch 1 / 2
2021-08-26 22:57:08.876 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.2715616524219513
2021-08-26 22:57:08.879 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.21564476191997528
2021-08-26 22:57:08.881 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.03074672259390354
2021-08-26 22:57:08.884 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.21564476191997528
2021-08-26 22:57:08.886 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradie

2021-08-26 22:57:09.442 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 58.25
2021-08-26 22:57:09.448 | INFO     | src.policies:train:152 - Mini-batch 1 / 2
2021-08-26 22:57:09.452 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.32373449206352234
2021-08-26 22:57:09.455 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.20744645595550537
2021-08-26 22:57:09.457 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.027667690068483353
2021-08-26 22:57:09.460 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.20744645595550537
2021-08-26 22:57:09.462 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.027667690068483353
2021-08-26 22:57:09.465 | INFO     | src.policies:train:152 - Mini-batch 2 / 2
2021-08-26 22:57:09.468 | INFO     | src.policies:minibatch_update:270 - Total los

2021-08-26 22:57:10.140 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.49999919533729553
2021-08-26 22:57:10.143 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.029139775782823563
2021-08-26 22:57:10.146 | INFO     | src.policies:train:152 - Mini-batch 2 / 2
2021-08-26 22:57:10.148 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.2973706126213074
2021-08-26 22:57:10.151 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.15139274299144745
2021-08-26 22:57:10.153 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.029051240533590317
2021-08-26 22:57:10.155 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.15139274299144745
2021-08-26 22:57:10.158 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.0290512405

2021-08-26 22:57:10.774 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 53.0
2021-08-26 22:57:10.775 | INFO     | src.policies:collect_trajectories:213 - Episode 3250
2021-08-26 22:57:10.798 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:57:10.799 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 35.0
2021-08-26 22:57:10.800 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 44.0
2021-08-26 22:57:10.801 | INFO     | src.policies:collect_trajectories:213 - Episode 3251
2021-08-26 22:57:10.838 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:57:10.839 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 59.0
2021-08-26 22:57:10.840 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 49.0
2021-08-26 22:57:10.841 | INFO     | src.policies:collect_trajector

2021-08-26 22:57:11.473 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.02800358273088932
2021-08-26 22:57:11.476 | INFO     | src.policies:train:116 - Epoch 475 / 800
2021-08-26 22:57:11.477 | INFO     | src.policies:collect_trajectories:213 - Episode 3257
2021-08-26 22:57:11.498 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:57:11.499 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 32.0
2021-08-26 22:57:11.500 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 32.0
2021-08-26 22:57:11.501 | INFO     | src.policies:collect_trajectories:213 - Episode 3258
2021-08-26 22:57:11.565 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:57:11.567 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 116.0
2021-08-26 22:57:11.573 | INFO     | src.policies:collect_trajectories:

2021-08-26 22:57:12.320 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 1.1034870147705078
2021-08-26 22:57:12.322 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.027741141617298126
2021-08-26 22:57:12.325 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.49999961256980896
2021-08-26 22:57:12.328 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.027741141617298126
2021-08-26 22:57:12.331 | INFO     | src.policies:train:152 - Mini-batch 2 / 3
2021-08-26 22:57:12.334 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.2825361490249634
2021-08-26 22:57:12.337 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.6805127263069153
2021-08-26 22:57:12.339 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.02759338915348053
2021-08-26 22:57:12.342 

2021-08-26 22:57:12.935 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.2911072373390198
2021-08-26 22:57:12.939 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.662714958190918
2021-08-26 22:57:12.941 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.028106411918997765
2021-08-26 22:57:12.943 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.4999992251396179
2021-08-26 22:57:12.945 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.028106411918997765
2021-08-26 22:57:12.948 | INFO     | src.policies:train:152 - Mini-batch 2 / 3
2021-08-26 22:57:12.952 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.30190181732177734
2021-08-26 22:57:12.955 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.46041974425315857
2021-08-26 22:57:12.956 | INFO     | src.polici

2021-08-26 22:57:13.498 | INFO     | src.policies:train:152 - Mini-batch 1 / 2
2021-08-26 22:57:13.502 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.27293238043785095
2021-08-26 22:57:13.505 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.41259855031967163
2021-08-26 22:57:13.508 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.02798188105225563
2021-08-26 22:57:13.510 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.41259855031967163
2021-08-26 22:57:13.513 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.02798188105225563
2021-08-26 22:57:13.518 | INFO     | src.policies:train:152 - Mini-batch 2 / 2
2021-08-26 22:57:13.522 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.2942679524421692
2021-08-26 22:57:13.525 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gr

2021-08-26 22:57:14.099 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.026222392916679382
2021-08-26 22:57:14.102 | INFO     | src.policies:train:116 - Epoch 486 / 800
2021-08-26 22:57:14.103 | INFO     | src.policies:collect_trajectories:213 - Episode 3287
2021-08-26 22:57:14.210 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:57:14.211 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 194.0
2021-08-26 22:57:14.212 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 194.0
2021-08-26 22:57:14.213 | INFO     | src.policies:collect_trajectories:213 - Episode 3288
2021-08-26 22:57:14.235 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:57:14.236 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 34.0
2021-08-26 22:57:14.236 | INFO     | src.policies:collect_trajectorie

2021-08-26 22:57:14.848 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.025492828339338303
2021-08-26 22:57:14.851 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.10165286809206009
2021-08-26 22:57:14.853 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.025492828339338303
2021-08-26 22:57:14.856 | INFO     | src.policies:train:152 - Mini-batch 3 / 3
2021-08-26 22:57:14.859 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.30547165870666504
2021-08-26 22:57:14.862 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.1835535168647766
2021-08-26 22:57:14.864 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.026748111471533775
2021-08-26 22:57:14.866 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.1835535168647766
2021-08

2021-08-26 22:57:15.462 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 76.0
2021-08-26 22:57:15.463 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 73.0
2021-08-26 22:57:15.468 | INFO     | src.policies:train:152 - Mini-batch 1 / 2
2021-08-26 22:57:15.472 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.28878405690193176
2021-08-26 22:57:15.475 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.22805002331733704
2021-08-26 22:57:15.477 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.026622114703059196
2021-08-26 22:57:15.480 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.22805002331733704
2021-08-26 22:57:15.483 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.026622114703059196
2021-08-26 22:57:15.486 | INFO     | src.policies:train:152

2021-08-26 22:57:16.071 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 48.0
2021-08-26 22:57:16.072 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 66.66666666666667
2021-08-26 22:57:16.076 | INFO     | src.policies:train:152 - Mini-batch 1 / 2
2021-08-26 22:57:16.079 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.3085852563381195
2021-08-26 22:57:16.082 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.34480059146881104
2021-08-26 22:57:16.084 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.02542964555323124
2021-08-26 22:57:16.087 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.34480059146881104
2021-08-26 22:57:16.090 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.02542964555323124
2021-08-26 22:57:16.093 | INFO     | src.policies

2021-08-26 22:57:16.721 | INFO     | src.policies:train:152 - Mini-batch 2 / 3
2021-08-26 22:57:16.724 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.2889276146888733
2021-08-26 22:57:16.728 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.21326303482055664
2021-08-26 22:57:16.730 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.025123389437794685
2021-08-26 22:57:16.733 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.21326303482055664
2021-08-26 22:57:16.736 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.025123389437794685
2021-08-26 22:57:16.739 | INFO     | src.policies:train:152 - Mini-batch 3 / 3
2021-08-26 22:57:16.742 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.3019549250602722
2021-08-26 22:57:16.745 | INFO     | src.policies:minibatch_update:277 - Policy network L2 g

2021-08-26 22:57:17.377 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.0257832370698452
2021-08-26 22:57:17.379 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.06632621586322784
2021-08-26 22:57:17.382 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.0257832370698452
2021-08-26 22:57:17.385 | INFO     | src.policies:train:152 - Mini-batch 2 / 3
2021-08-26 22:57:17.388 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.2727995812892914
2021-08-26 22:57:17.392 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.23665165901184082
2021-08-26 22:57:17.395 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.02615041844546795
2021-08-26 22:57:17.399 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.23665165901184082
2021-08-26 

2021-08-26 22:57:18.099 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 107.0
2021-08-26 22:57:18.099 | INFO     | src.policies:collect_trajectories:213 - Episode 3328
2021-08-26 22:57:18.123 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:57:18.124 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 40.0
2021-08-26 22:57:18.125 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 73.5
2021-08-26 22:57:18.126 | INFO     | src.policies:collect_trajectories:213 - Episode 3329
2021-08-26 22:57:18.204 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:57:18.205 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 76.0
2021-08-26 22:57:18.206 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 74.33333333333333
2021-08-26 22:57:18.214 | INFO     | src.policies:tra

2021-08-26 22:57:18.605 | INFO     | src.policies:collect_trajectories:213 - Episode 3336
2021-08-26 22:57:18.635 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:57:18.636 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 48.0
2021-08-26 22:57:18.637 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 28.0
2021-08-26 22:57:18.638 | INFO     | src.policies:collect_trajectories:213 - Episode 3337
2021-08-26 22:57:18.756 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:57:18.757 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 130.0
2021-08-26 22:57:18.758 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 53.5
2021-08-26 22:57:18.765 | INFO     | src.policies:train:152 - Mini-batch 1 / 2
2021-08-26 22:57:18.768 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.316569775342

2021-08-26 22:57:19.294 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:57:19.296 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 71.0
2021-08-26 22:57:19.296 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 47.0
2021-08-26 22:57:19.297 | INFO     | src.policies:collect_trajectories:213 - Episode 3345
2021-08-26 22:57:19.329 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:57:19.330 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 52.0
2021-08-26 22:57:19.331 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 48.666666666666664
2021-08-26 22:57:19.332 | INFO     | src.policies:collect_trajectories:213 - Episode 3346
2021-08-26 22:57:19.369 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:57:19.370 | INFO     | src.policies:collect_traj

2021-08-26 22:57:20.028 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.2671276926994324
2021-08-26 22:57:20.031 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.02272876538336277
2021-08-26 22:57:20.033 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.2671276926994324
2021-08-26 22:57:20.036 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.02272876538336277
2021-08-26 22:57:20.039 | INFO     | src.policies:train:152 - Mini-batch 2 / 2
2021-08-26 22:57:20.043 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.2948514223098755
2021-08-26 22:57:20.046 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.425786554813385
2021-08-26 22:57:20.048 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.024654198437929153
2021-08-26 22:57:20.050 | I

2021-08-26 22:57:20.546 | INFO     | src.policies:collect_trajectories:213 - Episode 3361
2021-08-26 22:57:20.592 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:57:20.594 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 82.0
2021-08-26 22:57:20.594 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 90.33333333333333
2021-08-26 22:57:20.600 | INFO     | src.policies:train:152 - Mini-batch 1 / 2
2021-08-26 22:57:20.603 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.29583197832107544
2021-08-26 22:57:20.607 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.3055625259876251
2021-08-26 22:57:20.609 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.024001434445381165
2021-08-26 22:57:20.611 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.305562525987625

2021-08-26 22:57:21.197 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.024506622925400734
2021-08-26 22:57:21.199 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.2549601197242737
2021-08-26 22:57:21.202 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.024506622925400734
2021-08-26 22:57:21.205 | INFO     | src.policies:train:152 - Mini-batch 2 / 2
2021-08-26 22:57:21.208 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.3244853615760803
2021-08-26 22:57:21.211 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.0678793415427208
2021-08-26 22:57:21.213 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.023091893643140793
2021-08-26 22:57:21.215 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.0678793415427208
2021-08-2

2021-08-26 22:57:21.769 | INFO     | src.policies:train:116 - Epoch 521 / 800
2021-08-26 22:57:21.770 | INFO     | src.policies:collect_trajectories:213 - Episode 3374
2021-08-26 22:57:21.778 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:57:21.779 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 14.0
2021-08-26 22:57:21.780 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 14.0
2021-08-26 22:57:21.781 | INFO     | src.policies:collect_trajectories:213 - Episode 3375
2021-08-26 22:57:21.826 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:57:21.827 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 83.0
2021-08-26 22:57:21.828 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 48.5
2021-08-26 22:57:21.829 | INFO     | src.policies:collect_trajectories:213 - Episode 3376
2021-08-26 2

2021-08-26 22:57:22.433 | INFO     | src.policies:train:152 - Mini-batch 2 / 2
2021-08-26 22:57:22.435 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.32254159450531006
2021-08-26 22:57:22.439 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.15960480272769928
2021-08-26 22:57:22.440 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.022569645196199417
2021-08-26 22:57:22.443 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.15960480272769928
2021-08-26 22:57:22.445 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.022569645196199417
2021-08-26 22:57:22.449 | INFO     | src.policies:train:116 - Epoch 524 / 800
2021-08-26 22:57:22.450 | INFO     | src.policies:collect_trajectories:213 - Episode 3382
2021-08-26 22:57:22.588 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
202

2021-08-26 22:57:23.119 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.40185078978538513
2021-08-26 22:57:23.121 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.02258380316197872
2021-08-26 22:57:23.124 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.40185078978538513
2021-08-26 22:57:23.126 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.02258380316197872
2021-08-26 22:57:23.129 | INFO     | src.policies:train:152 - Mini-batch 2 / 2
2021-08-26 22:57:23.132 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.27169719338417053
2021-08-26 22:57:23.135 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.11841149628162384
2021-08-26 22:57:23.136 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.023102832958102226
2021-08-26 22:57:23.13

2021-08-26 22:57:23.703 | INFO     | src.policies:collect_trajectories:213 - Episode 3395
2021-08-26 22:57:23.741 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:57:23.742 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 63.0
2021-08-26 22:57:23.743 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 128.5
2021-08-26 22:57:23.749 | INFO     | src.policies:train:152 - Mini-batch 1 / 2
2021-08-26 22:57:23.753 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.293363481760025
2021-08-26 22:57:23.756 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.908274233341217
2021-08-26 22:57:23.758 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.02256372570991516
2021-08-26 22:57:23.760 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.49999940395355225
2021-08-26 22

2021-08-26 22:57:24.321 | INFO     | src.policies:train:116 - Epoch 534 / 800
2021-08-26 22:57:24.322 | INFO     | src.policies:collect_trajectories:213 - Episode 3400
2021-08-26 22:57:24.395 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:57:24.397 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 127.0
2021-08-26 22:57:24.398 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 127.0
2021-08-26 22:57:24.398 | INFO     | src.policies:collect_trajectories:213 - Episode 3401
2021-08-26 22:57:24.502 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:57:24.503 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 200.0
2021-08-26 22:57:24.504 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 163.5
2021-08-26 22:57:24.510 | INFO     | src.policies:train:152 - Mini-batch 1 / 3
2021-08-26 22:57:24

2021-08-26 22:57:25.151 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.9728999733924866
2021-08-26 22:57:25.153 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.023003119975328445
2021-08-26 22:57:25.155 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.4999995529651642
2021-08-26 22:57:25.158 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.023003119975328445
2021-08-26 22:57:25.161 | INFO     | src.policies:train:152 - Mini-batch 2 / 3
2021-08-26 22:57:25.163 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.29120147228240967
2021-08-26 22:57:25.166 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.6158120036125183
2021-08-26 22:57:25.168 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.022377952933311462
2021-08-26 22:57:25.170

2021-08-26 22:57:25.789 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:57:25.790 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 200.0
2021-08-26 22:57:25.791 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 119.5
2021-08-26 22:57:25.796 | INFO     | src.policies:train:152 - Mini-batch 1 / 2
2021-08-26 22:57:25.800 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.314378023147583
2021-08-26 22:57:25.836 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 1.1260391473770142
2021-08-26 22:57:25.838 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.021214617416262627
2021-08-26 22:57:25.841 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.49999964237213135
2021-08-26 22:57:25.844 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradien

2021-08-26 22:57:26.295 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 26.0
2021-08-26 22:57:26.296 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 26.0
2021-08-26 22:57:26.297 | INFO     | src.policies:collect_trajectories:213 - Episode 3420
2021-08-26 22:57:26.455 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:57:26.456 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 200.0
2021-08-26 22:57:26.457 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 113.0
2021-08-26 22:57:26.464 | INFO     | src.policies:train:152 - Mini-batch 1 / 2
2021-08-26 22:57:26.467 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.31798848509788513
2021-08-26 22:57:26.471 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.23601093888282776
2021-08-26 22:57:26.474 | INFO     | src.policies:minibatch_u

2021-08-26 22:57:27.020 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.2922179698944092
2021-08-26 22:57:27.023 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.021520815789699554
2021-08-26 22:57:27.026 | INFO     | src.policies:train:116 - Epoch 546 / 800
2021-08-26 22:57:27.027 | INFO     | src.policies:collect_trajectories:213 - Episode 3425
2021-08-26 22:57:27.037 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:57:27.038 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 16.0
2021-08-26 22:57:27.039 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 16.0
2021-08-26 22:57:27.040 | INFO     | src.policies:collect_trajectories:213 - Episode 3426
2021-08-26 22:57:27.142 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:57:27.143 | INFO    

2021-08-26 22:57:27.593 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.3064151704311371
2021-08-26 22:57:27.596 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.547818124294281
2021-08-26 22:57:27.598 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.02138439193367958
2021-08-26 22:57:27.601 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.4999990463256836
2021-08-26 22:57:27.604 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.02138439193367958
2021-08-26 22:57:27.607 | INFO     | src.policies:train:116 - Epoch 549 / 800
2021-08-26 22:57:27.608 | INFO     | src.policies:collect_trajectories:213 - Episode 3434
2021-08-26 22:57:27.673 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:57:27.675 | INFO     | src.policies:collect_trajectories:229 - Mean episod

2021-08-26 22:57:28.342 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 200.0
2021-08-26 22:57:28.346 | INFO     | src.policies:train:152 - Mini-batch 1 / 2
2021-08-26 22:57:28.350 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.30436310172080994
2021-08-26 22:57:28.354 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.629210889339447
2021-08-26 22:57:28.356 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.021040914580225945
2021-08-26 22:57:28.359 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.49999910593032837
2021-08-26 22:57:28.362 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.021040914580225945
2021-08-26 22:57:28.365 | INFO     | src.policies:train:152 - Mini-batch 2 / 2
2021-08-26 22:57:28.368 | INFO     | src.policies:minibatch_update:270 - Total loss:

2021-08-26 22:57:28.876 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.45392921566963196
2021-08-26 22:57:28.878 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.020969782024621964
2021-08-26 22:57:28.881 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.45392921566963196
2021-08-26 22:57:28.883 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.020969782024621964
2021-08-26 22:57:28.886 | INFO     | src.policies:train:116 - Epoch 555 / 800
2021-08-26 22:57:28.887 | INFO     | src.policies:collect_trajectories:213 - Episode 3447
2021-08-26 22:57:28.959 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:57:28.961 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 132.0
2021-08-26 22:57:28.962 | INFO     | src.policies:collect_trajectories:230 - Last 10

2021-08-26 22:57:29.548 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.020862765610218048
2021-08-26 22:57:29.550 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.49999916553497314
2021-08-26 22:57:29.553 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.020862765610218048
2021-08-26 22:57:29.556 | INFO     | src.policies:train:116 - Epoch 558 / 800
2021-08-26 22:57:29.558 | INFO     | src.policies:collect_trajectories:213 - Episode 3453
2021-08-26 22:57:29.569 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:57:29.571 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 18.0
2021-08-26 22:57:29.572 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 18.0
2021-08-26 22:57:29.574 | INFO     | src.policies:collect_trajectories:213 - Episode 3454
2021-08-

2021-08-26 22:57:30.206 | INFO     | src.policies:train:116 - Epoch 561 / 800
2021-08-26 22:57:30.207 | INFO     | src.policies:collect_trajectories:213 - Episode 3460
2021-08-26 22:57:30.300 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:57:30.301 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 88.0
2021-08-26 22:57:30.302 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 88.0
2021-08-26 22:57:30.303 | INFO     | src.policies:collect_trajectories:213 - Episode 3461
2021-08-26 22:57:30.346 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:57:30.347 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 80.0
2021-08-26 22:57:30.348 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 84.0
2021-08-26 22:57:30.349 | INFO     | src.policies:collect_trajectories:213 - Episode 3462
2021-08-26 2

2021-08-26 22:57:30.916 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 81.0
2021-08-26 22:57:30.917 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 81.0
2021-08-26 22:57:30.918 | INFO     | src.policies:collect_trajectories:213 - Episode 3467
2021-08-26 22:57:31.015 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:57:31.016 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 187.0
2021-08-26 22:57:31.017 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 134.0
2021-08-26 22:57:31.022 | INFO     | src.policies:train:152 - Mini-batch 1 / 2
2021-08-26 22:57:31.026 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.27180910110473633
2021-08-26 22:57:31.029 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.8208596110343933
2021-08-26 22:57:31.031 | INFO     | src.policies:minibatch_up

2021-08-26 22:57:31.626 | INFO     | src.policies:train:152 - Mini-batch 3 / 3
2021-08-26 22:57:31.628 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.27587592601776123
2021-08-26 22:57:31.631 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 1.0056484937667847
2021-08-26 22:57:31.633 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.020833726972341537
2021-08-26 22:57:31.636 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.49999943375587463
2021-08-26 22:57:31.638 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.020833726972341537
2021-08-26 22:57:31.641 | INFO     | src.policies:train:116 - Epoch 567 / 800
2021-08-26 22:57:31.643 | INFO     | src.policies:collect_trajectories:213 - Episode 3473
2021-08-26 22:57:31.721 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021

2021-08-26 22:57:32.391 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.4121323823928833
2021-08-26 22:57:32.394 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.01974847912788391
2021-08-26 22:57:32.396 | INFO     | src.policies:train:152 - Mini-batch 2 / 3
2021-08-26 22:57:32.399 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.30555468797683716
2021-08-26 22:57:32.402 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.4607636034488678
2021-08-26 22:57:32.404 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.019965972751379013
2021-08-26 22:57:32.407 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.4607636034488678
2021-08-26 22:57:32.409 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.0199659727513

2021-08-26 22:57:33.018 | INFO     | src.policies:train:152 - Mini-batch 1 / 2
2021-08-26 22:57:33.021 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.2907657027244568
2021-08-26 22:57:33.024 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.4176085293292999
2021-08-26 22:57:33.026 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.020166080445051193
2021-08-26 22:57:33.029 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.4176085293292999
2021-08-26 22:57:33.031 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.020166080445051193
2021-08-26 22:57:33.034 | INFO     | src.policies:train:152 - Mini-batch 2 / 2
2021-08-26 22:57:33.037 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.2942068576812744
2021-08-26 22:57:33.039 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gra

2021-08-26 22:57:33.606 | INFO     | src.policies:train:152 - Mini-batch 3 / 3
2021-08-26 22:57:33.608 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.31853556632995605
2021-08-26 22:57:33.611 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.09841176122426987
2021-08-26 22:57:33.613 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.01929500512778759
2021-08-26 22:57:33.615 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.09841176122426987
2021-08-26 22:57:33.618 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.01929500512778759
2021-08-26 22:57:33.621 | INFO     | src.policies:train:116 - Epoch 576 / 800
2021-08-26 22:57:33.622 | INFO     | src.policies:collect_trajectories:213 - Episode 3491
2021-08-26 22:57:33.725 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-

2021-08-26 22:57:34.323 | INFO     | src.policies:train:152 - Mini-batch 2 / 3
2021-08-26 22:57:34.326 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.31203538179397583
2021-08-26 22:57:34.330 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.4388614296913147
2021-08-26 22:57:34.332 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.019164524972438812
2021-08-26 22:57:34.370 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.4388614296913147
2021-08-26 22:57:34.374 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.019164524972438812
2021-08-26 22:57:34.381 | INFO     | src.policies:train:152 - Mini-batch 3 / 3
2021-08-26 22:57:34.385 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.2883283495903015
2021-08-26 22:57:34.390 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gr

2021-08-26 22:57:34.975 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 74.0
2021-08-26 22:57:34.976 | INFO     | src.policies:collect_trajectories:213 - Episode 3503
2021-08-26 22:57:35.091 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:57:35.092 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 200.0
2021-08-26 22:57:35.093 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 116.0
2021-08-26 22:57:35.100 | INFO     | src.policies:train:152 - Mini-batch 1 / 3
2021-08-26 22:57:35.105 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.3145298957824707
2021-08-26 22:57:35.108 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.15789386630058289
2021-08-26 22:57:35.110 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.018910614773631096
2021-08-26 22:57:35.113 | INFO     |

2021-08-26 22:57:35.593 | INFO     | src.policies:collect_trajectories:213 - Episode 3509
2021-08-26 22:57:35.691 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:57:35.693 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 188.0
2021-08-26 22:57:35.694 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 149.0
2021-08-26 22:57:35.699 | INFO     | src.policies:train:152 - Mini-batch 1 / 2
2021-08-26 22:57:35.702 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.28506219387054443
2021-08-26 22:57:35.706 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.47975805401802063
2021-08-26 22:57:35.708 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.019425153732299805
2021-08-26 22:57:35.710 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.47975805401802063
2021-08

2021-08-26 22:57:36.281 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 123.0
2021-08-26 22:57:36.282 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 123.0
2021-08-26 22:57:36.283 | INFO     | src.policies:collect_trajectories:213 - Episode 3516
2021-08-26 22:57:36.370 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:57:36.371 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 168.0
2021-08-26 22:57:36.421 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 145.5
2021-08-26 22:57:36.427 | INFO     | src.policies:train:152 - Mini-batch 1 / 2
2021-08-26 22:57:36.431 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.2723264992237091
2021-08-26 22:57:36.434 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.3102252781391144
2021-08-26 22:57:36.436 | INFO     | src.policies:minibatch_u

2021-08-26 22:57:36.971 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.01840086653828621
2021-08-26 22:57:36.973 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.4194217920303345
2021-08-26 22:57:36.976 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.01840086653828621
2021-08-26 22:57:36.979 | INFO     | src.policies:train:152 - Mini-batch 2 / 2
2021-08-26 22:57:36.982 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.3074713349342346
2021-08-26 22:57:36.985 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.2613820433616638
2021-08-26 22:57:36.987 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.018756847828626633
2021-08-26 22:57:36.989 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.2613820433616638
2021-08-26 

2021-08-26 22:57:37.660 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.3081865906715393
2021-08-26 22:57:37.663 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.5128113627433777
2021-08-26 22:57:37.665 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.01825053058564663
2021-08-26 22:57:37.667 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.4999990165233612
2021-08-26 22:57:37.669 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.01825053058564663
2021-08-26 22:57:37.673 | INFO     | src.policies:train:152 - Mini-batch 2 / 3
2021-08-26 22:57:37.675 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.2832038104534149
2021-08-26 22:57:37.678 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 1.1214325428009033
2021-08-26 22:57:37.680 | INFO     | src.policies:

2021-08-26 22:57:38.315 | INFO     | src.policies:train:152 - Mini-batch 1 / 2
2021-08-26 22:57:38.319 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.32154759764671326
2021-08-26 22:57:38.322 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.4275633990764618
2021-08-26 22:57:38.324 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.017735466361045837
2021-08-26 22:57:38.326 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.4275633990764618
2021-08-26 22:57:38.328 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.017735466361045837
2021-08-26 22:57:38.331 | INFO     | src.policies:train:152 - Mini-batch 2 / 2
2021-08-26 22:57:38.334 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.31433284282684326
2021-08-26 22:57:38.337 | INFO     | src.policies:minibatch_update:277 - Policy network L2 g

2021-08-26 22:57:38.960 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.2840895354747772
2021-08-26 22:57:38.962 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.017690561711788177
2021-08-26 22:57:38.964 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.2840895354747772
2021-08-26 22:57:38.966 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.017690561711788177
2021-08-26 22:57:38.970 | INFO     | src.policies:train:116 - Epoch 601 / 800
2021-08-26 22:57:38.971 | INFO     | src.policies:collect_trajectories:213 - Episode 3543
2021-08-26 22:57:39.016 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:57:39.017 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 84.0
2021-08-26 22:57:39.017 | INFO     | src.policies:collect_trajectories:230 - Last 100 e

2021-08-26 22:57:39.542 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.017459942027926445
2021-08-26 22:57:39.545 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.37794220447540283
2021-08-26 22:57:39.548 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.017459942027926445
2021-08-26 22:57:39.552 | INFO     | src.policies:train:116 - Epoch 604 / 800
2021-08-26 22:57:39.553 | INFO     | src.policies:collect_trajectories:213 - Episode 3551
2021-08-26 22:57:39.567 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:57:39.568 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 19.0
2021-08-26 22:57:39.569 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 19.0
2021-08-26 22:57:39.570 | INFO     | src.policies:collect_trajectories:213 - Episode 3552
2021-08-

2021-08-26 22:57:40.179 | INFO     | src.policies:train:152 - Mini-batch 1 / 2
2021-08-26 22:57:40.182 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.28968966007232666
2021-08-26 22:57:40.185 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 1.1038368940353394
2021-08-26 22:57:40.187 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.017946450039744377
2021-08-26 22:57:40.189 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.4999994933605194
2021-08-26 22:57:40.192 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.017946450039744377
2021-08-26 22:57:40.195 | INFO     | src.policies:train:152 - Mini-batch 2 / 2
2021-08-26 22:57:40.197 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.31241604685783386
2021-08-26 22:57:40.201 | INFO     | src.policies:minibatch_update:277 - Policy network L2 g

2021-08-26 22:57:40.847 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 109.5
2021-08-26 22:57:40.852 | INFO     | src.policies:train:152 - Mini-batch 1 / 2
2021-08-26 22:57:40.856 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.2699507474899292
2021-08-26 22:57:40.859 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.4755857586860657
2021-08-26 22:57:40.861 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.01788509264588356
2021-08-26 22:57:40.863 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.4755857586860657
2021-08-26 22:57:40.865 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.01788509264588356
2021-08-26 22:57:40.868 | INFO     | src.policies:train:152 - Mini-batch 2 / 2
2021-08-26 22:57:40.871 | INFO     | src.policies:minibatch_update:270 - Total loss: -0

2021-08-26 22:57:41.345 | INFO     | src.policies:collect_trajectories:213 - Episode 3573
2021-08-26 22:57:41.360 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:57:41.361 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 24.0
2021-08-26 22:57:41.362 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 34.0
2021-08-26 22:57:41.363 | INFO     | src.policies:collect_trajectories:213 - Episode 3574
2021-08-26 22:57:41.392 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:57:41.393 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 52.0
2021-08-26 22:57:41.394 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 40.0
2021-08-26 22:57:41.395 | INFO     | src.policies:collect_trajectories:213 - Episode 3575
2021-08-26 22:57:41.456 | DEBUG    | src.policies:execute_episode:398 - Early stopping, al

2021-08-26 22:57:41.923 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.01690228097140789
2021-08-26 22:57:41.927 | INFO     | src.policies:train:116 - Epoch 615 / 800
2021-08-26 22:57:41.928 | INFO     | src.policies:collect_trajectories:213 - Episode 3580
2021-08-26 22:57:42.004 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:57:42.006 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 143.0
2021-08-26 22:57:42.007 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 143.0
2021-08-26 22:57:42.008 | INFO     | src.policies:collect_trajectories:213 - Episode 3581
2021-08-26 22:57:42.116 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:57:42.117 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 129.0
2021-08-26 22:57:42.118 | INFO     | src.policies:collect_trajectorie

2021-08-26 22:57:42.736 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 133.0
2021-08-26 22:57:42.741 | INFO     | src.policies:train:152 - Mini-batch 1 / 2
2021-08-26 22:57:42.745 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.31743860244750977
2021-08-26 22:57:42.748 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.3296823799610138
2021-08-26 22:57:42.750 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.016248106956481934
2021-08-26 22:57:42.753 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.3296823799610138
2021-08-26 22:57:42.755 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.016248106956481934
2021-08-26 22:57:42.759 | INFO     | src.policies:train:152 - Mini-batch 2 / 2
2021-08-26 22:57:42.762 | INFO     | src.policies:minibatch_update:270 - Total loss:

2021-08-26 22:57:43.257 | INFO     | src.policies:train:152 - Mini-batch 3 / 3
2021-08-26 22:57:43.260 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.33233165740966797
2021-08-26 22:57:43.263 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.04741125926375389
2021-08-26 22:57:43.264 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.015459174290299416
2021-08-26 22:57:43.267 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.04741125926375389
2021-08-26 22:57:43.269 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.015459174290299416
2021-08-26 22:57:43.272 | INFO     | src.policies:train:116 - Epoch 621 / 800
2021-08-26 22:57:43.273 | INFO     | src.policies:collect_trajectories:213 - Episode 3596
2021-08-26 22:57:43.303 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
202

2021-08-26 22:57:43.956 | INFO     | src.policies:train:152 - Mini-batch 1 / 3
2021-08-26 22:57:43.959 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.3058852255344391
2021-08-26 22:57:43.962 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.23487231135368347
2021-08-26 22:57:43.964 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.016776572912931442
2021-08-26 22:57:43.966 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.23487231135368347
2021-08-26 22:57:43.969 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.016776572912931442
2021-08-26 22:57:43.971 | INFO     | src.policies:train:152 - Mini-batch 2 / 3
2021-08-26 22:57:43.974 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.295919269323349
2021-08-26 22:57:43.977 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gr

2021-08-26 22:57:44.549 | INFO     | src.policies:train:152 - Mini-batch 2 / 2
2021-08-26 22:57:44.552 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.30212461948394775
2021-08-26 22:57:44.555 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.2871689796447754
2021-08-26 22:57:44.557 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.016295770183205605
2021-08-26 22:57:44.560 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.2871689796447754
2021-08-26 22:57:44.562 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.016295770183205605
2021-08-26 22:57:44.566 | INFO     | src.policies:train:116 - Epoch 627 / 800
2021-08-26 22:57:44.567 | INFO     | src.policies:collect_trajectories:213 - Episode 3609
2021-08-26 22:57:44.733 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-

2021-08-26 22:57:45.207 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.016458075493574142
2021-08-26 22:57:45.210 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.2237679809331894
2021-08-26 22:57:45.212 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.016458075493574142
2021-08-26 22:57:45.216 | INFO     | src.policies:train:116 - Epoch 630 / 800
2021-08-26 22:57:45.217 | INFO     | src.policies:collect_trajectories:213 - Episode 3615
2021-08-26 22:57:45.440 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:57:45.442 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 178.0
2021-08-26 22:57:45.442 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 178.0
2021-08-26 22:57:45.443 | INFO     | src.policies:collect_trajectories:213 - Episode 3616
2021-08

2021-08-26 22:57:46.009 | INFO     | src.policies:train:152 - Mini-batch 1 / 2
2021-08-26 22:57:46.013 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.28560447692871094
2021-08-26 22:57:46.016 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.31709975004196167
2021-08-26 22:57:46.018 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.01594584994018078
2021-08-26 22:57:46.021 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.31709975004196167
2021-08-26 22:57:46.023 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.01594584994018078
2021-08-26 22:57:46.026 | INFO     | src.policies:train:152 - Mini-batch 2 / 2
2021-08-26 22:57:46.029 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.28010672330856323
2021-08-26 22:57:46.032 | INFO     | src.policies:minibatch_update:277 - Policy network L2 g

2021-08-26 22:57:46.720 | INFO     | src.policies:train:152 - Mini-batch 1 / 3
2021-08-26 22:57:46.724 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.2944965362548828
2021-08-26 22:57:46.728 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.2557138204574585
2021-08-26 22:57:46.731 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.01594972051680088
2021-08-26 22:57:46.734 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.2557138204574585
2021-08-26 22:57:46.737 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.01594972051680088
2021-08-26 22:57:46.740 | INFO     | src.policies:train:152 - Mini-batch 2 / 3
2021-08-26 22:57:46.743 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.29089540243148804
2021-08-26 22:57:46.747 | INFO     | src.policies:minibatch_update:277 - Policy network L2 grad

2021-08-26 22:57:47.278 | INFO     | src.policies:train:116 - Epoch 639 / 800
2021-08-26 22:57:47.279 | INFO     | src.policies:collect_trajectories:213 - Episode 3635
2021-08-26 22:57:47.389 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:57:47.391 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 200.0
2021-08-26 22:57:47.392 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 200.0
2021-08-26 22:57:47.398 | INFO     | src.policies:train:152 - Mini-batch 1 / 2
2021-08-26 22:57:47.403 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.2990208864212036
2021-08-26 22:57:47.409 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.795745849609375
2021-08-26 22:57:47.412 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.01581105962395668
2021-08-26 22:57:47.417 | INFO     | src.policies:minibatch_update:288 - P

2021-08-26 22:57:47.979 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 105.0
2021-08-26 22:57:47.980 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 78.5
2021-08-26 22:57:47.981 | INFO     | src.policies:collect_trajectories:213 - Episode 3642
2021-08-26 22:57:48.136 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:57:48.138 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 200.0
2021-08-26 22:57:48.138 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 119.0
2021-08-26 22:57:48.145 | INFO     | src.policies:train:152 - Mini-batch 1 / 3
2021-08-26 22:57:48.147 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.30694347620010376
2021-08-26 22:57:48.150 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.7092357873916626
2021-08-26 22:57:48.152 | INFO     | src.policies:minibatch_u

2021-08-26 22:57:48.786 | INFO     | src.policies:train:152 - Mini-batch 2 / 3
2021-08-26 22:57:48.788 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.306841641664505
2021-08-26 22:57:48.792 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.4895041286945343
2021-08-26 22:57:48.794 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.015380792319774628
2021-08-26 22:57:48.797 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.4895041286945343
2021-08-26 22:57:48.800 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.015380792319774628
2021-08-26 22:57:48.803 | INFO     | src.policies:train:152 - Mini-batch 3 / 3
2021-08-26 22:57:48.807 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.2881334722042084
2021-08-26 22:57:48.810 | INFO     | src.policies:minibatch_update:277 - Policy network L2 grad

2021-08-26 22:57:49.347 | INFO     | src.policies:train:116 - Epoch 648 / 800
2021-08-26 22:57:49.348 | INFO     | src.policies:collect_trajectories:213 - Episode 3653
2021-08-26 22:57:49.455 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:57:49.456 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 200.0
2021-08-26 22:57:49.457 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 200.0
2021-08-26 22:57:49.462 | INFO     | src.policies:train:152 - Mini-batch 1 / 2
2021-08-26 22:57:49.465 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.286763995885849
2021-08-26 22:57:49.468 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.7545670866966248
2021-08-26 22:57:49.471 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.015255951322615147
2021-08-26 22:57:49.473 | INFO     | src.policies:minibatch_update:288 - 

2021-08-26 22:57:49.982 | INFO     | src.policies:train:116 - Epoch 652 / 800
2021-08-26 22:57:49.983 | INFO     | src.policies:collect_trajectories:213 - Episode 3658
2021-08-26 22:57:50.013 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:57:50.014 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 51.0
2021-08-26 22:57:50.015 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 51.0
2021-08-26 22:57:50.016 | INFO     | src.policies:collect_trajectories:213 - Episode 3659
2021-08-26 22:57:50.108 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:57:50.109 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 167.0
2021-08-26 22:57:50.111 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 109.0
2021-08-26 22:57:50.119 | INFO     | src.policies:train:152 - Mini-batch 1 / 2
2021-08-26 22:57:50.1

2021-08-26 22:57:50.612 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.22926181554794312
2021-08-26 22:57:50.614 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.015075194649398327
2021-08-26 22:57:50.618 | INFO     | src.policies:train:152 - Mini-batch 2 / 2
2021-08-26 22:57:50.621 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.28495046496391296
2021-08-26 22:57:50.623 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.19227010011672974
2021-08-26 22:57:50.626 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.015245568007230759
2021-08-26 22:57:50.628 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.19227010011672974
2021-08-26 22:57:50.631 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.015245568

2021-08-26 22:57:51.258 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.015073740854859352
2021-08-26 22:57:51.262 | INFO     | src.policies:train:116 - Epoch 658 / 800
2021-08-26 22:57:51.263 | INFO     | src.policies:collect_trajectories:213 - Episode 3673
2021-08-26 22:57:51.281 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:57:51.282 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 29.0
2021-08-26 22:57:51.283 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 29.0
2021-08-26 22:57:51.284 | INFO     | src.policies:collect_trajectories:213 - Episode 3674
2021-08-26 22:57:51.352 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:57:51.353 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 44.0
2021-08-26 22:57:51.354 | INFO     | src.policies:collect_trajectories:

2021-08-26 22:57:51.941 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.014439516700804234
2021-08-26 22:57:51.945 | INFO     | src.policies:train:152 - Mini-batch 2 / 2
2021-08-26 22:57:51.948 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.3072916269302368
2021-08-26 22:57:51.950 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.17134051024913788
2021-08-26 22:57:51.952 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.014557864516973495
2021-08-26 22:57:51.955 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.17134051024913788
2021-08-26 22:57:51.957 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.014557864516973495
2021-08-26 22:57:51.960 | INFO     | src.policies:train:116 - Epoch 661 / 800
2021-08-26 22:57:51.961 | INFO     | src.policies:coll

2021-08-26 22:57:52.714 | INFO     | src.policies:train:152 - Mini-batch 1 / 3
2021-08-26 22:57:52.717 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.2944014370441437
2021-08-26 22:57:52.720 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.33244067430496216
2021-08-26 22:57:52.722 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.014780120924115181
2021-08-26 22:57:52.725 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.33244067430496216
2021-08-26 22:57:52.728 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.014780120924115181
2021-08-26 22:57:52.731 | INFO     | src.policies:train:152 - Mini-batch 2 / 3
2021-08-26 22:57:52.736 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.29428184032440186
2021-08-26 22:57:52.741 | INFO     | src.policies:minibatch_update:277 - Policy network L2 

2021-08-26 22:57:53.281 | INFO     | src.policies:train:116 - Epoch 666 / 800
2021-08-26 22:57:53.282 | INFO     | src.policies:collect_trajectories:213 - Episode 3694
2021-08-26 22:57:53.328 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:57:53.329 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 80.0
2021-08-26 22:57:53.329 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 80.0
2021-08-26 22:57:53.330 | INFO     | src.policies:collect_trajectories:213 - Episode 3695
2021-08-26 22:57:53.412 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:57:53.414 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 127.0
2021-08-26 22:57:53.414 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 103.5
2021-08-26 22:57:53.421 | INFO     | src.policies:train:152 - Mini-batch 1 / 2
2021-08-26 22:57:53.4

2021-08-26 22:57:53.926 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.2674057185649872
2021-08-26 22:57:53.928 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.015187746845185757
2021-08-26 22:57:53.932 | INFO     | src.policies:train:116 - Epoch 669 / 800
2021-08-26 22:57:53.933 | INFO     | src.policies:collect_trajectories:213 - Episode 3701
2021-08-26 22:57:53.958 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:57:53.959 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 44.0
2021-08-26 22:57:53.960 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 44.0
2021-08-26 22:57:53.960 | INFO     | src.policies:collect_trajectories:213 - Episode 3702
2021-08-26 22:57:53.974 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:57:53.975 | INFO    

2021-08-26 22:57:54.603 | INFO     | src.policies:train:152 - Mini-batch 1 / 3
2021-08-26 22:57:54.608 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.28852856159210205
2021-08-26 22:57:54.611 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.3326222896575928
2021-08-26 22:57:54.647 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.014477125369012356
2021-08-26 22:57:54.650 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.3326222896575928
2021-08-26 22:57:54.653 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.014477125369012356
2021-08-26 22:57:54.656 | INFO     | src.policies:train:152 - Mini-batch 2 / 3
2021-08-26 22:57:54.660 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.29481086134910583
2021-08-26 22:57:54.663 | INFO     | src.policies:minibatch_update:277 - Policy network L2 g

2021-08-26 22:57:55.284 | INFO     | src.policies:train:152 - Mini-batch 2 / 3
2021-08-26 22:57:55.288 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.3094884753227234
2021-08-26 22:57:55.292 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.030953025445342064
2021-08-26 22:57:55.294 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.014424847438931465
2021-08-26 22:57:55.298 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.030953025445342064
2021-08-26 22:57:55.301 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.014424847438931465
2021-08-26 22:57:55.307 | INFO     | src.policies:train:152 - Mini-batch 3 / 3
2021-08-26 22:57:55.311 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.2881554067134857
2021-08-26 22:57:55.314 | INFO     | src.policies:minibatch_update:277 - Policy network L2

2021-08-26 22:57:55.899 | INFO     | src.policies:collect_trajectories:213 - Episode 3722
2021-08-26 22:57:56.008 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:57:56.010 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 200.0
2021-08-26 22:57:56.010 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 129.5
2021-08-26 22:57:56.017 | INFO     | src.policies:train:152 - Mini-batch 1 / 2
2021-08-26 22:57:56.020 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.2960105538368225
2021-08-26 22:57:56.023 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.6022521257400513
2021-08-26 22:57:56.025 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.01406506635248661
2021-08-26 22:57:56.028 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.49999913573265076
2021-08-26

2021-08-26 22:57:56.597 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.2892976403236389
2021-08-26 22:57:56.599 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.014175993390381336
2021-08-26 22:57:56.602 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.2892976403236389
2021-08-26 22:57:56.604 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.014175993390381336
2021-08-26 22:57:56.609 | INFO     | src.policies:train:116 - Epoch 680 / 800
2021-08-26 22:57:56.610 | INFO     | src.policies:collect_trajectories:213 - Episode 3728
2021-08-26 22:57:56.701 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:57:56.702 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 170.0
2021-08-26 22:57:56.703 | INFO     | src.policies:collect_trajectories:230 - Last 100 

2021-08-26 22:57:57.280 | INFO     | src.policies:train:152 - Mini-batch 2 / 2
2021-08-26 22:57:57.283 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.2760487198829651
2021-08-26 22:57:57.286 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.8100676536560059
2021-08-26 22:57:57.288 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.013895486481487751
2021-08-26 22:57:57.290 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.49999934434890747
2021-08-26 22:57:57.331 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.013895486481487751
2021-08-26 22:57:57.335 | INFO     | src.policies:train:116 - Epoch 683 / 800
2021-08-26 22:57:57.336 | INFO     | src.policies:collect_trajectories:213 - Episode 3735
2021-08-26 22:57:57.431 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-

2021-08-26 22:57:57.943 | INFO     | src.policies:train:116 - Epoch 686 / 800
2021-08-26 22:57:57.944 | INFO     | src.policies:collect_trajectories:213 - Episode 3740
2021-08-26 22:57:58.040 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:57:58.041 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 176.0
2021-08-26 22:57:58.042 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 176.0
2021-08-26 22:57:58.043 | INFO     | src.policies:collect_trajectories:213 - Episode 3741
2021-08-26 22:57:58.155 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:57:58.156 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 200.0
2021-08-26 22:57:58.157 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 188.0
2021-08-26 22:57:58.163 | INFO     | src.policies:train:152 - Mini-batch 1 / 3
2021-08-26 22:57:58

2021-08-26 22:57:58.628 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.4999990165233612
2021-08-26 22:57:58.631 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.013205881230533123
2021-08-26 22:57:58.634 | INFO     | src.policies:train:152 - Mini-batch 2 / 2
2021-08-26 22:57:58.636 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.3203781843185425
2021-08-26 22:57:58.639 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.20894566178321838
2021-08-26 22:57:58.641 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.013590945862233639
2021-08-26 22:57:58.644 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.20894566178321838
2021-08-26 22:57:58.646 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.01359094586

2021-08-26 22:57:59.399 | INFO     | src.policies:train:152 - Mini-batch 1 / 2
2021-08-26 22:57:59.403 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.29767128825187683
2021-08-26 22:57:59.408 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.5011746287345886
2021-08-26 22:57:59.411 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.013857178390026093
2021-08-26 22:57:59.414 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.49999895691871643
2021-08-26 22:57:59.418 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.013857178390026093
2021-08-26 22:57:59.423 | INFO     | src.policies:train:152 - Mini-batch 2 / 2
2021-08-26 22:57:59.426 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.2979484498500824
2021-08-26 22:57:59.429 | INFO     | src.policies:minibatch_update:277 - Policy network L2 g

2021-08-26 22:58:00.195 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.306401789188385
2021-08-26 22:58:00.198 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.064024418592453
2021-08-26 22:58:00.200 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.013496562838554382
2021-08-26 22:58:00.202 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.064024418592453
2021-08-26 22:58:00.204 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.013496562838554382
2021-08-26 22:58:00.207 | INFO     | src.policies:train:152 - Mini-batch 2 / 2
2021-08-26 22:58:00.210 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.30604544281959534
2021-08-26 22:58:00.213 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.15663361549377441
2021-08-26 22:58:00.215 | INFO     | src.policies

2021-08-26 22:58:00.860 | INFO     | src.policies:train:152 - Mini-batch 2 / 3
2021-08-26 22:58:00.863 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.28769707679748535
2021-08-26 22:58:00.866 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.6587099432945251
2021-08-26 22:58:00.868 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.013804948888719082
2021-08-26 22:58:00.870 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.4999992251396179
2021-08-26 22:58:00.872 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.013804948888719082
2021-08-26 22:58:00.875 | INFO     | src.policies:train:152 - Mini-batch 3 / 3
2021-08-26 22:58:00.878 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.3067438304424286
2021-08-26 22:58:00.881 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gr

2021-08-26 22:58:01.545 | INFO     | src.policies:train:152 - Mini-batch 2 / 3
2021-08-26 22:58:01.548 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.2929074466228485
2021-08-26 22:58:01.551 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.7848409414291382
2021-08-26 22:58:01.553 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.013090195134282112
2021-08-26 22:58:01.555 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.4999992847442627
2021-08-26 22:58:01.558 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.013090195134282112
2021-08-26 22:58:01.561 | INFO     | src.policies:train:152 - Mini-batch 3 / 3
2021-08-26 22:58:01.563 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.2927578091621399
2021-08-26 22:58:01.566 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gra

2021-08-26 22:58:02.198 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.7548058032989502
2021-08-26 22:58:02.200 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.013274841010570526
2021-08-26 22:58:02.202 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.49999934434890747
2021-08-26 22:58:02.205 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.013274841010570526
2021-08-26 22:58:02.208 | INFO     | src.policies:train:152 - Mini-batch 3 / 3
2021-08-26 22:58:02.210 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.2917375862598419
2021-08-26 22:58:02.214 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.26849162578582764
2021-08-26 22:58:02.216 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.013754425570368767
2021-08-26 22:58:02.21

2021-08-26 22:58:02.993 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 185.0
2021-08-26 22:58:02.994 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 189.0
2021-08-26 22:58:03.000 | INFO     | src.policies:train:152 - Mini-batch 1 / 3
2021-08-26 22:58:03.004 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.3053581416606903
2021-08-26 22:58:03.007 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.6020772457122803
2021-08-26 22:58:03.009 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.01280638575553894
2021-08-26 22:58:03.012 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.49999916553497314
2021-08-26 22:58:03.014 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.01280638575553894
2021-08-26 22:58:03.017 | INFO     | src.policies:train:152 -

2021-08-26 22:58:03.511 | INFO     | src.policies:collect_trajectories:213 - Episode 3787
2021-08-26 22:58:03.609 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:58:03.610 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 185.0
2021-08-26 22:58:03.611 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 185.0
2021-08-26 22:58:03.612 | INFO     | src.policies:collect_trajectories:213 - Episode 3788
2021-08-26 22:58:03.723 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:58:03.724 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 200.0
2021-08-26 22:58:03.725 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 192.5
2021-08-26 22:58:03.731 | INFO     | src.policies:train:152 - Mini-batch 1 / 3
2021-08-26 22:58:03.735 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.317931920

2021-08-26 22:58:04.268 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.012257533147931099
2021-08-26 22:58:04.270 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.2032630741596222
2021-08-26 22:58:04.272 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.012257533147931099
2021-08-26 22:58:04.275 | INFO     | src.policies:train:152 - Mini-batch 2 / 2
2021-08-26 22:58:04.278 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.2994735836982727
2021-08-26 22:58:04.281 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.48825109004974365
2021-08-26 22:58:04.282 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.012506666593253613
2021-08-26 22:58:04.285 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.48825109004974365
2021-08

2021-08-26 22:58:05.013 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 160.0
2021-08-26 22:58:05.016 | INFO     | src.policies:collect_trajectories:213 - Episode 3799
2021-08-26 22:58:05.168 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:58:05.171 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 55.0
2021-08-26 22:58:05.173 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 107.5
2021-08-26 22:58:05.192 | INFO     | src.policies:train:152 - Mini-batch 1 / 2
2021-08-26 22:58:05.199 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.300044447183609
2021-08-26 22:58:05.206 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.1328098177909851
2021-08-26 22:58:05.209 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.012555507011711597
2021-08-26 22:58:05.215 | INFO     | s

2021-08-26 22:58:06.142 | INFO     | src.policies:train:152 - Mini-batch 1 / 2
2021-08-26 22:58:06.146 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.2998853921890259
2021-08-26 22:58:06.149 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.23526225984096527
2021-08-26 22:58:06.151 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.012751780450344086
2021-08-26 22:58:06.153 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.23526225984096527
2021-08-26 22:58:06.155 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.012751780450344086
2021-08-26 22:58:06.158 | INFO     | src.policies:train:152 - Mini-batch 2 / 2
2021-08-26 22:58:06.161 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.30196329951286316
2021-08-26 22:58:06.164 | INFO     | src.policies:minibatch_update:277 - Policy network L2 

2021-08-26 22:58:06.741 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.2705455422401428
2021-08-26 22:58:06.743 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.01224600151181221
2021-08-26 22:58:06.746 | INFO     | src.policies:train:116 - Epoch 724 / 800
2021-08-26 22:58:06.747 | INFO     | src.policies:collect_trajectories:213 - Episode 3809
2021-08-26 22:58:06.764 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:58:06.765 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 29.0
2021-08-26 22:58:06.766 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 29.0
2021-08-26 22:58:06.767 | INFO     | src.policies:collect_trajectories:213 - Episode 3810
2021-08-26 22:58:06.787 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:58:06.788 | INFO     

2021-08-26 22:58:07.472 | INFO     | src.policies:collect_trajectories:213 - Episode 3815
2021-08-26 22:58:07.576 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:58:07.577 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 187.0
2021-08-26 22:58:07.578 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 187.0
2021-08-26 22:58:07.579 | INFO     | src.policies:collect_trajectories:213 - Episode 3816
2021-08-26 22:58:07.689 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:58:07.690 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 200.0
2021-08-26 22:58:07.691 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 193.5
2021-08-26 22:58:07.697 | INFO     | src.policies:train:152 - Mini-batch 1 / 3
2021-08-26 22:58:07.701 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.288944214

2021-08-26 22:58:08.207 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 142.0
2021-08-26 22:58:08.208 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 142.0
2021-08-26 22:58:08.209 | INFO     | src.policies:collect_trajectories:213 - Episode 3822
2021-08-26 22:58:08.344 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:58:08.347 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 200.0
2021-08-26 22:58:08.348 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 171.0
2021-08-26 22:58:08.354 | INFO     | src.policies:train:152 - Mini-batch 1 / 3
2021-08-26 22:58:08.358 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.3058047592639923
2021-08-26 22:58:08.361 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.09006359428167343
2021-08-26 22:58:08.363 | INFO     | src.policies:minibatch_

2021-08-26 22:58:08.901 | INFO     | src.policies:train:152 - Mini-batch 1 / 2
2021-08-26 22:58:08.905 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.2739379107952118
2021-08-26 22:58:08.907 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.5989775657653809
2021-08-26 22:58:08.909 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.012270725332200527
2021-08-26 22:58:08.912 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.4999992251396179
2021-08-26 22:58:08.914 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.012270725332200527
2021-08-26 22:58:08.917 | INFO     | src.policies:train:152 - Mini-batch 2 / 2
2021-08-26 22:58:08.920 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.2889585793018341
2021-08-26 22:58:08.923 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gra

2021-08-26 22:58:09.579 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 200.0
2021-08-26 22:58:09.583 | INFO     | src.policies:train:152 - Mini-batch 1 / 2
2021-08-26 22:58:09.586 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.2885074317455292
2021-08-26 22:58:09.589 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.4409751296043396
2021-08-26 22:58:09.590 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.01244575809687376
2021-08-26 22:58:09.593 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.4409751296043396
2021-08-26 22:58:09.595 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.01244575809687376
2021-08-26 22:58:09.598 | INFO     | src.policies:train:152 - Mini-batch 2 / 2
2021-08-26 22:58:09.601 | INFO     | src.policies:minibatch_update:270 - Total loss: -0

2021-08-26 22:58:10.208 | INFO     | src.policies:train:152 - Mini-batch 1 / 2
2021-08-26 22:58:10.212 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.315337598323822
2021-08-26 22:58:10.214 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.2301425188779831
2021-08-26 22:58:10.216 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.011623308062553406
2021-08-26 22:58:10.219 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.2301425188779831
2021-08-26 22:58:10.221 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.011623308062553406
2021-08-26 22:58:10.224 | INFO     | src.policies:train:152 - Mini-batch 2 / 2
2021-08-26 22:58:10.227 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.31092333793640137
2021-08-26 22:58:10.230 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gra

2021-08-26 22:58:10.942 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.011900962330400944
2021-08-26 22:58:10.945 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.43523845076560974
2021-08-26 22:58:10.949 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.011900962330400944
2021-08-26 22:58:10.953 | INFO     | src.policies:train:116 - Epoch 745 / 800
2021-08-26 22:58:10.954 | INFO     | src.policies:collect_trajectories:213 - Episode 3844
2021-08-26 22:58:11.061 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:58:11.062 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 139.0
2021-08-26 22:58:11.063 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 139.0
2021-08-26 22:58:11.064 | INFO     | src.policies:collect_trajectories:213 - Episode 3845
2021-0

2021-08-26 22:58:11.737 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 155.0
2021-08-26 22:58:11.738 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 155.0
2021-08-26 22:58:11.738 | INFO     | src.policies:collect_trajectories:213 - Episode 3850
2021-08-26 22:58:11.846 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:58:11.847 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 200.0
2021-08-26 22:58:11.848 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 177.5
2021-08-26 22:58:11.854 | INFO     | src.policies:train:152 - Mini-batch 1 / 3
2021-08-26 22:58:11.857 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.32565826177597046
2021-08-26 22:58:11.860 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.21252429485321045
2021-08-26 22:58:11.862 | INFO     | src.policies:minibatch

2021-08-26 22:58:12.358 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.4999995827674866
2021-08-26 22:58:12.360 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.011811789125204086
2021-08-26 22:58:12.364 | INFO     | src.policies:train:116 - Epoch 751 / 800
2021-08-26 22:58:12.365 | INFO     | src.policies:collect_trajectories:213 - Episode 3855
2021-08-26 22:58:12.421 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:58:12.423 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 86.0
2021-08-26 22:58:12.423 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 86.0
2021-08-26 22:58:12.424 | INFO     | src.policies:collect_trajectories:213 - Episode 3856
2021-08-26 22:58:12.516 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:58:12.517 | INFO    

2021-08-26 22:58:13.029 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.011793366633355618
2021-08-26 22:58:13.032 | INFO     | src.policies:train:152 - Mini-batch 3 / 3
2021-08-26 22:58:13.035 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.29917627573013306
2021-08-26 22:58:13.038 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.12731142342090607
2021-08-26 22:58:13.040 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.011441436596214771
2021-08-26 22:58:13.042 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.12731142342090607
2021-08-26 22:58:13.045 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.011441436596214771
2021-08-26 22:58:13.048 | INFO     | src.policies:train:116 - Epoch 754 / 800
2021-08-26 22:58:13.049 | INFO     | src.policies:col

2021-08-26 22:58:13.798 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.011871371418237686
2021-08-26 22:58:13.801 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.20694491267204285
2021-08-26 22:58:13.803 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.011871371418237686
2021-08-26 22:58:13.806 | INFO     | src.policies:train:152 - Mini-batch 2 / 2
2021-08-26 22:58:13.809 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.27833282947540283
2021-08-26 22:58:13.812 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.25795045495033264
2021-08-26 22:58:13.814 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.011494992300868034
2021-08-26 22:58:13.816 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.25795045495033264
2021-

2021-08-26 22:58:14.409 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.011381029151380062
2021-08-26 22:58:14.412 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.49999916553497314
2021-08-26 22:58:14.414 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.011381029151380062
2021-08-26 22:58:14.418 | INFO     | src.policies:train:116 - Epoch 760 / 800
2021-08-26 22:58:14.419 | INFO     | src.policies:collect_trajectories:213 - Episode 3873
2021-08-26 22:58:14.524 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:58:14.525 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 191.0
2021-08-26 22:58:14.526 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 191.0
2021-08-26 22:58:14.527 | INFO     | src.policies:collect_trajectories:213 - Episode 3874
2021-0

2021-08-26 22:58:15.226 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.011439383029937744
2021-08-26 22:58:15.228 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.4353860914707184
2021-08-26 22:58:15.231 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.011439383029937744
2021-08-26 22:58:15.234 | INFO     | src.policies:train:116 - Epoch 763 / 800
2021-08-26 22:58:15.235 | INFO     | src.policies:collect_trajectories:213 - Episode 3878
2021-08-26 22:58:15.281 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:58:15.282 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 78.0
2021-08-26 22:58:15.283 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 78.0
2021-08-26 22:58:15.284 | INFO     | src.policies:collect_trajectories:213 - Episode 3879
2021-08-2

2021-08-26 22:58:16.088 | INFO     | src.policies:train:152 - Mini-batch 1 / 3
2021-08-26 22:58:16.092 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.26192179322242737
2021-08-26 22:58:16.096 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.8251069784164429
2021-08-26 22:58:16.099 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.011505273170769215
2021-08-26 22:58:16.103 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.49999943375587463
2021-08-26 22:58:16.106 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.011505273170769215
2021-08-26 22:58:16.109 | INFO     | src.policies:train:152 - Mini-batch 2 / 3
2021-08-26 22:58:16.112 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.2993890345096588
2021-08-26 22:58:16.115 | INFO     | src.policies:minibatch_update:277 - Policy network L2 g

2021-08-26 22:58:16.601 | INFO     | src.policies:train:116 - Epoch 769 / 800
2021-08-26 22:58:16.602 | INFO     | src.policies:collect_trajectories:213 - Episode 3890
2021-08-26 22:58:16.707 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:58:16.708 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 200.0
2021-08-26 22:58:16.709 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 200.0
2021-08-26 22:58:16.713 | INFO     | src.policies:train:152 - Mini-batch 1 / 2
2021-08-26 22:58:16.716 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.2990042269229889
2021-08-26 22:58:16.719 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.8091145753860474
2021-08-26 22:58:16.721 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.010973858647048473
2021-08-26 22:58:16.723 | INFO     | src.policies:minibatch_update:288 -

2021-08-26 22:58:17.284 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.011089697480201721
2021-08-26 22:58:17.286 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.49999916553497314
2021-08-26 22:58:17.289 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.011089697480201721
2021-08-26 22:58:17.293 | INFO     | src.policies:train:116 - Epoch 772 / 800
2021-08-26 22:58:17.293 | INFO     | src.policies:collect_trajectories:213 - Episode 3896
2021-08-26 22:58:17.385 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:58:17.386 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 176.0
2021-08-26 22:58:17.387 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 176.0
2021-08-26 22:58:17.388 | INFO     | src.policies:collect_trajectories:213 - Episode 3897
2021-0

2021-08-26 22:58:17.926 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.14631833136081696
2021-08-26 22:58:17.928 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.01113736629486084
2021-08-26 22:58:17.931 | INFO     | src.policies:train:116 - Epoch 775 / 800
2021-08-26 22:58:17.932 | INFO     | src.policies:collect_trajectories:213 - Episode 3902
2021-08-26 22:58:17.985 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:58:17.986 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 93.0
2021-08-26 22:58:17.987 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 93.0
2021-08-26 22:58:17.988 | INFO     | src.policies:collect_trajectories:213 - Episode 3903
2021-08-26 22:58:18.075 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:58:18.076 | INFO    

2021-08-26 22:58:18.660 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.932918906211853
2021-08-26 22:58:18.662 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.010599218308925629
2021-08-26 22:58:18.664 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.49999943375587463
2021-08-26 22:58:18.668 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.010599218308925629
2021-08-26 22:58:18.672 | INFO     | src.policies:train:152 - Mini-batch 3 / 3
2021-08-26 22:58:18.675 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.2836046814918518
2021-08-26 22:58:18.678 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.5007028579711914
2021-08-26 22:58:18.680 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.011001737788319588
2021-08-26 22:58:18.683 

2021-08-26 22:58:19.359 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:58:19.360 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 185.0
2021-08-26 22:58:19.361 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 185.0
2021-08-26 22:58:19.362 | INFO     | src.policies:collect_trajectories:213 - Episode 3915
2021-08-26 22:58:19.527 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:58:19.528 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 200.0
2021-08-26 22:58:19.529 | INFO     | src.policies:collect_trajectories:230 - Last 100 episodes mean return: 192.5
2021-08-26 22:58:19.535 | INFO     | src.policies:train:152 - Mini-batch 1 / 3
2021-08-26 22:58:19.539 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.293401300907135
2021-08-26 22:58:19.542 | INFO     | src.policies:minibatch_update:277 - Policy net

2021-08-26 22:58:20.054 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.6651943325996399
2021-08-26 22:58:20.056 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.010735505260527134
2021-08-26 22:58:20.058 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.4999992847442627
2021-08-26 22:58:20.061 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.010735505260527134
2021-08-26 22:58:20.064 | INFO     | src.policies:train:116 - Epoch 785 / 800
2021-08-26 22:58:20.065 | INFO     | src.policies:collect_trajectories:213 - Episode 3919
2021-08-26 22:58:20.203 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:58:20.204 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 200.0
2021-08-26 22:58:20.205 | INFO     | src.policies:collect_trajectories:230 - Last 100 

2021-08-26 22:58:20.744 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.44154155254364014
2021-08-26 22:58:20.746 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.010623294860124588
2021-08-26 22:58:20.748 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.44154155254364014
2021-08-26 22:58:20.751 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.010623294860124588
2021-08-26 22:58:20.754 | INFO     | src.policies:train:116 - Epoch 789 / 800
2021-08-26 22:58:20.756 | INFO     | src.policies:collect_trajectories:213 - Episode 3924
2021-08-26 22:58:20.792 | DEBUG    | src.policies:execute_episode:398 - Early stopping, all agents done
2021-08-26 22:58:20.793 | INFO     | src.policies:collect_trajectories:229 - Mean episode return: 63.0
2021-08-26 22:58:20.794 | INFO     | src.policies:collect_trajectories:230 - Last 100

2021-08-26 22:58:21.431 | INFO     | src.policies:train:152 - Mini-batch 1 / 2
2021-08-26 22:58:21.434 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.3087337017059326
2021-08-26 22:58:21.437 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.7073695063591003
2021-08-26 22:58:21.439 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.010314984247088432
2021-08-26 22:58:21.442 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.4999993145465851
2021-08-26 22:58:21.444 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.010314984247088432
2021-08-26 22:58:21.447 | INFO     | src.policies:train:152 - Mini-batch 2 / 2
2021-08-26 22:58:21.450 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.2974664568901062
2021-08-26 22:58:21.453 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gra

2021-08-26 22:58:22.201 | INFO     | src.policies:train:152 - Mini-batch 1 / 2
2021-08-26 22:58:22.204 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.29566463828086853
2021-08-26 22:58:22.207 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 1.0724290609359741
2021-08-26 22:58:22.208 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.010281178168952465
2021-08-26 22:58:22.211 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.4999995529651642
2021-08-26 22:58:22.213 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.010281178168952465
2021-08-26 22:58:22.216 | INFO     | src.policies:train:152 - Mini-batch 2 / 2
2021-08-26 22:58:22.218 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.3085186183452606
2021-08-26 22:58:22.222 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gr

2021-08-26 22:58:22.972 | INFO     | src.policies:train:152 - Mini-batch 1 / 3
2021-08-26 22:58:22.976 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.29156261682510376
2021-08-26 22:58:22.979 | INFO     | src.policies:minibatch_update:277 - Policy network L2 gradient norm: 0.1961611807346344
2021-08-26 22:58:22.981 | INFO     | src.policies:minibatch_update:281 - Baseline network L2 gradient norm: 0.010258766822516918
2021-08-26 22:58:22.984 | INFO     | src.policies:minibatch_update:288 - Policy network L2 gradient norm after clipping: 0.1961611807346344
2021-08-26 22:58:22.986 | INFO     | src.policies:minibatch_update:295 - Baseline network L2 gradient norm after clipping: 0.010258766822516918
2021-08-26 22:58:22.989 | INFO     | src.policies:train:152 - Mini-batch 2 / 3
2021-08-26 22:58:22.992 | INFO     | src.policies:minibatch_update:270 - Total loss: -0.30782777070999146
2021-08-26 22:58:22.996 | INFO     | src.policies:minibatch_update:277 - Policy network L2 g

VBox(children=(Label(value=' 0.00MB of 0.00MB uploaded (0.00MB deduped)\r'), FloatProgress(value=1.0, max=1.0)…

0,1
loss,-0.91733
mean_return,193.5
_runtime,192.0
_timestamp,1630011503.0
_step,799.0


0,1
loss,█▇▆▅▅▄▄▅▃▄▄▄▄▄▄▂▄▄▄▄▁▄▄▁▁▃▄▃▄▂▃▄▄▄▄▃▃▂▁▁
mean_return,▁▁▁▁▁▁▁▁▁▂▁▂▂▁▂▄▃▅▃▂▃▃█▄▅▆█▄█▄▅▅▄▅███▆▆█
_runtime,▁▁▁▁▂▂▂▂▃▃▃▃▃▄▄▄▄▄▄▅▅▅▅▅▅▆▆▆▆▆▆▇▇▇▇▇▇███
_timestamp,▁▁▁▁▂▂▂▂▃▃▃▃▃▄▄▄▄▄▄▅▅▅▅▅▅▆▆▆▆▆▆▇▇▇▇▇▇███
_step,▁▁▁▁▂▂▂▂▂▃▃▃▃▃▃▄▄▄▄▄▅▅▅▅▅▅▆▆▆▆▆▇▇▇▇▇▇███
