# Cartpole tests with policy gradient

This notebook contains a simple test for each implemented policy gradient method. In order to test if they function properly, we rely on the [Cartpole](https://gym.openai.com/envs/CartPole-v0/) environment, provided out-of-the-box in OpenAI Gym. As stated in Gym's documentation, the problem is considered "solved" if the agent is able to obtain a mean return of 195 in the last 100 episodes.

## Pre-requisites

The cells down below install and import the necessary libraries to successfully run the notebook examples.

In [1]:
import sys
sys.path.append('../')

In [None]:
%%capture
!pip install -r ../init/requirements.txt

In [2]:
import numpy as np
import gym

from src import models, policies

%load_ext autoreload
%autoreload 2

## Utilities

The cell down below defines the environment, along with common variables to be used throughout the notebook.

In [3]:
env = gym.make('CartPole-v0')

In [4]:
observation_space_size = 4
action_space_size = 2
hidden_sizes = [32, 32]
epochs = 800
steps_per_epoch = 200
minibatch_size = 100
episodes_mean_return = 100
wandb_config = {
    "api_key": open("../wandb_api_key_file", "r").read().strip(),
    "project": "cpr-appropriation",
    "entity": "wadaboa",
}

## VPG

This section deals with training a Cartpole agent using our custom Vanilla Policy Gradient implementation.

In [6]:
vpg_policy_nn = models.MLP(observation_space_size, hidden_sizes, action_space_size)
vpg_baseline_nn = models.MLP(observation_space_size, hidden_sizes, 1, log_softmax=False)
vpg_policy = policies.VPGPolicy(env, vpg_policy_nn, baseline_nn=vpg_baseline_nn)
vpg_policy.train(
    epochs,
    steps_per_epoch,
    minibatch_size,
    enable_wandb=True,
    wandb_config={**wandb_config, "group": "VPG"},
    episodes_mean_return=episodes_mean_return
)

2021-09-07 17:17:49.160 | DEBUG    | src.models:__init__:56 - Model summary: MLP(
  (mlp): Sequential(
    (0): Linear(in_features=4, out_features=32, bias=True)
    (1): LeakyReLU(negative_slope=0.01)
    (2): Linear(in_features=32, out_features=32, bias=True)
    (3): LeakyReLU(negative_slope=0.01)
    (4): Linear(in_features=32, out_features=2, bias=True)
    (5): LeakyReLU(negative_slope=0.01)
  )
  (out): LogSoftmax(dim=-1)
)
2021-09-07 17:17:49.162 | DEBUG    | src.models:__init__:56 - Model summary: MLP(
  (mlp): Sequential(
    (0): Linear(in_features=4, out_features=32, bias=True)
    (1): LeakyReLU(negative_slope=0.01)
    (2): Linear(in_features=32, out_features=32, bias=True)
    (3): LeakyReLU(negative_slope=0.01)
    (4): Linear(in_features=32, out_features=1, bias=True)
    (5): LeakyReLU(negative_slope=0.01)
  )
  (out): Identity()
)


VBox(children=(Label(value=' 0.00MB of 0.00MB uploaded (0.00MB deduped)\r'), FloatProgress(value=1.0, max=1.0)…

0,1
loss,0.03829
mean_return,200.0
_runtime,113.0
_timestamp,1631027869.0
_step,728.0


0,1
loss,▆▄▄▃▂▃▂▃▃▆▂▂▃▂▂▂▂▂█▃▂▂▂▁▂▃▂▁▁▁▂▁▁▂▁▂▃▃▃▃
mean_return,▁▂▂▂▄▂▆▅█▇█▇████████████████▇████▅▆▄████
_runtime,▁▁▁▂▂▂▂▂▃▃▃▃▃▄▄▄▄▄▅▅▅▅▅▅▅▆▆▆▆▆▆▇▇▇▇▇████
_timestamp,▁▁▁▂▂▂▂▂▃▃▃▃▃▄▄▄▄▄▅▅▅▅▅▅▅▆▆▆▆▆▆▇▇▇▇▇████
_step,▁▁▁▁▂▂▂▂▂▃▃▃▃▃▄▄▄▄▄▄▅▅▅▅▅▅▆▆▆▆▆▇▇▇▇▇████


[34m[1mwandb[0m: wandb version 0.12.1 is available!  To upgrade, please run:
[34m[1mwandb[0m:  $ pip install wandb --upgrade


2021-09-07 17:17:56.889 | INFO     | src.policies:train:123 - Epoch 1 / 800
2021-09-07 17:17:56.889 | INFO     | src.policies:collect_trajectories:221 - Episode 1
2021-09-07 17:17:56.897 | DEBUG    | src.policies:execute_episode:413 - Early stopping, all agents done
2021-09-07 17:17:56.898 | INFO     | src.policies:collect_trajectories:237 - Mean episode return: 34.0
2021-09-07 17:17:56.898 | INFO     | src.policies:collect_trajectories:238 - Last 100 episodes mean return: 34.0
2021-09-07 17:17:56.899 | INFO     | src.policies:collect_trajectories:221 - Episode 2
2021-09-07 17:17:56.904 | DEBUG    | src.policies:execute_episode:413 - Early stopping, all agents done
2021-09-07 17:17:56.904 | INFO     | src.policies:collect_trajectories:237 - Mean episode return: 17.0
2021-09-07 17:17:56.905 | INFO     | src.policies:collect_trajectories:238 - Last 100 episodes mean return: 25.5
2021-09-07 17:17:56.905 | INFO     | src.policies:collect_trajectories:221 - Episode 3
2021-09-07 17:17:56.911

2021-09-07 17:17:57.043 | INFO     | src.policies:collect_trajectories:237 - Mean episode return: 17.0
2021-09-07 17:17:57.043 | INFO     | src.policies:collect_trajectories:238 - Last 100 episodes mean return: 21.555555555555557
2021-09-07 17:17:57.044 | INFO     | src.policies:collect_trajectories:221 - Episode 17
2021-09-07 17:17:57.048 | DEBUG    | src.policies:execute_episode:413 - Early stopping, all agents done
2021-09-07 17:17:57.048 | INFO     | src.policies:collect_trajectories:237 - Mean episode return: 10.0
2021-09-07 17:17:57.049 | INFO     | src.policies:collect_trajectories:238 - Last 100 episodes mean return: 20.4
2021-09-07 17:17:57.053 | INFO     | src.policies:train:159 - Mini-batch 1 / 2
2021-09-07 17:17:57.054 | INFO     | src.policies:minibatch_update:281 - Losses: {'policy_loss': -0.1984850913286209, 'baseline_loss': 1.553328275680542, 'total_loss': 0.5781790614128113}
2021-09-07 17:17:57.055 | INFO     | src.policies:minibatch_update:287 - Policy network L2 grad

2021-09-07 17:17:57.155 | INFO     | src.policies:train:123 - Epoch 4 / 800
2021-09-07 17:17:57.155 | INFO     | src.policies:collect_trajectories:221 - Episode 28
2021-09-07 17:17:57.160 | DEBUG    | src.policies:execute_episode:413 - Early stopping, all agents done
2021-09-07 17:17:57.161 | INFO     | src.policies:collect_trajectories:237 - Mean episode return: 21.0
2021-09-07 17:17:57.162 | INFO     | src.policies:collect_trajectories:238 - Last 100 episodes mean return: 21.0
2021-09-07 17:17:57.162 | INFO     | src.policies:collect_trajectories:221 - Episode 29
2021-09-07 17:17:57.168 | DEBUG    | src.policies:execute_episode:413 - Early stopping, all agents done
2021-09-07 17:17:57.169 | INFO     | src.policies:collect_trajectories:237 - Mean episode return: 18.0
2021-09-07 17:17:57.169 | INFO     | src.policies:collect_trajectories:238 - Last 100 episodes mean return: 19.5
2021-09-07 17:17:57.170 | INFO     | src.policies:collect_trajectories:221 - Episode 30
2021-09-07 17:17:57.

2021-09-07 17:17:57.279 | INFO     | src.policies:collect_trajectories:237 - Mean episode return: 25.0
2021-09-07 17:17:57.279 | INFO     | src.policies:collect_trajectories:238 - Last 100 episodes mean return: 21.0
2021-09-07 17:17:57.280 | INFO     | src.policies:collect_trajectories:221 - Episode 44
2021-09-07 17:17:57.285 | DEBUG    | src.policies:execute_episode:413 - Early stopping, all agents done
2021-09-07 17:17:57.286 | INFO     | src.policies:collect_trajectories:237 - Mean episode return: 23.0
2021-09-07 17:17:57.286 | INFO     | src.policies:collect_trajectories:238 - Last 100 episodes mean return: 21.333333333333332
2021-09-07 17:17:57.287 | INFO     | src.policies:collect_trajectories:221 - Episode 45
2021-09-07 17:17:57.292 | DEBUG    | src.policies:execute_episode:413 - Early stopping, all agents done
2021-09-07 17:17:57.292 | INFO     | src.policies:collect_trajectories:237 - Mean episode return: 20.0
2021-09-07 17:17:57.293 | INFO     | src.policies:collect_trajector

2021-09-07 17:17:57.488 | INFO     | src.policies:train:159 - Mini-batch 1 / 2
2021-09-07 17:17:57.489 | INFO     | src.policies:minibatch_update:281 - Losses: {'policy_loss': -0.17285577952861786, 'baseline_loss': 1.4730218648910522, 'total_loss': 0.5636551380157471}
2021-09-07 17:17:57.490 | INFO     | src.policies:minibatch_update:287 - Policy network L2 gradient norm: 0.18745467066764832
2021-09-07 17:17:57.491 | INFO     | src.policies:minibatch_update:291 - Baseline network L2 gradient norm: 0.7886437177658081
2021-09-07 17:17:57.492 | INFO     | src.policies:minibatch_update:298 - Policy network L2 gradient norm after clipping: 0.18745467066764832
2021-09-07 17:17:57.493 | INFO     | src.policies:minibatch_update:305 - Baseline network L2 gradient norm after clipping: 0.49999943375587463
2021-09-07 17:17:57.495 | INFO     | src.policies:train:159 - Mini-batch 2 / 2
2021-09-07 17:17:57.496 | INFO     | src.policies:minibatch_update:281 - Losses: {'policy_loss': -0.240296855568885

2021-09-07 17:17:57.613 | INFO     | src.policies:collect_trajectories:237 - Mean episode return: 13.0
2021-09-07 17:17:57.613 | INFO     | src.policies:collect_trajectories:238 - Last 100 episodes mean return: 20.4
2021-09-07 17:17:57.614 | INFO     | src.policies:collect_trajectories:221 - Episode 71
2021-09-07 17:17:57.618 | DEBUG    | src.policies:execute_episode:413 - Early stopping, all agents done
2021-09-07 17:17:57.618 | INFO     | src.policies:collect_trajectories:237 - Mean episode return: 18.0
2021-09-07 17:17:57.619 | INFO     | src.policies:collect_trajectories:238 - Last 100 episodes mean return: 20.0
2021-09-07 17:17:57.619 | INFO     | src.policies:collect_trajectories:221 - Episode 72
2021-09-07 17:17:57.625 | DEBUG    | src.policies:execute_episode:413 - Early stopping, all agents done
2021-09-07 17:17:57.625 | INFO     | src.policies:collect_trajectories:237 - Mean episode return: 21.0
2021-09-07 17:17:57.626 | INFO     | src.policies:collect_trajectories:238 - Last

2021-09-07 17:17:57.739 | INFO     | src.policies:minibatch_update:281 - Losses: {'policy_loss': -0.11859670281410217, 'baseline_loss': 1.0739201307296753, 'total_loss': 0.4183633625507355}
2021-09-07 17:17:57.740 | INFO     | src.policies:minibatch_update:287 - Policy network L2 gradient norm: 0.18032552301883698
2021-09-07 17:17:57.741 | INFO     | src.policies:minibatch_update:291 - Baseline network L2 gradient norm: 0.8120718598365784
2021-09-07 17:17:57.742 | INFO     | src.policies:minibatch_update:298 - Policy network L2 gradient norm after clipping: 0.18032552301883698
2021-09-07 17:17:57.744 | INFO     | src.policies:minibatch_update:305 - Baseline network L2 gradient norm after clipping: 0.49999934434890747
2021-09-07 17:17:57.745 | INFO     | src.policies:train:123 - Epoch 10 / 800
2021-09-07 17:17:57.746 | INFO     | src.policies:collect_trajectories:221 - Episode 84
2021-09-07 17:17:57.751 | DEBUG    | src.policies:execute_episode:413 - Early stopping, all agents done
2021

2021-09-07 17:17:57.877 | INFO     | src.policies:collect_trajectories:238 - Last 100 episodes mean return: 23.0
2021-09-07 17:17:57.877 | INFO     | src.policies:collect_trajectories:221 - Episode 98
2021-09-07 17:17:57.882 | DEBUG    | src.policies:execute_episode:413 - Early stopping, all agents done
2021-09-07 17:17:57.882 | INFO     | src.policies:collect_trajectories:237 - Mean episode return: 14.0
2021-09-07 17:17:57.883 | INFO     | src.policies:collect_trajectories:238 - Last 100 episodes mean return: 22.0
2021-09-07 17:17:57.883 | INFO     | src.policies:collect_trajectories:221 - Episode 99
2021-09-07 17:17:57.887 | DEBUG    | src.policies:execute_episode:413 - Early stopping, all agents done
2021-09-07 17:17:57.888 | INFO     | src.policies:collect_trajectories:237 - Mean episode return: 10.0
2021-09-07 17:17:57.888 | INFO     | src.policies:collect_trajectories:238 - Last 100 episodes mean return: 20.8
2021-09-07 17:17:57.892 | INFO     | src.policies:train:159 - Mini-batc

2021-09-07 17:17:58.078 | INFO     | src.policies:train:159 - Mini-batch 1 / 2
2021-09-07 17:17:58.079 | INFO     | src.policies:minibatch_update:281 - Losses: {'policy_loss': -0.15498928725719452, 'baseline_loss': 1.1203265190124512, 'total_loss': 0.4051739573478699}
2021-09-07 17:17:58.080 | INFO     | src.policies:minibatch_update:287 - Policy network L2 gradient norm: 0.10220279544591904
2021-09-07 17:17:58.082 | INFO     | src.policies:minibatch_update:291 - Baseline network L2 gradient norm: 0.806701123714447
2021-09-07 17:17:58.083 | INFO     | src.policies:minibatch_update:298 - Policy network L2 gradient norm after clipping: 0.10220279544591904
2021-09-07 17:17:58.084 | INFO     | src.policies:minibatch_update:305 - Baseline network L2 gradient norm after clipping: 0.49999940395355225
2021-09-07 17:17:58.086 | INFO     | src.policies:train:159 - Mini-batch 2 / 2
2021-09-07 17:17:58.087 | INFO     | src.policies:minibatch_update:281 - Losses: {'policy_loss': -0.1634016782045364

2021-09-07 17:17:58.199 | INFO     | src.policies:collect_trajectories:237 - Mean episode return: 12.0
2021-09-07 17:17:58.200 | INFO     | src.policies:collect_trajectories:238 - Last 100 episodes mean return: 24.333333333333332
2021-09-07 17:17:58.200 | INFO     | src.policies:collect_trajectories:221 - Episode 125
2021-09-07 17:17:58.206 | DEBUG    | src.policies:execute_episode:413 - Early stopping, all agents done
2021-09-07 17:17:58.207 | INFO     | src.policies:collect_trajectories:237 - Mean episode return: 18.0
2021-09-07 17:17:58.207 | INFO     | src.policies:collect_trajectories:238 - Last 100 episodes mean return: 22.75
2021-09-07 17:17:58.208 | INFO     | src.policies:collect_trajectories:221 - Episode 126
2021-09-07 17:17:58.212 | DEBUG    | src.policies:execute_episode:413 - Early stopping, all agents done
2021-09-07 17:17:58.213 | INFO     | src.policies:collect_trajectories:237 - Mean episode return: 18.0
2021-09-07 17:17:58.213 | INFO     | src.policies:collect_trajec

2021-09-07 17:17:58.316 | INFO     | src.policies:collect_trajectories:221 - Episode 140
2021-09-07 17:17:58.323 | DEBUG    | src.policies:execute_episode:413 - Early stopping, all agents done
2021-09-07 17:17:58.324 | INFO     | src.policies:collect_trajectories:237 - Mean episode return: 29.0
2021-09-07 17:17:58.325 | INFO     | src.policies:collect_trajectories:238 - Last 100 episodes mean return: 23.11111111111111
2021-09-07 17:17:58.328 | INFO     | src.policies:train:159 - Mini-batch 1 / 2
2021-09-07 17:17:58.330 | INFO     | src.policies:minibatch_update:281 - Losses: {'policy_loss': -0.1983831524848938, 'baseline_loss': 1.1767024993896484, 'total_loss': 0.3899680972099304}
2021-09-07 17:17:58.330 | INFO     | src.policies:minibatch_update:287 - Policy network L2 gradient norm: 0.04361369088292122
2021-09-07 17:17:58.331 | INFO     | src.policies:minibatch_update:291 - Baseline network L2 gradient norm: 0.8305467963218689
2021-09-07 17:17:58.332 | INFO     | src.policies:minibat

2021-09-07 17:17:58.428 | DEBUG    | src.policies:execute_episode:413 - Early stopping, all agents done
2021-09-07 17:17:58.428 | INFO     | src.policies:collect_trajectories:237 - Mean episode return: 11.0
2021-09-07 17:17:58.429 | INFO     | src.policies:collect_trajectories:238 - Last 100 episodes mean return: 11.0
2021-09-07 17:17:58.429 | INFO     | src.policies:collect_trajectories:221 - Episode 152
2021-09-07 17:17:58.432 | DEBUG    | src.policies:execute_episode:413 - Early stopping, all agents done
2021-09-07 17:17:58.433 | INFO     | src.policies:collect_trajectories:237 - Mean episode return: 12.0
2021-09-07 17:17:58.433 | INFO     | src.policies:collect_trajectories:238 - Last 100 episodes mean return: 11.5
2021-09-07 17:17:58.434 | INFO     | src.policies:collect_trajectories:221 - Episode 153
2021-09-07 17:17:58.437 | DEBUG    | src.policies:execute_episode:413 - Early stopping, all agents done
2021-09-07 17:17:58.438 | INFO     | src.policies:collect_trajectories:237 - M

2021-09-07 17:17:58.548 | INFO     | src.policies:collect_trajectories:238 - Last 100 episodes mean return: 19.8
2021-09-07 17:17:58.549 | INFO     | src.policies:collect_trajectories:221 - Episode 167
2021-09-07 17:17:58.555 | DEBUG    | src.policies:execute_episode:413 - Early stopping, all agents done
2021-09-07 17:17:58.556 | INFO     | src.policies:collect_trajectories:237 - Mean episode return: 33.0
2021-09-07 17:17:58.556 | INFO     | src.policies:collect_trajectories:238 - Last 100 episodes mean return: 22.0
2021-09-07 17:17:58.557 | INFO     | src.policies:collect_trajectories:221 - Episode 168
2021-09-07 17:17:58.568 | DEBUG    | src.policies:execute_episode:413 - Early stopping, all agents done
2021-09-07 17:17:58.569 | INFO     | src.policies:collect_trajectories:237 - Mean episode return: 56.0
2021-09-07 17:17:58.569 | INFO     | src.policies:collect_trajectories:238 - Last 100 episodes mean return: 26.857142857142858
2021-09-07 17:17:58.569 | INFO     | src.policies:colle

2021-09-07 17:17:58.873 | INFO     | src.policies:collect_trajectories:221 - Episode 178
2021-09-07 17:17:58.880 | DEBUG    | src.policies:execute_episode:413 - Early stopping, all agents done
2021-09-07 17:17:58.880 | INFO     | src.policies:collect_trajectories:237 - Mean episode return: 31.0
2021-09-07 17:17:58.881 | INFO     | src.policies:collect_trajectories:238 - Last 100 episodes mean return: 26.5
2021-09-07 17:17:58.881 | INFO     | src.policies:collect_trajectories:221 - Episode 179
2021-09-07 17:17:58.885 | DEBUG    | src.policies:execute_episode:413 - Early stopping, all agents done
2021-09-07 17:17:58.886 | INFO     | src.policies:collect_trajectories:237 - Mean episode return: 14.0
2021-09-07 17:17:58.887 | INFO     | src.policies:collect_trajectories:238 - Last 100 episodes mean return: 22.333333333333332
2021-09-07 17:17:58.887 | INFO     | src.policies:collect_trajectories:221 - Episode 180
2021-09-07 17:17:59.032 | DEBUG    | src.policies:execute_episode:413 - Early s

2021-09-07 17:17:59.155 | INFO     | src.policies:minibatch_update:287 - Policy network L2 gradient norm: 0.1210079938173294
2021-09-07 17:17:59.156 | INFO     | src.policies:minibatch_update:291 - Baseline network L2 gradient norm: 0.7229351997375488
2021-09-07 17:17:59.157 | INFO     | src.policies:minibatch_update:298 - Policy network L2 gradient norm after clipping: 0.1210079938173294
2021-09-07 17:17:59.158 | INFO     | src.policies:minibatch_update:305 - Baseline network L2 gradient norm after clipping: 0.4999992847442627
2021-09-07 17:17:59.160 | INFO     | src.policies:train:159 - Mini-batch 2 / 2
2021-09-07 17:17:59.162 | INFO     | src.policies:minibatch_update:281 - Losses: {'policy_loss': -0.23609676957130432, 'baseline_loss': 1.1550917625427246, 'total_loss': 0.341449111700058}
2021-09-07 17:17:59.163 | INFO     | src.policies:minibatch_update:287 - Policy network L2 gradient norm: 0.09787659347057343
2021-09-07 17:17:59.164 | INFO     | src.policies:minibatch_update:291 -

2021-09-07 17:17:59.349 | DEBUG    | src.policies:execute_episode:413 - Early stopping, all agents done
2021-09-07 17:17:59.350 | INFO     | src.policies:collect_trajectories:237 - Mean episode return: 48.0
2021-09-07 17:17:59.350 | INFO     | src.policies:collect_trajectories:238 - Last 100 episodes mean return: 24.8
2021-09-07 17:17:59.351 | INFO     | src.policies:collect_trajectories:221 - Episode 206
2021-09-07 17:17:59.355 | DEBUG    | src.policies:execute_episode:413 - Early stopping, all agents done
2021-09-07 17:17:59.356 | INFO     | src.policies:collect_trajectories:237 - Mean episode return: 14.0
2021-09-07 17:17:59.356 | INFO     | src.policies:collect_trajectories:238 - Last 100 episodes mean return: 23.0
2021-09-07 17:17:59.357 | INFO     | src.policies:collect_trajectories:221 - Episode 207
2021-09-07 17:17:59.362 | DEBUG    | src.policies:execute_episode:413 - Early stopping, all agents done
2021-09-07 17:17:59.362 | INFO     | src.policies:collect_trajectories:237 - M

2021-09-07 17:17:59.466 | INFO     | src.policies:train:159 - Mini-batch 2 / 2
2021-09-07 17:17:59.468 | INFO     | src.policies:minibatch_update:281 - Losses: {'policy_loss': -0.26167404651641846, 'baseline_loss': 1.0578150749206543, 'total_loss': 0.2672334909439087}
2021-09-07 17:17:59.469 | INFO     | src.policies:minibatch_update:287 - Policy network L2 gradient norm: 0.08644621074199677
2021-09-07 17:17:59.470 | INFO     | src.policies:minibatch_update:291 - Baseline network L2 gradient norm: 0.672928512096405
2021-09-07 17:17:59.471 | INFO     | src.policies:minibatch_update:298 - Policy network L2 gradient norm after clipping: 0.08644621074199677
2021-09-07 17:17:59.472 | INFO     | src.policies:minibatch_update:305 - Baseline network L2 gradient norm after clipping: 0.4999992549419403
2021-09-07 17:17:59.474 | INFO     | src.policies:train:123 - Epoch 25 / 800
2021-09-07 17:17:59.474 | INFO     | src.policies:collect_trajectories:221 - Episode 219
2021-09-07 17:17:59.481 | DEBU

2021-09-07 17:17:59.622 | INFO     | src.policies:collect_trajectories:237 - Mean episode return: 72.0
2021-09-07 17:17:59.623 | INFO     | src.policies:collect_trajectories:238 - Last 100 episodes mean return: 37.714285714285715
2021-09-07 17:17:59.628 | INFO     | src.policies:train:159 - Mini-batch 1 / 2
2021-09-07 17:17:59.630 | INFO     | src.policies:minibatch_update:281 - Losses: {'policy_loss': -0.25840359926223755, 'baseline_loss': 1.1044288873672485, 'total_loss': 0.2938108444213867}
2021-09-07 17:17:59.631 | INFO     | src.policies:minibatch_update:287 - Policy network L2 gradient norm: 0.06333320587873459
2021-09-07 17:17:59.631 | INFO     | src.policies:minibatch_update:291 - Baseline network L2 gradient norm: 0.5550678968429565
2021-09-07 17:17:59.633 | INFO     | src.policies:minibatch_update:298 - Policy network L2 gradient norm after clipping: 0.06333320587873459
2021-09-07 17:17:59.634 | INFO     | src.policies:minibatch_update:305 - Baseline network L2 gradient norm 

2021-09-07 17:17:59.747 | INFO     | src.policies:collect_trajectories:238 - Last 100 episodes mean return: 36.0
2021-09-07 17:17:59.747 | INFO     | src.policies:collect_trajectories:221 - Episode 244
2021-09-07 17:17:59.754 | DEBUG    | src.policies:execute_episode:413 - Early stopping, all agents done
2021-09-07 17:17:59.755 | INFO     | src.policies:collect_trajectories:237 - Mean episode return: 34.0
2021-09-07 17:17:59.756 | INFO     | src.policies:collect_trajectories:238 - Last 100 episodes mean return: 35.333333333333336
2021-09-07 17:17:59.756 | INFO     | src.policies:collect_trajectories:221 - Episode 245
2021-09-07 17:17:59.762 | DEBUG    | src.policies:execute_episode:413 - Early stopping, all agents done
2021-09-07 17:17:59.763 | INFO     | src.policies:collect_trajectories:237 - Mean episode return: 24.0
2021-09-07 17:17:59.763 | INFO     | src.policies:collect_trajectories:238 - Last 100 episodes mean return: 32.5
2021-09-07 17:17:59.764 | INFO     | src.policies:colle

2021-09-07 17:17:59.973 | INFO     | src.policies:train:159 - Mini-batch 1 / 2
2021-09-07 17:17:59.974 | INFO     | src.policies:minibatch_update:281 - Losses: {'policy_loss': -0.2889930009841919, 'baseline_loss': 1.1442428827285767, 'total_loss': 0.28312844038009644}
2021-09-07 17:17:59.975 | INFO     | src.policies:minibatch_update:287 - Policy network L2 gradient norm: 0.09033088386058807
2021-09-07 17:17:59.976 | INFO     | src.policies:minibatch_update:291 - Baseline network L2 gradient norm: 0.648177981376648
2021-09-07 17:17:59.977 | INFO     | src.policies:minibatch_update:298 - Policy network L2 gradient norm after clipping: 0.09033088386058807
2021-09-07 17:17:59.978 | INFO     | src.policies:minibatch_update:305 - Baseline network L2 gradient norm after clipping: 0.49999916553497314
2021-09-07 17:17:59.980 | INFO     | src.policies:train:159 - Mini-batch 2 / 2
2021-09-07 17:17:59.981 | INFO     | src.policies:minibatch_update:281 - Losses: {'policy_loss': -0.2852475643157959

2021-09-07 17:18:00.086 | INFO     | src.policies:collect_trajectories:237 - Mean episode return: 25.0
2021-09-07 17:18:00.087 | INFO     | src.policies:collect_trajectories:238 - Last 100 episodes mean return: 18.5
2021-09-07 17:18:00.087 | INFO     | src.policies:collect_trajectories:221 - Episode 271
2021-09-07 17:18:00.094 | DEBUG    | src.policies:execute_episode:413 - Early stopping, all agents done
2021-09-07 17:18:00.095 | INFO     | src.policies:collect_trajectories:237 - Mean episode return: 36.0
2021-09-07 17:18:00.095 | INFO     | src.policies:collect_trajectories:238 - Last 100 episodes mean return: 24.333333333333332
2021-09-07 17:18:00.096 | INFO     | src.policies:collect_trajectories:221 - Episode 272
2021-09-07 17:18:00.102 | DEBUG    | src.policies:execute_episode:413 - Early stopping, all agents done
2021-09-07 17:18:00.103 | INFO     | src.policies:collect_trajectories:237 - Mean episode return: 30.0
2021-09-07 17:18:00.103 | INFO     | src.policies:collect_traject

2021-09-07 17:18:00.221 | INFO     | src.policies:train:159 - Mini-batch 2 / 2
2021-09-07 17:18:00.222 | INFO     | src.policies:minibatch_update:281 - Losses: {'policy_loss': -0.3240794241428375, 'baseline_loss': 1.1539690494537354, 'total_loss': 0.25290510058403015}
2021-09-07 17:18:00.223 | INFO     | src.policies:minibatch_update:287 - Policy network L2 gradient norm: 0.050189897418022156
2021-09-07 17:18:00.224 | INFO     | src.policies:minibatch_update:291 - Baseline network L2 gradient norm: 0.5918150544166565
2021-09-07 17:18:00.226 | INFO     | src.policies:minibatch_update:298 - Policy network L2 gradient norm after clipping: 0.050189897418022156
2021-09-07 17:18:00.227 | INFO     | src.policies:minibatch_update:305 - Baseline network L2 gradient norm after clipping: 0.49999910593032837
2021-09-07 17:18:00.228 | INFO     | src.policies:train:123 - Epoch 33 / 800
2021-09-07 17:18:00.229 | INFO     | src.policies:collect_trajectories:221 - Episode 284
2021-09-07 17:18:00.232 | 

2021-09-07 17:18:00.351 | INFO     | src.policies:collect_trajectories:237 - Mean episode return: 16.0
2021-09-07 17:18:00.351 | INFO     | src.policies:collect_trajectories:238 - Last 100 episodes mean return: 31.6
2021-09-07 17:18:00.351 | INFO     | src.policies:collect_trajectories:221 - Episode 298
2021-09-07 17:18:00.358 | DEBUG    | src.policies:execute_episode:413 - Early stopping, all agents done
2021-09-07 17:18:00.358 | INFO     | src.policies:collect_trajectories:237 - Mean episode return: 33.0
2021-09-07 17:18:00.359 | INFO     | src.policies:collect_trajectories:238 - Last 100 episodes mean return: 31.833333333333332
2021-09-07 17:18:00.360 | INFO     | src.policies:collect_trajectories:221 - Episode 299
2021-09-07 17:18:00.364 | DEBUG    | src.policies:execute_episode:413 - Early stopping, all agents done
2021-09-07 17:18:00.365 | INFO     | src.policies:collect_trajectories:237 - Mean episode return: 13.0
2021-09-07 17:18:00.365 | INFO     | src.policies:collect_traject

2021-09-07 17:18:00.647 | INFO     | src.policies:collect_trajectories:238 - Last 100 episodes mean return: 24.666666666666668
2021-09-07 17:18:00.647 | INFO     | src.policies:collect_trajectories:221 - Episode 309
2021-09-07 17:18:00.652 | DEBUG    | src.policies:execute_episode:413 - Early stopping, all agents done
2021-09-07 17:18:00.653 | INFO     | src.policies:collect_trajectories:237 - Mean episode return: 21.0
2021-09-07 17:18:00.653 | INFO     | src.policies:collect_trajectories:238 - Last 100 episodes mean return: 23.75
2021-09-07 17:18:00.654 | INFO     | src.policies:collect_trajectories:221 - Episode 310
2021-09-07 17:18:00.658 | DEBUG    | src.policies:execute_episode:413 - Early stopping, all agents done
2021-09-07 17:18:00.658 | INFO     | src.policies:collect_trajectories:237 - Mean episode return: 17.0
2021-09-07 17:18:00.659 | INFO     | src.policies:collect_trajectories:238 - Last 100 episodes mean return: 22.4
2021-09-07 17:18:00.659 | INFO     | src.policies:coll

2021-09-07 17:18:00.771 | INFO     | src.policies:train:159 - Mini-batch 1 / 2
2021-09-07 17:18:00.772 | INFO     | src.policies:minibatch_update:281 - Losses: {'policy_loss': -0.43674349784851074, 'baseline_loss': 1.1087031364440918, 'total_loss': 0.11760807037353516}
2021-09-07 17:18:00.773 | INFO     | src.policies:minibatch_update:287 - Policy network L2 gradient norm: 0.3013475239276886
2021-09-07 17:18:00.774 | INFO     | src.policies:minibatch_update:291 - Baseline network L2 gradient norm: 0.4778384268283844
2021-09-07 17:18:00.775 | INFO     | src.policies:minibatch_update:298 - Policy network L2 gradient norm after clipping: 0.3013475239276886
2021-09-07 17:18:00.776 | INFO     | src.policies:minibatch_update:305 - Baseline network L2 gradient norm after clipping: 0.4778384268283844
2021-09-07 17:18:00.777 | INFO     | src.policies:train:159 - Mini-batch 2 / 2
2021-09-07 17:18:00.778 | INFO     | src.policies:minibatch_update:281 - Losses: {'policy_loss': -0.4433099329471588,

2021-09-07 17:18:00.911 | INFO     | src.policies:collect_trajectories:237 - Mean episode return: 59.0
2021-09-07 17:18:00.912 | INFO     | src.policies:collect_trajectories:238 - Last 100 episodes mean return: 38.25
2021-09-07 17:18:00.912 | INFO     | src.policies:collect_trajectories:221 - Episode 336
2021-09-07 17:18:00.918 | DEBUG    | src.policies:execute_episode:413 - Early stopping, all agents done
2021-09-07 17:18:00.919 | INFO     | src.policies:collect_trajectories:237 - Mean episode return: 27.0
2021-09-07 17:18:00.919 | INFO     | src.policies:collect_trajectories:238 - Last 100 episodes mean return: 36.0
2021-09-07 17:18:00.920 | INFO     | src.policies:collect_trajectories:221 - Episode 337
2021-09-07 17:18:00.930 | DEBUG    | src.policies:execute_episode:413 - Early stopping, all agents done
2021-09-07 17:18:00.931 | INFO     | src.policies:collect_trajectories:237 - Mean episode return: 51.0
2021-09-07 17:18:00.932 | INFO     | src.policies:collect_trajectories:238 - L

2021-09-07 17:18:01.228 | INFO     | src.policies:train:123 - Epoch 41 / 800
2021-09-07 17:18:01.229 | INFO     | src.policies:collect_trajectories:221 - Episode 347
2021-09-07 17:18:01.233 | DEBUG    | src.policies:execute_episode:413 - Early stopping, all agents done
2021-09-07 17:18:01.233 | INFO     | src.policies:collect_trajectories:237 - Mean episode return: 20.0
2021-09-07 17:18:01.234 | INFO     | src.policies:collect_trajectories:238 - Last 100 episodes mean return: 20.0
2021-09-07 17:18:01.234 | INFO     | src.policies:collect_trajectories:221 - Episode 348
2021-09-07 17:18:01.237 | DEBUG    | src.policies:execute_episode:413 - Early stopping, all agents done
2021-09-07 17:18:01.238 | INFO     | src.policies:collect_trajectories:237 - Mean episode return: 13.0
2021-09-07 17:18:01.238 | INFO     | src.policies:collect_trajectories:238 - Last 100 episodes mean return: 16.5
2021-09-07 17:18:01.239 | INFO     | src.policies:collect_trajectories:221 - Episode 349
2021-09-07 17:18

2021-09-07 17:18:01.371 | INFO     | src.policies:minibatch_update:281 - Losses: {'policy_loss': -0.4456835091114044, 'baseline_loss': 1.200881838798523, 'total_loss': 0.15475741028785706}
2021-09-07 17:18:01.371 | INFO     | src.policies:minibatch_update:287 - Policy network L2 gradient norm: 0.07501809298992157
2021-09-07 17:18:01.372 | INFO     | src.policies:minibatch_update:291 - Baseline network L2 gradient norm: 0.23358958959579468
2021-09-07 17:18:01.373 | INFO     | src.policies:minibatch_update:298 - Policy network L2 gradient norm after clipping: 0.07501809298992157
2021-09-07 17:18:01.374 | INFO     | src.policies:minibatch_update:305 - Baseline network L2 gradient norm after clipping: 0.23358958959579468
2021-09-07 17:18:01.375 | INFO     | src.policies:train:159 - Mini-batch 2 / 2
2021-09-07 17:18:01.376 | INFO     | src.policies:minibatch_update:281 - Losses: {'policy_loss': -0.4127517342567444, 'baseline_loss': 1.1794102191925049, 'total_loss': 0.17695337533950806}
2021

2021-09-07 17:18:01.503 | INFO     | src.policies:collect_trajectories:238 - Last 100 episodes mean return: 34.6
2021-09-07 17:18:01.504 | INFO     | src.policies:collect_trajectories:221 - Episode 374
2021-09-07 17:18:01.516 | DEBUG    | src.policies:execute_episode:413 - Early stopping, all agents done
2021-09-07 17:18:01.517 | INFO     | src.policies:collect_trajectories:237 - Mean episode return: 64.0
2021-09-07 17:18:01.517 | INFO     | src.policies:collect_trajectories:238 - Last 100 episodes mean return: 39.5
2021-09-07 17:18:01.522 | INFO     | src.policies:train:159 - Mini-batch 1 / 2
2021-09-07 17:18:01.524 | INFO     | src.policies:minibatch_update:281 - Losses: {'policy_loss': -0.3692206144332886, 'baseline_loss': 1.0083894729614258, 'total_loss': 0.13497412204742432}
2021-09-07 17:18:01.525 | INFO     | src.policies:minibatch_update:287 - Policy network L2 gradient norm: 0.11954299360513687
2021-09-07 17:18:01.527 | INFO     | src.policies:minibatch_update:291 - Baseline n

2021-09-07 17:18:01.647 | INFO     | src.policies:collect_trajectories:221 - Episode 385
2021-09-07 17:18:01.651 | DEBUG    | src.policies:execute_episode:413 - Early stopping, all agents done
2021-09-07 17:18:01.652 | INFO     | src.policies:collect_trajectories:237 - Mean episode return: 15.0
2021-09-07 17:18:01.652 | INFO     | src.policies:collect_trajectories:238 - Last 100 episodes mean return: 28.4
2021-09-07 17:18:01.652 | INFO     | src.policies:collect_trajectories:221 - Episode 386
2021-09-07 17:18:01.656 | DEBUG    | src.policies:execute_episode:413 - Early stopping, all agents done
2021-09-07 17:18:01.656 | INFO     | src.policies:collect_trajectories:237 - Mean episode return: 12.0
2021-09-07 17:18:01.657 | INFO     | src.policies:collect_trajectories:238 - Last 100 episodes mean return: 25.666666666666668
2021-09-07 17:18:01.657 | INFO     | src.policies:collect_trajectories:221 - Episode 387
2021-09-07 17:18:01.663 | DEBUG    | src.policies:execute_episode:413 - Early s

2021-09-07 17:18:01.852 | INFO     | src.policies:minibatch_update:298 - Policy network L2 gradient norm after clipping: 0.10557867586612701
2021-09-07 17:18:01.853 | INFO     | src.policies:minibatch_update:305 - Baseline network L2 gradient norm after clipping: 0.20648349821567535
2021-09-07 17:18:01.855 | INFO     | src.policies:train:123 - Epoch 48 / 800
2021-09-07 17:18:01.856 | INFO     | src.policies:collect_trajectories:221 - Episode 397
2021-09-07 17:18:01.860 | DEBUG    | src.policies:execute_episode:413 - Early stopping, all agents done
2021-09-07 17:18:01.860 | INFO     | src.policies:collect_trajectories:237 - Mean episode return: 14.0
2021-09-07 17:18:01.861 | INFO     | src.policies:collect_trajectories:238 - Last 100 episodes mean return: 14.0
2021-09-07 17:18:01.861 | INFO     | src.policies:collect_trajectories:221 - Episode 398
2021-09-07 17:18:01.868 | DEBUG    | src.policies:execute_episode:413 - Early stopping, all agents done
2021-09-07 17:18:01.868 | INFO     | 

2021-09-07 17:18:01.991 | INFO     | src.policies:collect_trajectories:238 - Last 100 episodes mean return: 30.285714285714285
2021-09-07 17:18:01.994 | INFO     | src.policies:train:159 - Mini-batch 1 / 2
2021-09-07 17:18:01.995 | INFO     | src.policies:minibatch_update:281 - Losses: {'policy_loss': -0.5116304159164429, 'baseline_loss': 1.2619110345840454, 'total_loss': 0.11932510137557983}
2021-09-07 17:18:01.996 | INFO     | src.policies:minibatch_update:287 - Policy network L2 gradient norm: 0.15278927981853485
2021-09-07 17:18:01.997 | INFO     | src.policies:minibatch_update:291 - Baseline network L2 gradient norm: 0.34671786427497864
2021-09-07 17:18:01.998 | INFO     | src.policies:minibatch_update:298 - Policy network L2 gradient norm after clipping: 0.15278927981853485
2021-09-07 17:18:02.000 | INFO     | src.policies:minibatch_update:305 - Baseline network L2 gradient norm after clipping: 0.34671786427497864
2021-09-07 17:18:02.002 | INFO     | src.policies:train:159 - Mini

2021-09-07 17:18:02.116 | INFO     | src.policies:collect_trajectories:221 - Episode 423
2021-09-07 17:18:02.121 | DEBUG    | src.policies:execute_episode:413 - Early stopping, all agents done
2021-09-07 17:18:02.122 | INFO     | src.policies:collect_trajectories:237 - Mean episode return: 19.0
2021-09-07 17:18:02.122 | INFO     | src.policies:collect_trajectories:238 - Last 100 episodes mean return: 18.333333333333332
2021-09-07 17:18:02.123 | INFO     | src.policies:collect_trajectories:221 - Episode 424
2021-09-07 17:18:02.130 | DEBUG    | src.policies:execute_episode:413 - Early stopping, all agents done
2021-09-07 17:18:02.130 | INFO     | src.policies:collect_trajectories:237 - Mean episode return: 22.0
2021-09-07 17:18:02.131 | INFO     | src.policies:collect_trajectories:238 - Last 100 episodes mean return: 19.25
2021-09-07 17:18:02.131 | INFO     | src.policies:collect_trajectories:221 - Episode 425
2021-09-07 17:18:02.148 | DEBUG    | src.policies:execute_episode:413 - Early 

2021-09-07 17:18:02.260 | INFO     | src.policies:minibatch_update:281 - Losses: {'policy_loss': -0.5169611573219299, 'baseline_loss': 1.3333277702331543, 'total_loss': 0.14970272779464722}
2021-09-07 17:18:02.261 | INFO     | src.policies:minibatch_update:287 - Policy network L2 gradient norm: 0.07441367954015732
2021-09-07 17:18:02.262 | INFO     | src.policies:minibatch_update:291 - Baseline network L2 gradient norm: 0.23442316055297852
2021-09-07 17:18:02.263 | INFO     | src.policies:minibatch_update:298 - Policy network L2 gradient norm after clipping: 0.07441367954015732
2021-09-07 17:18:02.265 | INFO     | src.policies:minibatch_update:305 - Baseline network L2 gradient norm after clipping: 0.23442316055297852
2021-09-07 17:18:02.267 | INFO     | src.policies:train:159 - Mini-batch 2 / 2
2021-09-07 17:18:02.268 | INFO     | src.policies:minibatch_update:281 - Losses: {'policy_loss': -0.4976745545864105, 'baseline_loss': 1.2325730323791504, 'total_loss': 0.11861196160316467}
202

2021-09-07 17:18:02.560 | INFO     | src.policies:collect_trajectories:238 - Last 100 episodes mean return: 18.0
2021-09-07 17:18:02.560 | INFO     | src.policies:collect_trajectories:221 - Episode 450
2021-09-07 17:18:02.569 | DEBUG    | src.policies:execute_episode:413 - Early stopping, all agents done
2021-09-07 17:18:02.570 | INFO     | src.policies:collect_trajectories:237 - Mean episode return: 40.0
2021-09-07 17:18:02.570 | INFO     | src.policies:collect_trajectories:238 - Last 100 episodes mean return: 21.666666666666668
2021-09-07 17:18:02.571 | INFO     | src.policies:collect_trajectories:221 - Episode 451
2021-09-07 17:18:02.577 | DEBUG    | src.policies:execute_episode:413 - Early stopping, all agents done
2021-09-07 17:18:02.577 | INFO     | src.policies:collect_trajectories:237 - Mean episode return: 27.0
2021-09-07 17:18:02.578 | INFO     | src.policies:collect_trajectories:238 - Last 100 episodes mean return: 22.428571428571427
2021-09-07 17:18:02.578 | INFO     | src.

2021-09-07 17:18:02.693 | INFO     | src.policies:train:159 - Mini-batch 2 / 2
2021-09-07 17:18:02.694 | INFO     | src.policies:minibatch_update:281 - Losses: {'policy_loss': -0.46958351135253906, 'baseline_loss': 1.2992199659347534, 'total_loss': 0.18002647161483765}
2021-09-07 17:18:02.695 | INFO     | src.policies:minibatch_update:287 - Policy network L2 gradient norm: 0.09623665362596512
2021-09-07 17:18:02.696 | INFO     | src.policies:minibatch_update:291 - Baseline network L2 gradient norm: 0.35026031732559204
2021-09-07 17:18:02.697 | INFO     | src.policies:minibatch_update:298 - Policy network L2 gradient norm after clipping: 0.09623665362596512
2021-09-07 17:18:02.698 | INFO     | src.policies:minibatch_update:305 - Baseline network L2 gradient norm after clipping: 0.35026031732559204
2021-09-07 17:18:02.700 | INFO     | src.policies:train:123 - Epoch 56 / 800
2021-09-07 17:18:02.701 | INFO     | src.policies:collect_trajectories:221 - Episode 463
2021-09-07 17:18:02.710 | 

2021-09-07 17:18:02.847 | INFO     | src.policies:minibatch_update:298 - Policy network L2 gradient norm after clipping: 0.11830920726060867
2021-09-07 17:18:02.849 | INFO     | src.policies:minibatch_update:305 - Baseline network L2 gradient norm after clipping: 0.39152973890304565
2021-09-07 17:18:02.850 | INFO     | src.policies:train:123 - Epoch 58 / 800
2021-09-07 17:18:02.851 | INFO     | src.policies:collect_trajectories:221 - Episode 473
2021-09-07 17:18:02.862 | DEBUG    | src.policies:execute_episode:413 - Early stopping, all agents done
2021-09-07 17:18:02.863 | INFO     | src.policies:collect_trajectories:237 - Mean episode return: 58.0
2021-09-07 17:18:02.863 | INFO     | src.policies:collect_trajectories:238 - Last 100 episodes mean return: 58.0
2021-09-07 17:18:02.864 | INFO     | src.policies:collect_trajectories:221 - Episode 474
2021-09-07 17:18:02.871 | DEBUG    | src.policies:execute_episode:413 - Early stopping, all agents done
2021-09-07 17:18:02.871 | INFO     | 

2021-09-07 17:18:03.075 | INFO     | src.policies:collect_trajectories:238 - Last 100 episodes mean return: 23.0
2021-09-07 17:18:03.075 | INFO     | src.policies:collect_trajectories:221 - Episode 488
2021-09-07 17:18:03.079 | DEBUG    | src.policies:execute_episode:413 - Early stopping, all agents done
2021-09-07 17:18:03.080 | INFO     | src.policies:collect_trajectories:237 - Mean episode return: 14.0
2021-09-07 17:18:03.080 | INFO     | src.policies:collect_trajectories:238 - Last 100 episodes mean return: 22.0
2021-09-07 17:18:03.081 | INFO     | src.policies:collect_trajectories:221 - Episode 489
2021-09-07 17:18:03.087 | DEBUG    | src.policies:execute_episode:413 - Early stopping, all agents done
2021-09-07 17:18:03.088 | INFO     | src.policies:collect_trajectories:237 - Mean episode return: 25.0
2021-09-07 17:18:03.088 | INFO     | src.policies:collect_trajectories:238 - Last 100 episodes mean return: 22.3
2021-09-07 17:18:03.092 | INFO     | src.policies:train:159 - Mini-ba

2021-09-07 17:18:03.371 | INFO     | src.policies:collect_trajectories:221 - Episode 499
2021-09-07 17:18:03.381 | DEBUG    | src.policies:execute_episode:413 - Early stopping, all agents done
2021-09-07 17:18:03.382 | INFO     | src.policies:collect_trajectories:237 - Mean episode return: 51.0
2021-09-07 17:18:03.383 | INFO     | src.policies:collect_trajectories:238 - Last 100 episodes mean return: 29.25
2021-09-07 17:18:03.383 | INFO     | src.policies:collect_trajectories:221 - Episode 500
2021-09-07 17:18:03.387 | DEBUG    | src.policies:execute_episode:413 - Early stopping, all agents done
2021-09-07 17:18:03.388 | INFO     | src.policies:collect_trajectories:237 - Mean episode return: 14.0
2021-09-07 17:18:03.388 | INFO     | src.policies:collect_trajectories:238 - Last 100 episodes mean return: 26.2
2021-09-07 17:18:03.389 | INFO     | src.policies:collect_trajectories:221 - Episode 501
2021-09-07 17:18:03.394 | DEBUG    | src.policies:execute_episode:413 - Early stopping, all 

2021-09-07 17:18:03.512 | INFO     | src.policies:collect_trajectories:237 - Mean episode return: 32.0
2021-09-07 17:18:03.512 | INFO     | src.policies:collect_trajectories:238 - Last 100 episodes mean return: 32.0
2021-09-07 17:18:03.513 | INFO     | src.policies:collect_trajectories:221 - Episode 511
2021-09-07 17:18:03.528 | DEBUG    | src.policies:execute_episode:413 - Early stopping, all agents done
2021-09-07 17:18:03.529 | INFO     | src.policies:collect_trajectories:237 - Mean episode return: 81.0
2021-09-07 17:18:03.529 | INFO     | src.policies:collect_trajectories:238 - Last 100 episodes mean return: 56.5
2021-09-07 17:18:03.530 | INFO     | src.policies:collect_trajectories:221 - Episode 512
2021-09-07 17:18:03.549 | DEBUG    | src.policies:execute_episode:413 - Early stopping, all agents done
2021-09-07 17:18:03.550 | INFO     | src.policies:collect_trajectories:237 - Mean episode return: 104.0
2021-09-07 17:18:03.550 | INFO     | src.policies:collect_trajectories:238 - L

2021-09-07 17:18:03.717 | INFO     | src.policies:collect_trajectories:238 - Last 100 episodes mean return: 19.4
2021-09-07 17:18:03.717 | INFO     | src.policies:collect_trajectories:221 - Episode 522
2021-09-07 17:18:03.721 | DEBUG    | src.policies:execute_episode:413 - Early stopping, all agents done
2021-09-07 17:18:03.722 | INFO     | src.policies:collect_trajectories:237 - Mean episode return: 11.0
2021-09-07 17:18:03.722 | INFO     | src.policies:collect_trajectories:238 - Last 100 episodes mean return: 18.0
2021-09-07 17:18:03.722 | INFO     | src.policies:collect_trajectories:221 - Episode 523
2021-09-07 17:18:03.728 | DEBUG    | src.policies:execute_episode:413 - Early stopping, all agents done
2021-09-07 17:18:03.729 | INFO     | src.policies:collect_trajectories:237 - Mean episode return: 32.0
2021-09-07 17:18:03.729 | INFO     | src.policies:collect_trajectories:238 - Last 100 episodes mean return: 20.0
2021-09-07 17:18:03.730 | INFO     | src.policies:collect_trajectorie

2021-09-07 17:18:03.836 | INFO     | src.policies:minibatch_update:281 - Losses: {'policy_loss': -0.32200539112091064, 'baseline_loss': 0.8831784725189209, 'total_loss': 0.1195838451385498}
2021-09-07 17:18:03.837 | INFO     | src.policies:minibatch_update:287 - Policy network L2 gradient norm: 0.15828406810760498
2021-09-07 17:18:03.838 | INFO     | src.policies:minibatch_update:291 - Baseline network L2 gradient norm: 0.7376359701156616
2021-09-07 17:18:03.839 | INFO     | src.policies:minibatch_update:298 - Policy network L2 gradient norm after clipping: 0.15828406810760498
2021-09-07 17:18:03.840 | INFO     | src.policies:minibatch_update:305 - Baseline network L2 gradient norm after clipping: 0.4999992847442627
2021-09-07 17:18:03.842 | INFO     | src.policies:train:123 - Epoch 67 / 800
2021-09-07 17:18:03.842 | INFO     | src.policies:collect_trajectories:221 - Episode 535
2021-09-07 17:18:03.848 | DEBUG    | src.policies:execute_episode:413 - Early stopping, all agents done
2021

2021-09-07 17:18:03.975 | INFO     | src.policies:minibatch_update:287 - Policy network L2 gradient norm: 0.3110700249671936
2021-09-07 17:18:03.976 | INFO     | src.policies:minibatch_update:291 - Baseline network L2 gradient norm: 0.6891926527023315
2021-09-07 17:18:03.978 | INFO     | src.policies:minibatch_update:298 - Policy network L2 gradient norm after clipping: 0.3110700249671936
2021-09-07 17:18:03.979 | INFO     | src.policies:minibatch_update:305 - Baseline network L2 gradient norm after clipping: 0.4999992847442627
2021-09-07 17:18:03.981 | INFO     | src.policies:train:159 - Mini-batch 2 / 2
2021-09-07 17:18:03.982 | INFO     | src.policies:minibatch_update:281 - Losses: {'policy_loss': -0.5169339179992676, 'baseline_loss': 1.3587348461151123, 'total_loss': 0.16243350505828857}
2021-09-07 17:18:03.983 | INFO     | src.policies:minibatch_update:287 - Policy network L2 gradient norm: 0.27006807923316956
2021-09-07 17:18:03.984 | INFO     | src.policies:minibatch_update:291 

2021-09-07 17:18:04.114 | DEBUG    | src.policies:execute_episode:413 - Early stopping, all agents done
2021-09-07 17:18:04.114 | INFO     | src.policies:collect_trajectories:237 - Mean episode return: 29.0
2021-09-07 17:18:04.114 | INFO     | src.policies:collect_trajectories:238 - Last 100 episodes mean return: 22.428571428571427
2021-09-07 17:18:04.115 | INFO     | src.policies:collect_trajectories:221 - Episode 561
2021-09-07 17:18:04.121 | DEBUG    | src.policies:execute_episode:413 - Early stopping, all agents done
2021-09-07 17:18:04.122 | INFO     | src.policies:collect_trajectories:237 - Mean episode return: 26.0
2021-09-07 17:18:04.122 | INFO     | src.policies:collect_trajectories:238 - Last 100 episodes mean return: 22.875
2021-09-07 17:18:04.123 | INFO     | src.policies:collect_trajectories:221 - Episode 562
2021-09-07 17:18:04.303 | DEBUG    | src.policies:execute_episode:413 - Early stopping, all agents done
2021-09-07 17:18:04.304 | INFO     | src.policies:collect_traj

2021-09-07 17:18:04.432 | INFO     | src.policies:collect_trajectories:237 - Mean episode return: 101.0
2021-09-07 17:18:04.432 | INFO     | src.policies:collect_trajectories:238 - Last 100 episodes mean return: 50.25
2021-09-07 17:18:04.435 | INFO     | src.policies:train:159 - Mini-batch 1 / 2
2021-09-07 17:18:04.437 | INFO     | src.policies:minibatch_update:281 - Losses: {'policy_loss': -0.48016202449798584, 'baseline_loss': 1.272067904472351, 'total_loss': 0.1558719277381897}
2021-09-07 17:18:04.438 | INFO     | src.policies:minibatch_update:287 - Policy network L2 gradient norm: 0.32943859696388245
2021-09-07 17:18:04.439 | INFO     | src.policies:minibatch_update:291 - Baseline network L2 gradient norm: 0.5054675340652466
2021-09-07 17:18:04.441 | INFO     | src.policies:minibatch_update:298 - Policy network L2 gradient norm after clipping: 0.32943859696388245
2021-09-07 17:18:04.442 | INFO     | src.policies:minibatch_update:305 - Baseline network L2 gradient norm after clippin

2021-09-07 17:18:04.591 | INFO     | src.policies:collect_trajectories:238 - Last 100 episodes mean return: 31.166666666666668
2021-09-07 17:18:04.592 | INFO     | src.policies:collect_trajectories:221 - Episode 583
2021-09-07 17:18:04.597 | DEBUG    | src.policies:execute_episode:413 - Early stopping, all agents done
2021-09-07 17:18:04.597 | INFO     | src.policies:collect_trajectories:237 - Mean episode return: 21.0
2021-09-07 17:18:04.598 | INFO     | src.policies:collect_trajectories:238 - Last 100 episodes mean return: 29.714285714285715
2021-09-07 17:18:04.603 | INFO     | src.policies:train:159 - Mini-batch 1 / 2
2021-09-07 17:18:04.604 | INFO     | src.policies:minibatch_update:281 - Losses: {'policy_loss': -0.5609764456748962, 'baseline_loss': 1.3868539333343506, 'total_loss': 0.13245052099227905}
2021-09-07 17:18:04.605 | INFO     | src.policies:minibatch_update:287 - Policy network L2 gradient norm: 0.31398704648017883
2021-09-07 17:18:04.606 | INFO     | src.policies:minib

2021-09-07 17:18:04.736 | INFO     | src.policies:minibatch_update:287 - Policy network L2 gradient norm: 0.12812195718288422
2021-09-07 17:18:04.737 | INFO     | src.policies:minibatch_update:291 - Baseline network L2 gradient norm: 0.5674140453338623
2021-09-07 17:18:04.738 | INFO     | src.policies:minibatch_update:298 - Policy network L2 gradient norm after clipping: 0.12812195718288422
2021-09-07 17:18:04.739 | INFO     | src.policies:minibatch_update:305 - Baseline network L2 gradient norm after clipping: 0.49999916553497314
2021-09-07 17:18:04.741 | INFO     | src.policies:train:159 - Mini-batch 2 / 2
2021-09-07 17:18:04.742 | INFO     | src.policies:minibatch_update:281 - Losses: {'policy_loss': -0.5222604274749756, 'baseline_loss': 1.2480316162109375, 'total_loss': 0.10175538063049316}
2021-09-07 17:18:04.743 | INFO     | src.policies:minibatch_update:287 - Policy network L2 gradient norm: 0.1463005095720291
2021-09-07 17:18:04.744 | INFO     | src.policies:minibatch_update:29

2021-09-07 17:18:04.976 | DEBUG    | src.policies:execute_episode:413 - Early stopping, all agents done
2021-09-07 17:18:04.976 | INFO     | src.policies:collect_trajectories:237 - Mean episode return: 37.0
2021-09-07 17:18:04.977 | INFO     | src.policies:collect_trajectories:238 - Last 100 episodes mean return: 42.5
2021-09-07 17:18:04.977 | INFO     | src.policies:collect_trajectories:221 - Episode 602
2021-09-07 17:18:04.987 | DEBUG    | src.policies:execute_episode:413 - Early stopping, all agents done
2021-09-07 17:18:04.987 | INFO     | src.policies:collect_trajectories:237 - Mean episode return: 40.0
2021-09-07 17:18:04.988 | INFO     | src.policies:collect_trajectories:238 - Last 100 episodes mean return: 41.666666666666664
2021-09-07 17:18:04.988 | INFO     | src.policies:collect_trajectories:221 - Episode 603
2021-09-07 17:18:04.997 | DEBUG    | src.policies:execute_episode:413 - Early stopping, all agents done
2021-09-07 17:18:04.998 | INFO     | src.policies:collect_trajec

2021-09-07 17:18:05.138 | INFO     | src.policies:collect_trajectories:237 - Mean episode return: 71.0
2021-09-07 17:18:05.138 | INFO     | src.policies:collect_trajectories:238 - Last 100 episodes mean return: 43.333333333333336
2021-09-07 17:18:05.139 | INFO     | src.policies:collect_trajectories:221 - Episode 613
2021-09-07 17:18:05.150 | DEBUG    | src.policies:execute_episode:413 - Early stopping, all agents done
2021-09-07 17:18:05.150 | INFO     | src.policies:collect_trajectories:237 - Mean episode return: 38.0
2021-09-07 17:18:05.151 | INFO     | src.policies:collect_trajectories:238 - Last 100 episodes mean return: 42.0
2021-09-07 17:18:05.151 | INFO     | src.policies:collect_trajectories:221 - Episode 614
2021-09-07 17:18:05.157 | DEBUG    | src.policies:execute_episode:413 - Early stopping, all agents done
2021-09-07 17:18:05.158 | INFO     | src.policies:collect_trajectories:237 - Mean episode return: 22.0
2021-09-07 17:18:05.158 | INFO     | src.policies:collect_traject

2021-09-07 17:18:05.307 | INFO     | src.policies:collect_trajectories:238 - Last 100 episodes mean return: 51.333333333333336
2021-09-07 17:18:05.307 | INFO     | src.policies:collect_trajectories:221 - Episode 624
2021-09-07 17:18:05.313 | DEBUG    | src.policies:execute_episode:413 - Early stopping, all agents done
2021-09-07 17:18:05.314 | INFO     | src.policies:collect_trajectories:237 - Mean episode return: 25.0
2021-09-07 17:18:05.314 | INFO     | src.policies:collect_trajectories:238 - Last 100 episodes mean return: 44.75
2021-09-07 17:18:05.314 | INFO     | src.policies:collect_trajectories:221 - Episode 625
2021-09-07 17:18:05.509 | DEBUG    | src.policies:execute_episode:413 - Early stopping, all agents done
2021-09-07 17:18:05.510 | INFO     | src.policies:collect_trajectories:237 - Mean episode return: 26.0
2021-09-07 17:18:05.510 | INFO     | src.policies:collect_trajectories:238 - Last 100 episodes mean return: 41.0
2021-09-07 17:18:05.514 | INFO     | src.policies:trai

2021-09-07 17:18:05.660 | INFO     | src.policies:minibatch_update:287 - Policy network L2 gradient norm: 0.1929067075252533
2021-09-07 17:18:05.661 | INFO     | src.policies:minibatch_update:291 - Baseline network L2 gradient norm: 0.8657205104827881
2021-09-07 17:18:05.662 | INFO     | src.policies:minibatch_update:298 - Policy network L2 gradient norm after clipping: 0.1929067075252533
2021-09-07 17:18:05.664 | INFO     | src.policies:minibatch_update:305 - Baseline network L2 gradient norm after clipping: 0.49999934434890747
2021-09-07 17:18:05.665 | INFO     | src.policies:train:159 - Mini-batch 2 / 2
2021-09-07 17:18:05.666 | INFO     | src.policies:minibatch_update:281 - Losses: {'policy_loss': -0.5015142560005188, 'baseline_loss': 1.4489381313323975, 'total_loss': 0.22295480966567993}
2021-09-07 17:18:05.667 | INFO     | src.policies:minibatch_update:287 - Policy network L2 gradient norm: 0.29608049988746643
2021-09-07 17:18:05.668 | INFO     | src.policies:minibatch_update:291

2021-09-07 17:18:05.818 | INFO     | src.policies:train:159 - Mini-batch 2 / 2
2021-09-07 17:18:05.819 | INFO     | src.policies:minibatch_update:281 - Losses: {'policy_loss': -0.49479275941848755, 'baseline_loss': 1.2154130935668945, 'total_loss': 0.11291378736495972}
2021-09-07 17:18:05.820 | INFO     | src.policies:minibatch_update:287 - Policy network L2 gradient norm: 0.19849590957164764
2021-09-07 17:18:05.821 | INFO     | src.policies:minibatch_update:291 - Baseline network L2 gradient norm: 0.464422345161438
2021-09-07 17:18:05.822 | INFO     | src.policies:minibatch_update:298 - Policy network L2 gradient norm after clipping: 0.19849590957164764
2021-09-07 17:18:05.823 | INFO     | src.policies:minibatch_update:305 - Baseline network L2 gradient norm after clipping: 0.464422345161438
2021-09-07 17:18:05.825 | INFO     | src.policies:train:123 - Epoch 88 / 800
2021-09-07 17:18:05.826 | INFO     | src.policies:collect_trajectories:221 - Episode 644
2021-09-07 17:18:05.829 | DEBU

2021-09-07 17:18:05.986 | DEBUG    | src.policies:execute_episode:413 - Early stopping, all agents done
2021-09-07 17:18:05.987 | INFO     | src.policies:collect_trajectories:237 - Mean episode return: 26.0
2021-09-07 17:18:05.988 | INFO     | src.policies:collect_trajectories:238 - Last 100 episodes mean return: 26.0
2021-09-07 17:18:05.988 | INFO     | src.policies:collect_trajectories:221 - Episode 654
2021-09-07 17:18:06.001 | DEBUG    | src.policies:execute_episode:413 - Early stopping, all agents done
2021-09-07 17:18:06.002 | INFO     | src.policies:collect_trajectories:237 - Mean episode return: 65.0
2021-09-07 17:18:06.002 | INFO     | src.policies:collect_trajectories:238 - Last 100 episodes mean return: 45.5
2021-09-07 17:18:06.003 | INFO     | src.policies:collect_trajectories:221 - Episode 655
2021-09-07 17:18:06.161 | DEBUG    | src.policies:execute_episode:413 - Early stopping, all agents done
2021-09-07 17:18:06.162 | INFO     | src.policies:collect_trajectories:237 - M

2021-09-07 17:18:06.304 | INFO     | src.policies:collect_trajectories:237 - Mean episode return: 56.0
2021-09-07 17:18:06.304 | INFO     | src.policies:collect_trajectories:238 - Last 100 episodes mean return: 45.6
2021-09-07 17:18:06.308 | INFO     | src.policies:train:159 - Mini-batch 1 / 2
2021-09-07 17:18:06.309 | INFO     | src.policies:minibatch_update:281 - Losses: {'policy_loss': -0.3086192011833191, 'baseline_loss': 0.8722154498100281, 'total_loss': 0.12748852372169495}
2021-09-07 17:18:06.310 | INFO     | src.policies:minibatch_update:287 - Policy network L2 gradient norm: 0.10154945403337479
2021-09-07 17:18:06.311 | INFO     | src.policies:minibatch_update:291 - Baseline network L2 gradient norm: 0.6041718125343323
2021-09-07 17:18:06.312 | INFO     | src.policies:minibatch_update:298 - Policy network L2 gradient norm after clipping: 0.10154945403337479
2021-09-07 17:18:06.313 | INFO     | src.policies:minibatch_update:305 - Baseline network L2 gradient norm after clipping

2021-09-07 17:18:06.448 | INFO     | src.policies:minibatch_update:287 - Policy network L2 gradient norm: 0.4185495674610138
2021-09-07 17:18:06.449 | INFO     | src.policies:minibatch_update:291 - Baseline network L2 gradient norm: 0.3027212917804718
2021-09-07 17:18:06.450 | INFO     | src.policies:minibatch_update:298 - Policy network L2 gradient norm after clipping: 0.4185495674610138
2021-09-07 17:18:06.451 | INFO     | src.policies:minibatch_update:305 - Baseline network L2 gradient norm after clipping: 0.3027212917804718
2021-09-07 17:18:06.452 | INFO     | src.policies:train:123 - Epoch 95 / 800
2021-09-07 17:18:06.453 | INFO     | src.policies:collect_trajectories:221 - Episode 673
2021-09-07 17:18:06.472 | DEBUG    | src.policies:execute_episode:413 - Early stopping, all agents done
2021-09-07 17:18:06.472 | INFO     | src.policies:collect_trajectories:237 - Mean episode return: 107.0
2021-09-07 17:18:06.473 | INFO     | src.policies:collect_trajectories:238 - Last 100 episod

2021-09-07 17:18:06.636 | INFO     | src.policies:minibatch_update:305 - Baseline network L2 gradient norm after clipping: 0.49999916553497314
2021-09-07 17:18:06.637 | INFO     | src.policies:train:123 - Epoch 98 / 800
2021-09-07 17:18:06.638 | INFO     | src.policies:collect_trajectories:221 - Episode 679
2021-09-07 17:18:06.651 | DEBUG    | src.policies:execute_episode:413 - Early stopping, all agents done
2021-09-07 17:18:06.652 | INFO     | src.policies:collect_trajectories:237 - Mean episode return: 73.0
2021-09-07 17:18:06.658 | INFO     | src.policies:collect_trajectories:238 - Last 100 episodes mean return: 73.0
2021-09-07 17:18:06.721 | INFO     | src.policies:collect_trajectories:221 - Episode 680
2021-09-07 17:18:06.743 | DEBUG    | src.policies:execute_episode:413 - Early stopping, all agents done
2021-09-07 17:18:06.743 | INFO     | src.policies:collect_trajectories:237 - Mean episode return: 115.0
2021-09-07 17:18:06.744 | INFO     | src.policies:collect_trajectories:238

2021-09-07 17:18:06.902 | INFO     | src.policies:minibatch_update:287 - Policy network L2 gradient norm: 0.3213431239128113
2021-09-07 17:18:06.903 | INFO     | src.policies:minibatch_update:291 - Baseline network L2 gradient norm: 0.4919082522392273
2021-09-07 17:18:06.904 | INFO     | src.policies:minibatch_update:298 - Policy network L2 gradient norm after clipping: 0.3213431239128113
2021-09-07 17:18:06.905 | INFO     | src.policies:minibatch_update:305 - Baseline network L2 gradient norm after clipping: 0.4919082522392273
2021-09-07 17:18:06.906 | INFO     | src.policies:train:123 - Epoch 101 / 800
2021-09-07 17:18:06.907 | INFO     | src.policies:collect_trajectories:221 - Episode 687
2021-09-07 17:18:06.934 | DEBUG    | src.policies:execute_episode:413 - Early stopping, all agents done
2021-09-07 17:18:06.935 | INFO     | src.policies:collect_trajectories:237 - Mean episode return: 177.0
2021-09-07 17:18:06.935 | INFO     | src.policies:collect_trajectories:238 - Last 100 episo

2021-09-07 17:18:07.093 | INFO     | src.policies:minibatch_update:305 - Baseline network L2 gradient norm after clipping: 0.49999943375587463
2021-09-07 17:18:07.095 | INFO     | src.policies:train:159 - Mini-batch 2 / 2
2021-09-07 17:18:07.096 | INFO     | src.policies:minibatch_update:281 - Losses: {'policy_loss': -0.6969920992851257, 'baseline_loss': 1.7827702760696411, 'total_loss': 0.19439303874969482}
2021-09-07 17:18:07.097 | INFO     | src.policies:minibatch_update:287 - Policy network L2 gradient norm: 0.5955057740211487
2021-09-07 17:18:07.098 | INFO     | src.policies:minibatch_update:291 - Baseline network L2 gradient norm: 1.5756452083587646
2021-09-07 17:18:07.099 | INFO     | src.policies:minibatch_update:298 - Policy network L2 gradient norm after clipping: 0.49999919533729553
2021-09-07 17:18:07.101 | INFO     | src.policies:minibatch_update:305 - Baseline network L2 gradient norm after clipping: 0.49999961256980896
2021-09-07 17:18:07.103 | INFO     | src.policies:tr

2021-09-07 17:18:07.353 | INFO     | src.policies:train:159 - Mini-batch 2 / 2
2021-09-07 17:18:07.354 | INFO     | src.policies:minibatch_update:281 - Losses: {'policy_loss': -0.3567737638950348, 'baseline_loss': 0.9221231341362, 'total_loss': 0.10428780317306519}
2021-09-07 17:18:07.355 | INFO     | src.policies:minibatch_update:287 - Policy network L2 gradient norm: 0.09743321686983109
2021-09-07 17:18:07.356 | INFO     | src.policies:minibatch_update:291 - Baseline network L2 gradient norm: 0.46000877022743225
2021-09-07 17:18:07.357 | INFO     | src.policies:minibatch_update:298 - Policy network L2 gradient norm after clipping: 0.09743321686983109
2021-09-07 17:18:07.358 | INFO     | src.policies:minibatch_update:305 - Baseline network L2 gradient norm after clipping: 0.46000877022743225
2021-09-07 17:18:07.360 | INFO     | src.policies:train:123 - Epoch 107 / 800
2021-09-07 17:18:07.361 | INFO     | src.policies:collect_trajectories:221 - Episode 702
2021-09-07 17:18:07.385 | DEB

2021-09-07 17:18:07.678 | DEBUG    | src.policies:execute_episode:413 - Early stopping, all agents done
2021-09-07 17:18:07.678 | INFO     | src.policies:collect_trajectories:237 - Mean episode return: 62.0
2021-09-07 17:18:07.679 | INFO     | src.policies:collect_trajectories:238 - Last 100 episodes mean return: 46.0
2021-09-07 17:18:07.679 | INFO     | src.policies:collect_trajectories:221 - Episode 710
2021-09-07 17:18:07.692 | DEBUG    | src.policies:execute_episode:413 - Early stopping, all agents done
2021-09-07 17:18:07.692 | INFO     | src.policies:collect_trajectories:237 - Mean episode return: 64.0
2021-09-07 17:18:07.693 | INFO     | src.policies:collect_trajectories:238 - Last 100 episodes mean return: 50.5
2021-09-07 17:18:07.696 | INFO     | src.policies:train:159 - Mini-batch 1 / 2
2021-09-07 17:18:07.698 | INFO     | src.policies:minibatch_update:281 - Losses: {'policy_loss': -0.5769041776657104, 'baseline_loss': 1.2368029356002808, 'total_loss': 0.04149729013442993}
20

2021-09-07 17:18:07.904 | DEBUG    | src.policies:execute_episode:413 - Early stopping, all agents done
2021-09-07 17:18:07.904 | INFO     | src.policies:collect_trajectories:237 - Mean episode return: 20.0
2021-09-07 17:18:07.905 | INFO     | src.policies:collect_trajectories:238 - Last 100 episodes mean return: 20.0
2021-09-07 17:18:07.905 | INFO     | src.policies:collect_trajectories:221 - Episode 717
2021-09-07 17:18:07.923 | DEBUG    | src.policies:execute_episode:413 - Early stopping, all agents done
2021-09-07 17:18:07.923 | INFO     | src.policies:collect_trajectories:237 - Mean episode return: 99.0
2021-09-07 17:18:07.924 | INFO     | src.policies:collect_trajectories:238 - Last 100 episodes mean return: 59.5
2021-09-07 17:18:07.924 | INFO     | src.policies:collect_trajectories:221 - Episode 718
2021-09-07 17:18:07.943 | DEBUG    | src.policies:execute_episode:413 - Early stopping, all agents done
2021-09-07 17:18:07.944 | INFO     | src.policies:collect_trajectories:237 - M

2021-09-07 17:18:08.123 | INFO     | src.policies:minibatch_update:281 - Losses: {'policy_loss': -0.5424262285232544, 'baseline_loss': 1.2853182554244995, 'total_loss': 0.10023289918899536}
2021-09-07 17:18:08.124 | INFO     | src.policies:minibatch_update:287 - Policy network L2 gradient norm: 0.33427363634109497
2021-09-07 17:18:08.126 | INFO     | src.policies:minibatch_update:291 - Baseline network L2 gradient norm: 0.7276410460472107
2021-09-07 17:18:08.127 | INFO     | src.policies:minibatch_update:298 - Policy network L2 gradient norm after clipping: 0.33427363634109497
2021-09-07 17:18:08.128 | INFO     | src.policies:minibatch_update:305 - Baseline network L2 gradient norm after clipping: 0.4999992251396179
2021-09-07 17:18:08.130 | INFO     | src.policies:train:159 - Mini-batch 2 / 2
2021-09-07 17:18:08.131 | INFO     | src.policies:minibatch_update:281 - Losses: {'policy_loss': -0.473753958940506, 'baseline_loss': 1.386712670326233, 'total_loss': 0.21960237622261047}
2021-09

2021-09-07 17:18:08.279 | INFO     | src.policies:minibatch_update:305 - Baseline network L2 gradient norm after clipping: 0.4783543348312378
2021-09-07 17:18:08.280 | INFO     | src.policies:train:123 - Epoch 117 / 800
2021-09-07 17:18:08.281 | INFO     | src.policies:collect_trajectories:221 - Episode 731
2021-09-07 17:18:08.296 | DEBUG    | src.policies:execute_episode:413 - Early stopping, all agents done
2021-09-07 17:18:08.298 | INFO     | src.policies:collect_trajectories:237 - Mean episode return: 108.0
2021-09-07 17:18:08.299 | INFO     | src.policies:collect_trajectories:238 - Last 100 episodes mean return: 108.0
2021-09-07 17:18:08.300 | INFO     | src.policies:collect_trajectories:221 - Episode 732
2021-09-07 17:18:08.315 | DEBUG    | src.policies:execute_episode:413 - Early stopping, all agents done
2021-09-07 17:18:08.316 | INFO     | src.policies:collect_trajectories:237 - Mean episode return: 93.0
2021-09-07 17:18:08.316 | INFO     | src.policies:collect_trajectories:23

2021-09-07 17:18:08.609 | INFO     | src.policies:minibatch_update:305 - Baseline network L2 gradient norm after clipping: 0.4999994933605194
2021-09-07 17:18:08.610 | INFO     | src.policies:train:159 - Mini-batch 2 / 2
2021-09-07 17:18:08.611 | INFO     | src.policies:minibatch_update:281 - Losses: {'policy_loss': -0.2660945653915405, 'baseline_loss': 0.579770565032959, 'total_loss': 0.023790717124938965}
2021-09-07 17:18:08.612 | INFO     | src.policies:minibatch_update:287 - Policy network L2 gradient norm: 0.1691228300333023
2021-09-07 17:18:08.613 | INFO     | src.policies:minibatch_update:291 - Baseline network L2 gradient norm: 1.143545150756836
2021-09-07 17:18:08.615 | INFO     | src.policies:minibatch_update:298 - Policy network L2 gradient norm after clipping: 0.1691228300333023
2021-09-07 17:18:08.616 | INFO     | src.policies:minibatch_update:305 - Baseline network L2 gradient norm after clipping: 0.4999995827674866
2021-09-07 17:18:08.618 | INFO     | src.policies:train:

2021-09-07 17:18:08.791 | INFO     | src.policies:train:159 - Mini-batch 2 / 2
2021-09-07 17:18:08.793 | INFO     | src.policies:minibatch_update:281 - Losses: {'policy_loss': -0.37853333353996277, 'baseline_loss': 0.6674444079399109, 'total_loss': -0.044811129570007324}
2021-09-07 17:18:08.794 | INFO     | src.policies:minibatch_update:287 - Policy network L2 gradient norm: 0.12044239044189453
2021-09-07 17:18:08.795 | INFO     | src.policies:minibatch_update:291 - Baseline network L2 gradient norm: 0.4347134828567505
2021-09-07 17:18:08.796 | INFO     | src.policies:minibatch_update:298 - Policy network L2 gradient norm after clipping: 0.12044239044189453
2021-09-07 17:18:08.797 | INFO     | src.policies:minibatch_update:305 - Baseline network L2 gradient norm after clipping: 0.4347134828567505
2021-09-07 17:18:08.799 | INFO     | src.policies:train:123 - Epoch 123 / 800
2021-09-07 17:18:08.800 | INFO     | src.policies:collect_trajectories:221 - Episode 745
2021-09-07 17:18:08.807 |

2021-09-07 17:18:09.055 | DEBUG    | src.policies:execute_episode:413 - Early stopping, all agents done
2021-09-07 17:18:09.055 | INFO     | src.policies:collect_trajectories:237 - Mean episode return: 146.0
2021-09-07 17:18:09.056 | INFO     | src.policies:collect_trajectories:238 - Last 100 episodes mean return: 86.66666666666667
2021-09-07 17:18:09.060 | INFO     | src.policies:train:159 - Mini-batch 1 / 2
2021-09-07 17:18:09.061 | INFO     | src.policies:minibatch_update:281 - Losses: {'policy_loss': -0.41632840037345886, 'baseline_loss': 0.7949144840240479, 'total_loss': -0.018871158361434937}
2021-09-07 17:18:09.062 | INFO     | src.policies:minibatch_update:287 - Policy network L2 gradient norm: 0.21490542590618134
2021-09-07 17:18:09.063 | INFO     | src.policies:minibatch_update:291 - Baseline network L2 gradient norm: 0.30727341771125793
2021-09-07 17:18:09.065 | INFO     | src.policies:minibatch_update:298 - Policy network L2 gradient norm after clipping: 0.21490542590618134

2021-09-07 17:18:09.257 | INFO     | src.policies:minibatch_update:298 - Policy network L2 gradient norm after clipping: 0.4362315833568573
2021-09-07 17:18:09.258 | INFO     | src.policies:minibatch_update:305 - Baseline network L2 gradient norm after clipping: 0.49999934434890747
2021-09-07 17:18:09.260 | INFO     | src.policies:train:159 - Mini-batch 2 / 3
2021-09-07 17:18:09.261 | INFO     | src.policies:minibatch_update:281 - Losses: {'policy_loss': -0.4691557288169861, 'baseline_loss': 1.071799874305725, 'total_loss': 0.06674420833587646}
2021-09-07 17:18:09.262 | INFO     | src.policies:minibatch_update:287 - Policy network L2 gradient norm: 0.2041151374578476
2021-09-07 17:18:09.263 | INFO     | src.policies:minibatch_update:291 - Baseline network L2 gradient norm: 0.9188663959503174
2021-09-07 17:18:09.264 | INFO     | src.policies:minibatch_update:298 - Policy network L2 gradient norm after clipping: 0.2041151374578476
2021-09-07 17:18:09.265 | INFO     | src.policies:minibat

2021-09-07 17:18:09.431 | INFO     | src.policies:collect_trajectories:237 - Mean episode return: 95.0
2021-09-07 17:18:09.431 | INFO     | src.policies:collect_trajectories:238 - Last 100 episodes mean return: 95.0
2021-09-07 17:18:09.432 | INFO     | src.policies:collect_trajectories:221 - Episode 763
2021-09-07 17:18:09.453 | DEBUG    | src.policies:execute_episode:413 - Early stopping, all agents done
2021-09-07 17:18:09.454 | INFO     | src.policies:collect_trajectories:237 - Mean episode return: 113.0
2021-09-07 17:18:09.455 | INFO     | src.policies:collect_trajectories:238 - Last 100 episodes mean return: 104.0
2021-09-07 17:18:09.458 | INFO     | src.policies:train:159 - Mini-batch 1 / 2
2021-09-07 17:18:09.461 | INFO     | src.policies:minibatch_update:281 - Losses: {'policy_loss': -0.3032114803791046, 'baseline_loss': 0.4971197545528412, 'total_loss': -0.05465160310268402}
2021-09-07 17:18:09.462 | INFO     | src.policies:minibatch_update:287 - Policy network L2 gradient nor

2021-09-07 17:18:09.806 | INFO     | src.policies:minibatch_update:287 - Policy network L2 gradient norm: 0.19782021641731262
2021-09-07 17:18:09.807 | INFO     | src.policies:minibatch_update:291 - Baseline network L2 gradient norm: 0.519811749458313
2021-09-07 17:18:09.808 | INFO     | src.policies:minibatch_update:298 - Policy network L2 gradient norm after clipping: 0.19782021641731262
2021-09-07 17:18:09.810 | INFO     | src.policies:minibatch_update:305 - Baseline network L2 gradient norm after clipping: 0.4999989867210388
2021-09-07 17:18:09.811 | INFO     | src.policies:train:123 - Epoch 134 / 800
2021-09-07 17:18:09.812 | INFO     | src.policies:collect_trajectories:221 - Episode 769
2021-09-07 17:18:09.822 | DEBUG    | src.policies:execute_episode:413 - Early stopping, all agents done
2021-09-07 17:18:09.822 | INFO     | src.policies:collect_trajectories:237 - Mean episode return: 54.0
2021-09-07 17:18:09.823 | INFO     | src.policies:collect_trajectories:238 - Last 100 episo

2021-09-07 17:18:09.971 | INFO     | src.policies:train:123 - Epoch 136 / 800
2021-09-07 17:18:09.971 | INFO     | src.policies:collect_trajectories:221 - Episode 777
2021-09-07 17:18:09.985 | DEBUG    | src.policies:execute_episode:413 - Early stopping, all agents done
2021-09-07 17:18:09.986 | INFO     | src.policies:collect_trajectories:237 - Mean episode return: 87.0
2021-09-07 17:18:09.986 | INFO     | src.policies:collect_trajectories:238 - Last 100 episodes mean return: 87.0
2021-09-07 17:18:09.987 | INFO     | src.policies:collect_trajectories:221 - Episode 778
2021-09-07 17:18:09.994 | DEBUG    | src.policies:execute_episode:413 - Early stopping, all agents done
2021-09-07 17:18:09.994 | INFO     | src.policies:collect_trajectories:237 - Mean episode return: 28.0
2021-09-07 17:18:09.995 | INFO     | src.policies:collect_trajectories:238 - Last 100 episodes mean return: 57.5
2021-09-07 17:18:09.995 | INFO     | src.policies:collect_trajectories:221 - Episode 779
2021-09-07 17:1

2021-09-07 17:18:10.168 | INFO     | src.policies:minibatch_update:291 - Baseline network L2 gradient norm: 0.8260717988014221
2021-09-07 17:18:10.170 | INFO     | src.policies:minibatch_update:298 - Policy network L2 gradient norm after clipping: 0.2238411158323288
2021-09-07 17:18:10.172 | INFO     | src.policies:minibatch_update:305 - Baseline network L2 gradient norm after clipping: 0.49999937415122986
2021-09-07 17:18:10.173 | INFO     | src.policies:train:123 - Epoch 139 / 800
2021-09-07 17:18:10.173 | INFO     | src.policies:collect_trajectories:221 - Episode 785
2021-09-07 17:18:10.184 | DEBUG    | src.policies:execute_episode:413 - Early stopping, all agents done
2021-09-07 17:18:10.184 | INFO     | src.policies:collect_trajectories:237 - Mean episode return: 62.0
2021-09-07 17:18:10.185 | INFO     | src.policies:collect_trajectories:238 - Last 100 episodes mean return: 62.0
2021-09-07 17:18:10.185 | INFO     | src.policies:collect_trajectories:221 - Episode 786
2021-09-07 17:

2021-09-07 17:18:10.430 | INFO     | src.policies:train:159 - Mini-batch 1 / 2
2021-09-07 17:18:10.431 | INFO     | src.policies:minibatch_update:281 - Losses: {'policy_loss': -0.27866441011428833, 'baseline_loss': 0.4892826974391937, 'total_loss': -0.03402306139469147}
2021-09-07 17:18:10.432 | INFO     | src.policies:minibatch_update:287 - Policy network L2 gradient norm: 0.13048571348190308
2021-09-07 17:18:10.433 | INFO     | src.policies:minibatch_update:291 - Baseline network L2 gradient norm: 0.9919720888137817
2021-09-07 17:18:10.434 | INFO     | src.policies:minibatch_update:298 - Policy network L2 gradient norm after clipping: 0.13048571348190308
2021-09-07 17:18:10.435 | INFO     | src.policies:minibatch_update:305 - Baseline network L2 gradient norm after clipping: 0.49999943375587463
2021-09-07 17:18:10.436 | INFO     | src.policies:train:159 - Mini-batch 2 / 2
2021-09-07 17:18:10.437 | INFO     | src.policies:minibatch_update:281 - Losses: {'policy_loss': -0.2058754861354

2021-09-07 17:18:10.623 | INFO     | src.policies:minibatch_update:291 - Baseline network L2 gradient norm: 0.7120163440704346
2021-09-07 17:18:10.624 | INFO     | src.policies:minibatch_update:298 - Policy network L2 gradient norm after clipping: 0.16904424130916595
2021-09-07 17:18:10.625 | INFO     | src.policies:minibatch_update:305 - Baseline network L2 gradient norm after clipping: 0.4999993145465851
2021-09-07 17:18:10.626 | INFO     | src.policies:train:159 - Mini-batch 2 / 2
2021-09-07 17:18:10.627 | INFO     | src.policies:minibatch_update:281 - Losses: {'policy_loss': -0.25482800602912903, 'baseline_loss': 0.7666146159172058, 'total_loss': 0.12847930192947388}
2021-09-07 17:18:10.628 | INFO     | src.policies:minibatch_update:287 - Policy network L2 gradient norm: 0.19430610537528992
2021-09-07 17:18:10.629 | INFO     | src.policies:minibatch_update:291 - Baseline network L2 gradient norm: 0.7946175932884216
2021-09-07 17:18:10.631 | INFO     | src.policies:minibatch_update:

2021-09-07 17:18:10.980 | INFO     | src.policies:minibatch_update:281 - Losses: {'policy_loss': -0.4009104073047638, 'baseline_loss': 0.6585647463798523, 'total_loss': -0.07162803411483765}
2021-09-07 17:18:10.981 | INFO     | src.policies:minibatch_update:287 - Policy network L2 gradient norm: 0.4507552981376648
2021-09-07 17:18:10.982 | INFO     | src.policies:minibatch_update:291 - Baseline network L2 gradient norm: 0.6161497831344604
2021-09-07 17:18:10.983 | INFO     | src.policies:minibatch_update:298 - Policy network L2 gradient norm after clipping: 0.4507552981376648
2021-09-07 17:18:10.985 | INFO     | src.policies:minibatch_update:305 - Baseline network L2 gradient norm after clipping: 0.49999916553497314
2021-09-07 17:18:10.986 | INFO     | src.policies:train:159 - Mini-batch 2 / 3
2021-09-07 17:18:10.987 | INFO     | src.policies:minibatch_update:281 - Losses: {'policy_loss': -0.32568567991256714, 'baseline_loss': 0.5754920840263367, 'total_loss': -0.037939637899398804}
20

2021-09-07 17:18:11.203 | INFO     | src.policies:collect_trajectories:237 - Mean episode return: 188.0
2021-09-07 17:18:11.204 | INFO     | src.policies:collect_trajectories:238 - Last 100 episodes mean return: 190.5
2021-09-07 17:18:11.208 | INFO     | src.policies:train:159 - Mini-batch 1 / 3
2021-09-07 17:18:11.210 | INFO     | src.policies:minibatch_update:281 - Losses: {'policy_loss': -0.2520011067390442, 'baseline_loss': 0.5222777724266052, 'total_loss': 0.009137779474258423}
2021-09-07 17:18:11.211 | INFO     | src.policies:minibatch_update:287 - Policy network L2 gradient norm: 0.27723222970962524
2021-09-07 17:18:11.212 | INFO     | src.policies:minibatch_update:291 - Baseline network L2 gradient norm: 1.0586267709732056
2021-09-07 17:18:11.213 | INFO     | src.policies:minibatch_update:298 - Policy network L2 gradient norm after clipping: 0.27723222970962524
2021-09-07 17:18:11.214 | INFO     | src.policies:minibatch_update:305 - Baseline network L2 gradient norm after clipp

2021-09-07 17:18:11.378 | INFO     | src.policies:collect_trajectories:237 - Mean episode return: 71.0
2021-09-07 17:18:11.379 | INFO     | src.policies:collect_trajectories:238 - Last 100 episodes mean return: 71.0
2021-09-07 17:18:11.379 | INFO     | src.policies:collect_trajectories:221 - Episode 816
2021-09-07 17:18:11.397 | DEBUG    | src.policies:execute_episode:413 - Early stopping, all agents done
2021-09-07 17:18:11.398 | INFO     | src.policies:collect_trajectories:237 - Mean episode return: 98.0
2021-09-07 17:18:11.398 | INFO     | src.policies:collect_trajectories:238 - Last 100 episodes mean return: 84.5
2021-09-07 17:18:11.399 | INFO     | src.policies:collect_trajectories:221 - Episode 817
2021-09-07 17:18:11.406 | DEBUG    | src.policies:execute_episode:413 - Early stopping, all agents done
2021-09-07 17:18:11.407 | INFO     | src.policies:collect_trajectories:237 - Mean episode return: 31.0
2021-09-07 17:18:11.408 | INFO     | src.policies:collect_trajectories:238 - La

2021-09-07 17:18:11.626 | INFO     | src.policies:train:159 - Mini-batch 2 / 2
2021-09-07 17:18:11.627 | INFO     | src.policies:minibatch_update:281 - Losses: {'policy_loss': -0.410436749458313, 'baseline_loss': 0.6089588403701782, 'total_loss': -0.10595732927322388}
2021-09-07 17:18:11.628 | INFO     | src.policies:minibatch_update:287 - Policy network L2 gradient norm: 0.12385214865207672
2021-09-07 17:18:11.629 | INFO     | src.policies:minibatch_update:291 - Baseline network L2 gradient norm: 0.3309799134731293
2021-09-07 17:18:11.630 | INFO     | src.policies:minibatch_update:298 - Policy network L2 gradient norm after clipping: 0.12385214865207672
2021-09-07 17:18:11.631 | INFO     | src.policies:minibatch_update:305 - Baseline network L2 gradient norm after clipping: 0.3309799134731293
2021-09-07 17:18:11.632 | INFO     | src.policies:train:123 - Epoch 156 / 800
2021-09-07 17:18:11.633 | INFO     | src.policies:collect_trajectories:221 - Episode 823
2021-09-07 17:18:11.650 | DE

2021-09-07 17:18:11.931 | INFO     | src.policies:minibatch_update:291 - Baseline network L2 gradient norm: 0.7860007286071777
2021-09-07 17:18:11.932 | INFO     | src.policies:minibatch_update:298 - Policy network L2 gradient norm after clipping: 0.41065138578414917
2021-09-07 17:18:11.934 | INFO     | src.policies:minibatch_update:305 - Baseline network L2 gradient norm after clipping: 0.49999934434890747
2021-09-07 17:18:11.935 | INFO     | src.policies:train:159 - Mini-batch 2 / 2
2021-09-07 17:18:11.936 | INFO     | src.policies:minibatch_update:281 - Losses: {'policy_loss': -0.5896462798118591, 'baseline_loss': 1.580430030822754, 'total_loss': 0.20056873559951782}
2021-09-07 17:18:11.937 | INFO     | src.policies:minibatch_update:287 - Policy network L2 gradient norm: 0.2952743470668793
2021-09-07 17:18:11.938 | INFO     | src.policies:minibatch_update:291 - Baseline network L2 gradient norm: 1.0590906143188477
2021-09-07 17:18:11.939 | INFO     | src.policies:minibatch_update:29

2021-09-07 17:18:12.155 | INFO     | src.policies:minibatch_update:298 - Policy network L2 gradient norm after clipping: 0.3692125082015991
2021-09-07 17:18:12.156 | INFO     | src.policies:minibatch_update:305 - Baseline network L2 gradient norm after clipping: 0.38704851269721985
2021-09-07 17:18:12.158 | INFO     | src.policies:train:123 - Epoch 161 / 800
2021-09-07 17:18:12.158 | INFO     | src.policies:collect_trajectories:221 - Episode 834
2021-09-07 17:18:12.184 | DEBUG    | src.policies:execute_episode:413 - Early stopping, all agents done
2021-09-07 17:18:12.185 | INFO     | src.policies:collect_trajectories:237 - Mean episode return: 167.0
2021-09-07 17:18:12.185 | INFO     | src.policies:collect_trajectories:238 - Last 100 episodes mean return: 167.0
2021-09-07 17:18:12.186 | INFO     | src.policies:collect_trajectories:221 - Episode 835
2021-09-07 17:18:12.208 | DEBUG    | src.policies:execute_episode:413 - Early stopping, all agents done
2021-09-07 17:18:12.209 | INFO     

2021-09-07 17:18:12.358 | INFO     | src.policies:collect_trajectories:221 - Episode 840
2021-09-07 17:18:12.378 | DEBUG    | src.policies:execute_episode:413 - Early stopping, all agents done
2021-09-07 17:18:12.379 | INFO     | src.policies:collect_trajectories:237 - Mean episode return: 127.0
2021-09-07 17:18:12.379 | INFO     | src.policies:collect_trajectories:238 - Last 100 episodes mean return: 127.0
2021-09-07 17:18:12.380 | INFO     | src.policies:collect_trajectories:221 - Episode 841
2021-09-07 17:18:12.398 | DEBUG    | src.policies:execute_episode:413 - Early stopping, all agents done
2021-09-07 17:18:12.398 | INFO     | src.policies:collect_trajectories:237 - Mean episode return: 85.0
2021-09-07 17:18:12.399 | INFO     | src.policies:collect_trajectories:238 - Last 100 episodes mean return: 106.0
2021-09-07 17:18:12.403 | INFO     | src.policies:train:159 - Mini-batch 1 / 2
2021-09-07 17:18:12.405 | INFO     | src.policies:minibatch_update:281 - Losses: {'policy_loss': -0.

2021-09-07 17:18:12.615 | DEBUG    | src.policies:execute_episode:413 - Early stopping, all agents done
2021-09-07 17:18:12.615 | INFO     | src.policies:collect_trajectories:237 - Mean episode return: 43.0
2021-09-07 17:18:12.616 | INFO     | src.policies:collect_trajectories:238 - Last 100 episodes mean return: 39.0
2021-09-07 17:18:12.616 | INFO     | src.policies:collect_trajectories:221 - Episode 848
2021-09-07 17:18:12.640 | DEBUG    | src.policies:execute_episode:413 - Early stopping, all agents done
2021-09-07 17:18:12.641 | INFO     | src.policies:collect_trajectories:237 - Mean episode return: 141.0
2021-09-07 17:18:12.641 | INFO     | src.policies:collect_trajectories:238 - Last 100 episodes mean return: 73.0
2021-09-07 17:18:12.645 | INFO     | src.policies:train:159 - Mini-batch 1 / 2
2021-09-07 17:18:12.648 | INFO     | src.policies:minibatch_update:281 - Losses: {'policy_loss': -0.6406899094581604, 'baseline_loss': 1.4064102172851562, 'total_loss': 0.06251519918441772}
2

2021-09-07 17:18:12.816 | DEBUG    | src.policies:execute_episode:413 - Early stopping, all agents done
2021-09-07 17:18:12.817 | INFO     | src.policies:collect_trajectories:237 - Mean episode return: 70.0
2021-09-07 17:18:12.818 | INFO     | src.policies:collect_trajectories:238 - Last 100 episodes mean return: 70.0
2021-09-07 17:18:12.818 | INFO     | src.policies:collect_trajectories:221 - Episode 855
2021-09-07 17:18:12.835 | DEBUG    | src.policies:execute_episode:413 - Early stopping, all agents done
2021-09-07 17:18:12.836 | INFO     | src.policies:collect_trajectories:237 - Mean episode return: 96.0
2021-09-07 17:18:12.837 | INFO     | src.policies:collect_trajectories:238 - Last 100 episodes mean return: 83.0
2021-09-07 17:18:12.837 | INFO     | src.policies:collect_trajectories:221 - Episode 856
2021-09-07 17:18:12.855 | DEBUG    | src.policies:execute_episode:413 - Early stopping, all agents done
2021-09-07 17:18:12.855 | INFO     | src.policies:collect_trajectories:237 - M

2021-09-07 17:18:13.025 | INFO     | src.policies:collect_trajectories:237 - Mean episode return: 100.0
2021-09-07 17:18:13.026 | INFO     | src.policies:collect_trajectories:238 - Last 100 episodes mean return: 105.0
2021-09-07 17:18:13.029 | INFO     | src.policies:train:159 - Mini-batch 1 / 2
2021-09-07 17:18:13.031 | INFO     | src.policies:minibatch_update:281 - Losses: {'policy_loss': -0.2076948583126068, 'baseline_loss': 0.45004206895828247, 'total_loss': 0.017326176166534424}
2021-09-07 17:18:13.032 | INFO     | src.policies:minibatch_update:287 - Policy network L2 gradient norm: 0.9417464137077332
2021-09-07 17:18:13.033 | INFO     | src.policies:minibatch_update:291 - Baseline network L2 gradient norm: 1.1202471256256104
2021-09-07 17:18:13.034 | INFO     | src.policies:minibatch_update:298 - Policy network L2 gradient norm after clipping: 0.49999937415122986
2021-09-07 17:18:13.035 | INFO     | src.policies:minibatch_update:305 - Baseline network L2 gradient norm after clipp

2021-09-07 17:18:13.238 | INFO     | src.policies:minibatch_update:305 - Baseline network L2 gradient norm after clipping: 0.499999076128006
2021-09-07 17:18:13.240 | INFO     | src.policies:train:123 - Epoch 176 / 800
2021-09-07 17:18:13.241 | INFO     | src.policies:collect_trajectories:221 - Episode 869
2021-09-07 17:18:13.255 | DEBUG    | src.policies:execute_episode:413 - Early stopping, all agents done
2021-09-07 17:18:13.256 | INFO     | src.policies:collect_trajectories:237 - Mean episode return: 89.0
2021-09-07 17:18:13.256 | INFO     | src.policies:collect_trajectories:238 - Last 100 episodes mean return: 89.0
2021-09-07 17:18:13.257 | INFO     | src.policies:collect_trajectories:221 - Episode 870
2021-09-07 17:18:13.283 | DEBUG    | src.policies:execute_episode:413 - Early stopping, all agents done
2021-09-07 17:18:13.284 | INFO     | src.policies:collect_trajectories:237 - Mean episode return: 122.0
2021-09-07 17:18:13.284 | INFO     | src.policies:collect_trajectories:238 

2021-09-07 17:18:13.537 | INFO     | src.policies:collect_trajectories:237 - Mean episode return: 97.0
2021-09-07 17:18:13.537 | INFO     | src.policies:collect_trajectories:238 - Last 100 episodes mean return: 97.0
2021-09-07 17:18:13.538 | INFO     | src.policies:collect_trajectories:221 - Episode 876
2021-09-07 17:18:13.554 | DEBUG    | src.policies:execute_episode:413 - Early stopping, all agents done
2021-09-07 17:18:13.555 | INFO     | src.policies:collect_trajectories:237 - Mean episode return: 93.0
2021-09-07 17:18:13.555 | INFO     | src.policies:collect_trajectories:238 - Last 100 episodes mean return: 95.0
2021-09-07 17:18:13.556 | INFO     | src.policies:collect_trajectories:221 - Episode 877
2021-09-07 17:18:13.578 | DEBUG    | src.policies:execute_episode:413 - Early stopping, all agents done
2021-09-07 17:18:13.579 | INFO     | src.policies:collect_trajectories:237 - Mean episode return: 142.0
2021-09-07 17:18:13.580 | INFO     | src.policies:collect_trajectories:238 - L

2021-09-07 17:18:13.815 | INFO     | src.policies:collect_trajectories:238 - Last 100 episodes mean return: 196.0
2021-09-07 17:18:13.927 | INFO     | src.policies:train:159 - Mini-batch 1 / 3
2021-09-07 17:18:13.928 | INFO     | src.policies:minibatch_update:281 - Losses: {'policy_loss': -0.3746868073940277, 'baseline_loss': 0.9085872769355774, 'total_loss': 0.07960683107376099}
2021-09-07 17:18:13.929 | INFO     | src.policies:minibatch_update:287 - Policy network L2 gradient norm: 0.10403470695018768
2021-09-07 17:18:13.930 | INFO     | src.policies:minibatch_update:291 - Baseline network L2 gradient norm: 1.0704869031906128
2021-09-07 17:18:13.931 | INFO     | src.policies:minibatch_update:298 - Policy network L2 gradient norm after clipping: 0.10403470695018768
2021-09-07 17:18:13.932 | INFO     | src.policies:minibatch_update:305 - Baseline network L2 gradient norm after clipping: 0.4999995529651642
2021-09-07 17:18:13.933 | INFO     | src.policies:train:159 - Mini-batch 2 / 3
20

2021-09-07 17:18:14.114 | INFO     | src.policies:train:159 - Mini-batch 1 / 3
2021-09-07 17:18:14.116 | INFO     | src.policies:minibatch_update:281 - Losses: {'policy_loss': -0.5142244100570679, 'baseline_loss': 0.997755229473114, 'total_loss': -0.015346795320510864}
2021-09-07 17:18:14.118 | INFO     | src.policies:minibatch_update:287 - Policy network L2 gradient norm: 0.14766205847263336
2021-09-07 17:18:14.119 | INFO     | src.policies:minibatch_update:291 - Baseline network L2 gradient norm: 1.1928467750549316
2021-09-07 17:18:14.120 | INFO     | src.policies:minibatch_update:298 - Policy network L2 gradient norm after clipping: 0.14766205847263336
2021-09-07 17:18:14.121 | INFO     | src.policies:minibatch_update:305 - Baseline network L2 gradient norm after clipping: 0.4999995231628418
2021-09-07 17:18:14.123 | INFO     | src.policies:train:159 - Mini-batch 2 / 3
2021-09-07 17:18:14.124 | INFO     | src.policies:minibatch_update:281 - Losses: {'policy_loss': -0.596564471721649

2021-09-07 17:18:14.350 | DEBUG    | src.policies:execute_episode:413 - Early stopping, all agents done
2021-09-07 17:18:14.350 | INFO     | src.policies:collect_trajectories:237 - Mean episode return: 162.0
2021-09-07 17:18:14.351 | INFO     | src.policies:collect_trajectories:238 - Last 100 episodes mean return: 162.0
2021-09-07 17:18:14.351 | INFO     | src.policies:collect_trajectories:221 - Episode 892
2021-09-07 17:18:14.385 | DEBUG    | src.policies:execute_episode:413 - Early stopping, all agents done
2021-09-07 17:18:14.386 | INFO     | src.policies:collect_trajectories:237 - Mean episode return: 200.0
2021-09-07 17:18:14.386 | INFO     | src.policies:collect_trajectories:238 - Last 100 episodes mean return: 181.0
2021-09-07 17:18:14.389 | INFO     | src.policies:train:159 - Mini-batch 1 / 3
2021-09-07 17:18:14.391 | INFO     | src.policies:minibatch_update:281 - Losses: {'policy_loss': -0.6169216632843018, 'baseline_loss': 1.7305978536605835, 'total_loss': 0.24837726354599}
2

2021-09-07 17:18:14.541 | INFO     | src.policies:minibatch_update:287 - Policy network L2 gradient norm: 0.4163171648979187
2021-09-07 17:18:14.542 | INFO     | src.policies:minibatch_update:291 - Baseline network L2 gradient norm: 2.1008622646331787
2021-09-07 17:18:14.543 | INFO     | src.policies:minibatch_update:298 - Policy network L2 gradient norm after clipping: 0.4163171648979187
2021-09-07 17:18:14.544 | INFO     | src.policies:minibatch_update:305 - Baseline network L2 gradient norm after clipping: 0.4999997615814209
2021-09-07 17:18:14.546 | INFO     | src.policies:train:123 - Epoch 190 / 800
2021-09-07 17:18:14.546 | INFO     | src.policies:collect_trajectories:221 - Episode 896
2021-09-07 17:18:14.562 | DEBUG    | src.policies:execute_episode:413 - Early stopping, all agents done
2021-09-07 17:18:14.563 | INFO     | src.policies:collect_trajectories:237 - Mean episode return: 100.0
2021-09-07 17:18:14.563 | INFO     | src.policies:collect_trajectories:238 - Last 100 episo

2021-09-07 17:18:14.827 | INFO     | src.policies:minibatch_update:305 - Baseline network L2 gradient norm after clipping: 0.4999995529651642
2021-09-07 17:18:14.829 | INFO     | src.policies:train:159 - Mini-batch 2 / 3
2021-09-07 17:18:14.830 | INFO     | src.policies:minibatch_update:281 - Losses: {'policy_loss': -0.4271913766860962, 'baseline_loss': 1.1676315069198608, 'total_loss': 0.15662437677383423}
2021-09-07 17:18:14.831 | INFO     | src.policies:minibatch_update:287 - Policy network L2 gradient norm: 0.24686473608016968
2021-09-07 17:18:14.832 | INFO     | src.policies:minibatch_update:291 - Baseline network L2 gradient norm: 1.0573545694351196
2021-09-07 17:18:14.833 | INFO     | src.policies:minibatch_update:298 - Policy network L2 gradient norm after clipping: 0.24686473608016968
2021-09-07 17:18:14.834 | INFO     | src.policies:minibatch_update:305 - Baseline network L2 gradient norm after clipping: 0.4999994933605194
2021-09-07 17:18:14.835 | INFO     | src.policies:tra

2021-09-07 17:18:15.016 | INFO     | src.policies:minibatch_update:298 - Policy network L2 gradient norm after clipping: 0.4099530279636383
2021-09-07 17:18:15.017 | INFO     | src.policies:minibatch_update:305 - Baseline network L2 gradient norm after clipping: 0.4999993145465851
2021-09-07 17:18:15.020 | INFO     | src.policies:train:123 - Epoch 195 / 800
2021-09-07 17:18:15.020 | INFO     | src.policies:collect_trajectories:221 - Episode 906
2021-09-07 17:18:15.056 | DEBUG    | src.policies:execute_episode:413 - Early stopping, all agents done
2021-09-07 17:18:15.057 | INFO     | src.policies:collect_trajectories:237 - Mean episode return: 197.0
2021-09-07 17:18:15.058 | INFO     | src.policies:collect_trajectories:238 - Last 100 episodes mean return: 197.0
2021-09-07 17:18:15.059 | INFO     | src.policies:collect_trajectories:221 - Episode 907
2021-09-07 17:18:15.090 | DEBUG    | src.policies:execute_episode:413 - Early stopping, all agents done
2021-09-07 17:18:15.091 | INFO     |

2021-09-07 17:18:15.261 | INFO     | src.policies:minibatch_update:298 - Policy network L2 gradient norm after clipping: 0.2836533486843109
2021-09-07 17:18:15.262 | INFO     | src.policies:minibatch_update:305 - Baseline network L2 gradient norm after clipping: 0.49999940395355225
2021-09-07 17:18:15.263 | INFO     | src.policies:train:159 - Mini-batch 2 / 3
2021-09-07 17:18:15.265 | INFO     | src.policies:minibatch_update:281 - Losses: {'policy_loss': -0.3039223253726959, 'baseline_loss': 0.7863755822181702, 'total_loss': 0.08926546573638916}
2021-09-07 17:18:15.266 | INFO     | src.policies:minibatch_update:287 - Policy network L2 gradient norm: 0.3370254933834076
2021-09-07 17:18:15.267 | INFO     | src.policies:minibatch_update:291 - Baseline network L2 gradient norm: 0.9874021410942078
2021-09-07 17:18:15.268 | INFO     | src.policies:minibatch_update:298 - Policy network L2 gradient norm after clipping: 0.3370254933834076
2021-09-07 17:18:15.269 | INFO     | src.policies:miniba

2021-09-07 17:18:15.504 | INFO     | src.policies:collect_trajectories:238 - Last 100 episodes mean return: 136.0
2021-09-07 17:18:15.505 | INFO     | src.policies:collect_trajectories:221 - Episode 918
2021-09-07 17:18:15.527 | DEBUG    | src.policies:execute_episode:413 - Early stopping, all agents done
2021-09-07 17:18:15.528 | INFO     | src.policies:collect_trajectories:237 - Mean episode return: 122.0
2021-09-07 17:18:15.528 | INFO     | src.policies:collect_trajectories:238 - Last 100 episodes mean return: 129.0
2021-09-07 17:18:15.531 | INFO     | src.policies:train:159 - Mini-batch 1 / 2
2021-09-07 17:18:15.533 | INFO     | src.policies:minibatch_update:281 - Losses: {'policy_loss': -0.6269293427467346, 'baseline_loss': 1.3814196586608887, 'total_loss': 0.06378048658370972}
2021-09-07 17:18:15.534 | INFO     | src.policies:minibatch_update:287 - Policy network L2 gradient norm: 0.31059348583221436
2021-09-07 17:18:15.535 | INFO     | src.policies:minibatch_update:291 - Baselin

2021-09-07 17:18:15.725 | INFO     | src.policies:minibatch_update:287 - Policy network L2 gradient norm: 0.25861579179763794
2021-09-07 17:18:15.727 | INFO     | src.policies:minibatch_update:291 - Baseline network L2 gradient norm: 1.1243698596954346
2021-09-07 17:18:15.728 | INFO     | src.policies:minibatch_update:298 - Policy network L2 gradient norm after clipping: 0.25861579179763794
2021-09-07 17:18:15.730 | INFO     | src.policies:minibatch_update:305 - Baseline network L2 gradient norm after clipping: 0.49999961256980896
2021-09-07 17:18:15.731 | INFO     | src.policies:train:159 - Mini-batch 2 / 3
2021-09-07 17:18:15.732 | INFO     | src.policies:minibatch_update:281 - Losses: {'policy_loss': -0.2396404892206192, 'baseline_loss': 0.513653576374054, 'total_loss': 0.017186298966407776}
2021-09-07 17:18:15.733 | INFO     | src.policies:minibatch_update:287 - Policy network L2 gradient norm: 0.307763010263443
2021-09-07 17:18:15.734 | INFO     | src.policies:minibatch_update:291

2021-09-07 17:18:16.183 | DEBUG    | src.policies:execute_episode:413 - Early stopping, all agents done
2021-09-07 17:18:16.184 | INFO     | src.policies:collect_trajectories:237 - Mean episode return: 200.0
2021-09-07 17:18:16.185 | INFO     | src.policies:collect_trajectories:238 - Last 100 episodes mean return: 137.5
2021-09-07 17:18:16.188 | INFO     | src.policies:train:159 - Mini-batch 1 / 2
2021-09-07 17:18:16.190 | INFO     | src.policies:minibatch_update:281 - Losses: {'policy_loss': -0.4739943742752075, 'baseline_loss': 1.0751060247421265, 'total_loss': 0.06355863809585571}
2021-09-07 17:18:16.191 | INFO     | src.policies:minibatch_update:287 - Policy network L2 gradient norm: 0.08597654849290848
2021-09-07 17:18:16.192 | INFO     | src.policies:minibatch_update:291 - Baseline network L2 gradient norm: 0.9974623918533325
2021-09-07 17:18:16.194 | INFO     | src.policies:minibatch_update:298 - Policy network L2 gradient norm after clipping: 0.08597654849290848
2021-09-07 17:1

2021-09-07 17:18:16.380 | INFO     | src.policies:minibatch_update:298 - Policy network L2 gradient norm after clipping: 0.3526490032672882
2021-09-07 17:18:16.381 | INFO     | src.policies:minibatch_update:305 - Baseline network L2 gradient norm after clipping: 0.4999995231628418
2021-09-07 17:18:16.383 | INFO     | src.policies:train:159 - Mini-batch 2 / 2
2021-09-07 17:18:16.384 | INFO     | src.policies:minibatch_update:281 - Losses: {'policy_loss': -0.287180095911026, 'baseline_loss': 0.48187899589538574, 'total_loss': -0.04624059796333313}
2021-09-07 17:18:16.385 | INFO     | src.policies:minibatch_update:287 - Policy network L2 gradient norm: 0.25190550088882446
2021-09-07 17:18:16.386 | INFO     | src.policies:minibatch_update:291 - Baseline network L2 gradient norm: 0.9173980355262756
2021-09-07 17:18:16.387 | INFO     | src.policies:minibatch_update:298 - Policy network L2 gradient norm after clipping: 0.25190550088882446
2021-09-07 17:18:16.389 | INFO     | src.policies:mini

2021-09-07 17:18:16.589 | INFO     | src.policies:minibatch_update:305 - Baseline network L2 gradient norm after clipping: 0.4654307961463928
2021-09-07 17:18:16.590 | INFO     | src.policies:train:159 - Mini-batch 3 / 3
2021-09-07 17:18:16.592 | INFO     | src.policies:minibatch_update:281 - Losses: {'policy_loss': -0.45172202587127686, 'baseline_loss': 1.060646414756775, 'total_loss': 0.0786011815071106}
2021-09-07 17:18:16.593 | INFO     | src.policies:minibatch_update:287 - Policy network L2 gradient norm: 0.45834168791770935
2021-09-07 17:18:16.593 | INFO     | src.policies:minibatch_update:291 - Baseline network L2 gradient norm: 1.4597052335739136
2021-09-07 17:18:16.594 | INFO     | src.policies:minibatch_update:298 - Policy network L2 gradient norm after clipping: 0.45834168791770935
2021-09-07 17:18:16.595 | INFO     | src.policies:minibatch_update:305 - Baseline network L2 gradient norm after clipping: 0.49999967217445374
2021-09-07 17:18:16.597 | INFO     | src.policies:tra

2021-09-07 17:18:16.858 | INFO     | src.policies:minibatch_update:291 - Baseline network L2 gradient norm: 1.3888026475906372
2021-09-07 17:18:16.859 | INFO     | src.policies:minibatch_update:298 - Policy network L2 gradient norm after clipping: 0.20917141437530518
2021-09-07 17:18:16.860 | INFO     | src.policies:minibatch_update:305 - Baseline network L2 gradient norm after clipping: 0.49999961256980896
2021-09-07 17:18:16.862 | INFO     | src.policies:train:159 - Mini-batch 2 / 3
2021-09-07 17:18:16.863 | INFO     | src.policies:minibatch_update:281 - Losses: {'policy_loss': -0.20374350249767303, 'baseline_loss': 0.6509544253349304, 'total_loss': 0.12173371016979218}
2021-09-07 17:18:16.864 | INFO     | src.policies:minibatch_update:287 - Policy network L2 gradient norm: 0.43471044301986694
2021-09-07 17:18:16.866 | INFO     | src.policies:minibatch_update:291 - Baseline network L2 gradient norm: 1.2041435241699219
2021-09-07 17:18:16.867 | INFO     | src.policies:minibatch_update

2021-09-07 17:18:17.082 | DEBUG    | src.policies:execute_episode:413 - Early stopping, all agents done
2021-09-07 17:18:17.083 | INFO     | src.policies:collect_trajectories:237 - Mean episode return: 200.0
2021-09-07 17:18:17.083 | INFO     | src.policies:collect_trajectories:238 - Last 100 episodes mean return: 200.0
2021-09-07 17:18:17.086 | INFO     | src.policies:train:159 - Mini-batch 1 / 2
2021-09-07 17:18:17.087 | INFO     | src.policies:minibatch_update:281 - Losses: {'policy_loss': -0.6841245293617249, 'baseline_loss': 2.4629721641540527, 'total_loss': 0.5473615527153015}
2021-09-07 17:18:17.088 | INFO     | src.policies:minibatch_update:287 - Policy network L2 gradient norm: 0.1343100219964981
2021-09-07 17:18:17.089 | INFO     | src.policies:minibatch_update:291 - Baseline network L2 gradient norm: 3.397463083267212
2021-09-07 17:18:17.090 | INFO     | src.policies:minibatch_update:298 - Policy network L2 gradient norm after clipping: 0.1343100219964981
2021-09-07 17:18:17

2021-09-07 17:18:17.351 | INFO     | src.policies:collect_trajectories:238 - Last 100 episodes mean return: 151.5
2021-09-07 17:18:17.354 | INFO     | src.policies:train:159 - Mini-batch 1 / 3
2021-09-07 17:18:17.356 | INFO     | src.policies:minibatch_update:281 - Losses: {'policy_loss': -0.40451037883758545, 'baseline_loss': 0.6178886294364929, 'total_loss': -0.09556606411933899}
2021-09-07 17:18:17.357 | INFO     | src.policies:minibatch_update:287 - Policy network L2 gradient norm: 0.4665355384349823
2021-09-07 17:18:17.358 | INFO     | src.policies:minibatch_update:291 - Baseline network L2 gradient norm: 0.7057991623878479
2021-09-07 17:18:17.360 | INFO     | src.policies:minibatch_update:298 - Policy network L2 gradient norm after clipping: 0.4665355384349823
2021-09-07 17:18:17.361 | INFO     | src.policies:minibatch_update:305 - Baseline network L2 gradient norm after clipping: 0.4999992251396179
2021-09-07 17:18:17.362 | INFO     | src.policies:train:159 - Mini-batch 2 / 3
20

2021-09-07 17:18:17.523 | INFO     | src.policies:collect_trajectories:221 - Episode 958
2021-09-07 17:18:17.543 | DEBUG    | src.policies:execute_episode:413 - Early stopping, all agents done
2021-09-07 17:18:17.543 | INFO     | src.policies:collect_trajectories:237 - Mean episode return: 114.0
2021-09-07 17:18:17.544 | INFO     | src.policies:collect_trajectories:238 - Last 100 episodes mean return: 114.0
2021-09-07 17:18:17.544 | INFO     | src.policies:collect_trajectories:221 - Episode 959
2021-09-07 17:18:17.584 | DEBUG    | src.policies:execute_episode:413 - Early stopping, all agents done
2021-09-07 17:18:17.585 | INFO     | src.policies:collect_trajectories:237 - Mean episode return: 200.0
2021-09-07 17:18:17.585 | INFO     | src.policies:collect_trajectories:238 - Last 100 episodes mean return: 157.0
2021-09-07 17:18:17.588 | INFO     | src.policies:train:159 - Mini-batch 1 / 3
2021-09-07 17:18:17.590 | INFO     | src.policies:minibatch_update:281 - Losses: {'policy_loss': -0

2021-09-07 17:18:17.824 | INFO     | src.policies:minibatch_update:298 - Policy network L2 gradient norm after clipping: 0.36337754130363464
2021-09-07 17:18:17.825 | INFO     | src.policies:minibatch_update:305 - Baseline network L2 gradient norm after clipping: 0.4999997019767761
2021-09-07 17:18:17.826 | INFO     | src.policies:train:159 - Mini-batch 2 / 3
2021-09-07 17:18:17.828 | INFO     | src.policies:minibatch_update:281 - Losses: {'policy_loss': -0.6087082028388977, 'baseline_loss': 1.4948841333389282, 'total_loss': 0.1387338638305664}
2021-09-07 17:18:17.829 | INFO     | src.policies:minibatch_update:287 - Policy network L2 gradient norm: 0.17246513068675995
2021-09-07 17:18:17.830 | INFO     | src.policies:minibatch_update:291 - Baseline network L2 gradient norm: 1.7622268199920654
2021-09-07 17:18:17.831 | INFO     | src.policies:minibatch_update:298 - Policy network L2 gradient norm after clipping: 0.17246513068675995
2021-09-07 17:18:17.832 | INFO     | src.policies:minib

2021-09-07 17:18:18.022 | INFO     | src.policies:minibatch_update:287 - Policy network L2 gradient norm: 0.2858635187149048
2021-09-07 17:18:18.023 | INFO     | src.policies:minibatch_update:291 - Baseline network L2 gradient norm: 1.1645325422286987
2021-09-07 17:18:18.024 | INFO     | src.policies:minibatch_update:298 - Policy network L2 gradient norm after clipping: 0.2858635187149048
2021-09-07 17:18:18.025 | INFO     | src.policies:minibatch_update:305 - Baseline network L2 gradient norm after clipping: 0.4999995231628418
2021-09-07 17:18:18.027 | INFO     | src.policies:train:123 - Epoch 229 / 800
2021-09-07 17:18:18.027 | INFO     | src.policies:collect_trajectories:221 - Episode 968
2021-09-07 17:18:18.062 | DEBUG    | src.policies:execute_episode:413 - Early stopping, all agents done
2021-09-07 17:18:18.063 | INFO     | src.policies:collect_trajectories:237 - Mean episode return: 200.0
2021-09-07 17:18:18.064 | INFO     | src.policies:collect_trajectories:238 - Last 100 episo

2021-09-07 17:18:18.404 | INFO     | src.policies:minibatch_update:291 - Baseline network L2 gradient norm: 0.491789847612381
2021-09-07 17:18:18.406 | INFO     | src.policies:minibatch_update:298 - Policy network L2 gradient norm after clipping: 0.23112176358699799
2021-09-07 17:18:18.408 | INFO     | src.policies:minibatch_update:305 - Baseline network L2 gradient norm after clipping: 0.491789847612381
2021-09-07 17:18:18.410 | INFO     | src.policies:train:159 - Mini-batch 2 / 3
2021-09-07 17:18:18.412 | INFO     | src.policies:minibatch_update:281 - Losses: {'policy_loss': -0.3467756509780884, 'baseline_loss': 0.8108263611793518, 'total_loss': 0.058637529611587524}
2021-09-07 17:18:18.413 | INFO     | src.policies:minibatch_update:287 - Policy network L2 gradient norm: 0.28771477937698364
2021-09-07 17:18:18.414 | INFO     | src.policies:minibatch_update:291 - Baseline network L2 gradient norm: 0.6814273595809937
2021-09-07 17:18:18.415 | INFO     | src.policies:minibatch_update:29

2021-09-07 17:18:18.634 | INFO     | src.policies:collect_trajectories:237 - Mean episode return: 200.0
2021-09-07 17:18:18.634 | INFO     | src.policies:collect_trajectories:238 - Last 100 episodes mean return: 192.0
2021-09-07 17:18:18.638 | INFO     | src.policies:train:159 - Mini-batch 1 / 3
2021-09-07 17:18:18.640 | INFO     | src.policies:minibatch_update:281 - Losses: {'policy_loss': -0.2408805638551712, 'baseline_loss': 0.4384835362434387, 'total_loss': -0.021638795733451843}
2021-09-07 17:18:18.642 | INFO     | src.policies:minibatch_update:287 - Policy network L2 gradient norm: 0.15399572253227234
2021-09-07 17:18:18.642 | INFO     | src.policies:minibatch_update:291 - Baseline network L2 gradient norm: 1.5763144493103027
2021-09-07 17:18:18.643 | INFO     | src.policies:minibatch_update:298 - Policy network L2 gradient norm after clipping: 0.15399572253227234
2021-09-07 17:18:18.645 | INFO     | src.policies:minibatch_update:305 - Baseline network L2 gradient norm after clip

2021-09-07 17:18:18.812 | INFO     | src.policies:train:159 - Mini-batch 2 / 2
2021-09-07 17:18:18.813 | INFO     | src.policies:minibatch_update:281 - Losses: {'policy_loss': -0.48540154099464417, 'baseline_loss': 0.963829517364502, 'total_loss': -0.0034867823123931885}
2021-09-07 17:18:18.814 | INFO     | src.policies:minibatch_update:287 - Policy network L2 gradient norm: 0.19994764029979706
2021-09-07 17:18:18.815 | INFO     | src.policies:minibatch_update:291 - Baseline network L2 gradient norm: 1.4462238550186157
2021-09-07 17:18:18.817 | INFO     | src.policies:minibatch_update:298 - Policy network L2 gradient norm after clipping: 0.19994764029979706
2021-09-07 17:18:18.818 | INFO     | src.policies:minibatch_update:305 - Baseline network L2 gradient norm after clipping: 0.49999967217445374
2021-09-07 17:18:18.819 | INFO     | src.policies:train:123 - Epoch 239 / 800
2021-09-07 17:18:18.820 | INFO     | src.policies:collect_trajectories:221 - Episode 981
2021-09-07 17:18:18.988 

2021-09-07 17:18:19.146 | INFO     | src.policies:minibatch_update:305 - Baseline network L2 gradient norm after clipping: 0.4999995827674866
2021-09-07 17:18:19.148 | INFO     | src.policies:train:123 - Epoch 242 / 800
2021-09-07 17:18:19.148 | INFO     | src.policies:collect_trajectories:221 - Episode 985
2021-09-07 17:18:19.179 | DEBUG    | src.policies:execute_episode:413 - Early stopping, all agents done
2021-09-07 17:18:19.180 | INFO     | src.policies:collect_trajectories:237 - Mean episode return: 200.0
2021-09-07 17:18:19.180 | INFO     | src.policies:collect_trajectories:238 - Last 100 episodes mean return: 200.0
2021-09-07 17:18:19.182 | INFO     | src.policies:train:159 - Mini-batch 1 / 2
2021-09-07 17:18:19.184 | INFO     | src.policies:minibatch_update:281 - Losses: {'policy_loss': -0.4816175103187561, 'baseline_loss': 1.1864101886749268, 'total_loss': 0.11158758401870728}
2021-09-07 17:18:19.186 | INFO     | src.policies:minibatch_update:287 - Policy network L2 gradient 

2021-09-07 17:18:19.374 | INFO     | src.policies:train:159 - Mini-batch 2 / 3
2021-09-07 17:18:19.375 | INFO     | src.policies:minibatch_update:281 - Losses: {'policy_loss': -0.29196497797966003, 'baseline_loss': 0.4633726179599762, 'total_loss': -0.060278668999671936}
2021-09-07 17:18:19.376 | INFO     | src.policies:minibatch_update:287 - Policy network L2 gradient norm: 0.22475449740886688
2021-09-07 17:18:19.377 | INFO     | src.policies:minibatch_update:291 - Baseline network L2 gradient norm: 1.3689671754837036
2021-09-07 17:18:19.378 | INFO     | src.policies:minibatch_update:298 - Policy network L2 gradient norm after clipping: 0.22475449740886688
2021-09-07 17:18:19.380 | INFO     | src.policies:minibatch_update:305 - Baseline network L2 gradient norm after clipping: 0.49999964237213135
2021-09-07 17:18:19.381 | INFO     | src.policies:train:159 - Mini-batch 3 / 3
2021-09-07 17:18:19.382 | INFO     | src.policies:minibatch_update:281 - Losses: {'policy_loss': -0.278042376041

2021-09-07 17:18:19.626 | INFO     | src.policies:collect_trajectories:237 - Mean episode return: 127.0
2021-09-07 17:18:19.626 | INFO     | src.policies:collect_trajectories:238 - Last 100 episodes mean return: 127.0
2021-09-07 17:18:19.627 | INFO     | src.policies:collect_trajectories:221 - Episode 994
2021-09-07 17:18:19.661 | DEBUG    | src.policies:execute_episode:413 - Early stopping, all agents done
2021-09-07 17:18:19.662 | INFO     | src.policies:collect_trajectories:237 - Mean episode return: 200.0
2021-09-07 17:18:19.663 | INFO     | src.policies:collect_trajectories:238 - Last 100 episodes mean return: 163.5
2021-09-07 17:18:19.667 | INFO     | src.policies:train:159 - Mini-batch 1 / 3
2021-09-07 17:18:19.669 | INFO     | src.policies:minibatch_update:281 - Losses: {'policy_loss': -0.162174791097641, 'baseline_loss': 0.4498845636844635, 'total_loss': 0.06276749074459076}
2021-09-07 17:18:19.670 | INFO     | src.policies:minibatch_update:287 - Policy network L2 gradient nor

2021-09-07 17:18:19.857 | INFO     | src.policies:train:159 - Mini-batch 1 / 3
2021-09-07 17:18:19.859 | INFO     | src.policies:minibatch_update:281 - Losses: {'policy_loss': -0.21299688518047333, 'baseline_loss': 0.45348572731018066, 'total_loss': 0.013745978474617004}
2021-09-07 17:18:19.860 | INFO     | src.policies:minibatch_update:287 - Policy network L2 gradient norm: 0.5140014886856079
2021-09-07 17:18:19.862 | INFO     | src.policies:minibatch_update:291 - Baseline network L2 gradient norm: 1.414684772491455
2021-09-07 17:18:19.863 | INFO     | src.policies:minibatch_update:298 - Policy network L2 gradient norm after clipping: 0.4999989867210388
2021-09-07 17:18:19.864 | INFO     | src.policies:minibatch_update:305 - Baseline network L2 gradient norm after clipping: 0.49999964237213135
2021-09-07 17:18:19.866 | INFO     | src.policies:train:159 - Mini-batch 2 / 3
2021-09-07 17:18:19.867 | INFO     | src.policies:minibatch_update:281 - Losses: {'policy_loss': -0.253276944160461

2021-09-07 17:18:20.035 | INFO     | src.policies:minibatch_update:287 - Policy network L2 gradient norm: 0.5013110041618347
2021-09-07 17:18:20.036 | INFO     | src.policies:minibatch_update:291 - Baseline network L2 gradient norm: 4.68612003326416
2021-09-07 17:18:20.037 | INFO     | src.policies:minibatch_update:298 - Policy network L2 gradient norm after clipping: 0.49999895691871643
2021-09-07 17:18:20.039 | INFO     | src.policies:minibatch_update:305 - Baseline network L2 gradient norm after clipping: 0.49999988079071045
2021-09-07 17:18:20.098 | INFO     | src.policies:train:123 - Epoch 256 / 800
2021-09-07 17:18:20.099 | INFO     | src.policies:collect_trajectories:221 - Episode 1002
2021-09-07 17:18:20.131 | DEBUG    | src.policies:execute_episode:413 - Early stopping, all agents done
2021-09-07 17:18:20.132 | INFO     | src.policies:collect_trajectories:237 - Mean episode return: 200.0
2021-09-07 17:18:20.132 | INFO     | src.policies:collect_trajectories:238 - Last 100 epis

2021-09-07 17:18:20.399 | INFO     | src.policies:collect_trajectories:221 - Episode 1006
2021-09-07 17:18:20.416 | DEBUG    | src.policies:execute_episode:413 - Early stopping, all agents done
2021-09-07 17:18:20.416 | INFO     | src.policies:collect_trajectories:237 - Mean episode return: 104.0
2021-09-07 17:18:20.417 | INFO     | src.policies:collect_trajectories:238 - Last 100 episodes mean return: 104.0
2021-09-07 17:18:20.417 | INFO     | src.policies:collect_trajectories:221 - Episode 1007
2021-09-07 17:18:20.448 | DEBUG    | src.policies:execute_episode:413 - Early stopping, all agents done
2021-09-07 17:18:20.449 | INFO     | src.policies:collect_trajectories:237 - Mean episode return: 200.0
2021-09-07 17:18:20.449 | INFO     | src.policies:collect_trajectories:238 - Last 100 episodes mean return: 152.0
2021-09-07 17:18:20.452 | INFO     | src.policies:train:159 - Mini-batch 1 / 3
2021-09-07 17:18:20.454 | INFO     | src.policies:minibatch_update:281 - Losses: {'policy_loss': 

2021-09-07 17:18:20.652 | INFO     | src.policies:minibatch_update:305 - Baseline network L2 gradient norm after clipping: 0.4999997019767761
2021-09-07 17:18:20.653 | INFO     | src.policies:train:159 - Mini-batch 3 / 3
2021-09-07 17:18:20.655 | INFO     | src.policies:minibatch_update:281 - Losses: {'policy_loss': -0.6352637410163879, 'baseline_loss': 1.8485995531082153, 'total_loss': 0.2890360355377197}
2021-09-07 17:18:20.656 | INFO     | src.policies:minibatch_update:287 - Policy network L2 gradient norm: 0.16021248698234558
2021-09-07 17:18:20.658 | INFO     | src.policies:minibatch_update:291 - Baseline network L2 gradient norm: 2.1023333072662354
2021-09-07 17:18:20.659 | INFO     | src.policies:minibatch_update:298 - Policy network L2 gradient norm after clipping: 0.16021248698234558
2021-09-07 17:18:20.660 | INFO     | src.policies:minibatch_update:305 - Baseline network L2 gradient norm after clipping: 0.4999997317790985
2021-09-07 17:18:20.662 | INFO     | src.policies:trai

2021-09-07 17:18:20.843 | INFO     | src.policies:minibatch_update:298 - Policy network L2 gradient norm after clipping: 0.3881438076496124
2021-09-07 17:18:20.845 | INFO     | src.policies:minibatch_update:305 - Baseline network L2 gradient norm after clipping: 0.49999988079071045
2021-09-07 17:18:20.846 | INFO     | src.policies:train:123 - Epoch 265 / 800
2021-09-07 17:18:20.847 | INFO     | src.policies:collect_trajectories:221 - Episode 1016
2021-09-07 17:18:20.877 | DEBUG    | src.policies:execute_episode:413 - Early stopping, all agents done
2021-09-07 17:18:20.878 | INFO     | src.policies:collect_trajectories:237 - Mean episode return: 200.0
2021-09-07 17:18:20.878 | INFO     | src.policies:collect_trajectories:238 - Last 100 episodes mean return: 200.0
2021-09-07 17:18:20.881 | INFO     | src.policies:train:159 - Mini-batch 1 / 2
2021-09-07 17:18:20.883 | INFO     | src.policies:minibatch_update:281 - Losses: {'policy_loss': -0.14878083765506744, 'baseline_loss': 0.5227115750

2021-09-07 17:18:21.079 | INFO     | src.policies:collect_trajectories:238 - Last 100 episodes mean return: 200.0
2021-09-07 17:18:21.082 | INFO     | src.policies:train:159 - Mini-batch 1 / 2
2021-09-07 17:18:21.083 | INFO     | src.policies:minibatch_update:281 - Losses: {'policy_loss': -0.7921368479728699, 'baseline_loss': 2.3416764736175537, 'total_loss': 0.378701388835907}
2021-09-07 17:18:21.084 | INFO     | src.policies:minibatch_update:287 - Policy network L2 gradient norm: 0.1639082133769989
2021-09-07 17:18:21.086 | INFO     | src.policies:minibatch_update:291 - Baseline network L2 gradient norm: 4.400983810424805
2021-09-07 17:18:21.087 | INFO     | src.policies:minibatch_update:298 - Policy network L2 gradient norm after clipping: 0.1639082133769989
2021-09-07 17:18:21.088 | INFO     | src.policies:minibatch_update:305 - Baseline network L2 gradient norm after clipping: 0.49999988079071045
2021-09-07 17:18:21.089 | INFO     | src.policies:train:159 - Mini-batch 2 / 2
2021-0

2021-09-07 17:18:21.388 | INFO     | src.policies:minibatch_update:298 - Policy network L2 gradient norm after clipping: 0.23468299210071564
2021-09-07 17:18:21.389 | INFO     | src.policies:minibatch_update:305 - Baseline network L2 gradient norm after clipping: 0.49999964237213135
2021-09-07 17:18:21.391 | INFO     | src.policies:train:123 - Epoch 272 / 800
2021-09-07 17:18:21.391 | INFO     | src.policies:collect_trajectories:221 - Episode 1025
2021-09-07 17:18:21.410 | DEBUG    | src.policies:execute_episode:413 - Early stopping, all agents done
2021-09-07 17:18:21.411 | INFO     | src.policies:collect_trajectories:237 - Mean episode return: 124.0
2021-09-07 17:18:21.411 | INFO     | src.policies:collect_trajectories:238 - Last 100 episodes mean return: 124.0
2021-09-07 17:18:21.412 | INFO     | src.policies:collect_trajectories:221 - Episode 1026
2021-09-07 17:18:21.444 | DEBUG    | src.policies:execute_episode:413 - Early stopping, all agents done
2021-09-07 17:18:21.445 | INFO  

2021-09-07 17:18:21.616 | INFO     | src.policies:collect_trajectories:238 - Last 100 episodes mean return: 200.0
2021-09-07 17:18:21.618 | INFO     | src.policies:train:159 - Mini-batch 1 / 2
2021-09-07 17:18:21.621 | INFO     | src.policies:minibatch_update:281 - Losses: {'policy_loss': -0.30286672711372375, 'baseline_loss': 0.7161865830421448, 'total_loss': 0.05522656440734863}
2021-09-07 17:18:21.623 | INFO     | src.policies:minibatch_update:287 - Policy network L2 gradient norm: 0.21945995092391968
2021-09-07 17:18:21.624 | INFO     | src.policies:minibatch_update:291 - Baseline network L2 gradient norm: 1.098760962486267
2021-09-07 17:18:21.625 | INFO     | src.policies:minibatch_update:298 - Policy network L2 gradient norm after clipping: 0.21945995092391968
2021-09-07 17:18:21.627 | INFO     | src.policies:minibatch_update:305 - Baseline network L2 gradient norm after clipping: 0.4999995529651642
2021-09-07 17:18:21.628 | INFO     | src.policies:train:159 - Mini-batch 2 / 2
20

2021-09-07 17:18:21.851 | INFO     | src.policies:minibatch_update:298 - Policy network L2 gradient norm after clipping: 0.29798436164855957
2021-09-07 17:18:21.852 | INFO     | src.policies:minibatch_update:305 - Baseline network L2 gradient norm after clipping: 0.49999964237213135
2021-09-07 17:18:21.854 | INFO     | src.policies:train:123 - Epoch 279 / 800
2021-09-07 17:18:21.854 | INFO     | src.policies:collect_trajectories:221 - Episode 1034
2021-09-07 17:18:21.874 | DEBUG    | src.policies:execute_episode:413 - Early stopping, all agents done
2021-09-07 17:18:21.875 | INFO     | src.policies:collect_trajectories:237 - Mean episode return: 131.0
2021-09-07 17:18:21.875 | INFO     | src.policies:collect_trajectories:238 - Last 100 episodes mean return: 131.0
2021-09-07 17:18:21.876 | INFO     | src.policies:collect_trajectories:221 - Episode 1035
2021-09-07 17:18:21.907 | DEBUG    | src.policies:execute_episode:413 - Early stopping, all agents done
2021-09-07 17:18:21.907 | INFO  

2021-09-07 17:18:22.042 | INFO     | src.policies:minibatch_update:305 - Baseline network L2 gradient norm after clipping: 0.4999995827674866
2021-09-07 17:18:22.044 | INFO     | src.policies:train:123 - Epoch 282 / 800
2021-09-07 17:18:22.044 | INFO     | src.policies:collect_trajectories:221 - Episode 1039
2021-09-07 17:18:22.064 | DEBUG    | src.policies:execute_episode:413 - Early stopping, all agents done
2021-09-07 17:18:22.065 | INFO     | src.policies:collect_trajectories:237 - Mean episode return: 136.0
2021-09-07 17:18:22.065 | INFO     | src.policies:collect_trajectories:238 - Last 100 episodes mean return: 136.0
2021-09-07 17:18:22.066 | INFO     | src.policies:collect_trajectories:221 - Episode 1040
2021-09-07 17:18:22.083 | DEBUG    | src.policies:execute_episode:413 - Early stopping, all agents done
2021-09-07 17:18:22.084 | INFO     | src.policies:collect_trajectories:237 - Mean episode return: 108.0
2021-09-07 17:18:22.084 | INFO     | src.policies:collect_trajectories

2021-09-07 17:18:22.261 | INFO     | src.policies:collect_trajectories:238 - Last 100 episodes mean return: 146.0
2021-09-07 17:18:22.264 | INFO     | src.policies:train:159 - Mini-batch 1 / 2
2021-09-07 17:18:22.266 | INFO     | src.policies:minibatch_update:281 - Losses: {'policy_loss': -0.4249342381954193, 'baseline_loss': 0.9055987596511841, 'total_loss': 0.02786514163017273}
2021-09-07 17:18:22.266 | INFO     | src.policies:minibatch_update:287 - Policy network L2 gradient norm: 0.3636590540409088
2021-09-07 17:18:22.267 | INFO     | src.policies:minibatch_update:291 - Baseline network L2 gradient norm: 0.9418933987617493
2021-09-07 17:18:22.268 | INFO     | src.policies:minibatch_update:298 - Policy network L2 gradient norm after clipping: 0.3636590540409088
2021-09-07 17:18:22.269 | INFO     | src.policies:minibatch_update:305 - Baseline network L2 gradient norm after clipping: 0.49999940395355225
2021-09-07 17:18:22.270 | INFO     | src.policies:train:159 - Mini-batch 2 / 2
202

2021-09-07 17:18:22.592 | INFO     | src.policies:train:159 - Mini-batch 1 / 3
2021-09-07 17:18:22.594 | INFO     | src.policies:minibatch_update:281 - Losses: {'policy_loss': -0.19629985094070435, 'baseline_loss': 0.47057095170021057, 'total_loss': 0.03898562490940094}
2021-09-07 17:18:22.595 | INFO     | src.policies:minibatch_update:287 - Policy network L2 gradient norm: 0.16877257823944092
2021-09-07 17:18:22.596 | INFO     | src.policies:minibatch_update:291 - Baseline network L2 gradient norm: 1.6679171323776245
2021-09-07 17:18:22.597 | INFO     | src.policies:minibatch_update:298 - Policy network L2 gradient norm after clipping: 0.16877257823944092
2021-09-07 17:18:22.598 | INFO     | src.policies:minibatch_update:305 - Baseline network L2 gradient norm after clipping: 0.4999997019767761
2021-09-07 17:18:22.599 | INFO     | src.policies:train:159 - Mini-batch 2 / 3
2021-09-07 17:18:22.601 | INFO     | src.policies:minibatch_update:281 - Losses: {'policy_loss': -0.27599990367889

2021-09-07 17:18:22.789 | DEBUG    | src.policies:execute_episode:413 - Early stopping, all agents done
2021-09-07 17:18:22.790 | INFO     | src.policies:collect_trajectories:237 - Mean episode return: 200.0
2021-09-07 17:18:22.790 | INFO     | src.policies:collect_trajectories:238 - Last 100 episodes mean return: 200.0
2021-09-07 17:18:22.793 | INFO     | src.policies:train:159 - Mini-batch 1 / 2
2021-09-07 17:18:22.795 | INFO     | src.policies:minibatch_update:281 - Losses: {'policy_loss': -0.13593655824661255, 'baseline_loss': 0.4831361770629883, 'total_loss': 0.10563153028488159}
2021-09-07 17:18:22.796 | INFO     | src.policies:minibatch_update:287 - Policy network L2 gradient norm: 0.23522652685642242
2021-09-07 17:18:22.797 | INFO     | src.policies:minibatch_update:291 - Baseline network L2 gradient norm: 1.7093541622161865
2021-09-07 17:18:22.799 | INFO     | src.policies:minibatch_update:298 - Policy network L2 gradient norm after clipping: 0.23522652685642242
2021-09-07 17:

2021-09-07 17:18:23.035 | INFO     | src.policies:train:123 - Epoch 295 / 800
2021-09-07 17:18:23.036 | INFO     | src.policies:collect_trajectories:221 - Episode 1058
2021-09-07 17:18:23.067 | DEBUG    | src.policies:execute_episode:413 - Early stopping, all agents done
2021-09-07 17:18:23.068 | INFO     | src.policies:collect_trajectories:237 - Mean episode return: 200.0
2021-09-07 17:18:23.068 | INFO     | src.policies:collect_trajectories:238 - Last 100 episodes mean return: 200.0
2021-09-07 17:18:23.070 | INFO     | src.policies:train:159 - Mini-batch 1 / 2
2021-09-07 17:18:23.073 | INFO     | src.policies:minibatch_update:281 - Losses: {'policy_loss': -0.31735873222351074, 'baseline_loss': 0.43542158603668213, 'total_loss': -0.09964793920516968}
2021-09-07 17:18:23.073 | INFO     | src.policies:minibatch_update:287 - Policy network L2 gradient norm: 0.18790099024772644
2021-09-07 17:18:23.074 | INFO     | src.policies:minibatch_update:291 - Baseline network L2 gradient norm: 1.11

2021-09-07 17:18:23.228 | INFO     | src.policies:minibatch_update:305 - Baseline network L2 gradient norm after clipping: 0.49999961256980896
2021-09-07 17:18:23.229 | INFO     | src.policies:train:123 - Epoch 299 / 800
2021-09-07 17:18:23.230 | INFO     | src.policies:collect_trajectories:221 - Episode 1062
2021-09-07 17:18:23.259 | DEBUG    | src.policies:execute_episode:413 - Early stopping, all agents done
2021-09-07 17:18:23.259 | INFO     | src.policies:collect_trajectories:237 - Mean episode return: 200.0
2021-09-07 17:18:23.260 | INFO     | src.policies:collect_trajectories:238 - Last 100 episodes mean return: 200.0
2021-09-07 17:18:23.262 | INFO     | src.policies:train:159 - Mini-batch 1 / 2
2021-09-07 17:18:23.264 | INFO     | src.policies:minibatch_update:281 - Losses: {'policy_loss': -0.4234309494495392, 'baseline_loss': 1.2459063529968262, 'total_loss': 0.1995222270488739}
2021-09-07 17:18:23.264 | INFO     | src.policies:minibatch_update:287 - Policy network L2 gradient

2021-09-07 17:18:23.431 | INFO     | src.policies:train:159 - Mini-batch 1 / 2
2021-09-07 17:18:23.434 | INFO     | src.policies:minibatch_update:281 - Losses: {'policy_loss': -0.3927168548107147, 'baseline_loss': 1.092500925064087, 'total_loss': 0.15353360772132874}
2021-09-07 17:18:23.435 | INFO     | src.policies:minibatch_update:287 - Policy network L2 gradient norm: 0.14398762583732605
2021-09-07 17:18:23.436 | INFO     | src.policies:minibatch_update:291 - Baseline network L2 gradient norm: 0.656749427318573
2021-09-07 17:18:23.437 | INFO     | src.policies:minibatch_update:298 - Policy network L2 gradient norm after clipping: 0.14398762583732605
2021-09-07 17:18:23.438 | INFO     | src.policies:minibatch_update:305 - Baseline network L2 gradient norm after clipping: 0.49999916553497314
2021-09-07 17:18:23.439 | INFO     | src.policies:train:159 - Mini-batch 2 / 2
2021-09-07 17:18:23.441 | INFO     | src.policies:minibatch_update:281 - Losses: {'policy_loss': -0.389823317527771, 

2021-09-07 17:18:23.788 | INFO     | src.policies:collect_trajectories:238 - Last 100 episodes mean return: 200.0
2021-09-07 17:18:23.792 | INFO     | src.policies:train:159 - Mini-batch 1 / 2
2021-09-07 17:18:23.794 | INFO     | src.policies:minibatch_update:281 - Losses: {'policy_loss': -0.2663658857345581, 'baseline_loss': 0.46792933344841003, 'total_loss': -0.03240121901035309}
2021-09-07 17:18:23.795 | INFO     | src.policies:minibatch_update:287 - Policy network L2 gradient norm: 0.1195637583732605
2021-09-07 17:18:23.797 | INFO     | src.policies:minibatch_update:291 - Baseline network L2 gradient norm: 0.9078469276428223
2021-09-07 17:18:23.798 | INFO     | src.policies:minibatch_update:298 - Policy network L2 gradient norm after clipping: 0.1195637583732605
2021-09-07 17:18:23.799 | INFO     | src.policies:minibatch_update:305 - Baseline network L2 gradient norm after clipping: 0.49999937415122986
2021-09-07 17:18:23.801 | INFO     | src.policies:train:159 - Mini-batch 2 / 2
2

2021-09-07 17:18:24.023 | INFO     | src.policies:collect_trajectories:237 - Mean episode return: 200.0
2021-09-07 17:18:24.024 | INFO     | src.policies:collect_trajectories:238 - Last 100 episodes mean return: 200.0
2021-09-07 17:18:24.025 | INFO     | src.policies:train:159 - Mini-batch 1 / 2
2021-09-07 17:18:24.027 | INFO     | src.policies:minibatch_update:281 - Losses: {'policy_loss': -0.7233911156654358, 'baseline_loss': 2.3884923458099365, 'total_loss': 0.47085505723953247}
2021-09-07 17:18:24.028 | INFO     | src.policies:minibatch_update:287 - Policy network L2 gradient norm: 0.4977518320083618
2021-09-07 17:18:24.029 | INFO     | src.policies:minibatch_update:291 - Baseline network L2 gradient norm: 3.5169384479522705
2021-09-07 17:18:24.030 | INFO     | src.policies:minibatch_update:298 - Policy network L2 gradient norm after clipping: 0.4977518320083618
2021-09-07 17:18:24.031 | INFO     | src.policies:minibatch_update:305 - Baseline network L2 gradient norm after clipping

2021-09-07 17:18:24.232 | INFO     | src.policies:collect_trajectories:221 - Episode 1078
2021-09-07 17:18:24.257 | DEBUG    | src.policies:execute_episode:413 - Early stopping, all agents done
2021-09-07 17:18:24.258 | INFO     | src.policies:collect_trajectories:237 - Mean episode return: 177.0
2021-09-07 17:18:24.259 | INFO     | src.policies:collect_trajectories:238 - Last 100 episodes mean return: 177.0
2021-09-07 17:18:24.259 | INFO     | src.policies:collect_trajectories:221 - Episode 1079
2021-09-07 17:18:24.281 | DEBUG    | src.policies:execute_episode:413 - Early stopping, all agents done
2021-09-07 17:18:24.282 | INFO     | src.policies:collect_trajectories:237 - Mean episode return: 137.0
2021-09-07 17:18:24.282 | INFO     | src.policies:collect_trajectories:238 - Last 100 episodes mean return: 157.0
2021-09-07 17:18:24.286 | INFO     | src.policies:train:159 - Mini-batch 1 / 3
2021-09-07 17:18:24.288 | INFO     | src.policies:minibatch_update:281 - Losses: {'policy_loss': 

2021-09-07 17:18:24.580 | INFO     | src.policies:collect_trajectories:237 - Mean episode return: 200.0
2021-09-07 17:18:24.580 | INFO     | src.policies:collect_trajectories:238 - Last 100 episodes mean return: 188.5
2021-09-07 17:18:24.583 | INFO     | src.policies:train:159 - Mini-batch 1 / 3
2021-09-07 17:18:24.586 | INFO     | src.policies:minibatch_update:281 - Losses: {'policy_loss': -0.44219815731048584, 'baseline_loss': 1.0291526317596436, 'total_loss': 0.07237815856933594}
2021-09-07 17:18:24.588 | INFO     | src.policies:minibatch_update:287 - Policy network L2 gradient norm: 0.2264396995306015
2021-09-07 17:18:24.589 | INFO     | src.policies:minibatch_update:291 - Baseline network L2 gradient norm: 0.5679299831390381
2021-09-07 17:18:24.590 | INFO     | src.policies:minibatch_update:298 - Policy network L2 gradient norm after clipping: 0.2264396995306015
2021-09-07 17:18:24.591 | INFO     | src.policies:minibatch_update:305 - Baseline network L2 gradient norm after clippin

2021-09-07 17:18:24.825 | INFO     | src.policies:minibatch_update:287 - Policy network L2 gradient norm: 0.5714660286903381
2021-09-07 17:18:24.826 | INFO     | src.policies:minibatch_update:291 - Baseline network L2 gradient norm: 0.6887465119361877
2021-09-07 17:18:24.827 | INFO     | src.policies:minibatch_update:298 - Policy network L2 gradient norm after clipping: 0.49999910593032837
2021-09-07 17:18:24.828 | INFO     | src.policies:minibatch_update:305 - Baseline network L2 gradient norm after clipping: 0.4999992251396179
2021-09-07 17:18:24.830 | INFO     | src.policies:train:159 - Mini-batch 2 / 3
2021-09-07 17:18:24.831 | INFO     | src.policies:minibatch_update:281 - Losses: {'policy_loss': -0.5792365670204163, 'baseline_loss': 1.3950262069702148, 'total_loss': 0.11827653646469116}
2021-09-07 17:18:24.832 | INFO     | src.policies:minibatch_update:287 - Policy network L2 gradient norm: 0.7248097062110901
2021-09-07 17:18:24.833 | INFO     | src.policies:minibatch_update:291 

2021-09-07 17:18:25.002 | INFO     | src.policies:minibatch_update:281 - Losses: {'policy_loss': -0.5137607455253601, 'baseline_loss': 1.5708383321762085, 'total_loss': 0.27165842056274414}
2021-09-07 17:18:25.003 | INFO     | src.policies:minibatch_update:287 - Policy network L2 gradient norm: 0.2789570689201355
2021-09-07 17:18:25.004 | INFO     | src.policies:minibatch_update:291 - Baseline network L2 gradient norm: 1.9556273221969604
2021-09-07 17:18:25.005 | INFO     | src.policies:minibatch_update:298 - Policy network L2 gradient norm after clipping: 0.2789570689201355
2021-09-07 17:18:25.006 | INFO     | src.policies:minibatch_update:305 - Baseline network L2 gradient norm after clipping: 0.4999997615814209
2021-09-07 17:18:25.008 | INFO     | src.policies:train:123 - Epoch 323 / 800
2021-09-07 17:18:25.009 | INFO     | src.policies:collect_trajectories:221 - Episode 1092
2021-09-07 17:18:25.016 | DEBUG    | src.policies:execute_episode:413 - Early stopping, all agents done
2021

2021-09-07 17:18:25.198 | INFO     | src.policies:minibatch_update:287 - Policy network L2 gradient norm: 0.12148456275463104
2021-09-07 17:18:25.199 | INFO     | src.policies:minibatch_update:291 - Baseline network L2 gradient norm: 0.6569643020629883
2021-09-07 17:18:25.201 | INFO     | src.policies:minibatch_update:298 - Policy network L2 gradient norm after clipping: 0.12148456275463104
2021-09-07 17:18:25.203 | INFO     | src.policies:minibatch_update:305 - Baseline network L2 gradient norm after clipping: 0.4999992549419403
2021-09-07 17:18:25.204 | INFO     | src.policies:train:159 - Mini-batch 2 / 2
2021-09-07 17:18:25.205 | INFO     | src.policies:minibatch_update:281 - Losses: {'policy_loss': -0.2938474714756012, 'baseline_loss': 0.44085174798965454, 'total_loss': -0.07342159748077393}
2021-09-07 17:18:25.206 | INFO     | src.policies:minibatch_update:287 - Policy network L2 gradient norm: 0.1278126835823059
2021-09-07 17:18:25.207 | INFO     | src.policies:minibatch_update:2

2021-09-07 17:18:25.457 | INFO     | src.policies:minibatch_update:281 - Losses: {'policy_loss': -0.13168972730636597, 'baseline_loss': 0.4728235900402069, 'total_loss': 0.10472206771373749}
2021-09-07 17:18:25.458 | INFO     | src.policies:minibatch_update:287 - Policy network L2 gradient norm: 0.07956526428461075
2021-09-07 17:18:25.459 | INFO     | src.policies:minibatch_update:291 - Baseline network L2 gradient norm: 1.5186136960983276
2021-09-07 17:18:25.460 | INFO     | src.policies:minibatch_update:298 - Policy network L2 gradient norm after clipping: 0.07956526428461075
2021-09-07 17:18:25.461 | INFO     | src.policies:minibatch_update:305 - Baseline network L2 gradient norm after clipping: 0.49999967217445374
2021-09-07 17:18:25.463 | INFO     | src.policies:train:159 - Mini-batch 2 / 2
2021-09-07 17:18:25.464 | INFO     | src.policies:minibatch_update:281 - Losses: {'policy_loss': -0.1535518616437912, 'baseline_loss': 0.4465338885784149, 'total_loss': 0.06971508264541626}
202

2021-09-07 17:18:25.645 | INFO     | src.policies:collect_trajectories:238 - Last 100 episodes mean return: 200.0
2021-09-07 17:18:25.646 | INFO     | src.policies:train:159 - Mini-batch 1 / 2
2021-09-07 17:18:25.649 | INFO     | src.policies:minibatch_update:281 - Losses: {'policy_loss': -0.13594159483909607, 'baseline_loss': 0.45421794056892395, 'total_loss': 0.0911673754453659}
2021-09-07 17:18:25.650 | INFO     | src.policies:minibatch_update:287 - Policy network L2 gradient norm: 0.14391183853149414
2021-09-07 17:18:25.651 | INFO     | src.policies:minibatch_update:291 - Baseline network L2 gradient norm: 1.518497347831726
2021-09-07 17:18:25.652 | INFO     | src.policies:minibatch_update:298 - Policy network L2 gradient norm after clipping: 0.14391183853149414
2021-09-07 17:18:25.653 | INFO     | src.policies:minibatch_update:305 - Baseline network L2 gradient norm after clipping: 0.4999997019767761
2021-09-07 17:18:25.655 | INFO     | src.policies:train:159 - Mini-batch 2 / 2
20

2021-09-07 17:18:25.888 | INFO     | src.policies:collect_trajectories:237 - Mean episode return: 200.0
2021-09-07 17:18:25.889 | INFO     | src.policies:collect_trajectories:238 - Last 100 episodes mean return: 200.0
2021-09-07 17:18:25.890 | INFO     | src.policies:train:159 - Mini-batch 1 / 2
2021-09-07 17:18:25.893 | INFO     | src.policies:minibatch_update:281 - Losses: {'policy_loss': -0.6721581816673279, 'baseline_loss': 2.8475711345672607, 'total_loss': 0.7516273856163025}
2021-09-07 17:18:25.894 | INFO     | src.policies:minibatch_update:287 - Policy network L2 gradient norm: 0.8537241220474243
2021-09-07 17:18:25.895 | INFO     | src.policies:minibatch_update:291 - Baseline network L2 gradient norm: 3.9244863986968994
2021-09-07 17:18:25.896 | INFO     | src.policies:minibatch_update:298 - Policy network L2 gradient norm after clipping: 0.49999943375587463
2021-09-07 17:18:25.897 | INFO     | src.policies:minibatch_update:305 - Baseline network L2 gradient norm after clipping

2021-09-07 17:18:26.051 | INFO     | src.policies:collect_trajectories:221 - Episode 1112
2021-09-07 17:18:26.080 | DEBUG    | src.policies:execute_episode:413 - Early stopping, all agents done
2021-09-07 17:18:26.081 | INFO     | src.policies:collect_trajectories:237 - Mean episode return: 200.0
2021-09-07 17:18:26.081 | INFO     | src.policies:collect_trajectories:238 - Last 100 episodes mean return: 200.0
2021-09-07 17:18:26.083 | INFO     | src.policies:train:159 - Mini-batch 1 / 2
2021-09-07 17:18:26.085 | INFO     | src.policies:minibatch_update:281 - Losses: {'policy_loss': -0.30326366424560547, 'baseline_loss': 0.49540820717811584, 'total_loss': -0.055559560656547546}
2021-09-07 17:18:26.086 | INFO     | src.policies:minibatch_update:287 - Policy network L2 gradient norm: 0.3937210440635681
2021-09-07 17:18:26.087 | INFO     | src.policies:minibatch_update:291 - Baseline network L2 gradient norm: 1.304807424545288
2021-09-07 17:18:26.089 | INFO     | src.policies:minibatch_upda

2021-09-07 17:18:26.237 | INFO     | src.policies:train:123 - Epoch 346 / 800
2021-09-07 17:18:26.237 | INFO     | src.policies:collect_trajectories:221 - Episode 1116
2021-09-07 17:18:26.266 | DEBUG    | src.policies:execute_episode:413 - Early stopping, all agents done
2021-09-07 17:18:26.266 | INFO     | src.policies:collect_trajectories:237 - Mean episode return: 200.0
2021-09-07 17:18:26.267 | INFO     | src.policies:collect_trajectories:238 - Last 100 episodes mean return: 200.0
2021-09-07 17:18:26.269 | INFO     | src.policies:train:159 - Mini-batch 1 / 2
2021-09-07 17:18:26.272 | INFO     | src.policies:minibatch_update:281 - Losses: {'policy_loss': -0.38264986872673035, 'baseline_loss': 1.3723766803741455, 'total_loss': 0.3035384714603424}
2021-09-07 17:18:26.273 | INFO     | src.policies:minibatch_update:287 - Policy network L2 gradient norm: 0.47700411081314087
2021-09-07 17:18:26.274 | INFO     | src.policies:minibatch_update:291 - Baseline network L2 gradient norm: 1.53053

2021-09-07 17:18:26.687 | INFO     | src.policies:minibatch_update:305 - Baseline network L2 gradient norm after clipping: 0.49999961256980896
2021-09-07 17:18:26.688 | INFO     | src.policies:train:123 - Epoch 350 / 800
2021-09-07 17:18:26.688 | INFO     | src.policies:collect_trajectories:221 - Episode 1120
2021-09-07 17:18:26.716 | DEBUG    | src.policies:execute_episode:413 - Early stopping, all agents done
2021-09-07 17:18:26.717 | INFO     | src.policies:collect_trajectories:237 - Mean episode return: 200.0
2021-09-07 17:18:26.717 | INFO     | src.policies:collect_trajectories:238 - Last 100 episodes mean return: 200.0
2021-09-07 17:18:26.719 | INFO     | src.policies:train:159 - Mini-batch 1 / 2
2021-09-07 17:18:26.722 | INFO     | src.policies:minibatch_update:281 - Losses: {'policy_loss': -0.6201171278953552, 'baseline_loss': 1.2991151809692383, 'total_loss': 0.029440462589263916}
2021-09-07 17:18:26.723 | INFO     | src.policies:minibatch_update:287 - Policy network L2 gradie

2021-09-07 17:18:26.894 | INFO     | src.policies:train:159 - Mini-batch 2 / 3
2021-09-07 17:18:26.895 | INFO     | src.policies:minibatch_update:281 - Losses: {'policy_loss': -0.12072617560625076, 'baseline_loss': 0.5078584551811218, 'total_loss': 0.13320305943489075}
2021-09-07 17:18:26.896 | INFO     | src.policies:minibatch_update:287 - Policy network L2 gradient norm: 0.2404179871082306
2021-09-07 17:18:26.897 | INFO     | src.policies:minibatch_update:291 - Baseline network L2 gradient norm: 1.4141533374786377
2021-09-07 17:18:26.898 | INFO     | src.policies:minibatch_update:298 - Policy network L2 gradient norm after clipping: 0.2404179871082306
2021-09-07 17:18:26.899 | INFO     | src.policies:minibatch_update:305 - Baseline network L2 gradient norm after clipping: 0.49999964237213135
2021-09-07 17:18:26.900 | INFO     | src.policies:train:159 - Mini-batch 3 / 3
2021-09-07 17:18:26.901 | INFO     | src.policies:minibatch_update:281 - Losses: {'policy_loss': -0.0965600982308387

2021-09-07 17:18:27.147 | INFO     | src.policies:minibatch_update:281 - Losses: {'policy_loss': -0.36891815066337585, 'baseline_loss': 0.6763364672660828, 'total_loss': -0.030749917030334473}
2021-09-07 17:18:27.148 | INFO     | src.policies:minibatch_update:287 - Policy network L2 gradient norm: 0.2511575222015381
2021-09-07 17:18:27.149 | INFO     | src.policies:minibatch_update:291 - Baseline network L2 gradient norm: 0.6517954468727112
2021-09-07 17:18:27.150 | INFO     | src.policies:minibatch_update:298 - Policy network L2 gradient norm after clipping: 0.2511575222015381
2021-09-07 17:18:27.151 | INFO     | src.policies:minibatch_update:305 - Baseline network L2 gradient norm after clipping: 0.4999992549419403
2021-09-07 17:18:27.152 | INFO     | src.policies:train:159 - Mini-batch 2 / 3
2021-09-07 17:18:27.153 | INFO     | src.policies:minibatch_update:281 - Losses: {'policy_loss': -0.3629641830921173, 'baseline_loss': 0.6805068850517273, 'total_loss': -0.022710740566253662}
20

2021-09-07 17:18:27.312 | INFO     | src.policies:minibatch_update:291 - Baseline network L2 gradient norm: 1.6505684852600098
2021-09-07 17:18:27.313 | INFO     | src.policies:minibatch_update:298 - Policy network L2 gradient norm after clipping: 0.10676635801792145
2021-09-07 17:18:27.314 | INFO     | src.policies:minibatch_update:305 - Baseline network L2 gradient norm after clipping: 0.4999997317790985
2021-09-07 17:18:27.315 | INFO     | src.policies:train:123 - Epoch 360 / 800
2021-09-07 17:18:27.316 | INFO     | src.policies:collect_trajectories:221 - Episode 1133
2021-09-07 17:18:27.344 | DEBUG    | src.policies:execute_episode:413 - Early stopping, all agents done
2021-09-07 17:18:27.344 | INFO     | src.policies:collect_trajectories:237 - Mean episode return: 200.0
2021-09-07 17:18:27.345 | INFO     | src.policies:collect_trajectories:238 - Last 100 episodes mean return: 200.0
2021-09-07 17:18:27.347 | INFO     | src.policies:train:159 - Mini-batch 1 / 2
2021-09-07 17:18:27.3

2021-09-07 17:18:27.505 | INFO     | src.policies:minibatch_update:287 - Policy network L2 gradient norm: 0.30805283784866333
2021-09-07 17:18:27.505 | INFO     | src.policies:minibatch_update:291 - Baseline network L2 gradient norm: 1.5495187044143677
2021-09-07 17:18:27.507 | INFO     | src.policies:minibatch_update:298 - Policy network L2 gradient norm after clipping: 0.30805283784866333
2021-09-07 17:18:27.507 | INFO     | src.policies:minibatch_update:305 - Baseline network L2 gradient norm after clipping: 0.4999997019767761
2021-09-07 17:18:27.509 | INFO     | src.policies:train:123 - Epoch 364 / 800
2021-09-07 17:18:27.509 | INFO     | src.policies:collect_trajectories:221 - Episode 1137
2021-09-07 17:18:27.540 | DEBUG    | src.policies:execute_episode:413 - Early stopping, all agents done
2021-09-07 17:18:27.540 | INFO     | src.policies:collect_trajectories:237 - Mean episode return: 200.0
2021-09-07 17:18:27.541 | INFO     | src.policies:collect_trajectories:238 - Last 100 ep

2021-09-07 17:18:27.751 | INFO     | src.policies:minibatch_update:281 - Losses: {'policy_loss': -0.375771164894104, 'baseline_loss': 0.5763034820556641, 'total_loss': -0.08761942386627197}
2021-09-07 17:18:27.752 | INFO     | src.policies:minibatch_update:287 - Policy network L2 gradient norm: 0.1350293755531311
2021-09-07 17:18:27.753 | INFO     | src.policies:minibatch_update:291 - Baseline network L2 gradient norm: 0.222791850566864
2021-09-07 17:18:27.754 | INFO     | src.policies:minibatch_update:298 - Policy network L2 gradient norm after clipping: 0.1350293755531311
2021-09-07 17:18:27.755 | INFO     | src.policies:minibatch_update:305 - Baseline network L2 gradient norm after clipping: 0.222791850566864
2021-09-07 17:18:27.757 | INFO     | src.policies:train:123 - Epoch 368 / 800
2021-09-07 17:18:27.757 | INFO     | src.policies:collect_trajectories:221 - Episode 1141
2021-09-07 17:18:27.786 | DEBUG    | src.policies:execute_episode:413 - Early stopping, all agents done
2021-0

2021-09-07 17:18:27.938 | INFO     | src.policies:train:159 - Mini-batch 2 / 2
2021-09-07 17:18:27.939 | INFO     | src.policies:minibatch_update:281 - Losses: {'policy_loss': -0.501645028591156, 'baseline_loss': 0.8406635522842407, 'total_loss': -0.08131325244903564}
2021-09-07 17:18:27.940 | INFO     | src.policies:minibatch_update:287 - Policy network L2 gradient norm: 0.21512524783611298
2021-09-07 17:18:27.941 | INFO     | src.policies:minibatch_update:291 - Baseline network L2 gradient norm: 0.9607008099555969
2021-09-07 17:18:27.943 | INFO     | src.policies:minibatch_update:298 - Policy network L2 gradient norm after clipping: 0.21512524783611298
2021-09-07 17:18:27.945 | INFO     | src.policies:minibatch_update:305 - Baseline network L2 gradient norm after clipping: 0.49999943375587463
2021-09-07 17:18:27.946 | INFO     | src.policies:train:123 - Epoch 372 / 800
2021-09-07 17:18:27.947 | INFO     | src.policies:collect_trajectories:221 - Episode 1145
2021-09-07 17:18:27.977 | 

2021-09-07 17:18:28.182 | INFO     | src.policies:minibatch_update:305 - Baseline network L2 gradient norm after clipping: 0.4999990463256836
2021-09-07 17:18:28.183 | INFO     | src.policies:train:123 - Epoch 375 / 800
2021-09-07 17:18:28.183 | INFO     | src.policies:collect_trajectories:221 - Episode 1149
2021-09-07 17:18:28.213 | DEBUG    | src.policies:execute_episode:413 - Early stopping, all agents done
2021-09-07 17:18:28.213 | INFO     | src.policies:collect_trajectories:237 - Mean episode return: 200.0
2021-09-07 17:18:28.214 | INFO     | src.policies:collect_trajectories:238 - Last 100 episodes mean return: 200.0
2021-09-07 17:18:28.215 | INFO     | src.policies:train:159 - Mini-batch 1 / 2
2021-09-07 17:18:28.218 | INFO     | src.policies:minibatch_update:281 - Losses: {'policy_loss': -0.5076902508735657, 'baseline_loss': 1.1023855209350586, 'total_loss': 0.04350250959396362}
2021-09-07 17:18:28.219 | INFO     | src.policies:minibatch_update:287 - Policy network L2 gradient

2021-09-07 17:18:28.367 | INFO     | src.policies:minibatch_update:298 - Policy network L2 gradient norm after clipping: 0.1362721472978592
2021-09-07 17:18:28.368 | INFO     | src.policies:minibatch_update:305 - Baseline network L2 gradient norm after clipping: 0.49999961256980896
2021-09-07 17:18:28.370 | INFO     | src.policies:train:123 - Epoch 379 / 800
2021-09-07 17:18:28.370 | INFO     | src.policies:collect_trajectories:221 - Episode 1153
2021-09-07 17:18:28.399 | DEBUG    | src.policies:execute_episode:413 - Early stopping, all agents done
2021-09-07 17:18:28.400 | INFO     | src.policies:collect_trajectories:237 - Mean episode return: 200.0
2021-09-07 17:18:28.400 | INFO     | src.policies:collect_trajectories:238 - Last 100 episodes mean return: 200.0
2021-09-07 17:18:28.402 | INFO     | src.policies:train:159 - Mini-batch 1 / 2
2021-09-07 17:18:28.404 | INFO     | src.policies:minibatch_update:281 - Losses: {'policy_loss': -0.2899593710899353, 'baseline_loss': 0.54514986276

2021-09-07 17:18:28.555 | INFO     | src.policies:minibatch_update:291 - Baseline network L2 gradient norm: 0.6260919570922852
2021-09-07 17:18:28.557 | INFO     | src.policies:minibatch_update:298 - Policy network L2 gradient norm after clipping: 0.09445828199386597
2021-09-07 17:18:28.558 | INFO     | src.policies:minibatch_update:305 - Baseline network L2 gradient norm after clipping: 0.49999916553497314
2021-09-07 17:18:28.559 | INFO     | src.policies:train:123 - Epoch 383 / 800
2021-09-07 17:18:28.559 | INFO     | src.policies:collect_trajectories:221 - Episode 1157
2021-09-07 17:18:28.588 | DEBUG    | src.policies:execute_episode:413 - Early stopping, all agents done
2021-09-07 17:18:28.589 | INFO     | src.policies:collect_trajectories:237 - Mean episode return: 200.0
2021-09-07 17:18:28.589 | INFO     | src.policies:collect_trajectories:238 - Last 100 episodes mean return: 200.0
2021-09-07 17:18:28.592 | INFO     | src.policies:train:159 - Mini-batch 1 / 2
2021-09-07 17:18:28.

2021-09-07 17:18:29.002 | INFO     | src.policies:minibatch_update:298 - Policy network L2 gradient norm after clipping: 0.16373836994171143
2021-09-07 17:18:29.003 | INFO     | src.policies:minibatch_update:305 - Baseline network L2 gradient norm after clipping: 0.4999995529651642
2021-09-07 17:18:29.005 | INFO     | src.policies:train:159 - Mini-batch 2 / 3
2021-09-07 17:18:29.006 | INFO     | src.policies:minibatch_update:281 - Losses: {'policy_loss': -0.2896394729614258, 'baseline_loss': 0.5093467831611633, 'total_loss': -0.034966081380844116}
2021-09-07 17:18:29.007 | INFO     | src.policies:minibatch_update:287 - Policy network L2 gradient norm: 0.15536366403102875
2021-09-07 17:18:29.008 | INFO     | src.policies:minibatch_update:291 - Baseline network L2 gradient norm: 0.7232069373130798
2021-09-07 17:18:29.009 | INFO     | src.policies:minibatch_update:298 - Policy network L2 gradient norm after clipping: 0.15536366403102875
2021-09-07 17:18:29.010 | INFO     | src.policies:mi

2021-09-07 17:18:29.162 | INFO     | src.policies:train:123 - Epoch 390 / 800
2021-09-07 17:18:29.163 | INFO     | src.policies:collect_trajectories:221 - Episode 1165
2021-09-07 17:18:29.191 | DEBUG    | src.policies:execute_episode:413 - Early stopping, all agents done
2021-09-07 17:18:29.192 | INFO     | src.policies:collect_trajectories:237 - Mean episode return: 200.0
2021-09-07 17:18:29.192 | INFO     | src.policies:collect_trajectories:238 - Last 100 episodes mean return: 200.0
2021-09-07 17:18:29.194 | INFO     | src.policies:train:159 - Mini-batch 1 / 2
2021-09-07 17:18:29.196 | INFO     | src.policies:minibatch_update:281 - Losses: {'policy_loss': -0.40575259923934937, 'baseline_loss': 0.5981131792068481, 'total_loss': -0.10669600963592529}
2021-09-07 17:18:29.197 | INFO     | src.policies:minibatch_update:287 - Policy network L2 gradient norm: 0.2700880765914917
2021-09-07 17:18:29.198 | INFO     | src.policies:minibatch_update:291 - Baseline network L2 gradient norm: 0.4408

2021-09-07 17:18:29.415 | INFO     | src.policies:minibatch_update:281 - Losses: {'policy_loss': -0.2526404857635498, 'baseline_loss': 0.42742615938186646, 'total_loss': -0.03892740607261658}
2021-09-07 17:18:29.416 | INFO     | src.policies:minibatch_update:287 - Policy network L2 gradient norm: 0.1337449550628662
2021-09-07 17:18:29.417 | INFO     | src.policies:minibatch_update:291 - Baseline network L2 gradient norm: 1.3824814558029175
2021-09-07 17:18:29.418 | INFO     | src.policies:minibatch_update:298 - Policy network L2 gradient norm after clipping: 0.1337449550628662
2021-09-07 17:18:29.419 | INFO     | src.policies:minibatch_update:305 - Baseline network L2 gradient norm after clipping: 0.49999964237213135
2021-09-07 17:18:29.420 | INFO     | src.policies:train:159 - Mini-batch 3 / 3
2021-09-07 17:18:29.422 | INFO     | src.policies:minibatch_update:281 - Losses: {'policy_loss': -0.2914535403251648, 'baseline_loss': 0.41025570034980774, 'total_loss': -0.08632569015026093}
20

2021-09-07 17:18:29.601 | INFO     | src.policies:collect_trajectories:238 - Last 100 episodes mean return: 200.0
2021-09-07 17:18:29.603 | INFO     | src.policies:train:159 - Mini-batch 1 / 2
2021-09-07 17:18:29.605 | INFO     | src.policies:minibatch_update:281 - Losses: {'policy_loss': -0.4642680883407593, 'baseline_loss': 1.2792465686798096, 'total_loss': 0.1753551959991455}
2021-09-07 17:18:29.606 | INFO     | src.policies:minibatch_update:287 - Policy network L2 gradient norm: 0.2607796788215637
2021-09-07 17:18:29.607 | INFO     | src.policies:minibatch_update:291 - Baseline network L2 gradient norm: 1.913916826248169
2021-09-07 17:18:29.608 | INFO     | src.policies:minibatch_update:298 - Policy network L2 gradient norm after clipping: 0.2607796788215637
2021-09-07 17:18:29.609 | INFO     | src.policies:minibatch_update:305 - Baseline network L2 gradient norm after clipping: 0.4999997317790985
2021-09-07 17:18:29.610 | INFO     | src.policies:train:159 - Mini-batch 2 / 2
2021-0

2021-09-07 17:18:29.780 | INFO     | src.policies:minibatch_update:298 - Policy network L2 gradient norm after clipping: 0.2548757493495941
2021-09-07 17:18:29.781 | INFO     | src.policies:minibatch_update:305 - Baseline network L2 gradient norm after clipping: 0.4999998211860657
2021-09-07 17:18:29.783 | INFO     | src.policies:train:123 - Epoch 401 / 800
2021-09-07 17:18:29.783 | INFO     | src.policies:collect_trajectories:221 - Episode 1178
2021-09-07 17:18:29.813 | DEBUG    | src.policies:execute_episode:413 - Early stopping, all agents done
2021-09-07 17:18:29.814 | INFO     | src.policies:collect_trajectories:237 - Mean episode return: 200.0
2021-09-07 17:18:29.814 | INFO     | src.policies:collect_trajectories:238 - Last 100 episodes mean return: 200.0
2021-09-07 17:18:29.816 | INFO     | src.policies:train:159 - Mini-batch 1 / 2
2021-09-07 17:18:29.818 | INFO     | src.policies:minibatch_update:281 - Losses: {'policy_loss': -0.4290468096733093, 'baseline_loss': 1.107812285423

2021-09-07 17:18:30.031 | INFO     | src.policies:minibatch_update:291 - Baseline network L2 gradient norm: 1.6868997812271118
2021-09-07 17:18:30.032 | INFO     | src.policies:minibatch_update:298 - Policy network L2 gradient norm after clipping: 0.1197444275021553
2021-09-07 17:18:30.033 | INFO     | src.policies:minibatch_update:305 - Baseline network L2 gradient norm after clipping: 0.4999997615814209
2021-09-07 17:18:30.035 | INFO     | src.policies:train:123 - Epoch 405 / 800
2021-09-07 17:18:30.035 | INFO     | src.policies:collect_trajectories:221 - Episode 1182
2021-09-07 17:18:30.064 | DEBUG    | src.policies:execute_episode:413 - Early stopping, all agents done
2021-09-07 17:18:30.064 | INFO     | src.policies:collect_trajectories:237 - Mean episode return: 200.0
2021-09-07 17:18:30.065 | INFO     | src.policies:collect_trajectories:238 - Last 100 episodes mean return: 200.0
2021-09-07 17:18:30.067 | INFO     | src.policies:train:159 - Mini-batch 1 / 2
2021-09-07 17:18:30.06

2021-09-07 17:18:30.228 | INFO     | src.policies:minibatch_update:298 - Policy network L2 gradient norm after clipping: 0.4999992549419403
2021-09-07 17:18:30.229 | INFO     | src.policies:minibatch_update:305 - Baseline network L2 gradient norm after clipping: 0.4999992549419403
2021-09-07 17:18:30.231 | INFO     | src.policies:train:159 - Mini-batch 2 / 2
2021-09-07 17:18:30.232 | INFO     | src.policies:minibatch_update:281 - Losses: {'policy_loss': -0.2718289792537689, 'baseline_loss': 1.213218331336975, 'total_loss': 0.33478018641471863}
2021-09-07 17:18:30.233 | INFO     | src.policies:minibatch_update:287 - Policy network L2 gradient norm: 0.1636267602443695
2021-09-07 17:18:30.233 | INFO     | src.policies:minibatch_update:291 - Baseline network L2 gradient norm: 0.8592065572738647
2021-09-07 17:18:30.235 | INFO     | src.policies:minibatch_update:298 - Policy network L2 gradient norm after clipping: 0.1636267602443695
2021-09-07 17:18:30.236 | INFO     | src.policies:minibatc

2021-09-07 17:18:30.413 | INFO     | src.policies:collect_trajectories:238 - Last 100 episodes mean return: 200.0
2021-09-07 17:18:30.415 | INFO     | src.policies:train:159 - Mini-batch 1 / 2
2021-09-07 17:18:30.418 | INFO     | src.policies:minibatch_update:281 - Losses: {'policy_loss': -0.668261706829071, 'baseline_loss': 1.69557785987854, 'total_loss': 0.17952722311019897}
2021-09-07 17:18:30.419 | INFO     | src.policies:minibatch_update:287 - Policy network L2 gradient norm: 0.25066953897476196
2021-09-07 17:18:30.420 | INFO     | src.policies:minibatch_update:291 - Baseline network L2 gradient norm: 2.429137945175171
2021-09-07 17:18:30.421 | INFO     | src.policies:minibatch_update:298 - Policy network L2 gradient norm after clipping: 0.25066953897476196
2021-09-07 17:18:30.422 | INFO     | src.policies:minibatch_update:305 - Baseline network L2 gradient norm after clipping: 0.4999998211860657
2021-09-07 17:18:30.423 | INFO     | src.policies:train:159 - Mini-batch 2 / 2
2021-0

2021-09-07 17:18:30.653 | INFO     | src.policies:collect_trajectories:237 - Mean episode return: 200.0
2021-09-07 17:18:30.654 | INFO     | src.policies:collect_trajectories:238 - Last 100 episodes mean return: 200.0
2021-09-07 17:18:30.657 | INFO     | src.policies:train:159 - Mini-batch 1 / 2
2021-09-07 17:18:30.659 | INFO     | src.policies:minibatch_update:281 - Losses: {'policy_loss': -0.18899936974048615, 'baseline_loss': 0.36808180809020996, 'total_loss': -0.0049584656953811646}
2021-09-07 17:18:30.660 | INFO     | src.policies:minibatch_update:287 - Policy network L2 gradient norm: 0.1260567009449005
2021-09-07 17:18:30.661 | INFO     | src.policies:minibatch_update:291 - Baseline network L2 gradient norm: 1.1095333099365234
2021-09-07 17:18:30.662 | INFO     | src.policies:minibatch_update:298 - Policy network L2 gradient norm after clipping: 0.1260567009449005
2021-09-07 17:18:30.663 | INFO     | src.policies:minibatch_update:305 - Baseline network L2 gradient norm after cli

2021-09-07 17:18:30.977 | INFO     | src.policies:train:159 - Mini-batch 1 / 3
2021-09-07 17:18:30.980 | INFO     | src.policies:minibatch_update:281 - Losses: {'policy_loss': -0.21794359385967255, 'baseline_loss': 0.460184246301651, 'total_loss': 0.012148529291152954}
2021-09-07 17:18:30.981 | INFO     | src.policies:minibatch_update:287 - Policy network L2 gradient norm: 0.19535812735557556
2021-09-07 17:18:30.982 | INFO     | src.policies:minibatch_update:291 - Baseline network L2 gradient norm: 1.5485113859176636
2021-09-07 17:18:30.983 | INFO     | src.policies:minibatch_update:298 - Policy network L2 gradient norm after clipping: 0.19535812735557556
2021-09-07 17:18:30.984 | INFO     | src.policies:minibatch_update:305 - Baseline network L2 gradient norm after clipping: 0.49999967217445374
2021-09-07 17:18:30.985 | INFO     | src.policies:train:159 - Mini-batch 2 / 3
2021-09-07 17:18:30.986 | INFO     | src.policies:minibatch_update:281 - Losses: {'policy_loss': -0.15412920713424

2021-09-07 17:18:31.264 | INFO     | src.policies:minibatch_update:287 - Policy network L2 gradient norm: 0.05726742744445801
2021-09-07 17:18:31.265 | INFO     | src.policies:minibatch_update:291 - Baseline network L2 gradient norm: 1.0902591943740845
2021-09-07 17:18:31.266 | INFO     | src.policies:minibatch_update:298 - Policy network L2 gradient norm after clipping: 0.05726742744445801
2021-09-07 17:18:31.267 | INFO     | src.policies:minibatch_update:305 - Baseline network L2 gradient norm after clipping: 0.4999995529651642
2021-09-07 17:18:31.268 | INFO     | src.policies:train:123 - Epoch 423 / 800
2021-09-07 17:18:31.269 | INFO     | src.policies:collect_trajectories:221 - Episode 1204
2021-09-07 17:18:31.297 | DEBUG    | src.policies:execute_episode:413 - Early stopping, all agents done
2021-09-07 17:18:31.298 | INFO     | src.policies:collect_trajectories:237 - Mean episode return: 200.0
2021-09-07 17:18:31.298 | INFO     | src.policies:collect_trajectories:238 - Last 100 ep

2021-09-07 17:18:31.445 | INFO     | src.policies:collect_trajectories:221 - Episode 1208
2021-09-07 17:18:31.465 | DEBUG    | src.policies:execute_episode:413 - Early stopping, all agents done
2021-09-07 17:18:31.466 | INFO     | src.policies:collect_trajectories:237 - Mean episode return: 106.0
2021-09-07 17:18:31.466 | INFO     | src.policies:collect_trajectories:238 - Last 100 episodes mean return: 106.0
2021-09-07 17:18:31.467 | INFO     | src.policies:collect_trajectories:221 - Episode 1209
2021-09-07 17:18:31.497 | DEBUG    | src.policies:execute_episode:413 - Early stopping, all agents done
2021-09-07 17:18:31.498 | INFO     | src.policies:collect_trajectories:237 - Mean episode return: 200.0
2021-09-07 17:18:31.498 | INFO     | src.policies:collect_trajectories:238 - Last 100 episodes mean return: 153.0
2021-09-07 17:18:31.501 | INFO     | src.policies:train:159 - Mini-batch 1 / 3
2021-09-07 17:18:31.504 | INFO     | src.policies:minibatch_update:281 - Losses: {'policy_loss': 

2021-09-07 17:18:31.709 | INFO     | src.policies:minibatch_update:287 - Policy network L2 gradient norm: 0.39560645818710327
2021-09-07 17:18:31.710 | INFO     | src.policies:minibatch_update:291 - Baseline network L2 gradient norm: 2.8080506324768066
2021-09-07 17:18:31.712 | INFO     | src.policies:minibatch_update:298 - Policy network L2 gradient norm after clipping: 0.39560645818710327
2021-09-07 17:18:31.713 | INFO     | src.policies:minibatch_update:305 - Baseline network L2 gradient norm after clipping: 0.4999998211860657
2021-09-07 17:18:31.714 | INFO     | src.policies:train:159 - Mini-batch 2 / 2
2021-09-07 17:18:31.715 | INFO     | src.policies:minibatch_update:281 - Losses: {'policy_loss': -0.7150508761405945, 'baseline_loss': 1.6249600648880005, 'total_loss': 0.09742915630340576}
2021-09-07 17:18:31.716 | INFO     | src.policies:minibatch_update:287 - Policy network L2 gradient norm: 0.471818745136261
2021-09-07 17:18:31.717 | INFO     | src.policies:minibatch_update:291 

2021-09-07 17:18:31.898 | INFO     | src.policies:minibatch_update:281 - Losses: {'policy_loss': -0.23047740757465363, 'baseline_loss': 0.3649948835372925, 'total_loss': -0.047979965806007385}
2021-09-07 17:18:31.899 | INFO     | src.policies:minibatch_update:287 - Policy network L2 gradient norm: 0.10850037634372711
2021-09-07 17:18:31.900 | INFO     | src.policies:minibatch_update:291 - Baseline network L2 gradient norm: 0.6939641833305359
2021-09-07 17:18:31.901 | INFO     | src.policies:minibatch_update:298 - Policy network L2 gradient norm after clipping: 0.10850037634372711
2021-09-07 17:18:31.902 | INFO     | src.policies:minibatch_update:305 - Baseline network L2 gradient norm after clipping: 0.4999992251396179
2021-09-07 17:18:31.903 | INFO     | src.policies:train:123 - Epoch 433 / 800
2021-09-07 17:18:31.904 | INFO     | src.policies:collect_trajectories:221 - Episode 1217
2021-09-07 17:18:31.935 | DEBUG    | src.policies:execute_episode:413 - Early stopping, all agents done

2021-09-07 17:18:32.080 | INFO     | src.policies:minibatch_update:305 - Baseline network L2 gradient norm after clipping: 0.2829112112522125
2021-09-07 17:18:32.082 | INFO     | src.policies:train:123 - Epoch 436 / 800
2021-09-07 17:18:32.082 | INFO     | src.policies:collect_trajectories:221 - Episode 1221
2021-09-07 17:18:32.109 | DEBUG    | src.policies:execute_episode:413 - Early stopping, all agents done
2021-09-07 17:18:32.109 | INFO     | src.policies:collect_trajectories:237 - Mean episode return: 156.0
2021-09-07 17:18:32.110 | INFO     | src.policies:collect_trajectories:238 - Last 100 episodes mean return: 156.0
2021-09-07 17:18:32.110 | INFO     | src.policies:collect_trajectories:221 - Episode 1222
2021-09-07 17:18:32.130 | DEBUG    | src.policies:execute_episode:413 - Early stopping, all agents done
2021-09-07 17:18:32.130 | INFO     | src.policies:collect_trajectories:237 - Mean episode return: 128.0
2021-09-07 17:18:32.131 | INFO     | src.policies:collect_trajectories

2021-09-07 17:18:32.350 | INFO     | src.policies:minibatch_update:287 - Policy network L2 gradient norm: 0.2665089964866638
2021-09-07 17:18:32.350 | INFO     | src.policies:minibatch_update:291 - Baseline network L2 gradient norm: 0.9203476905822754
2021-09-07 17:18:32.351 | INFO     | src.policies:minibatch_update:298 - Policy network L2 gradient norm after clipping: 0.2665089964866638
2021-09-07 17:18:32.353 | INFO     | src.policies:minibatch_update:305 - Baseline network L2 gradient norm after clipping: 0.49999943375587463
2021-09-07 17:18:32.354 | INFO     | src.policies:train:159 - Mini-batch 3 / 3
2021-09-07 17:18:32.355 | INFO     | src.policies:minibatch_update:281 - Losses: {'policy_loss': -0.49069443345069885, 'baseline_loss': 0.9107549786567688, 'total_loss': -0.03531694412231445}
2021-09-07 17:18:32.356 | INFO     | src.policies:minibatch_update:287 - Policy network L2 gradient norm: 0.27168649435043335
2021-09-07 17:18:32.357 | INFO     | src.policies:minibatch_update:2

2021-09-07 17:18:32.515 | INFO     | src.policies:train:123 - Epoch 441 / 800
2021-09-07 17:18:32.515 | INFO     | src.policies:collect_trajectories:221 - Episode 1231
2021-09-07 17:18:32.542 | DEBUG    | src.policies:execute_episode:413 - Early stopping, all agents done
2021-09-07 17:18:32.542 | INFO     | src.policies:collect_trajectories:237 - Mean episode return: 179.0
2021-09-07 17:18:32.543 | INFO     | src.policies:collect_trajectories:238 - Last 100 episodes mean return: 179.0
2021-09-07 17:18:32.543 | INFO     | src.policies:collect_trajectories:221 - Episode 1232
2021-09-07 17:18:32.565 | DEBUG    | src.policies:execute_episode:413 - Early stopping, all agents done
2021-09-07 17:18:32.565 | INFO     | src.policies:collect_trajectories:237 - Mean episode return: 127.0
2021-09-07 17:18:32.566 | INFO     | src.policies:collect_trajectories:238 - Last 100 episodes mean return: 153.0
2021-09-07 17:18:32.569 | INFO     | src.policies:train:159 - Mini-batch 1 / 3
2021-09-07 17:18:32

2021-09-07 17:18:32.739 | INFO     | src.policies:minibatch_update:291 - Baseline network L2 gradient norm: 0.28209519386291504
2021-09-07 17:18:32.740 | INFO     | src.policies:minibatch_update:298 - Policy network L2 gradient norm after clipping: 0.12486843019723892
2021-09-07 17:18:32.796 | INFO     | src.policies:minibatch_update:305 - Baseline network L2 gradient norm after clipping: 0.28209519386291504
2021-09-07 17:18:32.798 | INFO     | src.policies:train:159 - Mini-batch 2 / 3
2021-09-07 17:18:32.800 | INFO     | src.policies:minibatch_update:281 - Losses: {'policy_loss': -0.38458192348480225, 'baseline_loss': 0.6489148139953613, 'total_loss': -0.06012451648712158}
2021-09-07 17:18:32.801 | INFO     | src.policies:minibatch_update:287 - Policy network L2 gradient norm: 0.2208016812801361
2021-09-07 17:18:32.802 | INFO     | src.policies:minibatch_update:291 - Baseline network L2 gradient norm: 0.5125776529312134
2021-09-07 17:18:32.803 | INFO     | src.policies:minibatch_updat

2021-09-07 17:18:33.086 | INFO     | src.policies:minibatch_update:287 - Policy network L2 gradient norm: 0.31028109788894653
2021-09-07 17:18:33.088 | INFO     | src.policies:minibatch_update:291 - Baseline network L2 gradient norm: 1.209540605545044
2021-09-07 17:18:33.089 | INFO     | src.policies:minibatch_update:298 - Policy network L2 gradient norm after clipping: 0.31028109788894653
2021-09-07 17:18:33.089 | INFO     | src.policies:minibatch_update:305 - Baseline network L2 gradient norm after clipping: 0.49999964237213135
2021-09-07 17:18:33.091 | INFO     | src.policies:train:123 - Epoch 446 / 800
2021-09-07 17:18:33.091 | INFO     | src.policies:collect_trajectories:221 - Episode 1241
2021-09-07 17:18:33.117 | DEBUG    | src.policies:execute_episode:413 - Early stopping, all agents done
2021-09-07 17:18:33.118 | INFO     | src.policies:collect_trajectories:237 - Mean episode return: 186.0
2021-09-07 17:18:33.118 | INFO     | src.policies:collect_trajectories:238 - Last 100 ep

2021-09-07 17:18:33.431 | INFO     | src.policies:train:159 - Mini-batch 1 / 3
2021-09-07 17:18:33.434 | INFO     | src.policies:minibatch_update:281 - Losses: {'policy_loss': -0.4364531636238098, 'baseline_loss': 0.5621360540390015, 'total_loss': -0.15538513660430908}
2021-09-07 17:18:33.435 | INFO     | src.policies:minibatch_update:287 - Policy network L2 gradient norm: 0.2369960993528366
2021-09-07 17:18:33.436 | INFO     | src.policies:minibatch_update:291 - Baseline network L2 gradient norm: 0.0914401039481163
2021-09-07 17:18:33.437 | INFO     | src.policies:minibatch_update:298 - Policy network L2 gradient norm after clipping: 0.2369960993528366
2021-09-07 17:18:33.438 | INFO     | src.policies:minibatch_update:305 - Baseline network L2 gradient norm after clipping: 0.0914401039481163
2021-09-07 17:18:33.440 | INFO     | src.policies:train:159 - Mini-batch 2 / 3
2021-09-07 17:18:33.441 | INFO     | src.policies:minibatch_update:281 - Losses: {'policy_loss': -0.3659766912460327,

2021-09-07 17:18:33.618 | DEBUG    | src.policies:execute_episode:413 - Early stopping, all agents done
2021-09-07 17:18:33.618 | INFO     | src.policies:collect_trajectories:237 - Mean episode return: 200.0
2021-09-07 17:18:33.619 | INFO     | src.policies:collect_trajectories:238 - Last 100 episodes mean return: 200.0
2021-09-07 17:18:33.621 | INFO     | src.policies:train:159 - Mini-batch 1 / 2
2021-09-07 17:18:33.623 | INFO     | src.policies:minibatch_update:281 - Losses: {'policy_loss': -0.6000048518180847, 'baseline_loss': 1.255016565322876, 'total_loss': 0.02750343084335327}
2021-09-07 17:18:33.624 | INFO     | src.policies:minibatch_update:287 - Policy network L2 gradient norm: 0.4312174618244171
2021-09-07 17:18:33.625 | INFO     | src.policies:minibatch_update:291 - Baseline network L2 gradient norm: 2.6483101844787598
2021-09-07 17:18:33.626 | INFO     | src.policies:minibatch_update:298 - Policy network L2 gradient norm after clipping: 0.4312174618244171
2021-09-07 17:18:3

2021-09-07 17:18:33.808 | INFO     | src.policies:minibatch_update:287 - Policy network L2 gradient norm: 0.18856137990951538
2021-09-07 17:18:33.809 | INFO     | src.policies:minibatch_update:291 - Baseline network L2 gradient norm: 0.38066989183425903
2021-09-07 17:18:33.810 | INFO     | src.policies:minibatch_update:298 - Policy network L2 gradient norm after clipping: 0.18856137990951538
2021-09-07 17:18:33.811 | INFO     | src.policies:minibatch_update:305 - Baseline network L2 gradient norm after clipping: 0.38066989183425903
2021-09-07 17:18:33.813 | INFO     | src.policies:train:159 - Mini-batch 2 / 2
2021-09-07 17:18:33.814 | INFO     | src.policies:minibatch_update:281 - Losses: {'policy_loss': -0.43304935097694397, 'baseline_loss': 0.6687614321708679, 'total_loss': -0.09866863489151001}
2021-09-07 17:18:33.814 | INFO     | src.policies:minibatch_update:287 - Policy network L2 gradient norm: 0.20943184196949005
2021-09-07 17:18:33.815 | INFO     | src.policies:minibatch_updat

2021-09-07 17:18:34.055 | INFO     | src.policies:minibatch_update:281 - Losses: {'policy_loss': -0.330289751291275, 'baseline_loss': 0.4181672930717468, 'total_loss': -0.12120610475540161}
2021-09-07 17:18:34.056 | INFO     | src.policies:minibatch_update:287 - Policy network L2 gradient norm: 0.3505391478538513
2021-09-07 17:18:34.057 | INFO     | src.policies:minibatch_update:291 - Baseline network L2 gradient norm: 0.9796504974365234
2021-09-07 17:18:34.059 | INFO     | src.policies:minibatch_update:298 - Policy network L2 gradient norm after clipping: 0.3505391478538513
2021-09-07 17:18:34.060 | INFO     | src.policies:minibatch_update:305 - Baseline network L2 gradient norm after clipping: 0.4999994933605194
2021-09-07 17:18:34.061 | INFO     | src.policies:train:159 - Mini-batch 2 / 2
2021-09-07 17:18:34.062 | INFO     | src.policies:minibatch_update:281 - Losses: {'policy_loss': -0.2611180245876312, 'baseline_loss': 0.3743806779384613, 'total_loss': -0.07392768561840057}
2021-0

2021-09-07 17:18:34.228 | INFO     | src.policies:train:123 - Epoch 462 / 800
2021-09-07 17:18:34.229 | INFO     | src.policies:collect_trajectories:221 - Episode 1263
2021-09-07 17:18:34.253 | DEBUG    | src.policies:execute_episode:413 - Early stopping, all agents done
2021-09-07 17:18:34.254 | INFO     | src.policies:collect_trajectories:237 - Mean episode return: 165.0
2021-09-07 17:18:34.254 | INFO     | src.policies:collect_trajectories:238 - Last 100 episodes mean return: 165.0
2021-09-07 17:18:34.254 | INFO     | src.policies:collect_trajectories:221 - Episode 1264
2021-09-07 17:18:34.285 | DEBUG    | src.policies:execute_episode:413 - Early stopping, all agents done
2021-09-07 17:18:34.286 | INFO     | src.policies:collect_trajectories:237 - Mean episode return: 200.0
2021-09-07 17:18:34.286 | INFO     | src.policies:collect_trajectories:238 - Last 100 episodes mean return: 182.5
2021-09-07 17:18:34.289 | INFO     | src.policies:train:159 - Mini-batch 1 / 3
2021-09-07 17:18:34

2021-09-07 17:18:34.523 | INFO     | src.policies:collect_trajectories:237 - Mean episode return: 200.0
2021-09-07 17:18:34.523 | INFO     | src.policies:collect_trajectories:238 - Last 100 episodes mean return: 191.5
2021-09-07 17:18:34.526 | INFO     | src.policies:train:159 - Mini-batch 1 / 3
2021-09-07 17:18:34.530 | INFO     | src.policies:minibatch_update:281 - Losses: {'policy_loss': -0.5933182835578918, 'baseline_loss': 1.5986636877059937, 'total_loss': 0.20601356029510498}
2021-09-07 17:18:34.531 | INFO     | src.policies:minibatch_update:287 - Policy network L2 gradient norm: 0.26852476596832275
2021-09-07 17:18:34.532 | INFO     | src.policies:minibatch_update:291 - Baseline network L2 gradient norm: 2.5573196411132812
2021-09-07 17:18:34.534 | INFO     | src.policies:minibatch_update:298 - Policy network L2 gradient norm after clipping: 0.26852476596832275
2021-09-07 17:18:34.535 | INFO     | src.policies:minibatch_update:305 - Baseline network L2 gradient norm after clippi

2021-09-07 17:18:34.684 | INFO     | src.policies:minibatch_update:287 - Policy network L2 gradient norm: 0.26410624384880066
2021-09-07 17:18:34.684 | INFO     | src.policies:minibatch_update:291 - Baseline network L2 gradient norm: 1.1952199935913086
2021-09-07 17:18:34.686 | INFO     | src.policies:minibatch_update:298 - Policy network L2 gradient norm after clipping: 0.26410624384880066
2021-09-07 17:18:34.687 | INFO     | src.policies:minibatch_update:305 - Baseline network L2 gradient norm after clipping: 0.49999961256980896
2021-09-07 17:18:34.688 | INFO     | src.policies:train:123 - Epoch 468 / 800
2021-09-07 17:18:34.689 | INFO     | src.policies:collect_trajectories:221 - Episode 1273
2021-09-07 17:18:34.716 | DEBUG    | src.policies:execute_episode:413 - Early stopping, all agents done
2021-09-07 17:18:34.717 | INFO     | src.policies:collect_trajectories:237 - Mean episode return: 200.0
2021-09-07 17:18:34.717 | INFO     | src.policies:collect_trajectories:238 - Last 100 e

2021-09-07 17:18:34.866 | INFO     | src.policies:collect_trajectories:221 - Episode 1277
2021-09-07 17:18:34.896 | DEBUG    | src.policies:execute_episode:413 - Early stopping, all agents done
2021-09-07 17:18:34.897 | INFO     | src.policies:collect_trajectories:237 - Mean episode return: 200.0
2021-09-07 17:18:34.897 | INFO     | src.policies:collect_trajectories:238 - Last 100 episodes mean return: 200.0
2021-09-07 17:18:34.899 | INFO     | src.policies:train:159 - Mini-batch 1 / 2
2021-09-07 17:18:34.901 | INFO     | src.policies:minibatch_update:281 - Losses: {'policy_loss': -0.28201940655708313, 'baseline_loss': 0.6047320365905762, 'total_loss': 0.020346611738204956}
2021-09-07 17:18:34.902 | INFO     | src.policies:minibatch_update:287 - Policy network L2 gradient norm: 0.30891484022140503
2021-09-07 17:18:34.903 | INFO     | src.policies:minibatch_update:291 - Baseline network L2 gradient norm: 1.5286729335784912
2021-09-07 17:18:34.904 | INFO     | src.policies:minibatch_upda

2021-09-07 17:18:35.201 | INFO     | src.policies:minibatch_update:287 - Policy network L2 gradient norm: 0.20422542095184326
2021-09-07 17:18:35.202 | INFO     | src.policies:minibatch_update:291 - Baseline network L2 gradient norm: 0.9440163969993591
2021-09-07 17:18:35.203 | INFO     | src.policies:minibatch_update:298 - Policy network L2 gradient norm after clipping: 0.20422542095184326
2021-09-07 17:18:35.204 | INFO     | src.policies:minibatch_update:305 - Baseline network L2 gradient norm after clipping: 0.4999995231628418
2021-09-07 17:18:35.205 | INFO     | src.policies:train:123 - Epoch 474 / 800
2021-09-07 17:18:35.206 | INFO     | src.policies:collect_trajectories:221 - Episode 1282
2021-09-07 17:18:35.234 | DEBUG    | src.policies:execute_episode:413 - Early stopping, all agents done
2021-09-07 17:18:35.235 | INFO     | src.policies:collect_trajectories:237 - Mean episode return: 200.0
2021-09-07 17:18:35.235 | INFO     | src.policies:collect_trajectories:238 - Last 100 ep

2021-09-07 17:18:35.376 | INFO     | src.policies:minibatch_update:298 - Policy network L2 gradient norm after clipping: 0.12622776627540588
2021-09-07 17:18:35.377 | INFO     | src.policies:minibatch_update:305 - Baseline network L2 gradient norm after clipping: 0.4999997913837433
2021-09-07 17:18:35.378 | INFO     | src.policies:train:123 - Epoch 477 / 800
2021-09-07 17:18:35.379 | INFO     | src.policies:collect_trajectories:221 - Episode 1287
2021-09-07 17:18:35.410 | DEBUG    | src.policies:execute_episode:413 - Early stopping, all agents done
2021-09-07 17:18:35.410 | INFO     | src.policies:collect_trajectories:237 - Mean episode return: 200.0
2021-09-07 17:18:35.411 | INFO     | src.policies:collect_trajectories:238 - Last 100 episodes mean return: 200.0
2021-09-07 17:18:35.412 | INFO     | src.policies:train:159 - Mini-batch 1 / 2
2021-09-07 17:18:35.415 | INFO     | src.policies:minibatch_update:281 - Losses: {'policy_loss': -0.52718186378479, 'baseline_loss': 1.2792353630065

2021-09-07 17:18:35.567 | INFO     | src.policies:minibatch_update:291 - Baseline network L2 gradient norm: 2.584242343902588
2021-09-07 17:18:35.568 | INFO     | src.policies:minibatch_update:298 - Policy network L2 gradient norm after clipping: 0.29293182492256165
2021-09-07 17:18:35.569 | INFO     | src.policies:minibatch_update:305 - Baseline network L2 gradient norm after clipping: 0.4999998211860657
2021-09-07 17:18:35.570 | INFO     | src.policies:train:123 - Epoch 481 / 800
2021-09-07 17:18:35.571 | INFO     | src.policies:collect_trajectories:221 - Episode 1291
2021-09-07 17:18:35.746 | DEBUG    | src.policies:execute_episode:413 - Early stopping, all agents done
2021-09-07 17:18:35.747 | INFO     | src.policies:collect_trajectories:237 - Mean episode return: 200.0
2021-09-07 17:18:35.747 | INFO     | src.policies:collect_trajectories:238 - Last 100 episodes mean return: 200.0
2021-09-07 17:18:35.749 | INFO     | src.policies:train:159 - Mini-batch 1 / 2
2021-09-07 17:18:35.75

2021-09-07 17:18:35.914 | INFO     | src.policies:minibatch_update:298 - Policy network L2 gradient norm after clipping: 0.17592914402484894
2021-09-07 17:18:35.915 | INFO     | src.policies:minibatch_update:305 - Baseline network L2 gradient norm after clipping: 0.49999967217445374
2021-09-07 17:18:35.917 | INFO     | src.policies:train:159 - Mini-batch 2 / 3
2021-09-07 17:18:35.918 | INFO     | src.policies:minibatch_update:281 - Losses: {'policy_loss': -0.23816627264022827, 'baseline_loss': 0.50181645154953, 'total_loss': 0.012741953134536743}
2021-09-07 17:18:35.919 | INFO     | src.policies:minibatch_update:287 - Policy network L2 gradient norm: 0.13147585093975067
2021-09-07 17:18:35.920 | INFO     | src.policies:minibatch_update:291 - Baseline network L2 gradient norm: 1.5255435705184937
2021-09-07 17:18:35.921 | INFO     | src.policies:minibatch_update:298 - Policy network L2 gradient norm after clipping: 0.13147585093975067
2021-09-07 17:18:35.922 | INFO     | src.policies:min

2021-09-07 17:18:36.110 | INFO     | src.policies:minibatch_update:305 - Baseline network L2 gradient norm after clipping: 0.4999992251396179
2021-09-07 17:18:36.111 | INFO     | src.policies:train:159 - Mini-batch 2 / 3
2021-09-07 17:18:36.112 | INFO     | src.policies:minibatch_update:281 - Losses: {'policy_loss': -0.4831022620201111, 'baseline_loss': 1.0190833806991577, 'total_loss': 0.026439428329467773}
2021-09-07 17:18:36.113 | INFO     | src.policies:minibatch_update:287 - Policy network L2 gradient norm: 0.35903674364089966
2021-09-07 17:18:36.114 | INFO     | src.policies:minibatch_update:291 - Baseline network L2 gradient norm: 1.2428405284881592
2021-09-07 17:18:36.115 | INFO     | src.policies:minibatch_update:298 - Policy network L2 gradient norm after clipping: 0.35903674364089966
2021-09-07 17:18:36.116 | INFO     | src.policies:minibatch_update:305 - Baseline network L2 gradient norm after clipping: 0.49999964237213135
2021-09-07 17:18:36.117 | INFO     | src.policies:t

2021-09-07 17:18:36.346 | INFO     | src.policies:train:159 - Mini-batch 2 / 2
2021-09-07 17:18:36.347 | INFO     | src.policies:minibatch_update:281 - Losses: {'policy_loss': -0.403476744890213, 'baseline_loss': 1.2739126682281494, 'total_loss': 0.2334795892238617}
2021-09-07 17:18:36.348 | INFO     | src.policies:minibatch_update:287 - Policy network L2 gradient norm: 0.2988198697566986
2021-09-07 17:18:36.349 | INFO     | src.policies:minibatch_update:291 - Baseline network L2 gradient norm: 1.2415099143981934
2021-09-07 17:18:36.350 | INFO     | src.policies:minibatch_update:298 - Policy network L2 gradient norm after clipping: 0.2988198697566986
2021-09-07 17:18:36.351 | INFO     | src.policies:minibatch_update:305 - Baseline network L2 gradient norm after clipping: 0.4999995529651642
2021-09-07 17:18:36.352 | INFO     | src.policies:train:123 - Epoch 491 / 800
2021-09-07 17:18:36.352 | INFO     | src.policies:collect_trajectories:221 - Episode 1306
2021-09-07 17:18:36.382 | DEBUG

2021-09-07 17:18:36.540 | INFO     | src.policies:minibatch_update:305 - Baseline network L2 gradient norm after clipping: 0.49999985098838806
2021-09-07 17:18:36.542 | INFO     | src.policies:train:159 - Mini-batch 2 / 2
2021-09-07 17:18:36.543 | INFO     | src.policies:minibatch_update:281 - Losses: {'policy_loss': -0.5774255990982056, 'baseline_loss': 1.287003993988037, 'total_loss': 0.06607639789581299}
2021-09-07 17:18:36.544 | INFO     | src.policies:minibatch_update:287 - Policy network L2 gradient norm: 0.21309787034988403
2021-09-07 17:18:36.545 | INFO     | src.policies:minibatch_update:291 - Baseline network L2 gradient norm: 2.252711534500122
2021-09-07 17:18:36.546 | INFO     | src.policies:minibatch_update:298 - Policy network L2 gradient norm after clipping: 0.21309787034988403
2021-09-07 17:18:36.547 | INFO     | src.policies:minibatch_update:305 - Baseline network L2 gradient norm after clipping: 0.4999997317790985
2021-09-07 17:18:36.548 | INFO     | src.policies:trai

2021-09-07 17:18:36.720 | INFO     | src.policies:minibatch_update:298 - Policy network L2 gradient norm after clipping: 0.25692859292030334
2021-09-07 17:18:36.721 | INFO     | src.policies:minibatch_update:305 - Baseline network L2 gradient norm after clipping: 0.49999967217445374
2021-09-07 17:18:36.723 | INFO     | src.policies:train:123 - Epoch 498 / 800
2021-09-07 17:18:36.723 | INFO     | src.policies:collect_trajectories:221 - Episode 1314
2021-09-07 17:18:36.753 | DEBUG    | src.policies:execute_episode:413 - Early stopping, all agents done
2021-09-07 17:18:36.754 | INFO     | src.policies:collect_trajectories:237 - Mean episode return: 200.0
2021-09-07 17:18:36.754 | INFO     | src.policies:collect_trajectories:238 - Last 100 episodes mean return: 200.0
2021-09-07 17:18:36.756 | INFO     | src.policies:train:159 - Mini-batch 1 / 2
2021-09-07 17:18:36.759 | INFO     | src.policies:minibatch_update:281 - Losses: {'policy_loss': -0.14623822271823883, 'baseline_loss': 0.556932151

2021-09-07 17:18:36.994 | INFO     | src.policies:collect_trajectories:238 - Last 100 episodes mean return: 200.0
2021-09-07 17:18:36.995 | INFO     | src.policies:train:159 - Mini-batch 1 / 2
2021-09-07 17:18:36.998 | INFO     | src.policies:minibatch_update:281 - Losses: {'policy_loss': -0.16689759492874146, 'baseline_loss': 0.4295600652694702, 'total_loss': 0.04788243770599365}
2021-09-07 17:18:36.999 | INFO     | src.policies:minibatch_update:287 - Policy network L2 gradient norm: 0.15999653935432434
2021-09-07 17:18:37.000 | INFO     | src.policies:minibatch_update:291 - Baseline network L2 gradient norm: 1.942746639251709
2021-09-07 17:18:37.002 | INFO     | src.policies:minibatch_update:298 - Policy network L2 gradient norm after clipping: 0.15999653935432434
2021-09-07 17:18:37.004 | INFO     | src.policies:minibatch_update:305 - Baseline network L2 gradient norm after clipping: 0.4999997615814209
2021-09-07 17:18:37.005 | INFO     | src.policies:train:159 - Mini-batch 2 / 2
20

2021-09-07 17:18:37.296 | DEBUG    | src.policies:execute_episode:413 - Early stopping, all agents done
2021-09-07 17:18:37.296 | INFO     | src.policies:collect_trajectories:237 - Mean episode return: 200.0
2021-09-07 17:18:37.297 | INFO     | src.policies:collect_trajectories:238 - Last 100 episodes mean return: 200.0
2021-09-07 17:18:37.300 | INFO     | src.policies:train:159 - Mini-batch 1 / 2
2021-09-07 17:18:37.302 | INFO     | src.policies:minibatch_update:281 - Losses: {'policy_loss': -0.42909935116767883, 'baseline_loss': 0.9432332515716553, 'total_loss': 0.042517274618148804}
2021-09-07 17:18:37.303 | INFO     | src.policies:minibatch_update:287 - Policy network L2 gradient norm: 0.14085087180137634
2021-09-07 17:18:37.304 | INFO     | src.policies:minibatch_update:291 - Baseline network L2 gradient norm: 1.4099324941635132
2021-09-07 17:18:37.305 | INFO     | src.policies:minibatch_update:298 - Policy network L2 gradient norm after clipping: 0.14085087180137634
2021-09-07 17

2021-09-07 17:18:37.499 | INFO     | src.policies:train:123 - Epoch 509 / 800
2021-09-07 17:18:37.500 | INFO     | src.policies:collect_trajectories:221 - Episode 1326
2021-09-07 17:18:37.529 | DEBUG    | src.policies:execute_episode:413 - Early stopping, all agents done
2021-09-07 17:18:37.530 | INFO     | src.policies:collect_trajectories:237 - Mean episode return: 200.0
2021-09-07 17:18:37.530 | INFO     | src.policies:collect_trajectories:238 - Last 100 episodes mean return: 200.0
2021-09-07 17:18:37.532 | INFO     | src.policies:train:159 - Mini-batch 1 / 2
2021-09-07 17:18:37.535 | INFO     | src.policies:minibatch_update:281 - Losses: {'policy_loss': -0.058391470462083817, 'baseline_loss': 0.5033005475997925, 'total_loss': 0.19325880706310272}
2021-09-07 17:18:37.536 | INFO     | src.policies:minibatch_update:287 - Policy network L2 gradient norm: 0.11148972064256668
2021-09-07 17:18:37.537 | INFO     | src.policies:minibatch_update:291 - Baseline network L2 gradient norm: 1.886

2021-09-07 17:18:37.688 | INFO     | src.policies:minibatch_update:305 - Baseline network L2 gradient norm after clipping: 0.4999998211860657
2021-09-07 17:18:37.689 | INFO     | src.policies:train:123 - Epoch 513 / 800
2021-09-07 17:18:37.690 | INFO     | src.policies:collect_trajectories:221 - Episode 1330
2021-09-07 17:18:37.718 | DEBUG    | src.policies:execute_episode:413 - Early stopping, all agents done
2021-09-07 17:18:37.719 | INFO     | src.policies:collect_trajectories:237 - Mean episode return: 200.0
2021-09-07 17:18:37.719 | INFO     | src.policies:collect_trajectories:238 - Last 100 episodes mean return: 200.0
2021-09-07 17:18:37.723 | INFO     | src.policies:train:159 - Mini-batch 1 / 2
2021-09-07 17:18:37.726 | INFO     | src.policies:minibatch_update:281 - Losses: {'policy_loss': -0.5639148950576782, 'baseline_loss': 0.9142826199531555, 'total_loss': -0.10677358508110046}
2021-09-07 17:18:37.727 | INFO     | src.policies:minibatch_update:287 - Policy network L2 gradien

2021-09-07 17:18:37.939 | INFO     | src.policies:minibatch_update:298 - Policy network L2 gradient norm after clipping: 0.2904076874256134
2021-09-07 17:18:37.940 | INFO     | src.policies:minibatch_update:305 - Baseline network L2 gradient norm after clipping: 0.4999998211860657
2021-09-07 17:18:37.942 | INFO     | src.policies:train:123 - Epoch 517 / 800
2021-09-07 17:18:37.943 | INFO     | src.policies:collect_trajectories:221 - Episode 1334
2021-09-07 17:18:37.972 | DEBUG    | src.policies:execute_episode:413 - Early stopping, all agents done
2021-09-07 17:18:37.973 | INFO     | src.policies:collect_trajectories:237 - Mean episode return: 200.0
2021-09-07 17:18:37.973 | INFO     | src.policies:collect_trajectories:238 - Last 100 episodes mean return: 200.0
2021-09-07 17:18:37.975 | INFO     | src.policies:train:159 - Mini-batch 1 / 2
2021-09-07 17:18:37.978 | INFO     | src.policies:minibatch_update:281 - Losses: {'policy_loss': -0.6502429246902466, 'baseline_loss': 2.058404922485

2021-09-07 17:18:38.127 | INFO     | src.policies:minibatch_update:291 - Baseline network L2 gradient norm: 1.9533582925796509
2021-09-07 17:18:38.128 | INFO     | src.policies:minibatch_update:298 - Policy network L2 gradient norm after clipping: 0.31842276453971863
2021-09-07 17:18:38.129 | INFO     | src.policies:minibatch_update:305 - Baseline network L2 gradient norm after clipping: 0.4999997317790985
2021-09-07 17:18:38.131 | INFO     | src.policies:train:123 - Epoch 521 / 800
2021-09-07 17:18:38.131 | INFO     | src.policies:collect_trajectories:221 - Episode 1338
2021-09-07 17:18:38.140 | DEBUG    | src.policies:execute_episode:413 - Early stopping, all agents done
2021-09-07 17:18:38.141 | INFO     | src.policies:collect_trajectories:237 - Mean episode return: 53.0
2021-09-07 17:18:38.141 | INFO     | src.policies:collect_trajectories:238 - Last 100 episodes mean return: 53.0
2021-09-07 17:18:38.141 | INFO     | src.policies:collect_trajectories:221 - Episode 1339
2021-09-07 1

2021-09-07 17:18:38.322 | INFO     | src.policies:minibatch_update:281 - Losses: {'policy_loss': -0.11898521333932877, 'baseline_loss': 0.4788737893104553, 'total_loss': 0.1204516813158989}
2021-09-07 17:18:38.322 | INFO     | src.policies:minibatch_update:287 - Policy network L2 gradient norm: 0.24109013378620148
2021-09-07 17:18:38.324 | INFO     | src.policies:minibatch_update:291 - Baseline network L2 gradient norm: 1.5099471807479858
2021-09-07 17:18:38.325 | INFO     | src.policies:minibatch_update:298 - Policy network L2 gradient norm after clipping: 0.24109013378620148
2021-09-07 17:18:38.326 | INFO     | src.policies:minibatch_update:305 - Baseline network L2 gradient norm after clipping: 0.49999967217445374
2021-09-07 17:18:38.327 | INFO     | src.policies:train:159 - Mini-batch 2 / 2
2021-09-07 17:18:38.328 | INFO     | src.policies:minibatch_update:281 - Losses: {'policy_loss': -0.20087888836860657, 'baseline_loss': 0.5130429267883301, 'total_loss': 0.05564257502555847}
202

2021-09-07 17:18:38.682 | INFO     | src.policies:minibatch_update:287 - Policy network L2 gradient norm: 0.13892114162445068
2021-09-07 17:18:38.683 | INFO     | src.policies:minibatch_update:291 - Baseline network L2 gradient norm: 1.5000180006027222
2021-09-07 17:18:38.684 | INFO     | src.policies:minibatch_update:298 - Policy network L2 gradient norm after clipping: 0.13892114162445068
2021-09-07 17:18:38.685 | INFO     | src.policies:minibatch_update:305 - Baseline network L2 gradient norm after clipping: 0.4999997019767761
2021-09-07 17:18:38.686 | INFO     | src.policies:train:159 - Mini-batch 2 / 3
2021-09-07 17:18:38.687 | INFO     | src.policies:minibatch_update:281 - Losses: {'policy_loss': -0.3052327036857605, 'baseline_loss': 0.6739030480384827, 'total_loss': 0.031718820333480835}
2021-09-07 17:18:38.688 | INFO     | src.policies:minibatch_update:287 - Policy network L2 gradient norm: 0.3018686771392822
2021-09-07 17:18:38.689 | INFO     | src.policies:minibatch_update:29

2021-09-07 17:18:38.837 | INFO     | src.policies:minibatch_update:305 - Baseline network L2 gradient norm after clipping: 0.49999967217445374
2021-09-07 17:18:38.839 | INFO     | src.policies:train:123 - Epoch 530 / 800
2021-09-07 17:18:38.839 | INFO     | src.policies:collect_trajectories:221 - Episode 1352
2021-09-07 17:18:38.870 | DEBUG    | src.policies:execute_episode:413 - Early stopping, all agents done
2021-09-07 17:18:38.871 | INFO     | src.policies:collect_trajectories:237 - Mean episode return: 200.0
2021-09-07 17:18:38.871 | INFO     | src.policies:collect_trajectories:238 - Last 100 episodes mean return: 200.0
2021-09-07 17:18:38.873 | INFO     | src.policies:train:159 - Mini-batch 1 / 2
2021-09-07 17:18:38.876 | INFO     | src.policies:minibatch_update:281 - Losses: {'policy_loss': -0.5183829665184021, 'baseline_loss': 1.0907915830612183, 'total_loss': 0.02701282501220703}
2021-09-07 17:18:38.877 | INFO     | src.policies:minibatch_update:287 - Policy network L2 gradien

2021-09-07 17:18:39.060 | INFO     | src.policies:train:159 - Mini-batch 1 / 2
2021-09-07 17:18:39.063 | INFO     | src.policies:minibatch_update:281 - Losses: {'policy_loss': -0.18301011621952057, 'baseline_loss': 0.436050683259964, 'total_loss': 0.035015225410461426}
2021-09-07 17:18:39.064 | INFO     | src.policies:minibatch_update:287 - Policy network L2 gradient norm: 0.6024095416069031
2021-09-07 17:18:39.065 | INFO     | src.policies:minibatch_update:291 - Baseline network L2 gradient norm: 1.414191722869873
2021-09-07 17:18:39.066 | INFO     | src.policies:minibatch_update:298 - Policy network L2 gradient norm after clipping: 0.49999919533729553
2021-09-07 17:18:39.067 | INFO     | src.policies:minibatch_update:305 - Baseline network L2 gradient norm after clipping: 0.4999997019767761
2021-09-07 17:18:39.068 | INFO     | src.policies:train:159 - Mini-batch 2 / 2
2021-09-07 17:18:39.069 | INFO     | src.policies:minibatch_update:281 - Losses: {'policy_loss': -0.1980394423007965,

2021-09-07 17:18:39.423 | INFO     | src.policies:collect_trajectories:238 - Last 100 episodes mean return: 200.0
2021-09-07 17:18:39.425 | INFO     | src.policies:train:159 - Mini-batch 1 / 2
2021-09-07 17:18:39.428 | INFO     | src.policies:minibatch_update:281 - Losses: {'policy_loss': -0.6816398501396179, 'baseline_loss': 2.143871545791626, 'total_loss': 0.39029592275619507}
2021-09-07 17:18:39.429 | INFO     | src.policies:minibatch_update:287 - Policy network L2 gradient norm: 0.5347380042076111
2021-09-07 17:18:39.430 | INFO     | src.policies:minibatch_update:291 - Baseline network L2 gradient norm: 2.8587143421173096
2021-09-07 17:18:39.431 | INFO     | src.policies:minibatch_update:298 - Policy network L2 gradient norm after clipping: 0.49999910593032837
2021-09-07 17:18:39.432 | INFO     | src.policies:minibatch_update:305 - Baseline network L2 gradient norm after clipping: 0.4999998211860657
2021-09-07 17:18:39.434 | INFO     | src.policies:train:159 - Mini-batch 2 / 2
2021

2021-09-07 17:18:39.613 | DEBUG    | src.policies:execute_episode:413 - Early stopping, all agents done
2021-09-07 17:18:39.613 | INFO     | src.policies:collect_trajectories:237 - Mean episode return: 200.0
2021-09-07 17:18:39.614 | INFO     | src.policies:collect_trajectories:238 - Last 100 episodes mean return: 200.0
2021-09-07 17:18:39.615 | INFO     | src.policies:train:159 - Mini-batch 1 / 2
2021-09-07 17:18:39.617 | INFO     | src.policies:minibatch_update:281 - Losses: {'policy_loss': -0.5151874423027039, 'baseline_loss': 1.1617519855499268, 'total_loss': 0.06568855047225952}
2021-09-07 17:18:39.618 | INFO     | src.policies:minibatch_update:287 - Policy network L2 gradient norm: 0.3044017255306244
2021-09-07 17:18:39.619 | INFO     | src.policies:minibatch_update:291 - Baseline network L2 gradient norm: 1.9316846132278442
2021-09-07 17:18:39.620 | INFO     | src.policies:minibatch_update:298 - Policy network L2 gradient norm after clipping: 0.3044017255306244
2021-09-07 17:18:

2021-09-07 17:18:39.832 | INFO     | src.policies:minibatch_update:287 - Policy network L2 gradient norm: 0.4355959892272949
2021-09-07 17:18:39.833 | INFO     | src.policies:minibatch_update:291 - Baseline network L2 gradient norm: 0.8495416641235352
2021-09-07 17:18:39.834 | INFO     | src.policies:minibatch_update:298 - Policy network L2 gradient norm after clipping: 0.4355959892272949
2021-09-07 17:18:39.835 | INFO     | src.policies:minibatch_update:305 - Baseline network L2 gradient norm after clipping: 0.49999937415122986
2021-09-07 17:18:39.836 | INFO     | src.policies:train:123 - Epoch 545 / 800
2021-09-07 17:18:39.837 | INFO     | src.policies:collect_trajectories:221 - Episode 1369
2021-09-07 17:18:39.866 | DEBUG    | src.policies:execute_episode:413 - Early stopping, all agents done
2021-09-07 17:18:39.866 | INFO     | src.policies:collect_trajectories:237 - Mean episode return: 200.0
2021-09-07 17:18:39.867 | INFO     | src.policies:collect_trajectories:238 - Last 100 epi

2021-09-07 17:18:40.036 | INFO     | src.policies:minibatch_update:291 - Baseline network L2 gradient norm: 0.43449491262435913
2021-09-07 17:18:40.038 | INFO     | src.policies:minibatch_update:298 - Policy network L2 gradient norm after clipping: 0.3339749276638031
2021-09-07 17:18:40.039 | INFO     | src.policies:minibatch_update:305 - Baseline network L2 gradient norm after clipping: 0.43449491262435913
2021-09-07 17:18:40.041 | INFO     | src.policies:train:159 - Mini-batch 2 / 2
2021-09-07 17:18:40.042 | INFO     | src.policies:minibatch_update:281 - Losses: {'policy_loss': -0.27779650688171387, 'baseline_loss': 0.4306434392929077, 'total_loss': -0.06247478723526001}
2021-09-07 17:18:40.043 | INFO     | src.policies:minibatch_update:287 - Policy network L2 gradient norm: 0.19702868163585663
2021-09-07 17:18:40.044 | INFO     | src.policies:minibatch_update:291 - Baseline network L2 gradient norm: 0.7401382923126221
2021-09-07 17:18:40.045 | INFO     | src.policies:minibatch_updat

2021-09-07 17:18:40.298 | INFO     | src.policies:collect_trajectories:238 - Last 100 episodes mean return: 161.0
2021-09-07 17:18:40.303 | INFO     | src.policies:train:159 - Mini-batch 1 / 3
2021-09-07 17:18:40.305 | INFO     | src.policies:minibatch_update:281 - Losses: {'policy_loss': -0.49292999505996704, 'baseline_loss': 1.366382122039795, 'total_loss': 0.19026106595993042}
2021-09-07 17:18:40.306 | INFO     | src.policies:minibatch_update:287 - Policy network L2 gradient norm: 0.22806499898433685
2021-09-07 17:18:40.307 | INFO     | src.policies:minibatch_update:291 - Baseline network L2 gradient norm: 1.5438116788864136
2021-09-07 17:18:40.309 | INFO     | src.policies:minibatch_update:298 - Policy network L2 gradient norm after clipping: 0.22806499898433685
2021-09-07 17:18:40.310 | INFO     | src.policies:minibatch_update:305 - Baseline network L2 gradient norm after clipping: 0.4999997317790985
2021-09-07 17:18:40.311 | INFO     | src.policies:train:159 - Mini-batch 2 / 3
20

2021-09-07 17:18:40.463 | INFO     | src.policies:minibatch_update:281 - Losses: {'policy_loss': -0.4060472548007965, 'baseline_loss': 0.699405312538147, 'total_loss': -0.05634459853172302}
2021-09-07 17:18:40.464 | INFO     | src.policies:minibatch_update:287 - Policy network L2 gradient norm: 0.4500613510608673
2021-09-07 17:18:40.465 | INFO     | src.policies:minibatch_update:291 - Baseline network L2 gradient norm: 0.5931352376937866
2021-09-07 17:18:40.466 | INFO     | src.policies:minibatch_update:298 - Policy network L2 gradient norm after clipping: 0.4500613510608673
2021-09-07 17:18:40.467 | INFO     | src.policies:minibatch_update:305 - Baseline network L2 gradient norm after clipping: 0.49999913573265076
2021-09-07 17:18:40.468 | INFO     | src.policies:train:123 - Epoch 556 / 800
2021-09-07 17:18:40.469 | INFO     | src.policies:collect_trajectories:221 - Episode 1382
2021-09-07 17:18:40.496 | DEBUG    | src.policies:execute_episode:413 - Early stopping, all agents done
202

2021-09-07 17:18:40.644 | INFO     | src.policies:minibatch_update:305 - Baseline network L2 gradient norm after clipping: 0.4999995231628418
2021-09-07 17:18:40.645 | INFO     | src.policies:train:123 - Epoch 559 / 800
2021-09-07 17:18:40.646 | INFO     | src.policies:collect_trajectories:221 - Episode 1386
2021-09-07 17:18:40.677 | DEBUG    | src.policies:execute_episode:413 - Early stopping, all agents done
2021-09-07 17:18:40.677 | INFO     | src.policies:collect_trajectories:237 - Mean episode return: 200.0
2021-09-07 17:18:40.678 | INFO     | src.policies:collect_trajectories:238 - Last 100 episodes mean return: 200.0
2021-09-07 17:18:40.680 | INFO     | src.policies:train:159 - Mini-batch 1 / 2
2021-09-07 17:18:40.682 | INFO     | src.policies:minibatch_update:281 - Losses: {'policy_loss': -0.5706575512886047, 'baseline_loss': 2.0816848278045654, 'total_loss': 0.470184862613678}
2021-09-07 17:18:40.683 | INFO     | src.policies:minibatch_update:287 - Policy network L2 gradient n

2021-09-07 17:18:40.982 | INFO     | src.policies:minibatch_update:298 - Policy network L2 gradient norm after clipping: 0.2884061932563782
2021-09-07 17:18:40.983 | INFO     | src.policies:minibatch_update:305 - Baseline network L2 gradient norm after clipping: 0.4999992847442627
2021-09-07 17:18:40.984 | INFO     | src.policies:train:123 - Epoch 563 / 800
2021-09-07 17:18:40.985 | INFO     | src.policies:collect_trajectories:221 - Episode 1390
2021-09-07 17:18:41.016 | DEBUG    | src.policies:execute_episode:413 - Early stopping, all agents done
2021-09-07 17:18:41.016 | INFO     | src.policies:collect_trajectories:237 - Mean episode return: 200.0
2021-09-07 17:18:41.017 | INFO     | src.policies:collect_trajectories:238 - Last 100 episodes mean return: 200.0
2021-09-07 17:18:41.019 | INFO     | src.policies:train:159 - Mini-batch 1 / 2
2021-09-07 17:18:41.022 | INFO     | src.policies:minibatch_update:281 - Losses: {'policy_loss': -0.37429848313331604, 'baseline_loss': 1.08387684822

2021-09-07 17:18:41.175 | INFO     | src.policies:minibatch_update:291 - Baseline network L2 gradient norm: 1.090597152709961
2021-09-07 17:18:41.176 | INFO     | src.policies:minibatch_update:298 - Policy network L2 gradient norm after clipping: 0.2663496732711792
2021-09-07 17:18:41.177 | INFO     | src.policies:minibatch_update:305 - Baseline network L2 gradient norm after clipping: 0.49999961256980896
2021-09-07 17:18:41.179 | INFO     | src.policies:train:123 - Epoch 567 / 800
2021-09-07 17:18:41.179 | INFO     | src.policies:collect_trajectories:221 - Episode 1394
2021-09-07 17:18:41.208 | DEBUG    | src.policies:execute_episode:413 - Early stopping, all agents done
2021-09-07 17:18:41.209 | INFO     | src.policies:collect_trajectories:237 - Mean episode return: 200.0
2021-09-07 17:18:41.209 | INFO     | src.policies:collect_trajectories:238 - Last 100 episodes mean return: 200.0
2021-09-07 17:18:41.211 | INFO     | src.policies:train:159 - Mini-batch 1 / 2
2021-09-07 17:18:41.21

2021-09-07 17:18:41.502 | INFO     | src.policies:minibatch_update:287 - Policy network L2 gradient norm: 0.2049683928489685
2021-09-07 17:18:41.503 | INFO     | src.policies:minibatch_update:291 - Baseline network L2 gradient norm: 2.4258971214294434
2021-09-07 17:18:41.504 | INFO     | src.policies:minibatch_update:298 - Policy network L2 gradient norm after clipping: 0.2049683928489685
2021-09-07 17:18:41.505 | INFO     | src.policies:minibatch_update:305 - Baseline network L2 gradient norm after clipping: 0.4999997913837433
2021-09-07 17:18:41.506 | INFO     | src.policies:train:123 - Epoch 571 / 800
2021-09-07 17:18:41.507 | INFO     | src.policies:collect_trajectories:221 - Episode 1398
2021-09-07 17:18:41.535 | DEBUG    | src.policies:execute_episode:413 - Early stopping, all agents done
2021-09-07 17:18:41.536 | INFO     | src.policies:collect_trajectories:237 - Mean episode return: 200.0
2021-09-07 17:18:41.536 | INFO     | src.policies:collect_trajectories:238 - Last 100 epis

2021-09-07 17:18:41.696 | INFO     | src.policies:minibatch_update:291 - Baseline network L2 gradient norm: 1.537049412727356
2021-09-07 17:18:41.697 | INFO     | src.policies:minibatch_update:298 - Policy network L2 gradient norm after clipping: 0.499999076128006
2021-09-07 17:18:41.698 | INFO     | src.policies:minibatch_update:305 - Baseline network L2 gradient norm after clipping: 0.4999997317790985
2021-09-07 17:18:41.699 | INFO     | src.policies:train:159 - Mini-batch 2 / 2
2021-09-07 17:18:41.701 | INFO     | src.policies:minibatch_update:281 - Losses: {'policy_loss': -0.22554819285869598, 'baseline_loss': 0.7761709690093994, 'total_loss': 0.16253729164600372}
2021-09-07 17:18:41.701 | INFO     | src.policies:minibatch_update:287 - Policy network L2 gradient norm: 0.3168170750141144
2021-09-07 17:18:41.702 | INFO     | src.policies:minibatch_update:291 - Baseline network L2 gradient norm: 1.6796780824661255
2021-09-07 17:18:41.703 | INFO     | src.policies:minibatch_update:298 

2021-09-07 17:18:41.882 | INFO     | src.policies:minibatch_update:287 - Policy network L2 gradient norm: 0.25319433212280273
2021-09-07 17:18:41.883 | INFO     | src.policies:minibatch_update:291 - Baseline network L2 gradient norm: 0.49043965339660645
2021-09-07 17:18:41.884 | INFO     | src.policies:minibatch_update:298 - Policy network L2 gradient norm after clipping: 0.25319433212280273
2021-09-07 17:18:41.885 | INFO     | src.policies:minibatch_update:305 - Baseline network L2 gradient norm after clipping: 0.49043965339660645
2021-09-07 17:18:41.886 | INFO     | src.policies:train:159 - Mini-batch 2 / 2
2021-09-07 17:18:41.887 | INFO     | src.policies:minibatch_update:281 - Losses: {'policy_loss': -0.3772844076156616, 'baseline_loss': 0.9291239380836487, 'total_loss': 0.08727756142616272}
2021-09-07 17:18:41.888 | INFO     | src.policies:minibatch_update:287 - Policy network L2 gradient norm: 0.19684934616088867
2021-09-07 17:18:41.889 | INFO     | src.policies:minibatch_update:

2021-09-07 17:18:42.144 | INFO     | src.policies:minibatch_update:281 - Losses: {'policy_loss': -0.4545673429965973, 'baseline_loss': 1.2037596702575684, 'total_loss': 0.1473124921321869}
2021-09-07 17:18:42.145 | INFO     | src.policies:minibatch_update:287 - Policy network L2 gradient norm: 0.08548345416784286
2021-09-07 17:18:42.146 | INFO     | src.policies:minibatch_update:291 - Baseline network L2 gradient norm: 1.2261079549789429
2021-09-07 17:18:42.147 | INFO     | src.policies:minibatch_update:298 - Policy network L2 gradient norm after clipping: 0.08548345416784286
2021-09-07 17:18:42.148 | INFO     | src.policies:minibatch_update:305 - Baseline network L2 gradient norm after clipping: 0.4999995529651642
2021-09-07 17:18:42.149 | INFO     | src.policies:train:159 - Mini-batch 2 / 2
2021-09-07 17:18:42.151 | INFO     | src.policies:minibatch_update:281 - Losses: {'policy_loss': -0.48054200410842896, 'baseline_loss': 1.1204025745391846, 'total_loss': 0.07965928316116333}
2021-

2021-09-07 17:18:42.332 | INFO     | src.policies:minibatch_update:287 - Policy network L2 gradient norm: 0.15392541885375977
2021-09-07 17:18:42.333 | INFO     | src.policies:minibatch_update:291 - Baseline network L2 gradient norm: 0.7927494049072266
2021-09-07 17:18:42.334 | INFO     | src.policies:minibatch_update:298 - Policy network L2 gradient norm after clipping: 0.15392541885375977
2021-09-07 17:18:42.336 | INFO     | src.policies:minibatch_update:305 - Baseline network L2 gradient norm after clipping: 0.49999940395355225
2021-09-07 17:18:42.337 | INFO     | src.policies:train:159 - Mini-batch 3 / 3
2021-09-07 17:18:42.338 | INFO     | src.policies:minibatch_update:281 - Losses: {'policy_loss': -0.3210279047489166, 'baseline_loss': 0.7324683666229248, 'total_loss': 0.045206278562545776}
2021-09-07 17:18:42.340 | INFO     | src.policies:minibatch_update:287 - Policy network L2 gradient norm: 0.28407859802246094
2021-09-07 17:18:42.341 | INFO     | src.policies:minibatch_update:

2021-09-07 17:18:42.560 | INFO     | src.policies:collect_trajectories:221 - Episode 1420
2021-09-07 17:18:42.581 | DEBUG    | src.policies:execute_episode:413 - Early stopping, all agents done
2021-09-07 17:18:42.582 | INFO     | src.policies:collect_trajectories:237 - Mean episode return: 137.0
2021-09-07 17:18:42.582 | INFO     | src.policies:collect_trajectories:238 - Last 100 episodes mean return: 137.0
2021-09-07 17:18:42.583 | INFO     | src.policies:collect_trajectories:221 - Episode 1421
2021-09-07 17:18:42.612 | DEBUG    | src.policies:execute_episode:413 - Early stopping, all agents done
2021-09-07 17:18:42.613 | INFO     | src.policies:collect_trajectories:237 - Mean episode return: 200.0
2021-09-07 17:18:42.613 | INFO     | src.policies:collect_trajectories:238 - Last 100 episodes mean return: 168.5
2021-09-07 17:18:42.616 | INFO     | src.policies:train:159 - Mini-batch 1 / 3
2021-09-07 17:18:42.618 | INFO     | src.policies:minibatch_update:281 - Losses: {'policy_loss': 

2021-09-07 17:18:42.764 | INFO     | src.policies:minibatch_update:287 - Policy network L2 gradient norm: 0.2930562198162079
2021-09-07 17:18:42.764 | INFO     | src.policies:minibatch_update:291 - Baseline network L2 gradient norm: 1.0396625995635986
2021-09-07 17:18:42.766 | INFO     | src.policies:minibatch_update:298 - Policy network L2 gradient norm after clipping: 0.2930562198162079
2021-09-07 17:18:42.767 | INFO     | src.policies:minibatch_update:305 - Baseline network L2 gradient norm after clipping: 0.4999995231628418
2021-09-07 17:18:42.769 | INFO     | src.policies:train:159 - Mini-batch 2 / 2
2021-09-07 17:18:42.770 | INFO     | src.policies:minibatch_update:281 - Losses: {'policy_loss': -0.368330717086792, 'baseline_loss': 0.7901374697685242, 'total_loss': 0.026738017797470093}
2021-09-07 17:18:42.771 | INFO     | src.policies:minibatch_update:287 - Policy network L2 gradient norm: 0.22940462827682495
2021-09-07 17:18:42.772 | INFO     | src.policies:minibatch_update:291 

2021-09-07 17:18:42.938 | INFO     | src.policies:minibatch_update:281 - Losses: {'policy_loss': -0.44421327114105225, 'baseline_loss': 1.4502133131027222, 'total_loss': 0.28089338541030884}
2021-09-07 17:18:42.939 | INFO     | src.policies:minibatch_update:287 - Policy network L2 gradient norm: 0.1316864788532257
2021-09-07 17:18:42.940 | INFO     | src.policies:minibatch_update:291 - Baseline network L2 gradient norm: 1.7327853441238403
2021-09-07 17:18:42.942 | INFO     | src.policies:minibatch_update:298 - Policy network L2 gradient norm after clipping: 0.1316864788532257
2021-09-07 17:18:42.943 | INFO     | src.policies:minibatch_update:305 - Baseline network L2 gradient norm after clipping: 0.49999967217445374
2021-09-07 17:18:42.945 | INFO     | src.policies:train:123 - Epoch 596 / 800
2021-09-07 17:18:42.945 | INFO     | src.policies:collect_trajectories:221 - Episode 1429
2021-09-07 17:18:42.977 | DEBUG    | src.policies:execute_episode:413 - Early stopping, all agents done
20

2021-09-07 17:18:43.294 | INFO     | src.policies:train:159 - Mini-batch 2 / 2
2021-09-07 17:18:43.295 | INFO     | src.policies:minibatch_update:281 - Losses: {'policy_loss': -0.4885525107383728, 'baseline_loss': 1.18677818775177, 'total_loss': 0.10483658313751221}
2021-09-07 17:18:43.296 | INFO     | src.policies:minibatch_update:287 - Policy network L2 gradient norm: 0.07817389070987701
2021-09-07 17:18:43.297 | INFO     | src.policies:minibatch_update:291 - Baseline network L2 gradient norm: 1.9368093013763428
2021-09-07 17:18:43.299 | INFO     | src.policies:minibatch_update:298 - Policy network L2 gradient norm after clipping: 0.07817389070987701
2021-09-07 17:18:43.300 | INFO     | src.policies:minibatch_update:305 - Baseline network L2 gradient norm after clipping: 0.4999997019767761
2021-09-07 17:18:43.302 | INFO     | src.policies:train:123 - Epoch 600 / 800
2021-09-07 17:18:43.302 | INFO     | src.policies:collect_trajectories:221 - Episode 1433
2021-09-07 17:18:43.332 | DEB

2021-09-07 17:18:43.613 | INFO     | src.policies:minibatch_update:287 - Policy network L2 gradient norm: 0.17343492805957794
2021-09-07 17:18:43.614 | INFO     | src.policies:minibatch_update:291 - Baseline network L2 gradient norm: 2.6620967388153076
2021-09-07 17:18:43.615 | INFO     | src.policies:minibatch_update:298 - Policy network L2 gradient norm after clipping: 0.17343492805957794
2021-09-07 17:18:43.616 | INFO     | src.policies:minibatch_update:305 - Baseline network L2 gradient norm after clipping: 0.4999997317790985
2021-09-07 17:18:43.617 | INFO     | src.policies:train:159 - Mini-batch 2 / 2
2021-09-07 17:18:43.618 | INFO     | src.policies:minibatch_update:281 - Losses: {'policy_loss': -0.20193175971508026, 'baseline_loss': 0.48283684253692627, 'total_loss': 0.039486661553382874}
2021-09-07 17:18:43.619 | INFO     | src.policies:minibatch_update:287 - Policy network L2 gradient norm: 0.13193991780281067
2021-09-07 17:18:43.620 | INFO     | src.policies:minibatch_update

2021-09-07 17:18:43.831 | INFO     | src.policies:minibatch_update:291 - Baseline network L2 gradient norm: 1.8420733213424683
2021-09-07 17:18:43.832 | INFO     | src.policies:minibatch_update:298 - Policy network L2 gradient norm after clipping: 0.12574395537376404
2021-09-07 17:18:43.833 | INFO     | src.policies:minibatch_update:305 - Baseline network L2 gradient norm after clipping: 0.4999997615814209
2021-09-07 17:18:43.834 | INFO     | src.policies:train:123 - Epoch 607 / 800
2021-09-07 17:18:43.835 | INFO     | src.policies:collect_trajectories:221 - Episode 1443
2021-09-07 17:18:43.863 | DEBUG    | src.policies:execute_episode:413 - Early stopping, all agents done
2021-09-07 17:18:43.864 | INFO     | src.policies:collect_trajectories:237 - Mean episode return: 200.0
2021-09-07 17:18:43.864 | INFO     | src.policies:collect_trajectories:238 - Last 100 episodes mean return: 200.0
2021-09-07 17:18:43.867 | INFO     | src.policies:train:159 - Mini-batch 1 / 2
2021-09-07 17:18:43.8

2021-09-07 17:18:44.021 | INFO     | src.policies:minibatch_update:287 - Policy network L2 gradient norm: 0.34482482075691223
2021-09-07 17:18:44.022 | INFO     | src.policies:minibatch_update:291 - Baseline network L2 gradient norm: 3.5065619945526123
2021-09-07 17:18:44.023 | INFO     | src.policies:minibatch_update:298 - Policy network L2 gradient norm after clipping: 0.34482482075691223
2021-09-07 17:18:44.024 | INFO     | src.policies:minibatch_update:305 - Baseline network L2 gradient norm after clipping: 0.49999985098838806
2021-09-07 17:18:44.026 | INFO     | src.policies:train:123 - Epoch 611 / 800
2021-09-07 17:18:44.026 | INFO     | src.policies:collect_trajectories:221 - Episode 1447
2021-09-07 17:18:44.055 | DEBUG    | src.policies:execute_episode:413 - Early stopping, all agents done
2021-09-07 17:18:44.056 | INFO     | src.policies:collect_trajectories:237 - Mean episode return: 200.0
2021-09-07 17:18:44.056 | INFO     | src.policies:collect_trajectories:238 - Last 100 e

2021-09-07 17:18:44.211 | INFO     | src.policies:minibatch_update:281 - Losses: {'policy_loss': -0.26144635677337646, 'baseline_loss': 0.45284390449523926, 'total_loss': -0.035024404525756836}
2021-09-07 17:18:44.212 | INFO     | src.policies:minibatch_update:287 - Policy network L2 gradient norm: 0.09167583286762238
2021-09-07 17:18:44.213 | INFO     | src.policies:minibatch_update:291 - Baseline network L2 gradient norm: 0.9416763186454773
2021-09-07 17:18:44.214 | INFO     | src.policies:minibatch_update:298 - Policy network L2 gradient norm after clipping: 0.09167583286762238
2021-09-07 17:18:44.215 | INFO     | src.policies:minibatch_update:305 - Baseline network L2 gradient norm after clipping: 0.49999940395355225
2021-09-07 17:18:44.216 | INFO     | src.policies:train:123 - Epoch 615 / 800
2021-09-07 17:18:44.216 | INFO     | src.policies:collect_trajectories:221 - Episode 1451
2021-09-07 17:18:44.245 | DEBUG    | src.policies:execute_episode:413 - Early stopping, all agents do

2021-09-07 17:18:44.466 | INFO     | src.policies:train:159 - Mini-batch 2 / 2
2021-09-07 17:18:44.467 | INFO     | src.policies:minibatch_update:281 - Losses: {'policy_loss': -0.5149837732315063, 'baseline_loss': 1.4602454900741577, 'total_loss': 0.2151389718055725}
2021-09-07 17:18:44.468 | INFO     | src.policies:minibatch_update:287 - Policy network L2 gradient norm: 0.2791266143321991
2021-09-07 17:18:44.469 | INFO     | src.policies:minibatch_update:291 - Baseline network L2 gradient norm: 2.404263496398926
2021-09-07 17:18:44.470 | INFO     | src.policies:minibatch_update:298 - Policy network L2 gradient norm after clipping: 0.2791266143321991
2021-09-07 17:18:44.471 | INFO     | src.policies:minibatch_update:305 - Baseline network L2 gradient norm after clipping: 0.4999997615814209
2021-09-07 17:18:44.472 | INFO     | src.policies:train:123 - Epoch 619 / 800
2021-09-07 17:18:44.473 | INFO     | src.policies:collect_trajectories:221 - Episode 1455
2021-09-07 17:18:44.501 | DEBUG

2021-09-07 17:18:44.654 | INFO     | src.policies:minibatch_update:305 - Baseline network L2 gradient norm after clipping: 0.49999961256980896
2021-09-07 17:18:44.656 | INFO     | src.policies:train:159 - Mini-batch 2 / 2
2021-09-07 17:18:44.658 | INFO     | src.policies:minibatch_update:281 - Losses: {'policy_loss': -0.4351358413696289, 'baseline_loss': 1.10561203956604, 'total_loss': 0.11767017841339111}
2021-09-07 17:18:44.659 | INFO     | src.policies:minibatch_update:287 - Policy network L2 gradient norm: 0.14469456672668457
2021-09-07 17:18:44.661 | INFO     | src.policies:minibatch_update:291 - Baseline network L2 gradient norm: 1.943028450012207
2021-09-07 17:18:44.663 | INFO     | src.policies:minibatch_update:298 - Policy network L2 gradient norm after clipping: 0.14469456672668457
2021-09-07 17:18:44.664 | INFO     | src.policies:minibatch_update:305 - Baseline network L2 gradient norm after clipping: 0.4999997317790985
2021-09-07 17:18:44.666 | INFO     | src.policies:train

2021-09-07 17:18:44.898 | INFO     | src.policies:minibatch_update:298 - Policy network L2 gradient norm after clipping: 0.2136746197938919
2021-09-07 17:18:44.899 | INFO     | src.policies:minibatch_update:305 - Baseline network L2 gradient norm after clipping: 0.49999985098838806
2021-09-07 17:18:44.900 | INFO     | src.policies:train:159 - Mini-batch 2 / 2
2021-09-07 17:18:44.901 | INFO     | src.policies:minibatch_update:281 - Losses: {'policy_loss': -0.49169591069221497, 'baseline_loss': 1.7329188585281372, 'total_loss': 0.37476351857185364}
2021-09-07 17:18:44.902 | INFO     | src.policies:minibatch_update:287 - Policy network L2 gradient norm: 0.3738683760166168
2021-09-07 17:18:44.903 | INFO     | src.policies:minibatch_update:291 - Baseline network L2 gradient norm: 3.0979790687561035
2021-09-07 17:18:44.904 | INFO     | src.policies:minibatch_update:298 - Policy network L2 gradient norm after clipping: 0.3738683760166168
2021-09-07 17:18:44.905 | INFO     | src.policies:minib

2021-09-07 17:18:45.085 | INFO     | src.policies:minibatch_update:291 - Baseline network L2 gradient norm: 1.8550032377243042
2021-09-07 17:18:45.087 | INFO     | src.policies:minibatch_update:298 - Policy network L2 gradient norm after clipping: 0.1972937285900116
2021-09-07 17:18:45.088 | INFO     | src.policies:minibatch_update:305 - Baseline network L2 gradient norm after clipping: 0.4999997019767761
2021-09-07 17:18:45.089 | INFO     | src.policies:train:159 - Mini-batch 2 / 2
2021-09-07 17:18:45.091 | INFO     | src.policies:minibatch_update:281 - Losses: {'policy_loss': -0.3694514334201813, 'baseline_loss': 1.234114408493042, 'total_loss': 0.24760577082633972}
2021-09-07 17:18:45.092 | INFO     | src.policies:minibatch_update:287 - Policy network L2 gradient norm: 0.40959858894348145
2021-09-07 17:18:45.092 | INFO     | src.policies:minibatch_update:291 - Baseline network L2 gradient norm: 1.6521518230438232
2021-09-07 17:18:45.094 | INFO     | src.policies:minibatch_update:298

2021-09-07 17:18:45.271 | INFO     | src.policies:minibatch_update:287 - Policy network L2 gradient norm: 0.11504877358675003
2021-09-07 17:18:45.272 | INFO     | src.policies:minibatch_update:291 - Baseline network L2 gradient norm: 1.6986887454986572
2021-09-07 17:18:45.273 | INFO     | src.policies:minibatch_update:298 - Policy network L2 gradient norm after clipping: 0.11504877358675003
2021-09-07 17:18:45.274 | INFO     | src.policies:minibatch_update:305 - Baseline network L2 gradient norm after clipping: 0.4999997317790985
2021-09-07 17:18:45.275 | INFO     | src.policies:train:123 - Epoch 634 / 800
2021-09-07 17:18:45.275 | INFO     | src.policies:collect_trajectories:221 - Episode 1471
2021-09-07 17:18:45.309 | DEBUG    | src.policies:execute_episode:413 - Early stopping, all agents done
2021-09-07 17:18:45.309 | INFO     | src.policies:collect_trajectories:237 - Mean episode return: 200.0
2021-09-07 17:18:45.310 | INFO     | src.policies:collect_trajectories:238 - Last 100 ep

2021-09-07 17:18:45.711 | INFO     | src.policies:minibatch_update:281 - Losses: {'policy_loss': -0.6262121796607971, 'baseline_loss': 2.11067271232605, 'total_loss': 0.4291241765022278}
2021-09-07 17:18:45.712 | INFO     | src.policies:minibatch_update:287 - Policy network L2 gradient norm: 0.18052417039871216
2021-09-07 17:18:45.713 | INFO     | src.policies:minibatch_update:291 - Baseline network L2 gradient norm: 3.429443836212158
2021-09-07 17:18:45.714 | INFO     | src.policies:minibatch_update:298 - Policy network L2 gradient norm after clipping: 0.18052417039871216
2021-09-07 17:18:45.715 | INFO     | src.policies:minibatch_update:305 - Baseline network L2 gradient norm after clipping: 0.4999999701976776
2021-09-07 17:18:45.716 | INFO     | src.policies:train:123 - Epoch 638 / 800
2021-09-07 17:18:45.717 | INFO     | src.policies:collect_trajectories:221 - Episode 1475
2021-09-07 17:18:45.742 | DEBUG    | src.policies:execute_episode:413 - Early stopping, all agents done
2021-0

2021-09-07 17:18:45.897 | INFO     | src.policies:minibatch_update:305 - Baseline network L2 gradient norm after clipping: 0.49999934434890747
2021-09-07 17:18:45.898 | INFO     | src.policies:train:123 - Epoch 641 / 800
2021-09-07 17:18:45.899 | INFO     | src.policies:collect_trajectories:221 - Episode 1479
2021-09-07 17:18:45.928 | DEBUG    | src.policies:execute_episode:413 - Early stopping, all agents done
2021-09-07 17:18:45.929 | INFO     | src.policies:collect_trajectories:237 - Mean episode return: 200.0
2021-09-07 17:18:45.929 | INFO     | src.policies:collect_trajectories:238 - Last 100 episodes mean return: 200.0
2021-09-07 17:18:45.931 | INFO     | src.policies:train:159 - Mini-batch 1 / 2
2021-09-07 17:18:45.933 | INFO     | src.policies:minibatch_update:281 - Losses: {'policy_loss': -0.275915265083313, 'baseline_loss': 0.7543948292732239, 'total_loss': 0.10128214955329895}
2021-09-07 17:18:45.934 | INFO     | src.policies:minibatch_update:287 - Policy network L2 gradient

2021-09-07 17:18:46.105 | INFO     | src.policies:train:159 - Mini-batch 2 / 3
2021-09-07 17:18:46.106 | INFO     | src.policies:minibatch_update:281 - Losses: {'policy_loss': -0.46314480900764465, 'baseline_loss': 1.5077975988388062, 'total_loss': 0.2907539904117584}
2021-09-07 17:18:46.107 | INFO     | src.policies:minibatch_update:287 - Policy network L2 gradient norm: 0.33227112889289856
2021-09-07 17:18:46.107 | INFO     | src.policies:minibatch_update:291 - Baseline network L2 gradient norm: 2.393627166748047
2021-09-07 17:18:46.108 | INFO     | src.policies:minibatch_update:298 - Policy network L2 gradient norm after clipping: 0.33227112889289856
2021-09-07 17:18:46.109 | INFO     | src.policies:minibatch_update:305 - Baseline network L2 gradient norm after clipping: 0.4999997615814209
2021-09-07 17:18:46.111 | INFO     | src.policies:train:159 - Mini-batch 3 / 3
2021-09-07 17:18:46.112 | INFO     | src.policies:minibatch_update:281 - Losses: {'policy_loss': -0.4626815915107727,

2021-09-07 17:18:46.348 | INFO     | src.policies:collect_trajectories:238 - Last 100 episodes mean return: 200.0
2021-09-07 17:18:46.349 | INFO     | src.policies:train:159 - Mini-batch 1 / 2
2021-09-07 17:18:46.352 | INFO     | src.policies:minibatch_update:281 - Losses: {'policy_loss': -0.24239151179790497, 'baseline_loss': 0.4858238995075226, 'total_loss': 0.0005204379558563232}
2021-09-07 17:18:46.353 | INFO     | src.policies:minibatch_update:287 - Policy network L2 gradient norm: 0.13303139805793762
2021-09-07 17:18:46.354 | INFO     | src.policies:minibatch_update:291 - Baseline network L2 gradient norm: 1.9879471063613892
2021-09-07 17:18:46.355 | INFO     | src.policies:minibatch_update:298 - Policy network L2 gradient norm after clipping: 0.13303139805793762
2021-09-07 17:18:46.356 | INFO     | src.policies:minibatch_update:305 - Baseline network L2 gradient norm after clipping: 0.4999997913837433
2021-09-07 17:18:46.357 | INFO     | src.policies:train:159 - Mini-batch 2 / 2

2021-09-07 17:18:46.537 | INFO     | src.policies:collect_trajectories:237 - Mean episode return: 200.0
2021-09-07 17:18:46.537 | INFO     | src.policies:collect_trajectories:238 - Last 100 episodes mean return: 200.0
2021-09-07 17:18:46.540 | INFO     | src.policies:train:159 - Mini-batch 1 / 2
2021-09-07 17:18:46.543 | INFO     | src.policies:minibatch_update:281 - Losses: {'policy_loss': -0.40786197781562805, 'baseline_loss': 1.2847477197647095, 'total_loss': 0.23451188206672668}
2021-09-07 17:18:46.544 | INFO     | src.policies:minibatch_update:287 - Policy network L2 gradient norm: 0.1223452091217041
2021-09-07 17:18:46.545 | INFO     | src.policies:minibatch_update:291 - Baseline network L2 gradient norm: 1.8372920751571655
2021-09-07 17:18:46.546 | INFO     | src.policies:minibatch_update:298 - Policy network L2 gradient norm after clipping: 0.1223452091217041
2021-09-07 17:18:46.548 | INFO     | src.policies:minibatch_update:305 - Baseline network L2 gradient norm after clippin

2021-09-07 17:18:46.761 | INFO     | src.policies:collect_trajectories:221 - Episode 1495
2021-09-07 17:18:46.790 | DEBUG    | src.policies:execute_episode:413 - Early stopping, all agents done
2021-09-07 17:18:46.791 | INFO     | src.policies:collect_trajectories:237 - Mean episode return: 200.0
2021-09-07 17:18:46.791 | INFO     | src.policies:collect_trajectories:238 - Last 100 episodes mean return: 200.0
2021-09-07 17:18:46.793 | INFO     | src.policies:train:159 - Mini-batch 1 / 2
2021-09-07 17:18:46.795 | INFO     | src.policies:minibatch_update:281 - Losses: {'policy_loss': -0.39176875352859497, 'baseline_loss': 0.8250106573104858, 'total_loss': 0.02073657512664795}
2021-09-07 17:18:46.796 | INFO     | src.policies:minibatch_update:287 - Policy network L2 gradient norm: 0.33280062675476074
2021-09-07 17:18:46.797 | INFO     | src.policies:minibatch_update:291 - Baseline network L2 gradient norm: 0.9901783466339111
2021-09-07 17:18:46.798 | INFO     | src.policies:minibatch_updat

2021-09-07 17:18:46.949 | INFO     | src.policies:train:123 - Epoch 660 / 800
2021-09-07 17:18:46.950 | INFO     | src.policies:collect_trajectories:221 - Episode 1499
2021-09-07 17:18:46.979 | DEBUG    | src.policies:execute_episode:413 - Early stopping, all agents done
2021-09-07 17:18:46.980 | INFO     | src.policies:collect_trajectories:237 - Mean episode return: 200.0
2021-09-07 17:18:46.980 | INFO     | src.policies:collect_trajectories:238 - Last 100 episodes mean return: 200.0
2021-09-07 17:18:46.982 | INFO     | src.policies:train:159 - Mini-batch 1 / 2
2021-09-07 17:18:46.985 | INFO     | src.policies:minibatch_update:281 - Losses: {'policy_loss': -0.37029239535331726, 'baseline_loss': 2.1093592643737793, 'total_loss': 0.68438720703125}
2021-09-07 17:18:46.985 | INFO     | src.policies:minibatch_update:287 - Policy network L2 gradient norm: 0.535637617111206
2021-09-07 17:18:46.986 | INFO     | src.policies:minibatch_update:291 - Baseline network L2 gradient norm: 2.613523960

2021-09-07 17:18:47.147 | INFO     | src.policies:minibatch_update:287 - Policy network L2 gradient norm: 0.3569842576980591
2021-09-07 17:18:47.147 | INFO     | src.policies:minibatch_update:291 - Baseline network L2 gradient norm: 0.7523989677429199
2021-09-07 17:18:47.149 | INFO     | src.policies:minibatch_update:298 - Policy network L2 gradient norm after clipping: 0.3569842576980591
2021-09-07 17:18:47.149 | INFO     | src.policies:minibatch_update:305 - Baseline network L2 gradient norm after clipping: 0.4999993145465851
2021-09-07 17:18:47.151 | INFO     | src.policies:train:123 - Epoch 664 / 800
2021-09-07 17:18:47.151 | INFO     | src.policies:collect_trajectories:221 - Episode 1504
2021-09-07 17:18:47.180 | DEBUG    | src.policies:execute_episode:413 - Early stopping, all agents done
2021-09-07 17:18:47.181 | INFO     | src.policies:collect_trajectories:237 - Mean episode return: 200.0
2021-09-07 17:18:47.181 | INFO     | src.policies:collect_trajectories:238 - Last 100 epis

2021-09-07 17:18:47.407 | INFO     | src.policies:minibatch_update:291 - Baseline network L2 gradient norm: 1.9502004384994507
2021-09-07 17:18:47.408 | INFO     | src.policies:minibatch_update:298 - Policy network L2 gradient norm after clipping: 0.22903744876384735
2021-09-07 17:18:47.409 | INFO     | src.policies:minibatch_update:305 - Baseline network L2 gradient norm after clipping: 0.49999967217445374
2021-09-07 17:18:47.410 | INFO     | src.policies:train:159 - Mini-batch 2 / 3
2021-09-07 17:18:47.411 | INFO     | src.policies:minibatch_update:281 - Losses: {'policy_loss': -0.30934417247772217, 'baseline_loss': 0.4393107295036316, 'total_loss': -0.08968880772590637}
2021-09-07 17:18:47.412 | INFO     | src.policies:minibatch_update:287 - Policy network L2 gradient norm: 0.3877227306365967
2021-09-07 17:18:47.413 | INFO     | src.policies:minibatch_update:291 - Baseline network L2 gradient norm: 1.8568661212921143
2021-09-07 17:18:47.414 | INFO     | src.policies:minibatch_update

2021-09-07 17:18:47.573 | INFO     | src.policies:train:123 - Epoch 671 / 800
2021-09-07 17:18:47.573 | INFO     | src.policies:collect_trajectories:221 - Episode 1512
2021-09-07 17:18:47.602 | DEBUG    | src.policies:execute_episode:413 - Early stopping, all agents done
2021-09-07 17:18:47.602 | INFO     | src.policies:collect_trajectories:237 - Mean episode return: 200.0
2021-09-07 17:18:47.603 | INFO     | src.policies:collect_trajectories:238 - Last 100 episodes mean return: 200.0
2021-09-07 17:18:47.605 | INFO     | src.policies:train:159 - Mini-batch 1 / 2
2021-09-07 17:18:47.607 | INFO     | src.policies:minibatch_update:281 - Losses: {'policy_loss': -0.4585162103176117, 'baseline_loss': 1.008881688117981, 'total_loss': 0.045924633741378784}
2021-09-07 17:18:47.608 | INFO     | src.policies:minibatch_update:287 - Policy network L2 gradient norm: 0.3128708004951477
2021-09-07 17:18:47.609 | INFO     | src.policies:minibatch_update:291 - Baseline network L2 gradient norm: 1.985119

2021-09-07 17:18:47.987 | INFO     | src.policies:minibatch_update:305 - Baseline network L2 gradient norm after clipping: 0.49999988079071045
2021-09-07 17:18:47.989 | INFO     | src.policies:train:123 - Epoch 675 / 800
2021-09-07 17:18:47.989 | INFO     | src.policies:collect_trajectories:221 - Episode 1516
2021-09-07 17:18:48.020 | DEBUG    | src.policies:execute_episode:413 - Early stopping, all agents done
2021-09-07 17:18:48.021 | INFO     | src.policies:collect_trajectories:237 - Mean episode return: 200.0
2021-09-07 17:18:48.022 | INFO     | src.policies:collect_trajectories:238 - Last 100 episodes mean return: 200.0
2021-09-07 17:18:48.024 | INFO     | src.policies:train:159 - Mini-batch 1 / 2
2021-09-07 17:18:48.026 | INFO     | src.policies:minibatch_update:281 - Losses: {'policy_loss': -0.4650082290172577, 'baseline_loss': 0.8088168501853943, 'total_loss': -0.06059980392456055}
2021-09-07 17:18:48.027 | INFO     | src.policies:minibatch_update:287 - Policy network L2 gradie

2021-09-07 17:18:48.172 | INFO     | src.policies:minibatch_update:298 - Policy network L2 gradient norm after clipping: 0.09769832342863083
2021-09-07 17:18:48.173 | INFO     | src.policies:minibatch_update:305 - Baseline network L2 gradient norm after clipping: 0.49999943375587463
2021-09-07 17:18:48.174 | INFO     | src.policies:train:123 - Epoch 679 / 800
2021-09-07 17:18:48.175 | INFO     | src.policies:collect_trajectories:221 - Episode 1520
2021-09-07 17:18:48.204 | DEBUG    | src.policies:execute_episode:413 - Early stopping, all agents done
2021-09-07 17:18:48.205 | INFO     | src.policies:collect_trajectories:237 - Mean episode return: 200.0
2021-09-07 17:18:48.205 | INFO     | src.policies:collect_trajectories:238 - Last 100 episodes mean return: 200.0
2021-09-07 17:18:48.208 | INFO     | src.policies:train:159 - Mini-batch 1 / 2
2021-09-07 17:18:48.210 | INFO     | src.policies:minibatch_update:281 - Losses: {'policy_loss': -0.6386047601699829, 'baseline_loss': 2.1643052101

2021-09-07 17:18:48.360 | INFO     | src.policies:minibatch_update:291 - Baseline network L2 gradient norm: 5.128989219665527
2021-09-07 17:18:48.361 | INFO     | src.policies:minibatch_update:298 - Policy network L2 gradient norm after clipping: 0.49999937415122986
2021-09-07 17:18:48.362 | INFO     | src.policies:minibatch_update:305 - Baseline network L2 gradient norm after clipping: 0.49999988079071045
2021-09-07 17:18:48.363 | INFO     | src.policies:train:123 - Epoch 683 / 800
2021-09-07 17:18:48.364 | INFO     | src.policies:collect_trajectories:221 - Episode 1524
2021-09-07 17:18:48.392 | DEBUG    | src.policies:execute_episode:413 - Early stopping, all agents done
2021-09-07 17:18:48.392 | INFO     | src.policies:collect_trajectories:237 - Mean episode return: 200.0
2021-09-07 17:18:48.393 | INFO     | src.policies:collect_trajectories:238 - Last 100 episodes mean return: 200.0
2021-09-07 17:18:48.395 | INFO     | src.policies:train:159 - Mini-batch 1 / 2
2021-09-07 17:18:48.3

2021-09-07 17:18:48.613 | INFO     | src.policies:minibatch_update:298 - Policy network L2 gradient norm after clipping: 0.36569589376449585
2021-09-07 17:18:48.614 | INFO     | src.policies:minibatch_update:305 - Baseline network L2 gradient norm after clipping: 0.4999997317790985
2021-09-07 17:18:48.615 | INFO     | src.policies:train:159 - Mini-batch 2 / 2
2021-09-07 17:18:48.617 | INFO     | src.policies:minibatch_update:281 - Losses: {'policy_loss': -0.143715962767601, 'baseline_loss': 0.43199270963668823, 'total_loss': 0.0722803920507431}
2021-09-07 17:18:48.618 | INFO     | src.policies:minibatch_update:287 - Policy network L2 gradient norm: 0.24173404276371002
2021-09-07 17:18:48.619 | INFO     | src.policies:minibatch_update:291 - Baseline network L2 gradient norm: 1.4369062185287476
2021-09-07 17:18:48.620 | INFO     | src.policies:minibatch_update:298 - Policy network L2 gradient norm after clipping: 0.24173404276371002
2021-09-07 17:18:48.621 | INFO     | src.policies:minib

2021-09-07 17:18:48.806 | INFO     | src.policies:minibatch_update:291 - Baseline network L2 gradient norm: 0.9078142046928406
2021-09-07 17:18:48.807 | INFO     | src.policies:minibatch_update:298 - Policy network L2 gradient norm after clipping: 0.3134273290634155
2021-09-07 17:18:48.808 | INFO     | src.policies:minibatch_update:305 - Baseline network L2 gradient norm after clipping: 0.4999994933605194
2021-09-07 17:18:48.810 | INFO     | src.policies:train:159 - Mini-batch 2 / 2
2021-09-07 17:18:48.811 | INFO     | src.policies:minibatch_update:281 - Losses: {'policy_loss': -0.26315486431121826, 'baseline_loss': 0.4232656955718994, 'total_loss': -0.051522016525268555}
2021-09-07 17:18:48.813 | INFO     | src.policies:minibatch_update:287 - Policy network L2 gradient norm: 0.10172299295663834
2021-09-07 17:18:48.814 | INFO     | src.policies:minibatch_update:291 - Baseline network L2 gradient norm: 0.6598833799362183
2021-09-07 17:18:48.815 | INFO     | src.policies:minibatch_update

2021-09-07 17:18:49.050 | INFO     | src.policies:minibatch_update:287 - Policy network L2 gradient norm: 0.3069393038749695
2021-09-07 17:18:49.051 | INFO     | src.policies:minibatch_update:291 - Baseline network L2 gradient norm: 1.5704988241195679
2021-09-07 17:18:49.052 | INFO     | src.policies:minibatch_update:298 - Policy network L2 gradient norm after clipping: 0.3069393038749695
2021-09-07 17:18:49.053 | INFO     | src.policies:minibatch_update:305 - Baseline network L2 gradient norm after clipping: 0.4999997019767761
2021-09-07 17:18:49.054 | INFO     | src.policies:train:159 - Mini-batch 2 / 2
2021-09-07 17:18:49.055 | INFO     | src.policies:minibatch_update:281 - Losses: {'policy_loss': -0.07859165966510773, 'baseline_loss': 0.529151439666748, 'total_loss': 0.1859840601682663}
2021-09-07 17:18:49.056 | INFO     | src.policies:minibatch_update:287 - Policy network L2 gradient norm: 0.23823602497577667
2021-09-07 17:18:49.057 | INFO     | src.policies:minibatch_update:291 -

2021-09-07 17:18:49.232 | INFO     | src.policies:minibatch_update:281 - Losses: {'policy_loss': -0.31697970628738403, 'baseline_loss': 0.41786572337150574, 'total_loss': -0.10804684460163116}
2021-09-07 17:18:49.233 | INFO     | src.policies:minibatch_update:287 - Policy network L2 gradient norm: 0.20085330307483673
2021-09-07 17:18:49.234 | INFO     | src.policies:minibatch_update:291 - Baseline network L2 gradient norm: 0.5135137438774109
2021-09-07 17:18:49.235 | INFO     | src.policies:minibatch_update:298 - Policy network L2 gradient norm after clipping: 0.20085330307483673
2021-09-07 17:18:49.236 | INFO     | src.policies:minibatch_update:305 - Baseline network L2 gradient norm after clipping: 0.49999889731407166
2021-09-07 17:18:49.237 | INFO     | src.policies:train:159 - Mini-batch 2 / 2
2021-09-07 17:18:49.238 | INFO     | src.policies:minibatch_update:281 - Losses: {'policy_loss': -0.3100651502609253, 'baseline_loss': 0.4149540066719055, 'total_loss': -0.10258814692497253}


2021-09-07 17:18:49.415 | INFO     | src.policies:collect_trajectories:238 - Last 100 episodes mean return: 200.0
2021-09-07 17:18:49.418 | INFO     | src.policies:train:159 - Mini-batch 1 / 2
2021-09-07 17:18:49.420 | INFO     | src.policies:minibatch_update:281 - Losses: {'policy_loss': -0.2290123552083969, 'baseline_loss': 0.41896918416023254, 'total_loss': -0.01952776312828064}
2021-09-07 17:18:49.421 | INFO     | src.policies:minibatch_update:287 - Policy network L2 gradient norm: 0.524260401725769
2021-09-07 17:18:49.423 | INFO     | src.policies:minibatch_update:291 - Baseline network L2 gradient norm: 0.8242222666740417
2021-09-07 17:18:49.424 | INFO     | src.policies:minibatch_update:298 - Policy network L2 gradient norm after clipping: 0.4999990165233612
2021-09-07 17:18:49.425 | INFO     | src.policies:minibatch_update:305 - Baseline network L2 gradient norm after clipping: 0.49999937415122986
2021-09-07 17:18:49.427 | INFO     | src.policies:train:159 - Mini-batch 2 / 2
20

2021-09-07 17:18:49.663 | DEBUG    | src.policies:execute_episode:413 - Early stopping, all agents done
2021-09-07 17:18:49.663 | INFO     | src.policies:collect_trajectories:237 - Mean episode return: 200.0
2021-09-07 17:18:49.664 | INFO     | src.policies:collect_trajectories:238 - Last 100 episodes mean return: 200.0
2021-09-07 17:18:49.666 | INFO     | src.policies:train:159 - Mini-batch 1 / 2
2021-09-07 17:18:49.668 | INFO     | src.policies:minibatch_update:281 - Losses: {'policy_loss': -0.1440471112728119, 'baseline_loss': 0.450190007686615, 'total_loss': 0.0810478925704956}
2021-09-07 17:18:49.669 | INFO     | src.policies:minibatch_update:287 - Policy network L2 gradient norm: 0.12156384438276291
2021-09-07 17:18:49.670 | INFO     | src.policies:minibatch_update:291 - Baseline network L2 gradient norm: 1.0245121717453003
2021-09-07 17:18:49.671 | INFO     | src.policies:minibatch_update:298 - Policy network L2 gradient norm after clipping: 0.12156384438276291
2021-09-07 17:18:

2021-09-07 17:18:49.946 | INFO     | src.policies:train:123 - Epoch 710 / 800
2021-09-07 17:18:49.946 | INFO     | src.policies:collect_trajectories:221 - Episode 1552
2021-09-07 17:18:49.977 | DEBUG    | src.policies:execute_episode:413 - Early stopping, all agents done
2021-09-07 17:18:49.978 | INFO     | src.policies:collect_trajectories:237 - Mean episode return: 200.0
2021-09-07 17:18:49.978 | INFO     | src.policies:collect_trajectories:238 - Last 100 episodes mean return: 200.0
2021-09-07 17:18:49.981 | INFO     | src.policies:train:159 - Mini-batch 1 / 2
2021-09-07 17:18:49.984 | INFO     | src.policies:minibatch_update:281 - Losses: {'policy_loss': -0.4538159966468811, 'baseline_loss': 0.833057701587677, 'total_loss': -0.0372871458530426}
2021-09-07 17:18:49.985 | INFO     | src.policies:minibatch_update:287 - Policy network L2 gradient norm: 0.16567187011241913
2021-09-07 17:18:49.986 | INFO     | src.policies:minibatch_update:291 - Baseline network L2 gradient norm: 1.353061

2021-09-07 17:18:50.273 | INFO     | src.policies:minibatch_update:305 - Baseline network L2 gradient norm after clipping: 0.4999999403953552
2021-09-07 17:18:50.275 | INFO     | src.policies:train:123 - Epoch 714 / 800
2021-09-07 17:18:50.275 | INFO     | src.policies:collect_trajectories:221 - Episode 1556
2021-09-07 17:18:50.310 | DEBUG    | src.policies:execute_episode:413 - Early stopping, all agents done
2021-09-07 17:18:50.311 | INFO     | src.policies:collect_trajectories:237 - Mean episode return: 200.0
2021-09-07 17:18:50.311 | INFO     | src.policies:collect_trajectories:238 - Last 100 episodes mean return: 200.0
2021-09-07 17:18:50.313 | INFO     | src.policies:train:159 - Mini-batch 1 / 2
2021-09-07 17:18:50.315 | INFO     | src.policies:minibatch_update:281 - Losses: {'policy_loss': -0.5379276871681213, 'baseline_loss': 0.9081854224205017, 'total_loss': -0.08383497595787048}
2021-09-07 17:18:50.316 | INFO     | src.policies:minibatch_update:287 - Policy network L2 gradien

2021-09-07 17:18:50.491 | INFO     | src.policies:collect_trajectories:221 - Episode 1561
2021-09-07 17:18:50.522 | DEBUG    | src.policies:execute_episode:413 - Early stopping, all agents done
2021-09-07 17:18:50.522 | INFO     | src.policies:collect_trajectories:237 - Mean episode return: 200.0
2021-09-07 17:18:50.523 | INFO     | src.policies:collect_trajectories:238 - Last 100 episodes mean return: 199.0
2021-09-07 17:18:50.526 | INFO     | src.policies:train:159 - Mini-batch 1 / 3
2021-09-07 17:18:50.528 | INFO     | src.policies:minibatch_update:281 - Losses: {'policy_loss': -0.37649112939834595, 'baseline_loss': 0.5719707012176514, 'total_loss': -0.09050577878952026}
2021-09-07 17:18:50.529 | INFO     | src.policies:minibatch_update:287 - Policy network L2 gradient norm: 0.5684343576431274
2021-09-07 17:18:50.530 | INFO     | src.policies:minibatch_update:291 - Baseline network L2 gradient norm: 0.4801543653011322
2021-09-07 17:18:50.531 | INFO     | src.policies:minibatch_updat

2021-09-07 17:18:50.684 | INFO     | src.policies:minibatch_update:305 - Baseline network L2 gradient norm after clipping: 0.49999940395355225
2021-09-07 17:18:50.685 | INFO     | src.policies:train:159 - Mini-batch 2 / 2
2021-09-07 17:18:50.686 | INFO     | src.policies:minibatch_update:281 - Losses: {'policy_loss': -0.2180750072002411, 'baseline_loss': 0.3635489344596863, 'total_loss': -0.03630053997039795}
2021-09-07 17:18:50.687 | INFO     | src.policies:minibatch_update:287 - Policy network L2 gradient norm: 0.19295372068881989
2021-09-07 17:18:50.688 | INFO     | src.policies:minibatch_update:291 - Baseline network L2 gradient norm: 1.3128821849822998
2021-09-07 17:18:50.690 | INFO     | src.policies:minibatch_update:298 - Policy network L2 gradient norm after clipping: 0.19295372068881989
2021-09-07 17:18:50.691 | INFO     | src.policies:minibatch_update:305 - Baseline network L2 gradient norm after clipping: 0.49999967217445374
2021-09-07 17:18:50.692 | INFO     | src.policies:

2021-09-07 17:18:50.917 | INFO     | src.policies:minibatch_update:298 - Policy network L2 gradient norm after clipping: 0.209433451294899
2021-09-07 17:18:50.918 | INFO     | src.policies:minibatch_update:305 - Baseline network L2 gradient norm after clipping: 0.499999463558197
2021-09-07 17:18:50.920 | INFO     | src.policies:train:123 - Epoch 724 / 800
2021-09-07 17:18:50.920 | INFO     | src.policies:collect_trajectories:221 - Episode 1569
2021-09-07 17:18:50.927 | DEBUG    | src.policies:execute_episode:413 - Early stopping, all agents done
2021-09-07 17:18:50.928 | INFO     | src.policies:collect_trajectories:237 - Mean episode return: 32.0
2021-09-07 17:18:50.928 | INFO     | src.policies:collect_trajectories:238 - Last 100 episodes mean return: 32.0
2021-09-07 17:18:50.929 | INFO     | src.policies:collect_trajectories:221 - Episode 1570
2021-09-07 17:18:50.955 | DEBUG    | src.policies:execute_episode:413 - Early stopping, all agents done
2021-09-07 17:18:50.955 | INFO     | s

2021-09-07 17:18:51.155 | INFO     | src.policies:collect_trajectories:237 - Mean episode return: 182.0
2021-09-07 17:18:51.155 | INFO     | src.policies:collect_trajectories:238 - Last 100 episodes mean return: 168.5
2021-09-07 17:18:51.159 | INFO     | src.policies:train:159 - Mini-batch 1 / 3
2021-09-07 17:18:51.161 | INFO     | src.policies:minibatch_update:281 - Losses: {'policy_loss': -0.4157874584197998, 'baseline_loss': 0.5975469350814819, 'total_loss': -0.11701399087905884}
2021-09-07 17:18:51.162 | INFO     | src.policies:minibatch_update:287 - Policy network L2 gradient norm: 0.31346216797828674
2021-09-07 17:18:51.163 | INFO     | src.policies:minibatch_update:291 - Baseline network L2 gradient norm: 0.4413022994995117
2021-09-07 17:18:51.165 | INFO     | src.policies:minibatch_update:298 - Policy network L2 gradient norm after clipping: 0.31346216797828674
2021-09-07 17:18:51.166 | INFO     | src.policies:minibatch_update:305 - Baseline network L2 gradient norm after clipp

2021-09-07 17:18:51.383 | INFO     | src.policies:collect_trajectories:237 - Mean episode return: 128.0
2021-09-07 17:18:51.384 | INFO     | src.policies:collect_trajectories:238 - Last 100 episodes mean return: 128.0
2021-09-07 17:18:51.384 | INFO     | src.policies:collect_trajectories:221 - Episode 1581
2021-09-07 17:18:51.402 | DEBUG    | src.policies:execute_episode:413 - Early stopping, all agents done
2021-09-07 17:18:51.403 | INFO     | src.policies:collect_trajectories:237 - Mean episode return: 113.0
2021-09-07 17:18:51.403 | INFO     | src.policies:collect_trajectories:238 - Last 100 episodes mean return: 120.5
2021-09-07 17:18:51.406 | INFO     | src.policies:train:159 - Mini-batch 1 / 2
2021-09-07 17:18:51.410 | INFO     | src.policies:minibatch_update:281 - Losses: {'policy_loss': -0.4991490840911865, 'baseline_loss': 0.7263426780700684, 'total_loss': -0.13597774505615234}
2021-09-07 17:18:51.411 | INFO     | src.policies:minibatch_update:287 - Policy network L2 gradient 

2021-09-07 17:18:51.555 | INFO     | src.policies:minibatch_update:305 - Baseline network L2 gradient norm after clipping: 0.499999076128006
2021-09-07 17:18:51.556 | INFO     | src.policies:train:123 - Epoch 732 / 800
2021-09-07 17:18:51.557 | INFO     | src.policies:collect_trajectories:221 - Episode 1586
2021-09-07 17:18:51.574 | DEBUG    | src.policies:execute_episode:413 - Early stopping, all agents done
2021-09-07 17:18:51.574 | INFO     | src.policies:collect_trajectories:237 - Mean episode return: 121.0
2021-09-07 17:18:51.575 | INFO     | src.policies:collect_trajectories:238 - Last 100 episodes mean return: 121.0
2021-09-07 17:18:51.575 | INFO     | src.policies:collect_trajectories:221 - Episode 1587
2021-09-07 17:18:51.599 | DEBUG    | src.policies:execute_episode:413 - Early stopping, all agents done
2021-09-07 17:18:51.600 | INFO     | src.policies:collect_trajectories:237 - Mean episode return: 155.0
2021-09-07 17:18:51.600 | INFO     | src.policies:collect_trajectories:

2021-09-07 17:18:51.772 | INFO     | src.policies:collect_trajectories:237 - Mean episode return: 183.0
2021-09-07 17:18:51.772 | INFO     | src.policies:collect_trajectories:238 - Last 100 episodes mean return: 183.0
2021-09-07 17:18:51.773 | INFO     | src.policies:collect_trajectories:221 - Episode 1593
2021-09-07 17:18:51.799 | DEBUG    | src.policies:execute_episode:413 - Early stopping, all agents done
2021-09-07 17:18:51.800 | INFO     | src.policies:collect_trajectories:237 - Mean episode return: 176.0
2021-09-07 17:18:51.800 | INFO     | src.policies:collect_trajectories:238 - Last 100 episodes mean return: 179.5
2021-09-07 17:18:51.804 | INFO     | src.policies:train:159 - Mini-batch 1 / 3
2021-09-07 17:18:51.806 | INFO     | src.policies:minibatch_update:281 - Losses: {'policy_loss': -0.3652666211128235, 'baseline_loss': 0.49952232837677, 'total_loss': -0.11550545692443848}
2021-09-07 17:18:51.807 | INFO     | src.policies:minibatch_update:287 - Policy network L2 gradient no

2021-09-07 17:18:52.110 | INFO     | src.policies:minibatch_update:305 - Baseline network L2 gradient norm after clipping: 0.12641166150569916
2021-09-07 17:18:52.111 | INFO     | src.policies:train:159 - Mini-batch 3 / 3
2021-09-07 17:18:52.112 | INFO     | src.policies:minibatch_update:281 - Losses: {'policy_loss': -0.4592663049697876, 'baseline_loss': 0.6244103312492371, 'total_loss': -0.14706113934516907}
2021-09-07 17:18:52.113 | INFO     | src.policies:minibatch_update:287 - Policy network L2 gradient norm: 0.19572122395038605
2021-09-07 17:18:52.114 | INFO     | src.policies:minibatch_update:291 - Baseline network L2 gradient norm: 0.277847021818161
2021-09-07 17:18:52.115 | INFO     | src.policies:minibatch_update:298 - Policy network L2 gradient norm after clipping: 0.19572122395038605
2021-09-07 17:18:52.116 | INFO     | src.policies:minibatch_update:305 - Baseline network L2 gradient norm after clipping: 0.277847021818161
2021-09-07 17:18:52.118 | INFO     | src.policies:tra

2021-09-07 17:18:52.290 | INFO     | src.policies:minibatch_update:298 - Policy network L2 gradient norm after clipping: 0.23601418733596802
2021-09-07 17:18:52.292 | INFO     | src.policies:minibatch_update:305 - Baseline network L2 gradient norm after clipping: 0.49999940395355225
2021-09-07 17:18:52.293 | INFO     | src.policies:train:123 - Epoch 741 / 800
2021-09-07 17:18:52.294 | INFO     | src.policies:collect_trajectories:221 - Episode 1602
2021-09-07 17:18:52.320 | DEBUG    | src.policies:execute_episode:413 - Early stopping, all agents done
2021-09-07 17:18:52.321 | INFO     | src.policies:collect_trajectories:237 - Mean episode return: 158.0
2021-09-07 17:18:52.322 | INFO     | src.policies:collect_trajectories:238 - Last 100 episodes mean return: 158.0
2021-09-07 17:18:52.322 | INFO     | src.policies:collect_trajectories:221 - Episode 1603
2021-09-07 17:18:52.348 | DEBUG    | src.policies:execute_episode:413 - Early stopping, all agents done
2021-09-07 17:18:52.348 | INFO  

2021-09-07 17:18:52.547 | INFO     | src.policies:minibatch_update:305 - Baseline network L2 gradient norm after clipping: 0.4999992251396179
2021-09-07 17:18:52.548 | INFO     | src.policies:train:159 - Mini-batch 2 / 2
2021-09-07 17:18:52.549 | INFO     | src.policies:minibatch_update:281 - Losses: {'policy_loss': -0.3400154113769531, 'baseline_loss': 0.5419145822525024, 'total_loss': -0.0690581202507019}
2021-09-07 17:18:52.550 | INFO     | src.policies:minibatch_update:287 - Policy network L2 gradient norm: 0.22335396707057953
2021-09-07 17:18:52.551 | INFO     | src.policies:minibatch_update:291 - Baseline network L2 gradient norm: 0.3963835537433624
2021-09-07 17:18:52.552 | INFO     | src.policies:minibatch_update:298 - Policy network L2 gradient norm after clipping: 0.22335396707057953
2021-09-07 17:18:52.553 | INFO     | src.policies:minibatch_update:305 - Baseline network L2 gradient norm after clipping: 0.3963835537433624
2021-09-07 17:18:52.555 | INFO     | src.policies:tra

2021-09-07 17:18:52.751 | INFO     | src.policies:train:159 - Mini-batch 2 / 3
2021-09-07 17:18:52.752 | INFO     | src.policies:minibatch_update:281 - Losses: {'policy_loss': -0.3809751868247986, 'baseline_loss': 0.5614993572235107, 'total_loss': -0.10022550821304321}
2021-09-07 17:18:52.754 | INFO     | src.policies:minibatch_update:287 - Policy network L2 gradient norm: 0.3881306052207947
2021-09-07 17:18:52.754 | INFO     | src.policies:minibatch_update:291 - Baseline network L2 gradient norm: 0.4239238202571869
2021-09-07 17:18:52.755 | INFO     | src.policies:minibatch_update:298 - Policy network L2 gradient norm after clipping: 0.3881306052207947
2021-09-07 17:18:52.756 | INFO     | src.policies:minibatch_update:305 - Baseline network L2 gradient norm after clipping: 0.4239238202571869
2021-09-07 17:18:52.758 | INFO     | src.policies:train:159 - Mini-batch 3 / 3
2021-09-07 17:18:52.759 | INFO     | src.policies:minibatch_update:281 - Losses: {'policy_loss': -0.41040074825286865

2021-09-07 17:18:52.922 | INFO     | src.policies:minibatch_update:305 - Baseline network L2 gradient norm after clipping: 0.2720468044281006
2021-09-07 17:18:52.924 | INFO     | src.policies:train:123 - Epoch 749 / 800
2021-09-07 17:18:52.924 | INFO     | src.policies:collect_trajectories:221 - Episode 1616
2021-09-07 17:18:52.956 | DEBUG    | src.policies:execute_episode:413 - Early stopping, all agents done
2021-09-07 17:18:52.957 | INFO     | src.policies:collect_trajectories:237 - Mean episode return: 182.0
2021-09-07 17:18:52.957 | INFO     | src.policies:collect_trajectories:238 - Last 100 episodes mean return: 182.0
2021-09-07 17:18:52.958 | INFO     | src.policies:collect_trajectories:221 - Episode 1617
2021-09-07 17:18:52.988 | DEBUG    | src.policies:execute_episode:413 - Early stopping, all agents done
2021-09-07 17:18:52.989 | INFO     | src.policies:collect_trajectories:237 - Mean episode return: 196.0
2021-09-07 17:18:52.990 | INFO     | src.policies:collect_trajectories

2021-09-07 17:18:53.269 | INFO     | src.policies:train:123 - Epoch 752 / 800
2021-09-07 17:18:53.270 | INFO     | src.policies:collect_trajectories:221 - Episode 1621
2021-09-07 17:18:53.298 | DEBUG    | src.policies:execute_episode:413 - Early stopping, all agents done
2021-09-07 17:18:53.298 | INFO     | src.policies:collect_trajectories:237 - Mean episode return: 200.0
2021-09-07 17:18:53.299 | INFO     | src.policies:collect_trajectories:238 - Last 100 episodes mean return: 200.0
2021-09-07 17:18:53.301 | INFO     | src.policies:train:159 - Mini-batch 1 / 2
2021-09-07 17:18:53.303 | INFO     | src.policies:minibatch_update:281 - Losses: {'policy_loss': -0.5964189171791077, 'baseline_loss': 1.3721433877944946, 'total_loss': 0.08965277671813965}
2021-09-07 17:18:53.304 | INFO     | src.policies:minibatch_update:287 - Policy network L2 gradient norm: 0.45487701892852783
2021-09-07 17:18:53.306 | INFO     | src.policies:minibatch_update:291 - Baseline network L2 gradient norm: 2.76113

2021-09-07 17:18:53.464 | INFO     | src.policies:minibatch_update:305 - Baseline network L2 gradient norm after clipping: 0.4999997317790985
2021-09-07 17:18:53.465 | INFO     | src.policies:train:123 - Epoch 756 / 800
2021-09-07 17:18:53.465 | INFO     | src.policies:collect_trajectories:221 - Episode 1625
2021-09-07 17:18:53.494 | DEBUG    | src.policies:execute_episode:413 - Early stopping, all agents done
2021-09-07 17:18:53.495 | INFO     | src.policies:collect_trajectories:237 - Mean episode return: 200.0
2021-09-07 17:18:53.495 | INFO     | src.policies:collect_trajectories:238 - Last 100 episodes mean return: 200.0
2021-09-07 17:18:53.497 | INFO     | src.policies:train:159 - Mini-batch 1 / 2
2021-09-07 17:18:53.499 | INFO     | src.policies:minibatch_update:281 - Losses: {'policy_loss': -0.2526150345802307, 'baseline_loss': 0.3795895278453827, 'total_loss': -0.06282027065753937}
2021-09-07 17:18:53.500 | INFO     | src.policies:minibatch_update:287 - Policy network L2 gradien

2021-09-07 17:18:53.650 | INFO     | src.policies:minibatch_update:298 - Policy network L2 gradient norm after clipping: 0.3246631622314453
2021-09-07 17:18:53.651 | INFO     | src.policies:minibatch_update:305 - Baseline network L2 gradient norm after clipping: 0.4999997615814209
2021-09-07 17:18:53.653 | INFO     | src.policies:train:123 - Epoch 760 / 800
2021-09-07 17:18:53.653 | INFO     | src.policies:collect_trajectories:221 - Episode 1629
2021-09-07 17:18:53.682 | DEBUG    | src.policies:execute_episode:413 - Early stopping, all agents done
2021-09-07 17:18:53.683 | INFO     | src.policies:collect_trajectories:237 - Mean episode return: 200.0
2021-09-07 17:18:53.683 | INFO     | src.policies:collect_trajectories:238 - Last 100 episodes mean return: 200.0
2021-09-07 17:18:53.715 | INFO     | src.policies:train:159 - Mini-batch 1 / 2
2021-09-07 17:18:53.746 | INFO     | src.policies:minibatch_update:281 - Losses: {'policy_loss': -0.3866950273513794, 'baseline_loss': 0.554188787937

2021-09-07 17:18:53.894 | INFO     | src.policies:minibatch_update:291 - Baseline network L2 gradient norm: 1.3990775346755981
2021-09-07 17:18:53.895 | INFO     | src.policies:minibatch_update:298 - Policy network L2 gradient norm after clipping: 0.26723575592041016
2021-09-07 17:18:53.896 | INFO     | src.policies:minibatch_update:305 - Baseline network L2 gradient norm after clipping: 0.4999997019767761
2021-09-07 17:18:53.898 | INFO     | src.policies:train:123 - Epoch 764 / 800
2021-09-07 17:18:53.898 | INFO     | src.policies:collect_trajectories:221 - Episode 1633
2021-09-07 17:18:53.927 | DEBUG    | src.policies:execute_episode:413 - Early stopping, all agents done
2021-09-07 17:18:53.928 | INFO     | src.policies:collect_trajectories:237 - Mean episode return: 200.0
2021-09-07 17:18:53.928 | INFO     | src.policies:collect_trajectories:238 - Last 100 episodes mean return: 200.0
2021-09-07 17:18:53.930 | INFO     | src.policies:train:159 - Mini-batch 1 / 2
2021-09-07 17:18:53.9

2021-09-07 17:18:54.196 | INFO     | src.policies:minibatch_update:287 - Policy network L2 gradient norm: 0.6871870160102844
2021-09-07 17:18:54.197 | INFO     | src.policies:minibatch_update:291 - Baseline network L2 gradient norm: 4.204105377197266
2021-09-07 17:18:54.198 | INFO     | src.policies:minibatch_update:298 - Policy network L2 gradient norm after clipping: 0.4999992549419403
2021-09-07 17:18:54.199 | INFO     | src.policies:minibatch_update:305 - Baseline network L2 gradient norm after clipping: 0.49999988079071045
2021-09-07 17:18:54.200 | INFO     | src.policies:train:123 - Epoch 768 / 800
2021-09-07 17:18:54.200 | INFO     | src.policies:collect_trajectories:221 - Episode 1637
2021-09-07 17:18:54.228 | DEBUG    | src.policies:execute_episode:413 - Early stopping, all agents done
2021-09-07 17:18:54.228 | INFO     | src.policies:collect_trajectories:237 - Mean episode return: 200.0
2021-09-07 17:18:54.229 | INFO     | src.policies:collect_trajectories:238 - Last 100 epis

2021-09-07 17:18:54.431 | INFO     | src.policies:minibatch_update:281 - Losses: {'policy_loss': -0.43598711490631104, 'baseline_loss': 0.9752878546714783, 'total_loss': 0.0516568124294281}
2021-09-07 17:18:54.432 | INFO     | src.policies:minibatch_update:287 - Policy network L2 gradient norm: 0.3677012026309967
2021-09-07 17:18:54.433 | INFO     | src.policies:minibatch_update:291 - Baseline network L2 gradient norm: 1.1977490186691284
2021-09-07 17:18:54.434 | INFO     | src.policies:minibatch_update:298 - Policy network L2 gradient norm after clipping: 0.3677012026309967
2021-09-07 17:18:54.435 | INFO     | src.policies:minibatch_update:305 - Baseline network L2 gradient norm after clipping: 0.4999995231628418
2021-09-07 17:18:54.437 | INFO     | src.policies:train:123 - Epoch 772 / 800
2021-09-07 17:18:54.438 | INFO     | src.policies:collect_trajectories:221 - Episode 1641
2021-09-07 17:18:54.467 | DEBUG    | src.policies:execute_episode:413 - Early stopping, all agents done
2021

2021-09-07 17:18:54.624 | INFO     | src.policies:train:159 - Mini-batch 2 / 2
2021-09-07 17:18:54.625 | INFO     | src.policies:minibatch_update:281 - Losses: {'policy_loss': -0.18616394698619843, 'baseline_loss': 0.363148033618927, 'total_loss': -0.004589930176734924}
2021-09-07 17:18:54.626 | INFO     | src.policies:minibatch_update:287 - Policy network L2 gradient norm: 0.22245916724205017
2021-09-07 17:18:54.627 | INFO     | src.policies:minibatch_update:291 - Baseline network L2 gradient norm: 1.1853972673416138
2021-09-07 17:18:54.628 | INFO     | src.policies:minibatch_update:298 - Policy network L2 gradient norm after clipping: 0.22245916724205017
2021-09-07 17:18:54.629 | INFO     | src.policies:minibatch_update:305 - Baseline network L2 gradient norm after clipping: 0.4999995529651642
2021-09-07 17:18:54.631 | INFO     | src.policies:train:123 - Epoch 776 / 800
2021-09-07 17:18:54.631 | INFO     | src.policies:collect_trajectories:221 - Episode 1645
2021-09-07 17:18:54.659 |

2021-09-07 17:18:54.870 | INFO     | src.policies:minibatch_update:305 - Baseline network L2 gradient norm after clipping: 0.4999992549419403
2021-09-07 17:18:54.871 | INFO     | src.policies:train:159 - Mini-batch 2 / 2
2021-09-07 17:18:54.872 | INFO     | src.policies:minibatch_update:281 - Losses: {'policy_loss': -0.44053542613983154, 'baseline_loss': 0.7074931263923645, 'total_loss': -0.08678886294364929}
2021-09-07 17:18:54.873 | INFO     | src.policies:minibatch_update:287 - Policy network L2 gradient norm: 0.15074780583381653
2021-09-07 17:18:54.874 | INFO     | src.policies:minibatch_update:291 - Baseline network L2 gradient norm: 0.5984836220741272
2021-09-07 17:18:54.875 | INFO     | src.policies:minibatch_update:298 - Policy network L2 gradient norm after clipping: 0.15074780583381653
2021-09-07 17:18:54.877 | INFO     | src.policies:minibatch_update:305 - Baseline network L2 gradient norm after clipping: 0.49999910593032837
2021-09-07 17:18:54.878 | INFO     | src.policies:

2021-09-07 17:18:55.088 | INFO     | src.policies:train:159 - Mini-batch 2 / 3
2021-09-07 17:18:55.089 | INFO     | src.policies:minibatch_update:281 - Losses: {'policy_loss': -0.36925631761550903, 'baseline_loss': 0.5933966636657715, 'total_loss': -0.07255798578262329}
2021-09-07 17:18:55.090 | INFO     | src.policies:minibatch_update:287 - Policy network L2 gradient norm: 0.41722798347473145
2021-09-07 17:18:55.091 | INFO     | src.policies:minibatch_update:291 - Baseline network L2 gradient norm: 0.535599946975708
2021-09-07 17:18:55.092 | INFO     | src.policies:minibatch_update:298 - Policy network L2 gradient norm after clipping: 0.41722798347473145
2021-09-07 17:18:55.093 | INFO     | src.policies:minibatch_update:305 - Baseline network L2 gradient norm after clipping: 0.4999989867210388
2021-09-07 17:18:55.094 | INFO     | src.policies:train:159 - Mini-batch 3 / 3
2021-09-07 17:18:55.095 | INFO     | src.policies:minibatch_update:281 - Losses: {'policy_loss': -0.298089742660522

2021-09-07 17:18:55.273 | INFO     | src.policies:minibatch_update:305 - Baseline network L2 gradient norm after clipping: 0.49999940395355225
2021-09-07 17:18:55.274 | INFO     | src.policies:train:159 - Mini-batch 2 / 2
2021-09-07 17:18:55.275 | INFO     | src.policies:minibatch_update:281 - Losses: {'policy_loss': -0.4267500042915344, 'baseline_loss': 0.6324383616447449, 'total_loss': -0.11053082346916199}
2021-09-07 17:18:55.276 | INFO     | src.policies:minibatch_update:287 - Policy network L2 gradient norm: 0.3172348141670227
2021-09-07 17:18:55.277 | INFO     | src.policies:minibatch_update:291 - Baseline network L2 gradient norm: 0.5520754456520081
2021-09-07 17:18:55.278 | INFO     | src.policies:minibatch_update:298 - Policy network L2 gradient norm after clipping: 0.3172348141670227
2021-09-07 17:18:55.279 | INFO     | src.policies:minibatch_update:305 - Baseline network L2 gradient norm after clipping: 0.499999076128006
2021-09-07 17:18:55.280 | INFO     | src.policies:trai

2021-09-07 17:18:55.622 | INFO     | src.policies:train:159 - Mini-batch 2 / 3
2021-09-07 17:18:55.623 | INFO     | src.policies:minibatch_update:281 - Losses: {'policy_loss': -0.20925326645374298, 'baseline_loss': 0.337091326713562, 'total_loss': -0.040707603096961975}
2021-09-07 17:18:55.624 | INFO     | src.policies:minibatch_update:287 - Policy network L2 gradient norm: 0.30465254187583923
2021-09-07 17:18:55.625 | INFO     | src.policies:minibatch_update:291 - Baseline network L2 gradient norm: 1.017712950706482
2021-09-07 17:18:55.626 | INFO     | src.policies:minibatch_update:298 - Policy network L2 gradient norm after clipping: 0.30465254187583923
2021-09-07 17:18:55.628 | INFO     | src.policies:minibatch_update:305 - Baseline network L2 gradient norm after clipping: 0.4999995529651642
2021-09-07 17:18:55.629 | INFO     | src.policies:train:159 - Mini-batch 3 / 3
2021-09-07 17:18:55.630 | INFO     | src.policies:minibatch_update:281 - Losses: {'policy_loss': -0.366127610206604

2021-09-07 17:18:55.789 | INFO     | src.policies:minibatch_update:305 - Baseline network L2 gradient norm after clipping: 0.49999940395355225
2021-09-07 17:18:55.791 | INFO     | src.policies:train:123 - Epoch 791 / 800
2021-09-07 17:18:55.791 | INFO     | src.policies:collect_trajectories:221 - Episode 1667
2021-09-07 17:18:55.822 | DEBUG    | src.policies:execute_episode:413 - Early stopping, all agents done
2021-09-07 17:18:55.823 | INFO     | src.policies:collect_trajectories:237 - Mean episode return: 200.0
2021-09-07 17:18:55.824 | INFO     | src.policies:collect_trajectories:238 - Last 100 episodes mean return: 200.0
2021-09-07 17:18:55.826 | INFO     | src.policies:train:159 - Mini-batch 1 / 2
2021-09-07 17:18:55.828 | INFO     | src.policies:minibatch_update:281 - Losses: {'policy_loss': -0.42970898747444153, 'baseline_loss': 0.5271084308624268, 'total_loss': -0.16615477204322815}
2021-09-07 17:18:55.829 | INFO     | src.policies:minibatch_update:287 - Policy network L2 gradi

2021-09-07 17:18:56.061 | INFO     | src.policies:collect_trajectories:221 - Episode 1672
2021-09-07 17:18:56.202 | DEBUG    | src.policies:execute_episode:413 - Early stopping, all agents done
2021-09-07 17:18:56.203 | INFO     | src.policies:collect_trajectories:237 - Mean episode return: 186.0
2021-09-07 17:18:56.203 | INFO     | src.policies:collect_trajectories:238 - Last 100 episodes mean return: 186.5
2021-09-07 17:18:56.206 | INFO     | src.policies:train:159 - Mini-batch 1 / 3
2021-09-07 17:18:56.209 | INFO     | src.policies:minibatch_update:281 - Losses: {'policy_loss': -0.5090958476066589, 'baseline_loss': 0.8084837198257446, 'total_loss': -0.10485398769378662}
2021-09-07 17:18:56.210 | INFO     | src.policies:minibatch_update:287 - Policy network L2 gradient norm: 0.48964014649391174
2021-09-07 17:18:56.211 | INFO     | src.policies:minibatch_update:291 - Baseline network L2 gradient norm: 1.1354342699050903
2021-09-07 17:18:56.212 | INFO     | src.policies:minibatch_updat

2021-09-07 17:18:56.357 | INFO     | src.policies:minibatch_update:305 - Baseline network L2 gradient norm after clipping: 0.49999964237213135
2021-09-07 17:18:56.359 | INFO     | src.policies:train:159 - Mini-batch 2 / 2
2021-09-07 17:18:56.360 | INFO     | src.policies:minibatch_update:281 - Losses: {'policy_loss': -0.5166565179824829, 'baseline_loss': 0.9258697032928467, 'total_loss': -0.05372166633605957}
2021-09-07 17:18:56.360 | INFO     | src.policies:minibatch_update:287 - Policy network L2 gradient norm: 0.14023199677467346
2021-09-07 17:18:56.361 | INFO     | src.policies:minibatch_update:291 - Baseline network L2 gradient norm: 1.3529202938079834
2021-09-07 17:18:56.362 | INFO     | src.policies:minibatch_update:298 - Policy network L2 gradient norm after clipping: 0.14023199677467346
2021-09-07 17:18:56.363 | INFO     | src.policies:minibatch_update:305 - Baseline network L2 gradient norm after clipping: 0.49999964237213135
2021-09-07 17:18:56.365 | INFO     | src.policies:

2021-09-07 17:18:56.546 | INFO     | src.policies:collect_trajectories:221 - Episode 1681
2021-09-07 17:18:56.614 | DEBUG    | src.policies:execute_episode:413 - Early stopping, all agents done
2021-09-07 17:18:56.614 | INFO     | src.policies:collect_trajectories:237 - Mean episode return: 172.0
2021-09-07 17:18:56.615 | INFO     | src.policies:collect_trajectories:238 - Last 100 episodes mean return: 140.0
2021-09-07 17:18:56.618 | INFO     | src.policies:train:159 - Mini-batch 1 / 2
2021-09-07 17:18:56.620 | INFO     | src.policies:minibatch_update:281 - Losses: {'policy_loss': -0.4010089039802551, 'baseline_loss': 0.7162554264068604, 'total_loss': -0.04288119077682495}
2021-09-07 17:18:56.621 | INFO     | src.policies:minibatch_update:287 - Policy network L2 gradient norm: 0.18547064065933228
2021-09-07 17:18:56.622 | INFO     | src.policies:minibatch_update:291 - Baseline network L2 gradient norm: 0.4452331066131592
2021-09-07 17:18:56.623 | INFO     | src.policies:minibatch_updat

VBox(children=(Label(value=' 0.00MB of 0.00MB uploaded (0.00MB deduped)\r'), FloatProgress(value=1.0, max=1.0)…

0,1
loss,-0.2117
mean_return,140.0
_runtime,61.0
_timestamp,1631027936.0
_step,799.0


0,1
loss,█▆▄▅▅▇▃▅▃▃▃█▃▄▃▅▅▃▂█▅▃▁▆▆▄▇▅▄▄▃▇▄▂▄▂▂▄▄▂
mean_return,▁▁▁▂▂█▆▅▄▇▅▆█▇▆█████▆██▇▆██▆████████▅██▆
_runtime,▁▁▂▂▂▂▂▃▃▃▃▃▄▄▄▄▄▄▅▅▅▅▅▆▆▆▆▆▆▆▇▇▇▇▇▇████
_timestamp,▁▁▂▂▂▂▂▃▃▃▃▃▄▄▄▄▄▄▅▅▅▅▅▆▆▆▆▆▆▆▇▇▇▇▇▇████
_step,▁▁▁▁▂▂▂▂▂▃▃▃▃▃▃▄▄▄▄▄▅▅▅▅▅▅▆▆▆▆▆▇▇▇▇▇▇███


## TRPO

This section deals with training a Cartpole agent using our custom Trust Region Policy Optimization implementation.

In [None]:
beta = 1.0
kl_target = 0.01

In [None]:
trpo_policy_nn = models.MLP(observation_space_size, hidden_sizes, action_space_size)
trpo_baseline_nn = models.MLP(observation_space_size, hidden_sizes, 1, log_softmax=False)
trpo_policy = policies.TRPOPolicy(env, trpo_policy_nn, trpo_baseline_nn, beta=beta, kl_target=kl_target)
trpo_policy.train(
    epochs,
    steps_per_epoch,
    minibatch_size,
    enable_wandb=True,
    wandb_config={**wandb_config, "group": "TRPO"},
    episodes_mean_return=episodes_mean_return
)

## PPO

This section deals with training a Cartpole agent using our custom Proximal Policy Optimization implementation.

In [None]:
alpha = 1.0
beta = 0.01
eps = 0.2

In [None]:
ppo_policy_nn = models.MLP(observation_space_size, hidden_sizes, action_space_size)
ppo_baseline_nn = models.MLP(observation_space_size, hidden_sizes, 1, log_softmax=False)
ppo_policy = policies.PPOPolicy(env, ppo_policy_nn, ppo_baseline_nn, alpha=alpha, beta=beta, eps=eps)
ppo_policy.train(
    epochs,
    steps_per_epoch,
    minibatch_size,
    enable_wandb=True,
    wandb_config={**wandb_config, "group": "PPO"},
    episodes_mean_return=episodes_mean_return
)