# Portfolio Management with Amazon SageMaker RL


Portfolio management is the process of constant redistribution of a capital into a set of different financial assets. Given the historic prices of a list of stocks and current portfolio allocation, the goal is to maximize the return while restraining the risk. In this demo, we use a reinforcement learning framework to manage the portfolio by continuously reallocating several stocks. Based on the setup in [1], we use a tensor input constructed from historical price data, and then apply an actor-critic policy gradient algorithm to accommodate the continuous actions (reallocations). The customized environment is constructed using Open AI Gym and the RL agents are trained using Amazon SageMaker.  

[1] Jiang, Zhengyao, Dixing Xu, and Jinjun Liang. "[A deep reinforcement learning framework for the financial portfolio management problem." arXiv preprint arXiv:1706.10059 (2017)](https://arxiv.org/abs/1706.10059).

## Problem Statement

We start with $m$ preselected stocks. Without loss of generality, the total investment value is set as 1 dollar at the initial timestamp. At timestamp $t$, letting $v_{m,t}$ denote the closing price of stock $m$, the *price relative vector* is defined as 
$$ y_t = ( 1, \frac{v_{1,t}}{v_{1,t-1}}, \frac{v_{2,t}}{v_{2,t-1}}, \dots, \frac{v_{m,t}}{v_{m,t-1}} ). $$
The first element corresponds to the cash we maintain. The cash value doesn't change along time so it is always 1. During training, the investment redistribution at step $t$ is characterized by the portfolio weight vector $\mathbf{\omega} = (\omega_{0,t}, \omega_{1,t}, \dots, \omega_{m,t})$. 

1. *Objective:*
The portfolio consists of a group of stocks. We aim to maximize the portfolio value by adjusting the weights of each stock and reallocating the portfolio at the end of each day.

2. *Environment:*
Custom developed environment using Gym.

3. *States:*
Portfolio weight vector from last trading day $\omega_{t-1}$. Historic price tensor constructed using close, open, high, low prices of each stock. For more details, please refer to [1].

4. *Actions:*
New weight vector $\omega_{t}$ satisfying $\sum_{i=0}^{m}\omega_{i,t}=1$.

5. *Reward:* 
Average logarithmic cumulated return. Consider a trading cost factor $\mu$, the average logarithmic cumulated return after timestamp $T$ is $$ R := \frac{1}{T} \sum_{t=1}^{T+1} \ln(\mu_{t}y_{t}\cdot\omega_{t-1}).$$
We use the maximum rate at Poloniex and set $\mu=0.25\%$.


## Dataset

In this notebook, we use the dataset generated by [Chi Zhang](https://github.com/vermouth1992/drl-portfolio-management/tree/master/src/utils/datasets). It contains the historic price of 16 target stocks from NASDAQ100, including open, close, high and low prices from 2012-08-13 to 2017-08-11. Specifically, those stocks are: “AAPL”, “ATVI”, “CMCSA”, “COST”, “CSX”, “DISH”, “EA”, “EBAY”, “FB”, “GOOGL”, “HAS”, “ILMN”, “INTC”, “MAR”, “REGN” and “SBUX”.


### Dataset License
This dataset is licensed under a MIT License.

Copyright (c) 2017 Chi Zhang

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.

## Using reinforcement learning on Amazon SageMaker RL

Amazon SageMaker RL allows you to train your RL agents using an on-demand and fully managed infrastructure. You do not have to worry about setting up your machines with the RL toolkits and deep learning frameworks as there are pre-built RL environments. You can easily switch between many different machines setup for you, including powerful GPU machines that give a big speedup. You can also choose to use multiple machines in a cluster to further speedup training, often necessary for production level loads.



## Pre-requisites

### Roles and permissions

To get started, we'll import the Python libraries we need, set up the environment with a few prerequisites for permissions and configurations.

In [1]:
import sagemaker
import boto3
import sys
import os
import glob
import re
import subprocess
from IPython.display import HTML
import time
from time import gmtime, strftime
sys.path.append("common")
from misc import get_execution_role, wait_for_s3_object
from sagemaker.rl import RLEstimator, RLToolkit, RLFramework

### Steup S3 buckets

Set up the linkage and authentication to the S3 bucket that you want to use for checkpoint and the metadata. 

In [2]:
sage_session = sagemaker.session.Session()
s3_bucket = sage_session.default_bucket()  
s3_output_path = 's3://{}/'.format(s3_bucket)
print("S3 bucket path: {}".format(s3_output_path))

S3 bucket path: s3://sagemaker-eu-central-1-415877977751/


### Define Variables 

We define variables such as the job prefix for the training jobs.

In [3]:
# create unique job name 
job_name_prefix = 'rl-macroeconomic'

### Configure settings

You can run your RL training jobs on a SageMaker notebook instance or on your own machine. In both of these scenarios, you can run the following in either `local` or `SageMaker` modes. The `local` mode uses the SageMaker Python SDK to run your code in a local container before deploying to SageMaker. This can speed up iterative testing and debugging while using the same familiar Python SDK interface. You just need to set `local_mode = True`.

In [4]:
# run in local mode?
local_mode = False

### Create an IAM role
Either get the execution role when running from a SageMaker notebook `role = sagemaker.get_execution_role()` or, when running from local machine, use utils method `role = get_execution_role()` to create an execution role.

In [5]:
try:
    role = sagemaker.get_execution_role()
except:
    role = get_execution_role()

print("Using IAM role arn: {}".format(role))

Couldn't call 'get_role' to get Role ARN from role name bluhmben to get Role path.


Using IAM role arn: arn:aws:iam::415877977751:role/sagemaker


### Install docker for `local` mode

In order to work in `local` mode, you need to have docker installed. When running from you local machine, please make sure that you have docker or docker-compose (for local CPU machines) and nvidia-docker (for local GPU machines) installed. Alternatively, when running from a SageMaker notebook instance, you can simply run the following script to install dependenceis.

Note, you can only run a single local notebook at one time.

In [None]:
# Run on SageMaker notebook instance
if local_mode:
    !/bin/bash ./common/setup.sh

## Set up the environment

The environment is defined in a Python file called `portfolio_env.py` and the file is uploaded on `/src` directory. 

The environment also implements the `init()`, `step()` and `reset()` functions that describe how the environment behaves. This is consistent with Open AI Gym interfaces for defining an environment. 


1. init() - initialize the environment in a pre-defined state
2. step() - take an action on the environment
3. reset()- restart the environment on a new episode
4. [if applicable] render() - get a rendered image of the environment in its current state

In [6]:
!pygmentize src/macroeconomic_env.py

[34mimport[39;49;00m [04m[36mgym[39;49;00m
[34mimport[39;49;00m [04m[36mgym[39;49;00m[04m[36m.[39;49;00m[04m[36mspaces[39;49;00m
[34mimport[39;49;00m [04m[36mrandom[39;49;00m
[34mimport[39;49;00m [04m[36mmath[39;49;00m
[34mimport[39;49;00m [04m[36mcsv[39;49;00m
[34mimport[39;49;00m [04m[36mnumpy[39;49;00m [34mas[39;49;00m [04m[36mnp[39;49;00m
[34mfrom[39;49;00m [04m[36mscipy[39;49;00m[04m[36m.[39;49;00m[04m[36mstats[39;49;00m [34mimport[39;49;00m norm


[34mclass[39;49;00m [04m[32mMacroeconomicEnv[39;49;00m(gym.Env):
    [33m"""[39;49;00m
[33m    An environment for optimal consumption/savings policy in a macroeconomic life-cycle model.[39;49;00m
[33m    Based on [Bluhm 2020](...)[39;49;00m
[33m    """[39;49;00m

    [34mdef[39;49;00m [32m__init__[39;49;00m([36mself[39;49;00m, **config):
        config_defaults = {
            [37m# Parameter choice based on https://www.sas.upenn.edu/~jesusfv/Guid

## Configure the presets for RL algorithm 

The presets that configure the RL training jobs are defined in the `preset-portfolio-management-clippedppo.py` file which is also uploaded on the `/src` directory. Using the preset file, you can define agent parameters to select the specific agent algorithm. You can also set the environment parameters, define the schedule and visualization parameters, and define the graph manager. The schedule presets will define the number of heat up steps, periodic evaluation steps, training steps between evaluations.

These can be overridden at runtime by specifying the `RLCOACH_PRESET` hyperparameter. Additionally, it can be used to define custom hyperparameters. 


In [7]:
!pygmentize src/preset-macroeconomic-clippedppo.py

[34mfrom[39;49;00m [04m[36mrl_coach[39;49;00m[04m[36m.[39;49;00m[04m[36magents[39;49;00m[04m[36m.[39;49;00m[04m[36mclipped_ppo_agent[39;49;00m [34mimport[39;49;00m ClippedPPOAgentParameters
[34mfrom[39;49;00m [04m[36mrl_coach[39;49;00m[04m[36m.[39;49;00m[04m[36marchitectures[39;49;00m[04m[36m.[39;49;00m[04m[36mlayers[39;49;00m [34mimport[39;49;00m Dense, Conv2d
[34mfrom[39;49;00m [04m[36mrl_coach[39;49;00m[04m[36m.[39;49;00m[04m[36mbase_parameters[39;49;00m [34mimport[39;49;00m VisualizationParameters, PresetValidationParameters
[34mfrom[39;49;00m [04m[36mrl_coach[39;49;00m[04m[36m.[39;49;00m[04m[36mbase_parameters[39;49;00m [34mimport[39;49;00m MiddlewareScheme, DistributedCoachSynchronizationType, EmbedderScheme
[34mfrom[39;49;00m [04m[36mrl_coach[39;49;00m[04m[36m.[39;49;00m[04m[36mcore_types[39;49;00m [34mimport[39;49;00m TrainingSteps, EnvironmentEpisodes, EnvironmentSteps, RunPhase
[34mfrom[39;49;00m 

## Write the Training Code 

The training code is written in the file “train-coach.py” which is uploaded in the /src directory. 
First import the environment files and the preset files, and then define the `main()` function. 

In [8]:
!pygmentize src/train-coach.py

[34mfrom[39;49;00m [04m[36msagemaker_rl[39;49;00m[04m[36m.[39;49;00m[04m[36mcoach_launcher[39;49;00m [34mimport[39;49;00m SageMakerCoachPresetLauncher


[34mclass[39;49;00m [04m[32mMyLauncher[39;49;00m(SageMakerCoachPresetLauncher):

    [34mdef[39;49;00m [32mdefault_preset_name[39;49;00m([36mself[39;49;00m):
        [33m"""This points to a .py file that configures everything about the RL job.[39;49;00m
[33m        It can be overridden at runtime by specifying the RLCOACH_PRESET hyperparameter.[39;49;00m
[33m        """[39;49;00m
        [34mreturn[39;49;00m [33m'[39;49;00m[33mpreset-macroeconomic-clippedppo[39;49;00m[33m'[39;49;00m

    [34mdef[39;49;00m [32mmap_hyperparameter[39;49;00m([36mself[39;49;00m, name, value):
        [33m"""Here we configure some shortcut names for hyperparameters that we expect to use frequently.[39;49;00m
[33m        Essentially anything in the preset file can be overridden through a hyperparamet

## Train the RL model using the Python SDK Script mode

If you are using local mode, the training will run on the notebook instance. When using SageMaker for training, you can select a GPU or CPU instance. The RLEstimator is used for training RL jobs. 

1. Specify the source directory where the environment, presets and training code is uploaded.
2. Specify the entry point as the training code 
3. Specify the choice of RL toolkit and framework. This automatically resolves to the ECR path for the RL Container. 
4. Define the training parameters such as the instance count, job name, S3 path for output and job name. 
5. Specify the hyperparameters for the RL agent algorithm. The `RLCOACH_PRESET` can be used to specify the RL agent algorithm you want to use. 
6. [Optional] Choose the metrics that you are interested in capturing in your logs. These can also be visualized in CloudWatch and SageMaker Notebooks. The metrics are defined using regular expression matching.


In [10]:
if local_mode:
    instance_type = 'local'
else:
    instance_type = "ml.m4.xlarge"
    
estimator = RLEstimator(source_dir='src',
                      entry_point="train-coach.py",
                      dependencies=["common/sagemaker_rl"],
                      toolkit=RLToolkit.COACH,
                      toolkit_version='0.11.0',
                      framework=RLFramework.TENSORFLOW,
                      role=role,
                      train_instance_count=1,
                      train_instance_type=instance_type,
                      output_path=s3_output_path,
                      base_job_name=job_name_prefix,
                      hyperparameters = {
                          "RLCOACH_PRESET" : "preset-macroeconomic-clippedppo",
                          "rl.agent_params.algorithm.discount": 0.97,
                          "rl.evaluation_steps:EnvironmentEpisodes": 5
                      }
                    )
# takes ~15min
# The log may show KL divergence=[0.]. This is expected because the divergences were not necessarily required for 
# Clipped PPO. By default they are not calculated for computational efficiency.
estimator.fit()

2020-05-13 12:06:49 Starting - Starting the training job...
2020-05-13 12:06:51 Starting - Launching requested ML instances...
2020-05-13 12:07:47 Starting - Preparing the instances for training......
2020-05-13 12:08:35 Downloading - Downloading input data...
2020-05-13 12:09:15 Training - Training image download completed. Training in progress.[34mbash: cannot set terminal process group (-1): Inappropriate ioctl for device[0m
[34mbash: no job control in this shell[0m
[34m2020-05-13 12:09:16,119 sagemaker-containers INFO     Imported framework sagemaker_tensorflow_container.training[0m
[34m2020-05-13 12:09:16,122 sagemaker-containers INFO     No GPUs detected (normal if no gpus installed)[0m
[34m2020-05-13 12:09:16,234 sagemaker-containers INFO     No GPUs detected (normal if no gpus installed)[0m
[34m2020-05-13 12:09:16,247 sagemaker-containers INFO     Invoking user script
[0m
[34mTraining Env:
[0m
[34m{
    "additional_framework_parameters": {
        "sagemaker_esti

[34mHeatup> Name=main_level/agent, Worker=0, Episode=35, Total reward=-0.69, Steps=105, Training iteration=0[0m
[34mHeatup> Name=main_level/agent, Worker=0, Episode=36, Total reward=-0.69, Steps=108, Training iteration=0[0m
[34mHeatup> Name=main_level/agent, Worker=0, Episode=37, Total reward=-0.56, Steps=111, Training iteration=0[0m
[34mHeatup> Name=main_level/agent, Worker=0, Episode=38, Total reward=-0.72, Steps=114, Training iteration=0[0m
[34mHeatup> Name=main_level/agent, Worker=0, Episode=39, Total reward=-0.55, Steps=117, Training iteration=0[0m
[34mHeatup> Name=main_level/agent, Worker=0, Episode=40, Total reward=-1.13, Steps=120, Training iteration=0[0m
[34mHeatup> Name=main_level/agent, Worker=0, Episode=41, Total reward=-0.91, Steps=123, Training iteration=0[0m
[34mHeatup> Name=main_level/agent, Worker=0, Episode=42, Total reward=-0.73, Steps=126, Training iteration=0[0m
[34mHeatup> Name=main_level/agent, Worker=0, Episode=43, Total reward=-0.79, Steps=129,

[34mHeatup> Name=main_level/agent, Worker=0, Episode=393, Total reward=-0.79, Steps=1179, Training iteration=0[0m
[34mHeatup> Name=main_level/agent, Worker=0, Episode=394, Total reward=-0.55, Steps=1182, Training iteration=0[0m
[34mHeatup> Name=main_level/agent, Worker=0, Episode=395, Total reward=-0.44, Steps=1185, Training iteration=0[0m
[34mHeatup> Name=main_level/agent, Worker=0, Episode=396, Total reward=-0.9, Steps=1188, Training iteration=0[0m
[34mHeatup> Name=main_level/agent, Worker=0, Episode=397, Total reward=-0.49, Steps=1191, Training iteration=0[0m
[34mHeatup> Name=main_level/agent, Worker=0, Episode=398, Total reward=-0.56, Steps=1194, Training iteration=0[0m
[34mHeatup> Name=main_level/agent, Worker=0, Episode=399, Total reward=-0.54, Steps=1197, Training iteration=0[0m
[34mHeatup> Name=main_level/agent, Worker=0, Episode=400, Total reward=-0.48, Steps=1200, Training iteration=0[0m
[34mHeatup> Name=main_level/agent, Worker=0, Episode=401, Total reward=-

[34mHeatup> Name=main_level/agent, Worker=0, Episode=626, Total reward=-0.51, Steps=1878, Training iteration=0[0m
[34mHeatup> Name=main_level/agent, Worker=0, Episode=627, Total reward=-0.62, Steps=1881, Training iteration=0[0m
[34mHeatup> Name=main_level/agent, Worker=0, Episode=628, Total reward=-0.84, Steps=1884, Training iteration=0[0m
[34mHeatup> Name=main_level/agent, Worker=0, Episode=629, Total reward=-0.61, Steps=1887, Training iteration=0[0m
[34mHeatup> Name=main_level/agent, Worker=0, Episode=630, Total reward=-0.71, Steps=1890, Training iteration=0[0m
[34mHeatup> Name=main_level/agent, Worker=0, Episode=631, Total reward=-2.0, Steps=1893, Training iteration=0[0m
[34mHeatup> Name=main_level/agent, Worker=0, Episode=632, Total reward=-0.6, Steps=1896, Training iteration=0[0m
[34mHeatup> Name=main_level/agent, Worker=0, Episode=633, Total reward=-0.58, Steps=1899, Training iteration=0[0m
[34mHeatup> Name=main_level/agent, Worker=0, Episode=634, Total reward=-1

[34mHeatup> Name=main_level/agent, Worker=0, Episode=850, Total reward=-0.75, Steps=2550, Training iteration=0[0m
[34mHeatup> Name=main_level/agent, Worker=0, Episode=851, Total reward=-0.51, Steps=2553, Training iteration=0[0m
[34mHeatup> Name=main_level/agent, Worker=0, Episode=852, Total reward=-0.64, Steps=2556, Training iteration=0[0m
[34mHeatup> Name=main_level/agent, Worker=0, Episode=853, Total reward=-0.98, Steps=2559, Training iteration=0[0m
[34mHeatup> Name=main_level/agent, Worker=0, Episode=854, Total reward=-0.62, Steps=2562, Training iteration=0[0m
[34mHeatup> Name=main_level/agent, Worker=0, Episode=855, Total reward=-0.76, Steps=2565, Training iteration=0[0m
[34mHeatup> Name=main_level/agent, Worker=0, Episode=856, Total reward=-0.64, Steps=2568, Training iteration=0[0m
[34mHeatup> Name=main_level/agent, Worker=0, Episode=857, Total reward=-1.05, Steps=2571, Training iteration=0[0m
[34mHeatup> Name=main_level/agent, Worker=0, Episode=858, Total reward=

[34mHeatup> Name=main_level/agent, Worker=0, Episode=1068, Total reward=-0.57, Steps=3204, Training iteration=0[0m
[34mHeatup> Name=main_level/agent, Worker=0, Episode=1069, Total reward=-0.64, Steps=3207, Training iteration=0[0m
[34mHeatup> Name=main_level/agent, Worker=0, Episode=1070, Total reward=-0.58, Steps=3210, Training iteration=0[0m
[34mHeatup> Name=main_level/agent, Worker=0, Episode=1071, Total reward=-0.48, Steps=3213, Training iteration=0[0m
[34mHeatup> Name=main_level/agent, Worker=0, Episode=1072, Total reward=-0.57, Steps=3216, Training iteration=0[0m
[34mHeatup> Name=main_level/agent, Worker=0, Episode=1073, Total reward=-0.55, Steps=3219, Training iteration=0[0m
[34mHeatup> Name=main_level/agent, Worker=0, Episode=1074, Total reward=-0.95, Steps=3222, Training iteration=0[0m
[34mHeatup> Name=main_level/agent, Worker=0, Episode=1075, Total reward=-0.73, Steps=3225, Training iteration=0[0m
[34mHeatup> Name=main_level/agent, Worker=0, Episode=1076, Tota

[34mHeatup> Name=main_level/agent, Worker=0, Episode=1287, Total reward=-0.56, Steps=3861, Training iteration=0[0m
[34mHeatup> Name=main_level/agent, Worker=0, Episode=1288, Total reward=-0.82, Steps=3864, Training iteration=0[0m
[34mHeatup> Name=main_level/agent, Worker=0, Episode=1289, Total reward=-0.63, Steps=3867, Training iteration=0[0m
[34mHeatup> Name=main_level/agent, Worker=0, Episode=1290, Total reward=-0.78, Steps=3870, Training iteration=0[0m
[34mHeatup> Name=main_level/agent, Worker=0, Episode=1291, Total reward=-0.74, Steps=3873, Training iteration=0[0m
[34mHeatup> Name=main_level/agent, Worker=0, Episode=1292, Total reward=-0.52, Steps=3876, Training iteration=0[0m
[34mHeatup> Name=main_level/agent, Worker=0, Episode=1293, Total reward=-0.72, Steps=3879, Training iteration=0[0m
[34mHeatup> Name=main_level/agent, Worker=0, Episode=1294, Total reward=-0.75, Steps=3882, Training iteration=0[0m
[34mHeatup> Name=main_level/agent, Worker=0, Episode=1295, Tota

[34mHeatup> Name=main_level/agent, Worker=0, Episode=1502, Total reward=-0.84, Steps=4506, Training iteration=0[0m
[34mHeatup> Name=main_level/agent, Worker=0, Episode=1503, Total reward=-0.71, Steps=4509, Training iteration=0[0m
[34mHeatup> Name=main_level/agent, Worker=0, Episode=1504, Total reward=-0.59, Steps=4512, Training iteration=0[0m
[34mHeatup> Name=main_level/agent, Worker=0, Episode=1505, Total reward=-0.47, Steps=4515, Training iteration=0[0m
[34mHeatup> Name=main_level/agent, Worker=0, Episode=1506, Total reward=-0.79, Steps=4518, Training iteration=0[0m
[34mHeatup> Name=main_level/agent, Worker=0, Episode=1507, Total reward=-0.48, Steps=4521, Training iteration=0[0m
[34mHeatup> Name=main_level/agent, Worker=0, Episode=1508, Total reward=-0.64, Steps=4524, Training iteration=0[0m
[34mHeatup> Name=main_level/agent, Worker=0, Episode=1509, Total reward=-0.55, Steps=4527, Training iteration=0[0m
[34mHeatup> Name=main_level/agent, Worker=0, Episode=1510, Tota

[34mHeatup> Name=main_level/agent, Worker=0, Episode=1711, Total reward=-0.66, Steps=5133, Training iteration=0[0m
[34mHeatup> Name=main_level/agent, Worker=0, Episode=1712, Total reward=-0.67, Steps=5136, Training iteration=0[0m
[34mHeatup> Name=main_level/agent, Worker=0, Episode=1713, Total reward=-0.59, Steps=5139, Training iteration=0[0m
[34mHeatup> Name=main_level/agent, Worker=0, Episode=1714, Total reward=-0.49, Steps=5142, Training iteration=0[0m
[34mHeatup> Name=main_level/agent, Worker=0, Episode=1715, Total reward=-0.92, Steps=5145, Training iteration=0[0m
[34mHeatup> Name=main_level/agent, Worker=0, Episode=1716, Total reward=-1.36, Steps=5148, Training iteration=0[0m
[34mHeatup> Name=main_level/agent, Worker=0, Episode=1717, Total reward=-0.59, Steps=5151, Training iteration=0[0m
[34mHeatup> Name=main_level/agent, Worker=0, Episode=1718, Total reward=-0.46, Steps=5154, Training iteration=0[0m
[34mHeatup> Name=main_level/agent, Worker=0, Episode=1719, Tota

[34mHeatup> Name=main_level/agent, Worker=0, Episode=1919, Total reward=-0.49, Steps=5757, Training iteration=0[0m
[34mHeatup> Name=main_level/agent, Worker=0, Episode=1920, Total reward=-0.56, Steps=5760, Training iteration=0[0m
[34mHeatup> Name=main_level/agent, Worker=0, Episode=1921, Total reward=-0.99, Steps=5763, Training iteration=0[0m
[34mHeatup> Name=main_level/agent, Worker=0, Episode=1922, Total reward=-0.71, Steps=5766, Training iteration=0[0m
[34mHeatup> Name=main_level/agent, Worker=0, Episode=1923, Total reward=-1.02, Steps=5769, Training iteration=0[0m
[34mHeatup> Name=main_level/agent, Worker=0, Episode=1924, Total reward=-0.57, Steps=5772, Training iteration=0[0m
[34mHeatup> Name=main_level/agent, Worker=0, Episode=1925, Total reward=-0.73, Steps=5775, Training iteration=0[0m
[34mHeatup> Name=main_level/agent, Worker=0, Episode=1926, Total reward=-0.87, Steps=5778, Training iteration=0[0m
[34mHeatup> Name=main_level/agent, Worker=0, Episode=1927, Tota

[34mHeatup> Name=main_level/agent, Worker=0, Episode=2123, Total reward=-1.12, Steps=6369, Training iteration=0[0m
[34mHeatup> Name=main_level/agent, Worker=0, Episode=2124, Total reward=-0.65, Steps=6372, Training iteration=0[0m
[34mHeatup> Name=main_level/agent, Worker=0, Episode=2125, Total reward=-0.72, Steps=6375, Training iteration=0[0m
[34mHeatup> Name=main_level/agent, Worker=0, Episode=2126, Total reward=-0.69, Steps=6378, Training iteration=0[0m
[34mHeatup> Name=main_level/agent, Worker=0, Episode=2127, Total reward=-0.61, Steps=6381, Training iteration=0[0m
[34mHeatup> Name=main_level/agent, Worker=0, Episode=2128, Total reward=-0.45, Steps=6384, Training iteration=0[0m
[34mHeatup> Name=main_level/agent, Worker=0, Episode=2129, Total reward=-0.42, Steps=6387, Training iteration=0[0m
[34mHeatup> Name=main_level/agent, Worker=0, Episode=2130, Total reward=-0.61, Steps=6390, Training iteration=0[0m
[34mHeatup> Name=main_level/agent, Worker=0, Episode=2131, Tota

[34mHeatup> Name=main_level/agent, Worker=0, Episode=2327, Total reward=-0.77, Steps=6981, Training iteration=0[0m
[34mHeatup> Name=main_level/agent, Worker=0, Episode=2328, Total reward=-0.43, Steps=6984, Training iteration=0[0m
[34mHeatup> Name=main_level/agent, Worker=0, Episode=2329, Total reward=-0.6, Steps=6987, Training iteration=0[0m
[34mHeatup> Name=main_level/agent, Worker=0, Episode=2330, Total reward=-0.44, Steps=6990, Training iteration=0[0m
[34mHeatup> Name=main_level/agent, Worker=0, Episode=2331, Total reward=-0.62, Steps=6993, Training iteration=0[0m
[34mHeatup> Name=main_level/agent, Worker=0, Episode=2332, Total reward=-0.66, Steps=6996, Training iteration=0[0m
[34mHeatup> Name=main_level/agent, Worker=0, Episode=2333, Total reward=-0.81, Steps=6999, Training iteration=0[0m
[34mHeatup> Name=main_level/agent, Worker=0, Episode=2334, Total reward=-0.52, Steps=7002, Training iteration=0[0m
[34mHeatup> Name=main_level/agent, Worker=0, Episode=2335, Total

[34mHeatup> Name=main_level/agent, Worker=0, Episode=2530, Total reward=-0.63, Steps=7590, Training iteration=0[0m
[34mHeatup> Name=main_level/agent, Worker=0, Episode=2531, Total reward=-0.58, Steps=7593, Training iteration=0[0m
[34mHeatup> Name=main_level/agent, Worker=0, Episode=2532, Total reward=-0.74, Steps=7596, Training iteration=0[0m
[34mHeatup> Name=main_level/agent, Worker=0, Episode=2533, Total reward=-0.67, Steps=7599, Training iteration=0[0m
[34mHeatup> Name=main_level/agent, Worker=0, Episode=2534, Total reward=-1.01, Steps=7602, Training iteration=0[0m
[34mHeatup> Name=main_level/agent, Worker=0, Episode=2535, Total reward=-0.65, Steps=7605, Training iteration=0[0m
[34mHeatup> Name=main_level/agent, Worker=0, Episode=2536, Total reward=-0.79, Steps=7608, Training iteration=0[0m
[34mHeatup> Name=main_level/agent, Worker=0, Episode=2537, Total reward=-0.53, Steps=7611, Training iteration=0[0m
[34mHeatup> Name=main_level/agent, Worker=0, Episode=2538, Tota

[34mHeatup> Name=main_level/agent, Worker=0, Episode=2823, Total reward=-0.73, Steps=8469, Training iteration=0[0m
[34mHeatup> Name=main_level/agent, Worker=0, Episode=2824, Total reward=-0.7, Steps=8472, Training iteration=0[0m
[34mHeatup> Name=main_level/agent, Worker=0, Episode=2825, Total reward=-0.88, Steps=8475, Training iteration=0[0m
[34mHeatup> Name=main_level/agent, Worker=0, Episode=2826, Total reward=-0.59, Steps=8478, Training iteration=0[0m
[34mHeatup> Name=main_level/agent, Worker=0, Episode=2827, Total reward=-0.52, Steps=8481, Training iteration=0[0m
[34mHeatup> Name=main_level/agent, Worker=0, Episode=2828, Total reward=-0.58, Steps=8484, Training iteration=0[0m
[34mHeatup> Name=main_level/agent, Worker=0, Episode=2829, Total reward=-0.9, Steps=8487, Training iteration=0[0m
[34mHeatup> Name=main_level/agent, Worker=0, Episode=2830, Total reward=-0.53, Steps=8490, Training iteration=0[0m
[34mHeatup> Name=main_level/agent, Worker=0, Episode=2831, Total 

[34mHeatup> Name=main_level/agent, Worker=0, Episode=3019, Total reward=-0.53, Steps=9057, Training iteration=0[0m
[34mHeatup> Name=main_level/agent, Worker=0, Episode=3020, Total reward=-0.55, Steps=9060, Training iteration=0[0m
[34mHeatup> Name=main_level/agent, Worker=0, Episode=3021, Total reward=-0.72, Steps=9063, Training iteration=0[0m
[34mHeatup> Name=main_level/agent, Worker=0, Episode=3022, Total reward=-0.63, Steps=9066, Training iteration=0[0m
[34mHeatup> Name=main_level/agent, Worker=0, Episode=3023, Total reward=-0.82, Steps=9069, Training iteration=0[0m
[34mHeatup> Name=main_level/agent, Worker=0, Episode=3024, Total reward=-0.95, Steps=9072, Training iteration=0[0m
[34mHeatup> Name=main_level/agent, Worker=0, Episode=3025, Total reward=-0.46, Steps=9075, Training iteration=0[0m
[34mHeatup> Name=main_level/agent, Worker=0, Episode=3026, Total reward=-1.33, Steps=9078, Training iteration=0[0m
[34mHeatup> Name=main_level/agent, Worker=0, Episode=3027, Tota

[34mHeatup> Name=main_level/agent, Worker=0, Episode=3209, Total reward=-0.62, Steps=9627, Training iteration=0[0m
[34mHeatup> Name=main_level/agent, Worker=0, Episode=3210, Total reward=-0.52, Steps=9630, Training iteration=0[0m
[34mHeatup> Name=main_level/agent, Worker=0, Episode=3211, Total reward=-0.62, Steps=9633, Training iteration=0[0m
[34mHeatup> Name=main_level/agent, Worker=0, Episode=3212, Total reward=-0.44, Steps=9636, Training iteration=0[0m
[34mHeatup> Name=main_level/agent, Worker=0, Episode=3213, Total reward=-0.63, Steps=9639, Training iteration=0[0m
[34mHeatup> Name=main_level/agent, Worker=0, Episode=3214, Total reward=-0.55, Steps=9642, Training iteration=0[0m
[34mHeatup> Name=main_level/agent, Worker=0, Episode=3215, Total reward=-0.94, Steps=9645, Training iteration=0[0m
[34mHeatup> Name=main_level/agent, Worker=0, Episode=3216, Total reward=-0.55, Steps=9648, Training iteration=0[0m
[34mHeatup> Name=main_level/agent, Worker=0, Episode=3217, Tota

[34mPolicy training> Surrogate loss=-0.11318536102771759, KL divergence=0.03260501101613045, Entropy=1.3921188116073608, training epoch=9, learning_rate=0.0003[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=3336, Total reward=-0.53, Steps=10008, Training iteration=1[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=3337, Total reward=-0.67, Steps=10011, Training iteration=1[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=3338, Total reward=-0.79, Steps=10014, Training iteration=1[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=3339, Total reward=-0.57, Steps=10017, Training iteration=1[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=3340, Total reward=-0.75, Steps=10020, Training iteration=1[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=3341, Total reward=-0.62, Steps=10023, Training iteration=1[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=3342, Total reward=-0.6, Steps=10026, Training iteration=1

[34mTraining> Name=main_level/agent, Worker=0, Episode=3501, Total reward=-0.59, Steps=10503, Training iteration=1[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=3502, Total reward=-0.63, Steps=10506, Training iteration=1[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=3503, Total reward=-0.49, Steps=10509, Training iteration=1[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=3504, Total reward=-0.52, Steps=10512, Training iteration=1[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=3505, Total reward=-1.33, Steps=10515, Training iteration=1[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=3506, Total reward=-0.57, Steps=10518, Training iteration=1[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=3507, Total reward=-0.78, Steps=10521, Training iteration=1[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=3508, Total reward=-0.98, Steps=10524, Training iteration=1[0m
[34mTraining> Name=main_level/agent, Wo

[34mTraining> Name=main_level/agent, Worker=0, Episode=3668, Total reward=-1.07, Steps=11004, Training iteration=1[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=3669, Total reward=-0.82, Steps=11007, Training iteration=1[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=3670, Total reward=-0.65, Steps=11010, Training iteration=1[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=3671, Total reward=-0.54, Steps=11013, Training iteration=1[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=3672, Total reward=-0.55, Steps=11016, Training iteration=1[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=3673, Total reward=-0.51, Steps=11019, Training iteration=1[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=3674, Total reward=-0.84, Steps=11022, Training iteration=1[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=3675, Total reward=-1.2, Steps=11025, Training iteration=1[0m
[34mTraining> Name=main_level/agent, Wor

[34mTraining> Name=main_level/agent, Worker=0, Episode=3835, Total reward=-0.5, Steps=11505, Training iteration=1[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=3836, Total reward=-0.58, Steps=11508, Training iteration=1[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=3837, Total reward=-0.59, Steps=11511, Training iteration=1[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=3838, Total reward=-0.65, Steps=11514, Training iteration=1[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=3839, Total reward=-0.81, Steps=11517, Training iteration=1[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=3840, Total reward=-0.86, Steps=11520, Training iteration=1[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=3841, Total reward=-0.65, Steps=11523, Training iteration=1[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=3842, Total reward=-0.81, Steps=11526, Training iteration=1[0m
[34mTraining> Name=main_level/agent, Wor

[34mTraining> Name=main_level/agent, Worker=0, Episode=3997, Total reward=-0.65, Steps=11991, Training iteration=1[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=3998, Total reward=-0.94, Steps=11994, Training iteration=1[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=3999, Total reward=-0.65, Steps=11997, Training iteration=1[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=4000, Total reward=-1.27, Steps=12000, Training iteration=1[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=4001, Total reward=-0.66, Steps=12003, Training iteration=1[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=4002, Total reward=-0.62, Steps=12006, Training iteration=1[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=4003, Total reward=-0.7, Steps=12009, Training iteration=1[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=4004, Total reward=-0.6, Steps=12012, Training iteration=1[0m
[34mTraining> Name=main_level/agent, Work

[34mTraining> Name=main_level/agent, Worker=0, Episode=4138, Total reward=-0.61, Steps=12416, Training iteration=2[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=4139, Total reward=-0.63, Steps=12419, Training iteration=2[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=4140, Total reward=-0.57, Steps=12422, Training iteration=2[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=4141, Total reward=-0.66, Steps=12425, Training iteration=2[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=4142, Total reward=-0.84, Steps=12428, Training iteration=2[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=4143, Total reward=-1.02, Steps=12431, Training iteration=2[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=4144, Total reward=-0.43, Steps=12434, Training iteration=2[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=4145, Total reward=-0.54, Steps=12437, Training iteration=2[0m
[34mTraining> Name=main_level/agent, Wo

[34mTraining> Name=main_level/agent, Worker=0, Episode=4309, Total reward=-0.45, Steps=12929, Training iteration=2[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=4310, Total reward=-0.48, Steps=12932, Training iteration=2[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=4311, Total reward=-1.08, Steps=12935, Training iteration=2[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=4312, Total reward=-0.7, Steps=12938, Training iteration=2[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=4313, Total reward=-0.82, Steps=12941, Training iteration=2[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=4314, Total reward=-0.78, Steps=12944, Training iteration=2[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=4315, Total reward=-0.58, Steps=12947, Training iteration=2[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=4316, Total reward=-0.69, Steps=12950, Training iteration=2[0m
[34mTraining> Name=main_level/agent, Wor

[34mTraining> Name=main_level/agent, Worker=0, Episode=4478, Total reward=-0.67, Steps=13436, Training iteration=2[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=4479, Total reward=-0.41, Steps=13439, Training iteration=2[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=4480, Total reward=-0.49, Steps=13442, Training iteration=2[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=4481, Total reward=-0.51, Steps=13445, Training iteration=2[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=4482, Total reward=-0.6, Steps=13448, Training iteration=2[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=4483, Total reward=-0.65, Steps=13451, Training iteration=2[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=4484, Total reward=-0.56, Steps=13454, Training iteration=2[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=4485, Total reward=-0.75, Steps=13457, Training iteration=2[0m
[34mTraining> Name=main_level/agent, Wor

[34mTraining> Name=main_level/agent, Worker=0, Episode=4704, Total reward=-0.51, Steps=14116, Training iteration=3[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=4705, Total reward=-0.55, Steps=14119, Training iteration=3[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=4706, Total reward=-0.68, Steps=14122, Training iteration=3[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=4707, Total reward=-0.49, Steps=14125, Training iteration=3[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=4708, Total reward=-0.61, Steps=14128, Training iteration=3[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=4709, Total reward=-0.63, Steps=14131, Training iteration=3[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=4710, Total reward=-0.64, Steps=14134, Training iteration=3[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=4711, Total reward=-0.51, Steps=14137, Training iteration=3[0m
[34mTraining> Name=main_level/agent, Wo

[34mTraining> Name=main_level/agent, Worker=0, Episode=4872, Total reward=-0.75, Steps=14620, Training iteration=3[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=4873, Total reward=-0.44, Steps=14623, Training iteration=3[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=4874, Total reward=-0.54, Steps=14626, Training iteration=3[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=4875, Total reward=-0.84, Steps=14629, Training iteration=3[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=4876, Total reward=-0.63, Steps=14632, Training iteration=3[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=4877, Total reward=-0.9, Steps=14635, Training iteration=3[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=4878, Total reward=-0.48, Steps=14638, Training iteration=3[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=4879, Total reward=-1.17, Steps=14641, Training iteration=3[0m
[34mTraining> Name=main_level/agent, Wor

[34mTraining> Name=main_level/agent, Worker=0, Episode=5038, Total reward=-0.93, Steps=15118, Training iteration=3[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=5039, Total reward=-0.76, Steps=15121, Training iteration=3[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=5040, Total reward=-0.55, Steps=15124, Training iteration=3[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=5041, Total reward=-0.63, Steps=15127, Training iteration=3[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=5042, Total reward=-0.55, Steps=15130, Training iteration=3[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=5043, Total reward=-0.58, Steps=15133, Training iteration=3[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=5044, Total reward=-1.99, Steps=15136, Training iteration=3[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=5045, Total reward=-0.58, Steps=15139, Training iteration=3[0m
[34mTraining> Name=main_level/agent, Wo

[34mTraining> Name=main_level/agent, Worker=0, Episode=5204, Total reward=-0.64, Steps=15616, Training iteration=3[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=5205, Total reward=-0.74, Steps=15619, Training iteration=3[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=5206, Total reward=-0.63, Steps=15622, Training iteration=3[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=5207, Total reward=-0.68, Steps=15625, Training iteration=3[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=5208, Total reward=-0.42, Steps=15628, Training iteration=3[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=5209, Total reward=-0.75, Steps=15631, Training iteration=3[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=5210, Total reward=-0.58, Steps=15634, Training iteration=3[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=5211, Total reward=-0.59, Steps=15637, Training iteration=3[0m
[34mTraining> Name=main_level/agent, Wo

[34mTraining> Name=main_level/agent, Worker=0, Episode=5369, Total reward=-0.6, Steps=16111, Training iteration=3[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=5370, Total reward=-0.53, Steps=16114, Training iteration=3[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=5371, Total reward=-0.63, Steps=16117, Training iteration=3[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=5372, Total reward=-0.51, Steps=16120, Training iteration=3[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=5373, Total reward=-0.6, Steps=16123, Training iteration=3[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=5374, Total reward=-0.58, Steps=16126, Training iteration=3[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=5375, Total reward=-0.45, Steps=16129, Training iteration=3[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=5376, Total reward=-0.7, Steps=16132, Training iteration=3[0m
[34mTraining> Name=main_level/agent, Worke

[34mTraining> Name=main_level/agent, Worker=0, Episode=5503, Total reward=-0.65, Steps=16515, Training iteration=4[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=5504, Total reward=-0.66, Steps=16518, Training iteration=4[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=5505, Total reward=-0.69, Steps=16521, Training iteration=4[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=5506, Total reward=-0.6, Steps=16524, Training iteration=4[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=5507, Total reward=-0.64, Steps=16527, Training iteration=4[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=5508, Total reward=-0.43, Steps=16530, Training iteration=4[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=5509, Total reward=-0.51, Steps=16533, Training iteration=4[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=5510, Total reward=-0.81, Steps=16536, Training iteration=4[0m
[34mTraining> Name=main_level/agent, Wor

[34mTraining> Name=main_level/agent, Worker=0, Episode=5663, Total reward=-0.61, Steps=16995, Training iteration=4[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=5664, Total reward=-0.51, Steps=16998, Training iteration=4[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=5665, Total reward=-0.54, Steps=17001, Training iteration=4[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=5666, Total reward=-0.49, Steps=17004, Training iteration=4[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=5667, Total reward=-0.47, Steps=17007, Training iteration=4[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=5668, Total reward=-0.59, Steps=17010, Training iteration=4[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=5669, Total reward=-0.83, Steps=17013, Training iteration=4[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=5670, Total reward=-0.61, Steps=17016, Training iteration=4[0m
[34mTraining> Name=main_level/agent, Wo

[34mTraining> Name=main_level/agent, Worker=0, Episode=5822, Total reward=-0.58, Steps=17472, Training iteration=4[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=5823, Total reward=-0.47, Steps=17475, Training iteration=4[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=5824, Total reward=-0.6, Steps=17478, Training iteration=4[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=5825, Total reward=-0.42, Steps=17481, Training iteration=4[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=5826, Total reward=-0.48, Steps=17484, Training iteration=4[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=5827, Total reward=-0.99, Steps=17487, Training iteration=4[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=5828, Total reward=-0.63, Steps=17490, Training iteration=4[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=5829, Total reward=-0.57, Steps=17493, Training iteration=4[0m
[34mTraining> Name=main_level/agent, Wor

[34mTraining> Name=main_level/agent, Worker=0, Episode=5982, Total reward=-0.45, Steps=17952, Training iteration=4[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=5983, Total reward=-0.66, Steps=17955, Training iteration=4[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=5984, Total reward=-0.44, Steps=17958, Training iteration=4[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=5985, Total reward=-0.71, Steps=17961, Training iteration=4[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=5986, Total reward=-0.55, Steps=17964, Training iteration=4[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=5987, Total reward=-0.47, Steps=17967, Training iteration=4[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=5988, Total reward=-0.55, Steps=17970, Training iteration=4[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=5989, Total reward=-0.43, Steps=17973, Training iteration=4[0m
[34mTraining> Name=main_level/agent, Wo

[34mTraining> Name=main_level/agent, Worker=0, Episode=6111, Total reward=-1.72, Steps=18341, Training iteration=5[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=6112, Total reward=-0.46, Steps=18344, Training iteration=5[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=6113, Total reward=-0.54, Steps=18347, Training iteration=5[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=6114, Total reward=-0.54, Steps=18350, Training iteration=5[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=6115, Total reward=-0.52, Steps=18353, Training iteration=5[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=6116, Total reward=-0.6, Steps=18356, Training iteration=5[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=6117, Total reward=-0.96, Steps=18359, Training iteration=5[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=6118, Total reward=-0.48, Steps=18362, Training iteration=5[0m
[34mTraining> Name=main_level/agent, Wor

[34mTraining> Name=main_level/agent, Worker=0, Episode=6268, Total reward=-0.57, Steps=18812, Training iteration=5[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=6269, Total reward=-0.5, Steps=18815, Training iteration=5[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=6270, Total reward=-0.46, Steps=18818, Training iteration=5[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=6271, Total reward=-0.53, Steps=18821, Training iteration=5[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=6272, Total reward=-0.66, Steps=18824, Training iteration=5[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=6273, Total reward=-0.72, Steps=18827, Training iteration=5[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=6274, Total reward=-0.64, Steps=18830, Training iteration=5[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=6275, Total reward=-0.7, Steps=18833, Training iteration=5[0m
[34mTraining> Name=main_level/agent, Work

[34mTraining> Name=main_level/agent, Worker=0, Episode=6505, Total reward=-0.69, Steps=19523, Training iteration=5[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=6506, Total reward=-0.61, Steps=19526, Training iteration=5[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=6507, Total reward=-0.44, Steps=19529, Training iteration=5[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=6508, Total reward=-0.69, Steps=19532, Training iteration=5[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=6509, Total reward=-0.51, Steps=19535, Training iteration=5[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=6510, Total reward=-0.47, Steps=19538, Training iteration=5[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=6511, Total reward=-0.49, Steps=19541, Training iteration=5[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=6512, Total reward=-0.68, Steps=19544, Training iteration=5[0m
[34mTraining> Name=main_level/agent, Wo

[34mTraining> Name=main_level/agent, Worker=0, Episode=6661, Total reward=-0.48, Steps=19991, Training iteration=5[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=6662, Total reward=-0.65, Steps=19994, Training iteration=5[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=6663, Total reward=-0.57, Steps=19997, Training iteration=5[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=6664, Total reward=-0.57, Steps=20000, Training iteration=5[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=6665, Total reward=-0.56, Steps=20003, Training iteration=5[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=6666, Total reward=-0.56, Steps=20006, Training iteration=5[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=6667, Total reward=-0.6, Steps=20009, Training iteration=5[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=6668, Total reward=-0.65, Steps=20012, Training iteration=5[0m
[34mTraining> Name=main_level/agent, Wor

[34mTraining> Name=main_level/agent, Worker=0, Episode=6784, Total reward=-0.48, Steps=20362, Training iteration=6[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=6785, Total reward=-0.49, Steps=20365, Training iteration=6[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=6786, Total reward=-0.47, Steps=20368, Training iteration=6[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=6787, Total reward=-0.64, Steps=20371, Training iteration=6[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=6788, Total reward=-0.54, Steps=20374, Training iteration=6[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=6789, Total reward=-0.59, Steps=20377, Training iteration=6[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=6790, Total reward=-0.55, Steps=20380, Training iteration=6[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=6791, Total reward=-0.52, Steps=20383, Training iteration=6[0m
[34mTraining> Name=main_level/agent, Wo

[34mTraining> Name=main_level/agent, Worker=0, Episode=6937, Total reward=-0.45, Steps=20821, Training iteration=6[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=6938, Total reward=-0.54, Steps=20824, Training iteration=6[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=6939, Total reward=-0.56, Steps=20827, Training iteration=6[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=6940, Total reward=-0.59, Steps=20830, Training iteration=6[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=6941, Total reward=-0.5, Steps=20833, Training iteration=6[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=6942, Total reward=-0.61, Steps=20836, Training iteration=6[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=6943, Total reward=-0.61, Steps=20839, Training iteration=6[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=6944, Total reward=-0.51, Steps=20842, Training iteration=6[0m
[34mTraining> Name=main_level/agent, Wor

[34mTraining> Name=main_level/agent, Worker=0, Episode=7089, Total reward=-0.55, Steps=21277, Training iteration=6[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=7090, Total reward=-0.44, Steps=21280, Training iteration=6[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=7091, Total reward=-0.59, Steps=21283, Training iteration=6[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=7092, Total reward=-0.63, Steps=21286, Training iteration=6[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=7093, Total reward=-0.5, Steps=21289, Training iteration=6[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=7094, Total reward=-0.64, Steps=21292, Training iteration=6[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=7095, Total reward=-0.58, Steps=21295, Training iteration=6[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=7096, Total reward=-0.55, Steps=21298, Training iteration=6[0m
[34mTraining> Name=main_level/agent, Wor

[34mTraining> Name=main_level/agent, Worker=0, Episode=7238, Total reward=-0.59, Steps=21724, Training iteration=6[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=7239, Total reward=-0.44, Steps=21727, Training iteration=6[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=7240, Total reward=-0.62, Steps=21730, Training iteration=6[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=7241, Total reward=-0.64, Steps=21733, Training iteration=6[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=7242, Total reward=-0.45, Steps=21736, Training iteration=6[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=7243, Total reward=-0.67, Steps=21739, Training iteration=6[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=7244, Total reward=-0.41, Steps=21742, Training iteration=6[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=7245, Total reward=-0.56, Steps=21745, Training iteration=6[0m
[34mTraining> Name=main_level/agent, Wo

[34mTraining> Name=main_level/agent, Worker=0, Episode=7389, Total reward=-0.57, Steps=22177, Training iteration=6[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=7390, Total reward=-0.42, Steps=22180, Training iteration=6[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=7391, Total reward=-0.7, Steps=22183, Training iteration=6[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=7392, Total reward=-0.56, Steps=22186, Training iteration=6[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=7393, Total reward=-0.65, Steps=22189, Training iteration=6[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=7394, Total reward=-0.62, Steps=22192, Training iteration=6[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=7395, Total reward=-0.58, Steps=22195, Training iteration=6[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=7396, Total reward=-0.66, Steps=22198, Training iteration=6[0m
[34mTraining> Name=main_level/agent, Wor

[34mTraining> Name=main_level/agent, Worker=0, Episode=7512, Total reward=-0.59, Steps=22548, Training iteration=7[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=7513, Total reward=-0.55, Steps=22551, Training iteration=7[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=7514, Total reward=-0.55, Steps=22554, Training iteration=7[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=7515, Total reward=-0.67, Steps=22557, Training iteration=7[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=7516, Total reward=-0.49, Steps=22560, Training iteration=7[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=7517, Total reward=-0.62, Steps=22563, Training iteration=7[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=7518, Total reward=-0.77, Steps=22566, Training iteration=7[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=7519, Total reward=-0.51, Steps=22569, Training iteration=7[0m
[34mTraining> Name=main_level/agent, Wo

[34mTraining> Name=main_level/agent, Worker=0, Episode=7659, Total reward=-0.61, Steps=22989, Training iteration=7[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=7660, Total reward=-0.64, Steps=22992, Training iteration=7[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=7661, Total reward=-0.44, Steps=22995, Training iteration=7[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=7662, Total reward=-0.53, Steps=22998, Training iteration=7[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=7663, Total reward=-0.51, Steps=23001, Training iteration=7[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=7664, Total reward=-0.44, Steps=23004, Training iteration=7[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=7665, Total reward=-0.64, Steps=23007, Training iteration=7[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=7666, Total reward=-0.47, Steps=23010, Training iteration=7[0m
[34mTraining> Name=main_level/agent, Wo

[34mTraining> Name=main_level/agent, Worker=0, Episode=7806, Total reward=-0.54, Steps=23430, Training iteration=7[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=7807, Total reward=-0.46, Steps=23433, Training iteration=7[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=7808, Total reward=-0.49, Steps=23436, Training iteration=7[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=7809, Total reward=-0.41, Steps=23439, Training iteration=7[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=7810, Total reward=-0.46, Steps=23442, Training iteration=7[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=7811, Total reward=-0.49, Steps=23445, Training iteration=7[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=7812, Total reward=-0.59, Steps=23448, Training iteration=7[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=7813, Total reward=-0.48, Steps=23451, Training iteration=7[0m
[34mTraining> Name=main_level/agent, Wo

[34mTraining> Name=main_level/agent, Worker=0, Episode=7952, Total reward=-0.47, Steps=23868, Training iteration=7[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=7953, Total reward=-0.56, Steps=23871, Training iteration=7[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=7954, Total reward=-0.57, Steps=23874, Training iteration=7[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=7955, Total reward=-0.42, Steps=23877, Training iteration=7[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=7956, Total reward=-0.55, Steps=23880, Training iteration=7[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=7957, Total reward=-0.65, Steps=23883, Training iteration=7[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=7958, Total reward=-0.67, Steps=23886, Training iteration=7[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=7959, Total reward=-0.5, Steps=23889, Training iteration=7[0m
[34mTraining> Name=main_level/agent, Wor

[34mTraining> Name=main_level/agent, Worker=0, Episode=8097, Total reward=-0.62, Steps=24303, Training iteration=7[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=8098, Total reward=-0.42, Steps=24306, Training iteration=7[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=8099, Total reward=-0.46, Steps=24309, Training iteration=7[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=8100, Total reward=-0.56, Steps=24312, Training iteration=7[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=8101, Total reward=-0.58, Steps=24315, Training iteration=7[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=8102, Total reward=-0.7, Steps=24318, Training iteration=7[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=8103, Total reward=-0.59, Steps=24321, Training iteration=7[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=8104, Total reward=-0.65, Steps=24324, Training iteration=7[0m
[34mTraining> Name=main_level/agent, Wor

[34mTraining> Name=main_level/agent, Worker=0, Episode=8218, Total reward=-0.76, Steps=24668, Training iteration=8[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=8219, Total reward=-0.64, Steps=24671, Training iteration=8[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=8220, Total reward=-0.58, Steps=24674, Training iteration=8[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=8221, Total reward=-0.48, Steps=24677, Training iteration=8[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=8222, Total reward=-0.55, Steps=24680, Training iteration=8[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=8223, Total reward=-0.45, Steps=24683, Training iteration=8[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=8224, Total reward=-0.6, Steps=24686, Training iteration=8[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=8225, Total reward=-0.45, Steps=24689, Training iteration=8[0m
[34mTraining> Name=main_level/agent, Wor

[34mTraining> Name=main_level/agent, Worker=0, Episode=8437, Total reward=-0.65, Steps=25325, Training iteration=8[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=8438, Total reward=-0.66, Steps=25328, Training iteration=8[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=8439, Total reward=-0.54, Steps=25331, Training iteration=8[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=8440, Total reward=-0.64, Steps=25334, Training iteration=8[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=8441, Total reward=-0.59, Steps=25337, Training iteration=8[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=8442, Total reward=-0.52, Steps=25340, Training iteration=8[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=8443, Total reward=-0.51, Steps=25343, Training iteration=8[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=8444, Total reward=-0.44, Steps=25346, Training iteration=8[0m
[34mTraining> Name=main_level/agent, Wo

[34mTraining> Name=main_level/agent, Worker=0, Episode=8579, Total reward=-0.54, Steps=25751, Training iteration=8[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=8580, Total reward=-0.54, Steps=25754, Training iteration=8[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=8581, Total reward=-0.49, Steps=25757, Training iteration=8[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=8582, Total reward=-0.56, Steps=25760, Training iteration=8[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=8583, Total reward=-0.64, Steps=25763, Training iteration=8[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=8584, Total reward=-0.74, Steps=25766, Training iteration=8[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=8585, Total reward=-0.53, Steps=25769, Training iteration=8[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=8586, Total reward=-0.69, Steps=25772, Training iteration=8[0m
[34mTraining> Name=main_level/agent, Wo

[34mTraining> Name=main_level/agent, Worker=0, Episode=8722, Total reward=-0.54, Steps=26180, Training iteration=8[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=8723, Total reward=-0.52, Steps=26183, Training iteration=8[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=8724, Total reward=-0.64, Steps=26186, Training iteration=8[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=8725, Total reward=-0.54, Steps=26189, Training iteration=8[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=8726, Total reward=-0.44, Steps=26192, Training iteration=8[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=8727, Total reward=-0.53, Steps=26195, Training iteration=8[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=8728, Total reward=-0.48, Steps=26198, Training iteration=8[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=8729, Total reward=-0.53, Steps=26201, Training iteration=8[0m
[34mTraining> Name=main_level/agent, Wo

[34mTraining> Name=main_level/agent, Worker=0, Episode=8836, Total reward=-0.48, Steps=26524, Training iteration=9[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=8837, Total reward=-0.49, Steps=26527, Training iteration=9[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=8838, Total reward=-0.42, Steps=26530, Training iteration=9[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=8839, Total reward=-0.54, Steps=26533, Training iteration=9[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=8840, Total reward=-0.49, Steps=26536, Training iteration=9[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=8841, Total reward=-0.56, Steps=26539, Training iteration=9[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=8842, Total reward=-0.76, Steps=26542, Training iteration=9[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=8843, Total reward=-0.7, Steps=26545, Training iteration=9[0m
[34mTraining> Name=main_level/agent, Wor

[34mTraining> Name=main_level/agent, Worker=0, Episode=8976, Total reward=-0.43, Steps=26944, Training iteration=9[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=8977, Total reward=-0.61, Steps=26947, Training iteration=9[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=8978, Total reward=-0.59, Steps=26950, Training iteration=9[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=8979, Total reward=-0.54, Steps=26953, Training iteration=9[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=8980, Total reward=-0.45, Steps=26956, Training iteration=9[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=8981, Total reward=-0.59, Steps=26959, Training iteration=9[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=8982, Total reward=-0.53, Steps=26962, Training iteration=9[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=8983, Total reward=-0.46, Steps=26965, Training iteration=9[0m
[34mTraining> Name=main_level/agent, Wo

[34mCheckpoint> Saving in path=['/opt/ml/output/data/checkpoint/38_Step-17368.ckpt'][0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=9119, Total reward=-0.45, Steps=27373, Training iteration=9[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=9120, Total reward=-0.56, Steps=27376, Training iteration=9[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=9121, Total reward=-0.69, Steps=27379, Training iteration=9[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=9122, Total reward=-0.53, Steps=27382, Training iteration=9[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=9123, Total reward=-0.57, Steps=27385, Training iteration=9[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=9124, Total reward=-0.49, Steps=27388, Training iteration=9[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=9125, Total reward=-0.56, Steps=27391, Training iteration=9[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=9126, Total re

[34mTraining> Name=main_level/agent, Worker=0, Episode=9264, Total reward=-0.51, Steps=27808, Training iteration=9[0m
[34mCheckpoint> Saving in path=['/opt/ml/output/data/checkpoint/39_Step-17806.ckpt'][0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=9265, Total reward=-0.77, Steps=27811, Training iteration=9[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=9266, Total reward=-0.48, Steps=27814, Training iteration=9[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=9267, Total reward=-0.57, Steps=27817, Training iteration=9[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=9268, Total reward=-0.63, Steps=27820, Training iteration=9[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=9269, Total reward=-0.58, Steps=27823, Training iteration=9[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=9270, Total reward=-0.46, Steps=27826, Training iteration=9[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=9271, Total re

[34mTraining> Name=main_level/agent, Worker=0, Episode=9403, Total reward=-0.48, Steps=28225, Training iteration=9[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=9404, Total reward=-0.62, Steps=28228, Training iteration=9[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=9405, Total reward=-0.47, Steps=28231, Training iteration=9[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=9406, Total reward=-0.49, Steps=28234, Training iteration=9[0m
[34mCheckpoint> Saving in path=['/opt/ml/output/data/checkpoint/40_Step-18232.ckpt'][0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=9407, Total reward=-0.55, Steps=28237, Training iteration=9[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=9408, Total reward=-0.44, Steps=28240, Training iteration=9[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=9409, Total reward=-0.41, Steps=28243, Training iteration=9[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=9410, Total re

[34mTraining> Name=main_level/agent, Worker=0, Episode=9517, Total reward=-0.58, Steps=28569, Training iteration=10[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=9518, Total reward=-0.56, Steps=28572, Training iteration=10[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=9519, Total reward=-0.49, Steps=28575, Training iteration=10[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=9520, Total reward=-0.49, Steps=28578, Training iteration=10[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=9521, Total reward=-0.67, Steps=28581, Training iteration=10[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=9522, Total reward=-0.62, Steps=28584, Training iteration=10[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=9523, Total reward=-0.66, Steps=28587, Training iteration=10[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=9524, Total reward=-0.49, Steps=28590, Training iteration=10[0m
[34mCheckpoint> Saving in path=

[34mTraining> Name=main_level/agent, Worker=0, Episode=9722, Total reward=-0.54, Steps=29184, Training iteration=10[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=9723, Total reward=-0.47, Steps=29187, Training iteration=10[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=9724, Total reward=-0.54, Steps=29190, Training iteration=10[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=9725, Total reward=-0.43, Steps=29193, Training iteration=10[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=9726, Total reward=-0.72, Steps=29196, Training iteration=10[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=9727, Total reward=-0.67, Steps=29199, Training iteration=10[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=9728, Total reward=-0.55, Steps=29202, Training iteration=10[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=9729, Total reward=-0.42, Steps=29205, Training iteration=10[0m
[34mTraining> Name=main_level/a

[34mTraining> Name=main_level/agent, Worker=0, Episode=9792, Total reward=-0.5, Steps=29394, Training iteration=10[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=9793, Total reward=-0.46, Steps=29397, Training iteration=10[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=9794, Total reward=-0.62, Steps=29400, Training iteration=10[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=9795, Total reward=-0.61, Steps=29403, Training iteration=10[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=9796, Total reward=-0.63, Steps=29406, Training iteration=10[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=9797, Total reward=-0.5, Steps=29409, Training iteration=10[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=9798, Total reward=-0.48, Steps=29412, Training iteration=10[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=9799, Total reward=-0.58, Steps=29415, Training iteration=10[0m
[34mTraining> Name=main_level/age

[34mTraining> Name=main_level/agent, Worker=0, Episode=9927, Total reward=-0.42, Steps=29799, Training iteration=10[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=9928, Total reward=-0.55, Steps=29802, Training iteration=10[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=9929, Total reward=-0.59, Steps=29805, Training iteration=10[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=9930, Total reward=-0.72, Steps=29808, Training iteration=10[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=9931, Total reward=-0.49, Steps=29811, Training iteration=10[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=9932, Total reward=-0.54, Steps=29814, Training iteration=10[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=9933, Total reward=-0.7, Steps=29817, Training iteration=10[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=9934, Total reward=-0.59, Steps=29820, Training iteration=10[0m
[34mTraining> Name=main_level/ag

[34mTraining> Name=main_level/agent, Worker=0, Episode=10128, Total reward=-0.63, Steps=30402, Training iteration=10[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=10129, Total reward=-0.55, Steps=30405, Training iteration=10[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=10130, Total reward=-0.62, Steps=30408, Training iteration=10[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=10131, Total reward=-0.52, Steps=30411, Training iteration=10[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=10132, Total reward=-0.47, Steps=30414, Training iteration=10[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=10133, Total reward=-0.6, Steps=30417, Training iteration=10[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=10134, Total reward=-0.57, Steps=30420, Training iteration=10[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=10135, Total reward=-0.47, Steps=30423, Training iteration=10[0m
[34mTraining> Name=main_

[34mTraining> Name=main_level/agent, Worker=0, Episode=10241, Total reward=-0.64, Steps=30743, Training iteration=11[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=10242, Total reward=-0.51, Steps=30746, Training iteration=11[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=10243, Total reward=-0.65, Steps=30749, Training iteration=11[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=10244, Total reward=-0.43, Steps=30752, Training iteration=11[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=10245, Total reward=-0.5, Steps=30755, Training iteration=11[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=10246, Total reward=-0.58, Steps=30758, Training iteration=11[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=10247, Total reward=-0.44, Steps=30761, Training iteration=11[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=10248, Total reward=-0.58, Steps=30764, Training iteration=11[0m
[34mTraining> Name=main_

[34mTraining> Name=main_level/agent, Worker=0, Episode=10376, Total reward=-0.56, Steps=31148, Training iteration=11[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=10377, Total reward=-0.53, Steps=31151, Training iteration=11[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=10378, Total reward=-0.5, Steps=31154, Training iteration=11[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=10379, Total reward=-0.62, Steps=31157, Training iteration=11[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=10380, Total reward=-0.52, Steps=31160, Training iteration=11[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=10381, Total reward=-0.63, Steps=31163, Training iteration=11[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=10382, Total reward=-0.47, Steps=31166, Training iteration=11[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=10383, Total reward=-0.59, Steps=31169, Training iteration=11[0m
[34mTraining> Name=main_

[34mTraining> Name=main_level/agent, Worker=0, Episode=10509, Total reward=-0.49, Steps=31547, Training iteration=11[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=10510, Total reward=-0.66, Steps=31550, Training iteration=11[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=10511, Total reward=-0.46, Steps=31553, Training iteration=11[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=10512, Total reward=-0.57, Steps=31556, Training iteration=11[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=10513, Total reward=-0.43, Steps=31559, Training iteration=11[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=10514, Total reward=-1.77, Steps=31562, Training iteration=11[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=10515, Total reward=-0.6, Steps=31565, Training iteration=11[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=10516, Total reward=-0.66, Steps=31568, Training iteration=11[0m
[34mTraining> Name=main_

[34mTraining> Name=main_level/agent, Worker=0, Episode=10643, Total reward=-0.61, Steps=31949, Training iteration=11[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=10644, Total reward=-0.5, Steps=31952, Training iteration=11[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=10645, Total reward=-0.49, Steps=31955, Training iteration=11[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=10646, Total reward=-0.41, Steps=31958, Training iteration=11[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=10647, Total reward=-0.46, Steps=31961, Training iteration=11[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=10648, Total reward=-0.6, Steps=31964, Training iteration=11[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=10649, Total reward=-0.5, Steps=31967, Training iteration=11[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=10650, Total reward=-0.66, Steps=31970, Training iteration=11[0m
[34mTraining> Name=main_le

[34mTraining> Name=main_level/agent, Worker=0, Episode=10775, Total reward=-0.56, Steps=32345, Training iteration=11[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=10776, Total reward=-0.61, Steps=32348, Training iteration=11[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=10777, Total reward=-0.63, Steps=32351, Training iteration=11[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=10778, Total reward=-0.66, Steps=32354, Training iteration=11[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=10779, Total reward=-0.51, Steps=32357, Training iteration=11[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=10780, Total reward=-0.43, Steps=32360, Training iteration=11[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=10781, Total reward=-0.6, Steps=32363, Training iteration=11[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=10782, Total reward=-0.5, Steps=32366, Training iteration=11[0m
[34mTraining> Name=main_l

[34mTraining> Name=main_level/agent, Worker=0, Episode=10882, Total reward=-0.58, Steps=32668, Training iteration=12[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=10883, Total reward=-0.57, Steps=32671, Training iteration=12[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=10884, Total reward=-0.61, Steps=32674, Training iteration=12[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=10885, Total reward=-0.72, Steps=32677, Training iteration=12[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=10886, Total reward=-0.51, Steps=32680, Training iteration=12[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=10887, Total reward=-0.59, Steps=32683, Training iteration=12[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=10888, Total reward=-0.51, Steps=32686, Training iteration=12[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=10889, Total reward=-0.49, Steps=32689, Training iteration=12[0m
[34mTraining> Name=main

[34mTraining> Name=main_level/agent, Worker=0, Episode=11015, Total reward=-0.63, Steps=33067, Training iteration=12[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=11016, Total reward=-0.49, Steps=33070, Training iteration=12[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=11017, Total reward=-0.44, Steps=33073, Training iteration=12[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=11018, Total reward=-0.42, Steps=33076, Training iteration=12[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=11019, Total reward=-0.58, Steps=33079, Training iteration=12[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=11020, Total reward=-0.44, Steps=33082, Training iteration=12[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=11021, Total reward=-0.46, Steps=33085, Training iteration=12[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=11022, Total reward=-0.47, Steps=33088, Training iteration=12[0m
[34mTraining> Name=main

[34mTraining> Name=main_level/agent, Worker=0, Episode=11145, Total reward=-0.53, Steps=33457, Training iteration=12[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=11146, Total reward=-0.5, Steps=33460, Training iteration=12[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=11147, Total reward=-0.58, Steps=33463, Training iteration=12[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=11148, Total reward=-0.48, Steps=33466, Training iteration=12[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=11149, Total reward=-0.67, Steps=33469, Training iteration=12[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=11150, Total reward=-0.5, Steps=33472, Training iteration=12[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=11151, Total reward=-0.56, Steps=33475, Training iteration=12[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=11152, Total reward=-0.5, Steps=33478, Training iteration=12[0m
[34mTraining> Name=main_le

[34mTraining> Name=main_level/agent, Worker=0, Episode=11276, Total reward=-0.58, Steps=33850, Training iteration=12[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=11277, Total reward=-0.58, Steps=33853, Training iteration=12[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=11278, Total reward=-0.46, Steps=33856, Training iteration=12[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=11279, Total reward=-0.63, Steps=33859, Training iteration=12[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=11280, Total reward=-0.53, Steps=33862, Training iteration=12[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=11281, Total reward=-0.43, Steps=33865, Training iteration=12[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=11282, Total reward=-0.41, Steps=33868, Training iteration=12[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=11283, Total reward=-0.53, Steps=33871, Training iteration=12[0m
[34mTraining> Name=main

[34mTraining> Name=main_level/agent, Worker=0, Episode=11472, Total reward=-0.67, Steps=34438, Training iteration=12[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=11473, Total reward=-0.58, Steps=34441, Training iteration=12[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=11474, Total reward=-0.55, Steps=34444, Training iteration=12[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=11475, Total reward=-0.58, Steps=34447, Training iteration=12[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=11476, Total reward=-0.59, Steps=34450, Training iteration=12[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=11477, Total reward=-0.53, Steps=34453, Training iteration=12[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=11478, Total reward=-0.43, Steps=34456, Training iteration=12[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=11479, Total reward=-0.54, Steps=34459, Training iteration=12[0m
[34mTraining> Name=main

[34mTraining> Name=main_level/agent, Worker=0, Episode=11577, Total reward=-0.43, Steps=34755, Training iteration=13[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=11578, Total reward=-0.47, Steps=34758, Training iteration=13[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=11579, Total reward=-0.6, Steps=34761, Training iteration=13[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=11580, Total reward=-0.63, Steps=34764, Training iteration=13[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=11581, Total reward=-0.46, Steps=34767, Training iteration=13[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=11582, Total reward=-0.55, Steps=34770, Training iteration=13[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=11583, Total reward=-0.46, Steps=34773, Training iteration=13[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=11584, Total reward=-0.43, Steps=34776, Training iteration=13[0m
[34mTraining> Name=main_

[34mTraining> Name=main_level/agent, Worker=0, Episode=11705, Total reward=-0.49, Steps=35139, Training iteration=13[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=11706, Total reward=-0.52, Steps=35142, Training iteration=13[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=11707, Total reward=-0.47, Steps=35145, Training iteration=13[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=11708, Total reward=-0.45, Steps=35148, Training iteration=13[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=11709, Total reward=-0.53, Steps=35151, Training iteration=13[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=11710, Total reward=-0.5, Steps=35154, Training iteration=13[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=11711, Total reward=-0.53, Steps=35157, Training iteration=13[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=11712, Total reward=-0.54, Steps=35160, Training iteration=13[0m
[34mTraining> Name=main_

[34mTraining> Name=main_level/agent, Worker=0, Episode=11834, Total reward=-0.63, Steps=35526, Training iteration=13[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=11835, Total reward=-0.53, Steps=35529, Training iteration=13[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=11836, Total reward=-0.51, Steps=35532, Training iteration=13[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=11837, Total reward=-0.44, Steps=35535, Training iteration=13[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=11838, Total reward=-0.62, Steps=35538, Training iteration=13[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=11839, Total reward=-0.45, Steps=35541, Training iteration=13[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=11840, Total reward=-0.52, Steps=35544, Training iteration=13[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=11841, Total reward=-0.49, Steps=35547, Training iteration=13[0m
[34mTraining> Name=main

[34mTraining> Name=main_level/agent, Worker=0, Episode=11962, Total reward=-0.49, Steps=35910, Training iteration=13[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=11963, Total reward=-0.49, Steps=35913, Training iteration=13[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=11964, Total reward=-0.47, Steps=35916, Training iteration=13[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=11965, Total reward=-0.5, Steps=35919, Training iteration=13[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=11966, Total reward=-0.55, Steps=35922, Training iteration=13[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=11967, Total reward=-0.43, Steps=35925, Training iteration=13[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=11968, Total reward=-0.48, Steps=35928, Training iteration=13[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=11969, Total reward=-0.75, Steps=35931, Training iteration=13[0m
[34mTraining> Name=main_

[34mTraining> Name=main_level/agent, Worker=0, Episode=12089, Total reward=-0.57, Steps=36291, Training iteration=13[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=12090, Total reward=-0.45, Steps=36294, Training iteration=13[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=12091, Total reward=-0.52, Steps=36297, Training iteration=13[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=12092, Total reward=-0.72, Steps=36300, Training iteration=13[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=12093, Total reward=-0.6, Steps=36303, Training iteration=13[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=12094, Total reward=-0.52, Steps=36306, Training iteration=13[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=12095, Total reward=-0.52, Steps=36309, Training iteration=13[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=12096, Total reward=-0.68, Steps=36312, Training iteration=13[0m
[34mTraining> Name=main_

[34mPolicy training> Surrogate loss=-0.014491169713437557, KL divergence=0.009127105586230755, Entropy=1.0463768243789673, training epoch=5, learning_rate=0.0003[0m
[34mPolicy training> Surrogate loss=-0.021831920370459557, KL divergence=0.005874939728528261, Entropy=1.0444287061691284, training epoch=6, learning_rate=0.0003[0m
[34mPolicy training> Surrogate loss=-0.01539511140435934, KL divergence=0.0087219113484025, Entropy=1.0425406694412231, training epoch=7, learning_rate=0.0003[0m
[34mPolicy training> Surrogate loss=-0.014202599413692951, KL divergence=0.00633451621979475, Entropy=1.0407545566558838, training epoch=8, learning_rate=0.0003[0m
[34mPolicy training> Surrogate loss=-0.014990899711847305, KL divergence=0.00811222568154335, Entropy=1.0392119884490967, training epoch=9, learning_rate=0.0003[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=12202, Total reward=-0.46, Steps=36632, Training iteration=14[0m
[34mTraining> Name=main_level/agent, Worker=0, 

[34mTraining> Name=main_level/agent, Worker=0, Episode=12320, Total reward=-0.4, Steps=36986, Training iteration=14[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=12321, Total reward=-0.42, Steps=36989, Training iteration=14[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=12322, Total reward=-0.49, Steps=36992, Training iteration=14[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=12323, Total reward=-0.52, Steps=36995, Training iteration=14[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=12324, Total reward=-0.6, Steps=36998, Training iteration=14[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=12325, Total reward=-0.51, Steps=37001, Training iteration=14[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=12326, Total reward=-0.47, Steps=37004, Training iteration=14[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=12327, Total reward=-0.54, Steps=37007, Training iteration=14[0m
[34mTraining> Name=main_l

[34mTraining> Name=main_level/agent, Worker=0, Episode=12447, Total reward=-0.52, Steps=37367, Training iteration=14[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=12448, Total reward=-0.57, Steps=37370, Training iteration=14[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=12449, Total reward=-0.48, Steps=37373, Training iteration=14[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=12450, Total reward=-0.68, Steps=37376, Training iteration=14[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=12451, Total reward=-0.52, Steps=37379, Training iteration=14[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=12452, Total reward=-0.75, Steps=37382, Training iteration=14[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=12453, Total reward=-0.45, Steps=37385, Training iteration=14[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=12454, Total reward=-0.39, Steps=37388, Training iteration=14[0m
[34mTraining> Name=main

[34mTraining> Name=main_level/agent, Worker=0, Episode=12572, Total reward=-0.55, Steps=37742, Training iteration=14[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=12573, Total reward=-0.61, Steps=37745, Training iteration=14[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=12574, Total reward=-0.46, Steps=37748, Training iteration=14[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=12575, Total reward=-0.57, Steps=37751, Training iteration=14[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=12576, Total reward=-0.48, Steps=37754, Training iteration=14[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=12577, Total reward=-0.71, Steps=37757, Training iteration=14[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=12578, Total reward=-0.61, Steps=37760, Training iteration=14[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=12579, Total reward=-0.54, Steps=37763, Training iteration=14[0m
[34mTraining> Name=main

[34mTraining> Name=main_level/agent, Worker=0, Episode=12695, Total reward=-0.47, Steps=38111, Training iteration=14[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=12696, Total reward=-0.65, Steps=38114, Training iteration=14[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=12697, Total reward=-0.53, Steps=38117, Training iteration=14[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=12698, Total reward=-0.62, Steps=38120, Training iteration=14[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=12699, Total reward=-0.46, Steps=38123, Training iteration=14[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=12700, Total reward=-0.56, Steps=38126, Training iteration=14[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=12701, Total reward=-0.51, Steps=38129, Training iteration=14[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=12702, Total reward=-0.48, Steps=38132, Training iteration=14[0m
[34mTraining> Name=main

[34mTraining> Name=main_level/agent, Worker=0, Episode=12819, Total reward=-0.6, Steps=38483, Training iteration=14[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=12820, Total reward=-0.48, Steps=38486, Training iteration=14[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=12821, Total reward=-0.49, Steps=38489, Training iteration=14[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=12822, Total reward=-0.51, Steps=38492, Training iteration=14[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=12823, Total reward=-0.5, Steps=38495, Training iteration=14[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=12824, Total reward=-0.48, Steps=38498, Training iteration=14[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=12825, Total reward=-0.47, Steps=38501, Training iteration=14[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=12826, Total reward=-0.52, Steps=38504, Training iteration=14[0m
[34mTraining> Name=main_l

[34mTraining> Name=main_level/agent, Worker=0, Episode=12983, Total reward=-0.43, Steps=38977, Training iteration=15[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=12984, Total reward=-0.57, Steps=38980, Training iteration=15[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=12985, Total reward=-0.42, Steps=38983, Training iteration=15[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=12986, Total reward=-0.64, Steps=38986, Training iteration=15[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=12987, Total reward=-0.43, Steps=38989, Training iteration=15[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=12988, Total reward=-0.46, Steps=38992, Training iteration=15[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=12989, Total reward=-0.42, Steps=38995, Training iteration=15[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=12990, Total reward=-0.43, Steps=38998, Training iteration=15[0m
[34mTraining> Name=main

[34mTraining> Name=main_level/agent, Worker=0, Episode=13108, Total reward=-0.59, Steps=39352, Training iteration=15[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=13109, Total reward=-0.58, Steps=39355, Training iteration=15[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=13110, Total reward=-0.48, Steps=39358, Training iteration=15[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=13111, Total reward=-0.67, Steps=39361, Training iteration=15[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=13112, Total reward=-0.62, Steps=39364, Training iteration=15[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=13113, Total reward=-0.55, Steps=39367, Training iteration=15[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=13114, Total reward=-0.45, Steps=39370, Training iteration=15[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=13115, Total reward=-0.55, Steps=39373, Training iteration=15[0m
[34mTraining> Name=main

[34mTraining> Name=main_level/agent, Worker=0, Episode=13231, Total reward=-0.61, Steps=39721, Training iteration=15[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=13232, Total reward=-0.69, Steps=39724, Training iteration=15[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=13233, Total reward=-0.5, Steps=39727, Training iteration=15[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=13234, Total reward=-0.44, Steps=39730, Training iteration=15[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=13235, Total reward=-0.56, Steps=39733, Training iteration=15[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=13236, Total reward=-0.55, Steps=39736, Training iteration=15[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=13237, Total reward=-0.59, Steps=39739, Training iteration=15[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=13238, Total reward=-0.58, Steps=39742, Training iteration=15[0m
[34mTraining> Name=main_

[34mTraining> Name=main_level/agent, Worker=0, Episode=13354, Total reward=-0.54, Steps=40090, Training iteration=15[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=13355, Total reward=-0.53, Steps=40093, Training iteration=15[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=13356, Total reward=-0.51, Steps=40096, Training iteration=15[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=13357, Total reward=-0.45, Steps=40099, Training iteration=15[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=13358, Total reward=-0.53, Steps=40102, Training iteration=15[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=13359, Total reward=-0.55, Steps=40105, Training iteration=15[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=13360, Total reward=-0.64, Steps=40108, Training iteration=15[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=13361, Total reward=-0.52, Steps=40111, Training iteration=15[0m
[34mTraining> Name=main

[34mTraining> Name=main_level/agent, Worker=0, Episode=13478, Total reward=-0.57, Steps=40462, Training iteration=15[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=13479, Total reward=-0.72, Steps=40465, Training iteration=15[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=13480, Total reward=-0.5, Steps=40468, Training iteration=15[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=13481, Total reward=-0.58, Steps=40471, Training iteration=15[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=13482, Total reward=-0.46, Steps=40474, Training iteration=15[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=13483, Total reward=-0.54, Steps=40477, Training iteration=15[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=13484, Total reward=-0.39, Steps=40480, Training iteration=15[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=13485, Total reward=-0.52, Steps=40483, Training iteration=15[0m
[34mTraining> Name=main_

[34mTraining> Name=main_level/agent, Worker=0, Episode=13576, Total reward=-0.55, Steps=40758, Training iteration=16[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=13577, Total reward=-0.48, Steps=40761, Training iteration=16[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=13578, Total reward=-0.53, Steps=40764, Training iteration=16[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=13579, Total reward=-0.68, Steps=40767, Training iteration=16[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=13580, Total reward=-0.46, Steps=40770, Training iteration=16[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=13581, Total reward=-0.56, Steps=40773, Training iteration=16[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=13582, Total reward=-0.45, Steps=40776, Training iteration=16[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=13583, Total reward=-0.56, Steps=40779, Training iteration=16[0m
[34mTraining> Name=main

[34mTraining> Name=main_level/agent, Worker=0, Episode=13698, Total reward=-0.45, Steps=41124, Training iteration=16[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=13699, Total reward=-0.46, Steps=41127, Training iteration=16[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=13700, Total reward=-0.48, Steps=41130, Training iteration=16[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=13701, Total reward=-0.47, Steps=41133, Training iteration=16[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=13702, Total reward=-0.68, Steps=41136, Training iteration=16[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=13703, Total reward=-0.71, Steps=41139, Training iteration=16[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=13704, Total reward=-0.44, Steps=41142, Training iteration=16[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=13705, Total reward=-0.52, Steps=41145, Training iteration=16[0m
[34mTraining> Name=main

[34mTraining> Name=main_level/agent, Worker=0, Episode=13818, Total reward=-0.5, Steps=41484, Training iteration=16[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=13819, Total reward=-0.46, Steps=41487, Training iteration=16[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=13820, Total reward=-0.41, Steps=41490, Training iteration=16[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=13821, Total reward=-0.48, Steps=41493, Training iteration=16[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=13822, Total reward=-0.62, Steps=41496, Training iteration=16[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=13823, Total reward=-0.58, Steps=41499, Training iteration=16[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=13824, Total reward=-0.49, Steps=41502, Training iteration=16[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=13825, Total reward=-0.49, Steps=41505, Training iteration=16[0m
[34mTraining> Name=main_

[34mTraining> Name=main_level/agent, Worker=0, Episode=13937, Total reward=-0.52, Steps=41841, Training iteration=16[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=13938, Total reward=-0.45, Steps=41844, Training iteration=16[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=13939, Total reward=-0.42, Steps=41847, Training iteration=16[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=13940, Total reward=-0.43, Steps=41850, Training iteration=16[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=13941, Total reward=-0.57, Steps=41853, Training iteration=16[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=13942, Total reward=-0.72, Steps=41856, Training iteration=16[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=13943, Total reward=-0.55, Steps=41859, Training iteration=16[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=13944, Total reward=-0.45, Steps=41862, Training iteration=16[0m
[34mTraining> Name=main

[34mTraining> Name=main_level/agent, Worker=0, Episode=14054, Total reward=-0.44, Steps=42192, Training iteration=16[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=14055, Total reward=-0.59, Steps=42195, Training iteration=16[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=14056, Total reward=-0.46, Steps=42198, Training iteration=16[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=14057, Total reward=-0.5, Steps=42201, Training iteration=16[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=14058, Total reward=-0.59, Steps=42204, Training iteration=16[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=14059, Total reward=-0.56, Steps=42207, Training iteration=16[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=14060, Total reward=-0.56, Steps=42210, Training iteration=16[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=14061, Total reward=-0.42, Steps=42213, Training iteration=16[0m
[34mTraining> Name=main_

[34mTraining> Name=main_level/agent, Worker=0, Episode=14171, Total reward=-0.61, Steps=42543, Training iteration=16[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=14172, Total reward=-0.57, Steps=42546, Training iteration=16[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=14173, Total reward=-0.61, Steps=42549, Training iteration=16[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=14174, Total reward=-0.43, Steps=42552, Training iteration=16[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=14175, Total reward=-0.53, Steps=42555, Training iteration=16[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=14176, Total reward=-0.67, Steps=42558, Training iteration=16[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=14177, Total reward=-0.58, Steps=42561, Training iteration=16[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=14178, Total reward=-0.58, Steps=42564, Training iteration=16[0m
[34mTraining> Name=main

[34mTraining> Name=main_level/agent, Worker=0, Episode=14322, Total reward=-0.69, Steps=42998, Training iteration=17[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=14323, Total reward=-0.53, Steps=43001, Training iteration=17[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=14324, Total reward=-0.5, Steps=43004, Training iteration=17[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=14325, Total reward=-0.57, Steps=43007, Training iteration=17[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=14326, Total reward=-0.51, Steps=43010, Training iteration=17[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=14327, Total reward=-0.49, Steps=43013, Training iteration=17[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=14328, Total reward=-0.56, Steps=43016, Training iteration=17[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=14329, Total reward=-0.54, Steps=43019, Training iteration=17[0m
[34mTraining> Name=main_

[34mCheckpoint> Saving in path=['/opt/ml/output/data/checkpoint/80_Step-33347.ckpt'][0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=14440, Total reward=-0.49, Steps=43352, Training iteration=17[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=14441, Total reward=-0.41, Steps=43355, Training iteration=17[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=14442, Total reward=-0.57, Steps=43358, Training iteration=17[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=14443, Total reward=-0.66, Steps=43361, Training iteration=17[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=14444, Total reward=-0.64, Steps=43364, Training iteration=17[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=14445, Total reward=-0.54, Steps=43367, Training iteration=17[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=14446, Total reward=-0.42, Steps=43370, Training iteration=17[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=

[34mTraining> Name=main_level/agent, Worker=0, Episode=14556, Total reward=-0.46, Steps=43700, Training iteration=17[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=14557, Total reward=-0.41, Steps=43703, Training iteration=17[0m
[34mCheckpoint> Saving in path=['/opt/ml/output/data/checkpoint/81_Step-33701.ckpt'][0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=14558, Total reward=-0.4, Steps=43706, Training iteration=17[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=14559, Total reward=-0.57, Steps=43709, Training iteration=17[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=14560, Total reward=-0.64, Steps=43712, Training iteration=17[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=14561, Total reward=-0.69, Steps=43715, Training iteration=17[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=14562, Total reward=-0.69, Steps=43718, Training iteration=17[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=1

[34mTraining> Name=main_level/agent, Worker=0, Episode=14672, Total reward=-0.61, Steps=44048, Training iteration=17[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=14673, Total reward=-0.66, Steps=44051, Training iteration=17[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=14674, Total reward=-0.49, Steps=44054, Training iteration=17[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=14675, Total reward=-0.51, Steps=44057, Training iteration=17[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=14676, Total reward=-0.5, Steps=44060, Training iteration=17[0m
[34mCheckpoint> Saving in path=['/opt/ml/output/data/checkpoint/82_Step-34058.ckpt'][0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=14677, Total reward=-0.56, Steps=44063, Training iteration=17[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=14678, Total reward=-0.48, Steps=44066, Training iteration=17[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=1

[34mTraining> Name=main_level/agent, Worker=0, Episode=14787, Total reward=-0.65, Steps=44393, Training iteration=17[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=14788, Total reward=-0.67, Steps=44396, Training iteration=17[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=14789, Total reward=-0.69, Steps=44399, Training iteration=17[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=14790, Total reward=-0.57, Steps=44402, Training iteration=17[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=14791, Total reward=-0.5, Steps=44405, Training iteration=17[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=14792, Total reward=-0.69, Steps=44408, Training iteration=17[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=14793, Total reward=-0.64, Steps=44411, Training iteration=17[0m
[34mCheckpoint> Saving in path=['/opt/ml/output/data/checkpoint/83_Step-34409.ckpt'][0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=1

[34mTraining> Name=main_level/agent, Worker=0, Episode=14902, Total reward=-0.72, Steps=44738, Training iteration=17[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=14903, Total reward=-0.46, Steps=44741, Training iteration=17[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=14904, Total reward=-0.68, Steps=44744, Training iteration=17[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=14905, Total reward=-0.45, Steps=44747, Training iteration=17[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=14906, Total reward=-0.56, Steps=44750, Training iteration=17[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=14907, Total reward=-0.4, Steps=44753, Training iteration=17[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=14908, Total reward=-0.57, Steps=44756, Training iteration=17[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=14909, Total reward=-0.41, Steps=44759, Training iteration=17[0m
[34mTraining> Name=main_

[34mTraining> Name=main_level/agent, Worker=0, Episode=14996, Total reward=-0.52, Steps=45022, Training iteration=18[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=14997, Total reward=-0.61, Steps=45025, Training iteration=18[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=14998, Total reward=-0.62, Steps=45028, Training iteration=18[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=14999, Total reward=-0.45, Steps=45031, Training iteration=18[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=15000, Total reward=-0.43, Steps=45034, Training iteration=18[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=15001, Total reward=-0.6, Steps=45037, Training iteration=18[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=15002, Total reward=-0.57, Steps=45040, Training iteration=18[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=15003, Total reward=-0.64, Steps=45043, Training iteration=18[0m
[34mTraining> Name=main_

[34mTraining> Name=main_level/agent, Worker=0, Episode=15113, Total reward=-0.54, Steps=45373, Training iteration=18[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=15114, Total reward=-0.7, Steps=45376, Training iteration=18[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=15115, Total reward=-0.55, Steps=45379, Training iteration=18[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=15116, Total reward=-0.46, Steps=45382, Training iteration=18[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=15117, Total reward=-0.54, Steps=45385, Training iteration=18[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=15118, Total reward=-0.49, Steps=45388, Training iteration=18[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=15119, Total reward=-0.49, Steps=45391, Training iteration=18[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=15120, Total reward=-0.66, Steps=45394, Training iteration=18[0m
[34mTraining> Name=main_

[34mTraining> Name=main_level/agent, Worker=0, Episode=15228, Total reward=-0.67, Steps=45718, Training iteration=18[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=15229, Total reward=-0.58, Steps=45721, Training iteration=18[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=15230, Total reward=-0.5, Steps=45724, Training iteration=18[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=15231, Total reward=-0.65, Steps=45727, Training iteration=18[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=15232, Total reward=-0.43, Steps=45730, Training iteration=18[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=15233, Total reward=-0.43, Steps=45733, Training iteration=18[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=15234, Total reward=-0.54, Steps=45736, Training iteration=18[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=15235, Total reward=-0.65, Steps=45739, Training iteration=18[0m
[34mTraining> Name=main_

[34mTraining> Name=main_level/agent, Worker=0, Episode=15398, Total reward=-0.71, Steps=46228, Training iteration=18[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=15399, Total reward=-0.55, Steps=46231, Training iteration=18[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=15400, Total reward=-0.6, Steps=46234, Training iteration=18[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=15401, Total reward=-0.56, Steps=46237, Training iteration=18[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=15402, Total reward=-0.71, Steps=46240, Training iteration=18[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=15403, Total reward=-0.44, Steps=46243, Training iteration=18[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=15404, Total reward=-0.47, Steps=46246, Training iteration=18[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=15405, Total reward=-0.48, Steps=46249, Training iteration=18[0m
[34mTraining> Name=main_

[34mTraining> Name=main_level/agent, Worker=0, Episode=15512, Total reward=-0.66, Steps=46570, Training iteration=18[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=15513, Total reward=-0.66, Steps=46573, Training iteration=18[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=15514, Total reward=-0.39, Steps=46576, Training iteration=18[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=15515, Total reward=-0.52, Steps=46579, Training iteration=18[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=15516, Total reward=-0.62, Steps=46582, Training iteration=18[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=15517, Total reward=-0.53, Steps=46585, Training iteration=18[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=15518, Total reward=-0.49, Steps=46588, Training iteration=18[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=15519, Total reward=-0.59, Steps=46591, Training iteration=18[0m
[34mTraining> Name=main

[34mPolicy training> Surrogate loss=-0.002802168019115925, KL divergence=0.004277006257325411, Entropy=0.9246556162834167, training epoch=6, learning_rate=0.0003[0m
[34mPolicy training> Surrogate loss=-0.003149024210870266, KL divergence=0.0038310654927045107, Entropy=0.9245476126670837, training epoch=7, learning_rate=0.0003[0m
[34mPolicy training> Surrogate loss=-0.006610292941331863, KL divergence=0.005210080184042454, Entropy=0.924480676651001, training epoch=8, learning_rate=0.0003[0m
[34mPolicy training> Surrogate loss=-0.003178780199959874, KL divergence=0.003861798672005534, Entropy=0.9244409203529358, training epoch=9, learning_rate=0.0003[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=15612, Total reward=-0.6, Steps=46872, Training iteration=19[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=15613, Total reward=-0.5, Steps=46875, Training iteration=19[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=15614, Total reward=-0.57, Steps=4

[34mTraining> Name=main_level/agent, Worker=0, Episode=15719, Total reward=-0.53, Steps=47193, Training iteration=19[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=15720, Total reward=-0.49, Steps=47196, Training iteration=19[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=15721, Total reward=-0.51, Steps=47199, Training iteration=19[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=15722, Total reward=-0.55, Steps=47202, Training iteration=19[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=15723, Total reward=-0.57, Steps=47205, Training iteration=19[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=15724, Total reward=-0.63, Steps=47208, Training iteration=19[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=15725, Total reward=-0.46, Steps=47211, Training iteration=19[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=15726, Total reward=-0.52, Steps=47214, Training iteration=19[0m
[34mTraining> Name=main

[34mTraining> Name=main_level/agent, Worker=0, Episode=15830, Total reward=-0.6, Steps=47526, Training iteration=19[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=15831, Total reward=-0.45, Steps=47529, Training iteration=19[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=15832, Total reward=-0.61, Steps=47532, Training iteration=19[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=15833, Total reward=-0.68, Steps=47535, Training iteration=19[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=15834, Total reward=-0.6, Steps=47538, Training iteration=19[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=15835, Total reward=-0.58, Steps=47541, Training iteration=19[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=15836, Total reward=-0.49, Steps=47544, Training iteration=19[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=15837, Total reward=-0.44, Steps=47547, Training iteration=19[0m
[34mTraining> Name=main_l

[34mTraining> Name=main_level/agent, Worker=0, Episode=15941, Total reward=-0.49, Steps=47859, Training iteration=19[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=15942, Total reward=-0.45, Steps=47862, Training iteration=19[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=15943, Total reward=-0.72, Steps=47865, Training iteration=19[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=15944, Total reward=-0.69, Steps=47868, Training iteration=19[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=15945, Total reward=-0.56, Steps=47871, Training iteration=19[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=15946, Total reward=-0.41, Steps=47874, Training iteration=19[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=15947, Total reward=-0.39, Steps=47877, Training iteration=19[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=15948, Total reward=-0.51, Steps=47880, Training iteration=19[0m
[34mTraining> Name=main

[34mTraining> Name=main_level/agent, Worker=0, Episode=16055, Total reward=-0.49, Steps=48201, Training iteration=19[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=16056, Total reward=-0.49, Steps=48204, Training iteration=19[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=16057, Total reward=-0.66, Steps=48207, Training iteration=19[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=16058, Total reward=-0.47, Steps=48210, Training iteration=19[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=16059, Total reward=-0.46, Steps=48213, Training iteration=19[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=16060, Total reward=-0.57, Steps=48216, Training iteration=19[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=16061, Total reward=-0.46, Steps=48219, Training iteration=19[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=16062, Total reward=-0.5, Steps=48222, Training iteration=19[0m
[34mTraining> Name=main_

[34mTraining> Name=main_level/agent, Worker=0, Episode=16166, Total reward=-0.59, Steps=48534, Training iteration=19[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=16167, Total reward=-0.48, Steps=48537, Training iteration=19[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=16168, Total reward=-0.43, Steps=48540, Training iteration=19[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=16169, Total reward=-0.47, Steps=48543, Training iteration=19[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=16170, Total reward=-0.42, Steps=48546, Training iteration=19[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=16171, Total reward=-0.62, Steps=48549, Training iteration=19[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=16172, Total reward=-0.64, Steps=48552, Training iteration=19[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=16173, Total reward=-0.68, Steps=48555, Training iteration=19[0m
[34mTraining> Name=main

[34mTraining> Name=main_level/agent, Worker=0, Episode=16278, Total reward=-0.48, Steps=48870, Training iteration=19[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=16279, Total reward=-0.7, Steps=48873, Training iteration=19[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=16280, Total reward=-0.5, Steps=48876, Training iteration=19[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=16281, Total reward=-0.62, Steps=48879, Training iteration=19[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=16282, Total reward=-0.62, Steps=48882, Training iteration=19[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=16283, Total reward=-0.44, Steps=48885, Training iteration=19[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=16284, Total reward=-0.55, Steps=48888, Training iteration=19[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=16285, Total reward=-0.66, Steps=48891, Training iteration=19[0m
[34mTraining> Name=main_l

[34mTraining> Name=main_level/agent, Worker=0, Episode=16369, Total reward=-0.52, Steps=49145, Training iteration=20[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=16370, Total reward=-0.54, Steps=49148, Training iteration=20[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=16371, Total reward=-0.47, Steps=49151, Training iteration=20[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=16372, Total reward=-0.51, Steps=49154, Training iteration=20[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=16373, Total reward=-0.7, Steps=49157, Training iteration=20[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=16374, Total reward=-0.41, Steps=49160, Training iteration=20[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=16375, Total reward=-0.58, Steps=49163, Training iteration=20[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=16376, Total reward=-0.45, Steps=49166, Training iteration=20[0m
[34mTraining> Name=main_

[34mTraining> Name=main_level/agent, Worker=0, Episode=16481, Total reward=-0.44, Steps=49481, Training iteration=20[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=16482, Total reward=-0.52, Steps=49484, Training iteration=20[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=16483, Total reward=-0.56, Steps=49487, Training iteration=20[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=16484, Total reward=-0.58, Steps=49490, Training iteration=20[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=16485, Total reward=-0.51, Steps=49493, Training iteration=20[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=16486, Total reward=-0.65, Steps=49496, Training iteration=20[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=16487, Total reward=-0.57, Steps=49499, Training iteration=20[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=16488, Total reward=-0.44, Steps=49502, Training iteration=20[0m
[34mTraining> Name=main

[34mTraining> Name=main_level/agent, Worker=0, Episode=16649, Total reward=-0.68, Steps=49985, Training iteration=20[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=16650, Total reward=-0.45, Steps=49988, Training iteration=20[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=16651, Total reward=-0.64, Steps=49991, Training iteration=20[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=16652, Total reward=-0.5, Steps=49994, Training iteration=20[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=16653, Total reward=-0.69, Steps=49997, Training iteration=20[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=16654, Total reward=-0.52, Steps=50000, Training iteration=20[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=16655, Total reward=-0.44, Steps=50003, Training iteration=20[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=16656, Total reward=-0.54, Steps=50006, Training iteration=20[0m
[34mTraining> Name=main_

[34mTraining> Name=main_level/agent, Worker=0, Episode=16760, Total reward=-0.6, Steps=50318, Training iteration=20[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=16761, Total reward=-0.48, Steps=50321, Training iteration=20[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=16762, Total reward=-0.41, Steps=50324, Training iteration=20[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=16763, Total reward=-0.52, Steps=50327, Training iteration=20[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=16764, Total reward=-0.48, Steps=50330, Training iteration=20[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=16765, Total reward=-0.41, Steps=50333, Training iteration=20[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=16766, Total reward=-0.41, Steps=50336, Training iteration=20[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=16767, Total reward=-0.55, Steps=50339, Training iteration=20[0m
[34mTraining> Name=main_

[34mPolicy training> Surrogate loss=0.0009211614378727973, KL divergence=0.002152812434360385, Entropy=0.910165011882782, training epoch=0, learning_rate=0.0003[0m
[34mPolicy training> Surrogate loss=-0.0008141179569065571, KL divergence=0.002277419902384281, Entropy=0.908646285533905, training epoch=1, learning_rate=0.0003[0m
[34mPolicy training> Surrogate loss=-0.002903543645516038, KL divergence=0.0008970082271844149, Entropy=0.9068589806556702, training epoch=2, learning_rate=0.0003[0m
[34mPolicy training> Surrogate loss=-0.003665210912004113, KL divergence=0.002556883729994297, Entropy=0.9054972529411316, training epoch=3, learning_rate=0.0003[0m
[34mPolicy training> Surrogate loss=0.0016276357928290963, KL divergence=0.057227373123168945, Entropy=0.9014577269554138, training epoch=4, learning_rate=0.0003[0m
[34mPolicy training> Surrogate loss=0.013527121394872665, KL divergence=0.05245048925280571, Entropy=0.8965607285499573, training epoch=5, learning_rate=0.0003[0m


[34mTraining> Name=main_level/agent, Worker=0, Episode=17068, Total reward=-0.51, Steps=51244, Training iteration=21[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=17069, Total reward=-0.75, Steps=51247, Training iteration=21[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=17070, Total reward=-0.68, Steps=51250, Training iteration=21[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=17071, Total reward=-0.49, Steps=51253, Training iteration=21[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=17072, Total reward=-0.53, Steps=51256, Training iteration=21[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=17073, Total reward=-0.54, Steps=51259, Training iteration=21[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=17074, Total reward=-0.58, Steps=51262, Training iteration=21[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=17075, Total reward=-0.46, Steps=51265, Training iteration=21[0m
[34mTraining> Name=main

[34mTraining> Name=main_level/agent, Worker=0, Episode=17176, Total reward=-0.51, Steps=51568, Training iteration=21[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=17177, Total reward=-0.5, Steps=51571, Training iteration=21[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=17178, Total reward=-0.62, Steps=51574, Training iteration=21[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=17179, Total reward=-0.52, Steps=51577, Training iteration=21[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=17180, Total reward=-0.6, Steps=51580, Training iteration=21[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=17181, Total reward=-0.49, Steps=51583, Training iteration=21[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=17182, Total reward=-0.46, Steps=51586, Training iteration=21[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=17183, Total reward=-0.54, Steps=51589, Training iteration=21[0m
[34mTraining> Name=main_l

[34mTraining> Name=main_level/agent, Worker=0, Episode=17284, Total reward=-0.45, Steps=51892, Training iteration=21[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=17285, Total reward=-0.43, Steps=51895, Training iteration=21[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=17286, Total reward=-0.58, Steps=51898, Training iteration=21[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=17287, Total reward=-0.5, Steps=51901, Training iteration=21[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=17288, Total reward=-0.51, Steps=51904, Training iteration=21[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=17289, Total reward=-0.61, Steps=51907, Training iteration=21[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=17290, Total reward=-0.56, Steps=51910, Training iteration=21[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=17291, Total reward=-0.46, Steps=51913, Training iteration=21[0m
[34mTraining> Name=main_

[34mTraining> Name=main_level/agent, Worker=0, Episode=17392, Total reward=-0.47, Steps=52216, Training iteration=21[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=17393, Total reward=-0.55, Steps=52219, Training iteration=21[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=17394, Total reward=-0.5, Steps=52222, Training iteration=21[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=17395, Total reward=-0.5, Steps=52225, Training iteration=21[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=17396, Total reward=-0.59, Steps=52228, Training iteration=21[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=17397, Total reward=-0.45, Steps=52231, Training iteration=21[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=17398, Total reward=-0.55, Steps=52234, Training iteration=21[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=17399, Total reward=-0.61, Steps=52237, Training iteration=21[0m
[34mTraining> Name=main_l

[34mTraining> Name=main_level/agent, Worker=0, Episode=17499, Total reward=-0.58, Steps=52537, Training iteration=21[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=17500, Total reward=-0.46, Steps=52540, Training iteration=21[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=17501, Total reward=-0.61, Steps=52543, Training iteration=21[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=17502, Total reward=-0.51, Steps=52546, Training iteration=21[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=17503, Total reward=-0.45, Steps=52549, Training iteration=21[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=17504, Total reward=-0.43, Steps=52552, Training iteration=21[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=17505, Total reward=-0.5, Steps=52555, Training iteration=21[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=17506, Total reward=-0.59, Steps=52558, Training iteration=21[0m
[34mTraining> Name=main_

[34mTraining> Name=main_level/agent, Worker=0, Episode=17607, Total reward=-0.41, Steps=52861, Training iteration=21[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=17608, Total reward=-0.52, Steps=52864, Training iteration=21[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=17609, Total reward=-0.46, Steps=52867, Training iteration=21[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=17610, Total reward=-0.47, Steps=52870, Training iteration=21[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=17611, Total reward=-0.7, Steps=52873, Training iteration=21[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=17612, Total reward=-0.47, Steps=52876, Training iteration=21[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=17613, Total reward=-0.49, Steps=52879, Training iteration=21[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=17614, Total reward=-0.43, Steps=52882, Training iteration=21[0m
[34mTraining> Name=main_

[34mTraining> Name=main_level/agent, Worker=0, Episode=17694, Total reward=-0.44, Steps=53124, Training iteration=22[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=17695, Total reward=-0.47, Steps=53127, Training iteration=22[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=17696, Total reward=-0.54, Steps=53130, Training iteration=22[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=17697, Total reward=-0.5, Steps=53133, Training iteration=22[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=17698, Total reward=-0.55, Steps=53136, Training iteration=22[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=17699, Total reward=-0.6, Steps=53139, Training iteration=22[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=17700, Total reward=-0.48, Steps=53142, Training iteration=22[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=17701, Total reward=-0.42, Steps=53145, Training iteration=22[0m
[34mTraining> Name=main_l

[34mTraining> Name=main_level/agent, Worker=0, Episode=17833, Total reward=-0.46, Steps=53541, Training iteration=22[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=17834, Total reward=-0.54, Steps=53544, Training iteration=22[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=17835, Total reward=-0.47, Steps=53547, Training iteration=22[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=17836, Total reward=-0.6, Steps=53550, Training iteration=22[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=17837, Total reward=-0.47, Steps=53553, Training iteration=22[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=17838, Total reward=-0.73, Steps=53556, Training iteration=22[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=17839, Total reward=-0.75, Steps=53559, Training iteration=22[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=17840, Total reward=-0.64, Steps=53562, Training iteration=22[0m
[34mTraining> Name=main_

[34mTraining> Name=main_level/agent, Worker=0, Episode=17962, Total reward=-0.63, Steps=53928, Training iteration=22[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=17963, Total reward=-0.6, Steps=53931, Training iteration=22[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=17964, Total reward=-0.47, Steps=53934, Training iteration=22[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=17965, Total reward=-0.56, Steps=53937, Training iteration=22[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=17966, Total reward=-0.42, Steps=53940, Training iteration=22[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=17967, Total reward=-0.42, Steps=53943, Training iteration=22[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=17968, Total reward=-0.41, Steps=53946, Training iteration=22[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=17969, Total reward=-0.44, Steps=53949, Training iteration=22[0m
[34mTraining> Name=main_

[34mTraining> Name=main_level/agent, Worker=0, Episode=18068, Total reward=-0.7, Steps=54246, Training iteration=22[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=18069, Total reward=-0.45, Steps=54249, Training iteration=22[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=18070, Total reward=-0.56, Steps=54252, Training iteration=22[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=18071, Total reward=-0.44, Steps=54255, Training iteration=22[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=18072, Total reward=-0.4, Steps=54258, Training iteration=22[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=18073, Total reward=-0.43, Steps=54261, Training iteration=22[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=18074, Total reward=-0.54, Steps=54264, Training iteration=22[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=18075, Total reward=-0.56, Steps=54267, Training iteration=22[0m
[34mTraining> Name=main_l

[34mTraining> Name=main_level/agent, Worker=0, Episode=18174, Total reward=-0.62, Steps=54564, Training iteration=22[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=18175, Total reward=-0.54, Steps=54567, Training iteration=22[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=18176, Total reward=-0.52, Steps=54570, Training iteration=22[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=18177, Total reward=-0.43, Steps=54573, Training iteration=22[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=18178, Total reward=-0.59, Steps=54576, Training iteration=22[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=18179, Total reward=-0.53, Steps=54579, Training iteration=22[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=18180, Total reward=-0.59, Steps=54582, Training iteration=22[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=18181, Total reward=-0.46, Steps=54585, Training iteration=22[0m
[34mTraining> Name=main

[34mTraining> Name=main_level/agent, Worker=0, Episode=18280, Total reward=-0.57, Steps=54882, Training iteration=22[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=18281, Total reward=-0.64, Steps=54885, Training iteration=22[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=18282, Total reward=-0.58, Steps=54888, Training iteration=22[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=18283, Total reward=-0.65, Steps=54891, Training iteration=22[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=18284, Total reward=-0.42, Steps=54894, Training iteration=22[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=18285, Total reward=-0.4, Steps=54897, Training iteration=22[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=18286, Total reward=-0.65, Steps=54900, Training iteration=22[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=18287, Total reward=-0.55, Steps=54903, Training iteration=22[0m
[34mTraining> Name=main_

[34mTraining> Name=main_level/agent, Worker=0, Episode=18364, Total reward=-0.64, Steps=55136, Training iteration=23[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=18365, Total reward=-0.56, Steps=55139, Training iteration=23[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=18366, Total reward=-0.58, Steps=55142, Training iteration=23[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=18367, Total reward=-0.57, Steps=55145, Training iteration=23[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=18368, Total reward=-0.5, Steps=55148, Training iteration=23[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=18369, Total reward=-0.54, Steps=55151, Training iteration=23[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=18370, Total reward=-0.59, Steps=55154, Training iteration=23[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=18371, Total reward=-0.51, Steps=55157, Training iteration=23[0m
[34mTraining> Name=main_

[34mTraining> Name=main_level/agent, Worker=0, Episode=18467, Total reward=-0.63, Steps=55445, Training iteration=23[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=18468, Total reward=-0.48, Steps=55448, Training iteration=23[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=18469, Total reward=-0.44, Steps=55451, Training iteration=23[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=18470, Total reward=-0.53, Steps=55454, Training iteration=23[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=18471, Total reward=-0.45, Steps=55457, Training iteration=23[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=18472, Total reward=-0.61, Steps=55460, Training iteration=23[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=18473, Total reward=-0.52, Steps=55463, Training iteration=23[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=18474, Total reward=-0.6, Steps=55466, Training iteration=23[0m
[34mTraining> Name=main_

[34mTraining> Name=main_level/agent, Worker=0, Episode=18570, Total reward=-0.43, Steps=55754, Training iteration=23[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=18571, Total reward=-0.47, Steps=55757, Training iteration=23[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=18572, Total reward=-0.57, Steps=55760, Training iteration=23[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=18573, Total reward=-0.6, Steps=55763, Training iteration=23[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=18574, Total reward=-0.5, Steps=55766, Training iteration=23[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=18575, Total reward=-0.53, Steps=55769, Training iteration=23[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=18576, Total reward=-0.57, Steps=55772, Training iteration=23[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=18577, Total reward=-0.51, Steps=55775, Training iteration=23[0m
[34mTraining> Name=main_l

[34mTraining> Name=main_level/agent, Worker=0, Episode=18672, Total reward=-0.54, Steps=56060, Training iteration=23[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=18673, Total reward=-0.69, Steps=56063, Training iteration=23[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=18674, Total reward=-0.5, Steps=56066, Training iteration=23[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=18675, Total reward=-0.61, Steps=56069, Training iteration=23[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=18676, Total reward=-0.69, Steps=56072, Training iteration=23[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=18677, Total reward=-0.45, Steps=56075, Training iteration=23[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=18678, Total reward=-0.65, Steps=56078, Training iteration=23[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=18679, Total reward=-0.5, Steps=56081, Training iteration=23[0m
[34mTraining> Name=main_l

[34mTraining> Name=main_level/agent, Worker=0, Episode=18773, Total reward=-0.55, Steps=56363, Training iteration=23[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=18774, Total reward=-0.52, Steps=56366, Training iteration=23[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=18775, Total reward=-0.43, Steps=56369, Training iteration=23[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=18776, Total reward=-0.63, Steps=56372, Training iteration=23[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=18777, Total reward=-0.64, Steps=56375, Training iteration=23[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=18778, Total reward=-0.47, Steps=56378, Training iteration=23[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=18779, Total reward=-0.67, Steps=56381, Training iteration=23[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=18780, Total reward=-0.49, Steps=56384, Training iteration=23[0m
[34mTraining> Name=main

[34mTraining> Name=main_level/agent, Worker=0, Episode=18876, Total reward=-0.47, Steps=56672, Training iteration=23[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=18877, Total reward=-0.53, Steps=56675, Training iteration=23[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=18878, Total reward=-0.54, Steps=56678, Training iteration=23[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=18879, Total reward=-0.5, Steps=56681, Training iteration=23[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=18880, Total reward=-0.47, Steps=56684, Training iteration=23[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=18881, Total reward=-0.51, Steps=56687, Training iteration=23[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=18882, Total reward=-0.63, Steps=56690, Training iteration=23[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=18883, Total reward=-0.66, Steps=56693, Training iteration=23[0m
[34mTraining> Name=main_

[34mTraining> Name=main_level/agent, Worker=0, Episode=18980, Total reward=-0.45, Steps=56984, Training iteration=23[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=18981, Total reward=-0.5, Steps=56987, Training iteration=23[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=18982, Total reward=-0.54, Steps=56990, Training iteration=23[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=18983, Total reward=-0.63, Steps=56993, Training iteration=23[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=18984, Total reward=-0.62, Steps=56996, Training iteration=23[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=18985, Total reward=-0.5, Steps=56999, Training iteration=23[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=18986, Total reward=-0.46, Steps=57002, Training iteration=23[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=18987, Total reward=-0.52, Steps=57005, Training iteration=23[0m
[34mTraining> Name=main_l

[34mTraining> Name=main_level/agent, Worker=0, Episode=19064, Total reward=-0.55, Steps=57238, Training iteration=24[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=19065, Total reward=-0.7, Steps=57241, Training iteration=24[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=19066, Total reward=-0.42, Steps=57244, Training iteration=24[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=19067, Total reward=-0.55, Steps=57247, Training iteration=24[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=19068, Total reward=-0.44, Steps=57250, Training iteration=24[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=19069, Total reward=-0.48, Steps=57253, Training iteration=24[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=19070, Total reward=-0.47, Steps=57256, Training iteration=24[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=19071, Total reward=-0.42, Steps=57259, Training iteration=24[0m
[34mTraining> Name=main_

[34mTraining> Name=main_level/agent, Worker=0, Episode=19216, Total reward=-0.45, Steps=57694, Training iteration=24[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=19217, Total reward=-0.54, Steps=57697, Training iteration=24[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=19218, Total reward=-0.63, Steps=57700, Training iteration=24[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=19219, Total reward=-0.5, Steps=57703, Training iteration=24[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=19220, Total reward=-0.46, Steps=57706, Training iteration=24[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=19221, Total reward=-0.52, Steps=57709, Training iteration=24[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=19222, Total reward=-0.55, Steps=57712, Training iteration=24[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=19223, Total reward=-0.47, Steps=57715, Training iteration=24[0m
[34mTraining> Name=main_

[34mCheckpoint> Saving in path=['/opt/ml/output/data/checkpoint/125_Step-47998.ckpt'][0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=19319, Total reward=-0.47, Steps=58003, Training iteration=24[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=19320, Total reward=-0.58, Steps=58006, Training iteration=24[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=19321, Total reward=-0.5, Steps=58009, Training iteration=24[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=19322, Total reward=-0.46, Steps=58012, Training iteration=24[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=19323, Total reward=-0.71, Steps=58015, Training iteration=24[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=19324, Total reward=-0.76, Steps=58018, Training iteration=24[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=19325, Total reward=-0.43, Steps=58021, Training iteration=24[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=

[34mTraining> Name=main_level/agent, Worker=0, Episode=19422, Total reward=-0.47, Steps=58312, Training iteration=24[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=19423, Total reward=-0.54, Steps=58315, Training iteration=24[0m
[34mCheckpoint> Saving in path=['/opt/ml/output/data/checkpoint/126_Step-48313.ckpt'][0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=19424, Total reward=-0.57, Steps=58318, Training iteration=24[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=19425, Total reward=-0.49, Steps=58321, Training iteration=24[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=19426, Total reward=-0.44, Steps=58324, Training iteration=24[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=19427, Total reward=-0.62, Steps=58327, Training iteration=24[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=19428, Total reward=-0.48, Steps=58330, Training iteration=24[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode

[34mTraining> Name=main_level/agent, Worker=0, Episode=19524, Total reward=-0.42, Steps=58618, Training iteration=24[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=19525, Total reward=-0.59, Steps=58621, Training iteration=24[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=19526, Total reward=-0.58, Steps=58624, Training iteration=24[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=19527, Total reward=-0.43, Steps=58627, Training iteration=24[0m
[34mCheckpoint> Saving in path=['/opt/ml/output/data/checkpoint/127_Step-48625.ckpt'][0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=19528, Total reward=-0.52, Steps=58630, Training iteration=24[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=19529, Total reward=-0.65, Steps=58633, Training iteration=24[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=19530, Total reward=-0.56, Steps=58636, Training iteration=24[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode

[34mTraining> Name=main_level/agent, Worker=0, Episode=19627, Total reward=-0.6, Steps=58927, Training iteration=24[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=19628, Total reward=-0.72, Steps=58930, Training iteration=24[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=19629, Total reward=-0.74, Steps=58933, Training iteration=24[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=19630, Total reward=-0.58, Steps=58936, Training iteration=24[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=19631, Total reward=-0.46, Steps=58939, Training iteration=24[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=19632, Total reward=-0.49, Steps=58942, Training iteration=24[0m
[34mCheckpoint> Saving in path=['/opt/ml/output/data/checkpoint/128_Step-48940.ckpt'][0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=19633, Total reward=-0.69, Steps=58945, Training iteration=24[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=

[34mTraining> Name=main_level/agent, Worker=0, Episode=19708, Total reward=-0.67, Steps=59172, Training iteration=25[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=19709, Total reward=-0.49, Steps=59175, Training iteration=25[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=19710, Total reward=-0.55, Steps=59178, Training iteration=25[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=19711, Total reward=-0.46, Steps=59181, Training iteration=25[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=19712, Total reward=-0.48, Steps=59184, Training iteration=25[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=19713, Total reward=-0.44, Steps=59187, Training iteration=25[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=19714, Total reward=-0.62, Steps=59190, Training iteration=25[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=19715, Total reward=-0.5, Steps=59193, Training iteration=25[0m
[34mTraining> Name=main_

[34mTraining> Name=main_level/agent, Worker=0, Episode=19810, Total reward=-0.49, Steps=59478, Training iteration=25[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=19811, Total reward=-0.66, Steps=59481, Training iteration=25[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=19812, Total reward=-0.53, Steps=59484, Training iteration=25[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=19813, Total reward=-0.63, Steps=59487, Training iteration=25[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=19814, Total reward=-0.73, Steps=59490, Training iteration=25[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=19815, Total reward=-0.69, Steps=59493, Training iteration=25[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=19816, Total reward=-0.75, Steps=59496, Training iteration=25[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=19817, Total reward=-0.43, Steps=59499, Training iteration=25[0m
[34mTraining> Name=main

[34mTraining> Name=main_level/agent, Worker=0, Episode=19912, Total reward=-0.54, Steps=59784, Training iteration=25[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=19913, Total reward=-0.66, Steps=59787, Training iteration=25[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=19914, Total reward=-0.67, Steps=59790, Training iteration=25[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=19915, Total reward=-0.57, Steps=59793, Training iteration=25[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=19916, Total reward=-0.47, Steps=59796, Training iteration=25[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=19917, Total reward=-0.41, Steps=59799, Training iteration=25[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=19918, Total reward=-0.68, Steps=59802, Training iteration=25[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=19919, Total reward=-0.41, Steps=59805, Training iteration=25[0m
[34mTraining> Name=main

[34mTraining> Name=main_level/agent, Worker=0, Episode=20012, Total reward=-0.52, Steps=60084, Training iteration=25[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=20013, Total reward=-0.48, Steps=60087, Training iteration=25[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=20014, Total reward=-0.41, Steps=60090, Training iteration=25[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=20015, Total reward=-0.4, Steps=60093, Training iteration=25[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=20016, Total reward=-0.45, Steps=60096, Training iteration=25[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=20017, Total reward=-0.5, Steps=60099, Training iteration=25[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=20018, Total reward=-0.39, Steps=60102, Training iteration=25[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=20019, Total reward=-0.47, Steps=60105, Training iteration=25[0m
[34mTraining> Name=main_l

[34mTraining> Name=main_level/agent, Worker=0, Episode=20160, Total reward=-0.49, Steps=60528, Training iteration=25[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=20161, Total reward=-0.46, Steps=60531, Training iteration=25[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=20162, Total reward=-0.46, Steps=60534, Training iteration=25[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=20163, Total reward=-0.52, Steps=60537, Training iteration=25[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=20164, Total reward=-0.64, Steps=60540, Training iteration=25[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=20165, Total reward=-0.54, Steps=60543, Training iteration=25[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=20166, Total reward=-0.49, Steps=60546, Training iteration=25[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=20167, Total reward=-0.48, Steps=60549, Training iteration=25[0m
[34mTraining> Name=main

[34mTraining> Name=main_level/agent, Worker=0, Episode=20261, Total reward=-0.46, Steps=60831, Training iteration=25[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=20262, Total reward=-0.39, Steps=60834, Training iteration=25[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=20263, Total reward=-0.51, Steps=60837, Training iteration=25[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=20264, Total reward=-0.44, Steps=60840, Training iteration=25[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=20265, Total reward=-0.64, Steps=60843, Training iteration=25[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=20266, Total reward=-0.52, Steps=60846, Training iteration=25[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=20267, Total reward=-0.61, Steps=60849, Training iteration=25[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=20268, Total reward=-0.56, Steps=60852, Training iteration=25[0m
[34mTraining> Name=main

[34mTraining> Name=main_level/agent, Worker=0, Episode=20362, Total reward=-0.6, Steps=61134, Training iteration=25[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=20363, Total reward=-0.63, Steps=61137, Training iteration=25[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=20364, Total reward=-0.58, Steps=61140, Training iteration=25[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=20365, Total reward=-0.47, Steps=61143, Training iteration=25[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=20366, Total reward=-0.64, Steps=61146, Training iteration=25[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=20367, Total reward=-0.44, Steps=61149, Training iteration=25[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=20368, Total reward=-0.47, Steps=61152, Training iteration=25[0m
[34mTraining> Name=main_level/agent, Worker=0, Episode=20369, Total reward=-0.61, Steps=61155, Training iteration=25[0m
[34mTraining> Name=main_

## Store intermediate training output and model checkpoints 

The output from the training job above is either stored in a local directory (`local` mode) or on S3 (`SageMaker`) mode.


In [11]:
%%time

job_name=estimator._current_job_name
print("Job name: {}".format(job_name))

s3_url = "s3://{}/{}".format(s3_bucket,job_name)

if local_mode:
    output_tar_key = "{}/output.tar.gz".format(job_name)
else:
    output_tar_key = "{}/output/output.tar.gz".format(job_name)

intermediate_folder_key = "{}/output/intermediate/".format(job_name)
output_url = "s3://{}/{}".format(s3_bucket, output_tar_key)
intermediate_url = "s3://{}/{}".format(s3_bucket, intermediate_folder_key)

print("S3 job path: {}".format(s3_url))
print("Output.tar.gz location: {}".format(output_url))
print("Intermediate folder path: {}".format(intermediate_url))
    
tmp_dir = "/tmp/{}".format(job_name)
os.system("mkdir {}".format(tmp_dir))
print("Create local folder {}".format(tmp_dir))

Job name: rl-macroeconomic-2020-05-13-12-06-48-952
S3 job path: s3://sagemaker-eu-central-1-415877977751/rl-macroeconomic-2020-05-13-12-06-48-952
Output.tar.gz location: s3://sagemaker-eu-central-1-415877977751/rl-macroeconomic-2020-05-13-12-06-48-952/output/output.tar.gz
Intermediate folder path: s3://sagemaker-eu-central-1-415877977751/rl-macroeconomic-2020-05-13-12-06-48-952/output/intermediate/
Create local folder /tmp/rl-macroeconomic-2020-05-13-12-06-48-952
CPU times: user 0 ns, sys: 8.17 ms, total: 8.17 ms
Wall time: 6.97 ms


In [12]:
%%time

wait_for_s3_object(s3_bucket, output_tar_key, tmp_dir)  

if not os.path.isfile("{}/output.tar.gz".format(tmp_dir)):
    raise FileNotFoundError("File output.tar.gz not found")
os.system("tar -xvzf {}/output.tar.gz -C {}".format(tmp_dir, tmp_dir))
if not local_mode:
    os.system("aws s3 cp --recursive {} {}".format(intermediate_url, tmp_dir))
if not os.path.isfile("{}/output.tar.gz".format(tmp_dir)):
    raise FileNotFoundError("File output.tar.gz not found")
os.system("tar -xvzf {}/output.tar.gz -C {}".format(tmp_dir, tmp_dir))
print("Copied output files to {}".format(tmp_dir))

if local_mode:
    checkpoint_dir = "{}/data/checkpoint".format(tmp_dir)
    info_dir = "{}/data/".format(tmp_dir)
else:
    checkpoint_dir = "{}/checkpoint".format(tmp_dir)
    info_dir = "{}/".format(tmp_dir)

print("Checkpoint directory {}".format(checkpoint_dir))
print("info directory {}".format(info_dir))

Waiting for s3://sagemaker-eu-central-1-415877977751/rl-macroeconomic-2020-05-13-12-06-48-952/output/output.tar.gz...
Downloading rl-macroeconomic-2020-05-13-12-06-48-952/output/output.tar.gz
Copied output files to /tmp/rl-macroeconomic-2020-05-13-12-06-48-952
Checkpoint directory /tmp/rl-macroeconomic-2020-05-13-12-06-48-952/checkpoint
info directory /tmp/rl-macroeconomic-2020-05-13-12-06-48-952/
CPU times: user 209 ms, sys: 84.3 ms, total: 294 ms
Wall time: 3.97 s


## Visualization

### Plot rate of learning

We can view the rewards during training using the code below. This visualization helps us understand how the performance of the model represented as the reward has improved over time. For the consideration of training time, we restict the episodes number. If you see the final reward (average logarithmic cumulated return) is still below zero, try a larger training steps. The number of steps can be configured in the preset file.

In [None]:
%matplotlib inline
import pandas as pd

csv_file_name = "worker_0.simple_rl_graph.main_level.main_level.agent_0.csv"
key = os.path.join(intermediate_folder_key, csv_file_name)
wait_for_s3_object(s3_bucket, key, tmp_dir)

csv_file = "{}/{}".format(tmp_dir, csv_file_name)
df = pd.read_csv(csv_file)
df = df.dropna(subset=['Training Reward'])
# print(list(df))
x_axis = 'Episode #'
y_axis = 'Training Reward'

plt = df.plot(x=x_axis,y=y_axis, figsize=(12,5), legend=True, style='b-')
plt.set_ylabel(y_axis);
plt.set_xlabel(x_axis);

### Visualize the portfolio value

We use result of the last evaluation phase as an example to visualize the portfolio value. The following figure demonstrates reward vs date. Sharpe ratio and maximum drawdown are also calculated to help readers understand the return of an investment compared to its risk. 

In [None]:
info_dir

In [None]:
import pandas as pd
pd.set_option('display.max_rows', 300)
info = info_dir + 'macroeconomic.csv'
df_info = pd.read_csv(info)
# df_info[df_info['state'] == '(0.0, 2.05, 1.2446216995824309)']
df_info
#df_info['date'] = pd.to_datetime(df_info['date'], format='%Y-%m-%d')
#df_info.set_index('date', inplace=True)
#mdd = max_drawdown(df_info.rate_of_return + 1)
#sharpe_ratio = sharpe(df_info.rate_of_return)
#title = 'max_drawdown={: 2.2%} sharpe_ratio={: 2.4f}'.format(mdd, sharpe_ratio)
#df_info[["portfolio_value", "market_value"]].plot(title=title, fig=plt.gcf(), rot=30)

## Load the checkpointed models for evaluation

Checkpointed data from the previously trained models will be passed on for evaluation / inference in the `checkpoint` channel. In `local` mode, we can simply use the local directory, whereas in the `SageMaker` mode, it needs to be moved to S3 first.

Since TensorFlow stores ckeckpoint file containes absolute paths from when they were generated (see [issue](https://github.com/tensorflow/tensorflow/issues/9146)), we need to replace the absolute paths to relative paths. This is implemented within `evaluate-coach.py`


In [None]:
%%time

if local_mode:
    checkpoint_path = 'file://{}'.format(checkpoint_dir)
    print("Local checkpoint file path: {}".format(checkpoint_path))
else:
    checkpoint_path = "s3://{}/{}/checkpoint/".format(s3_bucket, job_name)
    if not os.listdir(checkpoint_dir):
        raise FileNotFoundError("Checkpoint files not found under the path")
    os.system("aws s3 cp --recursive {} {}".format(checkpoint_dir, checkpoint_path))
    print("S3 checkpoint file path: {}".format(checkpoint_path))

### Run the evaluation step

Use the checkpointed model to run the evaluation step. 


In [None]:
%%time

estimator_eval = RLEstimator(role=role,
                      source_dir='src/',
                      dependencies=["common/sagemaker_rl"],
                      toolkit=RLToolkit.COACH,
                      toolkit_version='0.11.0',
                      framework=RLFramework.MXNET,
                      entry_point="evaluate-coach.py",
                      train_instance_count=1,
                      train_instance_type=instance_type,
                      hyperparameters = {
                          "evaluate_steps": 731*2 # evaluate on 2 episodes
                      }
                    )
estimator_eval.fit({'checkpoint': checkpoint_path})

## Risk Disclaimer (for live-trading)

This notebook is for educational purposes only. Past trading performance does not guarantee future performance. The loss in trading can be substantial, and therefore 
**investors should use all trading strategies at their own risk**.