# Learning Tic-Tac-Toe with Reinforcement Learning
**_Train with SageMaker RL and evaluate interactively within the notebook_**

---

---


## Outline

1. [Overview](#Overview)
1. [Setup](#Setup)
1. [Code](#Code)
  1. [Environment](#Environment)
  1. [Preset](#Preset)
  1. [Launcher](#Launcher)
1. [Train](#Train)
1. [Deploy](#Deploy)
  1. [Inference](#Inference)
1. [Play](#Play)
1. [Wrap Up](#Wrap-Up)

---

## Overview

Tic-tac-toe is one of the first games children learn to play and was one of the [first computer games ever](https://en.wikipedia.org/wiki/OXO).  Optimal play through exhaustive search is relatively straightforward, however, approaching with a reinforcement learning agent can be educational.

This notebook shows how to train a reinforcement learning agent with SageMaker RL and then play locally and interactively within the notebook.  Unlike SageMaker local mode, this method does not require a docker container to run locally, instead using an endpoint and integration with a small Jupyter app.

---

## Setup

Let's start by defining our S3 bucket and and IAM role.

In [1]:
import sagemaker

bucket = sagemaker.Session().default_bucket()
role = sagemaker.get_execution_role()

Let's import the libraries we'll use.

In [27]:
import os
import numpy as np
import sagemaker
from sagemaker.rl import RLEstimator, RLToolkit, RLFramework
from tic_tac_toe_game import TicTacToeGame

---

## Code

Our tic-tac-toe example requires 3 scripts in order to train our agent using SageMaker RL.  The scripts are placed in the `./src` directory which is sent to the container when the SageMaker training job is initiated.

### Environment

For our tic-tac-toe use case we'll create a custom Gym environment.  This means we'll specify a Python class which inherits from `gym.Env` and has two methods: `reset()` and `step()`.  These will provide the agent its state, actions, and rewards for learning.  In more detail:

The `__init__()` method is called at the beginning of the SageMaker training job and:
1. Starts the 3x3 tic-tac-toe board as a NumPy array of zeros
1. Prepares the state space as a flattened version of the board (length 9)
1. Defines a discrete action space with 9 possible options (one for each place on the board)

The `reset()` method is called at the beginning of each episode and:
1. Clears the 3x3 board (sets all values to 0)
1. Does some minor record-keeping for tracking across tic-tac-toe games

The `step()` method is called for each iteration in an episode and:
1. Adjusts the board based on the action chosen by the agent based on the previous state
1. Generates rewards based on performance
1. Automatically chooses the move for the agent's opponent if needed

Note:
* The opponent has not been programmed for perfect play.  If we taught our agent against a perfect opponent, it would not generalize to scenarios where the rules of perfect play were not followed.
* If our agent selects an occupied space, it is given a minor penalty (-0.1) and asked to choose again.  Although the state doesn't change across these steps (meaning the agent's network's prediction should stay the same), randomness in the agent should eventually result in different actions.  However, if the agent chooses an occupied space 10 times in a row, the game is forfeit.  Selecting an action only from available spaces would require more substantial modification than was desired for this example.
* Other rewards only occur when a game is completed (+1 for win, 0 for draw, -1 for loss).
* The board is saved as a NumPy array where a value of +1 represents our agent's moves (`X`s) and a value of -1 represents the opponent's moves (`O`s).

In [3]:
!pygmentize ./src/tic_tac_toe.py

[34mimport[39;49;00m [04m[36mgym[39;49;00m
[34mfrom[39;49;00m [04m[36mgym[39;49;00m [34mimport[39;49;00m spaces
[34mimport[39;49;00m [04m[36mnumpy[39;49;00m [34mas[39;49;00m [04m[36mnp[39;49;00m
[34mimport[39;49;00m [04m[36mos[39;49;00m
[34mimport[39;49;00m [04m[36mtime[39;49;00m


[34mclass[39;49;00m [04m[32mTicTacToeEnv[39;49;00m(gym.Env):


    [34mdef[39;49;00m [32m__init__[39;49;00m([36mself[39;49;00m, opponent=[33m'[39;49;00m[33mmoderate[39;49;00m[33m'[39;49;00m):
        [36mself[39;49;00m.opponent = opponent
        [36mself[39;49;00m.episode = [34m0[39;49;00m
        [36mself[39;49;00m.observation_space = spaces.Box(low=-[34m1[39;49;00m, high=[34m1[39;49;00m, shape=([34m9[39;49;00m, ), dtype=np.int)
        [36mself[39;49;00m.action_space = spaces.Discrete([34m9[39;49;00m)


    [34mdef[39;49;00m [32mreset[39;49;00m([36mself[39;49;00m):
        [36mself[39;49;00m.episode += [34m1[39;49;00m
        [3

### Preset

The preset file specifies Coach parameters used by our reinforcement learning agent.  For this problem we'll use a [Clipped PPO algorithm](https://nervanasystems.github.io/coach/components/agents/policy_optimization/cppo.html).  We have kept the preset file deliberately spartan, deferring to defaults for most parameters, in order to focus on just the key components.  Performance of our agent could likely be improved with increased tuning.

In [4]:
!pygmentize ./src/preset.py

[34mfrom[39;49;00m [04m[36mrl_coach.agents.clipped_ppo_agent[39;49;00m [34mimport[39;49;00m ClippedPPOAgentParameters
[34mfrom[39;49;00m [04m[36mrl_coach.base_parameters[39;49;00m [34mimport[39;49;00m VisualizationParameters, PresetValidationParameters
[34mfrom[39;49;00m [04m[36mrl_coach.core_types[39;49;00m [34mimport[39;49;00m TrainingSteps, EnvironmentEpisodes, EnvironmentSteps
[34mfrom[39;49;00m [04m[36mrl_coach.environments.gym_environment[39;49;00m [34mimport[39;49;00m GymVectorEnvironment
[34mfrom[39;49;00m [04m[36mrl_coach.graph_managers.basic_rl_graph_manager[39;49;00m [34mimport[39;49;00m BasicRLGraphManager
[34mfrom[39;49;00m [04m[36mrl_coach.graph_managers.graph_manager[39;49;00m [34mimport[39;49;00m ScheduleParameters

[37m####################[39;49;00m
[37m# Graph Scheduling #[39;49;00m
[37m####################[39;49;00m

schedule_params = ScheduleParameters()
schedule_params.improve_steps = TrainingSteps([34m50

### Launcher

The launcher is a script used by Amazon SageMaker to drive the training job on the SageMaker RL container.  We have kept it minimal, only specifying the name of the preset file to be used for the training job.

In [5]:
!pygmentize ./src/train-coach.py

[34mfrom[39;49;00m [04m[36msagemaker_rl.coach_launcher[39;49;00m [34mimport[39;49;00m SageMakerCoachPresetLauncher


[34mclass[39;49;00m [04m[32mMyLauncher[39;49;00m(SageMakerCoachPresetLauncher):

    [34mdef[39;49;00m [32mdefault_preset_name[39;49;00m([36mself[39;49;00m):
        [33m"""This points to a .py file that configures everything about the RL job.[39;49;00m
[33m        It can be overridden at runtime by specifying the RLCOACH_PRESET hyperparameter.[39;49;00m
[33m        """[39;49;00m
        [34mreturn[39;49;00m [33m'[39;49;00m[33mpreset[39;49;00m[33m'[39;49;00m


[34mif[39;49;00m [31m__name__[39;49;00m == [33m'[39;49;00m[33m__main__[39;49;00m[33m'[39;49;00m:
    MyLauncher.train_main()


---

## Train

Now, let's kick off the training job in Amazon SageMaker.  This call can include hyperparameters that overwrite values in `train-coach.py` or `preset.py`, but in our case, we've limited to defining:
1. The location of our agent code `./src` and dependencies in `common`.
1. Which RL and DL framework to use (SageMaker also supports [Ray RLlib](https://ray.readthedocs.io/en/latest/rllib.html) and Coach TensorFlow).
1. The IAM role granted permissions to our data in S3 and ability to create SageMaker training jobs.
1. Training job hardware specifications (in this case just 1 ml.m4.xlarge instance).
1. Output path for our checkpoints and saved episodes.
1. A single hyperparameter specifying that we would like our agent's network to be output (in this case as an ONNX model).

In [6]:
estimator = RLEstimator(source_dir='src',
                        entry_point="train-coach.py",
                        dependencies=["common/sagemaker_rl"],
                        toolkit=RLToolkit.COACH,
                        toolkit_version='0.11.0',
                        framework=RLFramework.MXNET,
                        role=role,
                        train_instance_count=1,
                        train_instance_type='ml.m4.xlarge',
                        output_path='s3://{}/'.format(bucket),
                        base_job_name='DEMO-rl-tic-tac-toe',
                        hyperparameters={'save_model': 1})

estimator.fit()

2019-06-07 15:35:19 Starting - Starting the training job...
2019-06-07 15:35:21 Starting - Launching requested ML instances.........
2019-06-07 15:36:53 Starting - Preparing the instances for training......
2019-06-07 15:38:10 Downloading - Downloading input data
2019-06-07 15:38:10 Training - Downloading the training image..
[31mbash: cannot set terminal process group (-1): Inappropriate ioctl for device[0m
[31mbash: no job control in this shell[0m
[31m2019-06-07 15:38:30,105 sagemaker-containers INFO     Imported framework sagemaker_mxnet_container.training[0m
[31m2019-06-07 15:38:30,109 sagemaker-containers INFO     No GPUs detected (normal if no gpus installed)[0m
[31m2019-06-07 15:38:30,123 sagemaker_mxnet_container.training INFO     MXNet training environment: {'SM_TRAINING_ENV': '{"additional_framework_parameters":{"sagemaker_estimator":"RLEstimator"},"channel_input_dirs":{},"current_host":"algo-1","framework_module":"sagemaker_mxnet_container.training:main","hosts":["a

[31mTraining> Name=main_level/agent, Worker=0, Episode=1, Total reward=-1.1, Steps=5, Training iteration=0[0m
[31mTraining> Name=main_level/agent, Worker=0, Episode=2, Total reward=-1, Steps=8, Training iteration=0[0m
[31mTraining> Name=main_level/agent, Worker=0, Episode=3, Total reward=-1.2, Steps=13, Training iteration=0[0m
[31mTraining> Name=main_level/agent, Worker=0, Episode=4, Total reward=-1.1, Steps=17, Training iteration=0[0m
[31mTraining> Name=main_level/agent, Worker=0, Episode=5, Total reward=-0.2, Steps=24, Training iteration=0[0m
[31mTraining> Name=main_level/agent, Worker=0, Episode=6, Total reward=-0.5, Steps=34, Training iteration=0[0m
[31mTraining> Name=main_level/agent, Worker=0, Episode=7, Total reward=-0.7, Steps=46, Training iteration=0[0m
[31mTraining> Name=main_level/agent, Worker=0, Episode=8, Total reward=-1.1, Steps=51, Training iteration=0[0m
[31mTraining> Name=main_level/agent, Worker=0, Episode=9, Total reward=0.1, Steps=64, Training iter

[31mTraining> Name=main_level/agent, Worker=0, Episode=173, Total reward=-1, Steps=1242, Training iteration=0[0m
[31mTraining> Name=main_level/agent, Worker=0, Episode=174, Total reward=-0.6, Steps=1253, Training iteration=0[0m
[31mTraining> Name=main_level/agent, Worker=0, Episode=175, Total reward=-1, Steps=1257, Training iteration=0[0m
[31mTraining> Name=main_level/agent, Worker=0, Episode=176, Total reward=-2.1, Steps=1272, Training iteration=0[0m
[31mTraining> Name=main_level/agent, Worker=0, Episode=177, Total reward=-0.4, Steps=1281, Training iteration=0[0m
[31mTraining> Name=main_level/agent, Worker=0, Episode=178, Total reward=-1, Steps=1284, Training iteration=0[0m
[31mTraining> Name=main_level/agent, Worker=0, Episode=179, Total reward=-1, Steps=1287, Training iteration=0[0m
[31mTraining> Name=main_level/agent, Worker=0, Episode=180, Total reward=-2.1, Steps=1303, Training iteration=0[0m
[31mTraining> Name=main_level/agent, Worker=0, Episode=181, Total rewar

[31mPolicy training> Surrogate loss=-0.016231467947363853, KL divergence=[0.], Entropy=[-0.0218709], training epoch=6, learning_rate=0.00025[0m
[31mPolicy training> Surrogate loss=-0.01799607463181019, KL divergence=[0.], Entropy=[-0.02186245], training epoch=7, learning_rate=0.00025[0m
[31mPolicy training> Surrogate loss=-0.018739134073257446, KL divergence=[0.], Entropy=[-0.02184578], training epoch=8, learning_rate=0.00025[0m
[31mPolicy training> Surrogate loss=-0.01849059946835041, KL divergence=[0.], Entropy=[-0.02183821], training epoch=9, learning_rate=0.00025[0m
[31mCheckpoint> Saving in path=['/opt/ml/output/data/checkpoint/1_Step-2053.ckpt.main_level.agent.main.online', '/opt/ml/output/data/checkpoint/1_Step-2053.ckpt.main_level.agent.main.online.onnx'][0m
[31mTraining> Name=main_level/agent, Worker=0, Episode=280, Total reward=-1.3, Steps=2060, Training iteration=1[0m
[31mTraining> Name=main_level/agent, Worker=0, Episode=281, Total reward=-0.8, Steps=2073, Trai


2019-06-07 15:39:01 Training - Training image download completed. Training in progress.[31mTraining> Name=main_level/agent, Worker=0, Episode=411, Total reward=-1.3, Steps=3095, Training iteration=1[0m
[31mTraining> Name=main_level/agent, Worker=0, Episode=412, Total reward=-1, Steps=3099, Training iteration=1[0m
[31mTraining> Name=main_level/agent, Worker=0, Episode=413, Total reward=-1.1, Steps=3103, Training iteration=1[0m
[31mTraining> Name=main_level/agent, Worker=0, Episode=414, Total reward=-0.7, Steps=3115, Training iteration=1[0m
[31mTraining> Name=main_level/agent, Worker=0, Episode=415, Total reward=-1, Steps=3118, Training iteration=1[0m
[31mTraining> Name=main_level/agent, Worker=0, Episode=416, Total reward=-0.6, Steps=3129, Training iteration=1[0m
[31mTraining> Name=main_level/agent, Worker=0, Episode=417, Total reward=-1.4, Steps=3136, Training iteration=1[0m
[31mTraining> Name=main_level/agent, Worker=0, Episode=418, Total reward=-1.1, Steps=3140, Train

[31mPolicy training> Surrogate loss=-0.006139806006103754, KL divergence=[0.], Entropy=[-0.02179379], training epoch=2, learning_rate=0.00025[0m
[31mPolicy training> Surrogate loss=-0.009154296480119228, KL divergence=[0.], Entropy=[-0.02176896], training epoch=3, learning_rate=0.00025[0m
[31mPolicy training> Surrogate loss=-0.010454955510795116, KL divergence=[0.], Entropy=[-0.02176537], training epoch=4, learning_rate=0.00025[0m
[31mPolicy training> Surrogate loss=-0.011527474038302898, KL divergence=[0.], Entropy=[-0.02174321], training epoch=5, learning_rate=0.00025[0m
[31mPolicy training> Surrogate loss=-0.011554984375834465, KL divergence=[0.], Entropy=[-0.02171823], training epoch=6, learning_rate=0.00025[0m
[31mPolicy training> Surrogate loss=-0.011834641918540001, KL divergence=[0.], Entropy=[-0.0217067], training epoch=7, learning_rate=0.00025[0m
[31mPolicy training> Surrogate loss=-0.012181682512164116, KL divergence=[0.], Entropy=[-0.02170207], training epoch=8

[31mTraining> Name=main_level/agent, Worker=0, Episode=643, Total reward=-1, Steps=4895, Training iteration=2[0m
[31mTraining> Name=main_level/agent, Worker=0, Episode=644, Total reward=-2.5, Steps=4915, Training iteration=2[0m
[31mTraining> Name=main_level/agent, Worker=0, Episode=645, Total reward=-0.6, Steps=4926, Training iteration=2[0m
[31mTraining> Name=main_level/agent, Worker=0, Episode=646, Total reward=-0.1, Steps=4932, Training iteration=2[0m
[31mTraining> Name=main_level/agent, Worker=0, Episode=647, Total reward=-2.3, Steps=4950, Training iteration=2[0m
[31mTraining> Name=main_level/agent, Worker=0, Episode=648, Total reward=-1.3, Steps=4956, Training iteration=2[0m
[31mTraining> Name=main_level/agent, Worker=0, Episode=649, Total reward=-1, Steps=4959, Training iteration=2[0m
[31mTraining> Name=main_level/agent, Worker=0, Episode=650, Total reward=-2.2, Steps=4976, Training iteration=2[0m
[31mTraining> Name=main_level/agent, Worker=0, Episode=651, Total r

[31mPolicy training> Surrogate loss=0.006649320479482412, KL divergence=[0.], Entropy=[-0.02166798], training epoch=0, learning_rate=0.00025[0m
[31mPolicy training> Surrogate loss=-0.00990191288292408, KL divergence=[0.], Entropy=[-0.02168251], training epoch=1, learning_rate=0.00025[0m
[31mPolicy training> Surrogate loss=-0.014921552501618862, KL divergence=[0.], Entropy=[-0.02163666], training epoch=2, learning_rate=0.00025[0m
[31mPolicy training> Surrogate loss=-0.014410367235541344, KL divergence=[0.], Entropy=[-0.02162141], training epoch=3, learning_rate=0.00025[0m
[31mPolicy training> Surrogate loss=-0.012273123487830162, KL divergence=[0.], Entropy=[-0.02161769], training epoch=4, learning_rate=0.00025[0m
[31mPolicy training> Surrogate loss=-0.01582665555179119, KL divergence=[0.], Entropy=[-0.02159612], training epoch=5, learning_rate=0.00025[0m
[31mPolicy training> Surrogate loss=-0.011142424307763577, KL divergence=[0.], Entropy=[-0.02159632], training epoch=6, 

[31mTraining> Name=main_level/agent, Worker=0, Episode=875, Total reward=-1.4, Steps=6765, Training iteration=3[0m
[31mTraining> Name=main_level/agent, Worker=0, Episode=876, Total reward=-1.1, Steps=6769, Training iteration=3[0m
[31mTraining> Name=main_level/agent, Worker=0, Episode=877, Total reward=-0.4, Steps=6778, Training iteration=3[0m
[31mTraining> Name=main_level/agent, Worker=0, Episode=878, Total reward=-1.1, Steps=6783, Training iteration=3[0m
[31mTraining> Name=main_level/agent, Worker=0, Episode=879, Total reward=-1, Steps=6786, Training iteration=3[0m
[31mTraining> Name=main_level/agent, Worker=0, Episode=880, Total reward=-1.3, Steps=6793, Training iteration=3[0m
[31mTraining> Name=main_level/agent, Worker=0, Episode=881, Total reward=-1.8, Steps=6816, Training iteration=3[0m
[31mTraining> Name=main_level/agent, Worker=0, Episode=882, Total reward=-1.1, Steps=6820, Training iteration=3[0m
[31mTraining> Name=main_level/agent, Worker=0, Episode=883, Total

[31mPolicy training> Surrogate loss=-0.012167873792350292, KL divergence=[0.], Entropy=[-0.02146081], training epoch=5, learning_rate=0.00025[0m
[31mPolicy training> Surrogate loss=-0.013184120878577232, KL divergence=[0.], Entropy=[-0.02146401], training epoch=6, learning_rate=0.00025[0m
[31mPolicy training> Surrogate loss=-0.013755375519394875, KL divergence=[0.], Entropy=[-0.02143923], training epoch=7, learning_rate=0.00025[0m
[31mPolicy training> Surrogate loss=-0.013607991859316826, KL divergence=[0.], Entropy=[-0.02142605], training epoch=8, learning_rate=0.00025[0m
[31mPolicy training> Surrogate loss=-0.014061628840863705, KL divergence=[0.], Entropy=[-0.02141408], training epoch=9, learning_rate=0.00025[0m
[31mCheckpoint> Saving in path=['/opt/ml/output/data/checkpoint/7_Step-8210.ckpt.main_level.agent.main.online', '/opt/ml/output/data/checkpoint/7_Step-8210.ckpt.main_level.agent.main.online.onnx'][0m
[31mTraining> Name=main_level/agent, Worker=0, Episode=1071, T

[31mTraining> Name=main_level/agent, Worker=0, Episode=1188, Total reward=-1.4, Steps=9147, Training iteration=4[0m
[31mTraining> Name=main_level/agent, Worker=0, Episode=1189, Total reward=-1.1, Steps=9163, Training iteration=4[0m
[31mTraining> Name=main_level/agent, Worker=0, Episode=1190, Total reward=-1.5, Steps=9171, Training iteration=4[0m
[31mTraining> Name=main_level/agent, Worker=0, Episode=1191, Total reward=-1.2, Steps=9177, Training iteration=4[0m
[31mTraining> Name=main_level/agent, Worker=0, Episode=1192, Total reward=-2.0, Steps=9192, Training iteration=4[0m
[31mTraining> Name=main_level/agent, Worker=0, Episode=1193, Total reward=-0.2, Steps=9199, Training iteration=4[0m
[31mTraining> Name=main_level/agent, Worker=0, Episode=1194, Total reward=-0.8, Steps=9212, Training iteration=4[0m
[31mTraining> Name=main_level/agent, Worker=0, Episode=1195, Total reward=-2.0, Steps=9226, Training iteration=4[0m
[31mTraining> Name=main_level/agent, Worker=0, Episode=

[31mPolicy training> Surrogate loss=-0.01481508556753397, KL divergence=[0.], Entropy=[-0.021208], training epoch=2, learning_rate=0.00025[0m
[31mPolicy training> Surrogate loss=-0.01102064736187458, KL divergence=[0.], Entropy=[-0.02123042], training epoch=3, learning_rate=0.00025[0m
[31mPolicy training> Surrogate loss=-0.020161183550953865, KL divergence=[0.], Entropy=[-0.02118429], training epoch=4, learning_rate=0.00025[0m
[31mPolicy training> Surrogate loss=-0.016819478943943977, KL divergence=[0.], Entropy=[-0.02117538], training epoch=5, learning_rate=0.00025[0m
[31mPolicy training> Surrogate loss=-0.009511539712548256, KL divergence=[0.], Entropy=[-0.02116999], training epoch=6, learning_rate=0.00025[0m
[31mPolicy training> Surrogate loss=-0.018424885347485542, KL divergence=[0.], Entropy=[-0.02116523], training epoch=7, learning_rate=0.00025[0m
[31mPolicy training> Surrogate loss=-0.019215207546949387, KL divergence=[0.], Entropy=[-0.02118155], training epoch=8, l

[31mTraining> Name=main_level/agent, Worker=0, Episode=1416, Total reward=-0.3, Steps=11018, Training iteration=5[0m
[31mTraining> Name=main_level/agent, Worker=0, Episode=1417, Total reward=-1.1, Steps=11022, Training iteration=5[0m
[31mTraining> Name=main_level/agent, Worker=0, Episode=1418, Total reward=-1.3, Steps=11028, Training iteration=5[0m
[31mTraining> Name=main_level/agent, Worker=0, Episode=1419, Total reward=-1.5, Steps=11037, Training iteration=5[0m
[31mTraining> Name=main_level/agent, Worker=0, Episode=1420, Total reward=-1.3, Steps=11043, Training iteration=5[0m
[31mTraining> Name=main_level/agent, Worker=0, Episode=1421, Total reward=-1.2, Steps=11048, Training iteration=5[0m
[31mTraining> Name=main_level/agent, Worker=0, Episode=1422, Total reward=-1.1, Steps=11053, Training iteration=5[0m
[31mTraining> Name=main_level/agent, Worker=0, Episode=1423, Total reward=-0.2, Steps=11060, Training iteration=5[0m
[31mTraining> Name=main_level/agent, Worker=0, 

[31mPolicy training> Surrogate loss=-0.00402833940461278, KL divergence=[0.], Entropy=[-0.02125546], training epoch=0, learning_rate=0.00025[0m
[31mPolicy training> Surrogate loss=-0.012601017951965332, KL divergence=[0.], Entropy=[-0.02131982], training epoch=1, learning_rate=0.00025[0m
[31mPolicy training> Surrogate loss=-0.01281595416367054, KL divergence=[0.], Entropy=[-0.02131479], training epoch=2, learning_rate=0.00025[0m
[31mPolicy training> Surrogate loss=-0.019701676443219185, KL divergence=[0.], Entropy=[-0.02129945], training epoch=3, learning_rate=0.00025[0m
[31mPolicy training> Surrogate loss=-0.019200138747692108, KL divergence=[0.], Entropy=[-0.02126358], training epoch=4, learning_rate=0.00025[0m
[31mPolicy training> Surrogate loss=-0.020444797351956367, KL divergence=[0.], Entropy=[-0.02125837], training epoch=5, learning_rate=0.00025[0m
[31mPolicy training> Surrogate loss=-0.018390145152807236, KL divergence=[0.], Entropy=[-0.02122474], training epoch=6,

[31mTraining> Name=main_level/agent, Worker=0, Episode=1642, Total reward=-1.1, Steps=12859, Training iteration=6[0m
[31mTraining> Name=main_level/agent, Worker=0, Episode=1643, Total reward=-0.4, Steps=12868, Training iteration=6[0m
[31mTraining> Name=main_level/agent, Worker=0, Episode=1644, Total reward=-1.3, Steps=12875, Training iteration=6[0m
[31mTraining> Name=main_level/agent, Worker=0, Episode=1645, Total reward=-1.8, Steps=12887, Training iteration=6[0m
[31mTraining> Name=main_level/agent, Worker=0, Episode=1646, Total reward=-2.0, Steps=12902, Training iteration=6[0m
[31mTraining> Name=main_level/agent, Worker=0, Episode=1647, Total reward=-1.4, Steps=12910, Training iteration=6[0m
[31mTraining> Name=main_level/agent, Worker=0, Episode=1648, Total reward=0.9, Steps=12915, Training iteration=6[0m
[31mTraining> Name=main_level/agent, Worker=0, Episode=1649, Total reward=-0.6, Steps=12926, Training iteration=6[0m
[31mTraining> Name=main_level/agent, Worker=0, E

[31mTraining> Name=main_level/agent, Worker=0, Episode=1799, Total reward=-1.4, Steps=14120, Training iteration=6[0m
[31mTraining> Name=main_level/agent, Worker=0, Episode=1800, Total reward=-0.4, Steps=14129, Training iteration=6[0m
[31mTraining> Name=main_level/agent, Worker=0, Episode=1801, Total reward=-1, Steps=14132, Training iteration=6[0m
[31mTraining> Name=main_level/agent, Worker=0, Episode=1802, Total reward=-0.5, Steps=14142, Training iteration=6[0m
[31mTraining> Name=main_level/agent, Worker=0, Episode=1803, Total reward=-1.1, Steps=14146, Training iteration=6[0m
[31mTraining> Name=main_level/agent, Worker=0, Episode=1804, Total reward=-1.8, Steps=14158, Training iteration=6[0m
[31mTraining> Name=main_level/agent, Worker=0, Episode=1805, Total reward=-0.3, Steps=14166, Training iteration=6[0m
[31mTraining> Name=main_level/agent, Worker=0, Episode=1806, Total reward=-1, Steps=14169, Training iteration=6[0m
[31mTraining> Name=main_level/agent, Worker=0, Epis

[31mTraining> Name=main_level/agent, Worker=0, Episode=1864, Total reward=-2.0, Steps=14666, Training iteration=7[0m
[31mTraining> Name=main_level/agent, Worker=0, Episode=1865, Total reward=-1, Steps=14669, Training iteration=7[0m
[31mTraining> Name=main_level/agent, Worker=0, Episode=1866, Total reward=1, Steps=14673, Training iteration=7[0m
[31mTraining> Name=main_level/agent, Worker=0, Episode=1867, Total reward=-2.2, Steps=14690, Training iteration=7[0m
[31mTraining> Name=main_level/agent, Worker=0, Episode=1868, Total reward=-1.1, Steps=14694, Training iteration=7[0m
[31mTraining> Name=main_level/agent, Worker=0, Episode=1869, Total reward=-1.3, Steps=14700, Training iteration=7[0m
[31mTraining> Name=main_level/agent, Worker=0, Episode=1870, Total reward=-1.2, Steps=14706, Training iteration=7[0m
[31mTraining> Name=main_level/agent, Worker=0, Episode=1871, Total reward=-2.1, Steps=14722, Training iteration=7[0m
[31mTraining> Name=main_level/agent, Worker=0, Episo

[31mTesting> Name=main_level/agent, Worker=0, Episode=2020, Total reward=-2.0, Steps=16000, Training iteration=7[0m
[31mTesting> Name=main_level/agent, Worker=0, Episode=2020, Total reward=-2.0, Steps=16000, Training iteration=7[0m
[31mTesting> Name=main_level/agent, Worker=0, Episode=2020, Total reward=-2.0, Steps=16000, Training iteration=7[0m
[31mTesting> Name=main_level/agent, Worker=0, Episode=2020, Total reward=-2.0, Steps=16000, Training iteration=7[0m
[31mTesting> Name=main_level/agent, Worker=0, Episode=2020, Total reward=-2.0, Steps=16000, Training iteration=7[0m
[31m## agent: Finished evaluation phase. Success rate = 0.0, Avg Total Reward = -2.0[0m
[31mTraining> Name=main_level/agent, Worker=0, Episode=2021, Total reward=-1, Steps=16003, Training iteration=7[0m
[31mTraining> Name=main_level/agent, Worker=0, Episode=2022, Total reward=-0.9, Steps=16017, Training iteration=7[0m
[31mTraining> Name=main_level/agent, Worker=0, Episode=2023, Total reward=-1.2, Ste

[31mTraining> Name=main_level/agent, Worker=0, Episode=2083, Total reward=-1.1, Steps=16455, Training iteration=8[0m
[31mTraining> Name=main_level/agent, Worker=0, Episode=2084, Total reward=-1.3, Steps=16461, Training iteration=8[0m
[31mTraining> Name=main_level/agent, Worker=0, Episode=2085, Total reward=-1.1, Steps=16465, Training iteration=8[0m
[31mTraining> Name=main_level/agent, Worker=0, Episode=2086, Total reward=-1, Steps=16468, Training iteration=8[0m
[31mTraining> Name=main_level/agent, Worker=0, Episode=2087, Total reward=-0.3, Steps=16476, Training iteration=8[0m
[31mTraining> Name=main_level/agent, Worker=0, Episode=2088, Total reward=-1, Steps=16479, Training iteration=8[0m
[31mTraining> Name=main_level/agent, Worker=0, Episode=2089, Total reward=-2.5, Steps=16499, Training iteration=8[0m
[31mTraining> Name=main_level/agent, Worker=0, Episode=2090, Total reward=-1.2, Steps=16504, Training iteration=8[0m
[31mTraining> Name=main_level/agent, Worker=0, Epis

[31mTraining> Name=main_level/agent, Worker=0, Episode=2242, Total reward=-1.5, Steps=17766, Training iteration=8[0m
[31mTraining> Name=main_level/agent, Worker=0, Episode=2243, Total reward=-1, Steps=17769, Training iteration=8[0m
[31mTraining> Name=main_level/agent, Worker=0, Episode=2244, Total reward=-0.8, Steps=17782, Training iteration=8[0m
[31mTraining> Name=main_level/agent, Worker=0, Episode=2245, Total reward=-2.0, Steps=17797, Training iteration=8[0m
[31mTraining> Name=main_level/agent, Worker=0, Episode=2246, Total reward=0.9, Steps=17802, Training iteration=8[0m
[31mTraining> Name=main_level/agent, Worker=0, Episode=2247, Total reward=-1.1, Steps=17806, Training iteration=8[0m
[31mTraining> Name=main_level/agent, Worker=0, Episode=2248, Total reward=-2.3, Steps=17824, Training iteration=8[0m
[31mTraining> Name=main_level/agent, Worker=0, Episode=2249, Total reward=-1, Steps=17827, Training iteration=8[0m
[31mTraining> Name=main_level/agent, Worker=0, Episo

[31mPolicy training> Surrogate loss=-0.013172081671655178, KL divergence=[0.], Entropy=[-0.02112224], training epoch=8, learning_rate=0.00025[0m
[31mPolicy training> Surrogate loss=-0.018213534727692604, KL divergence=[0.], Entropy=[-0.02110823], training epoch=9, learning_rate=0.00025[0m
[31mCheckpoint> Saving in path=['/opt/ml/output/data/checkpoint/17_Step-18468.ckpt.main_level.agent.main.online', '/opt/ml/output/data/checkpoint/17_Step-18468.ckpt.main_level.agent.main.online.onnx'][0m
[31mTraining> Name=main_level/agent, Worker=0, Episode=2320, Total reward=-1.7, Steps=18479, Training iteration=9[0m
[31mTraining> Name=main_level/agent, Worker=0, Episode=2321, Total reward=-0.4, Steps=18488, Training iteration=9[0m
[31mTraining> Name=main_level/agent, Worker=0, Episode=2322, Total reward=-1.2, Steps=18493, Training iteration=9[0m
[31mTraining> Name=main_level/agent, Worker=0, Episode=2323, Total reward=-1, Steps=18497, Training iteration=9[0m
[31mTraining> Name=main_l

[31mTraining> Name=main_level/agent, Worker=0, Episode=2458, Total reward=-0.7, Steps=19614, Training iteration=9[0m
[31mTraining> Name=main_level/agent, Worker=0, Episode=2459, Total reward=-1.1, Steps=19618, Training iteration=9[0m
[31mTraining> Name=main_level/agent, Worker=0, Episode=2460, Total reward=-0.6, Steps=19629, Training iteration=9[0m
[31mTraining> Name=main_level/agent, Worker=0, Episode=2461, Total reward=-1.3, Steps=19635, Training iteration=9[0m
[31mTraining> Name=main_level/agent, Worker=0, Episode=2462, Total reward=-1.1, Steps=19640, Training iteration=9[0m
[31mTraining> Name=main_level/agent, Worker=0, Episode=2463, Total reward=-1.2, Steps=19645, Training iteration=9[0m
[31mTraining> Name=main_level/agent, Worker=0, Episode=2464, Total reward=0.3, Steps=19656, Training iteration=9[0m
[31mTraining> Name=main_level/agent, Worker=0, Episode=2465, Total reward=-1, Steps=19659, Training iteration=9[0m
[31mTraining> Name=main_level/agent, Worker=0, Epi

[31mPolicy training> Surrogate loss=-0.008461217395961285, KL divergence=[0.], Entropy=[-0.02112949], training epoch=3, learning_rate=0.00025[0m
[31mPolicy training> Surrogate loss=-0.009390773251652718, KL divergence=[0.], Entropy=[-0.02114019], training epoch=4, learning_rate=0.00025[0m
[31mPolicy training> Surrogate loss=-0.009794431738555431, KL divergence=[0.], Entropy=[-0.02109722], training epoch=5, learning_rate=0.00025[0m
[31mPolicy training> Surrogate loss=-0.009543782100081444, KL divergence=[0.], Entropy=[-0.02112355], training epoch=6, learning_rate=0.00025[0m
[31mPolicy training> Surrogate loss=-0.009771621786057949, KL divergence=[0.], Entropy=[-0.02112409], training epoch=7, learning_rate=0.00025[0m
[31mPolicy training> Surrogate loss=-0.010077225975692272, KL divergence=[0.], Entropy=[-0.02110557], training epoch=8, learning_rate=0.00025[0m
[31mPolicy training> Surrogate loss=-0.01098724827170372, KL divergence=[0.], Entropy=[-0.02112223], training epoch=9

[31mTraining> Name=main_level/agent, Worker=0, Episode=2676, Total reward=0.9, Steps=21204, Training iteration=10[0m
[31mTraining> Name=main_level/agent, Worker=0, Episode=2677, Total reward=-1, Steps=21207, Training iteration=10[0m
[31mTraining> Name=main_level/agent, Worker=0, Episode=2678, Total reward=-1.1, Steps=21211, Training iteration=10[0m
[31mTraining> Name=main_level/agent, Worker=0, Episode=2679, Total reward=0.9, Steps=21216, Training iteration=10[0m
[31mTraining> Name=main_level/agent, Worker=0, Episode=2680, Total reward=-1.1, Steps=21220, Training iteration=10[0m
[31mTraining> Name=main_level/agent, Worker=0, Episode=2681, Total reward=-1.1, Steps=21225, Training iteration=10[0m
[31mTraining> Name=main_level/agent, Worker=0, Episode=2682, Total reward=-1.1, Steps=21229, Training iteration=10[0m
[31mTraining> Name=main_level/agent, Worker=0, Episode=2683, Total reward=-1.3, Steps=21235, Training iteration=10[0m
[31mTraining> Name=main_level/agent, Worker

[31mTraining> Name=main_level/agent, Worker=0, Episode=2827, Total reward=-0.8, Steps=22377, Training iteration=10[0m
[31mTraining> Name=main_level/agent, Worker=0, Episode=2828, Total reward=0.3, Steps=22389, Training iteration=10[0m
[31mTraining> Name=main_level/agent, Worker=0, Episode=2829, Total reward=-1.5, Steps=22398, Training iteration=10[0m
[31mTraining> Name=main_level/agent, Worker=0, Episode=2830, Total reward=0.1, Steps=22411, Training iteration=10[0m
[31mTraining> Name=main_level/agent, Worker=0, Episode=2831, Total reward=-1.1, Steps=22416, Training iteration=10[0m
[31mTraining> Name=main_level/agent, Worker=0, Episode=2832, Total reward=-2.4, Steps=22435, Training iteration=10[0m
[31mTraining> Name=main_level/agent, Worker=0, Episode=2833, Total reward=-1.4, Steps=22442, Training iteration=10[0m
[31mTraining> Name=main_level/agent, Worker=0, Episode=2834, Total reward=0.1, Steps=22455, Training iteration=10[0m
[31mTraining> Name=main_level/agent, Worke

[31mTraining> Name=main_level/agent, Worker=0, Episode=2887, Total reward=-1.4, Steps=23011, Training iteration=11[0m
[31mTraining> Name=main_level/agent, Worker=0, Episode=2888, Total reward=-1.1, Steps=23015, Training iteration=11[0m
[31mTraining> Name=main_level/agent, Worker=0, Episode=2889, Total reward=-1.1, Steps=23019, Training iteration=11[0m
[31mTraining> Name=main_level/agent, Worker=0, Episode=2890, Total reward=-1.3, Steps=23025, Training iteration=11[0m
[31mTraining> Name=main_level/agent, Worker=0, Episode=2891, Total reward=-2.4, Steps=23044, Training iteration=11[0m
[31mTraining> Name=main_level/agent, Worker=0, Episode=2892, Total reward=0.5, Steps=23053, Training iteration=11[0m
[31mTraining> Name=main_level/agent, Worker=0, Episode=2893, Total reward=-1, Steps=23057, Training iteration=11[0m
[31mTraining> Name=main_level/agent, Worker=0, Episode=2894, Total reward=0.0, Steps=23072, Training iteration=11[0m
[31mTraining> Name=main_level/agent, Worker

[31mTraining> Name=main_level/agent, Worker=0, Episode=3032, Total reward=-0.3, Steps=24293, Training iteration=11[0m
[31mTraining> Name=main_level/agent, Worker=0, Episode=3033, Total reward=-1.1, Steps=24297, Training iteration=11[0m
[31mTraining> Name=main_level/agent, Worker=0, Episode=3034, Total reward=-2.0, Steps=24312, Training iteration=11[0m
[31mTraining> Name=main_level/agent, Worker=0, Episode=3035, Total reward=-1, Steps=24315, Training iteration=11[0m
[31mTraining> Name=main_level/agent, Worker=0, Episode=3036, Total reward=0.6, Steps=24323, Training iteration=11[0m
[31mTraining> Name=main_level/agent, Worker=0, Episode=3037, Total reward=-1.1, Steps=24327, Training iteration=11[0m
[31mTraining> Name=main_level/agent, Worker=0, Episode=3038, Total reward=-1.0, Steps=24342, Training iteration=11[0m
[31mTraining> Name=main_level/agent, Worker=0, Episode=3039, Total reward=-2.2, Steps=24358, Training iteration=11[0m
[31mTraining> Name=main_level/agent, Worke

[31mTraining> Name=main_level/agent, Worker=0, Episode=3094, Total reward=-1.3, Steps=24823, Training iteration=12[0m
[31mTraining> Name=main_level/agent, Worker=0, Episode=3095, Total reward=-2.5, Steps=24843, Training iteration=12[0m
[31mTraining> Name=main_level/agent, Worker=0, Episode=3096, Total reward=-1.1, Steps=24847, Training iteration=12[0m
[31mTraining> Name=main_level/agent, Worker=0, Episode=3097, Total reward=-1.4, Steps=24854, Training iteration=12[0m
[31mTraining> Name=main_level/agent, Worker=0, Episode=3098, Total reward=-2.9, Steps=24878, Training iteration=12[0m
[31mTraining> Name=main_level/agent, Worker=0, Episode=3099, Total reward=-0.8, Steps=24891, Training iteration=12[0m
[31mTraining> Name=main_level/agent, Worker=0, Episode=3100, Total reward=-0.3, Steps=24899, Training iteration=12[0m
[31mTraining> Name=main_level/agent, Worker=0, Episode=3101, Total reward=-1.1, Steps=24904, Training iteration=12[0m
[31mTraining> Name=main_level/agent, Wo

[31mTraining> Name=main_level/agent, Worker=0, Episode=3240, Total reward=-1.1, Steps=26090, Training iteration=12[0m
[31mTraining> Name=main_level/agent, Worker=0, Episode=3241, Total reward=-1, Steps=26093, Training iteration=12[0m
[31mTraining> Name=main_level/agent, Worker=0, Episode=3242, Total reward=-1.5, Steps=26101, Training iteration=12[0m
[31mTraining> Name=main_level/agent, Worker=0, Episode=3243, Total reward=-1.1, Steps=26105, Training iteration=12[0m
[31mTraining> Name=main_level/agent, Worker=0, Episode=3244, Total reward=-1.0, Steps=26120, Training iteration=12[0m
[31mTraining> Name=main_level/agent, Worker=0, Episode=3245, Total reward=-1.1, Steps=26124, Training iteration=12[0m
[31mTraining> Name=main_level/agent, Worker=0, Episode=3246, Total reward=-1, Steps=26127, Training iteration=12[0m
[31mTraining> Name=main_level/agent, Worker=0, Episode=3247, Total reward=-1.9, Steps=26140, Training iteration=12[0m
[31mTraining> Name=main_level/agent, Worker

[31mPolicy training> Surrogate loss=-0.017740141600370407, KL divergence=[0.], Entropy=[-0.02055498], training epoch=9, learning_rate=0.00025[0m
[31mCheckpoint> Saving in path=['/opt/ml/output/data/checkpoint/25_Step-26685.ckpt.main_level.agent.main.online', '/opt/ml/output/data/checkpoint/25_Step-26685.ckpt.main_level.agent.main.online.onnx'][0m
[31mTraining> Name=main_level/agent, Worker=0, Episode=3310, Total reward=-1.1, Steps=26689, Training iteration=13[0m
[31mTraining> Name=main_level/agent, Worker=0, Episode=3311, Total reward=-1.1, Steps=26694, Training iteration=13[0m
[31mTraining> Name=main_level/agent, Worker=0, Episode=3312, Total reward=-1.4, Steps=26702, Training iteration=13[0m
[31mTraining> Name=main_level/agent, Worker=0, Episode=3313, Total reward=-1, Steps=26706, Training iteration=13[0m
[31mTraining> Name=main_level/agent, Worker=0, Episode=3314, Total reward=-2.8, Steps=26729, Training iteration=13[0m
[31mTraining> Name=main_level/agent, Worker=0, E

[31mTraining> Name=main_level/agent, Worker=0, Episode=3454, Total reward=-1, Steps=27904, Training iteration=13[0m
[31mTraining> Name=main_level/agent, Worker=0, Episode=3455, Total reward=-0.3, Steps=27912, Training iteration=13[0m
[31mTraining> Name=main_level/agent, Worker=0, Episode=3456, Total reward=-2.2, Steps=27929, Training iteration=13[0m
[31mTraining> Name=main_level/agent, Worker=0, Episode=3457, Total reward=-1, Steps=27933, Training iteration=13[0m
[31mTraining> Name=main_level/agent, Worker=0, Episode=3458, Total reward=-2.0, Steps=27947, Training iteration=13[0m
[31mTraining> Name=main_level/agent, Worker=0, Episode=3459, Total reward=-1.4, Steps=27955, Training iteration=13[0m
[31mTraining> Name=main_level/agent, Worker=0, Episode=3460, Total reward=-1.1, Steps=27959, Training iteration=13[0m
[31mCheckpoint> Saving in path=['/opt/ml/output/data/checkpoint/26_Step-27959.ckpt.main_level.agent.main.online', '/opt/ml/output/data/checkpoint/26_Step-27959.ckp

[31mPolicy training> Surrogate loss=-0.017668835818767548, KL divergence=[0.], Entropy=[-0.02066642], training epoch=4, learning_rate=0.00025[0m
[31mPolicy training> Surrogate loss=-0.01636495441198349, KL divergence=[0.], Entropy=[-0.02060301], training epoch=5, learning_rate=0.00025[0m
[31mPolicy training> Surrogate loss=-0.017108911648392677, KL divergence=[0.], Entropy=[-0.02063004], training epoch=6, learning_rate=0.00025[0m
[31mPolicy training> Surrogate loss=-0.01837288774549961, KL divergence=[0.], Entropy=[-0.02065354], training epoch=7, learning_rate=0.00025[0m
[31mPolicy training> Surrogate loss=-0.01626228541135788, KL divergence=[0.], Entropy=[-0.02060197], training epoch=8, learning_rate=0.00025[0m
[31mPolicy training> Surrogate loss=-0.01825937069952488, KL divergence=[0.], Entropy=[-0.02062425], training epoch=9, learning_rate=0.00025[0m
[31mCheckpoint> Saving in path=['/opt/ml/output/data/checkpoint/27_Step-28744.ckpt.main_level.agent.main.online', '/opt/m

[31mTraining> Name=main_level/agent, Worker=0, Episode=3663, Total reward=-1.3, Steps=29567, Training iteration=14[0m
[31mTraining> Name=main_level/agent, Worker=0, Episode=3664, Total reward=-1.1, Steps=29571, Training iteration=14[0m
[31mTraining> Name=main_level/agent, Worker=0, Episode=3665, Total reward=-1, Steps=29574, Training iteration=14[0m
[31mTraining> Name=main_level/agent, Worker=0, Episode=3666, Total reward=1, Steps=29578, Training iteration=14[0m
[31mTraining> Name=main_level/agent, Worker=0, Episode=3667, Total reward=-2.2, Steps=29595, Training iteration=14[0m
[31mTraining> Name=main_level/agent, Worker=0, Episode=3668, Total reward=-2.3, Steps=29613, Training iteration=14[0m
[31mTraining> Name=main_level/agent, Worker=0, Episode=3669, Total reward=-1.0, Steps=29628, Training iteration=14[0m
[31mTraining> Name=main_level/agent, Worker=0, Episode=3670, Total reward=0.1, Steps=29641, Training iteration=14[0m
[31mTraining> Name=main_level/agent, Worker=0

[31mTraining> Name=main_level/agent, Worker=0, Episode=3809, Total reward=-1.1, Steps=30733, Training iteration=14[0m
[31mTraining> Name=main_level/agent, Worker=0, Episode=3810, Total reward=-1, Steps=30736, Training iteration=14[0m
[31mTraining> Name=main_level/agent, Worker=0, Episode=3811, Total reward=-1.3, Steps=30743, Training iteration=14[0m
[31mTraining> Name=main_level/agent, Worker=0, Episode=3812, Total reward=-1.2, Steps=30748, Training iteration=14[0m
[31mTraining> Name=main_level/agent, Worker=0, Episode=3813, Total reward=-2.2, Steps=30764, Training iteration=14[0m
[31mTraining> Name=main_level/agent, Worker=0, Episode=3814, Total reward=0.4, Steps=30775, Training iteration=14[0m
[31mTraining> Name=main_level/agent, Worker=0, Episode=3815, Total reward=-1, Steps=30778, Training iteration=14[0m
[31mTraining> Name=main_level/agent, Worker=0, Episode=3816, Total reward=-1.3, Steps=30785, Training iteration=14[0m
[31mTraining> Name=main_level/agent, Worker=

[31mTraining> Name=main_level/agent, Worker=0, Episode=3869, Total reward=-1.4, Steps=31245, Training iteration=15[0m
[31mTraining> Name=main_level/agent, Worker=0, Episode=3870, Total reward=-1.1, Steps=31250, Training iteration=15[0m
[31mTraining> Name=main_level/agent, Worker=0, Episode=3871, Total reward=-2.0, Steps=31264, Training iteration=15[0m
[31mTraining> Name=main_level/agent, Worker=0, Episode=3872, Total reward=-1.1, Steps=31280, Training iteration=15[0m
[31mTraining> Name=main_level/agent, Worker=0, Episode=3873, Total reward=-1.3, Steps=31286, Training iteration=15[0m
[31mTraining> Name=main_level/agent, Worker=0, Episode=3874, Total reward=-1.4, Steps=31293, Training iteration=15[0m
[31mTraining> Name=main_level/agent, Worker=0, Episode=3875, Total reward=-2.3, Steps=31311, Training iteration=15[0m
[31mTraining> Name=main_level/agent, Worker=0, Episode=3876, Total reward=-2.6, Steps=31332, Training iteration=15[0m
[31mTraining> Name=main_level/agent, Wo

[31mTraining> Name=main_level/agent, Worker=0, Episode=4009, Total reward=-1.3, Steps=32457, Training iteration=15[0m
[31mTraining> Name=main_level/agent, Worker=0, Episode=4010, Total reward=-2.3, Steps=32475, Training iteration=15[0m
[31mTraining> Name=main_level/agent, Worker=0, Episode=4011, Total reward=-1.2, Steps=32480, Training iteration=15[0m
[31mTraining> Name=main_level/agent, Worker=0, Episode=4012, Total reward=-0.9, Steps=32494, Training iteration=15[0m
[31mTraining> Name=main_level/agent, Worker=0, Episode=4013, Total reward=0.9, Steps=32498, Training iteration=15[0m
[31mTraining> Name=main_level/agent, Worker=0, Episode=4014, Total reward=-2.2, Steps=32515, Training iteration=15[0m
[31mTraining> Name=main_level/agent, Worker=0, Episode=4015, Total reward=-0.5, Steps=32525, Training iteration=15[0m
[31mTraining> Name=main_level/agent, Worker=0, Episode=4016, Total reward=-1.2, Steps=32530, Training iteration=15[0m
[31mTraining> Name=main_level/agent, Wor

[31mTraining> Name=main_level/agent, Worker=0, Episode=4071, Total reward=0.9, Steps=32955, Training iteration=16[0m
[31mTraining> Name=main_level/agent, Worker=0, Episode=4072, Total reward=-0.4, Steps=32964, Training iteration=16[0m
[31mTraining> Name=main_level/agent, Worker=0, Episode=4073, Total reward=-1.3, Steps=32970, Training iteration=16[0m
[31mTraining> Name=main_level/agent, Worker=0, Episode=4074, Total reward=-2.5, Steps=32990, Training iteration=16[0m
[31mTraining> Name=main_level/agent, Worker=0, Episode=4075, Total reward=-0.3, Steps=32998, Training iteration=16[0m
[31mTraining> Name=main_level/agent, Worker=0, Episode=4076, Total reward=-1.3, Steps=33005, Training iteration=16[0m
[31mTraining> Name=main_level/agent, Worker=0, Episode=4077, Total reward=-1, Steps=33008, Training iteration=16[0m
[31mTraining> Name=main_level/agent, Worker=0, Episode=4078, Total reward=-0.6, Steps=33019, Training iteration=16[0m
[31mTraining> Name=main_level/agent, Worke

[31mTraining> Name=main_level/agent, Worker=0, Episode=4211, Total reward=-1.1, Steps=34115, Training iteration=16[0m
[31mTraining> Name=main_level/agent, Worker=0, Episode=4212, Total reward=-1.2, Steps=34120, Training iteration=16[0m
[31mTraining> Name=main_level/agent, Worker=0, Episode=4213, Total reward=-1, Steps=34123, Training iteration=16[0m
[31mTraining> Name=main_level/agent, Worker=0, Episode=4214, Total reward=-2.1, Steps=34139, Training iteration=16[0m
[31mTraining> Name=main_level/agent, Worker=0, Episode=4215, Total reward=-1.8, Steps=34151, Training iteration=16[0m
[31mTraining> Name=main_level/agent, Worker=0, Episode=4216, Total reward=-1.4, Steps=34159, Training iteration=16[0m
[31mTraining> Name=main_level/agent, Worker=0, Episode=4217, Total reward=-2.4, Steps=34178, Training iteration=16[0m
[31mTraining> Name=main_level/agent, Worker=0, Episode=4218, Total reward=-1, Steps=34181, Training iteration=16[0m
[31mTraining> Name=main_level/agent, Worker

[31mPolicy training> Surrogate loss=-0.014726868830621243, KL divergence=[0.], Entropy=[-0.02040864], training epoch=6, learning_rate=0.00025[0m
[31mPolicy training> Surrogate loss=-0.014971758238971233, KL divergence=[0.], Entropy=[-0.02042085], training epoch=7, learning_rate=0.00025[0m
[31mPolicy training> Surrogate loss=-0.014140975661575794, KL divergence=[0.], Entropy=[-0.0203804], training epoch=8, learning_rate=0.00025[0m
[31mPolicy training> Surrogate loss=-0.014284862205386162, KL divergence=[0.], Entropy=[-0.02037345], training epoch=9, learning_rate=0.00025[0m
[31mCheckpoint> Saving in path=['/opt/ml/output/data/checkpoint/33_Step-34910.ckpt.main_level.agent.main.online', '/opt/ml/output/data/checkpoint/33_Step-34910.ckpt.main_level.agent.main.online.onnx'][0m
[31mTraining> Name=main_level/agent, Worker=0, Episode=4302, Total reward=0.3, Steps=34922, Training iteration=17[0m
[31mTraining> Name=main_level/agent, Worker=0, Episode=4303, Total reward=-1.2, Steps=3

[31mTraining> Name=main_level/agent, Worker=0, Episode=4417, Total reward=-1.1, Steps=35873, Training iteration=17[0m
[31mTraining> Name=main_level/agent, Worker=0, Episode=4418, Total reward=0.8, Steps=35879, Training iteration=17[0m
[31mTraining> Name=main_level/agent, Worker=0, Episode=4419, Total reward=-1.1, Steps=35883, Training iteration=17[0m
[31mTraining> Name=main_level/agent, Worker=0, Episode=4420, Total reward=-1, Steps=35886, Training iteration=17[0m
[31mTraining> Name=main_level/agent, Worker=0, Episode=4421, Total reward=-1, Steps=35889, Training iteration=17[0m
[31mTraining> Name=main_level/agent, Worker=0, Episode=4422, Total reward=-1, Steps=35892, Training iteration=17[0m
[31mTraining> Name=main_level/agent, Worker=0, Episode=4423, Total reward=-1, Steps=35896, Training iteration=17[0m
[31mTraining> Name=main_level/agent, Worker=0, Episode=4424, Total reward=-1.7, Steps=35906, Training iteration=17[0m
[31mTraining> Name=main_level/agent, Worker=0, E

[31mPolicy training> Surrogate loss=-0.0012222720542922616, KL divergence=[0.], Entropy=[-0.02048531], training epoch=0, learning_rate=0.00025[0m
[31mPolicy training> Surrogate loss=-0.006166974548250437, KL divergence=[0.], Entropy=[-0.02049991], training epoch=1, learning_rate=0.00025[0m
[31mPolicy training> Surrogate loss=-0.016840631142258644, KL divergence=[0.], Entropy=[-0.02049147], training epoch=2, learning_rate=0.00025[0m
[31mPolicy training> Surrogate loss=-0.0134761743247509, KL divergence=[0.], Entropy=[-0.02042898], training epoch=3, learning_rate=0.00025[0m
[31mPolicy training> Surrogate loss=-0.008649528957903385, KL divergence=[0.], Entropy=[-0.02045467], training epoch=4, learning_rate=0.00025[0m
[31mPolicy training> Surrogate loss=-0.008873945102095604, KL divergence=[0.], Entropy=[-0.02041368], training epoch=5, learning_rate=0.00025[0m
[31mPolicy training> Surrogate loss=-0.014790860936045647, KL divergence=[0.], Entropy=[-0.02038978], training epoch=6

[31mTraining> Name=main_level/agent, Worker=0, Episode=4619, Total reward=-1.8, Steps=37553, Training iteration=18[0m
[31mTraining> Name=main_level/agent, Worker=0, Episode=4620, Total reward=-2.5, Steps=37573, Training iteration=18[0m
[31mTraining> Name=main_level/agent, Worker=0, Episode=4621, Total reward=-1.2, Steps=37578, Training iteration=18[0m
[31mTraining> Name=main_level/agent, Worker=0, Episode=4622, Total reward=-1.6, Steps=37587, Training iteration=18[0m
[31mTraining> Name=main_level/agent, Worker=0, Episode=4623, Total reward=-1.3, Steps=37593, Training iteration=18[0m
[31mTraining> Name=main_level/agent, Worker=0, Episode=4624, Total reward=-0.6, Steps=37604, Training iteration=18[0m
[31mTraining> Name=main_level/agent, Worker=0, Episode=4625, Total reward=-2.1, Steps=37619, Training iteration=18[0m
[31mTraining> Name=main_level/agent, Worker=0, Episode=4626, Total reward=-1.2, Steps=37624, Training iteration=18[0m
[31mTraining> Name=main_level/agent, Wo

[31mTraining> Name=main_level/agent, Worker=0, Episode=4758, Total reward=-1.3, Steps=38628, Training iteration=18[0m
[31mTraining> Name=main_level/agent, Worker=0, Episode=4759, Total reward=-0.5, Steps=38638, Training iteration=18[0m
[31mTraining> Name=main_level/agent, Worker=0, Episode=4760, Total reward=-1.4, Steps=38645, Training iteration=18[0m
[31mTraining> Name=main_level/agent, Worker=0, Episode=4761, Total reward=1, Steps=38649, Training iteration=18[0m
[31mTraining> Name=main_level/agent, Worker=0, Episode=4762, Total reward=-1.1, Steps=38653, Training iteration=18[0m
[31mTraining> Name=main_level/agent, Worker=0, Episode=4763, Total reward=-1.0, Steps=38668, Training iteration=18[0m
[31mTraining> Name=main_level/agent, Worker=0, Episode=4764, Total reward=-1, Steps=38671, Training iteration=18[0m
[31mTraining> Name=main_level/agent, Worker=0, Episode=4765, Total reward=-0.7, Steps=38683, Training iteration=18[0m
[31mTraining> Name=main_level/agent, Worker=

[31mTraining> Name=main_level/agent, Worker=0, Episode=4815, Total reward=-1.4, Steps=39188, Training iteration=19[0m
[31mTraining> Name=main_level/agent, Worker=0, Episode=4816, Total reward=-2.4, Steps=39207, Training iteration=19[0m
[31mTraining> Name=main_level/agent, Worker=0, Episode=4817, Total reward=-1.1, Steps=39211, Training iteration=19[0m
[31mTraining> Name=main_level/agent, Worker=0, Episode=4818, Total reward=-1.7, Steps=39222, Training iteration=19[0m
[31mTraining> Name=main_level/agent, Worker=0, Episode=4819, Total reward=-0.3, Steps=39230, Training iteration=19[0m
[31mTraining> Name=main_level/agent, Worker=0, Episode=4820, Total reward=-1.1, Steps=39234, Training iteration=19[0m
[31mTraining> Name=main_level/agent, Worker=0, Episode=4821, Total reward=-1.9, Steps=39247, Training iteration=19[0m
[31mTraining> Name=main_level/agent, Worker=0, Episode=4822, Total reward=-2.2, Steps=39263, Training iteration=19[0m
[31mTraining> Name=main_level/agent, Wo

[31mTraining> Name=main_level/agent, Worker=0, Episode=4950, Total reward=-1, Steps=40387, Training iteration=19[0m
[31mTraining> Name=main_level/agent, Worker=0, Episode=4951, Total reward=-0.1, Steps=40393, Training iteration=19[0m
[31mTraining> Name=main_level/agent, Worker=0, Episode=4952, Total reward=-1.3, Steps=40400, Training iteration=19[0m
[31mTraining> Name=main_level/agent, Worker=0, Episode=4953, Total reward=-0.9, Steps=40414, Training iteration=19[0m
[31mTraining> Name=main_level/agent, Worker=0, Episode=4954, Total reward=-1.2, Steps=40419, Training iteration=19[0m
[31mTraining> Name=main_level/agent, Worker=0, Episode=4955, Total reward=-0.8, Steps=40432, Training iteration=19[0m
[31mTraining> Name=main_level/agent, Worker=0, Episode=4956, Total reward=-0.7, Steps=40444, Training iteration=19[0m
[31mTraining> Name=main_level/agent, Worker=0, Episode=4957, Total reward=-1.0, Steps=40459, Training iteration=19[0m
[31mTraining> Name=main_level/agent, Work

[31mPolicy training> Surrogate loss=-0.015470999293029308, KL divergence=[0.], Entropy=[-0.02025846], training epoch=8, learning_rate=0.00025[0m
[31mPolicy training> Surrogate loss=-0.015366313979029655, KL divergence=[0.], Entropy=[-0.02029875], training epoch=9, learning_rate=0.00025[0m
[31mCheckpoint> Saving in path=['/opt/ml/output/data/checkpoint/39_Step-41075.ckpt.main_level.agent.main.online', '/opt/ml/output/data/checkpoint/39_Step-41075.ckpt.main_level.agent.main.online.onnx'][0m
[31mTraining> Name=main_level/agent, Worker=0, Episode=5024, Total reward=-1, Steps=41079, Training iteration=20[0m
[31mTraining> Name=main_level/agent, Worker=0, Episode=5025, Total reward=-1.3, Steps=41085, Training iteration=20[0m
[31mTraining> Name=main_level/agent, Worker=0, Episode=5026, Total reward=-3.1, Steps=41111, Training iteration=20[0m
[31mTraining> Name=main_level/agent, Worker=0, Episode=5027, Total reward=-1.5, Steps=41120, Training iteration=20[0m
[31mTraining> Name=ma

[31mTraining> Name=main_level/agent, Worker=0, Episode=5146, Total reward=-1.6, Steps=42104, Training iteration=20[0m
[31mTraining> Name=main_level/agent, Worker=0, Episode=5147, Total reward=-1.1, Steps=42109, Training iteration=20[0m
[31mTraining> Name=main_level/agent, Worker=0, Episode=5148, Total reward=-2.2, Steps=42126, Training iteration=20[0m
[31mTraining> Name=main_level/agent, Worker=0, Episode=5149, Total reward=-2.4, Steps=42145, Training iteration=20[0m
[31mTraining> Name=main_level/agent, Worker=0, Episode=5150, Total reward=-0.5, Steps=42155, Training iteration=20[0m
[31mTraining> Name=main_level/agent, Worker=0, Episode=5151, Total reward=-1, Steps=42158, Training iteration=20[0m
[31mTraining> Name=main_level/agent, Worker=0, Episode=5152, Total reward=-0.2, Steps=42165, Training iteration=20[0m
[31mTraining> Name=main_level/agent, Worker=0, Episode=5153, Total reward=-1, Steps=42168, Training iteration=20[0m
[31mTraining> Name=main_level/agent, Worker

[31mPolicy training> Surrogate loss=0.008609688840806484, KL divergence=[0.], Entropy=[-0.02019304], training epoch=0, learning_rate=0.00025[0m
[31mPolicy training> Surrogate loss=-0.017000550404191017, KL divergence=[0.], Entropy=[-0.02012397], training epoch=1, learning_rate=0.00025[0m
[31mPolicy training> Surrogate loss=-0.008341547101736069, KL divergence=[0.], Entropy=[-0.02010873], training epoch=2, learning_rate=0.00025[0m
[31mPolicy training> Surrogate loss=-0.024310817942023277, KL divergence=[0.], Entropy=[-0.02006064], training epoch=3, learning_rate=0.00025[0m
[31mPolicy training> Surrogate loss=-0.013124095275998116, KL divergence=[0.], Entropy=[-0.02000138], training epoch=4, learning_rate=0.00025[0m
[31mPolicy training> Surrogate loss=-0.00766280060634017, KL divergence=[0.], Entropy=[-0.02001655], training epoch=5, learning_rate=0.00025[0m
[31mPolicy training> Surrogate loss=-0.013710000552237034, KL divergence=[0.], Entropy=[-0.01994273], training epoch=6,

[31mTraining> Name=main_level/agent, Worker=0, Episode=5349, Total reward=-1.8, Steps=43669, Training iteration=21[0m
[31mTraining> Name=main_level/agent, Worker=0, Episode=5350, Total reward=-1.1, Steps=43673, Training iteration=21[0m
[31mTraining> Name=main_level/agent, Worker=0, Episode=5351, Total reward=-1.1, Steps=43677, Training iteration=21[0m
[31mTraining> Name=main_level/agent, Worker=0, Episode=5352, Total reward=-1, Steps=43680, Training iteration=21[0m
[31mTraining> Name=main_level/agent, Worker=0, Episode=5353, Total reward=-1.1, Steps=43684, Training iteration=21[0m
[31mTraining> Name=main_level/agent, Worker=0, Episode=5354, Total reward=-1.2, Steps=43690, Training iteration=21[0m
[31mTraining> Name=main_level/agent, Worker=0, Episode=5355, Total reward=-1, Steps=43693, Training iteration=21[0m
[31mTraining> Name=main_level/agent, Worker=0, Episode=5356, Total reward=-1.1, Steps=43697, Training iteration=21[0m
[31mTraining> Name=main_level/agent, Worker

[31mTraining> Name=main_level/agent, Worker=0, Episode=5484, Total reward=-1.2, Steps=44782, Training iteration=21[0m
[31mTraining> Name=main_level/agent, Worker=0, Episode=5485, Total reward=-1.3, Steps=44789, Training iteration=21[0m
[31mTraining> Name=main_level/agent, Worker=0, Episode=5486, Total reward=-1.6, Steps=44798, Training iteration=21[0m
[31mTraining> Name=main_level/agent, Worker=0, Episode=5487, Total reward=-1, Steps=44801, Training iteration=21[0m
[31mTraining> Name=main_level/agent, Worker=0, Episode=5488, Total reward=-1.2, Steps=44806, Training iteration=21[0m
[31mTraining> Name=main_level/agent, Worker=0, Episode=5489, Total reward=-1, Steps=44809, Training iteration=21[0m
[31mTraining> Name=main_level/agent, Worker=0, Episode=5490, Total reward=-1.2, Steps=44815, Training iteration=21[0m
[31mTraining> Name=main_level/agent, Worker=0, Episode=5491, Total reward=-0.3, Steps=44823, Training iteration=21[0m
[31mTraining> Name=main_level/agent, Worker

[31mTraining> Name=main_level/agent, Worker=0, Episode=5542, Total reward=-1.2, Steps=45226, Training iteration=22[0m
[31mTraining> Name=main_level/agent, Worker=0, Episode=5543, Total reward=-1.3, Steps=45244, Training iteration=22[0m
[31mTraining> Name=main_level/agent, Worker=0, Episode=5544, Total reward=-2.5, Steps=45264, Training iteration=22[0m
[31mTraining> Name=main_level/agent, Worker=0, Episode=5545, Total reward=-1.1, Steps=45268, Training iteration=22[0m
[31mTraining> Name=main_level/agent, Worker=0, Episode=5546, Total reward=-1.1, Steps=45272, Training iteration=22[0m
[31mTraining> Name=main_level/agent, Worker=0, Episode=5547, Total reward=-2.1, Steps=45287, Training iteration=22[0m
[31mTraining> Name=main_level/agent, Worker=0, Episode=5548, Total reward=0.9, Steps=45292, Training iteration=22[0m
[31mTraining> Name=main_level/agent, Worker=0, Episode=5549, Total reward=-0.6, Steps=45303, Training iteration=22[0m
[31mTraining> Name=main_level/agent, Wor

[31mTraining> Name=main_level/agent, Worker=0, Episode=5666, Total reward=-1.1, Steps=46381, Training iteration=22[0m
[31mTraining> Name=main_level/agent, Worker=0, Episode=5667, Total reward=-0.9, Steps=46395, Training iteration=22[0m
[31mTraining> Name=main_level/agent, Worker=0, Episode=5668, Total reward=-3.1, Steps=46421, Training iteration=22[0m
[31mTraining> Name=main_level/agent, Worker=0, Episode=5669, Total reward=-1.2, Steps=46426, Training iteration=22[0m
[31mTraining> Name=main_level/agent, Worker=0, Episode=5670, Total reward=-1.5, Steps=46435, Training iteration=22[0m
[31mTraining> Name=main_level/agent, Worker=0, Episode=5671, Total reward=-2.1, Steps=46451, Training iteration=22[0m
[31mTraining> Name=main_level/agent, Worker=0, Episode=5672, Total reward=-1.1, Steps=46455, Training iteration=22[0m
[31mTraining> Name=main_level/agent, Worker=0, Episode=5673, Total reward=-2.0, Steps=46470, Training iteration=22[0m
[31mTraining> Name=main_level/agent, Wo

[31mPolicy training> Surrogate loss=-0.013428492471575737, KL divergence=[0.], Entropy=[-0.01978944], training epoch=4, learning_rate=0.00025[0m
[31mPolicy training> Surrogate loss=-0.013824593275785446, KL divergence=[0.], Entropy=[-0.01980574], training epoch=5, learning_rate=0.00025[0m
[31mPolicy training> Surrogate loss=-0.01403997465968132, KL divergence=[0.], Entropy=[-0.0197537], training epoch=6, learning_rate=0.00025[0m
[31mPolicy training> Surrogate loss=-0.014880860224366188, KL divergence=[0.], Entropy=[-0.01974027], training epoch=7, learning_rate=0.00025[0m
[31mPolicy training> Surrogate loss=-0.013231315650045872, KL divergence=[0.], Entropy=[-0.01975147], training epoch=8, learning_rate=0.00025[0m
[31mPolicy training> Surrogate loss=-0.013218604028224945, KL divergence=[0.], Entropy=[-0.01968562], training epoch=9, learning_rate=0.00025[0m
[31mCheckpoint> Saving in path=['/opt/ml/output/data/checkpoint/45_Step-47229.ckpt.main_level.agent.main.online', '/opt

[31mTesting> Name=main_level/agent, Worker=0, Episode=5861, Total reward=-2.0, Steps=48000, Training iteration=23[0m
[31mTesting> Name=main_level/agent, Worker=0, Episode=5861, Total reward=-2.0, Steps=48000, Training iteration=23[0m
[31mTesting> Name=main_level/agent, Worker=0, Episode=5861, Total reward=1, Steps=48000, Training iteration=23[0m
[31m## agent: Finished evaluation phase. Success rate = 0.0, Avg Total Reward = -1.4[0m
[31mTraining> Name=main_level/agent, Worker=0, Episode=5862, Total reward=-0.6, Steps=48011, Training iteration=23[0m
[31mTraining> Name=main_level/agent, Worker=0, Episode=5863, Total reward=-1.3, Steps=48017, Training iteration=23[0m
[31mTraining> Name=main_level/agent, Worker=0, Episode=5864, Total reward=-1, Steps=48020, Training iteration=23[0m
[31mTraining> Name=main_level/agent, Worker=0, Episode=5865, Total reward=-0.9, Steps=48034, Training iteration=23[0m
[31mTraining> Name=main_level/agent, Worker=0, Episode=5866, Total reward=-1.

[31mTraining> Name=main_level/agent, Worker=0, Episode=5995, Total reward=-0.1, Steps=49163, Training iteration=23[0m
[31mTraining> Name=main_level/agent, Worker=0, Episode=5996, Total reward=-1.2, Steps=49168, Training iteration=23[0m
[31mTraining> Name=main_level/agent, Worker=0, Episode=5997, Total reward=-0.1, Steps=49174, Training iteration=23[0m
[31mTraining> Name=main_level/agent, Worker=0, Episode=5998, Total reward=-1.5, Steps=49182, Training iteration=23[0m
[31mTraining> Name=main_level/agent, Worker=0, Episode=5999, Total reward=-2.0, Steps=49197, Training iteration=23[0m
[31mTraining> Name=main_level/agent, Worker=0, Episode=6000, Total reward=-1.1, Steps=49201, Training iteration=23[0m
[31mTraining> Name=main_level/agent, Worker=0, Episode=6001, Total reward=-1, Steps=49205, Training iteration=23[0m
[31mTraining> Name=main_level/agent, Worker=0, Episode=6002, Total reward=-2.3, Steps=49223, Training iteration=23[0m
[31mTraining> Name=main_level/agent, Work


2019-06-07 15:47:46 Uploading - Uploading generated training model[31mTraining> Name=main_level/agent, Worker=0, Episode=6052, Total reward=0.1, Steps=49673, Training iteration=24[0m
[31mTraining> Name=main_level/agent, Worker=0, Episode=6053, Total reward=-2.1, Steps=49689, Training iteration=24[0m
[31mTraining> Name=main_level/agent, Worker=0, Episode=6054, Total reward=-1.1, Steps=49705, Training iteration=24[0m
[31mTraining> Name=main_level/agent, Worker=0, Episode=6055, Total reward=-1.3, Steps=49712, Training iteration=24[0m
[31mTraining> Name=main_level/agent, Worker=0, Episode=6056, Total reward=-1.2, Steps=49718, Training iteration=24[0m
[31mTraining> Name=main_level/agent, Worker=0, Episode=6057, Total reward=-0.9, Steps=49732, Training iteration=24[0m
[31mTraining> Name=main_level/agent, Worker=0, Episode=6058, Total reward=-0.6, Steps=49743, Training iteration=24[0m
[31mTraining> Name=main_level/agent, Worker=0, Episode=6059, Total reward=-1.1, Steps=49748, 

---

## Deploy

Normally we would evaluate our agent by looking for reward convergence or monitoring performance across epsisodes.  Other SageMaker RL example notebooks cover this in detail.  We'll skip that for the more tangible approach of testing the trained agent by playing against it ourselves.  To do that, we'll first deploy the agent to a realtime endpoint to get predictions.

### Inference

Our deployment code:
1. Unpacks the ONNX model output and prepares it for inference in `model_fn`
1. Generates predictions from our network, given state (a flattened tic-tac-toe board) in `transform_fn`

In [7]:
!pygmentize ./src/deploy-coach.py

[34mimport[39;49;00m [04m[36mos[39;49;00m
[34mimport[39;49;00m [04m[36mjson[39;49;00m
[34mimport[39;49;00m [04m[36mmxnet[39;49;00m [34mas[39;49;00m [04m[36mmx[39;49;00m
[34mfrom[39;49;00m [04m[36mmxnet.contrib[39;49;00m [34mimport[39;49;00m onnx [34mas[39;49;00m onnx_mxnet
[34mfrom[39;49;00m [04m[36mmxnet[39;49;00m [34mimport[39;49;00m gluon, nd
[34mimport[39;49;00m [04m[36mnumpy[39;49;00m [34mas[39;49;00m [04m[36mnp[39;49;00m
    

[34mdef[39;49;00m [32mmodel_fn[39;49;00m(model_dir):
    [33m"""[39;49;00m
[33m    Load the onnx model. Called once when hosting service starts.[39;49;00m
[33m[39;49;00m
[33m    :param: model_dir The directory where model files are stored.[39;49;00m
[33m    :return: a model[39;49;00m
[33m    """[39;49;00m
    onnx_path = os.path.join(model_dir, [33m"[39;49;00m[33mmodel.onnx[39;49;00m[33m"[39;49;00m)
    ctx = mx.cpu() [37m# todo: pass into function[39;49;00m
    [37m# lo

### Endpoint

Now we'll actually create a SageMaker endpoint to call for predictions.

*Note, this step could be replaced by importing the ONNX model into the notebook environment.*

In [8]:
predictor = estimator.deploy(initial_instance_count=1, 
                             instance_type='ml.m4.xlarge', 
                             entry_point='deploy-coach.py')

---------------------------------------------------------------------------------------!

---

## Play 

Let's play our agent.  After running the cell below, just click on one the boxes to make your move.  To restart the game, simply execute the cell again.

*This cell uses the `TicTacToeGame` class from `tic_tac_toe_game.py` script to build an extremely basic tic-tac-toe app within a Jupyter notebook.  The opponents moves are generated by invoking the `predictor` passed at initialization.  Please refer to the code for additional details.*

In [37]:
t = TicTacToeGame(predictor)
t.start()

VBox(children=(VBox(children=(HBox(children=(Button(layout=Layout(height='75px', width='75px'), style=ButtonSt…

---

## Wrap Up

In this notebook we trained a reinforcement learning agent to play a simple game of tic-tac-toe, using a custom Gym environment.  It could be built upon to solve other problems or improved by:

- Training for more episodes
- Using a different reinforcement learning algorithm
- Tuning hyperparameters for improved performance
- Or how about a nice game of [global thermonuclear war](https://youtu.be/s93KC4AGKnY?t=41)?

Let's finish by cleaning up our endpoint to prevent any persistent costs.

In [38]:
predictor.delete_endpoint()