# Introduction

This notebook outlines the steps involved in building and deploying a Battlesnake model using Ray RLlib and TensorFlow on Amazon SageMaker.

Library versions currently in use:  TensorFlow 2.1, Ray RLlib 0.8.2

The model is first trained using multi-agent PPO, and then deployed to a managed _TensorFlow Serving_ SageMaker endpoint that can be used for inference.

<br/>

**Note:** This is a work-in-progress...

### Comments and Known Issues

* The CNN kernels in `cnn_tf.py` are currently fixed, which means they will likely require adjustment if you specify a map size other than 11x11
  * it would be nice if this happened automagically
* The current MultiAgentBattlesnake environment uses 2 frames for each observation. If you only want to use one frame, you'll need to adjust the observation code in `ma_battlesnake.py` accordingly
* The original TF model export code in `ray_launcher.py` did not work for TF2.1.
  * I switched over to RLlib's export_model() method, which seems to be working here
* I have not yet tested RLlib's built-in `{'use_lstm': True}` model parameter, which wraps the CNN in an LSTM. This was working for local training/inference but has not been tested with the SageMaker inference endpoint, yet
* Regardless of the number of snakes in the gym, or which policy is 'best', only policy_0 is currently exported as a TF model. Refer to `common/sagemaker_rl/tf_serving_utils.py` and see the comment in the inference section, below
* The Ray dashboard fails to start (errors during training) but does not abort the training job
* There are many warnings during training - most appear to be benign, but are annoying
* Both local-mode and SageMaker-based training and inference have been tested, and appear to be working
    * local-mode inference might generate some warnings, but seems to work regardless
* GPU training/inference has not been tested
* Single-instance training has been tested. Distributed multi-instance RLlib training has not yet been tested.
* Although the hosted model is able to provide predictions, I haven't yet verified that the predictions are correct or useful.
* The default hyperparameters are unlikely to generate an impressive model. Modify the hyperparameters and rewards if you are hoping to see something cool.

In [1]:
import sagemaker
from sagemaker.rl import RLEstimator, RLToolkit
import boto3

In [2]:
sm_session = sagemaker.session.Session()
s3_bucket = sm_session.default_bucket()

s3_output_path = 's3://{}/'.format(s3_bucket)
print("S3 bucket path: {}".format(s3_output_path))

S3 bucket path: s3://sagemaker-us-west-2-599069043765/


In [3]:
job_name_prefix = 'battlesnake-rllib-ppo'

role = sagemaker.get_execution_role()
print(role)

arn:aws:iam::599069043765:role/service-role/AmazonSageMaker-ExecutionRole-20191203T172993


In [4]:
# Change local_mode to True if you want to do local training within this Notebook instance
# Otherwise, we'll spin-up a SageMaker training instance to handle the training

local_mode = False

if local_mode:
    instance_type = 'local'
else:
    instance_type = "ml.m5.4xlarge"
    
# If training locally, do some Docker housekeeping..
if local_mode:
    !/bin/bash ./common/setup.sh

In [5]:
# Specify the new TF v2.1 / Ray RLlib 0.8.2 container
#    Adjust 'cpu' or 'gpu' in the image name, as required
image_name = '462105765813.dkr.ecr.us-west-2.amazonaws.com/sagemaker-rl-ray-container:ray-0.8.2-tf-cpu-py36'

In [6]:
%%time

# Define and execute our training job
# Adjust hyperparameters and train_instance_count accordingly

metric_definitions = RLEstimator.default_metric_definitions(RLToolkit.RAY)
    
estimator = RLEstimator(entry_point="train-mabs.py",
                        source_dir='src',
                        dependencies=["common/sagemaker_rl", "common/battlesnake_gym"],
                        image_name=image_name,
                        role=role,
                        train_instance_type=instance_type,
                        train_instance_count=1,
                        output_path=s3_output_path,
                        base_job_name=job_name_prefix,
                        metric_definitions=metric_definitions,
                        hyperparameters={
                            # See train-mabs.py to add additional hyperparameters
                            # Also see ray_launcher.py for the rl.training.* hyperparameters
                            #
                            # number of training iterations
                            "num_iters": 10,
                            # number of snakes in the gym
                            "num_agents": 5,
                            # dimension of the gym. changing this could require changes to CNN kernels
                            # in cnn_ft.py
                            "map_height": 11,
                        }
                    )

estimator.fit()

job_name = estimator.latest_training_job.job_name
print("Training job: %s" % job_name)

2020-03-20 22:01:36 Starting - Starting the training job...
2020-03-20 22:01:39 Starting - Launching requested ML instances...
2020-03-20 22:02:36 Starting - Preparing the instances for training......
2020-03-20 22:03:29 Downloading - Downloading input data...
2020-03-20 22:03:38 Training - Downloading the training image.....[34mbash: cannot set terminal process group (-1): Inappropriate ioctl for device[0m
[34mbash: no job control in this shell[0m
[34m2020-03-20 22:04:42,024 sagemaker-containers INFO     Imported framework sagemaker_tensorflow_container.training[0m
[34m2020-03-20 22:04:42,031 sagemaker-containers INFO     No GPUs detected (normal if no gpus installed)[0m
[34m2020-03-20 22:04:42,158 sagemaker-containers INFO     No GPUs detected (normal if no gpus installed)[0m
[34m2020-03-20 22:04:42,173 sagemaker-containers INFO     No GPUs detected (normal if no gpus installed)[0m
[34m2020-03-20 22:04:42,188 sagemaker-containers INFO     No GPUs detected (normal if no g


[34m#033[2m#033[36m(pid=145)#033[0m   obj = yaml.load(type_)[0m
[34m#033[2m#033[36m(pid=145)#033[0m   obj = yaml.load(type_)[0m
[34m#033[2m#033[36m(pid=145)#033[0m   obj = yaml.load(type_)[0m
[34m#033[2m#033[36m(pid=142)#033[0m   obj = yaml.load(type_)[0m
[34m#033[2m#033[36m(pid=153)#033[0m   obj = yaml.load(type_)[0m
[34m#033[2m#033[36m(pid=144)#033[0m   obj = yaml.load(type_)[0m
[34m#033[2m#033[36m(pid=146)#033[0m   obj = yaml.load(type_)[0m
[34m#033[2m#033[36m(pid=150)#033[0m   obj = yaml.load(type_)[0m
[34m#033[2m#033[36m(pid=149)#033[0m   obj = yaml.load(type_)[0m
[34m#033[2m#033[36m(pid=151)#033[0m   obj = yaml.load(type_)[0m
[34m#033[2m#033[36m(pid=148)#033[0m   obj = yaml.load(type_)[0m
[34m#033[2m#033[36m(pid=138)#033[0m   obj = yaml.load(type_)[0m
[34m#033[2m#033[36m(pid=141)#033[0m   obj = yaml.load(type_)[0m
[34m#033[2m#033[36m(pid=143)#033[0m   obj = yaml.load(type_)[0m
[34m#033[2m#033[36m(pid=147)#033[0m   obj = yaml.load(type_)[0m
[34m#033

[34m#033[2m#033[36m(pid=151)#033[0m   obj = yaml.load(type_)[0m
[34m#033[2m#033[36m(pid=150)#033[0m   obj = yaml.load(type_)[0m
[34m#033[2m#033[36m(pid=142)#033[0m   obj = yaml.load(type_)[0m
[34m#033[2m#033[36m(pid=146)#033[0m   obj = yaml.load(type_)[0m
[34m#033[2m#033[36m(pid=144)#033[0m   obj = yaml.load(type_)[0m
[34m#033[2m#033[36m(pid=149)#033[0m   obj = yaml.load(type_)[0m
[34m#033[2m#033[36m(pid=153)#033[0m   obj = yaml.load(type_)[0m
[34m#033[2m#033[36m(pid=148)#033[0m   obj = yaml.load(type_)[0m
[34m#033[2m#033[36m(pid=143)#033[0m   obj = yaml.load(type_)[0m
[34m#033[2m#033[36m(pid=152)#033[0m   obj = yaml.load(type_)[0m
[34m#033[2m#033[36m(pid=139)#033[0m   obj = yaml.load(type_)[0m
[34m#033[2m#033[36m(pid=140)#033[0m   obj = yaml.load(type_)[0m
[34m#033[2m#033[36m(pid=141)#033[0m   obj = yaml.load(type_)[0m
[34m#033[2m#033[36m(pid=138)#033[0m   obj = yaml.load(type_)[0m
[34m#033[2m#033[36m(pid=147)#033[0m   obj = yaml.load(type_)[0m
[34m#033[

[34m#033[2m#033[36m(pid=145)#033[0m 2020-03-20 22:05:27,618#011INFO trainable.py:178 -- _setup took 38.992 seconds. If your trainable is slow to initialize, consider setting reuse_actors=True to reduce actor creation overheads.[0m
[34mResult for PPO_MultiAgentBattlesnake-v1_cf3b6376:
  custom_metrics: {}
  date: 2020-03-20_22-06-34
  done: false
  episode_len_mean: 4.412625096227868
  episode_reward_max: 27.0
  episode_reward_mean: -1.9214780600461894
  episode_reward_min: -15.0
  episodes_this_iter: 2598
  episodes_total: 2598
  experiment_id: 6b777adc731a4f3c8323165ca3c20cb9
  experiment_tag: '0'
  hostname: ip-10-0-207-160.us-west-2.compute.internal
  info:
    grad_time_ms: 23809.142
    learner:
      policy_0:
        cur_kl_coeff: 0.20000000298023224
        cur_lr: 0.0005000000237487257
        entropy: 1.3616435527801514
        entropy_coeff: 0.0
        kl: 0.023356305435299873
        policy_loss: -0.06204693019390106
        total_loss: 2.0016069412231445
        vf_exp

[34mResult for PPO_MultiAgentBattlesnake-v1_cf3b6376:
  custom_metrics: {}
  date: 2020-03-20_22-07-27
  done: false
  episode_len_mean: 5.8001009591115595
  episode_reward_max: 43.0
  episode_reward_mean: 2.2120141342756185
  episode_reward_min: -15.0
  episodes_this_iter: 1981
  episodes_total: 6981
  experiment_id: 6b777adc731a4f3c8323165ca3c20cb9
  experiment_tag: '0'
  hostname: ip-10-0-207-160.us-west-2.compute.internal
  info:
    grad_time_ms: 18322.211
    learner:
      policy_0:
        cur_kl_coeff: 0.44999998807907104
        cur_lr: 0.0005000000237487257
        entropy: 1.2594250440597534
        entropy_coeff: 0.0
        kl: 0.029049284756183624
        policy_loss: -0.10358740389347076
        total_loss: 3.2191340923309326
        vf_explained_var: 0.3344835937023163
        vf_loss: 3.3096489906311035
      policy_1:
        cur_kl_coeff: 0.44999998807907104
        cur_lr: 0.0005000000237487257
        entropy: 1.2532711029052734
        entropy_coeff: 0.0
       

[34mResult for PPO_MultiAgentBattlesnake-v1_cf3b6376:
  custom_metrics: {}
  date: 2020-03-20_22-08-17
  done: false
  episode_len_mean: 9.664418212478921
  episode_reward_max: 73.0
  episode_reward_mean: 13.267284991568296
  episode_reward_min: -12.0
  episodes_this_iter: 1186
  episodes_total: 9697
  experiment_id: 6b777adc731a4f3c8323165ca3c20cb9
  experiment_tag: '0'
  hostname: ip-10-0-207-160.us-west-2.compute.internal
  info:
    grad_time_ms: 17044.031
    learner:
      policy_0:
        cur_kl_coeff: 1.0125000476837158
        cur_lr: 0.0005000000237487257
        entropy: 1.1744681596755981
        entropy_coeff: 0.0
        kl: 0.021839885041117668
        policy_loss: -0.09286774694919586
        total_loss: 5.419380187988281
        vf_explained_var: 0.4603818356990814
        vf_loss: 5.490135192871094
      policy_1:
        cur_kl_coeff: 1.0125000476837158
        cur_lr: 0.0005000000237487257
        entropy: 1.1476343870162964
        entropy_coeff: 0.0
        kl: 

[34mResult for PPO_MultiAgentBattlesnake-v1_cf3b6376:
  custom_metrics: {}
  date: 2020-03-20_22-09-07
  done: false
  episode_len_mean: 15.113606340819022
  episode_reward_max: 109.0
  episode_reward_mean: 28.878467635402906
  episode_reward_min: -8.0
  episodes_this_iter: 757
  episodes_total: 11391
  experiment_id: 6b777adc731a4f3c8323165ca3c20cb9
  experiment_tag: '0'
  hostname: ip-10-0-207-160.us-west-2.compute.internal
  info:
    grad_time_ms: 16425.453
    learner:
      policy_0:
        cur_kl_coeff: 1.5187499523162842
        cur_lr: 0.0005000000237487257
        entropy: 1.1271483898162842
        entropy_coeff: 0.0
        kl: 0.01589125208556652
        policy_loss: -0.07924960553646088
        total_loss: 7.6601433753967285
        vf_explained_var: 0.5912667512893677
        vf_loss: 7.7152581214904785
      policy_1:
        cur_kl_coeff: 1.5187499523162842
        cur_lr: 0.0005000000237487257
        entropy: 1.0806876420974731
        entropy_coeff: 0.0
        kl

[34mResult for PPO_MultiAgentBattlesnake-v1_cf3b6376:
  custom_metrics: {}
  date: 2020-03-20_22-09-57
  done: false
  episode_len_mean: 21.741509433962264
  episode_reward_max: 172.0
  episode_reward_mean: 46.51320754716981
  episode_reward_min: -5.0
  episodes_this_iter: 530
  episodes_total: 12567
  experiment_id: 6b777adc731a4f3c8323165ca3c20cb9
  experiment_tag: '0'
  hostname: ip-10-0-207-160.us-west-2.compute.internal
  info:
    grad_time_ms: 16111.883
    learner:
      policy_0:
        cur_kl_coeff: 1.5187499523162842
        cur_lr: 0.0005000000237487257
        entropy: 1.0844929218292236
        entropy_coeff: 0.0
        kl: 0.014590064994990826
        policy_loss: -0.07125009596347809
        total_loss: 10.055526733398438
        vf_explained_var: 0.6800039410591125
        vf_loss: 10.104618072509766
      policy_1:
        cur_kl_coeff: 1.5187499523162842
        cur_lr: 0.0005000000237487257
        entropy: 1.0302314758300781
        entropy_coeff: 0.0
        kl

[34mSaved the checkpoint file /opt/ml/output/intermediate/training/PPO_MultiAgentBattlesnake-v1_cf3b6376_0_2020-03-20_22-04-45q_xnsn0n/checkpoint_10/checkpoint-10 as /opt/ml/model/checkpoint[0m
[34mSaved the checkpoint file /opt/ml/output/intermediate/training/PPO_MultiAgentBattlesnake-v1_cf3b6376_0_2020-03-20_22-04-45q_xnsn0n/checkpoint_10/checkpoint-10.tune_metadata as /opt/ml/model/checkpoint.tune_metadata[0m
[34m2020-03-20 22:10:27,509#011INFO trainer.py:420 -- Tip: set 'eager': true or the --eager flag to enable TensorFlow eager execution[0m
[34m2020-03-20 22:10:27,514#011INFO trainer.py:580 -- Current log_level is ERROR. For more information, set 'log_level': 'INFO' / 'DEBUG' or use the -v and -vv flags.[0m
  obj = yaml.load(type_)[0m
  obj = yaml.load(type_)[0m
  obj = yaml.load(type_)[0m
  obj = yaml.load(type_)[0m
  obj = yaml.load(type_)[0m
[34m#033[2m#033[36m(pid=5403)#033[0m   obj = yaml.load(type_)[0m
[34m#033[2m#033[36m(pid=5403)#033[0m   obj = yaml.load(t

In [7]:
# Where is the model stored in S3?
estimator.model_data

's3://sagemaker-us-west-2-599069043765/battlesnake-rllib-ppo-2020-03-20-22-01-36-294/output/model.tar.gz'

In [8]:
from sagemaker.tensorflow.serving import Model

model = Model(model_data=estimator.model_data,
              role=role,
              framework_version='2.1.0',
             )

if local_mode:
    inf_instance_type = 'local'
else:
    inf_instance_type = "ml.t2.medium"

# Deploy an inference endpoint
predictor = model.deploy(initial_instance_count=1, instance_type=inf_instance_type)

-----------!

In [23]:
# Spoof an observation from a Battlesnake environment, and get the predicted action from the model
#
# This example is using single observation for a 5-agent environment with an 11x11 map
# The last axis is 12 because the current MultiAgentEnv is concatenating 2 frames
#   5 agent maps + 1 food map = 6 maps total    6 maps * 2 frames = 12
#
# Note: this prediction is for the first policy in the environment "policy_0"
#   We need to fix this to export the 'best' policy, all policies, etc.
#   Also - the agent's policy # and position within the observation *does* currently matter.
#   For example, if we export policy_4 for inference, we need to ensure that the agent's current
#   snake representation (during inference) is located within index 4 of the observations (food is index 0)

import numpy as np
from time import time

fake_obs = np.zeros(shape=(1,11,11,12), dtype=np.float32).tolist()

test_data = {"inputs": { 'observations': fake_obs,
                        'prev_action': -1,
                        'is_training': False,
                        'prev_reward': -1,
                        'seq_lens': -1
                       } }
before = time()
result = predictor.predict(test_data)
elapsed = time() - before

print("Raw inference results:")
for key in sorted(result['outputs'].keys()):
    print("  ", key, ": ", result['outputs'][key])

print()
print("Our model predicts that the next action to take is: action", result['outputs']['actions'][0])
print()
print("Inference took %.2f ms" % (elapsed*1000))

Raw inference results:
   action_logp :  [0.0]
   action_prob :  [1.0]
   actions :  [1]
   behaviour_logits :  [[-0.0531836264, 0.0837905332, -0.113031864, 0.0553472936]]
   vf_preds :  [-1.23689604]

Our model predicts that the next action to take is: action 1

Inference took 32.20 ms


In [10]:
# Uncomment and run to delete the endpoint
# predictor.delete_endpoint()