# Introduction

This notebook outlines the steps involved in building and deploying a Battlesnake model using Ray RLlib and TensorFlow on Amazon SageMaker.

Library versions currently in use:  TensorFlow 2.1, Ray RLlib 0.8.2

The model is first trained using multi-agent PPO, and then deployed to a managed _TensorFlow Serving_ SageMaker endpoint that can be used for inference.

<br/>

**Note:** This is a work-in-progress...

### Comments and Known Issues

* `cnn_tf.py` currently contains default CNN filters for map sizes ranging 7x7 to 21x21. These default filter sizes can be overriden via the 'conv_filters' config parameter.
* The current MultiAgentBattlesnake environment uses 2 frames for each observation. If you only want to use one frame, you'll need to adjust the observation code in `ma_battlesnake.py` accordingly
* The original TF model export code in `ray_launcher.py` did not work for TF2.1.
  * I switched over to RLlib's export_model() method, which seems to be working here
* I have not yet tested RLlib's built-in `{'use_lstm': True}` model parameter, which wraps the CNN in an LSTM. This was working for local training/inference but has not been tested with the SageMaker inference endpoint, yet
* Regardless of the number of snakes in the gym, or which policy is 'best', only policy_0 is currently exported as a TF model. Refer to `common/sagemaker_rl/tf_serving_utils.py` and see the comment in the inference section, below
* The Ray dashboard fails to start (errors during training) but does not abort the training job
* There are many warnings during training - most appear to be benign, but are annoying
* Both local-mode and SageMaker-based training and inference have been tested, and appear to be working
    * local-mode inference might generate some warnings, but seems to work regardless
* GPU training has been tested
* GPU inference has not been tested
* Single-instance training has been tested. Distributed multi-instance RLlib training has not yet been tested.
* Although the hosted model is able to provide predictions, I haven't yet verified that the predictions are correct or useful.
* The default hyperparameters are unlikely to generate an impressive model. Modify the hyperparameters and rewards if you are hoping to see something cool.

In [None]:
import sagemaker
from sagemaker.rl import RLEstimator, RLToolkit
import boto3

In [None]:
sm_session = sagemaker.session.Session()
s3_bucket = sm_session.default_bucket()

s3_output_path = 's3://{}/'.format(s3_bucket)
print("S3 bucket path: {}".format(s3_output_path))

In [None]:
job_name_prefix = 'Battlesnake-job-rllib'

role = sagemaker.get_execution_role()
print(role)

In [None]:
# Change local_mode to True if you want to do local training within this Notebook instance
# Otherwise, we'll spin-up a SageMaker training instance to handle the training

local_mode = False

if local_mode:
    instance_type = 'local'
else:
    instance_type = "SAGEMAKER_TRAINING_INSTANCE_TYPE"
    
# If training locally, do some Docker housekeeping..
if local_mode:
    !/bin/bash ./common/setup.sh

In [None]:
region = sm_session.boto_region_name
device = "cpu"
image_name = '462105765813.dkr.ecr.{region}.amazonaws.com/sagemaker-rl-ray-container:ray-0.8.2-tf-{device}-py36'.format(region=region, device=device)

In [None]:
%%time

# Define and execute our training job
# Adjust hyperparameters and train_instance_count accordingly

metric_definitions = RLEstimator.default_metric_definitions(RLToolkit.RAY)
    
estimator = RLEstimator(entry_point="train-mabs.py",
                        source_dir='rllib_src',
                        dependencies=["rllib_common/sagemaker_rl", "battlesnake_gym/"],
                        image_name=image_name,
                        role=role,
                        train_instance_type=instance_type,
                        train_instance_count=1,
                        output_path=s3_output_path,
                        base_job_name=job_name_prefix,
                        metric_definitions=metric_definitions,
                        hyperparameters={
                            # See train-mabs.py to add additional hyperparameters
                            # Also see ray_launcher.py for the rl.training.* hyperparameters
                            #
                            # number of training iterations
                            "num_iters": 30,
                            # number of snakes in the gym
                            "num_agents": 5,
                            # dimension of the gym. changing this could require changes to CNN kernels
                            # in cnn_ft.py
                            "map_height": 15,
                            
                            # Methods of representing the game state options: ["flat-num", "bordered-num",
                            # "max-bordered-num", "flat-51s", "bordered-51s", "max-bordered-51s"]
                            "observation_type": "max-bordered-51s"
                        }
                    )

estimator.fit()

job_name = estimator.latest_training_job.job_name
print("Training job: %s" % job_name)

In [None]:
# Where is the model stored in S3?
estimator.model_data

# Create an endpoint to host the policy
Firstly, we will delete the previous endpoint and model

In [None]:
sm_client = boto3.client(service_name='sagemaker')
sm_client.delete_endpoint(EndpointName='battlesnake-endpoint')
sm_client.delete_endpoint_config(EndpointConfigName='battlesnake-endpoint')
sm_client.delete_model(ModelName="battlesnake-rllib")

In [None]:
from sagemaker.tensorflow.serving import Model

model = Model(model_data=estimator.model_data,
              role=role,
              entry_point="inference.py",
              source_dir='rllib_inference/src',
              framework_version='2.1.0',
              name="battlesnake-rllib",
             )

if local_mode:
    inf_instance_type = 'local'
else:
    inf_instance_type = "SAGEMAKER_INFERENCE_INSTANCE_TYPE"

# Deploy an inference endpoint
predictor = model.deploy(initial_instance_count=1, instance_type=inf_instance_type, endpoint_name='battlesnake-endpoint')

# Test the endpoint

This example is using single observation for a 5-agent environment 
The last axis is 12 because the current MultiAgentEnv is concatenating 2 frames
5 agent maps + 1 food map = 6 maps total    6 maps * 2 frames = 12

In [None]:
import numpy as np
from time import time

data1 = np.zeros(shape=(1, 21, 21, 12), dtype=np.float32).tolist()

health_dict = {0: 50, 1: 50}
json = {"turn": 4,
        "board": {
                "height": 15,
                "width": 15,
                "food": [],
                "snakes": []
                },
            "you": {
                "id": "snake-id-string",
                "name": "Sneky Snek",
                "health": 90,
                "body": [{"x": 1, "y": 3}]
                }
            }

before = time()
action = predictor.predict({"state": data1, "prev_action": -1, 
                           "prev_reward": -1, "seq_lens": -1,  
                           "all_health": health_dict, "json": json})
elapsed = time() - before

action_to_take = action["outputs"]["heuristisc_action"]
print("Action to take {}".format(action_to_take))
print("Inference took %.2f ms" % (elapsed*1000))