This page explains how to retrain the model, and how to modify the training settings with MXNet or with RLlib
This is controlled mainly by a single notebook and will use a training gym environment.
PRE-REQUISITE: You need to run Step 1 before following these instructions.
This step adds a training instance to the Battlesnake environment:
Estimated cost : This environment adds a training instance to Step 1: the free tiers include 50 hours per month for the training instance during the first two months. After the free tiers are exceeded, the charge will be $0.269 per hour ($6.5 per 24 hour period) when the training instance is running. Cost savings tip : Once you have finished training, you can stop your training instance in order to stop consuming free tiers or incurring charges. You can easily restart them at a later date to continue with training.
Similar to accessing the heuristics developer environment, check the CloudFormation 'Outputs' tab and open ModelTrainingEnvironment:
You need to be authenticated for that link to work. Click on SourceEditionInNotebook link if you are denied access
- Once you have opened the SagemakerModelTraining.ipynb notebook, run through the notebook in order to train the model or run hyperparameter optimization (Press ► on the top to run the notebook). See here for a tutorial on how to use Jupyter Notebooks.
- set
run_hpo = False
to train your model, orrun_hpo = True
to run hyperparameter optimization.
The reinforcement learning components of this project include an OpenAI gym to train your Battlesnake AI (https://play.battlesnake.com/) and an MXNet example notebook to train your own neural network.
The OpenAI gym was designed to follow the official Battlesnake rules outlined here: https://docs.battlesnake.com/rules.
From the Cloudformation stack created during STEP 1, go to the 'Outputs' tab and click on the link next to ModelTrainingEnvironment:
The notebook contains code for training and automatic deployment of the model.
Press ► on the top to run the notebook (see here for a tutorial on how to use jupyter notebooks).
The main entry point (Amazon SageMaker endpoint) of the training the model is mxnet_src/train.py
for MXNet and rllib_src/train.py` for RLlib.
This gym provide several options for the the observation space.
The observation space provided by the gym is of size N x M x C
where N
and M
are the width and height of the map and C
is the number of snakes (+1 to account for the food). The food is indicated by values of 1
in C=0
. The snakes are stored in C=1 to C=num_snakes+1
and can be represented in 2 possible ways (51s
, num
):
Figure 1: 51s snake representation | Figure 2: num snake representation |
---|---|
![]() |
![]() |
The gym also provides an option to increase the map size by 2 to include -1 in the border.
Figure 3: bordered 51s snake representation
For each snake, the possible actions are UP, DOWN, LEFT, RIGHT (0, 1, 2, 3 respectively). Please note that according to the rules of Battlesnake, if your snake is facing UP and your snake performs a DOWN action, your snake will die.
The food spawning algorithm is not provided in the official Battlesnake rules. The gym uses a food spawning mechanism based on the code provided here
Designing an appropriate reward function is an important aspect of reinforcement learning. Currently, the gym records the following events that can be used to help shape your reward function:
- Surviving another turn (labelled as:
"another_turn"
) - Eating food (labelled as:
"ate_food"
) - Winning the game (labelled as:
"won"
) - Losing the game (labelled as:
"died"
) - Eating another snake (labelled as:
"ate_another_snake"
) - Dying by hitting a wall (labelled as:
"hit_wall"
) - Hitting another snake (labelled as:
"hit_other_snake"
) - Hitting yourself (labelled as:
"hit_self"
) - Was eaten by another snake (labelled as:
"was_eaten"
) - Another snake hits your body (labelled as:
"other_snake_hit_body"
) - Performing a forbidden move (i.e., moving south when facing north) (labelled as:
"forbidden_move"
) - Dying by starving (labelled as:
"starved"
)
The current reward function is simple ("another_turn"=1, "won"=2, "died"=-3, "forbidden_move"=-1
).
More complex reward functions with methods that can handle sparse rewards may be greatly beneficial. Also, it is possible to design different rewards for different snakes.
Based on the OpenAI gym framework, the following functions are used to interact with the gym:
-
state, _, dones, info = env.reset()
This function resets the environment. This is called when first creating the environment, and also at the end of each episode. 'state' provides the initial observations. 'info' provides information on the turn count and the health of each snake
-
state, rewards, dones, info = env.step(actions)
This function executes one time step within the environment, based on the set of snake actions contained in actions
actions
should be a numpy array of size num_snakes
containing integers 0 to 3, to represent the desired action for each snake in the environment
rewards
is a dictionary of reward values, keyed by the snake id. For example {'0': 45, '1': 37, '2': 60}
dones
is a dictionary of booleans indicating if each snake has died. Like rewards, the dictionary is keyed by snake ID, beginning at 0.
-
env.render(mode="rgb_array")
This function renders the environment based on its current state.
mode
can be rbg_array
, ascii
, human
rbg_array
outputs an expanded numpy array that can be used for creating gifsascii
outputs a text based representation that can be printed in the command prompthuman
an OpenAI plot will be generated
python mxnet_src/train.py --should_render --print_progress --number_of_snakes 4
Please refer to https://github.com/awslabs/sagemaker-battlesnake-ai/blob/master/TrainingEnvironment/examples/train.py for the other hyperparameters
This code uses multi-agent DQN to train the bots. The N
snakes share the same Qnetwork and the network is configured as follows:
Given a state representation N x M x C
to get the action of snake i
, the network expects that C=1
is the representation of snake i
and C=2 ... num_snakes+1
are the remaining snakes (note that C=0
represents the food).
The inputs 'snake health' and 'snake ID' in the figure correspond to the health and ID of Snake i
.