D key | S key | Space key | Up Arrow Key | Down Arrow Key |
---|---|---|---|---|
Enter/Exit Debug Mode |
Skip Trial |
Next Frame |
Speed up frame by 100 |
Slow down frame by 100 |
Clone the github repository, install and activate python venv.
Install a python virtual environment (Optional):
python3 -m venv [path to virtual environment]
To start virtual environment and install required packages:
cd [path to virtual environment]
source bin/activate
pip install -r requirements.txt
Run the code typing in:
python3 learn.py
or
python3 learn.py -d
to start the code in debug mode.
Once the snake program is launched, you can adjust the frame rate of the snake with the up and down arrow keys.
If you press the 'D' key, you will enter debug mode. This allows you to see graphically what the weight values of each possible moves are. The red colored value will indicate the direction the snake chose for the next move.
Pressing the 'D' key one more time will exit debug mode.
Press the 'S' key to skip one trial. A trial is defined to be 10,000 frames or until the snake dies.
Press the space bar to allow the snake to move one frame.
On the terminal, the current feature values will be displayed followed by a dashed line.
Command | Description |
---|---|
foodLeft | Food is on the left side of snake |
foodRight | Food is on the right side of the snake |
foodDown | Food is on the down side of the snake |
foodUp | Food is on the up side of the snake |
dangerDown | the snake will die if it travels one block below |
dangerUp | the snake will die if it travels one block above |
dangerLeft | the snake will die if it travels one block to the left |
dangerRight | the snake will die if it travels one block to the right |
The snake AI uses Reinforcement Learning to solve the most optimal way to play snake.
The AI uses Q algorithm to solve the problem.
There are 4 actions that the AI can take. "up", "down", "left", or "right". Depending on if the body size > 1, this could be limited to 3 actions.
A "state" in this problem is defined to be:
- x,y coordinates of the head of the snake
- Length of the snake
- x,y coordinates of the food
- The current snake body
Given that the food is randomly generated across the screen (besides the first iteration so the snake can learn quickly that eating the food gives a good reward). There are too many possible states to visit. Therefore instead of saving and trying to visit all possible states, approximate Q learning will be used.
For actual rewards, two methodes were tested.
- Reward for every closer block to the fruit.
- Reward for eating fruit: 10
- Reward for being in a new state closer to the fruit compared to the previous state: 1
- Reward for being in a new state further to the fruit compared to the previous state: -1
- Reward for dying: -1000
The idea was to try to get the snake always moving towards the fruit, but it ended up determening that dying might be worth more than going around its long tail in later stages.
- Reward for only the fruit.
- Reward for eating fruit: 10
- Reward for dying: -1000
- Any other reward: 0
The idea was to try to get the snake to prioritize staying alive, giving it no negative feedback when it takes the long way around.
For each given state and action, the featureExtractor will generate a bitlist. There are 8 Features with the following
index | Description |
---|---|
0 | (Bit is 1 if food is on the left side, 0 otherwise. , action) |
1 | (Bit is 1 if food is on the right side, 0 otherwise. , action) |
2 | (Bit is 1 if food is on the bottom side, 0 otherwise. , action) |
3 | (Bit is 1 if food is on the top side, 0 otherwise. , action) |
4 | (Bit is 1 if the snake will die if it travels one block below, action) |
5 | (Bit is 1 if the snake will die if it travels one block above, action) |
6 | (Bit is 1 if the snake will die if it travels one block to the left, action) |
7 | (Bit is 1 if the snake will die if it travels one block to the rightm, action) |
For Example, if the snake was in the top left corner without a tail and the food is in the middle of the screen, the result bitlist will be this:
- (0, 1, 1, 0, 0, 1, 1, 0)
With a probability of
With a probability of
Next Action =
Now the actual learning part. To update the respective weights for these specific feature values, for a given weight
Where
When we take this action a in state s we land in state s'. The difference variable mentioned when calculating
- [reward observed in s'+ discount value*max(future predicted rewards)] - predicted reward
- Mathematically this can be written as
$difference$ =$[r(s') + \gamma * max_{a'}Q(s',a')] - Q(s,a)$ -
$\gamma$ = discount value
-
Issues resolved:
- The code has an
$\epsilon$ value that acts as a probability of the snake to make a random move with the purpose of exploring new states. This causes an issue in later stages, so an$\epsilon$ value of each trial is determined to be$\epsilon$ = (max$\epsilon$ )/trial. This decreases randomness as our trials increase - The fruit spawned in the snake body causing the snake to crash into itself
Future Implementations:
- Adjust
$\alpha$ on the fly.