# Jupyter Snake

---

In this project our goal is to beat the game of snake, applying multiple RL techniques in order to teach an agent how to play the game.\
The project is divided in 3 parts:
1. Developement of the Environment
2. Implementation of the Algorithms
3. Learning and evaluating phase of the Algorithms

This whole project is developed as final project for the "Reinforcement Learning" course (2024-2025).

Authors : *Bredariol Francesco, Savorgnan Enrico, Tic Ruben*

In [2]:
from algorithms import *
from eligibility_traces import *
from epsilon_scheduler import * 
from snake_environment import *
from states_bracket import *
from utils import *

## PART 1
---
*Environment*

##### **The Game**

For who who doesn't know Snake is a not just a game but a genre of action video games.\
It was born in 1976 competitive arcade video game Blockade, where the goal was to survive longer than others players while collecting the most possible food.\
In this game you control the head of a snake on grid world and you aim to eat some food in order to become bigger. The big difficulty here is that if you hit your tail (this is the only common rule for all snake variant) you die.\
There are multiple version of the game and some of them are really weird (where teleportation can occour or some food can actually make you die).\
We took in account the most basic version, where:

1. The world is a discrete grid world of size $n\times m$
2. There is always only one food (apple) on the grid world, until you eat it and it changes position
3. There are no periodic boundary conditions, meaning that if you hit a wall you die

The rest is just as described as in the introduction to the game.\
Little side note is that this version is inspired by the "Snake Byte" published by Sirius Software in 1982.

##### **The Implementation**

Thanks to Gymnasium and PyGame the implementation of this simple version of the game is pretty straightforward.\
Developed completely in Python in the file "snake_environment.py", this implementation follows the Gym protocol, defining:

1. Step
2. Reset
3. Render, which use PyGame
4. Close

All others functions are private and only used inside the class, for the exception of "get_possible_action" which is useful for our purpose.

One important thing is that we actually defined a maximum number of step inside the environment to prevent infinte loop driven by bad policies while training. This is a parameter for the __init__ with default value 1000.

In [13]:
import random
env = SnakeEnv(render_mode="human")

done, keep = False, True

state, _ = env.reset()
action = 0

while not done and keep:
    action = random.choice(env.get_possible_actions(action))
    state, reward, done, trunc, inf = env.step(action)
    keep = env.render()

env.close()

##### **The Dimensionality Problem**

Once the environment is defined, one can think about how big the space of all possible configuration is.\
Well, doing some (pretty bad but although valid) approximation, considering a state as the matrix representation of the grid with 0 (empty cell), 1 (apple), 2(head) and 3 (tail), the dimension of all possible configuration ends up being something like this:
$$
    |S| = (n\times m)(n\times m)2^{(n\times m)}
$$
This should describe all the possible positions of the apple, all the possible position of the head and all possible configuration on the grid of the tails (now this is the big approximation, since the tail configuration is not indepent from the head position). Anyway, even if this is an approximation one can simply add the "blocks" on the grid world (static cells that kill, if touched, the snake) and the dimension should exactly being that big.

Now this is not a simple thing to deal with while learning. Solution? Bidding (or bracketing, how we actually call it). Now on this soon.

## PART 2
---
*Algorithms*

##### **Why not MDP**

-> Explaining why we do not use MDP. Lack of the transition function, too many state to actually defined a good Markov decision processes.

##### **Our Choices**

In the end we decided to develope 5 differents algorithms (with their variants) in order to take familiarity with the whole RL framework. These are the algorithms we implemented:

1. Montecarlo 
2. SARSA
3. QLearning
4. DQL
5. Policy Gradient

We firstly implemented a super class that defines a protocol for all the algorithms and provides useful function such as "get_action_epsilon_greedy" and so on.\
In addiction the utils.py contains a lot of useful function used to deal with the default_dict, a structure we used to store the QValues look up table (for the algorithms that require it). Since we used a lot of bidding (more about this soon) we used default dict to possibly deal with no-fixed total dimension of all possible configurations.

##### **Bidding (or Bracketing)**

As shown before the Snake game has a huge state dimension.\
Since a look up table of that dimension has no logic to exist (I don't even think our computers can store something like that) and since it is pretty impossible just to see one example for each pair state-action in an entire life time, something had to come in mind.\
Thanks to a lesson we learnt about bidding (that we called bracketing for the entire project, shame on Francesco) and decided to try it out.\
Bidding essentialy is just a supervised techniques (since it is the human that codify how it works) that agglomerate similar states together, in order to reduce the dimensionality of the problem.\
A very stupid example could be the following: each state just randomly be labelled as 0 or 1. And the agents now will not see the entire state representation, but only the label you gave it. Now this example is stupid because, using this random strategy, you end up with no knowledge at all. BUT, loko at what happened at your state dimension: it falled down from whatever it was to 2! Pretty neat, uh?

We discussed together end decided to try a lot of different bidding techniques, and we end up discovering the big tradeoff in this field: information against dimension.\
Minimal state bidding are easy to develop, but if they are too small it is not ensured that they will bring enough information to actually learn to the agent. On the other hand, if you give too many information to the agent, you will end up again with a too big state dimension to deal with.