# Jupyter Snake

---

In this project our goal is to beat the game of snake, applying multiple RL techniques in order to teach an agent how to play the game.\
The project is divided in 3 parts:
1. Developement of the Environment
2. Implementation of the Algorithms
3. Learning and evaluating phase of the Algorithms

This whole project is developed as final project for the "Reinforcement Learning" course (2024-2025).

Authors : *Bredariol Francesco, Savorgnan Enrico, Tic Ruben*

In [1]:
from algorithms import *
from eligibility_traces import *
from epsilon_scheduler import * 
from snake_environment import *
from states_bracket import *
from utils import *

## PART 1
---
*Environment*

### **The Game**

For who who doesn't know Snake is a not just a game but a genre of action video games.\
It was born in 1976 competitive arcade video game Blockade, where the goal was to survive longer than others players while collecting the most possible food.\
In this game you control the head of a snake on grid world and you aim to eat some food in order to become bigger. The big difficulty here is that if you hit your tail (this is the only common rule for all snake variant) you die.\
There are multiple version of the game and some of them are really weird (where teleportation can occour or some food can actually make you die).\
We took in account the most basic version, where:

1. The world is a discrete grid world of size $n\times m$
2. There is always only one food (apple) on the grid world, and when you it changes position
3. There are no periodic boundary conditions, meaning that if you hit a wall you die

The rest is just as described as in the introduction to the game.\
Little side note is that this version is inspired by the "Snake Byte" published by Sirius Software in 1982.

### **The Implementation**

Thanks to Gymnasium and PyGame the implementation of this simple version of the game is pretty straightforward.\
Developed completely in Python in the file "snake_environment.py", this implementation follows the Gym protocol, defining:

1. Step
2. Reset
3. Render, which use PyGame
4. Close

All others functions are private and only used inside the class, for the exception of "get_possible_action" which is useful for our purpose.

One important thing is that we actually defined a maximum number of step inside the environment to prevent infinte loop driven by bad policies while training. This is a parameter for the __init__ with default value 1000.

In [2]:
import random
env = SnakeEnv(render_mode="human")

done, keep = False, True

state, _ = env.reset()
action = 0

while not done and keep:
    action = random.choice(env.get_possible_actions(action))
    state, reward, done, trunc, inf = env.step(action)
    keep = env.render()

env.close()

### **The Dimensionality Problem**

Once the environment is defined, one can think about how big the space of all possible configuration is.\
Well, doing some (pretty bad but although valid) approximation, considering a state as the matrix representation of the grid with 0 (empty cell), 1 (apple), 2(head) and 3 (tail), the dimension of all possible configuration ends up being something like this:
$$
    |S| = (n\times m)(n\times m)2^{(n\times m)}
$$
This should describe all the possible positions of the apple, all the possible position of the head and all possible configuration on the grid of the tails (now this is the big approximation, since the tail configuration is not independent from the head position). Anyway, even if this is an approximation one can simply add the "blocks" on the grid world (static cells that kill, if touched, the snake) and the dimension should exactly being that big.

Now this is not a simple thing to deal with while learning. Solution? Bidding (or bracketing, how we actually call it). Now on this soon.

## PART 2
---
*Algorithms*

## **BIDDING (or Bracketing)**

As shown before the Snake game has a huge state dimension.\
Since a look up table of that dimension has no logic to exist (I don't even think our computers can store something like that) and since it is pretty impossible just to see one example for each pair state-action in an entire life time, something had to come in mind.\
Thanks to a lesson we learnt about bidding (that we called bracketing for the entire project, shame on Francesco) and decided to try it out.\
Bidding essentially is just a supervised technique (it is the human that codifies how it works) that agglomerates similar states together, in order to reduce the dimensionality of the problem.\
A very stupid example could be the following: each state just randomly be labelled as 0 or 1. And the agents now will not see the entire state representation, but only the label you gave it. Now this example is stupid because, using this random strategy, you end up with no knowledge at all. BUT, look at what happened at your state dimension: it fell down from whatever it was to 2! Pretty neat, uh?

We discussed together end decided to try a lot of different bidding techniques, and we end up discovering the big tradeoff in this field: **information against dimension**.\
Minimal state bidding are easy to implement, but if they are too small it is not ensured that they will bring enough information to actually learn to the agent. On the other hand, if you give too many information to the agent, you will end up again with a too big state dimension to deal with.

Another important aspect of bidding is that you actually "completely" lose the transition function below your system. It is true that you can develop a new transition function on a new space, the "bidding space", but it is not easy and it is not ensured to be relevant.

Now we will quickly see all the biddings we implemented. We defined in the states_bracket.py a super class (StateBracket) which implement the protocol for all the bidding techniques.

### **FOOD INFORMATION ONLY**

Our first approach was to give only information about the position of the apple wrt the head of the snake. Two variants of this idea came up:

1. Food Relative Position 
2. Food Direction

Both of the techniques are pretty straightforward to implement, and their computation is very fast as well. However, the snake loses information about its tail, the walls and the boundaries of the cell.


##### **Food Relative Position**

Once you get the apple position $(a_y, a_x)$ and the head position $(h_y, h_x)$ you just return as state the tuple $(a_y - h_y, a_x - h_x)$. This approach reduces the states' dimension up to $2m \times 2n$

##### **Food Direction**

Once you get the apple position $(a_y, a_x)$ and the head position $(h_y, h_x)$ you just return as state the tuple $(a_y < h_y, a_y > h_x, a_x < h_x, a_x > h_x)$.\
Now, while the first one is straightforward, this a little more subtle. This tells you whether your head is above, below, to the left, or to the right of the food. \
This is a very minimal information: the states are condensed into just $8$ bins.



In [3]:
state = np.array([
    [0, 0, 0], 
    [0, 0, 1], 
    [0, 0, 2]
    ])
frp = FoodRelativePositionBracket()
fd = FoodDirectionBracket()
print(frp.bracket(state))
print(fd.bracket(state))

(0, -1)
(0, 0, 1, 0)


### **FOOD AND NEIGHBORHOOD INFORMATION**

These bidding techniques combines the things we have just seen plus information relative to the neighborhood of the head.\
Neighborhood could be both of Von Neumann type or Moore type. The radius is a parameter for this bidding. Notice that the greater the radius, the greater the total state dimension. The information within the neighborhood are expressed in the form of 0 and 1. 0 if a cell is free and 1 if a cell is occupied by the tail.

For further details on the implementation we suggest to directly read the code, which is full commented.

In [4]:
state = np.array([
    [0, 0, 0], 
    [0, 0, 1], 
    [0, 0, 2]
    ])
npfrp = NeighPlusFoodRelativePositionBracket(radius=1)
npfd = NeighPlusFoodDirectionBracket(radius=1)
print(npfrp.bracket(state))
print(npfd.bracket(state))

(0, -1, 0, 0, 0, 1, 1)
(0, 0, 1, 0, 0, 0, 0, 1, 1)


### **ONLY NEIGHBORHOOD INFORMATION**

An interesting experiment was the definition of the bidding containing only information relative to the neighborhood of the head (adding the 2 value for the apple in this case).\
The idea was that, using this type of bidding, the agent could have learnt to search being really careful, but probably with no knowledge on its own position it is impossible for it to learn a valid strategy.

In [5]:
state = np.array([
    [0, 0, 0], 
    [0, 0, 1], 
    [0, 0, 2]
    ])
n = NeighborhoodBracket()
print(n.bracket(state))

(0, 0, 0, 1, 1)


### **FOOD AND NEIGHBORHOOD AND TAIL INFORMATION**

The last type of bidding explored combines all the informations described until now plus a little information about the tail.\
The relative information about the tail is its length, and should help the agent learn to be a little more careful when the tail gets longer.

In [6]:
state = np.array([
    [0, 0, 0], 
    [0, 0, 1], 
    [0, 0, 2]
    ])
tpnpfrp = NeighPlusFoodRelativePositionPlusTailBracket()
tpnpfd = NeighPlusFoodDirectionPlusTailBracket()
print(tpnpfrp.bracket(state))
print(tpnpfd.bracket(state))

(0, 0, -1, 0, 0, 0, 1, 1)
(0, 0, 0, 1, 0, 0, 0, 0, 1, 1)


## ALGORITHMS

### **Why not MDP**

How we have seen in the latest section, the only way to obtain a dealable state dimension is using the binning. \
The problem is that, using binning, we lose the transition function below our MDP. Indeed, being in the same state, taking the same action, ending into the same state, could possibly lead to different rewards.
To give an example, consider the two situations below, where we marked with a $O$ the head of the snake, with $0$ its body and with $X$ the food (consider the snake without tail):


\begin{array}{|c|c|c|}
\hline
 & 0 & 0 \\
\hline
 & O & 0 \\
\hline
X &  &  \\
\hline
\end{array}

\begin{array}{|c|c|c|}
\hline
 &  &  \\
\hline
 & O & 0 \\
\hline
X &  & 0 \\
\hline
\end{array}


Such situations are bidded into the same condensed-state $(-1, -1)$. However, the action "UP" will lead in the first example to a negative reward because of the hit of the snake's tail, while  in the second case, to a zero reward.

This erases the capability to retrieve good results solving the problem as an MDP via policy iteration.

TO ARGUMENT A BIT MORE


### **Our Choices**

In the end we decided to develop 5 different **Model-free algorithms** (with also some variants of them) in order to take familiarity with the whole RL framework. These are the algorithms we implemented:

1. Montecarlo 
2. SARSA
3. QLearning
4. DDQL
5. Policy Gradient

We firstly implemented a super class that defines a protocol for all the algorithms and provides useful function such as "get_action_epsilon_greedy" and so on.\
Useful method of the class are the save and upload methods, which can let to store the results obtained as a pickle dictionary in order to retrieve it later.\
In addiction the utils.py contains a lot of useful function used to deal with the default_dict, a structure we used to store the QValues look up table (for the algorithms that require it). Since we used a lot of bidding we used default dict to possibly deal with no-fixed state dimension.

Before diving in our actual implementation, let's briefly remind some key concepts that will be used.

*Credit for the key concepts to [this tutorial](https://medium.com/@hsinhungw/intro-to-reinforcement-learning-monte-carlo-to-policy-gradient-1c7ede4eed6e)*

### **Policy-based vs Value-based**

Most of our algorithms are value-based but policy gradient is in fact policy-based. Let's define the difference between these two classes.

1. **Policy-based methods**: The agent learns the policy directly.
2. **Value-based methods**: The agent learns a value function that gives the expected return of being in a given state or performing a given action in a given state. The policy can then be derived from the learned value function.

### **Off-policy / On-policy**

In RL, the agent generates experience by interacting with the environment under a certain policy and then learns the policy from those experiences. Given this in mind, algorithms can use different approaches to the policy. Two main classes exist, let's break them down.


1. **On-policy methods**: The agent attempts to learn the policy that is also used to generate the experience.
2. **Off-policy methods**: The agent learns a policy that is different from the one used to generate experiences.

### **Q-Value Learning**

Our goal is to find the optimal policy given only the experience of an environment. Again, the experience consists in the trajectories we perform as we explore the environment

$\tau_i = (S^i_0, A^i_0, R^i_1, S^i_1, A^i_1, R^i_1, \dots S^i_{T^i})$


To do this, it will be more convenient to consider, instead of the _state_-function $V_{\pi} (s) $, the _state/action_-function $Q_{\pi}(s,a)$. It is defined as:

$$
Q_{\pi}(s,a) = \mathbb{E}_\pi\bigg[ \sum_{t=0}^\infty \gamma^t \, R_{t+1} \, \, \Big| \, \, S_0 = s, A_0 = a \bigg] \ .
$$

**Definition 1** _"The expectation discounted cumulative sum of all future rewards when starting from state $s$, acting with action $a$ and then always following the policy $\pi$"_

or equivalently as:

$$
Q_\pi(s,a) = \mathbb{E}_{\pi}\bigg[ R_1 + \gamma V_\pi(S_1) \,\, \Big| \,\, S_0 = s, A_0 = a\bigg] \ .
$$

**Definition 2** _"The expectation value of the immediate reward plus the discounted $V$-value of the following state $S_1$, when starting from state $s$, acting with action $a$ and then always following the policy $\pi$."_

or again as:

$$
Q_\pi(s,a) = \mathbb{E}_{\pi}\bigg[ R_1 + \gamma  \, \sum_{a'} \, Q_\pi(S_1, a') \, \pi(a' | S_1) \,\, \Big| \,\, S_0 = s, A_0 = a\bigg] \ .
$$
**Definition 3** _"The expectation value of the immediate reward plus the discounted $Q$-value of the following state $S_1$ and all possible actions $a'$ weighted by the probability of taking that action ($\pi(a' | S_1)$), when starting from state $s$, acting with action $a$ and then always following the policy $\pi$."_

*Credit to the 4th tutorial for the definition*

### **Montecarlo**

Monte Carlo methods are ways of estimating value functions through experiences (sample sequence of states, actions, and rewards). Recall that the value of a state is the expected return (expected cumulative future discounted reward) starting from that state. One way to estimate the expected return is simply to average the returns observed after visits to that state. As more returns are observed, by the law of large numbers, the average should converge to the expected return. This idea underlies all Monte Carlo methods.

Our first algorithm developed was the **on-policy Montecarlo**. It works as follows:

1. On-policy MC uses an **ε-soft policy**. A soft policy refers to any policy that has a small, but finite, probability of selecting any possible action, ensuring exploration of alternative actions.
2. After each episode, the **observed returns** are used to learn the q-value function, and then the policy is improved based on the learned value function for all the states visited in the episode.

This means that we use the first definition. This definition has no bias but an high variance.

![onpolicymc](/images/on%20policy%20mc.png)

In [12]:
env = SnakeEnv(render_mode="non-human", max_step=100)
mc = Montecarlo(action_space=4, gamma=0.90, lr_v=0.01)
eps = ConstantEpsilonDecay(0.3)
bracketer = NeighPlusFoodDirectionBracket()
mc.learning(env = env, epsilon_schedule= eps, n_episodes=400, bracketer=bracketer)
env = SnakeEnv(render_mode="human", max_step=2000)
mc.play(env=env, bracketer=bracketer)

Episode 399/400


Learning finished


Episode 0/400 : Performance -10.0
Episode 100/400 : Performance -11.5
Episode 200/400 : Performance -2.0
Episode 300/400 : Performance -4.5


27.0

### **SARSA** 

SARSA is a fascinating on-policy method which use state action and reward (State Action Reward State Action) in order to achieve its goals. There are many variants of SARSA, and all of them are different in terms of how they update the Q-Value table. 