# Gridworld

*Andrea Mazzolini, amazzoli@ictp.it, Como 2019. From a previous tutorial of  Jacopo Talamini.*

<br/><br/>

Here we want to find the optimal strategy of a 2d grid-world problem using a model-free reinforcement-learning algorithm: **Q-learning**.

A 2D gridworld problem is a 2d lattice having size $d \times d$, in which an agent can navigate and look for rewards. At each step the agent can move in one of the four neighbours (or stay still). Some coordinates are special and contain rewards, which the agent aquires if it moves over them or stays in the cell.

The agent starts from an initial position $s_0$ and has to move and aquire rewards in order to maximize an exponentially discounted return:
$$
V_\pi(s_0) = \mathbb{E}_\pi\bigg[ \sum_{t=0}^\infty \gamma^t\,r_t \bigg] \ .
$$

We consider a case in which there are two rewards $R_a > R_b$, placed in the gridworld, such that:

*   $R_a$ is far from the starting position of the agent, $s_0$
*   $R_b$ is close to $s_0$

Therefore there will be a tradeoff depending on $\gamma$ for which reward the agent chooses to reach.

## Gridworld as a Markov Decision Process

### States
The state space corresponds to the physical space of the gridworld. Therefore each state is identified by the two coordinates, and the whole space is composed of $d^2$ states:

$$
\mathcal{S} = \{ 0, 1, \ldots, d-1 \} \times \{ 0, 1, \ldots, d-1 \}
$$

### Actions
The actions of the agent are five: he can move to nearest neighbours or stay in the cell without moving:

$$
\mathcal{A} = \{ \text{up}, \text{left}, \text{down}, \text{right}, \text{stay} \} = \{ (0,1), (-1,0), (0,-1), (1,0), (0,0)\}
$$

which can be expressed also translation vectors.
Actually, these actions are not always possible in each state: the agent cannot cross boundaries. This makes actions state dependent, for example if the agent is located on the left boundary: $\mathcal{A}(0,y) = \{ \text{up}, \text{down}, \text{right}, \text{stay} \}$, or in a corner: $\mathcal{A}(d-1,d-1) = \{ \text{left}, \text{down}, \text{stay} \}$.

### Transition probabilities

The transition probabilities between states are deterministic: the next state is just the old state plus the translation action chosen by the agent:

$$
p(s_{t+1} | a_t, s_t) = \delta (s_{t+1} = a_t + s_t)
$$

### Rewards

The rewards depends only on the arrival states, $r(s_{t+1})$, and are zero for all the states with the exception of some special ones chosen to contain some resource.