# Grid world

In this notebook, we will demonstrate some algorithms used in **Dynamic Programming**.

We define a game where a robot learns to find an optimal path in a room to a target while avoiding traps. 

The room is described by a 3x4 grid. Initially the robot is in (2,0), the target is in (0,3), the trap in (1,3) and the cell (1,1) is unreachable (wall or pillar).

```
    +---+---+---+---+
    |   |   |   | +1|
    +---+---+---+---+
    |   |XXX|   | -1|
    +---+---+---+---+
    | R |   |   |   |
    +---+---+---+---+
    
```
The actions available to the robot are up, down, left and right (U,D,L,R).

This problem exposes 11 states and 4 actions.

In [1]:
import numpy as np

DEBUG=0

The class `Grid` describes a generic room (size, rewards for each cell and actions possible from each cell) and keeps the current state (the position of the agent).

In [2]:
class Grid:
    def __init__(self, height, width):
        self.width = width
        self.height = height
        
    def configure(self, rewards, actions, transitions, start):
        self.rewards = rewards
        self.actions = actions
        self.transitions = transitions
        self.state = start
        
    def is_terminal_state(self, s):
        return s not in self.actions
    
    def is_game_over(self):
        return is_terminal_state(self.state)
    
    def all_states(self):
        for r in range(self.height):
            for c in range(self.width):
                yield (r,c)
    
    def move(self, a):
        if self.is_game_over(): return 0.0
        
        # get the transition distribution from the current state with action a
        targets = self.transitions(self.state, a)        
        if not targets: return 0 # move is not possible
        
        self.state = np.random.choice(targets.keys(), p=targets.values())
        
        return self.reward.get(self.state, 0.0)

Let's now define 2 grid factory functions: `build_standard_grid` and `build_negative_grid`. Both are creating a new `Grid` instance and configure it as described above. The difference is that the second one will penalize each move with a negative award, while the first one does not give any award on empty cells.

In [3]:
def build_standard_grid():
    g = Grid(3,4)
    
    rewards = { (0,3): 1.0, 
                (1,3): -1.0
              }
    
    actions = { (0,0): ['R', 'D'],
                (0,1): ['L', 'R'],
                (0,2): ['L', 'R', 'D'],
                (1,0): ['U', 'D'],
                (1,2): ['R', 'U', 'D'],
                (2,0): ['R', 'U'],
                (2,1): ['L', 'R'],
                (2,2): ['L', 'R', 'U'],
                (2,3): ['L', 'U']
              }
    
    def transitions(s, a):
        if a not in actions[s]: return {}
        if a=='D': return {(s[0]+1, s[1]  ): 1.0}
        if a=='U': return {(s[0]-1, s[1]  ): 1.0}
        if a=='R': return {(s[0]  , s[1]+1): 1.0}
        if a=='L': return {(s[0]  , s[1]-1): 1.0}

    
    g.configure(rewards, actions, transitions, (2,0))
    return g

def build_negative_grid(step_cost=-0.1):
    g = build_standard_grid()
    g.reward.update({ s:step_cost for s in policy.keys() })
    


We will also want to inspect the differents solutions we have found. The following functions display the value function and the policy on the grid.

In [4]:
def print_value(V, g):
    print ("Value function")
    for c in range(g.width):
        print("+-------", end='')
    print("+")
    for r in range(g.height):
        for c in range(g.width):
            print (f"| {V.get((r,c),0):+0.2f} ", end='')
        print("|")
        for c in range(g.width):
            print("+-------", end='')
        print("+")
        
        
def print_policy(pi, g):
    print ("Policy")
    
    for c in range(g.width):
        print("+-------------", end='')
    print("+")
    
    for r in range(g.height):
        for c in range(g.width):
            s = (r,c)
            print (f"|    {pi(g, 'U', s):+0.2f}    ", end='')
        print("|")
        for c in range(g.width):
            s = (r,c)
            print (f"| {pi(g, 'L', s):+0.2f} {pi(g, 'R', s):+0.2f} ", end='')
        print("|")
        for c in range(g.width):
            s = (r,c)
            print (f"|    {pi(g, 'D', s):+0.2f}    ", end='')
        print("|")
        for c in range(g.width):
            print("+-------------", end='')
        print("+")



## Iterative Policy Evaluation

In the next steps, we will need to evaluate different policies. For that we will use the **Iterative Policy Evaluation** algorithm which loops repetitively to update the value function at each state using the Bellman equation until it converges.

The Bellman equation compute the value of a state from the value of the future possible states, according to a policy:

$$V_\pi(s) = \sum_{a}{\pi(a \mid s) \sum_{s'}\sum_{r}{p(s',r \mid s,a)(r + \gamma V_\pi(s'))}}$$

where:

- $V_\pi(s)$ is the value of the state $s$ according to the policy $\pi$
- $\pi(a \mid s)$ is the policy. It gives the probability of taking the action $a$ given we are in state $s$
- $p(s',r \mid s,a)$ is the probability of transitionning to state $s'$ and getting the reward $r$, while we are in state $s$ and take the action $a$. In our case, transitions from state to state are deterministic, so this value is always $1$ or $0$.
- $r$ is the reward of the transition $s \rightarrow s'$
- $\gamma$ is the discount factor

Initially, all values are set to zero. Then we iterate continuously over all states, applying the Bellman equation to compute the value at each state. The loop finishes when all the values converge toward a stable value (which mean the difference between the new computed value and the previous one is very small).


In [5]:
def iterative_policy_evaluation(grid, pi, gamma):
    
    # initialize the value function to 0
    V = { s:0.0 for s in grid.all_states() }
    
    
    # loop until stabilization
    loop = 0
    while True:
        if DEBUG: print (f"--- Loop #{loop}")
        delta = 0
        
        # enumerate all states
        for s in grid.all_states():
            if DEBUG: print (f"   State {s}")
            
            # if terminal, the value is still zero
            if grid.is_terminal_state(s):
                if DEBUG: print (f"   -> terminal")
                continue
                
                
            # Bellman equation : sum over all available actions
            v = 0
            for a in grid.actions[s]:
                if DEBUG: print (f"      Action: {a}")
                
                # get the probability of taking action 'a' while in state 's'
                pa = pi(grid, a, s)
                if DEBUG: print (f"          pa={pa}")
                if pa == 0: continue
                
                # get the distribution of possible targets if we take action 'a' from state 's'
                targets = grid.transitions(s,a)
                if DEBUG: print (f"          targets={targets}")
                if not targets: continue
                
                for next_state, prob in targets.items():
                    if DEBUG: print (f"              next_state={next_state}, prob={prob}")
                    if DEBUG: print (f"              V[next_state]={V[next_state]}")
                    v += pa * prob * (grid.rewards.get(next_state, 0.0) + gamma * V[next_state])

            
            delta = max(delta, abs(v - V[s]))
            if DEBUG: print (f"   Vs={V[s]} -> {v} : delta={delta}")
                    
            V[s] = v
            
        if (delta < 1e-3):
            break
                
    return V

### Random policy
To begin, we evaluate a random policy. This policy states that we have the same probability to take any available action at every state. More formally, at each state $s$, there is a set of available action $A_s = {a_0, a_1, \ldots , a_N }$. The probability to take the action $a_i$ is $p(a_i) = 1\,/\,|A_s|$

In [6]:
def random_policy(grid, a, s):
    if s not in grid.actions: return 0.0
    if a not in grid.actions[s]: return 0.0
    return 1.0 / len(grid.actions[s])



std_grid = build_standard_grid()

V = iterative_policy_evaluation(std_grid, random_policy, 1.0)

print_policy(random_policy, std_grid)
print_value (V, std_grid)


Policy
+-------------+-------------+-------------+-------------+
|    +0.00    |    +0.00    |    +0.00    |    +0.00    |
| +0.00 +0.50 | +0.50 +0.50 | +0.33 +0.33 | +0.00 +0.00 |
|    +0.50    |    +0.00    |    +0.33    |    +0.00    |
+-------------+-------------+-------------+-------------+
|    +0.50    |    +0.00    |    +0.33    |    +0.00    |
| +0.00 +0.00 | +0.00 +0.00 | +0.00 +0.33 | +0.00 +0.00 |
|    +0.50    |    +0.00    |    +0.33    |    +0.00    |
+-------------+-------------+-------------+-------------+
|    +0.50    |    +0.00    |    +0.33    |    +0.50    |
| +0.00 +0.50 | +0.50 +0.50 | +0.33 +0.33 | +0.50 +0.00 |
|    +0.00    |    +0.00    |    +0.00    |    +0.00    |
+-------------+-------------+-------------+-------------+
Value function
+-------+-------+-------+-------+
| -0.03 | +0.09 | +0.22 | +0.00 |
+-------+-------+-------+-------+
| -0.16 | +0.00 | -0.44 | +0.00 |
+-------+-------+-------+-------+
| -0.29 | -0.41 | -0.54 | -0.77 |
+-------+-------+---

### Fixed policy
The next policy we want to evaluate is a fixed policy. At each state, the action is fixed. For our example, any state on the left column or the top row will lead to the target, and any other state will lead to the trap.

```
    +---+---+---+---+
    | → | → | → | +1|
    +---+---+---+---+
    | ↑ |XXX| → | -1|
    +---+---+---+---+
    | ↑ | → | → | ↑ |
    +---+---+---+---+
    
```

In [7]:
def fixed_policy(grid, a, s):
    fixed = { (0,0):'R', (0,1):'R', (0,2):'R', 
              (1,0):'U',            (1,2):'R',  
              (2,0):'U', (2,1):'R', (2,2):'R', (2,3):'U',  
            }
    return 1.0 if s in fixed and fixed[s] == a else 0.0


V = iterative_policy_evaluation(std_grid, fixed_policy, 0.9)

print_policy(fixed_policy, std_grid)
print_value (V, std_grid)


Policy
+-------------+-------------+-------------+-------------+
|    +0.00    |    +0.00    |    +0.00    |    +0.00    |
| +0.00 +1.00 | +0.00 +1.00 | +0.00 +1.00 | +0.00 +0.00 |
|    +0.00    |    +0.00    |    +0.00    |    +0.00    |
+-------------+-------------+-------------+-------------+
|    +1.00    |    +0.00    |    +0.00    |    +0.00    |
| +0.00 +0.00 | +0.00 +0.00 | +0.00 +1.00 | +0.00 +0.00 |
|    +0.00    |    +0.00    |    +0.00    |    +0.00    |
+-------------+-------------+-------------+-------------+
|    +1.00    |    +0.00    |    +0.00    |    +1.00    |
| +0.00 +0.00 | +0.00 +1.00 | +0.00 +1.00 | +0.00 +0.00 |
|    +0.00    |    +0.00    |    +0.00    |    +0.00    |
+-------------+-------------+-------------+-------------+
Value function
+-------+-------+-------+-------+
| +0.81 | +0.90 | +1.00 | +0.00 |
+-------+-------+-------+-------+
| +0.73 | +0.00 | -1.00 | +0.00 |
+-------+-------+-------+-------+
| +0.66 | -0.81 | -0.90 | -1.00 |
+-------+-------+---